[SMT-COMP] [SMT-LIB] SMT-COMP 2016: Draft Rules and Timeline Available
Giles Reger
giles.reger at manchester.ac.uk
Mon May 16 19:45:40 EDT 2016
Tjark,
I understand that making changes to the rules this close to the competition would be difficult and that this is a topic that needs further discussion. But I do think it is something that should be seriously discussed.
On 15 May 2016, at 14:49, Tjark Weber <tjark.weber at it.uu.se<mailto:tjark.weber at it.uu.se>> wrote:
Giles,
On Fri, 2016-05-13 at 14:39 +0000, Giles Reger wrote:
In experiments we solve many problems with an unknown status and find
the notion of restricting the competition to only those with a known
status to be at odds with the overall aim here: to solve hard
problems efficiently.
We are more familiar with how the CASC competition is organised where
(i) All solvers are run on all problems (with competition timing, not
an inflated time limit) to establish status (rating in CASC)
(i) All solvable problems (by anybody) are eligible for the
competition
It seems that requiring two solvers to agree on the status of a
benchmark does not encourage finding new ways to solve problems but
only solving problems solvable by others.
I understand your point of view. Indeed, the current rules disadvantage
solvers that can solve benchmarks not solvable (with a much higher time
limit, nonetheless) by any other solver.
I am not familiar with the situation in SMT technology but it would not be unusual for one ATP system to solve a problem in a few seconds and no other solver be able to solve it with any amount of resources.
But historically, SMT-COMP has valued correct answers much more highly
than incorrect answers. Certainly returning an incorrect answer does
not amount to solving the benchmark. If a single solver's answer
suffices, how can we be confident that the answer is correct?
I am not sure the decision is between valuing correct or valuing incorrect answers. Few would value incorrect answers. It seems that more value is placed on solving problems shown to be solvable than solving new problems.
Obviously soundness is very important, an unsound system has little worth. But I am not convinced that having two solvers agree on a status is necessary or sufficient for establishing reasonable confidence in soundness. I would argue that confidence should be built in a system rather than a single answer. I agree that this is difficult to achieve. Proof checking has been discussed in this thread and this seems like a reasonable approach. We have a mechanism for utilising other solvers (SMT solvers and ATPs) for checking individual steps in our proofs. In CASC there is a lot of policing from Geoff, who actively tries to seek out unsoundness, a lot of effort on his part.
As an additional note, we observed that in some cases solvers entered into last year’s competition returned different results on problems with unknown status [1]. Our interpretation of the rules is that this marks the problem as ineligible. But in our view it should raise questions about the soundness of the systems in question, only one can be correct. By disregarding problems with unknown status one loses one source of soundness checking.
(It appears to me that the CASC system could potentially be gamed by a
wrapper script that runs last year's best solver, and arbitrarily
returns "sat" or "unsat" if the solver times out.)
I’m afraid this would not work (for anybody suddenly thinking of quickly entering CASC) for two main reasons. The first (pragmatic) reason is that entrants must submit their source code and Geoff would spot this game quickly and penalise the entrant under the ‘No cheating is allowed’ rule [1]. The second more compelling reason is that as (i) the previous best solver will not solve everything that all solvers can solve, (ii) unseen problems are added before the competition each year, and (iii) many problems have known statuses (e.g. from mathematics) independent of solvers, this wrapper will be forced to return an arbitrary result on at least one problem for which the status is known. The likelihood is that at least one wrong answer will be given and the wrapper immediately disqualified.
Opening up SMT-COMP to benchmarks with unknown status would, at the
very least, require us to redefine when an answer is (in)correct. I am
happy to have an open-minded discussion about this, but I do not think
it would be appropriate to make such a substantial change on short
notice (i.e., for SMT-COMP 2016).
I think I have made my initial input to this open-minded discussion. However, I would venture a further suggestion, to include unknown problems in a non-competitive part of the results as is done with experimental divisions until a consensus is formed. These numbers could be disregarded by those not trusting a single solver but could indicate the extent to which individual solvers claim to be solving unknown problems (I venture that it is not just us).
Best,
Giles
[1] For LIA/pysco/107.smt2 and LIA/pysco/181.smt2 (both of unknown status) the experimental version of CVC4 from 2015 and Z3 4.4.0 give different results. This is something we only recently observed and does not seem to appear in other versions of the solvers (based on some limited checks) and we have not informed the authors of the tools.
[2] http://www.cs.miami.edu/~tptp/CASC/J8/Design.html
More information about the SMT-COMP
mailing list