[SMT-COMP] Fwd: [SMT-LIB] SMT-COMP 2016: Draft Rules and Timeline Available

Mon May 16 23:21:06 EDT 2016

I sympathize with this and I agree with Andrei on the purpose of the
competition.

Basically we have benchmarks that the establishment can solve and there is
some consensus that the solutions are correct. But, since we have the
computational power and time, there is no reason not to run the competition
on all benchmarks, even those with status unknown. It should be possible to
compute for each solver the number of solved problems:

 N1 = form the "known" set,
 N2 = from the "unknown" set where also all other successful solvers
concur,
 N3 = from the "unknown" set where no other solver concurs.

This gives a choice scoring over N1 (conservative), N1+N2 (moderate), or
N1+N2+N3 (liberal).

I don't see any reason not to use the moderate scoring. Although the
liberal score is appealing, and I would support it, it's hard to see a
viable path to nomination in 2016. But, I would still like to see it at
least presented in the competition results.

Dejan

On Mon, May 16, 2016 at 4:59 PM Giles Reger <giles.reger at manchester.ac.uk>
wrote:

>
> [Sending on Andrei’s behalf as he is not on the list]
>
> Hi Tjark,
>
> you have already got a reply from Giles.
>
> To me, competitions in our area should stimulate development of better
> systems. Every unsolved problem must be a challenge. This is not what
> SMT-COMP encourages. You can solve hundreds of problems nobody can
> solve ... but this won't give you a point according to your rules.
>
> Another important point for a competition is fairness. Your rules
> treat unfavorably two kinds of systems:
>
> (1) those that use methods different from other systems (and therefore
> are likely to solve different problems)
> (2) newcomers, because they have to compete on problems at least two
> old system solve (this is in the area where many divisions have
> previously had only two entrants!)
>
> 1) is not a speculation. We made preliminary tests: there is a
> division where we solve over 30% more problems than all other systems
> together and would not be winners according to your rules. Even more
> interestingly, in this division on the union of problems solvable by
> at least one other system, we solve more than any of them, which also
> illustrates 2).
>
> Note that Vampire is both 1) and 2) .
>
> If you main motivation is 100% soundness, you did not achieve it. In
> quantifier divisions, there are essentially 2 systems and we know at
> least one of them is unsound (see the example from Giles).
>
> As Giles pointed out, (mainly in quantifier divisions) giving solvers
> 10 hours (or 10 days) does not make a lot of new problems solvable. We
> run Vampire all year round on 180+ cores and know this too well :)
>
> If your motivation is progress in the area, changing the current rules
> seems to be the only way to go.
>
> Best, Andrei
>
> On 15 May 2016 at 14:49, Tjark Weber <tjark.weber at it.uu.se> wrote:
> > Giles,
> >
> > On Fri, 2016-05-13 at 14:39 +0000, Giles Reger wrote:
> >> In experiments we solve many problems with an unknown status and find
> >> the notion of restricting the competition to only those with a known
> >> status to be at odds with the overall aim here: to solve hard
> >> problems efficiently.
> >>
> >> We are more familiar with how the CASC competition is organised where
> >> (i) All solvers are run on all problems (with competition timing, not
> >> an inflated time limit) to establish status (rating in CASC)
> >> (i) All solvable problems (by anybody) are eligible for the
> >> competition
> >>
> >> It seems that requiring two solvers to agree on the status of a
> >> benchmark does not encourage finding new ways to solve problems but
> >> only solving problems solvable by others.
> >
> > I understand your point of view. Indeed, the current rules disadvantage
> > solvers that can solve benchmarks not solvable (with a much higher time
> > limit, nonetheless) by any other solver.
> >
> > But historically, SMT-COMP has valued correct answers much more highly
> > than incorrect answers. Certainly returning an incorrect answer does
> > not amount to solving the benchmark. If a single solver's answer
> > suffices, how can we be confident that the answer is correct?
> >
> > (It appears to me that the CASC system could potentially be gamed by a
> > wrapper script that runs last year's best solver, and arbitrarily
> > returns "sat" or "unsat" if the solver times out.)
> >
> > Opening up SMT-COMP to benchmarks with unknown status would, at the
> > very least, require us to redefine when an answer is (in)correct. I am
> > happy to have an open-minded discussion about this, but I do not think
> > it would be appropriate to make such a substantial change on short
> > notice (i.e., for SMT-COMP 2016).
> >
> > Best,
> > Tjark
> >
>
> _______________________________________________
> SMT-COMP mailing list
> SMT-COMP at cs.nyu.edu
> http://cs.nyu.edu/mailman/listinfo/smt-comp
>