[SMT-COMP] SMT-COMP 2019: Call for Comments, Benchmarks, Solvers

Tue Jan 29 12:09:24 EST 2019

Tjark,

sorry for my late reply.

> Emphasizing challenging benchmarks may seem like a good idea, and I
> agree that there is a need for more difficult benchmarks in SMT-LIB.
> But excluding easy benchmarks from SMT-COMP is not obviously
> beneficial.
It is true that 'easy' benchmarks do not have a significant impact on
the overall run time. However, when selecting a subset of benchmarks, it
*does* make a difference if these are included or not. In particular
since we want to focus on more challenging benchmarks.

> - On the other hand, if a solver fails to solve an easy benchmark (or
> worse, gives an incorrect result), isn't this useful information that
> *should* affect the ranking?
Well, to a certain extent one could argue that the 'selection' of
benchmarks in SMT-LIB is again only a partial view. Sure, this is useful
information. But this seems more in the scope of a comprehensive
evaluation rather than a competition.

> A key finding of the 2013 evaluation of SMT-COMP and SMT-LIB was that
> the competition results were quite sensitive to randomness. Before you
> re-introduce a random selection of benchmarks, I would encourage you to
> take a look at the 2013 report
> https://link.springer.com/article/10.1007%2Fs10817-015-9328-2
> (in particular Section 3.5.2 and its Conclusions and Recommendations).
We are aware of this problem.

We hope to remedy some of it by removing the 'easy' benchmarks and
selecting a larger percentage of benchmarks (50% instead of 10%).
Further, we will normalize the weight of the benchmarks with respect to
the size of the selected set per family (similar as was done in
previous years, where the weight was normalized with respect to the size
of the benchmark family).

Note that for the experiment of SMT-EVAL 2013 you cited in your email,
the winner was determined as the solver that solved the largest number
of benchmarks. I believe that this may be a significant reason why you
observed such sensitivity to randomness when selecting benchmarks. We
would need to run simulations on previous competition data to back this
claim, though. For now this is more of a conjecture.

We hope that keeping and adapting (and hopefully improving, as mentioned
in the call) the scoring scheme introduced in 2016 will help to
deemphasize the weight of very large and very small benchmark families.
And we hope that it further minimizes the above mentioned sensitivity to
randomness of benchmark selection. And again, as above, one could argue
that the benchmarks in SMT-LIB are a selection and thus a partial view
anyhow.

> When you are obtaining an increased time limit at the cost of running
> (far) fewer benchmarks, I believe it is worth pointing out that *very
> few* benchmarks are solved after > 20 minutes. Is it really better to
> run solvers for another 20 minutes, than to use 100% of the (eligible)
> benchmarks?
The main reason for the decision to use a subset rather than the whole
set of benchmarks in SMT-LIB is not to increase the time limit. It is to
make the results less predictable and to discourage solver developers to
overfit to a fixed problem set. As a consequence, this also allows us to
increase the time limit. Considering that different domains require
different time limits (some of them even significantly larger than 20
minutes, thus the new challenge track), and that we want to further
progress on yet unsolved benchmarks, I think it is reasonable to
increase the time limit back to 40 minutes.

Aina

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: </pipermail/smt-comp/attachments/20190129/182a07cc/attachment.asc>