[SMT-COMP] SMT-COMP 2019: Call for Comments, Benchmarks, Solvers

Wed Jan 23 08:26:48 EST 2019

Aina,

On Tue, 2019-01-22 at 22:20 -0800, Aina Niemetz wrote:
> * As a first step we will remove all benchmarks in a division that
>   were solved by *all* solvers (including non-competing solvers) in
>   this division in under a second during last year's competition.

Emphasizing challenging benchmarks may seem like a good idea, and I
agree that there is a need for more difficult benchmarks in SMT-LIB.
But excluding easy benchmarks from SMT-COMP is not obviously
beneficial.

- If all solvers can solve a benchmark, that benchmark doesn't affect
the ranking anyway (except possibly for time), which will thus depend
on the more challenging benchmarks.

- On the other hand, if a solver fails to solve an easy benchmark (or
worse, gives an incorrect result), isn't this useful information that
*should* affect the ranking?

- When you perform a cost-benefit analysis of running easy benchmarks,
note that running easy benchmarks is relatively cheap (in terms of
computing time). The 78% of easy benchmarks likely take *far less* than
78% of the main track's total run-time, which is largely spent on
time-outs. Removing easy benchmarks (without further random selection)
would likely have a minimal impact on the viable time limit.

> * From the remaining benchmarks we will randomly select a minimum of
>   300 benchmarks. If the division contains less, we include all
>   benchmarks.  If 50% of the eligible benchmarks in the division
>   exceed 300, we select 50% of the benchmarks.

A key finding of the 2013 evaluation of SMT-COMP and SMT-LIB was that
the competition results were quite sensitive to randomness. Before you
re-introduce a random selection of benchmarks, I would encourage you to
take a look at the 2013 report
https://link.springer.com/article/10.1007%2Fs10817-015-9328-2
(in particular Section 3.5.2 and its Conclusions and Recommendations).

Of course, a competition only offers a snapshot of performance at a
given moment in time, and a certain degree of randomness (e.g., caused
by the benchmark scrambling) isn't necessarily bad. But I find it
difficult to argue that this form of randomness, where the ranking may
hinge on whether one particular benchmark happens to be included in the
random set, would be desirable.

> 2. Increase time limit to 40 minutes.

Clearly, a longer time limit is better. But the real question is how to
use the available computational resources to get the most meaningful
data.

When you are obtaining an increased time limit at the cost of running
(far) fewer benchmarks, I believe it is worth pointing out that *very
few* benchmarks are solved after > 20 minutes. Is it really better to
run solvers for another 20 minutes, than to use 100% of the (eligible)
benchmarks?

Best,
Tjark

När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy