[SMT-COMP] SMT-COMP 2019: Call for Comments, Benchmarks, Solvers

Thu Jan 31 08:39:48 EST 2019

Tjark (and all),

I would add some points to what Aina wrote previously.

Aina pointed out that part of our reasoning was “to make results less predictable” but I would perhaps extend this and make it “to make results less predictable and more meaningful”. I would argue that the benchmarks we are removing are not useful in distinguishing solvers, they just make the gap between solvers (in terms of % of problems solved) smaller e.g. if 90% of benchmarks are ‘uninteresting’ then the gap between solvers is about an order of magnitude smaller than it would be if those benchmarks were removed. Solving an extra 50 problems out of a set of 500 is a more clear success than out of a set of 50000. I would also note that by using 1 second as the cut off we are still including those solved by all solvers in e.g. 2 seconds. We still preserve some of the `simple problems' whilst removing a large number of uninteresting cases.

With respect to overfitting, you make a good point that random selection doesn’t really solve this. We would ideally like a situation where unknown benchmarks were used but given the openness of SMT-LIB and the fact that many solver developers are involved in SMT-LIB and SMT-COMP we feel that this would be difficult to achieve in a transparent manner. Of course, an ideal situation would be the ‘Bring Your Own Benchmarks’ approach taken by the SAT competition.

You are right that it’s important to focus on validity (i.e., is the competition measuring the right thing?) and reliability (i.e., would repeated measurements produce similar results?). We have spent quite a while discussing what it is we think the competition should be trying to achieve.

With respect to validity,  e.g. what do we want to measure, I would say that we want to measure the performance of solvers on interesting, challenging benchmarks. It is good to reflect on the primary goals of the competition. To us, and I’m sure to many, the primary goal is to encourage new developments. These uninteresting benchmarks don’t seem useful for this.

With respect to reliability, would things look the same if we ran the competition now and when if we ran it after some more benchmarks were added… perhaps not. I would also suggest that there is a far greater randomness at play here - the randomness of which problems are submitted to SMT-LIB. It seems doubtful that SMT-LIB represents something that is even close to a uniform sample of the set of all interesting problems. We can’t do much about this but this is part of the reason why I am not opposed, in principle, to random selection.

As a final point, one motivation for us as organisers is to make things a little more fun. Extra prizes etc help but we are also keen to inject a bit of suspense into the competition.

Cheers, Giles

On 30 Jan 2019, at 10:39, Tjark Weber <tjark.weber at it.uu.se<mailto:tjark.weber at it.uu.se>> wrote:

Aina,

You made a number of good points, with which I generally agree. Thus,
just briefly:

On Tue, 2019-01-29 at 09:09 -0800, Aina Niemetz wrote:
The main reason for the decision to use a subset rather than the whole
set of benchmarks in SMT-LIB is not to increase the time limit. It is to
make the results less predictable and to discourage solver developers to
overfit to a fixed problem set.

Re predictability: it is certainly good if multiple solvers receive
recognition. A good way to achieve this is to use multiple tracks,
divisions, and special awards. (Your plans to recognize solvers based
on different timeouts, and based on sat/unsat performance only, are
great examples.) However, for the results within each division, I would
encourage you to focus on validity (i.e., is the competition measuring
the right thing?) and reliability (i.e., would repeated measurements
produce similar results?). May the best solvers win!

Re overfitting: when you select a random subset of SMT-LIB benchmarks,
isn't it still the best strategy for a solver to (over-)fit to the set
of (all) SMT-LIB benchmarks? To discourage overfitting, it would help
to have a significant proportion of new, previously unknown benchmarks
each year (but that may not be very practical, and not publishing these
benchmarks beforehand would of course be a break with tradition).

Best,
Tjark

När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
_______________________________________________
SMT-COMP mailing list
SMT-COMP at cs.nyu.edu<mailto:SMT-COMP at cs.nyu.edu>
https://cs.nyu.edu/mailman/listinfo/smt-comp