[SMT-COMP] Some further questions on benchmarks status and

Mon Jun 16 12:33:05 EDT 2008

Sorry for the delayed response to this email.  We had to have some discussions
and do some investigation in order to come up with satisfactory answers.

> First, there are several instances on nec-smt/large that are still
> marked as unknown, but can be solved (with an agreement on status) by at
> least three solvers: the latest version of yices and z3 (which are
> public), and the current version of mathsat (which has not been released
> yet). Therefore, we wanted to know if this is enough to assign them a
> status (we could not report this before, since we did not know which
> were the unknown instances).

Our standard policy for determining the status of "unknown" benchmarks is to
run the previous year's solvers that finished in good standing (typically with
as high a timeout as we can afford) and look for broad agreement (in
particular, at least two solvers should agree and none should disagree).  We
have used this approach in advance of this year's SMT-COMP.  If there are
benchmarks remaining with an "unknown" status, it means that at most one of
last year's solvers was able to solve it which is not good enough.

We feel that this method provides a good balance of safety (ensuring the
status fields are correct) and inclusion (updating as many statuses as
possible).

We should mention that we would welcome help from the community in establishing
the status of additional benchmarks.  However, the evidence must be compelling
and submitted before the benchmark deadline.

For this year, we will not be accepting any more status updates, simply because
the deadline for freezing the benchmarks was June 1.  This year's solvers will
be used to update the status fields before next year, so we should get the
updated results you mentioned for next year.

> Second, on QF_BV there are some instances whose status seems to depend
> on the way division by zero is handled. These are the instances we have
> discovered that seem to display this problem:
>   QF_BV/spear/wget_v1.10.2/src_wget_vc18197.smt:  :status unsat
>   QF_BV/spear/wget_v1.10.2/src_wget_vc18755.smt:  :status unsat
>   QF_BV/spear/wget_v1.10.2/src_wget_vc18756.smt:  :status unsat
> Both z3 1.2 and yices 1.0.12 report "sat" for them, while spear 2.3
> reports unsat for the first two, and times out (1 hour on our machines)
> for the third. According to our experiments, we get different results
> with different semantics of division by zero which, as far as we
> understand, should be undefined, meaning that it should be ok to choose
> whatever semantics is liked. Is our interpretation correct? We got it
> from this passage of
> http://goedel.cs.uiowa.edu/smtlib/theories/BitVectors.smt:
> 
>    Note that the semantic interpretation above is underspecified because
>    it does not specify the meaning of (bvudiv s t) or (bvurem s t) in
>    case bv2nat([[t]]) is 0.  Since the semantics of SMT-LIB's underlying
>    logic associates *total* functions to function symbols, we then
>    consider as models of this theory *any* interpretation conforming to
>    the specifications above (and defining bvudiv and bvurem arbitrarily
>    when the second argument evaluates to 0).  Benchmarks using this
>    theory should only include a :status sat or :status unsat attribute
>    if the status is independent of the particular choice of model for
>    the theory.
> 
> If our interpretation is correct, then maybe such instances should not
> be part of the competition pool?

Thank you for bringing this issue to our attention.  We took a close look at
this issue and it has generated a lot of discussion.  Fortunately, removing
these benchmarks will not affect the variety of available benchmarks for the
competition very much (see tables below).  Therefore, we have decided to
eliminate all such benchmarks from the competition this year.

I will open a thread on the SMT-LIB mailing list to discuss the best solution
for handling these operators in the future.

> Third, we are not sure to understand how the difficulty of some new
> instances (nec-smt/large in particular) was determined. According to the
> rules, difficulty is assigned by running last year's solvers with a
> timeout of 600 seconds, and then computed with the formula:
> 
>         5 * (1 - solved/total)
> 
> In the current version of the library, there is no instance in
> QF_LIA/nec-smt/large with difficulty > 3. But according to our
> measurements, there are several that can't be solved by any of last year
> solvers within 600 seconds. For instance, we ran
> nec-smt/large/checkpass/prp-1-50.smt (which is unsat) and
> nec-smt/large/getoption_user/prp-18-48.smt (which is sat) on SMT-EXEC,
> using a timeout of 1200 seconds, and none of last year's solvers can
> solve them. But they are both marked with difficulty 3... Therefore, we
> would like to know if we missed something, or if the rules have changed,
> of if we just did something wrong...

The rules for SMT-COMP 2008 say that a 10-minute timeout will be used for
difficulty assessment.  This was simply an error in the rules: we used a 30-minute
timeout last year in the competition, and we intend to do so again this year,
resources permitting.  For difficulty assessment, therefore, we have used this
longer timeout, hoping to reflect benchmark difficulty with respect to the
actual competition.  We failed to update the rules to reflect this and
apologize for the confusion that resulted.

Sincerely,
SMT-COMP organizers

---------------------------------------------------------------------------
Here is a table showing the benchmarks currently available in the bitvector
divisions and how they are divided according to difficulty and status.

+----------+------------+----------+----------+
| name     | difficulty | status   | count(*) |
+----------+------------+----------+----------+
| QF_AUFBV |          0 | sat      |     4785 |
| QF_AUFBV |          0 | unsat    |     3339 |
| QF_AUFBV |          2 | sat      |       60 |
| QF_AUFBV |          2 | unsat    |       93 |
| QF_AUFBV |          5 | sat      |       90 |
| QF_AUFBV |          5 | unsat    |       90 |
| QF_BV    |          0 | sat      |     1565 |
| QF_BV    |          0 | unsat    |      204 |
| QF_BV    |          2 | sat      |      551 |
| QF_BV    |          2 | unsat    |       22 |
| QF_BV    |          3 | sat      |      109 |
| QF_BV    |          3 | unsat    |       11 |
| QF_BV    |          5 | sat      |        9 |
| QF_BV    |          5 | unsat    |       30 |
+----------+------------+----------+----------+

...and now numbers without the problem-benchmarks: (notice the
biggest drop is in QF_BV difficulty 0).  We feel that there is enough of a
variety left to keep the competition interesting and challenging.

+----------+------------+----------+----------+
| name     | difficulty | solution | count(*) |
+----------+------------+----------+----------+
| QF_AUFBV |          0 | sat      |     4785 |
| QF_AUFBV |          0 | unsat    |     3339 |
| QF_AUFBV |          2 | sat      |       60 |
| QF_AUFBV |          2 | unsat    |       91 |
| QF_AUFBV |          5 | sat      |       90 |
| QF_AUFBV |          5 | unsat    |       87 |
| QF_BV    |          0 | sat      |      336 |
| QF_BV    |          0 | unsat    |      187 |
| QF_BV    |          2 | sat      |      122 |
| QF_BV    |          2 | unsat    |       21 |
| QF_BV    |          3 | sat      |      108 |
| QF_BV    |          3 | unsat    |       11 |
| QF_BV    |          5 | sat      |        9 |
| QF_BV    |          5 | unsat    |       30 |
+----------+------------+----------+----------+