Artificial Intelligence: Problem Set 5 Assigned: Mar. 28 Due: April 11 Problem 1 A. There are three states: Playing (P), Won (W), Stopped (S). W and S are terminal states, with rewards 10 and 1 respectively. (Why is the reward on S 1 rather than 0? Well, really the ante should be associated with the choice of action: 1 if you flip, 0 if you quit. But since the book sets things up in terms of the rewards on states, we can kludge things to that format by imagining that you ante whenever you are in P, but that S pays back your most recent ante when you quit.) P is a non-terminal state with a reward of -1. From P there are two actions: flip (f) and quit (q). If you choose "f", then you have a 1/2 probability of reaching W and a 1/2 probability of remaining in P. If you choose "q" then you have a probability of 1 of reaching W. There are therefore only two Markov policies (1) In P, always quit; (2) In P, always flip. The expected utility of state P in policy (1) has the value
u(P) = r(P) + u(S) = r(P) + r(S) = -1 + 1 = 0. The expected utility of state P in policy 2 satisfies the equation
u(P) = r(P) + 1/2 u(W) + 1/2 u(P) = 4 + 1/2 u(P).
which has solution u(P)=8. B. Intuitively, the utility function is not separable, because the cost associated with choosing to flip depends on the past history, not just on the state of the game. Formally, according to the equation in Russell and Norvig (p. 502), a utility function on histories is separable if there exists a function f such that U(s0, s1, ... sn) = f(s0,U(s1 ... sn)). Now, let S0 = P and let s1 .. s5 be the sequence P,P,P,P,W, and let v1, v2 be the sequence P,Q. Both of these have utility 0. But U(s0,s1,s2,s3,s4,s5) = -5 while U(s0,v1,v2) = -1, so there cannot be any such function. (Of course, this argument depends on the particular coincidence that there are these two different histories with the exact same utility. Suppose the numbers were changed so that no such coincidence were possible? Actually, I think the formula in Russell and Norvig is too general, but I shall have to check on this.) C. Give an argument to show that, if the ante increases monotonically over time, then the optimal strategy is to continue playing until the ante reaches \$5, and then stop. This question has two parts (I) that as long as the ante is less than 5, you should continue; (II) that as soon as the ante is at least 5, you should stop. As for (I); if the ante is x < 5, then the expected payoff for the strategy "play the next move; then quit", which is 5-x, is always better than the strategy "quit", which has a payoff of 0. Hence quitting is never the optimal strategy; hence, it is always better to play. As for (II); if all future antes are exactly 5, then, by the same kind of calculation as in part (A), the expected payoff is 0. Hence if the antes are greater than 5, the expected payoff will be less than 0. D. Assuming that the player uses the strategy in C for the game in B, what is the expected value of the game? The probabilities are 1/2 that he will win on the first roll with a payoff of 9; 1/4 that he will win on the second roll with a payoff of 7; 1/8 that he will win on the third roll with a payoff of 4; 1/16 that he will win on the fourth roll with a payoff of 0; and 1/16 that he will lose all four rolls, with a payoff of -10. Hence, the expected payoff is (9/2) + (7/4) + (4/8) + 0 - 10/16 = 49/8 = 4.125
• E. Suppose that the ante starts at \$6 and decreases by \$1 each turn until it reaches 0, and then stays 0. What is the value of the game at the beginning? Clearly, if it is worth playing at all, it is worth playing to the end. If the player plays, the probabilities are 1/2 that he will win on the first roll with a payoff of 4; 1/4 that he will win on the second roll with a payoff of -1; 1/8 that he will win on the third roll with a payoff of -5; 1/16 that he will win on the fourth roll with a payoff of -8; 1/32 that he will win on the fifth roll with a payoff of -10; 1/32 that he will eventually win after the fifth roll, with a payoff of -11. Hence the expected return is (4/2) - (1/4) - (5/8) - (8/16) - (10/32) - (11/32) = -1/32. Hence, it is better to quit at the beginning, with an expected return of 0. Problem 2 The simplest procedure, as described in the book proceeds as follows: At the start G = { TRUE } and S = { FALSE } A. We encounter the example that R(3,1,4,2) is satisfied. The single rule in G is satisfied, and so need not be modified. The rule in S is too specialized, and so must be generalized. The possible minimal generalizations give S = { R1: c > a & a > b; R2: a > d & d > b; R3: c > a & a > d; R4: c > d & d > b; R5: a > b & c > d; R6: c > a & d > b; R7: a > d & c > b } (Note: rules such as "c > b & a > b" or "c > a & c > b" are strictly more general than R1, and likewise for similar examples.) B. We encounter the example that R(2,2,1,3) is not satisfied. All the rules in S are consistent with this, so none are changed. The rule in G gives a false positive here, so must be specialized. The maximally general specializations that exclude this example give G = { R8: a > b; R9: b > a; R10: c >= a; R11: a >= d; R12: c >= b; R13: b >= d; R14: c >= d } C. We encounter the example that R(1,3,4,4) is satisfied. In S: R1 is too specialized, and is replaced by c > a. R2 is too specialized, and is replaced by d > b. R3 is too specialized, and is replaced by c > a. R4 is too specialized, and is replaced by c >= d & d > b. R5 is too specialized, and is replaced by c >= d. R6 is satisfied and is unchanged. R7 is too specialized and is replaced by c > b; Hence, the new value of S is { R15: c > a; R16: d > b; R17: c >= d & d > b; R18: c >= d; R6: c > a & d > b; R19: c > b} In G: R8, R11, and R13 give false negatives and so are deleted. This gives G = { R9: b > a; R10: c >= a; R12: c >= b; R14: c >= d } Now, there are two improvements to the most basic routine that are possible and sensible. The first improvement is that one can go through all the *previous* positive examples, and remove all the rules in G that give a false negative; and go through all the previous negative examples, and remove all the rules in S that give a false positive. The second improvement is that, at each step, if there are two rules RA and RB where RA is strictly more general than RB, then, if RA and RB are both in G, then RB can be deleted; if they are both in S, then RA can be deleted. For instance, this happens with R6 and R15 in our example above; it is not prevented by the most simple statement of the algorithm, because these derive from different previous rules. Applying both of these improvements gives the following results: No change after step A: G= { TRUE } S = { R1: c > a & a > b; R2: a > d & d > b; R3: c > a & a > d; R4: c > d & d > b; R5: a > b & c > d; R6: c > a & d > b; R7: a > d & c > b } After step B: the first improvement allows the elimination of R9, R11, and R13 giving G = { R8: a > b; R10: c >= a; R12: c >= b; R14: c >= d } S is as above. After step C: The improvement allows the subsumption of R15 and R16 by R6 and of R18 and R19 by R17 giving S= { R17: c >= d & d > b; R6: c > a & d > b} In G: R8 gives a false negative leaving G = { R10: c >= a; R12: c >= b; R14: c >= d }