In the dating game, you play the role of a matchmaker trying to satisfy
a picky dater. Much as in real life, suitors (or suitresses) are reduced to a real-valued vector of n
(≤ 100) features. Each element of the vector is in the range [0..1]
and corresponds to some characteristic of the candidate. The picky dater (called "person" or P
in the above link) judges the quality of each candidate based on an
unknown weighting of their features. Each feature can be weighted
positively or negatively (between -1
and 1
) and we are guaranteed that, before modification, the negative feature weights will sum to -1
and the positive weights will sum to 1
. Additionally, we know that each weight will use no more than 2 decimal places, which I'll ignore for the rest of this page.
Initially, you are given P
's dating history in the form of m
(≈ 20) feature vectors along with the score that P
assigned to each vector. Formally, we can describe this as:
w ∈ [-1..1]n (P's weight vector of length n) ∀wi < 0 : ∑wi = -1 ∀wj > 0 : ∑wj = 1 dating history = { xk, yk | 1 ≤ k ≤ m }
Your job is to generate an ideal match for P
, which we'll call x*
, such that wT · x* = 1
.
To do this, you will probably want to figure out the weighting function
by which your candidates are being judged. If we approximate w
using only the dating history we are first given, we are doing a type of function estimation called linear regression.
Additionally, every candidate you submit (paired with the score you get
back) can be used as evidence to refine your estimate of w
. This puts us in an interesting machine learning setting called active learning (an area of current research). Since I don't know of any canonical techniques for active learning,
I'll leave that part to you. Instead, I'll focus on the easier problem:
given some data, how do we derive the best linear model (ie, the
optimal estimate of w
).
To find our optimal w
(which we'll call w
*
from now on), we need to pick a learning algorithm. Most machine
learning algorithms boil down to different definitions of the following
two concepts:
C(w)
, is the function that tells us how undesirable a particular weight vector is. The cost function consists of two parts:
L(w)
computes how well the weights predict the data we already have. R(w)
imposes some assumptions on what w
should look like in order to improve performance on unseen data. Regularization terms are optional, but often helpful.In summary, to say that we have a "learning algorithm" means we have defined
C(w) = L(w) + R(w) optimize :: (sample data X, sample labels Y, cost function C) → best parameters
The simplest cost function worth talking about is called mean squared error
(MSE). MSE ignores regularization and thus consists of just a loss
function: the average squared difference between the value our estimate
of w
gives to each of the n features
and the correct value
yi
for each potential match i (of the m tried so far).
C(w) = (1/m) * ∑(xiT·w - yi)2 or in matrix form, where the rows of X are xi and Y is a column vector C(w) = (1/m) * (Xw - Y)T(Xw - Y)
Minimizing MSE for a linear system (of the form Y = X·w
) gives us Ordinary Least Squares (OLS) linear regression.
OLS regression is appealing because it's arguably the simplest sort of
optimization we can imagine doing. We don't have to differentiate
anything and there's no iterative update process. Just a little bit of
term shuffling and we'll have a solution.
First, let's assume we have this desired w
* in hand. Let's further assume that it's a perfect fit, resulting in no error.
X ∈ Rm × n (previous candidates) Y ∈ Rm × 1 (scores of previous candidates) w ∈ Rn × 1 (picky dater's weights) Xw* = Y XTXw* = XTY (multiply both sides by XT) w* = (XTX)-1XTY (multiply both sides by (XTX)-1)
By algebraic magic, we now know how to calculate w
*. OK, good job guys. Can we go home? Well, not so fast. There's a few things worth noting here:
If we want to keep the same cost function (mean squared error) but
ditch the matrices we can instead use gradient descent. The idea here
is that at every value of w
we choose, we can calculate
the gradient of the loss function and then move down the gradient
towards the minimum. To do this we, of course, have to actually figure
out what the gradient looks like. To make the math a little easier,
let's drop the 1/m from in front of C(w)
and replace it with a 1/2
. Since constants won't affect the location of a minimum, this is kosher.
C(w) = 1/2 * ∑(xiT·w - yi)2 ∇C = ∑∇Ci ∇Ci = (xiT·w - yi)xi
Now that we know how to calculate the gradient, we can initialize the weight vector w0
to some random value and proceed to iteratively refine that value.
Until some stopping condition Gt = vector of length n (number of features), initialized to all 0s foreach candidate i dotprod = ∑f xi,f*wt,f diff = dotprod - yi foreach gradient index feature f Gt,f = Gt,f + diff * xi,f foreach weight index f wt+1,f = wt,f - η*Gt,f
Take an example: supposes that diff is positive. So the prediction is greater than it should be. If the particular index f has a large positive x value (i.e. xi,f is large and positive) then making the weight on that index less positive or more negative would have brought the error down. So, the Gt,f should increase and the fth weighting (wf) coefficient should decrease. By contrast, if the particular index f has a large negative value, but diff is still positive, then the gradient would tend to decrease (at least as far as this candidate is concerned) and the weight should increase. The point is that each candidate has an influence to increase or decrease the weight.
The two key decisions to make with the above algorithm are the stopping criterion and the learning rate (denoted η).
w
starts to vanish. The above gradient descent method is called "batch learning" since it aggregates the gradients from each candidate before updating the weights. A variant called "stochastic gradient descent" or "online learning" updates the weight vector using the individual gradient from each candidate. Stochastic gradient descent is often preferable for large samples and/or complicated cost surfaces. Since in the Dating Game we have neither, I can't see a reason to use the stochastic variant over the batch algorithm. For a more thorough explanation of gradient descent (both batch and stochastic), see this page on Linear Neural Networks.
Here's a very simple algorithm for choosing the best learning rate (η) to use with gradient descent.
bestCost = ∞ bestEta = 0 foreach η in eta_range currCost = 0 foreach candidate i wi = train(X without xi, Y without yi) currCost = currCost + C(wi) if currCost < bestCost then bestCost = currCost bestEta = η
When trying to solve for parameters in an underconstrained or noisy system a smart thing to do is regularization. Regularization imposes some soft constraints on your weight vector, it lets you express an assumption about what a good solution should look like. Actually, even with sufficient data, regularization turns out (for reasons too complicated to discuss here) to be beneficial.
A common regularized form of linear regression is ridge regression, which uses the cost function:
C(w) = ∑(xiT·w - yi)2 + α∑wj2You'll notice that the first term (the loss) is the same loss function we used in OLS regression. The regularization term (on the right) penalizes the norm of
w
and thus guides the algorithm to prefer smaller weights. In practice,
this works quite well. The choice of a regularization term may seem
arbitrary, but ridge regression can actually be derived from some simple probabilistic assumptions. (derivation care of Sergey Feldman).
+1
or -1
. Since we're dealing with continuous values of yi
, a classifier wouldn't do us much good. You can do regression with an SVM variant, but it's definitely overkill for the Dating Game.