|
- Time Period: September 2006 - present.
- Participants: Sumit Chopra, Yann LeCun (Courant Institute/CBLL),
Trivikraman Thampy, John Leahy, Andrew Caplin (Economics Dept, NYU).
- Description: We are developing a new type of relational
graphical models that can be applied to "structured regression
problem". A prime example of structured regression problem is the
prediction of house prices. The price of a house depends not only on
the characteristics of the house, but also of the prices of similar
houses in the neighborhood, or perhaps on hidden features of the
neighborhood that influence them. Our relational regression model
infers a hidden "desirability sruface" from which house prices
are predicted.
130. |
Sumit Chopra, Trivikraman Thampy, John Leahy, Andrew Caplin and Yann LeCun: Discovering the hidden structure of house prices with non-parametric latent manifold model, Proc. Knowledge Discovery in Databases (KDD'07), 2007, \cite{chopra-kdd-07}.
|
422KB | DjVu |
1182KB | PDF |
17382KB | PS.GZ |
A majority of supervised learning algorithms process an input point
independently of other points, based on the assumptions that
the input data is sampled independently and identically from a fixed
underlying distribution. However in a number of real-world problems
the value of variables associated with each sample not only depends on
features specific to the sample, but also on the features and variables of
other samples related to it. We say the data possesses a
relational structure. Prices of real estate properties is one such example.
Price of a house is a function of its individual features,
such as the number of bedrooms, etc. In addition the price is also
influenced by features of the neighborhood in which it lies.
Some of these features are measurable, such as
the quality of the local schools. However most of them that
make a particular neighborhood desirable are very difficult to measure
directly, and are merely reflected in the price of houses in that
neighborhood. Hence the "desirability" of a location/neighborhood can
be modeled as a latent variable, that
must be estimated as part of the learning process, and efficiently
inferred for unseen samples. A number of authors have recently proposed architectures and learning
algorithms that make use of relational structure. The earlier
techniques were based on the idea of influence
propagation [1, 3, 6, 5].
Probabilistic Relational Models (PRMs)
were introduced in [4, 2] as an extension of Bayesian
networks to relational data. Their discriminative extensions,
called Relational Markov Networks (RMN) were later
proposed in [7].
This paper introduces a general framework for prediction in relational
data. An architecture is presented that allows efficient inference algorithms
for continuous variables with relational dependencies.
The class of models introduced is novel in several ways:
1. it pertains to relational regression problems in which the
answer variable is continuous;
2. it allows inter-sample
dependencies through hidden variables as well as through the
answer variables;
3. it allows log-likelihood functions that are non-linear in the parameters
(non exponential family), which leads to non-convex loss functions
but are considerably more flexible;
4. it eliminates the intractable partition function problem
through appropriate design of the relational and non-relational
factors. The idea behind a relational factor graph is to have a single factor
graph that models the entire collection of data samples and their
dependencies.
The relationships between samples is captured by the factors that
connect the variables associated with multiple samples. We are
given a set of N training samples, each of which is described
by a sample-specific feature vector Xi and an answer to be
predicted Yi. Let the collection of input variables be denoted
by X = {Xi, i = 1 … N}, the output variables by
Y = { Yi, i = 1 … N}, and the latent variables by Z.
The EBRFG is defined by an energy function of the form
E(W,Z, Y,X) = E(W,Z, Y1, …, YN, X1, …, XN),
in which W is the set of parameters to be estimated by learning.
Given a test sample feature vector X0, the model is used to predict
the value of the corresponding answer variable Y0.
One way to do this is by minimizing the following energy function
augmented with the test sample
(X0,Y0)
Y0* = argminY0 { | |
E(W, Z, Y0, …, YN, X0, …, XN)}.
(1) |
For it to be usable on new test samples without requiring excessive work,
the energy function must be carefully constructed is such a way that the
addition of a new sample in the arguments will not require re-training
the entire system, or re-estimating some high-dimensional hidden
variables. Moreover, the parameterization must be designed in such a
way that its estimation on the training sample will actually result in
good prediction on test samples. Training an EBRFG can be
performed by minimizing the negative log
conditional probability of the answer variables with respect to the
parameter W. We propose an efficient training and inference
algorithm for the general model. The architecture of the factor graph that was used for predicting
the prices of real estate properties is shown in Figure 1 (top).
The price of a house is modeled as a product of two quantities:
1. its "intrinsic" price which is dependent only on its individual features,
and 2. the desirability of its location.
A pair of factors Exyzi and
Ezzi are associated with every house. Exyzi is non-relational
and captures the sample specific dependencies. It is modeled as a
parametric function with learnable parameters Wxyz.
The parameters Wxyz are shared across all
the instances of Exyzi. The factor Ezzi is relational and
captures the dependencies between the samples via the
"hidden" variables Zi.
These dependencies influence the answer
for a sample through the intermediary hidden variable di. The variables
Zi can be interpreted as the desirability of the location of the i-th
house, and di can be viewed as the estimated desirability of the house
using the desirabilities of the houses related to it (those that lie in its
vicinity). This factor is modeled as a non-parametric function. In
particular we use a locally weighted linear regression, with weights
given by a gaussian kernel.
The model is trained by maximizing the likelihood of the training
data, which is realized by minimizing the negative log likelihood function
with respect to W and Z. However we show that
this minimization reduces to
L(W,Z)
= | | | | (Yi − (G(Wxyz,Xi) +
H(Z Ni,Xi)))2 + R(Z),
(2) |
where R(Z) is a regularizer on Z (an L2 regularizer
in the experiments).
This is achieved by applying a type of deterministic generalized EM algorithm.
It consists of iterating through two phases. Phase 1 involves
keeping the parameters W fixed and minimizing the loss with respect to Z.
The loss is quadratic in Z and we show that its minimization
reduced to solving a large scale sparse quadratic system.
We used conjugate gradient method using an
adaptive threshold to minimize it.
Phase 2 involves fixing the hidden variables Z
and minimizing with respect to W. Since the parameters W are shared
among the factors, this can be done using gradient descent.
Inference on a new sample Xo involves
computing its neighboring training samples, and using the learnt values
of their hidden variables Z No to get an estimate of its
"desirability" do;
passing the house specific features Xho through the learnt
parametric model to get its "intrinsic" price; and
combining the two to get its predicted price. The model was trained and tested on a very challenging and
diverse real world dataset
that included 42,025 sale transactions of houses in
Los Angeles county in the year 2004.
Each house was described using a set of 18 house specific
attributes like gps coordinates, living area,
year build, number of bedrooms, etc.
In addition, for each house, a number of neighborhood specific
attributes obtained from census tract data and the school district data were
also used. It included attributes like
average house hold income of that area, percentage of owner
occupied homes etc.
The performance of the proposed model
was compared with a number of standard non-relational techniques that
that have been used in literature for this problem, namely
nearest neighbor, locally weighted regression, linear regression,
and fully connected neural network.
EBRFG gives the best prediction accuracy by far, compared to other
models. In addition we also plot the
"desirabilities" learnt by our model (Figure 1 (bottom)).
The plot shows that the model is actually able
to learn the "desirabilities" of various areas in a way that is
reflective of the real world situation.
References-
[1]
-
S. Chakrabarti, B. Dom, and P. Indyk.
Enhanced hypertext categorization using hyperlinks.
In. Proc. of ACM SIGMOD98, pages 307 – 318, 1998.
- [2]
-
N. Friedman, L. Getoor, D. Koller, and A. Pfeffer.
Learning probabilistic relational models.
In Proc. IJCAI99, pages 1300 – 1309, 1999.
- [3]
-
J. M. Klienberg.
Authoritative sources in a hyperlinked environment.
Journal of the ACM, 46(5):604 – 632, 1999.
- [4]
-
D. Koller and A. Pfeffer.
Probabilistic frame-based systems.
In Proc. AAAI98, pages 580 – 587, 1998.
- [5]
-
J. Neville and D. Jensen.
Iterative classification in relational data.
In Proc. AAAI00 Workshop on Learning Statistical Models From
Relational Data, pages 13 – 20, 2000.
- [6]
-
S. Slattery and T. Mitchell.
Discovering test set regularities in relational domain.
In. Proc. ICML00, pages 895 – 902, 2000.
- [7]
-
B. Taskar, P. Abbeel, and D. Koller.
Discriminative probabilistic models for relational data.
Eighteenth Conference on Uncertainty on Machine Intelligence
(UAI02), 2002.
Table 1: Prediction accuracies of various algorithms on the test set.
Absolute Relative Forecasting Error (fe) was measured. The error (fei)
on the ith sample is defined as
fei = |Ai − Pri|/Ai,
where Ai is the actual selling price and Pri is the predicted price. Three
performance quantities on the test set are reported; percentage of houses
with a forecasting error of less than 5%, with less than 10% and with less
15%. |
Model Class | Model | < 5% | <10% | < 15% |
non-parametric | Nearest Neighbor | 25.41 | 47.44 | 64.72 |
non-parametric | Locally Weighted Regression | 32.98 | 58.46 | 75.21 |
parametric | Linear Regression | 26.58 | 48.11 | 65.12 |
parametric | Fully Connected Neural Network | 33.79 | 60.55 | 76.47 |
hybrid | Relational Factor Graph | 39.47 | 65.76 | 81.04 |
-0.1in
Figure 1: (Top) A typical Energy Based Relational Factor
Graph showing the connections between three samples. The factors
Exyzi capture the dependencies between the features of
individual samples and their answer variable Yi, as well as the
dependence on local latent variables di.
The factors Ezzi captures the dependencies between the
hidden variables of multiple samples. The connection to these two
factors may exist only from a subset of samples that are related to
sample i. When the energy of factor Ezz is quadratic in d.
(Bottom) The color coded values of the desirability surface at
the location of the test samples. For every test sample, the
estimate of its desirability is computed and is color coded according
to its value. Blue color implies low desirability and red color implies high
desirability. |
This document was translated from LATEX by
HEVEA.
|
|