CBLL HOME
VLG Group
News/Events
Seminars

People
Research
Publications
Talks
Demos
Datasets
Software
Courses
Links

Group Meetings

Join CBLL

Y. LeCun's website
CS at Courant
Courant Institute
NYU

Relational Regression Models

Time Period: September 2006 - present.
Participants: Sumit Chopra, Yann LeCun (Courant Institute/CBLL), Trivikraman Thampy, John Leahy, Andrew Caplin (Economics Dept, NYU).
Description: We are developing a new type of relational graphical models that can be applied to "structured regression problem". A prime example of structured regression problem is the prediction of house prices. The price of a house depends not only on the characteristics of the house, but also of the prices of similar houses in the neighborhood, or perhaps on hidden features of the neighborhood that influence them. Our relational regression model infers a hidden "desirability sruface" from which house prices are predicted.

Publications

130. Sumit Chopra, Trivikraman Thampy, John Leahy, Andrew Caplin and Yann LeCun: Discovering the hidden structure of house prices with non-parametric latent manifold model, Proc. Knowledge Discovery in Databases (KDD'07), 2007, \cite{chopra-kdd-07}. 422KB DjVu

1182KB PDF

17382KB PS.GZ

Description

A majority of supervised learning algorithms process an input point independently of other points, based on the assumptions that the input data is sampled independently and identically from a fixed underlying distribution. However in a number of real-world problems the value of variables associated with each sample not only depends on features specific to the sample, but also on the features and variables of other samples related to it. We say the data possesses a relational structure.

Prices of real estate properties is one such example. Price of a house is a function of its individual features, such as the number of bedrooms, etc. In addition the price is also influenced by features of the neighborhood in which it lies. Some of these features are measurable, such as the quality of the local schools. However most of them that make a particular neighborhood desirable are very difficult to measure directly, and are merely reflected in the price of houses in that neighborhood. Hence the "desirability" of a location/neighborhood can be modeled as a latent variable, that must be estimated as part of the learning process, and efficiently inferred for unseen samples.

A number of authors have recently proposed architectures and learning algorithms that make use of relational structure. The earlier techniques were based on the idea of influence propagation [1, 3, 6, 5]. Probabilistic Relational Models (PRMs) were introduced in [4, 2] as an extension of Bayesian networks to relational data. Their discriminative extensions, called Relational Markov Networks (RMN) were later proposed in [7].

This paper introduces a general framework for prediction in relational data. An architecture is presented that allows efficient inference algorithms for continuous variables with relational dependencies. The class of models introduced is novel in several ways: 1. it pertains to relational regression problems in which the answer variable is continuous; 2. it allows inter-sample dependencies through hidden variables as well as through the answer variables; 3. it allows log-likelihood functions that are non-linear in the parameters (non exponential family), which leads to non-convex loss functions but are considerably more flexible; 4. it eliminates the intractable partition function problem through appropriate design of the relational and non-relational factors.

The idea behind a relational factor graph is to have a single factor graph that models the entire collection of data samples and their dependencies. The relationships between samples is captured by the factors that connect the variables associated with multiple samples. We are given a set of N training samples, each of which is described by a sample-specific feature vector Xⁱ and an answer to be predicted Yⁱ. Let the collection of input variables be denoted by X = {Xⁱ, i = 1 … N}, the output variables by Y = { Yⁱ, i = 1 … N}, and the latent variables by Z. The EBRFG is defined by an energy function of the form E(W,Z, Y,X) = E(W,Z, Y¹, …, Y^N, X¹, …, X^N), in which W is the set of parameters to be estimated by learning. Given a test sample feature vector X⁰, the model is used to predict the value of the corresponding answer variable Y⁰. One way to do this is by minimizing the following energy function augmented with the test sample (X⁰,Y⁰)

Y^0* = argmin_Y⁰ {

min

E(W, Z, Y⁰, …, Y^N, X⁰, …, X^N)}. (1)

For it to be usable on new test samples without requiring excessive work, the energy function must be carefully constructed is such a way that the addition of a new sample in the arguments will not require re-training the entire system, or re-estimating some high-dimensional hidden variables. Moreover, the parameterization must be designed in such a way that its estimation on the training sample will actually result in good prediction on test samples. Training an EBRFG can be performed by minimizing the negative log conditional probability of the answer variables with respect to the parameter W. We propose an efficient training and inference algorithm for the general model.

The architecture of the factor graph that was used for predicting the prices of real estate properties is shown in Figure 1 (top). The price of a house is modeled as a product of two quantities: 1. its "intrinsic" price which is dependent only on its individual features, and 2. the desirability of its location. A pair of factors E_xyzⁱ and E_zzⁱ are associated with every house. E_xyzⁱ is non-relational and captures the sample specific dependencies. It is modeled as a parametric function with learnable parameters W_xyz. The parameters W_xyz are shared across all the instances of E_xyzⁱ. The factor E_zzⁱ is relational and captures the dependencies between the samples via the "hidden" variables Zⁱ. These dependencies influence the answer for a sample through the intermediary hidden variable dⁱ. The variables Zⁱ can be interpreted as the desirability of the location of the i-th house, and dⁱ can be viewed as the estimated desirability of the house using the desirabilities of the houses related to it (those that lie in its vicinity). This factor is modeled as a non-parametric function. In particular we use a locally weighted linear regression, with weights given by a gaussian kernel.

The model is trained by maximizing the likelihood of the training data, which is realized by minimizing the negative log likelihood function with respect to W and Z. However we show that this minimization reduces to

L(W,Z) =

∑

i = 1

(Yⁱ − (G(W_xyz,Xⁱ) + H(Z_Nⁱ,Xⁱ)))² + R(Z), (2)

where R(Z) is a regularizer on Z (an L₂ regularizer in the experiments). This is achieved by applying a type of deterministic generalized EM algorithm. It consists of iterating through two phases. Phase 1 involves keeping the parameters W fixed and minimizing the loss with respect to Z. The loss is quadratic in Z and we show that its minimization reduced to solving a large scale sparse quadratic system. We used conjugate gradient method using an adaptive threshold to minimize it. Phase 2 involves fixing the hidden variables Z and minimizing with respect to W. Since the parameters W are shared among the factors, this can be done using gradient descent. Inference on a new sample X^o involves computing its neighboring training samples, and using the learnt values of their hidden variables Z_N^o to get an estimate of its "desirability" d^o; passing the house specific features X_h^o through the learnt parametric model to get its "intrinsic" price; and combining the two to get its predicted price.

The model was trained and tested on a very challenging and diverse real world dataset that included 42,025 sale transactions of houses in Los Angeles county in the year 2004. Each house was described using a set of 18 house specific attributes like gps coordinates, living area, year build, number of bedrooms, etc. In addition, for each house, a number of neighborhood specific attributes obtained from census tract data and the school district data were also used. It included attributes like average house hold income of that area, percentage of owner occupied homes etc. The performance of the proposed model was compared with a number of standard non-relational techniques that that have been used in literature for this problem, namely nearest neighbor, locally weighted regression, linear regression, and fully connected neural network. EBRFG gives the best prediction accuracy by far, compared to other models. In addition we also plot the "desirabilities" learnt by our model (Figure 1 (bottom)). The plot shows that the model is actually able to learn the "desirabilities" of various areas in a way that is reflective of the real world situation.

References

[1]: S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In. Proc. of ACM SIGMOD98, pages 307 – 318, 1998.
[2]: N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In Proc. IJCAI99, pages 1300 – 1309, 1999.
[3]: J. M. Klienberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604 – 632, 1999.
[4]: D. Koller and A. Pfeffer. Probabilistic frame-based systems. In Proc. AAAI98, pages 580 – 587, 1998.
[5]: J. Neville and D. Jensen. Iterative classification in relational data. In Proc. AAAI00 Workshop on Learning Statistical Models From Relational Data, pages 13 – 20, 2000.
[6]: S. Slattery and T. Mitchell. Discovering test set regularities in relational domain. In. Proc. ICML00, pages 895 – 902, 2000.
[7]: B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. Eighteenth Conference on Uncertainty on Machine Intelligence (UAI02), 2002.

Table 1: Prediction accuracies of various algorithms on the test set. Absolute Relative Forecasting Error (fe) was measured. The error (fe_i) on the i^th sample is defined as fe_i = |A_i − Pr_i|/A_i, where A_i is the actual selling price and Pr_i is the predicted price. Three performance quantities on the test set are reported; percentage of houses with a forecasting error of less than 5%, with less than 10% and with less 15%.

Model Class Model < 5% <10% < 15%

non-parametric Nearest Neighbor 25.41 47.44 64.72

non-parametric Locally Weighted Regression 32.98 58.46 75.21

parametric Linear Regression 26.58 48.11 65.12

parametric Fully Connected Neural Network 33.79 60.55 76.47

hybrid Relational Factor Graph 39.47 65.76 81.04

-0.1in

Figure 1: (Top) A typical Energy Based Relational Factor Graph showing the connections between three samples. The factors E_xyzⁱ capture the dependencies between the features of individual samples and their answer variable Yⁱ, as well as the dependence on local latent variables dⁱ. The factors E_zzⁱ captures the dependencies between the hidden variables of multiple samples. The connection to these two factors may exist only from a subset of samples that are related to sample i. When the energy of factor E_zz is quadratic in d. (Bottom) The color coded values of the desirability surface at the location of the test samples. For every test sample, the estimate of its desirability is computed and is color coded according to its value. Blue color implies low desirability and red color implies high desirability.

This document was translated from L^AT_EX by H^EV^EA.