G22.2591 - Advanced Natural Language Processing - Spring 2011

Lecture 6

March 1, 2011

Capturing Hyponymy, cont'd

Looking at some data .... searching for Hearst patterns
- Ngram tool demo
- discuss assignment

Using structured data on the web

The focus of the work by Hearst and by Snow et al. was on textual evidence of hyponomy.  Shinzato and Torisawa looked at a structured source:  list headings and list elements in Web pages.

Keiji Shinzato and Kentaro Torizawa.  Extracting Hyponyms of Prespecified Hypernyms from Itemizations and Headings in Web Documents. COLING 2004.

Hyponymy and entity sets

Systems for capturing hyponymy generally combine two approaches:  collecting the relations directly and augmenting them with words similar to the hyponyms (to improve coverage).  We saw examples already in the work of Snow et al.

Automatic entity set expansion

Using Linguistic evidence

Distributional similarity: based on the idea that words that occur in similar contexts have related meanings ("distributional hypothesis"). Create a context vector for each word, using as contexts either adjacent words on each side or governor in a dependency tree. Typical vector values are based on pointwise mutual information, log(p(word,context)/p(word)*p(context)). Compute vector similarity, e.g. using cosine metric.

(Word similarity measures are discussed in Jurafsky and Martin, section 20.7, pp. 658-667.)

(Patrick Pantel et al. 2009: Web-scale distributional similarity and entity set expansion. Proc. EMNLP 2009.)

Lexico-semantic patterns:  extract co-ordinate terms from text patterns "X, Y, and/or Z"

(Sarmento, L.; Jijkuon, V.; de Rijke, M.; and Oliveira, E. “More like these”: growing entity classes from seeds. In Proceedings of CIKM-2007.)

Lexical resources (WordNet, thesaurus.com, Moby thesaurus)

Using Structured evidence

Google sets (labs.google.com/sets) is a patented approach (patent 7,350,187) for searching lists on the web to find related items.  When the user inputs a seed, the collected lists are searched for the seed items;  then the lists are weighted and the items on the lists merged.  (description)

SEAL (Richard Wang and William Cohen. Language-Independent Set Expansion of Named Entities using the Web in ICDM-2007) sought to generalize and improve on Google sets.  It begins by using the seeds as search terms to retrieve a set of web pages.  It then builds a page-specific wrapper for the list items (so it is not dependent on a specific list-markup format), and uses the wrapper to extract the items from each page.  These are then combined by a ranker.

ASIA (Richard Wang and William W. Cohen (2009): Automatic Set Instance Extraction using the Web in ACL-IJNLP 2009.) combines an improved version of SEAL with Hearst-style patterns for finding hyponyms.

Evaluation

If a procedure produces a single, fixed expansion of the seed set, its performance can be measured in terms of recall and precision (and F measure).

Most procedures, however, produce ranked lists of entities of (nearly) arbitrary length. This is similar to the situation with document retrieval, and its possible to produce a recall-precision curve as a result.

If a single number is desired, the most popular metric is Average Precision. An alternate measure is R-precision = P(R).

Performance may depend greatly on the set of seeds chosen. A typical approach is to select seeds at random from the sets of true entities.

Putting the pieces together

Marius Pasca and Benjamin Van Durme
Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs
ACL-2008.

(presentation by Bonan Min)

Semi-automatic methods for entity sets

The quality of automatic set expansion is limited by the fact that a small seed set may not uniquely define the concept to be captured. (It is also limited by the amount of evidence we can marshall for every word -- for rare words the system may not have access to enough data to properly classify the word.)

(Vishnu Vyas and Patrick Pantel.  Semi-automatic entity set refinement.  NAACL 2009.)

(David Vickrey, Oscar Kipersztok, and Daphne Koller.  An Active Learning approach to finding related terms.  ACL 2010.)


Looking ahead: identifying semantic relations (supervised methods)

Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. Kernel Methods for Relation Extraction. J. Machine Learning Research 3 (2003) 1083-1106.