Tomek Strzalkowski; Jin Wang. A Self-Learning
Universal Concept Spotter.
One of the earliest efforts at bootstrapping name categories from a
small set of seeds, using features based on words and bigrams
preceeding, within, and following the name. Demonstrated good
results at finding organization names and moderately good results on
Michael Collins; Yoram Singer. Unsupervised Models
for Named Entity Classification
. EMNLP 99.
(presentation by Michal Novemsky
Silviu Cucerzan; David Yarowsky. Language
Independent Named Entity Recognition Combining Morphological and
Riloff, E. and Jones, R. (1999) "Learning
Dictionaries for Information Extraction by
, Proceedings of the Sixteenth
Conference on Artificial Intelligence (AAAI-99)
, 1999, pp.
(presentation by Sunandan Chakraborty
All of these efforts aim to produce taggers with moderate performance
starting with minimal resources. We showed that self-training can
also be used for improving a tagger which
has already been trained on a large annotated corpus. The
self-training strategy we used is quite simple: we tag a large corpus
HMM name tagger, and select those sentences for which the margin
-- the difference between
the probability of the most likely analysis and the second most likely
analysis is large (a large margin indicates confidence in the most
likely analysis). We add the selected sentences to the tagged
corpus and retrain the tagger. (Heng Ji and Ralph Grishman.
semi-supervised learning for name tagging
. ACL 06
Workshop on Information Extraction Beyond the Document.)
Self-training can also be used to update an old name tagger. If a
tagger is trained on an old annotated corpus, it will be missing many
of the names which appear in contemporary text. The tagger can be
effectively updated through self-training using contemporary
unannotated texts; this spares the effort to manually tag new
text. (Cristina Mota and Ralph Grishman. Updating a
name tagger using contemporary unlabeled data
. ACL 2009.)
Problems of self-training
Several problems arise with self-training ("bootstrapping"). One
problem is semantic drift
a procedure which is given a seed consisting of one type of name
gradually starts assigning that label to names of another type.
This is particularly likely if the two sets of names intersect, such as
women's first names and names of flowers (Rose, Violet, ...).
More generally, these bootstrapping methods often lack a good stopping
criterion, and so tend to label too many examples.
This problem is less severe if we are assigning labels to all the
examples and learn all the types concurrently. For example,
capitalization is a fairly good clue to identifying names in English,
and most names in news text are people, organizations, or
locations. So concurrent self-training on these three types can
be quite effective.
Locally, we have investigated the problem of tagging technical terms,
where capitalization does not help in identification. We have
benefit of 'competition', where we identify and add additional name
classes (and train on all the classes concurrently) in order to improve
the training of the original classes. (Roman Yangarber; Winston
Ralph Grishman. Unsupervised
Learning of Generalized Names
. COLING 2002.) A similar
approach was taken in Weighted Mutual Exclusion Bootstrapping (WMEB) by
McIntosh and Curran. The disadvantage of this approach is that it
requires a manual analysis and creation of these competing classes.
The manual creation of competing classes can be avoided by using
unsupervised term clustering (based on all the contexts in which a term
appears) to create these negative categories (Tara McIntosh, Unsupervised
discovery of negative in lexicon bootstrapping
, EMNLP 2010.)
Unsupervised clustering has also proven effective to limit drift and
halt bootstrapping for relation discovery (Ang Sun and Ralph Grishman, Semi-supervised
Semantic Pattern Discovery with Guidance from Unsupervised Pattern