Problem of ambiguity (p. 372)
How fine should these classes be?
Range of answers ... different part-of-speech 'tag' sets
Penn Tag Set ... used to tag Univ. of Pennsylvania Tree Bank (1 million words) (p. 297)
The tagging task: determining the tag of each word
Not trivial: many common words have several tags
A dictionary will tell us which tags are possible for a word, independent of context.
We could parse the sentence, and see which tags are used in the parses, but that's an expensive
and difficult process (we might not always get a parse).
Instead, we develop separate part-of-speech taggers.
Help parsing (reduce ambiguity).
Resolve pronunciation ambiguities (for text-to-speech).
Resolve semantic ambiguities.
Rule based part-of-speech tagging (J&M 8.4)
Ex: Constraint-grammar tagger
Needs large tagged corpus for testing
Statistical part-of-speech tagger (J&M 8.5)
Needs large tagged corpus for training
Unigram statistics (most common part-of-speech for each word) get us to about 90% accuracy
For greater accuracy, need some information on adjacent words