APFtoXML
The APFtoXML utility extracts information from an
Ace APF file and produces a file with the selected information marked
by
in-line XML tags such as <ENAMEX TYPE=type> for names. It
is
invoked by
xjet
AceJet.APFtoXML year apf-directory
output-directory filelist apf-extension output-extension [gazetteer pre-dictionary] flag flag ...
where
- year
- is one of 2003, 2004, or 2005, reflecting
the different APF formats used
- apf-directory
- is the directory which contains both
the text and apf files
- output-directory
- is the directory which will contain the files with in-line XML
tags
- filelist
- is a file containing a list of the
documents to
be processed, one per line; text and apf files are relative to
apf-directory; output files are relative to
output-directory. If a line in this file is F, the text file is read from F.sgm, the apf file is read from F.apf-extension, and the output file is F.output-extension .
- apf-extension
file extension for apf files (added to document name)
- output-extension
file extension for output files (added to document name)
For 2004, pre-nominals were tagged PRE whether they were names or not,
so additional information is required to identify names. This is
provided by two additional files,
- gazetteer
- a Jet gazetteer, listing country and state names
- pre-dictionary
- a list of words, indicating for each whether or not they are names
- flag
- one or more of sentences
timex mentions types names, indicating a type of information to
be included in the output files
sentences:
output <sentence>
tags
timex:
output <timex2> tags
mentions:
output <mention entity=n>
tags indicating co-reference relations
types:
include ACE type and subtype features with mention tags
names:
include ENAMEX tags