BIOWriter
The BIOWriter utility converts a tagged name
corpus from XML tags to BIO tags. It is invoked by
jet -BIOWriter XML-collection
BIO-file
where XML-collection is the
name of the input file -- a collection of XML-annotated files -- and BIO-file is the name of the output
file. XML-collection
should contain a list of the file names of the XML-annotated document
files, one file name per line. File names may either be absolute
paths or relative paths; relative paths are interpreted relative
to the directory containing XML-collection.
Each XML-annotated document file is annotated in MUC format. Only
data between <TEXT> and </TEXT> is processed. Names should be marked with <ENAMEX TYPE=type> ... </ENAMEX>; tags of the form <TIMEX> ... </TIMEX> and <NUMEX> ... </NUMEX> are also allowed but are ignored.
The output file (BIO-file)
consists of one token per line, with a blank line between
sentences. Each line consists of the token, a blank, and a BIO
tag. Tokens outside a name are tagged "O". A sequence of
the form <ENAMEX TYPE=type> token1 token2 token3</ENAMEX> will be rendered
token1 B-type
token2 I-type
token3 I-type
The single BIO-file
contains information from all the documents in the input collection.
Note: if input is available as a single file with <DOC> ... </DOC> surrounding each
document, it can be converted to the required (one document per file)
form with the MakeCollection
utility.