BIOWriter

The BIOWriter utility converts a tagged name corpus from XML tags to BIO tags. It is invoked by

jet -BIOWriter XML-collection BIO-file

where XML-collection is the name of the input file -- a collection of XML-annotated files -- and BIO-file is the name of the output file. XML-collection should contain a list of the file names of the XML-annotated document files, one file name per line. File names may either be absolute paths or relative paths; relative paths are interpreted relative to the directory containing XML-collection.

Each XML-annotated document file is annotated in MUC format. Only data between <TEXT> and </TEXT> is processed. Names should be marked with <ENAMEX TYPE=type> ... </ENAMEX>; tags of the form <TIMEX> ... </TIMEX> and <NUMEX> ... </NUMEX> are also allowed but are ignored.

The output file (BIO-file) consists of one token per line, with a blank line between sentences. Each line consists of the token, a blank, and a BIO tag. Tokens outside a name are tagged "O". A sequence of the form <ENAMEX TYPE=type> token1 token2 token3</ENAMEX> will be rendered

token1 B-type
token2 I-type
token3 I-type

The single BIO-file contains information from all the documents in the input collection.

Note: if input is available as a single file with <DOC> ... </DOC> surrounding each document, it can be converted to the required (one document per file) form with the MakeCollection utility.