BIOWriter

The BIOWriter utility converts a tagged name corpus from XML tags to BIO tags.  It is invoked by

jet -BIOWriter  XML-collection BIO-file

where XML-collection is the name of the input file -- a collection of XML-annotated files -- and BIO-file is the name of the output file.  XML-collection should contain a list of the file names of the XML-annotated document files, one file name per line.  File names may either be absolute paths or relative paths;  relative paths are interpreted relative to the directory containing XML-collection.

Each XML-annotated document file is annotated in MUC format.  Only data between
<TEXT> and </TEXT> is processed.  Names should be marked with <ENAMEX TYPE=type> ... </ENAMEX>;  tags of the form <TIMEX> ... </TIMEX> and <NUMEX> ... </NUMEX> are also allowed but are ignored.

The output file (BIO-file) consists of one token per line, with a blank line between sentences.  Each line consists of the token, a blank, and a BIO tag.  Tokens outside a name are tagged "O".  A sequence of the form
<ENAMEX TYPE=type> token1 token2 token3</ENAMEX> will be rendered

token1 B-type
token2 I-type
token3 I-type

The single BIO-file contains information from all the documents in the input collection.

Note:  if input is available as a single file with <DOC> ... </DOC> surrounding each document, it can be converted to the required (one document per file) form with the MakeCollection utility.