Corpus Tools

Last Modified: Tue Oct 31 12:58:26 EST 2006 (10/31, 17:58 UTC)

All the tools are written in Python. Version 2.4 or newer is required to run.

Presentation Slides about Unicode and Foreign Encodings [PDF, 640k]

Tips

Here is Unicode to GB2312 or GBK table (WARNING: This is very big!)

Here is Unicode to Shift_JIS or MS-932 table (WARNING: This is very big!)

How to edit UTF-8 text with Emacs.

  1. Grab MULE-UCS package at http://www.meadowy.org/~shirai/elips/mule-ucs.tar.gz
  2. $ mkdir /some/where/common/place
  3. $ tar zxf mule-ucs.tar.gz
  4. $ emacs -q --no-site-file -batch -l mucs-comp.el
  5. Move this directory to that place.
  6. Add the following lines to your .emacs.
    (setq load-path (cons "somewhere/mule-ucs/lisp/" load-path))
    (require 'un-define)
    (require 'un-tools)
    
  7. Try editing a UTF-8 text file.

validchar.py

Encoding validator. Can be used for cleansing corpora that include unexpected characters.

Download validchar.py (4kbytes)

It checks if all the input files are valid as a character encoding specified by -i option. It displays those which couldn't be decoded properly. Optionally, it also checks if the input characters can be safely converted to another encoding specified by -o option. Checking is done in line by line. When it finds a non-decodable line, it displays the whole line with its line number and file name to stdout. After checking all the input files, it displays a summary of all unencodable characters that appeared in the files. When the -v (verbose) option is given, it displays every occurrence of all unencodable characters.

Notice: This program is for checking purpose only and does not output successfully converted texts.

Usage:

$ ./validchar.py [-v] [-i input_encoding] [-o output_encoding] file ...

convchar.py

Character encoding converter. Optionally it can emulate non-ascii characters with ascii strings.

Download convchar.py (9kbytes)

It re-encodes all the input characters in a different encoding and outputs to the stdout. The input and output encoding are specified by -i and -o option respectively. Unlike validchar.py, it operates in a strict manner and aborts if any of the input characters cannot be encoded properly unless a recovery method is specified. The recovery method can be specified with -D option (shown below). When the -E option is given, it tries to emulate some non-ascii characters, such as hyphens and Western European characters with ascii strings. The emulation table is embedded within the program (as in CONVERSION_TABLE variable).

Available recovery methods:

Usage:

$ ./convchar.py [-i input_encoding] [-o output_encoding] [-E] [-D recovery_method] file ...

pinyin.py

Chinese Pinyin to ASCII converter.

Download pinyin.py (270kbytes)

It converts Chinese characters (hanzi) into pronunciation strings (pinyin). The -C (Cantonese) option instructs to use Cantonese pronunciation instead of Mandarin. When the -T (tone) option is given, each pronunciation string is added with the tone numbers. Other non-ASCII characters that are not Chinese remain untouched. When the -H (HTML) option is given, pronunciation strings are attached as superscript letters for each Chinese character using <sup> tag (e.g. 美Mei3).

Usage:

$ ./pinyin.py [-i input_encoding] [-C] [-T] [-H] file ...

Sample Run:

$ cat xinhua.txt
新华社波士顿3月31日电(记者杨志望)美国马萨诸塞州港务局3月31日在
这里举行活动,隆重庆祝中国远洋运输(集团)总公司“珍河轮”首航波士顿港
一周年。
$ ./pinyin.py xinhua.txt
Xin Hua She Bo Shi Dun 3 Yue 31 Ri Dian(Ji Zhe Yang Zhi Wang)Mei Guo Ma Sa Zhu Sai Zhou Gang Wu Ju 3 Yue 31 Ri Zai
Zhe Li Ju Xing Huo Dong,Long Zhong Qing Zhu Zhong Guo Yuan Yang Yun Shu(Ji Tuan)Zong Gong Si“Zhen He Lun”Shou Hang Bo Shi Dun Gang
Yi Zhou Nian。

etokenizer.py

A simplistic tokenizer and sentence splitter for English text.

Download etokenizer.py (9kbytes)

Each token is delimited with a blank character. The output contains one sentence per line. A line that starts with '#' character is regarded as a comment and directly passed to the output without processing. When the -u (uncapitalized), option is given, it applies a special heuristics for sentence splitting instead of assuming the beginning of each sentence is capitalized. Both input and output encoding is UTF-8.

Usage:

$ ./etokenizer.py [-u] file ...

Sample Run:

$ cat input.txt
# wsj_0001
Pierre Vinken, 61 years old, will join the board as a 
nonexecutive director Nov. 29. Mr. Vinken is chairman
of Elsevier N.V., the Dutch publishing group.
$ ./etokenizer.py input.txt
# wsj_0001
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .

Terms and Conditions

Copyright (c) 2004 onward, Yusuke Shinyama (yusuke at cs dot nyu dot edu)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Yusuke Shinyama (yusuke at cs dot nyu dot edu)