All the tools are written in Python. Version 2.4 or newer is required to run.
Presentation Slides about Unicode and Foreign Encodings [PDF, 640k]
Here is Unicode to GB2312 or GBK table (WARNING: This is very big!)
Here is Unicode to Shift_JIS or MS-932 table (WARNING: This is very big!)
$ mkdir /some/where/common/place
$ tar zxf mule-ucs.tar.gz
$ emacs -q --no-site-file -batch -l mucs-comp.el
(setq load-path (cons "somewhere/mule-ucs/lisp/" load-path)) (require 'un-define) (require 'un-tools)
Encoding validator. Can be used for cleansing corpora that include unexpected characters.
Download validchar.py (4kbytes)
It checks if all the input files are valid as a character
encoding specified by -i option. It displays those which
couldn't be decoded properly. Optionally, it also checks if the
input characters can be safely converted to another encoding
specified by -o option. Checking is done in line by line. When
it finds a non-decodable line, it displays the whole line with
its line number and file name to stdout. After checking all the input
files, it displays a summary of all unencodable characters that
appeared in the files. When the -v (verbose) option is given,
it displays every occurrence of all unencodable characters.
Notice: This program is for checking purpose only and does not output successfully converted texts.
Usage:
$ ./validchar.py [-v] [-i input_encoding] [-o output_encoding] file ...
Character encoding converter. Optionally it can emulate non-ascii characters with ascii strings.
Download convchar.py (9kbytes)
It re-encodes all the input characters in a different encoding
and outputs to the stdout.
The input and output encoding are specified by -i and
-o option respectively. Unlike validchar.py, it
operates in a strict manner and aborts if any of the input
characters cannot be encoded properly unless a recovery method
is specified. The recovery method can be specified with -D option
(shown below). When the -E option is given, it tries
to emulate some non-ascii characters, such as hyphens and Western
European characters with ascii strings. The emulation table is
embedded within the program (as in CONVERSION_TABLE variable).
Available recovery methods:
-D none: Omits all unencodable characters from the output.
-D blank: Replaces an unencodable character with a single blank character.
This preserves the byte offsets of the input text.
-D xxx: Replaces an unencodable character with a constant string 'xxx'.
-D html: Replaces an unencodable character with an HTML entity notation 'xxxx;'.
where XXXX is the Unicode codepoint of the character.
-D code: Replaces an unencodable character with a string in form of
"unknown_char(U+XXXX)" where XXXX is the Unicode codepoint
of the character.
Usage:
$ ./convchar.py [-i input_encoding] [-o output_encoding] [-E] [-D recovery_method] file ...
Chinese Pinyin to ASCII converter.
Download pinyin.py (270kbytes)
It converts Chinese characters (hanzi) into pronunciation strings (pinyin).
The -C (Cantonese) option instructs to use Cantonese
pronunciation instead of Mandarin. When the -T (tone) option is given,
each pronunciation string is added with the tone numbers.
Other non-ASCII characters that are not Chinese remain untouched.
When the -H (HTML) option is given, pronunciation strings are
attached as superscript letters for each Chinese character using <sup> tag
(e.g. 美Mei3).
Usage:
$ ./pinyin.py [-i input_encoding] [-C] [-T] [-H] file ...
Sample Run:
$ cat xinhua.txt 新华社波士顿3月31日电(记者杨志望)美国马萨诸塞州港务局3月31日在 这里举行活动,隆重庆祝中国远洋运输(集团)总公司“珍河轮”首航波士顿港 一周年。 $ ./pinyin.py xinhua.txt Xin Hua She Bo Shi Dun 3 Yue 31 Ri Dian(Ji Zhe Yang Zhi Wang)Mei Guo Ma Sa Zhu Sai Zhou Gang Wu Ju 3 Yue 31 Ri Zai Zhe Li Ju Xing Huo Dong,Long Zhong Qing Zhu Zhong Guo Yuan Yang Yun Shu(Ji Tuan)Zong Gong Si“Zhen He Lun”Shou Hang Bo Shi Dun Gang Yi Zhou Nian。
A simplistic tokenizer and sentence splitter for English text.
Download etokenizer.py (9kbytes)
Each token is delimited with a blank character. The output
contains one sentence per line. A line that starts with
'#' character is regarded as a comment and directly
passed to the output without processing. When the -u (uncapitalized),
option is given, it applies a special heuristics for sentence splitting
instead of assuming the beginning of each sentence is capitalized.
Both input and output encoding is UTF-8.
Usage:
$ ./etokenizer.py [-u] file ...
Sample Run:
$ cat input.txt # wsj_0001 Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. $ ./etokenizer.py input.txt # wsj_0001 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Copyright (c) 2004 onward, Yusuke Shinyama (yusuke at cs dot nyu dot edu)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.