Theses & Reports
Text Representation using Convolutional Networks
Title: Text Representation using Convolutional Networks
Candidate: Zhang, Xiang
Advisor(s): LeCun, Yann
This dissertation applies convolutional networks for learning representations of text, and it consists of several parts. The first part offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks. These results indicate that using low-level inputs -- in this case characters -- for convolutional networks could be feasible for text representation learning.
The second part concerns which text encoding method might work for convolutional networks. We include a comprehensive comparison of different encoding methods for the task of text classification using 14 large-scale datasets in 4 languages including Chinese, English, Japanese and Korean. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized words. For all encoding levels, whenever applicable, we provide comparisons with linear models, fastText and convolutional networks. For convolutional networks, we compare between encoding mechanisms using character glyph images, one-hot (or one-of-n) encoding, and embedding. From these 473 models, one of the conclusions is that byte-level one-hot encoding works consistently best for convolutional networks. Based on this, in the third part of the dissertation we develope a convolutional network at the level of bytes for learning representations through the task of auto-encoding. The proposed model is a multi-stage deep convolutional encoder-decoder framework using residual connections, containing up to 160 parameterized layers. Each encoder or decoder contains a shared group of modules that consists of either pooling or upsampling layers, making the network recursive in terms of abstraction levels in representation. The decoding process is non-sequential. Results for 6 large-scale paragraph datasets are reported, in 3 languages including Arabic, Chinese and English. Analyses are conducted to study several properties of the proposed model. Experiments are presented to verify that the auto-encoder can learn useful representations.
In the fourth part of the dissertation, we use the improved design from the previous auto-encoding model to text classification, adding comparisons between residual and dense connections. This further validates the choice of the architecture we made for the auto-encoding model, and the effectiveness of the recursive architecture with residual or dense connections.