Data for MATLAB hackers
Here are some datasets in MATLAB format.
I'm working on better documentation, but
if you decide to use one of these and don't have
enough info, send me a note and I'll try to help.
Also, if you discover something, let me know
and I'll try to include it for others.
Handwritten Digits
- USPS 16x16 "Raw"
[data/usps_16x16_raw.mat]
Handwritten digits from US post office, "0" through "9".
Two cell arrays, one for training, one for testing.
d{1} is "one"... d{9} is "nine"...d{10} is "zero".
7291 train cases, 2007 test cases
Warning: test data comes from totally different distribution
than training data. Use at your own risk.
- USPS Digits 16x16 "Nice"
[data/usps_16x16.mat]
A Hinton Mafia classic, pre-processed ages ago by Mike Revow (and
perhaps Yann LeCun?); 16x16 grayscale, 1100 of each digit, no test or
training splits (welcome to unsupervised world).
Case-by-case mapping from this one to the above one is unknown.
- USPS Digits 8x8
[data/usps_8x8.mat]
[picture]
Same as above but 8x8.
- Binary Alphadigits
[data/binaryalphadigs.mat]
[picture]
Binary 20x16 digits of "0" through "9" and capital "A" through "Z".
39 examples of each class.
From Simon Lucas' (sml@essex.ac.uk),
Algoval system.
Faces
- Frey Face
[data/frey_rawface.mat]
[picture]
From Brendan Frey. Almost 2000 images of Brendan's face,
taken from sequential frames of a small video. Size: 20x28.
- Saul Faces
[data/saulfaces.mat]
[picture]
Grayscale faces 8 bit [0-255], a few images of several different people.
400 total images, 64x64 size.
From Kearns' and Saul's
Advanced Topics in AI course page at Penn.
Text
- NIPS Conference Papers Vols0-12
[matlab or
raw data]
A whole lot of fun! I massaged the OCR'd data
from NIPS1-12 (the pre-electronic submission era)
that Yann made available.
I've included a
tarball of
the massaged raw data, as well as a
matlab package
which is nicely read in and pre-processed.
See the
readme file
for the raw data massaging notes and the
matlab notes file
for explanations of the matlab data.
There are also a couple of extra matlab files, containing
conference and page number info
which you can't make yourself but seems boring to me, and
word counts by author
which is cool, but you could easily have made yourself.
- 20 Newsgroups
[data/20news_w100.mat]
A tiny version of the 20newsgroups data, with binary occurance
data for 100 words across 16242 postings.
I've also tagged the postings by the highest level
domain in the array "newsgroups".
- Some English Text
[data/dfre_txt.mat]
From "Decline and Fall of the Roman Empire" by Gibbon.
Cell array, one entry per sentence.
Also grab the function
data/dfre_read.m
to convert the cell arrays to readable text.