Fall 2006, Machine Learning
Professor Yann LeCun
Instrument recognition is the training of a learning machine to identify an instrument by name by hearing a sound clip. In this project I implement and test a simple machine that recognizes four instruments, a saxophone, a trumpet, a piano, and a flute. The sound files are processed using a hand-written feature extractor written in processing for java. The features are stored in a lush-readable .ldat format. The data is then used to train and test neural network machines with 10, 20, 40, and 80 hidden units.
The features used to identify an instrument are the relative volumes of the overtones above the fundamental frequency. This provides pitch invariance and sufficient information to identify instruments relatively well for four instruments. The feature extractor is hand-written. The sound is sampled in 100ms slices run through a fast fourier transform into the frequency domain. It then extracts the volume of the first nine overtone frequencies and divides them by the volume of the fundamental frequency for normalization. The overtones are found for ten, 100ms intervals, so for each sample, there are 90 values, ten times nine overtones.
The data is organized in four different ways. See Table 1 for the distribution of samples.
There are a total of 90 sound files. The sound files are of samples of the four instruments playing sustained notes at different octaves, with different tones, with and without vibrato. Some sound files have several notes separated by silence. The feature extractor divides up these files and extracts a few of the samples from them. The distribution of samples for each instrument by source can be found in Table 1. Most of the samples come from the UIowa database. It had the most complete and accessible database of instrument samples that I could find on the Internet. However, using only a single database restricts the recognition ability of the machine to identify the instruments that were used to create the UIowa database. For each instrument type, the UIowa database only has instrument and musician providing all the sound clips. The instrument is played over several octaves with and without vibrato. The miscellaneous sound samples are used to broaden the recognition capabilities of the machine and to explore how accurate the machine is when only trained with samples from the same database. For all the datasets, 70% percent of the dataset is used for training and the rest for testing, except for separate where this is not applicable.
The sound clips were drawn from a few different sources. The main source was the University of Iowa instrument database (http://theremin.music.uiowa.edu/MIS.html) and is referred to as the UIowa data source. While the database has other instruments, only the alto saxophone, trumpet, piano, and flute were used. Other instrument recordings were gathered from two sites, the Free Sounds project repository (http://freesound.iua.upf.edu/index.php) and the Find Sounds sound search engine (http://www.findsounds.com/). This data is referred to as Miscellaneous.
For each dataset, four machines, neural networks with 10, 20, 40, and 80 hidden units, were trained and tested. The learning rate, eta, was 0.003 for all four machines. For the 10 and 20 hidden unit neural networks, the decay, lambda, was 0.0001, and for the 40 and 80 unit neural networks, the decay was 0.00008.
For all four datasets the test error was calculated for neural networks with 10, 20, 40, and 80 hidden units. The best, worst, and average test error from five runs can be found in Table 2. This data is visualized with bar charts in Figures 1 and 2.
|Dataset||# of Units||Best||Worst||Average|
When training and testing with only the UIowa dataset only, the results are very good because there is only one instrument and musician playing all the samples for each type of instrument. While recognizing sounds from the same instrument is not useful for practical applications, it is not entirely trivial because the instrument is playing different notes in different ways. This shows the machine recognizing the same instruments it was trained with playing different notes in different ways. These means the feature extractor is working properly and the neural network is learning to recognize four different instruments.
To expand on the UIowa database, samples taken from various sources were combined with the UIowa database to expand the recognition capability of the machine. This is equivalent to being able to recognize an instrument type regardless of the instrument manufacturer or playing style. It also tests the feature extractors ability to extract relevant features despite recording quality. The best results were with the 40 and 80 unit neural nets yielding about 8% test error in the best case. The 80 unit neural net, however, was better on average with about 12% test error.
The separate dataset trains with the UIowa database and tests with the miscellaneous data. The results are much worse than any of the other datasets. Even though the UIowa dataset has a variety of samples of good recording quality, one instrument recording the sounds is not enough to recognize a wider range of recordings of instruments.
The miscellaneous data was used for training and testing without the UIowa dataset. This was done to discover the overall quality of the recordings and how well a disparate set of data would work for recognition. When testing the UIowa dataset alone, there was only one instrument recording all the samples for a type of instrument. In the miscellaneous set, almost every sample is played with a different instrument and recorded by a different person. But, the dataset is small, 47 samples, and this impacted the training of the machine as can be seen by the worsening test error on the machines with more hidden units. The variation in the dataset, coupled with the low sample size, created wider variability in test error. The test error was worse than with the UIowa dataset because the UIowa dataset was trained and tested on the same instrument and musician for a given instrument type. The miscellaneous data bring a wider variety of instruments playing, but they are generally of lower recording quality, and thus the features are not as accurately extracted for this data.
When testing with only the UIowa database, which was basically recognizing one person playing one instrument in different samples, the machine recognized samples that it hadn't been trained on from the same database very well, but performed poorly when recognizing samples from other sources. The miscellaneous data trained the machine to recognize a wide variety of different recordings of the same instrument, but its performance was unstable, yielding poor average test errors. However, using the all dataset gave good performance even when identifying different recordings of different instruments. In the end, the best performance was balance performance. It came from using a combination of the two datasets: the same instrument played in many different ways provided reinforcement to the recognizer teaching slight differences between samples; and several different recordings of different instruments provided breadth to the recognizer.
There is much room for development of this instrument recognizer. The dataset was small, only 125 samples. With more data, the test error would be smaller than the results listed here. Also, more instruments can be added, features can be extracted automatically via a convolutional network, and real-time instrument recognition can be acheived with a time-delay neural network. Multiple instruments can be recognized simultaneously with a database of overlapped instrument samples.