Music has always been present in the lives of human beings, both
individually and socially, through the cultural, professional, leisure or
religious aspects of life. Music is a way by which composers express their
innermost feelings. As a means of celebration, music has always accompanied
man’s festive moments. In the expression of human religiosity, music has always
been regarded as a form of prayer. In sport activities, music can be used to
keep an athlete motivated. Recently, music is being approached as an aid for the
therapy of nervous disturbances or even for the improvement of student
performance. Music is associated to the most marking moments of our life, brings
us memories, arouses emotions, it is part of our individual and social
imaginary.
For the reasons appointed, music plays an important role in the balance and development of world economy. In fact, the music industry runs, only in USA, an amount of money in the order of several billion dollars per year [16].
As a result of recent technological innovations, there has been a tremendous growth in the Electronic Music Distribution (EMD) industry. Factors like the widespread access to the Internet, bandwidth increasing in domestic accesses or the generalized use of compact audio formats with CD or near CD quality, such as mp3, have given a great contribution to that boom. This boom poses new and exciting challenges.
Presently, whether it is the case of a digital music library, the Internet or any music database, search and retrieval is carried out mostly in a textual manner, based on categories such as author, title or genre. This approach leads to a certain number of difficulties, both for service providers and general users, namely in what concerns music labeling or database search in a transparent and intuitive way, respectively.
Therefore, some efforts are now being conducted in order to make it possible to search music databases by content similarity [12; 22; 24 ]. In those systems, the goal is to allow the creation of musical queries through examples given by the user, e.g., by humming the melody to search for, or by specifying a song in some way similar to what is being looked for, in terms of certain searching criteria (theme, rhythm, genre, instrumentation, etc.). This approach can be very useful when users do not know or are not especially interested in the melody. Namely, an aerobic instructor can look for songs with a certain tempo or a truck driver can look for a song that keeps him alert [9], regardless of the melody or genre. This can be a daunting task if we think of the thousands or even millions of songs, organized sometimes in tens or hundreds of different and often non-uniform genres that many music libraries contain.
Another objective is to simplify the task of organizing musical databases via automatic classification systems, where similar songs may be found close to each other [7; 16; 21]. Such systems should overcome the limitations resulting from manual song labeling, which may be a highly time-consuming and subjective task.
Regarding EMD, music web crawlers, which “traverse the web and index music-related files” [9], are applications with an enormous potential. Also, automatic classification systems should be extremely useful for the labeling and updating of huge music databases, as well as tool for content-based retrieval, as stated before. This also applies for multimedia databases and operating systems.
Besides the possibilities for EMD, systems for education and training can also gain from the results attained. For example, systems for automatic music transcription [2; 6; 13] may simplify tasks such as manual transcription, music composition, music analysis or evaluation of musical performance. Also, professional composers may find useful tools that help detecting plagiarism.
Additionally, audio editors or audio browsers could become more intelligent with tools for automatic indexing of music/audio files [23].
Also, in cinema or advertising industries, it is often necessary to search
for songs that induce a certain mood to the intended audience [9].
Figure 1. General System Architecture.
A general system for music retrieval is depicted in Figure
1, adapted from [4].
The architecture presented is a typical client-server one, suited for web-based
applications, as it is usual situation in music retrieval. However, such a
general architecture can be easily adapted, for example, for standalone
applications or music retrieval in local file systems or distributed databases.
Below, I will present an overview of the main tasks performed by both client and
server.
When users are searching for songs, either they know what they want, e.g., song title, artist or genre, where a traditional text-based search is enough, or they want to find songs based on content similarity. They could also search for a song based on lyrics or other criteria.
In the case of searching by similarity, it must be possible for users to build their queries in an intuitive and interactive way [14]. One of the most intuitive ways to search music databases is by humming or singing some melody at the microphone, i.e., query-by-singing (QBS) or query-by-humming (QBH) [1; 4]. Another way of specifying the query is by selecting a song file, typically an mp3 file, similar to the desired song or songs in some way, e.g., rhythm, melody or genre, i.e., query-by-example (QBE) [12; 22; 24]. Either way, the client is responsible for building a query signature, which is sent to the server and then compared to the signatures present in the database, in the server-side.
Regarding QBH / QBS, the main task of the client application is to extract the sequence of notes hummed or its melodic contour, i.e., the sequence of note transitions (up, down, equal), which is especially suited for transposition-invariant searches [11]. In fact, many people can sing the same melody in different keys, e.g., “Happy Birthday”, and the search engine should respond adequately (this point will be addressed later on, when the server-side is described). Also, rhythmic information should be captured [4], since two songs can be equal in terms of note sequence but differ in beat and tempo. In order to make the query as precise and robust as possible, its construction should also be interactive [14]. Namely, the notes extracted should be shown in a score, so that the user could evaluate and correct the sequence extracted by the application. The same applies if a melodic contour is extracted. Additionally, it should be possible to listen both to the songs hummed and to the notes extracted.
As for QBE, similarity criteria should defined by individual users. This can
be accomplished by creating an interface where users are able to select relevant
criteria for their searches, as well as weights given to those criteria [23].
Queries could encompass higher-level criteria such as melody [8;
10],
rhythm [18;
21],
timbre [21],
loudness [7;
16]
or mood [9],
as well as lower-level physical criteria such as energy, harmonicity, or
bandwidth [23].
The signature of the query can also be made up of statistical data such as mean,
variance, maximum or minimum, as well as histograms of volume, frequency and so
forth.
Still regarding QBE, the system should also be extensible with
user-defined concepts [23].
Methodologies for empirical learning, e.g., neural networks, could be applied to
induce those concepts by means of training examples.
Finally, the signature of the query is sent to the server, which returns the search results ranked by similarity. For example, if a user had sent a query regarding some Bossa-Nova song, he/she could have obtained a list like the one in Figure 1 (visualization box). The server should also return a short summary for every song, which is useful for song recognition and validation of results [9].
The creation of signatures results from the necessity to obtain a meaningful representation from raw audio data. In fact, raw audio cannot be used directly for content analysis. For instance, it is not possible to look at a sequence of sample values and say, directly, what notes are present. Thus, it is essential to transform the crude information present into a meaningful representation suited for music content analysis.
The referred representation must also allow for data reduction by eliminating any redundant information present in the song [9]. Namely, most songs have a chorus, which is repeated several times throughout the song. Thus, it is important to segment the song in its most relevant components, e.g., introduction or chorus, and to eliminate all repeated segments. Then, a signature is obtained for each of the final non-redundant segments. However, in most of the cases found in literature, the authors simply extract a signature for the whole song or segments at regular intervals, without taking in consideration the aspects referred above. For instance, in [16], Pampalk extracts segments by dividing every piece of music into six-second’s sequences and analyzing only every third segment. This is very useful for capturing different styles in music classification problems but does not work well for music summarization.
After segmentation, the signatures are then obtained based on features extracted from each of the segments. Features can be temporal sequences of fundamental frequency, energy, zero-crossing rate and so on, as stated before. Additionally, statistical information can be obtained from those sequences. The process of feature extraction will be described later on.
Based on the extracted features, each segment is then classified, typically in a genre hierarchy. In this situation, tools like neural networks, k-nearest neighbors, clustering techniques or support vector machines can be used, where classes such as jazz, baroque or pop are learned via training examples made up of such features. It is important to notice that segment classification rather than song classification can overcome some classification ambiguities in songs that encompass different styles. In this way, one song can be classified into several genres. This procedure can avoid many ambiguities in song labeling.
The methodology for signature creation and song classification is applied to all songs in the server, and all signatures are stored in the database. This task is carried out offline while setting up a new system and when new songs are added to the database. Therefore, no real-time requirements are imposed, though temporal efficiency is desired.
The situation is different when the server receives a query from a client. In fact, the signature of the query must then be compared with all signatures stored in the database. Clearly, a “world-wide-wait” is to be avoided and the system must give quick responses to the client. In huge databases, this can be a difficult problem. That is one of the reasons why it is important that the signatures stored are as short as possible, while keeping their relevance. When melody is not a main issue, comparison can be simply performed by using a metric distance. In fact, in such queries, signatures consist typically of feature-vectors with information regarding statistical data extracted from temporal sequences [23]. However, in query-by-humming and query-by-singing, algorithms for string matching must be used in order to compare sequences [11], e.g., melodic contour or sequence of notes, possibly in different keys. Therefore, such algorithms must be highly efficient and optimized, so as to minimize server response time in transposition-invariant searches [11]. Additionally, the system must be robust and flexible regarding query imprecision, e.g., a few notes out of tune, non-uniform tempo and beat. Furthermore, other procedures for search optimization are very important, e.g., database indexing.
Finally, the results are sent to the client, ranked by similarity, where the most similar song is the one containing the most similar segment. Additionally, song summaries are also sent, which are constructed using the segments obtained previously. As stated before, summaries are important for users, so that they can easily hear what the server returned and validate those results. One common and easy way of delivering summaries consists of getting only the incipit, i.e., “the initial few seconds” of the song [9]. This procedure is the same as in common web search engines, where the initial words from every page are returned. However, it is a rather limited approach since many songs have introductions that are very different from the chorus [9]. Furthermore, many songs have very distinct passages, all of them relevant, which should be present in the summary. Thus, summaries based on song segmentation and redundancy elimination are more effective. In conclusion, song summaries are relevant both for server and server sides: as a way to optimize searches, in the server, and as a way to evaluate results, in the client.
As I referred before, extracting adequate query signatures is a key issue for
music classification and retrieval systems. In order to accomplish that goal,
good features must be selected and extracted from audio data. I will now address
this crucial aspect.
Before describing those features, it is important to know what characterizes a good feature. Intuitively, a good feature should carry meaningful information [16], such as melody, rhythm or timbre. Ideally the process of feature extraction should mimic the human brain. This has led to an attempt to apply psychoacoustics findings to the problem of music content analysis [3; 20]. However, many aspects of the behavior of the brain are still poorly understood. That is why some physical features, i.e., features extracted directly from data without perceptual concerns, can be useful, despite being less intuitive. They also have the advantage of being less costly in terms of computing time.
Below, I describe some of the most relevant features for music content-analysis.
Fundamental frequency
When we hear a song, it is amazing how our brains can distinguish instruments present and the melodic line followed by each of them (or, at least, part of it). In fact, in an orchestra we can try to follow the violins, flutes, trumpets and so on. Also, we can easily follow the global melodic line, instead of paying attention to particular instruments. However, no computer system can do anything similar yet.
The main issues here are, therefore, i) to build computer systems that can extract the dominant frequency or fundamental frequency at any instant of time and ii) to isolate each audio stream present in a music signal, e.g., vocals, guitar, violin, a problem known as source separation. Often, pitch and fundamental frequency are used interchangeably, however, they are different concepts. Fundamental frequency is a physical variable that represent the base frequency present in harmonic sounds, i.e., sounds where all main frequencies present are multiple of a base frequency. On the other hand, pitch is a perceptual variable that determines our individual perception of frequency. Pitch entails aspects like our brains’ response to frequency and intensity, and is discussed, e.g., in [20].
There are presently hundreds of algorithms for fundamental frequency detection or estimation, many of them developed in the context of voice recognition research. However, most of them are only suited for monophonic music analysis, i.e., music with only one audio stream, such as folk music, “shower singing”, humming, etc.
As for polyphonic music, i.e., music with several audio streams, such as pop music or orchestras, algorithms for dominant frequency detection or estimation are still in their infancy.
One of the most used tools for frequency analysis of discrete signals is the Discrete Fourier Transform (DFT) [17]. Through the DFT, signals are transformed from the time-domain to the frequency-domain, where their frequency spectrum, i.e., the range of frequencies covered, is showed up. In Figure 2, the DFT for a signal is depicted.
Figure 2. DFT of a simple signal.
In the previous figure, the three frequencies present in the signal are clearly shown. However, what happens when we have a signal whose frequency content varies through time - The DFT is only suited for stationary signals, i.e., signals that always have the same frequency content. This calls for a windowed version of the DFT: the Short-Time Frequency Transform (STFT) [17]. Here, the main idea is to divide the signal into a set of time frames and calculate the DFT for each of those frames. Typically, frame length is around 20 ms, so that stationarity can be assumed.
Then, a simple algorithm for fundamental frequency detection would consist of performing STFT analysis for the signal and determining the fundamental frequency for each frame. Filter banks can be used to extend this approach by measuring signal energy in each frequency band. This technique tries to mimic the behavior of the inner ear, which acts also as a bank of filters. Therefore, this strategy is the most valuable in terms of human audio perception.
Another possibility is to perform cepstral analysis [12; 17]. In this technique, Fourier coefficients are first determined, then their logarithms are calculated and finally the inverse Fourier transform is applied to them. The result is a large peak at the frequency of the original signal.
One other strategy consists of calculating the autocorrelation function [17]. There, if a signal is periodic (or pseudo-periodic, i.e., “almost” periodic), its autocorrelation function will also be a periodic signal with the same period as the original. Then, the period found will be equal to the distance between peaks.
As for polyphonic music, many more difficulties are present. To illustrate such difficulties, let’s look at the spectra resulting from playing middle C on a flute, piano and trumpet, in Figure 3 (adapted from [5]).
Figure 3. Spectra of middle C played on a flute, piano and a trumpet [5].
In the previous figure, we can see that the same note played in different instruments originates very different spectra. In fact, flute is almost a pure tone (only a peak at the fundamental frequency), piano has clear peaks at the second and fourth harmonics and trumpet is the richer of the three instruments in terms of harmonic content. Also, there are overtones present, i.e., spectral components which are not multiple of the fundamental frequency (these overtones contribute to the particular timbre of each instrument). Now, if we imagine a song where those instruments are present, it is easy to conclude that the signal spectrum will be very complex, because of the interaction between fundamental frequencies, harmonics and overtones from each of them. So, extracting each audio streams or the dominant frequency is not an easy task any longer.
Some efforts are being conducted in order to attack the problems of source separation and dominant frequency detection/estimation.
Source separation is a major concern for polyphonic music analysis and automatic music transcription systems, and has no general solution yet. The human brain processes auditory information in a process called “auditory scene analysis” [3]. As an attempt to replicate human behavior, some work has been carried out so as to develop computational auditory scene analysis systems. The results obtained are not yet very accurate and are only acceptable for simpler or well-constrained problems. Namely, Ellis [6] tries to analyze a sound signal by means of competitive theories, where each of them proposes a combination of sounds that might produce the sound obtained. Sound source models are used as a basis for the proposed method. Belo et al [2] and Martin [13] have used computational blackboard systems for simple automatic music transcription. Scheirer [19] proposes a model based on perceptual issues, using dynamic clustering of comodulation data. In contrast to the other systems referred, this model is designed for analysis of complex music. Other models impose constraints in the number of instruments present or the harmonic interaction between them, as it is referred in [8].
Dominant frequency detection/estimation, possibly a less difficult problem, consists of detecting only the main melodic line in a song. For instance, when we hear a pop song, we have vocals, guitar, bass, percussion and so forth. Yet, in spite of all that information, our brains still can retain the main melodic line. Klapuri et al [10] proposed a method for predominant pitch estimation where the musical signal is analyzed at separate frequency bands. Namely, 18 logarithmic distributed bands from 50 Hz to 6 kHz are used. Then at each band, a fundamental frequency likelihood vector is calculated. Finally, the results from each band are combined to yield global pitch likelihoods. They report results that outperform the average of ten trained musicians. Goto [8] uses a probabilistic model for the detection of melody and bass lines. The signal is first band-pass filtered and then a probability density function (PDF) is computed for each signal component. The PDFs are generated from a weighted-mixture model of tone models of all possible fundamental frequencies. The more dominant a model is in the PDF, the more likely the fundamental frequency belongs to that model. The author compared the dominant frequencies extracted with hand-labeled marked notes and reports an average rate of 88.4% for the melodic line.
Tonal Histograms and Transitions
The frequency content of a music signal can also be analyzed by means of tonal histograms and transitions. In [22], histograms of frequency amplitudes across the notes of the Western music scale are proposed. This information can be used to detect dominant chords, as well as the key where the song is played in.
The same authors suggest tonal transitions for QBE. Basically, music signals can be seen as sequences of frequency transitions over time. Therefore, they propose an extractor that measures the number of tonal transitions in a given frequency range for ten seconds’ samples. There, five feature-vectors are obtained, each of them containing 306 values.
Zero-Crossing Rate
Measuring the zero-crossing rate (ZCR) in an audio signal consists of counting the number of times the sound wave crosses the zero axis [17]. Thus, this feature gives frequency-related information. Namely, a high ZCR indicates a signal with high-frequency content, whereas a low ZCR suggests the opposite.
ZCR is a feature imported from voice recognition systems. In fact, it has been used there as a robust measure to detect unvoiced speech. Also, in general audio classification, it has been used for music/speech discrimination [23].
Regarding music content analysis, ZCR-based features, namely statistical ones, are present in feature-vectors for music signal classification.
Rhythm analysis encompasses aspects such as beat and tempo analysis, which I describe below.
Beat and Tempo
Regarding beat and tempo analysis, the main idea is to find periodicities in the signal amplitude envelope.
In [21], a bank of filters is used to divide the signal into a number of bands, each of them representing an octave. Then, the amplitude envelope of the signal at each band is extracted, by means of full wave rectification, low pass filtering and downsampling. Next, the envelopes at each band are summed up. Finally, periodicities are detected by finding peaks in the envelope autocorrelation function.
In [18], a filterbank is also used, which divides the signal into six bands. For each band, the amplitude envelope is calculated, as well as its derivative. Next, each of the derivatives is passed to a set of comb filter resonators, where only one of them will phase-lock. The output of those resonator filterbanks is then summed across the frequency bands. Then, the energy output from each resonator channel is examined. The tempo of the signal is selected as the frequency of the resonator with the maximum energy output. Finally, beats are detected by looking back to the peak phase points in the phase-locked resonators.
Energy
Signal energy, also called volume, is also useful for rhythmic analysis. In fact, signal energy can be used as a basis for amplitude envelope extraction.
Energy is obtained by computing the sum of squares of signal amplitude values for each frame [17]. This feature has been largely used in voice recognition systems for silence detection and voiced/unvoiced speech discrimination.
Loudness is the perceptual correspondent to volume [16]. This is a perceptual variable that determines our individual perception of volume and will not be discussed now.
Instrument detection must be grounded in timbre analysis. Physically speaking, timbre is a feature related to the sound wave. This idea is illustrated in Figure 4 (extracted from [5]), where sounds waves for middle C played on a flute, piano and trumpet are depicted.
Figure 4. Sound waves of middle C played on a flute, piano and trumpet [5].
As can be seen, the same note has a different “color” or “texture” for each instrument. This is also reflected in the spectral content, shown previously in Figure 3. Therefore, timbre has much to do with a signal’s harmonic content. The number of harmonics and overtones present, as well as their intensity, strongly influence the richness of the sound.
Deriving perceptually-inspired features for capturing timbre from an audio signal is not a trivial task. Some physical features, such as harmonicity or centroid, have been suggested and are described below.
Harmonicity
Harmonicity is a physical feature that tries to derive timbre information
from the analysis of harmonics present in an audio signal [23].
Once again, this is a rather complex task for polyphonic music. In [23],
harmonicity is measured as the deviation of the signal’s spectrum from a
perfectly harmonic spectrum. Tzanetakis et al [21]
propose fundamental frequency histograms for the analysis of harmonic content.
Another possible approach could be to measure the number and percentage of
harmonic peaks in the spectrum.
In polyphonic signals, the harmonic content can be used as basis for measuring noise levels. The noise content of a music signal can be useful to discriminate between soft/aggressive songs, e.g., ambient songs or heavy metal [22].
Centroid
Centroid is usually used as an indicator of signal brightness, i.e., “the higher frequency content of the signal” [23]. This feature is calculated as the energy-weighted mean of frequencies of the short-time Fourier magnitude spectra.
Bandwidth
This feature is used as a measure of the frequency range of the signal. It is computed as the “magnitude-weighted average of differences between the spectral components and the centroid” [23]. Bandwidth is equivalent to frequency standard deviation.
Low Energy
In measuring the amount of bass in a song, it is useful to use a feature called low energy [21]. This feature consists of calculating the percentage of frames that have less energy than the average energy in all frames.
The first method consists of deriving short-term features from raw audio data. This is the case of fundamental frequency estimation methods, for example. Analysis is performed in short-time windows, as it happens in STFT analysis. Therefore, such features consist of temporal trajectories of some meaningful variables. Trajectory variations are also obtained, e.g., by measuring signal derivates (first-differences) [7].
Based on such short-term features, medium-term or long-term features can be
obtained. Here, statistical data such as meansa and standard deviations [23],
as well as histograms are derived. What distinguishes medium from long-term
features is the window size where the statistical analysis is performed.
Clearly, the time window is wider for long-term features.
However, we are now assisting to a strong interest in the issues of searching and classifying audio music signals. Some systems for QBE, e.g., [12; 22; 24] and music classification and clustering, e.g., [7; 16; 21] have been developed. In those systems, different authors use different subsets of the features described above.
Typically, feature-vectors are constructed with statistical data obtained from features such as cepstral coefficients, energy, ZCR, harmonicity, centroid, bandwidth or tonal transitions.
Regarding classification, most results reported indicate 60 to 80% accuracy in relatively simple problems, where two to seven classes are separated (jazz, pop and classical music are the most common). Namely, Golub [7] refers an average classification accuracy of 77% in a problem involving two highly similar genres, 82% for three highly dissimilar genres and 64% for seven genres with different similarity levels. There, a database of 1714 songs was used. Classifiers evaluated were the generalized linear model, multilayer perceptron and k-nearest neighbor.
As for QBE systems, evaluation is usually carried out by counting the average number of similar songs in the first 5, 10 or 20 in the list of results. Typically, a set of users is chosen to personally evaluate the results obtained. This is clearly a subjective metric, since similarity is a rather vague concept. In [12], a database of over 8000 songs encompassing several genres is used. In order to evaluate the matches obtained, these were judged by two users. The results reported indicate that, in the first 5 matches returned by the system, on an average 2.5 are similar to the query. For the first 10 and 20, 4.7 and 8.2 are said to be similar, respectively.
As stated before, the analysis carried out above is very subjective. Furthermore, results are not comparable since many different databases of songs are used. Regarding classification, empirical supervised learning is normally used to train music classifiers (one exception is [16], where clustering is performed, so that similar songs are found close to each other in a self-organizing map). Therefore, training examples must have been labeled previously. However, if we look at the taxonomies used in some music libraries, it is difficult to find any uniformity there. In fact, there are many semantic discrepancies in the classes defined, both in horizontal and vertical terms, i.e., the number of classes used (horizontal dimension) and the number of subclasses derived from them (vertical dimension) is very different. Also, the same song appears with different labels in different libraries. These and other problems are analyzed in [15]. As a consequence, it is urgent to define standard test collections and benchmark problems, as well as a uniform taxonomy. This problem is the subject of the Workshop on the “Creation of Standardized Test Collections, Tasks and Metrics for Music Information Retrieval (MIR) and Music Digital Library (MDL) Evaluation”, to be held at the Second Joint Conference on Digital Libraries (JCDL’ 2002).
As for QBH/QBS systems, the bulk of the research deals with databases of MIDI songs, as stated before. QBH/QBS systems for real-world music must be grounded on robust and accurate strategies for polyphonic music analysis, which is still an open problem. In the same way, systems for automatic wave-based music segmentation and summarization are still non-existent.
In conclusion, content-based classification and retrieval of music is a fascinating research problem, with a broad range of possible commercial applications. Nevertheless, we still have ahead many years of hard research before robust and accurate products are available.
EMD - Electronic Music Delivery
QBE - Query-By-Example
QBH - Query-By-Humming
QBS - Query-By-Singing
STFT - Short-Time Fourier Transform
ZCR - Zero-Crossing Rate