Sunteți pe pagina 1din 7

Song Genre Classification through Quantitative Analysis of Lyrics

Doran Walsten Daivik Orth


Johns Hopkins University Johns Hopkins Univeristy
dwalste1@jhu.edu dorth4@jhu.edu

marginal amount. However, we wanted to try a


Abstract different feature of the text instead: sentiment.
Sentiment analysis in natural language pro-
The goal of this project was to implement a cessing is a complex problem. The goal of senti-
machine learning algorithm to the challenging ment analysis is to determine the positivity or
problem of song genre identification. Specifi- negativity of a string based on the words given and
cally, we were interested in learning how well the relationships between those words. For a hu-
a standard clustering algorithm, K-means,
man, it is relatively simple to detect this sentiment.
could cluster similar songs into genres. In ad-
dition, we attempted to find which features of However, for machines its a bit more challenging.
the text we calculated were most valuable in Just as in real life, sentiment depends on context.
clustering. For example, although the word funny is consid-
ered positive, not funny is considered negative.
Thus, an algorithm cannot simply look at the
1 Introduction words present in the string, it must be able to de-
termine which words are related to each other first.
An important concept in machine learning is clas-
sification. Classification is loosely defined as la- 2 Machine Learning Methods
beling an input piece of data with a particular label
that identifies that specific input with a subset of In this project, we utilized two main machine
all data received. Machine learning is the training learning concepts in order to generate our clusters
of a computer or algorithm to successfully label of genres. These two concepts were K-means clus-
input with minimal error. tering and Sentiment analysis.
For this project, our inspirations were mobile
song identification applications such as Sound- 2.1 K-means
Hound and Shazam. These applications are able to
listen to audio and accurately identify songs. How- K-means is an algorithm that aims to take n-
ever, we were interested in whether the lyrics of observations and cluster them into k clusters,
song contain enough information to classify a song where every observation belongs to the cluster
in a particular genre. with the nearest mean. The problem is computa-
Mayer, et al. (2008) wrote an article detailing tionally NP-hard, we implemented an efficient
their method for identifying the genre of a song. heuristic algorithm that converges to to a local op-
The three types of features that they analyzed in timal solution. Given a set of observations x1,
the text were rhyme scheme, part of speech, and x2,, xn, each of which is an m-dimensional vec-
general statistics. These general statistics included tor, the goal of K-Means is to group these vectors
average word length, average length of line, words into sets S1, S2, , Sk, where k is the number of
per minute, etc. The novel approach of their paper clusters. The objective function to minimize is the
was to analyze the rhyming scheme to assist with Euclidean distance between each point and the
classification. This helped improve classification a centroid of the cluster it is assigned to. This outline
of K-means as well as the pseudocode for the actu- phrases, which is an algorithm that identifies the
al implementation were found at the Stanford Ar- groups of words that make up phrases in a line of
tifical Intelligence course webpage (Piech, 2013). text. These phrases can then be organized in a tree
The first step of the algorithm is to initialize a that shows the relations between different parts of
starting set of clusters by choosing k observations a sentence or paragraph (The Stanford NLP
in the data set. K-means works best when the ini- Group).
tial clusters are as far apart as possible. To do this, The Recursive Neural Tensor Network is used
the first cluster is chosen uniformly at random. to predict the sentiment of a phrase by recursively
Through k-1 iterations, the distance to the nearest iterating through the nodes of the phrase tree, and
cluster is computed for each observation. Then, computing the output of a composition function for
another cluster is chosen using a weighted proba- the two children of a node (Socher, 2013). In this
bility distribution such that the probability of a par- way, the network integrates the features of individ-
ticular observation being chosen as the next center ual phrases on increasing scales to predict the
is proportional to the distance to the point squared. overall sentiment at the top of the tree.
Next, we repeat the following assignment and
update steps until convergence. 3 Work Completed
Assign every observation to the nearest
cluster mean based on Euclidean distance
The first step completed in this project was data
Update the new means to be centroids of
collection. Copyright laws make it difficult to col-
the observations in the new clusters.
lect data about songs, including their lyrics. We
We chose to use K-Means primarily because we
wanted to find a database that had this information
wanted to determine whether or not artists could be
readily accessible. The first database we looked
grouped into genres by clustering the artists songs
into was one entitled The Million Song Dataset.
without genre labels. We also would have had to
This was a database created by some faculty mem-
generate our own labels, which is both time con-
bers at Columbia University to make more of this
suming and restricting on the potential genres a
data available to researchers in academia. They
song could be a member of. We felt that K-Means
proposed a bag-of-words approach to storing lyric
would be a good starting point for unsupervised
information. They generated a set of lemmas of
learning/classification for a few reasons. Firstly, K-
common words then determined the frequency of
Means is relatively sensitive to outliers and noise
these words in a song. In addition, this dataset in-
in the data, which helped in determining what set
cluded other important features of songs including
of features and artists allowed for the best training
duration. While we were working on the proposal,
and test accuracies. If we were to continue this pro-
this seemed like a great resource. However, we
ject, we would take our K-Means implementation
realized that this database would only be helpful
and used it to initialize Expectation Maximization
for a couple of features, and the dataset had actual-
with a Gaussian Mixture.
ly been poorly maintained and was missing the
2.2 Sentiment Analysis majority of the dataset. We had no choice but to
abandon this database. This was frustrating be-
cause we lost a lot of time trying to see if this data-
For Sentiment Analysis, we used a package de- base could work for our project.
veloped by the Stanford NLP Group1. The back- We turned to LyricWikia for help. This site ac-
bone of the sentiment analysis package is the tually stores lyrics online for almost every song
Stanford Sentiment Treebank and Recursive Neu- imaginable. Some very friendly people on the In-
ral Tensor Network. ternet posted an outline of Javascript necessary to
The Treebank is composed of hundreds of thou- extract lyrics from this website. The approach was
sands of phases labeled by human judges with cer- to first perform an http request of the site to get
tain levels of positivity or negativity (Socher, URL information of our song of interest. Then, we
2013). The Stanford parser generated these used a website parsing package, Jsoup2, to actually
1 2
See http://nlp.stanford.edu/software/corenlp.shtml for more See http://jsoup.org for additional info about the Jsoup pack-
info about the Stanford CoreNLP package age
extract the lyrics from the web page. This approach data in many different ways. Unfortunately, none
worked very well, and was performed in the Lyr- were very successful. Early on in the project, we
icsExtractor class we wrote. did not include the total number of words feature,
Another challenge in data collection was storing so we were working with 7 features instead of 8.
song names to be analyzed. Unfortunately, Lyr- To get our training set, we chose two artists from
icWikia did not have the best interface for making each genre in our collection of artist text files. Our
requests, and we were forced to store songs in in- genres were Classic Rock, Country, Emo, Grunge,
dividual text files for individual artists. LyricWikia Modern Rock, Pop, R&B, and Rap. For example,
does not store songs by genre, only by artist. Thus, we chose Nirvana and Soundgarden for Grunge
we had to associate each song with its artist. A fur- and The Beatles and Led Zeppelin for Classic
ther problem was getting a list of songs for said Rock. These files contained between 50 and 100
artist in a quick fashion. Luckily, the website songs, and we combined all of the feature vectors
www.songfacts.com had a list of songs for the ma- from all of the songs into one giant ArrayList of
jority of artists we were interested in. We used the- ArrayLists in Java. We then ran K-means over this
se list to generate our text files. data, assuming 8 clusters.
Next, we computed all of the features for a par- We immediately began to notice problems when
ticular song in java. Every feature was computed we used this approach. We were getting horrible
by iterating over strings derived from the original training accuracy. In fact, this plagued us through-
song text. This was done through basic string oper- out the project. We found this accuracy by deter-
ations in StringParameterExtrac- mining which cluster center each of our input
tor.java. This class also contained calls to songs was assigned to. We had hoped that a major-
methods in the Stanford NLP package to compute ity of the songs would share similar cluster centers,
the sentiment. Here is a more detailed description but they were all over the board. In most cases,
of the features we extracted. there were 3 to 4 cluster centers included in each
Type/Token Ratio This is the ratio of genres training data. The one exception was
unique words to the total number of words Eminem and some of the rap artists. They were
in the song consistently isolated. We believe this was due to
Unique words per line their high words per line feature.
Mean word length Average number of However, we quite enjoyed seeing that country
characters in a word songs were included with rap songs because they
Words per line too had high words per line. We decided that we
wanted to try a different approach to clustering that
Sentiment Percentages This was a 3-
would hopefully lead to better results.
dimensional vector with each index repre-
We noticed that the variance of the features be-
senting the percentage of lines that are
tween songs for one artist was very high. For ex-
negative, neutral, and positive in the song
ample, see Table 1 in Appendix A for statistics on
Total number of words in the song (Basic
a subset of Beyonces songs. There is significant
indicator of song duration)
variability between songs, and we thought this was
We implemented the K-means with initializa-
leading to some of the problems with our cluster-
tion algorithm described in section 2. This was a
ing. Artists like Beyonce are diverse and write
good choice. Because of this initialization step, our
about a variety of material. Our proposed solution
algorithm converges within 10 iterations, check to
was to take the average feature vector of 20 songs
be sure. However, we learned that getting K-
of a particular artist, collect 7 to 10 of these aver-
means to work was the easy part. Understanding
aged vectors for artists in a particular genre, then
the output and deciding how to run the algorithm
cluster the resulting vectors. Our argument for this
on our data was another challenge entirely.
approach is that artists on a whole can be consid-
We wrote the kmeans.java to perform all
ered to be a part of a genre, so we should consider
operations associated with K-means on input data. an averaged sample of their work.
Then, we constructed a kmeans object in the Unfortunately, this approach still had some is-
GenreClustering.java to actually run K- sues. Rappers and Country artists were still clus-
means on our data. We attempted to cluster our tered together, and our training vectors within a
predicted genre still assigned to a variety of differ- 20.6394). With these results, we were confident
ent clusters. However, it was better than the first that our method was working and decided to try it
go-around of K-means. Because the variance in our out on our lyrics data.
initial predicted clusters has decreased, K-means To complete testing of our K-means algorithm,
doesnt have to deal with as noisy of data while it we focused on the results produced by the last
runs. See Table 2 for a summary of statistics in the round of data where we rescaled the data between
Pop training data. 0 and 1. We first trained a K-means object with the
The next step we took proved to be a great ex- centroids for the number of clusters that were in-
ample of what not to do with K-means. We were terested in. After running K-means with a wide
hoping to add one more feature to our current 7. variety of number of clusters and seeing the re-
We decided to try out total number of words. This sults, we settled on 10 clusters. Many artists or
was hopefully going to be a good indicator of song songs cannot be defined specifically as just one of
length. When we added this feature, we started to the genres we included in our project. Its highly
get very consistent results for our training data. likely that there will be songs that share features
Thus, we decided to actually try out some test cas- with multiple genres. Having additional clusters
es. However, we soon realized that there was a will take this possibility into account.
problem. When we started to remove features to Even with rescaling, there was still a lot of vari-
determine which ones have more importance, the ance in the assignment of training samples within a
output cluster of K-means did not change until we particular genre. We believe that rescaling brought
removed the total words feature. In the rush to fin- the points too close to one another and finding 10
ish the project, we had forgotten the fundamentals clusters among nearby points is difficult. Our test-
of Euclidean distance. ing protocol was as follows:
Insofar, we had not normalized any of the fea- We found which two clusters were most
tures. However, the total number of words was on frequently assigned to training samples in
the scale of hundreds while every other feature was a particular genre
either less than one or between 0 and 10. As a re- Find the cluster assignments for songs in
sult, by not normalizing the total number of words, test files for each genre
we placed a large bias on that feature. While this Determine the percentage of these test
led to clustering that was not terrible, it was effec- songs labeled with either cluster
tively clustering along a number line. This was not After running once with all features given
our original goal. to K-means, remove 1 feature and repeat
To compensate for this, we normalized the data the above steps again.
using rescaling and reran K-means. This produced This final step allowed us to get a rough idea of
the results described below. the importance of different features for the cluster-
ing of a specific genre. We chose to not mix multi-
ple genres together in the test file for a couple
reasons. First, labeling every song would have
4 Results been a challenge for us. Second, we designed our
K-means to account for the fact that a genre might
One of the first tests we completed was to check have songs that are included in more than one clus-
that our K-means algorithm was initiated properly ter. Third, we were more interested in seeing how
and working. To do so, we created some artificial well new samples in what we consider the same
data. These vectors were in 3-space and were clus- genre agree with the training samples. Although
ters of 20 points surrounding (0,0,0), (10, 10, 10) we did not use the traditional method to determine
and (20, 20, 20). We added a small amount of ran- the effectiveness of our algorithm, we still uncov-
dom noise to the points to finish our preparation. ered some interesting results. See Table 3 below
When we ran our method over this data, it con- for a summary of the accuracies we observed for
verged in two iterations, and the cluster centroids various genres and exclusions of features.
were accurate. The three cluster centroids returned
were (0.5296, 0.4936, 0.5499), (10.5488,
10.52336, 10.6080), and (20.4972, 20.4777
All Features Feature Removed
Included Type/Token Unique Mean Words % Negative % Neutral % Positive Total #
WPL WL Per Line Words
Classic 13.33 26.67 12 12 14.67 12 21.33 21.33 17.33
Rock
Country 11.27 32.39 12.68 23.94 11.27 12.68 12.68 35.21 14.08
Emo 9.84 26.23 24.59 16.39 22.95 8.2 18.03 13.11 8.2
Grunge 15.48 28.57 25 42.86 35.71 40.48 23.81 26.19 13.1
Modern 25.76 19.7 19.7 27.27 21.21 7.58 25.76 12.12 13.64
Rock
Pop 11.94 8.96 13.43 28.36 29.85 22.39 17.91 16.42 23.88
R&B 19.15 12.77 21.28 21.28 21.28 17.02 14.89 21.28 4.26
Rap 16.07 23.21 58.93 23.21 17.86 23.21 60.71 21.43 25

Table 3: Percent inclusion for test songs in each genre of interest in all test cases

the length of the longest 10 words in each song.

Although mean word length seemed to be a


Neutral sentiment seems to be a generally bad fea- universally useless feature, there were many fea-
ture for genre clustering. For every genre except tures that were highly informative for certain gen-
R&B, there was a relatively significant increase in res but not at all in others. For example, negative
accuracy when the neutral sentiment feature was sentiment between grunge and modern rock gener-
removed. Based upon further inspection of the ated different results. When negative sentiment
cluster means, this makes sense because there is was removed for grunge, accuracy was improved
relatively little variance in neutral sentiment be- from 15% to 40%. In Modern rock, when negative
tween genre type. This behavior is also explained sentiment was removed, accuracy was decreased
by the methods used by Stanford to develop the from 25% to 7.58%, a dramatic decrease. The fact
package. In their article, they noticed that for that certain features improved the accuracy for cer-
strings of shorter length, the majority of the super- tain genres but reduced the accuracy in other gen-
vised labels that judges gave were neutral (Socher, res may indicate that binary or ternary
2013). It is difficult to determine the sentiment of classification would be a better starting point for
only a couple of words. For the majority of genres genre classification. One way tod o this would be
tested, the average number of words per line is be- to run an SVM or logistic regression and find a
tween 5 and 6 words. On the other hand, some separating hyper-plane between songs in and not in
songs such as rap songs have close to 10 words per a particular genre. Of course, this would require
line. This means that there is more variance in the labeling our training examples.
potential positive or negative labels for the lines in There are some pretty interesting trends in the
a rap song than others. This is probably the reason accuracy data when you focus on the changes in
that both country and rap have the lowest percent- accuracy as sentiment parameters are removed. For
age of neutral lines. We could have made this a example, the accuracy of Rap increases up to 60%.
more global trend by determining the sentiment of This change can be understood in the context of
stanzas instead of individual lines. This would the average sentiment values for Rap compared to
have further improved the sentiment features as an the other genres. See Table 4 for values. Rap has
informative feature. the largest gap between negative and positive per-
Mean word length also proved to be a feature, centages at 21. This means that when the neutral
which only seemed to add noise. While we thought feature is taken away, feature vectors in this form
it was plausible that different genres would use stand out. Another case that is interesting is when
varying levels of vocabulary, mean word length is the negative sentiment is removed. The percent
simply not a good measure of this. Because of the inclusion for Emo and Modern decrease a lot while
abundance of words like the, as, and, etc. Grunge increases to 40%. As shown in Table 4,
this feature did not vary significantly between dif- these three genres are very similar when in regard
ferent songs. A more intelligently designed feature to neutral and positive sentiment. It is highly pos-
that would more accurately reflect the difference in sible that the clustering algorithm found a centroid
vocabulary between genres would be to average
that is unable to discern between Grunge and these As suggested in our proposal, one way that we
other two. could dramatically improve the output of our clus-
ter is to couple K-means with an Expectation-
Maximization with Gaussian Mixtures. In K-
%Negative %Neutral %Positive
Classic
Means we make the implicit assumption that the
Rock 0.28 0.54 0.18 clusters are spherically shaped. Of course this is a
Country 0.29 0.49 0.22 naive assumption. In 2-D space, there might be two
Emo 0.31 0.53 0.16 very thin oval-shaped clusters oriented in such a
Grunge 0.23 0.59 0.18 way that the outer edges of one cluster will be sig-
Modern nificantly closer to the mean of the adjacent clus-
Rock 0.26 0.57 0.17 ter. By applying E-M to a GMM, or another
Pop 0.3 0.5 0.2 probability distribution, one can find clusters of
R&B 0.27 0.51 0.22 varying shapes and distributions.
Rap 0.37 0.48 0.16 One of the last goals in our proposal was to run
K-Nearest Neighbors and compare it to K-means.
Table 4: Mean Sentiment percentage values At that time, we believed that using an unsuper-
for each genre of interest vised approach might work better. After the chal-
lenges we had with K-means, we are convinced
Sentiment plays a heightened role in the now that using a supervised approach to genre identifi-
rescaled version of the feature vectors because eve- cation would be better. In the Mayer article, they
ry feature is now found between 0 and 1. Before, did use both K-Nearest Neighbors as well as SVMs
many of the other features were much larger than to attempt to classify songs by genre. However,
sentiment, and this meant that sentiment did not even with supervised learning and sometimes thou
have as much weight in the Euclidean distance cal- sands of features, their test accuracy only occa-
culations. sionally broke 25%. In our proposal, we said we
were hoping for 70% test accuracy. This was based
5 Proposal Commentary on experience that we had in class instead of evi-
dence demonstrated in papers on the subject.
The goals and milestones described in our pro-
posal ended up being a little too lofty for us to References
achieve. However, we are still proud of our efforts.
We were able to complete our first four milestones, Antonio Castro and Brian Lindauer. 2012. Author Iden-
which included: tification on Twitter. Retrieved from
http://cs229.stanford.edu/proj2012/CastroLindauer-
Extract the data we needed from the internet
AuthorIdentificationOnTwitter.pdf
Choose features based on the Mayer article Rudolf Mayer, Robert Neumayer, and Andreas Rauber.
Use ArrayLists of ArrayLists as our data Rhyme and Style Features for Musical Genre Classi-
structure of choice fication by Song Lyrics. 2008. Retrieved from
Implement and test our K-means algorithm http://www.ismir2008.ismir.net/papers/ISMIR2008_
Analyze our results in such a way that we 235.pdf
have clear next steps if we were to continue Chris Piech. 2013. K Means. Retrieved from
http://stanford.edu/~cpiech/cs221/handouts/kmeans.h
the project.
tml
However, there were a few bumps along the way Richard Socher, et al. 2013. Recursive Deep Models for
that inhibited us. First and foremost was data col- Semantic Compositionality Over a Sentiment Tree-
lection. Because we were building our data from bank. Retrieved from
scratch, this process took a long time. In addition, http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN
every time that we wanted to make an adjustment .pdf
in our features, we had to re-compute features over The Stanford NLP Group. The Stanford Parser: A sta-
every song. Doing a project where the data has tistical parser. Retrieved from
already been accumulated would have definitely http://nlp.stanford.edu/software/lex-parser.shtml
allowed us to focus more time on improving and
learning more about our algorithm.
Appendix A: Feature Data

Words
Unique Mean Per % nega- Total #
Type/Token WPL WL Line tive %Neutral %Positive Words

Mean 0.31 2 3.7 6.81 0.35 0.44 0.22 393.2

Variance 0.065 0.519 0.17 1.777 0.162 0.158 0.139 178.276

Min 0.18 1.23 3.44 3.48 0.13 0.19 0.06 161

Max 0.4 3.16 4.06 9.66 0.72 0.76 0.47 835

Range 0.22 1.93 0.62 6.18 0.59 0.57 0.41 674

Table 1: Statistics on Sample of Beyonce Feature vectors, note variance and range

Words
Unique Mean Per % nega- Total #
Type/Token WPL WL Line tive %Neutral %Positive Words

Mean 0.32 1.92 3.72 6.06 0.3 0.5 0.2 356.45

Variance 0.027 0.243 0.098 0.372 0.027 0.056 0.038 39.66

Min 0.28 1.67 3.59 5.52 0.25 0.38 0.16 297.22

Max 0.37 2.34 3.9 6.58 0.34 0.55 0.28 409.35

Range 0.09 0.67 0.31 1.06 0.09 0.17 0.12 112.13

Table 2: Statistics on collection of averaged Pop artist feature vectors

S-ar putea să vă placă și