Music Mood

Posted by: Chris Sept. 11, 2020, 11:36 p.m. |


  • Language
    • Python
  • Package
    • Pandas
    • Numpy
    • Matplotlib
    • Scikit
  • API
  • Web
    • Scrapy
    • Beautifulsoup

Brief Description

 Repo Link

Music Mood is a project that started as a school assignment. The prompt was to do some analysis on data. I immediately loved the idea and chose to analyse music. At the moment I was in love with angry music so, I decided to investigate that idea a bit.

Since music can make me feel some type of way, I thought it would be a good idea to make music the focus of my analysis. I needed a lot of data to analyse, and I did not want to grab a dataset that was already cleaned and integrated. Thanks to several resources available through APIs, I was able to make my own dataset with some pretty cool data on music. 


In order to analyse music tracks, a ton of information is needed. Data was gathered using web scraping and streaming API's. Then the music track data was analysed with Pandas. Finally, the music tracks were classified by their genre based on various features. 

Data Gathering

Sometimes data is only available through the front end of a website. In those cases, Scrapy comes in handy. You just build a spider to crawl sites, and the information will be extracted for analysis. That data can then be formatted into a csv file and used with Pandas. For example, this is one of the pages whose table was scraped for data on Eminem's albums

The majority of the data is grabbed from the Spotify and Genius API. The Genius API offers track lyrics, the amount of views for a track lyric page, and other features. The Spotify API offers some really neat features like how danceable a song is, or how much energy the song has. 

Since the Genius API and Spotify API offer different data on any given track, I needed to combine them into one large dataset. So, I had to clean data from both APIs in order to get all data on one track into one single record. I cleaned and integrated data from the Genius and Spotify APIs into one csv file.

For a full list of features, visit the repository README. For the full dataset, go here


After the data was gathered, analysis on the data began. Various statistical methods like the mean, normalized totals and plotting of data were used to find correlations between features of a music track. One method of finding correlations is a heatmap, like in figure 1.

Figure 1: Heatmap on feature correlations. The more red a feature pair, the more correlated they are.

Aside from correlation matrices, some basic analysis was done on each genre within the dataset. Such as averaging the features for each genre. The music genres present in the dataset are Country, Metal, Pop, Rock and Hip Hop.

Figure 2: Metal is the least danceable genre, which makes sense. Unless you consider fighting a dance.

Figure 3: Metal has the fewest words, but most energy.

Then using a sentiment extractor, each track was analysed for it's sentiment. I got the idea of extracting sentiment from this post, which uses a trained NLTK model on social media posts. The results aren't exactly perfect, but they seemed to be good enough.

Figure 4: The negative and positive scores come from the compound score. Metal is more negative than positive. The compound score is a good metric for the sentiment in a song. The lower the bar, the more negative the genre is.

Machine Learning

Another component of this project was to attempt classifying the genre of a music track using the various features collected. The features used in this classification process were not very good. A testing rate of 60% was achieved with the K-Nearest Neighbors classifier. Suffice to say, a 60% prediction rate is not very good.

Further Improvements

Since this project was done during school, there was a deadline to meet. I would have liked to go more in depth into some interesting genre qualities. For instance, why anyone would listen to metal if it's sad and angry all the time? Scraping social media posts on metal bands would be my go to for next time.

The machine learning portion of the project was rushed, because of the deadline. I believe with more time, I could have gotten more data and features. More features and data means a better chance at correctly predicting the genre. Also, attempting more methods other than K-Nearest Neighbors might be useful.

I was also in a group while working on this project. I couldn't go too far off topic on the project, since it was tied to a grade. So, I would probably take more freedoms and try more methods, with nicer looking charts and plots. There are a ton of things I would love to try, and being in a group whose grades depended on how the project turned out weighed on my creative freedom.