Building a Content-Based Song Recommendation Website in Python from Spotify’s Songs Data

13 min readSep 20, 2023

From Extracting Track’s Data from Spotify’s API to Web Deployment

Introduction

This project came to mind when I was desperately searching for an addition to my portfolio. While listening to Spotify's weekly recommendations, I noticed that some of the suggested songs were far from my preferred taste in music! That's when the idea struck me – what could be a better project than building a custom song recommendation system? Not only would it allow me to discover songs that truly suit my preferences, but it would also showcase my skills as a data scientist.

In this article, I am thrilled to share the journey of creating this data science portfolio project. Throughout this adventure, you’ll have the opportunity to explore the following key aspects:

The process of doing a portfolio data science project.
Working with Spotify’s API
Building a Content-Based Recommendation System
Building and deploying a website all in Python

Where to get data?

Like in all my data science portfolio projects, the first hurdle I faced after understanding the project was data collection. As usual, I began by exploring online data platforms like Kaggle and Google Dataset to find relevant datasets. Although I came across some related datasets, they presented significant challenges. The primary issue was that they hadn’t been updated with the latest songs, rendering them unsuitable for my needs. Additionally, the datasets’ sizes were too small, and some crucial features required for powering my recommendation engine were missing.

So, I searched for an API that allows me to extract Spotify’s track details directly or indirectly, and fortunately, I came across Spotify’s functional API, which enabled me to retrieve all the necessary track and artist details data — exactly what I needed! To my surprise, it proved to be remarkably user-friendly and well-documented as I delved into Spotify’s web API documentation.

Getting started with Spotify’s API

To begin working with Spotify’s API, you need to visit the Spotify developer’s website and create an account if you don’t have one (there’s no distinction between a regular Spotify account and a Spotify developers’ account; you can use your existing Spotify account to log in). Once you’re logged in, navigate to the dashboard and create an app to obtain your `client id` and `secret id`.

Pulling data from Spotify’s API

With my app set up and the client id and secret id in hand, it's time to begin collecting track data. Given the vast number of songs available on Spotify, spanning various languages and years, attempting to collect all of them would be a daunting task. Therefore, I made the decision to focus on a specific subset – 1000 English songs from each year between 2017 and 2022, resulting in a total of 7000 songs.

# pip install spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy

You can utilize Spotify’s API in two ways in Python: you can make direct requests, or you can use Spotipy, which is a lightweight Python library for the Spotify Web API. I opted for spotipy as it provides full access to all the music data offered by the Spotify platform.

# setup API
cid = 'YOUR_CLIENT_ID'
secret = 'YOUR_SECRET_ID'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Here is what I’ve collected from Spotify’s tracks API:

id : song id
song name : name of the songs
artist name : name of artist who sing the song
artist genres : genres that artist sings
album genres : genres of songs album
release_date : release date of song
song link : link of song on Spotify
image : cover image of song
song duration : song duration
song popularity : song popularity on spotify

I looped through all the songs from 2017 to 2022, collected the first 1000 songs found using Spotify’s track API, and then placed them into a dataframe.

songs_list = []

years = ['2017','2018','2019','2020','2021','2022']

for year in years : 
    for i in range(0,1000,50):
        track_results = sp.search(q='year:'+year,type='track',limit=50, offset=i)
        for i, t in enumerate(track_results['tracks']['items']):
            artist = sp.artist(t["artists"][0]["external_urls"]["spotify"])
            album = sp.album(t["album"]["external_urls"]["spotify"])
            songs_data = {
                    'id':t['id'],
                    'song name':t['name'],
                    'artist name':t['artists'][0]['name'],
                    'artist genres':artist["genres"],
                    'preview_url': t['preview_url']
                    'album genres': album["genres"],
                    'release_date':t['album']['release_date'],
                    'song link':t['external_urls']['spotify'],
                    'image': t['album']['images'][0]['url'],
                    'song duration':t['duration_ms'],
                    'song popularity' : t['popularity']
                }
            songs_list.append(songs_data)
df_songs = pd.DataFrame(songs_list)

Collect tracks audio features

After collecting the track data, I proceeded to gather the audio features of each track, which turned out to be quite enjoyable. Spotify provided comprehensive details of the audio features, which proved to be essential and played a significant role in my recommendation system. Here are some of the audio features I obtained from Spotify:

Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Acousticness: A measure from 0.0 to 1.0 of whether the track is acoustic.
Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
Instrumentalness: Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.
Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track. Values typical range between -60 and 0 db.
Speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

It was surprisingly straightforward. All I needed to do was gather the song IDs, which I already collected and stored in a dataframe. Then, I simply passed these IDs to the audio_featuresAPI, retrieved the audio features, and organized them neatly in another dataframe called df_audiosthen I combined them using tracks id.

Tip: To handle cases where some songs lacked audio features on Spotify, I used try and catch approach. This allowed me to gracefully handle any exceptions that arose during the retrieval of audio features for those particular songs.

audio_features = []
for ids in df['id']:
    try : 
        results = sp.audio_features(ids)
        audio_data = {
            'id':ids,
            'danceability':results[0]['danceability'],
            'energy':results[0]['energy'],
            'key':results[0]["key"],
            'loudness': results[0]["loudness"],
            'mode':results[0]['mode'],
            'speechiness':results[0]['speechiness'],
            'acousticness': results[0]['acousticness'],
            'instrumentalness':results[0]['instrumentalness'],
            'liveness':results[0]['liveness'],
            'valence':results[0]['valence'],
            'tempo':results[0]['tempo'],
            'time_signature' : results[0]['time_signature'],
        }
        audio_features.append(audio_data)
    except : 
        print('cant')
        
df_audios  = pd.DataFrame(audio_features)


df_playlist = pd.merge(df_plsylist_audio_fe, df_playlist_songs, on='id')

Collect songs data that I like

Now that I have my tracks dataset ready with the features I want, I decided to take another step before diving into building the recommendation website.

I wanted to gain deeper insights into my music taste by analyzing the audio features and details of the tracks I liked. To achieve this, I created a playlist containing all of my favorite songs and utilized Spotify’s playlist API to gather their data. The process was similar to the tracks collection process, with only minor differences.

# Get the tracks urls from playlist
playlist_link = "https://open.spotify.com/playlist/2wQaxhZC3HXtmWd2yjqJG7?si=bc4f208ae2a14ab2"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]
track_uris = [x["track"]["uri"] for x in sp.playlist_tracks(playlist_URI)["items"]]

playlist_songs = []

for t in sp.playlist_tracks(playlist_URI)["items"]:
    try : 
        artist = sp.artist(t['track']["artists"][0]["external_urls"]["spotify"])
        album = sp.album(t['track']["album"]["external_urls"]["spotify"])
        songs_data = {
                'id':t['track']['id'],
                'song name':t['track']['name'],
                'artist name':t['track']['artists'][0]['name'],
                'artist genres':artist["genres"],
                'album genres': album["genres"],
                'release_date':t['track']['album']['release_date'],
                'song link':t['track']['external_urls']['spotify'],
                'image': t['track']['album']['images'][0]['url'],
                'song duration':t['track']['duration_ms'],
                'song popularity' : t['track']['popularity']
        }
        playlist_songs.append(songs_data)
    except :
        print('cant')
df_playlist_songs = pd.DataFrame(playlist_songs)

Understanding the data

Similar to all data science projects, the second step involved analyzing the collected data. This allowed me to ensure that the data processing part executed smoothly and provided an opportunity to gain a deeper understanding of the dataset. By familiarizing myself with the data, I could leverage its full potential.

I’m always saying, ‘You’ve got to be friends with your data!’ The better you understand it, the better companionship you can establish with it! :D

Here are the steps I used to analyze the collected data. While I won’t delve deep into all of the steps since most are foundational, if you’re interested, I’ll include the link to the project repository at the end of the article. There, you can explore the codes in more detail and gain further insights

Shape of the data
Check column dtypes
Check is there any null values
Describe numerical values
Check the correlation

Cleaning the data

Data doesn’t always come in perfect shape; in fact, it can often be quite messy, especially when sourced from the internet. As a data scientist, it’s your role to shape it like a skilled mason crafting a statue. Fortunately, in this particular dataset, there weren’t many challenges to address. I only needed to handle duplicated songs that had no relevance to the data collection process. Since Spotify is a public platform, some people share the same songs multiple times. Additionally, I encountered some data type issues, which I easily resolved using pandas.

df_songs = df_songs.drop_duplicates(['song name','artist name'],keep='first')

Analyzing the cleared data

In this phase, my main goal was to gain better insights from the track data and develop a deeper understanding of my music taste through data analysis. While I won’t provide an exhaustive explanation of all the analysis results (as that’s beyond the scope of this article), I’ll share the key findings to give you an idea of the project’s flow and process.

Tip: to make my visualizations visually appealing and cohesive, I decided to spice things up with Spotify’s color palette. It worked like magic, giving my graphs a touch of beauty and personality. However, it’s worth noting that this approach may not fit every project, especially when some graphs contain sensitive information. Moreover, certain colors might not communicate the right message to your audience. For instance, you wouldn’t want to go all ‘Netflix-red’ if it doesn’t suit the overall vibe you’re aiming for.

1 - Number of songs per year after removing duplicates

Collected songs released date. Image by author

2 - When were my favorite songs released?

My favorite songs released date. Image by author

3 - What’s the duration of my favorite songs?

Collected songs duration distribution. Image by author

4 - Do I have a common taste in music based on the popularity of my favorite songs?

My favorite songs popularity. Image by author

5 - My favorite songs genre

My favorite songs genre wordcloud. Image by author

6 - Apply PCA to songs audio feature

In this part, I separated the songs’ audio features and standardized them (think of standardizing as cutting all fruits into equal-sized pieces for a salad, ensuring you taste each fruit equally). Next, I applied PCA, which is a dimension reduction algorithm, and transformed my data into three columns, as the name suggests. This allowed me to reduce the dimensionality of my data without losing significant information, much like creating a zip file to boost computing power and better fit it into the desired algorithm.

df_audios = df_songs[['danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo']]  

# Standard scaling songs audio features
std_scaler = StandardScaler()
std_data = std_scaler.fit_transform(df_audios)

## Apply PCA on scaled data
pca = PCA(3,svd_solver='full')
pca.fit(scaled_df)
lowdim_df = pca.transform(scaled_df)

Reduced dimension songs data output PCA. Image by author

7 - Apply K-means clustering to the reduced dimensionality data

In the final part of my analysis, I was eager to cluster songs based on their audio features. This way, I could uncover patterns and explore how similar they are to each other.

Choosing the right ‘K’ (the number of clusters) for K-means posed a challenge since it’s an unsupervised learning algorithm with no definitive answer. To tackle this, I employed the Elbow method. This clever technique helped me identify the best ‘k’ that perfectly suited my data

Elbow_M = KElbowVisualizer(KMeans(), k=10)
Elbow_M.fit(df_pca)
Elbow_M.show()

Choosing the right K using the Elbow method. Image by author

KM = KMeans(n_clusters=5)
km_pred = KM.fit_predict(df_pca)

Kmeans clustering output. Image by author

Building the Recommendation Website

Now that we’ve collected, cleaned, and analyzed our data, it’s time to construct the recommendation website and engine. But before we dive into the nitty-gritty, let me provide you with a brief introduction to two common types of recommendation systems:

Content-Based Recommendation Systems: Picture a movie streaming platform like Netflix that suggests movies based on attributes of films you’ve enjoyed, such as their genre, cast, and release date.
Collaborative Filtering Recommendation Systems: In this approach, the streaming platform leverages preferences from users with similar tastes to yours. For instance, if those users liked ‘Harry Potter,’ the system will likely recommend it to you as well.

Tip: Some websites utilize both approaches to provide more personalized recommendations, as seen on platforms like Amazon and Netflix. They call it Hybrid Recommendation System.

I’m only interested in content recommendation systems since I don’t have access to other users’ preferences. What I have in mind is to create a reduced-dimensional latent vector of all songs’ audio features and details like genre and singer. Then, I’ll use cosine similarity to compare my liked songs and, based on that, the system can suggest similar songs to me.

Let me quickly explain cosine similarity; it’s a simple yet powerful algorithm. As someone who’s a big fan of linear algebra and vectors, I find it fascinating. Cosine similarity compares the cosine of the angle (theta) between two vectors, and the larger the cosine value, the more similar the vectors are.

**Cosine similarity** example. Image by opensourceforu

Build and deploy the website

For the next part, let’s jump into how I handled the front-end of the project. When it comes to the front-end, there are plenty of excellent options out there. But for this data science project, I didn’t want to spend ages fussing over UI and front-end complexities. That’s when Streamlit came to my rescue! It’s a fantastic Python library that’s perfect for data scientists like me who just want to show off their work without getting tangled in web development intricacies.

You see, Streamlit has its ups and downs. On the bright side, it offers dynamic components that blend seamlessly with Python, making it a breeze to create interactive elements. But, well, it doesn’t have all the fancy components you’d find in frameworks like React.js or Vue.js. And, to be honest, it’s not as customizable as I would have liked for some of its components. So, if you’re dreaming of building a website with all the bells and whistles, you might want to explore other options. But, you know what? For this particular project, Streamlit ticked all the boxes and met my needs just right.

Now, let me give you a sneak peek at the features I cooked up for this website:

Listen to the preview of recommended songs
Create a playlist of your favorite songs
Sort recommended songs by popularity, duration, similarity
Shuffle recommended songs
Direct to the song’s page on Spotify

Deploying it (Making it online)

The last part of the project involved taking it online, and for this task, I explored two different approaches. I opted for Streamlit Cloud Deployment and Heroku, both of which are free and user-friendly platforms.

Now, here’s a helpful tip: Starting from 28th November 2022, Heroku no longer offers free deployment, so I had to go with the Streamlit Cloud option.

Thankfully, Streamlit comes to the rescue with its Streamlit Cloud platform, allowing me to deploy my Streamlit app effortlessly. The process is simple: I just needed to create a public or private GitHub repository, depending on your preference. Then, I shared the link to my Streamlit app in Python format, and voilà! Streamlit Cloud automatically took care of the deployment for me, or as they like to call it, ‘baking’ the app. 😄

Streamlit cloud deploy example. Image by Streamlit blog

Heroku isn’t that different either! You just need to create a repo and add a Procfile and a setup.sh file. Then, simply copy-paste the code below:

# setup.sh
mkdir -p ~/.streamlit/echo "\
[server]\n\
headless = true\n\
port = $PORT\n\
enableCORS = false\n\
\n\
" > ~/.streamlit/config.toml

And add the following code to your Procfile (make sure to change app.py to match your app’s name):

web: sh setup.sh && streamlit run app.py

Then, connect to Heroku and login, and create an app using the code below:

heroku login

heroku create your_app

and push the app files into your app:

git push heroku master

to make sure your app deployed correctly you can check the url and logs using:

heroku logs --tail

Conclusion

I hope you didn’t get bored cause we did a lot from collecting data from Spotify’s API to building and deploying a website. I tried to take you on the journey of creating my portfolio project. and tried to cover all the practical stuff in this article.

As you embark on your own data science projects, I hope this article inspires you to explore the endless possibilities of data-driven applications. Remember, finding the right tools, like Streamlit, can simplify the process and allow you to focus on what matters most: uncovering the hidden gems in your data and sharing your discoveries with the world.

Thank you for joining me on this adventure, and I invite you to explore the project’s repository linked at the end of this article for a deeper dive into the code and implementation details.

Happy data exploring and coding!

Meysam Raz

Follow me on LinkedIn

Project Repository

Project Live Demo