Sparkify : Churn Prediction

Project Overview & Problem Like spotify, Sparkify is a music streaming service offering the possibility to its users to enjoy …

Project Overview & Problem

Like spotify, Sparkify is a music streaming service offering the possibility to its users to enjoy their service for free (with ads) or a monthly subscription to avoid ads while listening.

Being able to predict churn is critical for company subscription based , it can help predict customer who are most likely to leave and allow them to act accordingly before a churn to retain them.

Usually retaining existing customer is more cost effective than finding new customer

Data science can help us in anticipating customers’ demands by giving them on time targeted offers.

Sparkify, as every company, has to be economically viable, we suppose that most of their earnings come directly from the premium offer (monthly subscription).

By analyzing the transmitted historical data our goal is to detect and predict if a customer is in the process of cancelling, downgrading their account based on features communicated by sparkify.

we’ll use Spark to build a machine learning model that could solve this challenge.

The Dataset

We will be working with a small fraction (128MB) of the original data, containing 225 unique accounts and 286,500 transactions.

The reason why we will be working with such a small sample is because we don’t want to spend too long on training and testing. At the end the model could be deploy on a cluster with the 12gb files available to get less imbalanced result if interested.

Data Exploration

the chunk of data provided is in .json format with the following schema :

1userIdstringID of the user
2artiststringName of the artist
3authstring“Logged in” or “Cancelled”
4firstNamestringFirst name of the user
5genderstringGender of the user, “F” or “M”
6itemInSessionLong IntItem in session
7lastNamestringLast name of the user
8lengthDoubleLength of the song related to the event
9levelstringLevel of the user’s subscription, “free” or “paid”. User can
change the level, so events for the same user can have
different levels
10locationstringLocation of the user at the time of the event
11methodstring“GET” or “PUT”
12pagestringType of action: “Next Song”, “Login”, “Thumbs Up” etc
13registrationLong IntRegistration number
14sessionIdLong IntSession id
15songstringName of the song
16statusLong IntResponse status: 200, 404, 307
17tsLong IntTimestamp of the event
18userAgentstringAgent, which user used for the event

‘ts’ & ‘registration’ columns are timestamps in the wrong format, once in the right format we can oberve that the sample file begin the 2018-10-01 02:01:57 and end 2 month later the 2018-12-03 02:11:16

There is 286 500 row in the json file We are spotting 3 patterns with columns with no missing value and others column where 8346 or 58392 values are missing

For the one with 8346 records missing, because it represent only 2.9% dataset, i decide to remove them.

The artist, length and song columns have the same exact number of nulls value wich mainly depends on the page feature that does not involved song.

To begin with data exploration, i’ve checked the different value of the page column

it allow me to define the churn as the cancellation confirmation event. From this observation i managed to create a new column “churn’ for people who achieve to go to the cancellation confirmation page flagged as 1 or 0 if they did not access this page.

We define churn user as an user that cancel an event or not, the plot above show us that 173 user did not have access to to the cancellation event, 52 user did have access to the cancellation confirmation page.

From the churned user per gender plot we can observe that proportion of churned user per gender is approximately the same even if it seems that male user are more willing to churn.

89 male & 84 female did not churn when 32 men and 20 female did churn.

The level of the user determine if the user have a “free” or “paid” membership.

from this plot we can observe that most of the churned user are paying a membership (31 people).

lifetime is a feature that show the nulber of days after a user churn or not, it’s based on on the registration date and event timestamp.

From this boxplot we can observe that non churn user are more long term user with a median of 75 days versus 50 days for user cancelling their membership.

Windows and mac user are the most common use os per sparkify user, they tend to have the same proportion of churn user compared to linux where more than half of the user churned.

Windows 7 is the most commonly microsoft version per sparkify user followed by macos user.

Feature Engineering

1UserIduser unique identification number
2churnuser cancel subscription
3genderMale or female
4sessionCountNumber of session of a user
5minSessionTimeMinimum time session of a user
6maxSessionTimeMaximum time session of a user
7AvgSessionDurationMinutesAverage time session of a user
8AvgSongsPerSessionaverage number of song listened per a user
9artistCountnumber of artist a user has listened to
10levelwhether a user pay a membership or not
11SongCountnumber of song a user has listened to
12days_total_subscriptionNumber of day a user is register
13Region Northeastnumber user connected in the northeast
14Region Midwestnumber user connected in the Midwest
15Region Westnumber user connected in the west
16Region Southnumber user connected in the south
17Aboutnumber of about page event per user
18Add Friendnumber of add friend event per user
19Add to Playlistnumber of add to playlist event per user
20Downgradenumber of downgrade event per user
21Errornumber of error page event per user
22Helpnumber of help page event per user
23Homenumber of home page event per user
24LogOutnumber of log out per user
25NextSongnumber of next song event per user
26Roll Advertnumber of roll advert event per user
27Save Settingsnumber of save setting event per user
28Settingsnumber of Settings page event per user
29SubmitDowngradenumber of submit downgrade event per user
30SubmitUpgradenumber of submit upgrade event per user
31Thumbs downnumber of thumbs down event per user
32Thumbs Upnumber of thumbs up event per user
33Upgradenumber of upgrade event per user

All the above feature are used to create our final dataframe, it’s composed of 1 dependent variable (churn) and 32 independent variable that will be used to operate the following supervised machine learning model :

  • Logistic Regression
  • SVM
  • Decision Tree
  • Random Forest

As seen on the churn user plot, churned user represent a small part of all user, that’s why we choose the f1 score as the measurement standard because it average between precision and recall on imbalanced classes like churn user.

after training and testing each model, Support vector machines obtain the highest f1 score, with 74.89% after hyperparameters tuned.

the above plot show us wich features impact our model the most when predicting churn, we observe that’s :

  • People showing disatisfaction trough thumbs downs, logout, downgrade are the one who will leave more easily.
  • apart of these we can see that people from the midwest and west are more willing to churn
  • Finally roll advertising is also a feature to consider when we are talking of churn user
  • In contrast people used to sparkify for a long time are confident and not willing to churn, cancel their subscription.


We deployed Spark machine learning models to estimate churn for Sparkify’s users using
previous log data.
With a 74,89% f1 score, we developed a model that can detect consumers at danger of churning.
Once we had the dataset, We trained four distinct types of models and fine-tune each model, to see who’s perform the best.

At first sight result seems acceptable but we were working on a too small dataset to be statiscally significant wich was higly imbalance, to get better outcomes it could be interesting to have acces to a cluster and test our model on a larger dataset.

Finally we still manage to predict churn, wich is a good start even if result would be more accurate with more recent log data ( 2018 on this one), Sparkify could use our model everyday to improve customer satisfaction using A/B testing by offering discount or targeted message to the people who where likely to churn.

Keep reading

More >