Sparkify : Churn Prediction

Project Overview & Problem

Like spotify, Sparkify is a music streaming service offering the possibility to its users to enjoy their service for free (with ads) or a monthly subscription to avoid ads while listening.

Being able to predict churn is critical for company subscription based , it can help predict customer who are most likely to leave and allow them to act accordingly before a churn to retain them.

Usually retaining existing customer is more cost effective than finding new customer

Data science can help us in anticipating customers’ demands by giving them on time targeted offers.

Sparkify, as every company, has to be economically viable, we suppose that most of their earnings come directly from the premium offer (monthly subscription).

By analyzing the transmitted historical data our goal is to detect and predict if a customer is in the process of cancelling, downgrading their account based on features communicated by sparkify.

we’ll use Spark to build a machine learning model that could solve this challenge.

The Dataset

We will be working with a small fraction (128MB) of the original data, containing 225 unique accounts and 286,500 transactions.

The reason why we will be working with such a small sample is because we don’t want to spend too long on training and testing. At the end the model could be deploy on a cluster with the 12gb files available to get less imbalanced result if interested.

Data Exploration

the chunk of data provided is in .json format with the following schema :

#	Column	type	Description
1	userId	string	ID of the user
2	artist	string	Name of the artist
3	auth	string	“Logged in” or “Cancelled”
4	firstName	string	First name of the user
5	gender	string	Gender of the user, “F” or “M”
6	itemInSession	Long Int	Item in session
7	lastName	string	Last name of the user
8	length	Double	Length of the song related to the event
9	level	string	Level of the user’s subscription, “free” or “paid”. User can change the level, so events for the same user can have different levels
10	location	string	Location of the user at the time of the event
11	method	string	“GET” or “PUT”
12	page	string	Type of action: “Next Song”, “Login”, “Thumbs Up” etc
13	registration	Long Int	Registration number
14	sessionId	Long Int	Session id
15	song	string	Name of the song
16	status	Long Int	Response status: 200, 404, 307
17	ts	Long Int	Timestamp of the event
18	userAgent	string	Agent, which user used for the event

‘ts’ & ‘registration’ columns are timestamps in the wrong format, once in the right format we can oberve that the sample file begin the 2018-10-01 02:01:57 and end 2 month later the 2018-12-03 02:11:16

There is 286 500 row in the json file We are spotting 3 patterns with columns with no missing value and others column where 8346 or 58392 values are missing

For the one with 8346 records missing, because it represent only 2.9% dataset, i decide to remove them.

The artist, length and song columns have the same exact number of nulls value wich mainly depends on the page feature that does not involved song.

To begin with data exploration, i’ve checked the different value of the page column

it allow me to define the churn as the cancellation confirmation event. From this observation i managed to create a new column “churn’ for people who achieve to go to the cancellation confirmation page flagged as 1 or 0 if they did not access this page.

We define churn user as an user that cancel an event or not, the plot above show us that 173 user did not have access to to the cancellation event, 52 user did have access to the cancellation confirmation page.

From the churned user per gender plot we can observe that proportion of churned user per gender is approximately the same even if it seems that male user are more willing to churn.

89 male & 84 female did not churn when 32 men and 20 female did churn.

The level of the user determine if the user have a “free” or “paid” membership.

from this plot we can observe that most of the churned user are paying a membership (31 people).

lifetime is a feature that show the nulber of days after a user churn or not, it’s based on on the registration date and event timestamp.

From this boxplot we can observe that non churn user are more long term user with a median of 75 days versus 50 days for user cancelling their membership.

Windows and mac user are the most common use os per sparkify user, they tend to have the same proportion of churn user compared to linux where more than half of the user churned.

Windows 7 is the most commonly microsoft version per sparkify user followed by macos user.

Feature Engineering

#	Features	Description
1	UserId	user unique identification number
2	churn	user cancel subscription
3	gender	Male or female
4	sessionCount	Number of session of a user
5	minSessionTime	Minimum time session of a user
6	maxSessionTime	Maximum time session of a user
7	AvgSessionDurationMinutes	Average time session of a user
8	AvgSongsPerSession	average number of song listened per a user
9	artistCount	number of artist a user has listened to
10	level	whether a user pay a membership or not
11	SongCount	number of song a user has listened to
12	days_total_subscription	Number of day a user is register
13	Region Northeast	number user connected in the northeast
14	Region Midwest	number user connected in the Midwest
15	Region West	number user connected in the west
16	Region South	number user connected in the south
17	About	number of about page event per user
18	Add Friend	number of add friend event per user
19	Add to Playlist	number of add to playlist event per user
20	Downgrade	number of downgrade event per user
21	Error	number of error page event per user
22	Help	number of help page event per user
23	Home	number of home page event per user
24	LogOut	number of log out per user
25	NextSong	number of next song event per user
26	Roll Advert	number of roll advert event per user
27	Save Settings	number of save setting event per user
28	Settings	number of Settings page event per user
29	SubmitDowngrade	number of submit downgrade event per user
30	SubmitUpgrade	number of submit upgrade event per user
31	Thumbs down	number of thumbs down event per user
32	Thumbs Up	number of thumbs up event per user
33	Upgrade	number of upgrade event per user

All the above feature are used to create our final dataframe, it’s composed of 1 dependent variable (churn) and 32 independent variable that will be used to operate the following supervised machine learning model :

Logistic Regression

Decision Tree

Random Forest

As seen on the churn user plot, churned user represent a small part of all user, that’s why we choose the f1 score as the measurement standard because it average between precision and recall on imbalanced classes like churn user.

after training and testing each model, Support vector machines obtain the highest f1 score, with 74.89% after hyperparameters tuned.

the above plot show us wich features impact our model the most when predicting churn, we observe that’s :

People showing disatisfaction trough thumbs downs, logout, downgrade are the one who will leave more easily.
apart of these we can see that people from the midwest and west are more willing to churn
Finally roll advertising is also a feature to consider when we are talking of churn user
In contrast people used to sparkify for a long time are confident and not willing to churn, cancel their subscription.

Conclusion

We deployed Spark machine learning models to estimate churn for Sparkify’s users using
previous log data.
With a 74,89% f1 score, we developed a model that can detect consumers at danger of churning.
Once we had the dataset, We trained four distinct types of models and fine-tune each model, to see who’s perform the best.

At first sight result seems acceptable but we were working on a too small dataset to be statiscally significant wich was higly imbalance, to get better outcomes it could be interesting to have acces to a cluster and test our model on a larger dataset.

Finally we still manage to predict churn, wich is a good start even if result would be more accurate with more recent log data ( 2018 on this one), Sparkify could use our model everyday to improve customer satisfaction using A/B testing by offering discount or targeted message to the people who where likely to churn.