{"id":516,"date":"2022-06-17T09:36:56","date_gmt":"2022-06-17T07:36:56","guid":{"rendered":"https:\/\/www.labo.mathieurella.fr\/?p=516"},"modified":"2022-06-26T15:42:21","modified_gmt":"2022-06-26T13:42:21","slug":"sparkify-churn-prediction","status":"publish","type":"post","link":"https:\/\/www.labo.mathieurella.fr\/?p=516","title":{"rendered":"Sparkify : Churn Prediction"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/sparkify-1024x467.jpeg\" alt=\"\" class=\"wp-image-549\"\/><noscript><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"467\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/sparkify-1024x467.jpeg\" alt=\"\" class=\"wp-image-549\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/sparkify-1024x467.jpeg 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/sparkify-300x137.jpeg 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/sparkify-768x350.jpeg 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/sparkify.jpeg 1400w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/noscript><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Project Overview &amp; Problem<\/h2>\n\n\n\n<p>Like spotify, Sparkify is a music streaming service offering the possibility to its users to enjoy their service for free (with ads) or a monthly subscription to avoid ads while listening.<\/p>\n\n\n\n<p>Being able to predict churn is critical for company subscription based , it can help predict customer who are most likely to leave and allow them to act accordingly before a churn to retain them. <\/p>\n\n\n\n<p>Usually retaining existing customer is more cost effective than finding new customer<\/p>\n\n\n\n<p>Data science can help us in anticipating customers&#8217; demands by giving them on time targeted offers.<\/p>\n\n\n\n<p>Sparkify, as every company, has to be economically viable, we suppose that most of their earnings come directly from the premium offer (monthly subscription).<\/p>\n\n\n\n<p>By analyzing the transmitted historical data our goal is to detect and predict if a customer is in the process of cancelling, downgrading their account based on features communicated by sparkify.<\/p>\n\n\n\n<p>we&#8217;ll use Spark to build a machine learning model that could solve this challenge.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><meta charset=\"utf-8\"><meta charset=\"utf-8\">The Dataset<\/h2>\n\n\n\n<p>We will be working with a small fraction (128MB) of the original data, containing 225 unique accounts and 286,500 transactions. <\/p>\n\n\n\n<p>The reason why we will be working with such a small sample is because we don\u2019t want to spend too long on training and testing. At the end the model could be deploy on a cluster with the 12gb files available to get less imbalanced result if interested.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><meta charset=\"utf-8\">Data Exploration<\/h2>\n\n\n\n<p>the chunk of data provided is in .json format with the following schema :<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-12.00.35-PM-1024x557.png\" alt=\"\" class=\"wp-image-524\" width=\"410\" height=\"223\"\/><noscript><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-12.00.35-PM-1024x557.png\" alt=\"\" class=\"wp-image-524\" width=\"410\" height=\"223\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-12.00.35-PM-1024x557.png 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-12.00.35-PM-300x163.png 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-12.00.35-PM-768x418.png 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-12.00.35-PM.png 1518w\" sizes=\"(max-width: 410px) 100vw, 410px\" \/><\/noscript><\/figure><\/div>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\"><div class=\"wp-block-group__inner-container\">\n<figure class=\"wp-block-table aligncenter is-style-stripes\"><table><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\"><strong>#<\/strong><\/td><td><strong>Column<\/strong><\/td><td><strong>type<\/strong><\/td><td><strong>Description<\/strong><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">1<\/td><td>userId<\/td><td>string<\/td><td>ID of the user<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">2<\/td><td>artist<\/td><td><meta charset=\"utf-8\">string<\/td><td>Name of the artist<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">3<\/td><td>auth<\/td><td><meta charset=\"utf-8\">string<\/td><td>\u201cLogged in\u201d or \u201cCancelled\u201d<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">4<\/td><td>firstName<\/td><td><meta charset=\"utf-8\">string<\/td><td>First name of the user<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">5<\/td><td>gender<\/td><td><meta charset=\"utf-8\">string<\/td><td>Gender of the user, \u201cF\u201d or \u201cM\u201d<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">6<\/td><td>itemInSession<\/td><td>Long Int<\/td><td>Item in session<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">7<\/td><td>lastName<\/td><td><meta charset=\"utf-8\">string<\/td><td>Last name of the user<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">8<\/td><td>length<\/td><td>Double<\/td><td>Length of the song related to the event<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">9<\/td><td>level<\/td><td><meta charset=\"utf-8\">string<\/td><td>Level of the user\u2019s subscription, \u201cfree\u201d or \u201cpaid\u201d. User can<br>change the level, so events for the same user can have<br>different levels<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">10<\/td><td>location<\/td><td><meta charset=\"utf-8\">string<\/td><td>Location of the user at the time of the event<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">11<\/td><td>method<\/td><td><meta charset=\"utf-8\">string<\/td><td>\u201cGET\u201d or \u201cPUT\u201d<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">12<\/td><td>page<\/td><td><meta charset=\"utf-8\">string<\/td><td>Type of action: \u201cNext Song\u201d, \u201cLogin\u201d, \u201cThumbs Up\u201d etc<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">13<\/td><td>registration<\/td><td><meta charset=\"utf-8\">Long Int<\/td><td>Registration number<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">14<\/td><td>sessionId<\/td><td><meta charset=\"utf-8\">Long Int<\/td><td>Session id<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">15<\/td><td>song<\/td><td><meta charset=\"utf-8\">string<\/td><td>Name of the song<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">16<\/td><td>status<\/td><td><meta charset=\"utf-8\">Long Int<\/td><td>Response status: 200, 404, 307<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">17<\/td><td>ts<\/td><td><meta charset=\"utf-8\">Long Int<\/td><td>Timestamp of the event<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">18<\/td><td>userAgent<\/td><td><meta charset=\"utf-8\">string<\/td><td>Agent, which user used for the event<\/td><\/tr><\/tbody><\/table><\/figure>\n<\/div><\/div>\n\n\n\n<p>&#8216;ts&#8217; &amp; &#8216;registration&#8217; columns are timestamps in the wrong format, once in the right format we can oberve that the sample file begin the 2018-10-01 02:01:57 and end 2 month later the 2018-12-03 02:11:16<\/p>\n\n\n\n<p> <\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-12.02.57-PM.png\" alt=\"\" class=\"wp-image-525\" width=\"317\" height=\"284\"\/><noscript><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-12.02.57-PM.png\" alt=\"\" class=\"wp-image-525\" width=\"317\" height=\"284\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-12.02.57-PM.png 764w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-12.02.57-PM-300x269.png 300w\" sizes=\"(max-width: 317px) 100vw, 317px\" \/><\/noscript><\/figure><\/div>\n\n\n\n<p>There is 286 500 row in the json file We are spotting 3 patterns with columns with no missing value and others column where 8346 or 58392 values are missing<\/p>\n\n\n\n<p>For the one with 8346 records missing, because it represent only 2.9% dataset, i decide to remove them.<\/p>\n\n\n\n<p>The artist, length and song columns have the same exact number of nulls value wich mainly depends on the page feature that does not involved song.<\/p>\n\n\n\n<p>To begin with data exploration, i&#8217;ve checked the different value of the page column <\/p>\n\n\n\n<p> <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-5.51.01-PM-1024x543.png\" alt=\"\" class=\"wp-image-527\"\/><noscript><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"543\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-5.51.01-PM-1024x543.png\" alt=\"\" class=\"wp-image-527\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-5.51.01-PM-1024x543.png 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-5.51.01-PM-300x159.png 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-5.51.01-PM-768x407.png 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-09-a\u0300-5.51.01-PM.png 1468w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/noscript><figcaption>it allow me to define the churn as the cancellation confirmation event. From this observation i managed to create a new column &#8220;churn&#8217; for people who achieve to go to the cancellation confirmation page flagged as 1 or 0 if they did not access this page. <\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.48.49-PM-1024x768.png\" alt=\"\" class=\"wp-image-531\"\/><noscript><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"768\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.48.49-PM-1024x768.png\" alt=\"\" class=\"wp-image-531\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.48.49-PM-1024x768.png 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.48.49-PM-300x225.png 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.48.49-PM-768x576.png 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.48.49-PM.png 1320w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/noscript><\/figure>\n\n\n\n<p>We define churn user as an user that cancel an event or not, the plot above show us that 173 user did not have access to to the cancellation event, 52 user did have access to the cancellation confirmation page.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.00-PM-1024x772.png\" alt=\"\" class=\"wp-image-532\"\/><noscript><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"772\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.00-PM-1024x772.png\" alt=\"\" class=\"wp-image-532\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.00-PM-1024x772.png 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.00-PM-300x226.png 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.00-PM-768x579.png 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.00-PM.png 1310w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/noscript><\/figure>\n\n\n\n<p>From the churned user per gender plot we can observe that proportion of churned user per gender is approximately the same even if it seems that male user are more willing to churn.<\/p>\n\n\n\n<p>89 male &amp; 84 female did not churn when 32 men and 20 female did churn.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.12-PM-1024x833.png\" alt=\"\" class=\"wp-image-533\"\/><noscript><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"833\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.12-PM-1024x833.png\" alt=\"\" class=\"wp-image-533\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.12-PM-1024x833.png 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.12-PM-300x244.png 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.12-PM-768x625.png 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.49.12-PM.png 1202w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/noscript><\/figure>\n\n\n\n<p>The level of the user determine if the user have a &#8220;free&#8221; or &#8220;paid&#8221; membership.<\/p>\n\n\n\n<p>from this plot we can observe that most of the churned user are paying a membership (31 people).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.07.27-PM-1024x732.png\" alt=\"\" class=\"wp-image-530\"\/><noscript><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"732\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.07.27-PM-1024x732.png\" alt=\"\" class=\"wp-image-530\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.07.27-PM-1024x732.png 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.07.27-PM-300x214.png 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.07.27-PM-768x549.png 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-13-a\u0300-3.07.27-PM.png 1366w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/noscript><\/figure>\n\n\n\n<p>lifetime is a feature that show the nulber of days after a user churn or not, it&#8217;s based on on the registration date and event timestamp.<\/p>\n\n\n\n<p>From this boxplot we can observe that non churn user are more long term user with a median of 75 days versus 50 days for user cancelling their membership.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.07-PM-1024x742.png\" alt=\"\" class=\"wp-image-537\"\/><noscript><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"742\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.07-PM-1024x742.png\" alt=\"\" class=\"wp-image-537\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.07-PM-1024x742.png 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.07-PM-300x217.png 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.07-PM-768x556.png 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.07-PM.png 1336w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/noscript><\/figure>\n\n\n\n<p>Windows and mac user are the most common use os per sparkify user, they tend to have the same proportion of churn user compared to linux where more than half of the user churned.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.15-PM-1024x842.png\" alt=\"\" class=\"wp-image-538\"\/><noscript><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"842\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.15-PM-1024x842.png\" alt=\"\" class=\"wp-image-538\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.15-PM-1024x842.png 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.15-PM-300x247.png 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.15-PM-768x632.png 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-14-a\u0300-12.26.15-PM.png 1262w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/noscript><\/figure>\n\n\n\n<p>Windows 7 is the most commonly microsoft version per sparkify user followed by macos user.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><meta charset=\"utf-8\">Feature Engineering<\/h2>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td><strong>#<\/strong><\/td><td class=\"has-text-align-left\" data-align=\"left\"><strong>Features<\/strong><\/td><td><strong>Description<\/strong><\/td><\/tr><tr><td>1<\/td><td class=\"has-text-align-left\" data-align=\"left\">UserId<\/td><td>user unique identification number<\/td><\/tr><tr><td>2<\/td><td class=\"has-text-align-left\" data-align=\"left\">churn<\/td><td>user cancel subscription<\/td><\/tr><tr><td>3<\/td><td class=\"has-text-align-left\" data-align=\"left\">gender<\/td><td>Male or female<\/td><\/tr><tr><td>4<\/td><td class=\"has-text-align-left\" data-align=\"left\">sessionCount<\/td><td>Number of session of a user<\/td><\/tr><tr><td>5<\/td><td class=\"has-text-align-left\" data-align=\"left\">minSessionTime<\/td><td>Minimum time session of a user<\/td><\/tr><tr><td>6<\/td><td class=\"has-text-align-left\" data-align=\"left\">maxSessionTime<\/td><td><meta charset=\"utf-8\">Maximum time session of a user<\/td><\/tr><tr><td>7<\/td><td class=\"has-text-align-left\" data-align=\"left\">AvgSessionDurationMinutes<\/td><td><meta charset=\"utf-8\">Average time session of a user<\/td><\/tr><tr><td>8<\/td><td class=\"has-text-align-left\" data-align=\"left\">AvgSongsPerSession<\/td><td>average number of song listened per a user<\/td><\/tr><tr><td>9<\/td><td class=\"has-text-align-left\" data-align=\"left\">artistCount<\/td><td>number of artist a user has listened to<\/td><\/tr><tr><td>10<\/td><td class=\"has-text-align-left\" data-align=\"left\">level<\/td><td>whether a user pay a membership or not<\/td><\/tr><tr><td>11<\/td><td class=\"has-text-align-left\" data-align=\"left\">SongCount<\/td><td>number of song a user has listened to<\/td><\/tr><tr><td>12<\/td><td class=\"has-text-align-left\" data-align=\"left\">days_total_subscription<\/td><td>Number of day a user is register<\/td><\/tr><tr><td>13<\/td><td class=\"has-text-align-left\" data-align=\"left\">Region Northeast<\/td><td><meta charset=\"utf-8\">number user connected in the northeast<\/td><\/tr><tr><td>14<\/td><td class=\"has-text-align-left\" data-align=\"left\"><meta charset=\"utf-8\">Region Midwest<\/td><td><meta charset=\"utf-8\"><meta charset=\"utf-8\">number user connected in the Midwest<\/td><\/tr><tr><td>15<\/td><td class=\"has-text-align-left\" data-align=\"left\"><meta charset=\"utf-8\">Region West<\/td><td><meta charset=\"utf-8\"><meta charset=\"utf-8\">number user connected in the west<\/td><\/tr><tr><td>16<\/td><td class=\"has-text-align-left\" data-align=\"left\"><meta charset=\"utf-8\">Region South<\/td><td><meta charset=\"utf-8\">number user connected in the south<\/td><\/tr><tr><td>17<\/td><td class=\"has-text-align-left\" data-align=\"left\">About<\/td><td><meta charset=\"utf-8\">number of about page event per user<\/td><\/tr><tr><td>18<\/td><td class=\"has-text-align-left\" data-align=\"left\">Add Friend<\/td><td><meta charset=\"utf-8\">number of add friend event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>19<\/td><td class=\"has-text-align-left\" data-align=\"left\">Add to Playlist<\/td><td><meta charset=\"utf-8\">number of add to playlist event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>20<\/td><td class=\"has-text-align-left\" data-align=\"left\">Downgrade<\/td><td><meta charset=\"utf-8\">number of downgrade event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>21<\/td><td class=\"has-text-align-left\" data-align=\"left\">Error<\/td><td><meta charset=\"utf-8\">number of error page event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>22<\/td><td class=\"has-text-align-left\" data-align=\"left\">Help<\/td><td><meta charset=\"utf-8\">number of help page event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>23<\/td><td class=\"has-text-align-left\" data-align=\"left\">Home<\/td><td><meta charset=\"utf-8\">number of home page event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>24<\/td><td class=\"has-text-align-left\" data-align=\"left\">LogOut<\/td><td><meta charset=\"utf-8\">number of log out <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>25<\/td><td class=\"has-text-align-left\" data-align=\"left\">NextSong<\/td><td><meta charset=\"utf-8\">number of next song event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>26<\/td><td class=\"has-text-align-left\" data-align=\"left\">Roll Advert<\/td><td><meta charset=\"utf-8\">number of roll advert event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>27<\/td><td class=\"has-text-align-left\" data-align=\"left\">Save Settings<\/td><td><meta charset=\"utf-8\">number of save setting event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>28<\/td><td class=\"has-text-align-left\" data-align=\"left\">Settings<\/td><td><meta charset=\"utf-8\">number of Settings page event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>29<\/td><td class=\"has-text-align-left\" data-align=\"left\">SubmitDowngrade<\/td><td><meta charset=\"utf-8\">number of submit downgrade event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>30<\/td><td class=\"has-text-align-left\" data-align=\"left\">SubmitUpgrade<\/td><td><meta charset=\"utf-8\"><meta charset=\"utf-8\">number of submit upgrade event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>31<\/td><td class=\"has-text-align-left\" data-align=\"left\">Thumbs down<\/td><td><meta charset=\"utf-8\">number of thumbs down event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>32<\/td><td class=\"has-text-align-left\" data-align=\"left\">Thumbs Up<\/td><td>number of thumbs up event <meta charset=\"utf-8\">per user<\/td><\/tr><tr><td>33<\/td><td class=\"has-text-align-left\" data-align=\"left\">Upgrade<\/td><td><meta charset=\"utf-8\">number of upgrade event <meta charset=\"utf-8\">per user<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>All the above feature are used to create our final dataframe, it&#8217;s composed of 1 dependent variable (churn) and 32 independent variable that will be used to operate the following supervised machine learning model :<\/p>\n\n\n\n<ul><li>Logistic Regression<\/li><\/ul>\n\n\n\n<ul><li>SVM<\/li><\/ul>\n\n\n\n<ul><li>Decision Tree<\/li><\/ul>\n\n\n\n<ul><li>Random Forest<\/li><\/ul>\n\n\n\n<p>As seen on the churn user plot, churned user represent a small part of all user, that&#8217;s why we choose the f1 score as the measurement standard because it average between precision and recall on imbalanced classes like churn user.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.10.50-PM-1024x843.png\" alt=\"\" class=\"wp-image-545\"\/><noscript><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"843\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.10.50-PM-1024x843.png\" alt=\"\" class=\"wp-image-545\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.10.50-PM-1024x843.png 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.10.50-PM-300x247.png 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.10.50-PM-768x633.png 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.10.50-PM.png 1214w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/noscript><figcaption>after training and testing each model, Support vector machines obtain the highest f1 score, with 74.89%  after hyperparameters tuned.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"data:image\/gif;base64,R0lGODlhAQABAIAAAAAAAP\/\/\/yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\" data-src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.11.10-PM-1024x812.png\" alt=\"\" class=\"wp-image-546\"\/><noscript><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"812\" src=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.11.10-PM-1024x812.png\" alt=\"\" class=\"wp-image-546\" srcset=\"https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.11.10-PM-1024x812.png 1024w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.11.10-PM-300x238.png 300w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.11.10-PM-768x609.png 768w, https:\/\/www.labo.mathieurella.fr\/wp-content\/uploads\/2022\/06\/Capture-de\u0301cran-2022-06-16-a\u0300-8.11.10-PM.png 1284w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/noscript><\/figure>\n\n\n\n<p>the above plot show us wich features impact our model the most when predicting churn, we observe that&#8217;s :<\/p>\n\n\n\n<ul><li>People showing disatisfaction trough thumbs downs, logout, downgrade are the one who will leave more easily.<\/li><li>apart of these we can see that people from the midwest and west are more willing to churn<\/li><li>Finally roll advertising is also a feature to consider when we are talking of churn user<\/li><li>In contrast people used to sparkify for a long time are confident and not willing to churn, cancel their subscription.<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><meta charset=\"utf-8\">Conclusion<\/h2>\n\n\n\n<p>We deployed Spark machine learning models to estimate churn for Sparkify&#8217;s users using<br>previous log data.<br>With a 74,89% f1 score, we developed a model that can detect consumers at danger of churning.<br>Once we had the dataset, We trained four distinct types of models and fine-tune each model, to see who&#8217;s perform the best. <\/p>\n\n\n\n<p>At first sight result seems acceptable but we were working on a too small dataset to be statiscally significant wich was higly imbalance, to get better outcomes it could be interesting to have acces to a cluster and test our model on a larger dataset.<\/p>\n\n\n\n<p>Finally we still manage to predict churn, wich is  a good start even if result would be more accurate with more recent log data ( 2018 on this one), Sparkify could use our model everyday to improve customer satisfaction using A\/B testing by offering discount or targeted message to the people who where likely to churn.<br><br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Project Overview &amp; Problem Like spotify, Sparkify is a music streaming service offering the possibility to its users to enjoy &#8230;<\/p>\n","protected":false},"author":1,"featured_media":549,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=\/wp\/v2\/posts\/516"}],"collection":[{"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=516"}],"version-history":[{"count":16,"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=\/wp\/v2\/posts\/516\/revisions"}],"predecessor-version":[{"id":550,"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=\/wp\/v2\/posts\/516\/revisions\/550"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=\/wp\/v2\/media\/549"}],"wp:attachment":[{"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=516"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=516"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.labo.mathieurella.fr\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=516"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}