PySpark for Churn Analysis

8 min readNov 17, 2020

Predicting churn is a challenging and common problem in customer-facing business. Since the predictions are usually made from a huge number of user activity logs with the service, we need a distributed way to handle a large dataset efficiently without having to fit it in our memory all at once.

Project: Analysing Customer Churn with PySpark

Definition

Project Overview
Problem Statement
Metrics

Analysis

Data Exploration
Data Visualisation

Methodology

Data Pre-processing
Implementation
Refinement

Conclusion

Reflection
Challenges
Files
Software Requirements

I. Definition

Project Overview

You might have heard of the two music streaming giants: Apple Music and Spotify. Which is better than the other? Well, that depends on multiple factors, like the UI/UX of their app, the frequency of new content, user-curated playlists and subscribers count. The factor which we are studying is called the churn rate. Churn rate has a direct impact on the subscribers’ count and also the long term growth of the business.

So what is this churn rate anyway?

For a business, the churn rate is a measure of the number of customers leaving the service or downgrading their subscription plan within a given period of time.

Problem Statement

Imagine you are working on the data team for a popular digital music service similar to Spotify or Pandora. Millions of users stream their favourite songs to your service every day either using the free tier that plays advertisements between the songs or using the premium subscription model where they stream music at free but pay a monthly flat rate. Users can upgrade, downgrade or cancel their service at any time so it is crucial to make sure your users love the service.

In this project, our aim is to identify the customer churn for Sparkify (a Spotify-like fictional music streaming service). This does not include the variety (like the genre, curated playlists, top regional charts) of music that the service provides. It mainly explores user behaviour and how we can identify the possibility that a user will churn. Such customers are those who decide to downgrade their service, i.e. going from paid subscription to free, or entirely leaving the service.

Our target variable is isChurn. It cannot be interpreted directly from the JSON file, but we will use feature engineering to create it. isChurn column is 1 for users who visited the Cancellation Confirmation page and 0 otherwise.

II. Analysis

Data Exploration

Name of the input data file is mini_sparkify_event_data.json in data directory. The shape of our feature space is 286500 rows and 18 columns. data/metadata.xlsx contains information about the features.

A preview of data:

I have used toPandas() method above because 18 columns cannot be displayed in a user-friendly way by PySpark's built-in .show() method.

Feature Space

Univariate Plots

Distribution of pages

Cancellations are less. That is what we have to predict.

We will remove the starting classes: Cancel and Cancellation Confirmation in our modelling section to avoid lookahead bias.

Most commonly browsed pages include activities like the addition to a playlist, home page, and thumbs up.

2. Distribution of levels (free or paid)

70% of churned users are paying customers. Customer retention is relatively more important for paid ones because they are directly connected with the revenue of the company.

3. Song length

No additional information is available from this. It just shows that most songs are 4 minutes long.

4. What type of device user is streaming from?

This is what we have expected. Windows is the most used platform.

Multivariate Plots

Gender distribution

Males are more in number.

2. Distribution of pages based on churn

No strong conclusion can be drawn from this graph. It also shows the same common actions as seen from the univariate plot

3. Distribution of hour based on churn

We can see that non-churn users are more active during day time.

4. Behaviour across weekdays

Activity is more on weekdays. Especially for churned users. However, this change is not significant.

5. Behaviour at the month level

Non-churn users are generally less active at the start of the month as compared to churn users, and the opposite is the case at the EOM.

Target Space

For page column, we have 22 distinct values:

About
Add Friend
Add to Playlist
Cancel
Cancellation Confirmation
Downgrade
Error
Help
Home
Login
Logout
NextSong
Register
Roll Advert
Save Settings
Settings
Submit Downgrade
Submit Registration
Submit Upgrade
Thumbs Down
Thumbs Up
Upgrade

We are interested in the fifth category. Our target variable is isChurn. It is 1 if the user visited Cancellation Confirmation page and 0 otherwise.

Data Visualisation

This imbalanced data suggests that we should not use accuracy as our evaluation metric. We will use F1 score instead and use under-sampling to further optimise it.

III. Methodology

Data Preprocessing

Handling null values

First, we will remove null values for some columns. There are two distinct number of null values observed: 8346 and 58392.

58392 is 20% of the data (286500) and 8346 is merely 2%. So we keep the columns which have 2% nans and see for the 20% one's whether we can impute the missing values in some way.

These are the columns with 20% missing values. Seems like it is difficult to impute them. We will drop the respective rows with null values for these columns.

Implementation

We have the same training and testing features for all the models. PySpark’s ML library(pyspark.ml) has access to the most common machine learning classification algorithms. Others are still in development, like Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

The ones which we’ll be using are:

Refinement

Since the class distribution is highly imbalanced, we will perform random undersampling to optimize our F1 score.

F1 is the harmonic mean of precision and recall. Precision and recall are calculated in the following way:

This article will deepen your understanding on why to use F1 score when evaluating your model on imbalanced data set.

Comparison of average metrics before and after under sampling.

Model ,Average Metrics Before ,Average Metrics After

Logistic Regression 0.717, 0.684, 0.681 0.486, 0.344, 0.192

Random Forest Classifier 0.710, 0.699, 0.698 0.540, 0.537, 0.499

Gradient Boosting Tree Classifier 0.710, 0.705, 0.684 0.629, 0.627, 0.616

Since the data size is still relatively small, and the performance difference is huge, we will prefer the model that perform the best. GBT Classifier provided a fairly good F1 score of 0.64 after under sampling. Therefore, we choose GBT model as our final used model and conduct a grid search to fine tune our model this time.

Hyperparams tunning

Grid search to fine tune GBT Classifier.

Metrics

Out of 225 unique users, only 52 churned (23.5%). So, accuracy will not be a good metric to handle this imbalance. We will instead use F1-Score to evaluate our model.

F1-Score is the harmonic mean of precision and recall.

Precision = True Positive / (True Positive + False Positive)

Recall = True Positive / (True Positive + False Negative)

The reason we use F-1 score here is because it gives us a simple measure of the precision(whether we send offer to the right person) and recall(whether we miss one that we should’ve sent the offer) of the model. We want to identify those who are likely to churn and give them some special offers in trying to keep the customer, but at the same time, we do not want to send too many offers (most likely a monetary incentive) to those who are not as likely to churn and therefore wasting money and resources.

IV. Conclusion

Reflection

I enjoyed the data pre-processing part of the project. For data visualisation part, instead of using “for” loops to get arrays for our bar chart, I first converted our Spark data frame to Pandas data frame using to Pandas()method. Data visualization is easier from then on.

The shape of our final data is just 225 x 32. This is too small to generalize our model. Just 225 users for a streaming company? That’s nothing. 12 GB data might provide some useful results. If you want more statistically significant results then I suggest you run this notebook on Amazon EMR for the 12 GB data set. I have skipped that part for now because it costs $30 for one week.

Challenges

Some of the challenges which I faced in this project are:

Official documentation of PySpark as compared to Pandas
For the sake of mastering Spark, we only used the most common machine learning classification models instead of using the advanced ones
Highly imbalanced data led to a poor F1 score
If you run this notebook on your local machine without any change, then it will take around an hour to run completely

Further improvements

Larger dataset will be helpful to get an enhanced exploratory analysis of the churned users. Optimized data preparation can complement feature engineering, as well as performing a comprehensive grid search by using cloud computing techniques such as AWS or IBM can improve model training and testing

V. Files

Folders gbtModel, lrModel and rfModel

Saved models before under-sampling

2. Folders new_gbt_model, new_lr_model and new_rf_model

Saved models after under-sampling

3. helper.py

Helper functions for Plotly visualizations

VI. Software Requirements

This project uses Python 3.6.6 and the necessary libraries are mentioned in requirements.txt file.

VII. References

Check out the code on GitHub: here