Predicting churn is a challenging and common problem in customer-facing business. Since the predictions are usually made from a huge number of user activity logs with the service, we need a distributed way to handle a large dataset efficiently without having to fit it in our memory all at once.
Project: Analysing Customer Churn with PySpark
Table of Contents
Definition
- Project Overview
- Problem Statement
- Metrics
Analysis
- Data Exploration
- Data Visualisation
Methodology
- Data Pre-processing
- Implementation
- Refinement
Conclusion
- Reflection
- Challenges
- Files
- Software Requirements
I. Definition
Project Overview
You might have heard of the two music streaming giants: Apple Music and Spotify. Which is better than the other? Well, that depends on multiple factors, like the UI/UX of their app, the frequency of new content, user-curated playlists and subscribers count. The factor which we are studying is called the churn rate. Churn rate has a direct impact on the subscribers’ count and also the long term growth of the business.
So what is this churn rate anyway?
For a business, the churn rate is a measure of the number of customers leaving the service or downgrading their subscription plan within a given period of time.
Problem Statement
Imagine you are working on the data team for a popular digital music service similar to Spotify or Pandora. Millions of users stream their favourite songs to your service every day either using the free tier that plays advertisements between the songs or using the premium subscription model where they stream music at free but pay a monthly flat rate. Users can upgrade, downgrade or cancel their service at any time so it is crucial to make sure your users love the service.
In this project, our aim is to identify the customer churn for Sparkify (a Spotify-like fictional music streaming service). This does not include the variety (like the genre, curated playlists, top regional charts) of music that the service provides. It mainly explores user behaviour and how we can identify the possibility that a user will churn. Such customers are those who decide to downgrade their service, i.e. going from paid subscription to free, or entirely leaving the service.
Our target variable is isChurn
. It cannot be interpreted directly from the JSON file, but we will use feature engineering to create it. isChurn
column is 1
for users who visited the Cancellation Confirmation
page and 0
otherwise.
II. Analysis
Data Exploration
Name of the input data file is mini_sparkify_event_data.json in data directory. The shape of our feature space is 286500 rows and 18 columns. data/metadata.xlsx contains information about the features.
A preview of data:
I have used toPandas()
method above because 18 columns cannot be displayed in a user-friendly way by PySpark's built-in .show()
method.
Feature Space
Univariate Plots
- Distribution of pages
Cancellations are less. That is what we have to predict.
We will remove the starting classes: Cancel
and Cancellation Confirmation
in our modelling section to avoid lookahead bias.
Most commonly browsed pages include activities like the addition to a playlist, home page, and thumbs up.
2. Distribution of levels (free or paid)
70% of churned users are paying customers. Customer retention is relatively more important for paid ones because they are directly connected with the revenue of the company.
3. Song length
No additional information is available from this. It just shows that most songs are 4 minutes long.
4. What type of device user is streaming from?
This is what we have expected. Windows is the most used platform.
Multivariate Plots
- Gender distribution
Males are more in number.
2. Distribution of pages based on churn
No strong conclusion can be drawn from this graph. It also shows the same common actions as seen from the univariate plot
3. Distribution of hour based on churn
We can see that non-churn users are more active during day time.
4. Behaviour across weekdays
Activity is more on weekdays. Especially for churned users. However, this change is not significant.
5. Behaviour at the month level
Non-churn users are generally less active at the start of the month as compared to churn users, and the opposite is the case at the EOM.
Target Space
For page
column, we have 22 distinct values:
- About
- Add Friend
- Add to Playlist
- Cancel
- Cancellation Confirmation
- Downgrade
- Error
- Help
- Home
- Login
- Logout
- NextSong
- Register
- Roll Advert
- Save Settings
- Settings
- Submit Downgrade
- Submit Registration
- Submit Upgrade
- Thumbs Down
- Thumbs Up
- Upgrade
We are interested in the fifth category. Our target variable is isChurn
. It is 1
if the user visited Cancellation Confirmation
page and 0
otherwise.
Data Visualisation
This imbalanced data suggests that we should not use accuracy as our evaluation metric. We will use F1 score instead and use under-sampling to further optimise it.
III. Methodology
Data Preprocessing
- Handling null values
First, we will remove null values for some columns. There are two distinct number of null values observed: 8346 and 58392.
58392 is 20% of the data (286500) and 8346 is merely 2%. So we keep the columns which have 2% nan
s and see for the 20% one's whether we can impute the missing values in some way.
These are the columns with 20% missing values. Seems like it is difficult to impute them. We will drop the respective rows with null values for these columns.
Implementation
We have the same training and testing features for all the models. PySpark’s ML library(pyspark.ml
) has access to the most common machine learning classification algorithms. Others are still in development, like Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
The ones which we’ll be using are:
Refinement
Since the class distribution is highly imbalanced, we will perform random undersampling to optimize our F1 score.
F1 is the harmonic mean of precision and recall. Precision and recall are calculated in the following way:
This article will deepen your understanding on why to use F1 score when evaluating your model on imbalanced data set.
Comparison of average metrics before and after under sampling.
Model ,Average Metrics Before ,Average Metrics After
Logistic Regression 0.717, 0.684, 0.681 0.486, 0.344, 0.192
Random Forest Classifier 0.710, 0.699, 0.698 0.540, 0.537, 0.499
Gradient Boosting Tree Classifier 0.710, 0.705, 0.684 0.629, 0.627, 0.616
Since the data size is still relatively small, and the performance difference is huge, we will prefer the model that perform the best. GBT Classifier provided a fairly good F1 score of 0.64 after under sampling. Therefore, we choose GBT model as our final used model and conduct a grid search to fine tune our model this time.
Hyperparams tunning
Metrics
Out of 225 unique users, only 52 churned (23.5%). So, accuracy will not be a good metric to handle this imbalance. We will instead use F1-Score to evaluate our model.
F1-Score is the harmonic mean of precision and recall.
Precision = True Positive / (True Positive + False Positive)
Recall = True Positive / (True Positive + False Negative)
The reason we use F-1 score here is because it gives us a simple measure of the precision(whether we send offer to the right person) and recall(whether we miss one that we should’ve sent the offer) of the model. We want to identify those who are likely to churn and give them some special offers in trying to keep the customer, but at the same time, we do not want to send too many offers (most likely a monetary incentive) to those who are not as likely to churn and therefore wasting money and resources.
IV. Conclusion
Reflection
I enjoyed the data pre-processing part of the project. For data visualisation part, instead of using “for” loops to get arrays for our bar chart, I first converted our Spark data frame to Pandas data frame using to Pandas()method. Data visualization is easier from then on.
The shape of our final data is just 225 x 32. This is too small to generalize our model. Just 225 users for a streaming company? That’s nothing. 12 GB data might provide some useful results. If you want more statistically significant results then I suggest you run this notebook on Amazon EMR for the 12 GB data set. I have skipped that part for now because it costs $30 for one week.
Challenges
Some of the challenges which I faced in this project are:
- Official documentation of PySpark as compared to Pandas
- For the sake of mastering Spark, we only used the most common machine learning classification models instead of using the advanced ones
- Highly imbalanced data led to a poor F1 score
- If you run this notebook on your local machine without any change, then it will take around an hour to run completely
Further improvements
Larger dataset will be helpful to get an enhanced exploratory analysis of the churned users. Optimized data preparation can complement feature engineering, as well as performing a comprehensive grid search by using cloud computing techniques such as AWS or IBM can improve model training and testing
V. Files
- Folders gbtModel, lrModel and rfModel
Saved models before under-sampling
2. Folders new_gbt_model, new_lr_model and new_rf_model
Saved models after under-sampling
3. helper.py
Helper functions for Plotly visualizations
VI. Software Requirements
This project uses Python 3.6.6 and the necessary libraries are mentioned in requirements.txt file.
VII. References
- Getting Pandas like dummies in PySpark
- Using multiple if-else conditions in a list comprehension
- Business Insider article on classifying region based on U.S. State
- Write single CSV file (instead of batching) using Spark
- Python API docs for Logistic Regression
- Python API docs for Random Forest Classifier
- Python API docs for GBTClassifier
- Knowledge about F1 score and why it is a better metric for imbalanced data set
Check out the code on GitHub: here