Assume this, You are working in the HR department of OLA Cabs and are responsible for hiring and retaining drivers. You've noticed an increase in the number of drivers leaving the company over the past few months. Your manager has tasked you with identifying the reasons for this trend. Well, what do you do? You set a meeting with the analytics department and explain the problem. This is where I come in. I am a data scientist responsible for finding out the reasons for the churn and predicting which drivers will churn in the future.
A few moments later
I have completed the analysis and the report of my findings is as follows:
Ola Cabs provides convenience and transparency by connecting customers to various vehicles. Ola's main focus is to ensure a quality driving experience and retain efficient drivers. Recruiting and retaining drivers is challenging due to high churn rates. Losing drivers affects morale, and acquiring new drivers is costlier than retaining current ones. We will be building a predictive model to determine this. We are also going to determine which of the driver features is most responsible for driver churn.
The dataset contains monthly information for a segment of drivers for 2019 and 2020. The dataset contains the following columns:
Predictive Algorithms and Metrics
Since I was already proficient in Pandas and scikit-learn, I wanted to explore a new technology. PySpark, a big data processing library capable of handling large datasets, presented a great opportunity to learn and apply it for data preprocessing, feature engineering, and machine learning.
Another advantage of PySpark is that I can execute SQL queries on the data directly. This is a great feature as I can use SQL queries to filter, group, and aggregate data. Only thing to keep in mind is that one has to create a temporary view of the data before executing SQL queries.
Following is an example of SQL query to check the change in business value for drivers whose rating decreased
This was a small dataset with some missing values. I wanted to use scikit-learn's KNN Imputer to impute these values. However, I didn't want to go through the hassle of converting my dataset to a Pandas dataframe, applying the KNN Imputer, and then converting it back to a PySpark dataframe.
Luckily, PySpark provides a feature called UDF (User Defined Functions). This allows me to use scikit-learn's KNN Imputer directly on my PySpark dataset.
There are better ways to impute missing values in PySpark, because of time and space complexity of KNN algorithm. I just wanted to learn how to use UDFs using scikit-learn's functions.
Tenure
, Had_Negative_Business
, Has_Income_Increased
, Has_Rating_Increased
and Churned
.window_spec = Window.partitionBy("Driver_ID").orderBy("reporting_month_year")
spark_df=spark_df.withColumns({
"LastWorkingDate": sf.coalesce(sf.col("LastWorkingDate"), sf.first("LastWorkingDate", True).over(window_spec)),
"Churned": sf.when(sf.col("LastWorkingDate").isNotNull(), 1).otherwise(0),
"Had_Negative_Business": sf.when(sf.col("Total Business Value") > 0, 1).otherwise(0),
"Has_Income_Increased": sf.when(sf.last("Income").over(window_spec) > sf.first("Income").over(window_spec), 1).otherwise(0),
"Has_Rating_Increased": sf.when(sf.last("Quarterly Rating").over(window_spec) > sf.first("Quarterly Rating").over(window_spec), 1).otherwise(0),
})
agg_map=[
sf.first("Dateofjoining").alias("Date_Of_Joining"),
sf.sum("Total Business Value").alias("Total_Business_Value"),
sf.sum("Had_Negative_Business").alias("Total_Had_Negative_Business"),
sf.max("Has_Income_Increased").alias("Has_Income_Increased"),
sf.max("Has_Rating_Increased").alias("Has_Rating_Increased"),
sf.avg("Total Business Value").cast("int").alias("Avg_Business_Value"),
sf.max("reporting_month_year").alias("Last_Reporting_Month"),
sf.max("Age").alias("Age"),
sf.mode("Gender").alias("Gender"),
sf.last("Income").alias("Income"),
sf.sum("Income").alias("Total_Income"),
sf.first("Education_Level").alias("Education_Level"),
sf.last("City").alias("City"),
sf.first("Joining Designation").alias("Joining_Designation"),
sf.last("Grade").alias("Grade"),
sf.last("Quarterly Rating").alias("Quarterly_Rating"),
sf.max("LastWorkingDate").alias("Last_Working_Date"),
sf.max("churned").alias("Churned"),
]
merged_df = spark_df.sort("reporting_month_year").groupBy("Driver_ID").agg(*agg_map)
Although I am not a Tableau expert, I created a dashboard to visualize the data. You can find the link to the dashboard below. A more detailed exploratory data analysis (EDA) is available in the notebook linked at the end.
Following is the embedded dashboard from Tableau
Driver Rating
Change in ratings for different cities
Effect on business value when ratings decrease
Effect of rating based on the month of the year
Effect of Ratings based on City
Other features affecting Quarterly Rating
Drivers should be given raises more often.
Expectation from the job has to be asked of the drivers who joined recently as they tend to churn the most.
Feedback must be taken from employees with consistently low ratings.
Drivers can be assigned to different cities to check if their ratings can be increased.
Ratings can be changed from Quarterly to monthly to reflect progress better.
From all the above model feature importance Tenure, Quarterly Rating and Income are the biggest contributors for generating the predictions
Model | F1 | Accuracy | Precision | Recall | AUC |
---|---|---|---|---|---|
LightGBM | 0.940727 | 0.925659 | 0.946154 | 0.935361 | 0.969285 |
XGBoost | 0.937743 | 0.923261 | 0.960159 | 0.916350 | 0.968273 |
Gradient Boosting | 0.919325 | 0.896882 | 0.907407 | 0.931559 | 0.955817 |
Random Forest | 0.904762 | 0.875300 | 0.872792 | 0.939163 | 0.936497 |
From the above analysis of models, we can conclude that LightGBM has better stats as compared to other models.
Recall means out of all the drivers, how correctly the model identifies churning and Precision means from all the drivers identified to be churned, how many churned. Assume that the company has decided to give a raise to those drivers which are predicted to churn
Case 1: Businesses want my model to detect even a small possibility of Churn.
Case 2: The business wants to make sure that the predicted driver will churn.
Conclusion
I have hosted the app on Streamlit. The app can be found below. Although the final model had been built using PySpark, I couldn't host it on Streamlit due to resource constraints. So I built a model using scikit-learn and hosted it on Streamlit. ie I used the same hyper-parameters and features to build the model.
Streamlit shutdowns the app after some hours of inactivity. Please click on button 'wake up' button to restart the app.
The main notebook used for the analysis can be found below. One can view the notebook using the nbviewer or open it in Google Colab.
The complete code can be found in the GitHub repository below