Is predictive churn modeling a new thing? For sure not. The loss of profitable customers is an inherent challenge in all competitive industries, and for many companies this has been among the very first application areas for machine learning. Still, we see repeated examples of such initiatives not meeting expectations when it comes to actual improvements in customer retention.
So, how to do better? We have ourselves been through significant learnings over the past five years, building on constant developments in business know-how, data availability and machine learning capabilities. In this article, we share some key steps to to successfully implement a churn prediction model and actually reduce churn, based on our experience in industries such as banking, media and telecom, as well as a master thesis written by two of the authors in collaboration with a large Nordic financial institution.
Before diving into our churn prediction tips, we first encourage you to consider the churn problem holistically. An important starting point is to have control over target figures such as churn rates and customer profitability to quantify the problem and the financial upside in reducing the churn rate. Once that is in place, we recommend four categories of measures taking advantage of both analytics and domain expertise, as illustrated in the figure below. The topic of this article is highlighted in green; churn prediction, which is highly operational — taking actions to prevent churn on the individual customer level. Thinking more strategically, it is also possible to incorporate anti-churn measures in product development and marketing, customer journeys (which we have previously written an article about), and finding churn drivers to re-engineer situations where churn occurs.
Now, let’s get into our top tips for successful churn prediction.
One common pitfall when embarking on any analytics project is the lack of a clearly defined business problem. A problem definition sets the goal for the analytical model (i.e., defines the target variable we want our model to predict), describes how it will be operationalized, and gives structure to the business case. When working with customer churn this means that we need to clearly define what constitutes churn. Often, there may be multiple definitions of churn within the same line of business. We recommend that analytics and business resources work closely together — domain expertise is crucial to produce definitions that are both feasible to extract from the company’s data and have the potential for value creation, such as identifying profitable customers which can be influenced to stay.
There are several elements that should be considered in the definitions, most typically:
We want to identify the customers we want to keep long-term. Ideally, we know each customer’s historic profitability, or even better, future estimated profitability (predicted customer lifetime value), and can filter out customers with negative profitability. If this information is not readily available, identifying heuristics for profitability such as tenure or total revenue might make sense.
What constitutes churn? And what specific event do we want to predict? For a subscription-based service, churn is typically defined as a canceled subscription, but is it the customer churn intent that we want to predict, or is it the actual churn (i.e. actual deactivation of an agreement)? It depends on when we can counteract the churn, which determines the timing of the anti-churn measure. Within retail, can “no purchases within a certain timeframe” be the churn event? Ensure crystal clear definitions of these matters!
This is often straightforward, but a complicated internal product structure can make this messy. Does it make sense to group together similar products (e.g., within banking, different long-term saving products), both to keep things simple and to have more historical churn events as examples to train our model on?
The most important consideration for deciding the length of the prediction window is how quickly the business can act on the prediction, which connects directly to the churn event and related anti-churn measure (point 2 above). A short time window will typically not allow sufficient time to contact the customer (and perhaps also be challenging in training a prediction model, as the fraction of customers in the training data set labeled “churn” decreases with the prediction window), whereas a long window will give a too general prediction (in the long term, all customers churn in some form or another!)
Can we differentiate between customers who have churned willingly (which we can influence to stay) and customers who have stopped using the product or service due to some other external circumstance such as relocation or no longer having any need? Clearly, our anti-churn measures will be futile in the latter group.
We summarize these considerations in the below figure.
After carefully defining the churn variable so that you can train, test and deploy an accurate prediction model, how should you reach out to the identified customers? What measures can you take to retain them? A special offer, a “check-in” conversation, or a lower price? An anti-churn project should include strong involvement of domain experts to shortlist possible actions, which are then piloted to test which work best (A/B testing of different offers for separate experiment groups, as explained in one of our previous articles).
It is possible to make this process more efficient by finding churn drivers for each individual customer, answering the question “why will this customer churn?”. This can help narrow down what actions are worth testing and create differentiated customer groups for even more accurate targeting of actions. Essentially, we would like to know how important each feature (data variable describing a customer characteristic, behavior etc.) is in making the specific prediction, and in what direction (lower/higher churn probability) the feature value contributed to the prediction. We want to know this independently of our choice of machine learning model, as less explainable “black box” models such as XGBoost and deep neural nets are likely going to give the best model. One way to achieve this is SHAP (Shapley Additive exPlanations), which we have recently used in several churn projects. We won’t go into the technical details here, but this blog post provides an overview of the theory behind SHAP. In python, the SHAP-tools we need are implemented in the shap package.
The figure below shows a fictitious example of a SHAP Waterfall Chart for a single customer calculated from a churn model in a telecom company. The green boxes indicate an increase in the probability of churn, while the blue boxes indicate a decrease in the probability of churn, compared to a baseline probability of 0,18.
This plot tells us that the top churn driver for this customer is frequent contact with customer service. This variable could be an indication of a customer that has experienced some churn-triggering event which required lots of interaction with customer service to resolve (e.g., loss of cell phone service), or that poor customer itself service could be driving churn. Discussion with domain experts who understand the operation of the business and have a good understanding of customer behavior and churn events is crucial to make the right judgment calls in such situations — do we understand the driver well enough to create an appropriate anti-churn action? Essentially, what we are after is to separate which features have a causal relationship with churn, or if they are merely correlated — the causal features are the ones you want to base your churn actions on!
Continuing with our example, let’s say we know for sure that the “number of contacts to customer service” is indicative of an underlying problem which causes churn (and frequent calls to the customer service department), and not in itself a cause of churn. Coming up with measures to decrease the number of times a customer calls customer service is thus not a good strategy for reducing churn, even though there is a clear correlation between a customer’s frequent contacts with the company and their propensity to churn! We will not delve too deeply into the issue of correlation vs causation, but we encourage the interested reader to read this article, which explores some common pitfalls and explains approaches to infer causal relationships.
We also see that surpassing the limit for internet usage each month according to the subscription plan is a strong churn driver, which is clearly a causal feature and much simpler to understand: The customer is not on an appropriate plan, and is running out of data. Sending a text message with a link to change the plan may be exactly what’s needed in this case.
Lastly, tailoring anti-churn measures to each unique customer is of course not realistic, and not our suggestion. However, based on aggregating insights from driver analysis on the individual level, some repeating patterns typically emerge where typical causes for churn behavior can be understood. This again can be exploited to identify anti-churn measures, preferably supported by A/B testing of the different measures.
Engagement with key stakeholders throughout the anti-churn project is of particular importance. The impact of the model on a company’s bottom line needs to be analyzed and communicated clearly. One straightforward way of doing this is by assigning an assumed income/cost for each customer in the model’s four different prediction categories:
True positive (TP): Customers correctly predicted to churn. For each of these customers, we have some probability of preventing churn, which will give us increased income from a longer customer lifetime. However, there is also a cost associated with the intervention.
False positive (FP): Customers incorrectly predicted to churn. The company incurs intervention costs without any churn reduction.
True negative (TN): Customers correctly predicted to not churn. No action taken, so no cost or income is associated with these customers.
False negative (FN): Customers incorrectly predicted not to churn, which we do not contact even though we ideally should. The company loses out on potential income from preventing churn, but we do not count this as a cost associated with operationalizing the churn model.
This is summarized in the cost matrix below:
The assumed incomes/costs for the TP and FP categories together with the number of customers in each category will let us estimate the total profitability of operationalizing the model. The distribution of the customers in the four categories depends on the intervention level — the minimum model score required for classifying a customer as “Believes Churn” (TP+FP). This tells us which intervention level will give us the maximum profitability, by plotting profitability versus intervention level. If we have trained different models (i.e., tested different algorithms), the curve also tells us which model will give us the maximum profitability overall — an effective way to give the business side some insight into the model development process!
This type of chart can be calculated for different sets of cost/income assumptions for the four categories, based on discussions with domain experts. Typical assumptions are the average cost of targeting a customer, the probability that a customer responds positively to a specific type of measure, and how long the customer remains after a successful measure. Typically, no clear answers to these questions are found, and undertaking a pilot project based on an A/B-testing framework before fully committing will reduce risks.
A final best practice is to enrich your predictive churn model with several data sources. Time is virtually always better spent hunting down and structuring new data sources into features, than improving the model itself — the essence of the Data Centric AI manifesto. In addition to potentially picking up more relevant signals statistically linked to customer churn, more features, up to a certain point, will also make your model more explainable.
The obvious place to start is of course the company’s own internal data, typically sourced from CRM or ERP systems, such as demographics, product data, and customer interactions. We would like to highlight some especially important features for churn modeling:
Net promoter score: Perhaps one of the most indicative measures of customer satisfaction can be obtained by simply asking the customer about their perception of your company’s products and/or services. You might have received a text message after calling a customer call center, asking you to give a score on whether you are likely to recommend a product or service to a friend or colleague, followed up by an encouragement to leave a comment. This is commonly referred to as the “Net promoter score” and is a widely used metric in marketing research. An obvious benefit to using such a survey is that it allows your business to gain deep insights into the exact pain points of your customer journeys, which can be used to produce actionable measures to counteract overall customer churn. The net promoter scores gathered from this survey can then be used to track customer satisfaction following any anti-churn measures, using the A/B testing framework mentioned earlier.
Lastly, information gathered from this type of survey may be used as an input to your predictive churn model, which together with the other variables in your model may provide an effective predictor of customer churn.
RFM variables: RFM stands for recency, frequency, and monetary value. These are behavioral factors that are widely used in marketing analytics and often turn out to be significant predictors of churn. In short, these can be defined as follows:
Recency: How long ago did a customer make a purchase/use a service?
Frequency: How often does the customer, on average, make a purchase/use a service?
Monetary value: How much does the customer spend on a given product/service?
We also recommend looking outside the walls of the company, including external data sources where possible. For example, in the master thesis mentioned in the introduction, we utilized external data from financial markets in predicting the mutual fund outflows for a large Nordic financial institution. We found (as perhaps expected) that stock market uncertainty, measured in the volatility index (VIX), had a large contribution to both model performance and explainability, allowing us to gauge how much uncertainty in financial markets increases the likelihood of churn. In addition, customers whose mutual funds perform poorly were more likely to churn if the customer’s stated reason for saving is short-term savings, compared to someone saving for retirement.
In short, considering which features to include in your model requires a blend of experience, creativity, domain knowledge, and a healthy dose of trial and error. Iterating with domain experts such as customer advisors or business developers is crucial. Talking directly to existing and previous customers is also often forgotten and can be a valuable source of information.
Leveraging data through machine learning to get a clearer picture of which customers are going to churn for what reasons makes good economic sense, as it is often more expensive to gain a new customer than to retain an existing customer. However, it is not as simple as just training a prediction model, contacting the customers that the model predicts will churn and hoping for the best. There are many pitfalls, and the success of your machine learning anti-churn project starts at the very top, with a clear definition of the business problem. When operationalizing the model you should consider the underlying business case, and tailor the churn-preventing communication to why the customer is likely to churn in the first place, instead of a “one size fits all”-approach. To achieve a sufficiently accurate model it can also be well worth the time to look for new data sources, often outside the four walls of the company. Finally, we also highly recommend having an A/B testing framework in place.
Jarrod Carmichael is a Technology Analyst at BearingPoint Oslo, with experience in machine learning in safety-critical systems within the energy industry and customer retention within the finance industry. Before joining BearingPoint in 2021, Jarrod obtained a master’s degree in Financial Economics and Business Analytics from BI Norwegian Business School. Email address: firstname.lastname@example.org.
Alborz Sabetrasekh is a Technology Analyst in the Data & Analytics team of BearingPoint Oslo, with several years of working experience in the financial industry, including involvement in multiple projects related to customer retention. Alborz obtained a master’s degree in Business Analytics from BI Norwegian Business School in 2021. Email address: email@example.com.
Vebjørn Axelsen is a Partner working at BearingPoint in Oslo and head of the Advanced Analytics area. He has significant experience as strategic advisor, architect and developer across data science and data engineering and has led a multitude of data science efforts across industries. Vebjørn obtained his MSc. in Computer Science specialized in AI from the Norwegian University of Science and Technology in 2007. Email address: firstname.lastname@example.org.
Dr. Lars Bjålie is a Technology Advisor at BearingPoint Oslo, with experience in customer analytics from the financial industry and predictive maintenance within railroad infrastructure. Before joining BearingPoint in 2016, Lars obtained a Ph.D. in computational materials science from the University of California, Santa Barbara.