Machine learning models to predict customer behavior and A/B testing of marketing actions have become key capabilities in any modern marketing function. Together, they increase the impact of the marketing spend by finding the right messages to deliver to each customer to increase sales and loyalty. In our consulting practice, we unfortunately often see the techniques used together in ways that defeat the value of one or both.
In this article, we share some key learnings rooted in both theory and practice on properly combining them: How to select experiment groups, evaluate the results, and determine statistical significance. The principles we describe are general and apply equally well within for example predictive maintenance, but we keep our discussion within the realm of customer analytics.
Whether one uses cutting-edge machine learning algorithms trained on vast quantities of customer data, or business rules built on well-aged gut feeling, the essence is to find customer groups who have a high likelihood of behaving in a certain way and influence them by sending the right message.
Typically, we want to monitor two things:
For example, within banking one can build a model to select customers who have a high probability of changing their mortgage to another provider, and then tailor a customer journey with specific offers designed to increase loyalty within the selected group. When the model’s selection ability decreases, the model should be retrained based on fresh data, or we need an entirely new model. The same principle goes for action effectiveness, where an ineffective action needs to be redesigned. We, therefore, want to implement A/B-testing of both model and action, such that different models (champion/challenger) and different actions (classical A/B) can be evaluated over time. With the right experiment groups, we can measure these two things independently and at the same time.
The approach we present is straightforward to implement and has been used in projects at several large Norwegian companies. It is based on a structured setup for continuous evaluation by using three or four different experiment groups (depending on the case at hand).
Consider a customer churn example. A machine learning model is developed to predict which customers are likely to cancel their subscription/membership (aka. churn) in the future. To prevent the cancellation, some action is to be taken. Based on our machine learning model, we divide the customer base into mutually exclusive experiment groups, as illustrated in the figure below.
The customers the model believes will churn are in the “believes churn” group, while the rest are in the “believes not churn” group. Here we need to manually decide a cut-off, for example selecting the top 5% of customers with the highest scores, or customers with a sufficiently high probability of churn as estimated by the model. These considerations are part of the business case of the anti-churn activity, which we do not go further into here.
We then apply an action to prevent churn to a percentage of the “believes churn” group (A), while the rest (B) are left out as a control group. In the example, customers that received an action have a 10% churn rate, whereas those that did not have a 20% churn rate — the action reduces churn by 50% in the customers picked out by the model.
To measure the performance of the machine learning model, we need to compare the churn rates between the customers in the “believes churn” and “believes not churn” control groups who do not receive any action — groups B and D. The churn rates for these groups are 20% and 5%, respectively, and our example model thus finds customers four times more likely to churn than the rest.
We can also measure the effectiveness of the action in the “believes not churn” group by sending the action to a percentage of these customers (C) and comparing the churn rate with the rest of the customers in the “believes not churn” group (D). Doing this increases the complexity of the experimental setup, but if one sees a large positive effect by doing this comparison, it is a hint that one may want to decrease the selectivity of the model and include more customers in group A. In the extreme case, one may want to do away with the model entirely and send the action to all customers. In the example, an increased churn rate for group C is illustrated, meaning the action has a negative effect in the “believes not churn” group as more customers churn — better only target those the model selects!
To summarize, the model selects a customer group with a four times higher churn rate than the rest of the customer base, and the action on the target group reduces the churn rate by 50%. It is tempting to stop our analysis here, concluding that the model’s ability to select customers is good and that the action is effective. However, the result must be evaluated more carefully: We must check if our findings are statistically significant, or just random noise. A common approach to do so is to calculate the P-value of the experiment.
P-values are a way of explaining and quantifying the findings of a test; it tells us how likely it is for the result to be caused by chance. There are countless articles explaining P-values so we will not explain the concept here. Similarly, there are many online P-value calculators you can use to test the validity of your results or you can set up an Excel-spreadsheet.
It is by no means a perfect tool, and there is no such thing as a black and white significant or non-significant outcome (the statistically inclined reader may for instance read this article for a nuanced take on P-values). However, P-values still have their usefulness, and we recommend using them to get an idea if the action caused an effect or the result was due to randomness. A P-value of 0.05 is often used to determine significance and means we have less than a 5% chance of the result being caused by chance.
Below we note some key learnings for performing tests for statistical significance:
Returning to our mortgage churn example, if the target group was only 20 customers, 10 receiving the action and 10 reserved in the control group, then one customer would have churned in the action group A, and two customers in the control group B. This is hardly telling you anything, as this could easily be by chance. For a more realistic test, say the two groups were 100 customers each, so that 10 customers churned in group A and 20 in group B. Then the test statistic is -1.9803 and the corresponding P-value (for a two-tailed test) is 0.0477 < 0.05, and you would conclude that the result is statistically significant.
Which customers end up in the target group is selected by our model, and within the target group, we decide on the number of customers receiving the action (A) or being selected for control (B). Very often one faces organizational pressure to operate with a smaller control group to reach more customers with the action, but it is crucial to stress the importance of statistical significance. It is also possible (and encouraged!) to send the action to the customers in the control group once the effectiveness of the action on the target group has been established.
Each customer should be subjected to only one test (unless the tests are undeniably independent) at the same time. The control group for one action must also be kept clear of similar actions. If the customer base is large, a tidy solution is to divide it into several non-overlapping groups. In the below figure, this setup issue is demonstrated for a case where three A/B tests with similar actions are run in parallel. For Test 1, a separate part of the customer base is used, so the results are not affected by the other tests. For Tests 2 and 3, however, the same customer base is used, resulting in a very small control group (customers who didn’t receive any action at all), and the effects of one test are hard to distinguish from the other.
Before running the experiment, decide on the number of customers in each of the groups and which outcomes you will measure (i.e., clicks, e-mail opens, or purchases). Failing to do so may inadvertently lead to invalid results: A test giving a P-value below 0.05 means that there is a 5% chance that the action was not significant. If you “stack” multiple tests, it is easy for one of them to randomly give a P-value below 0.05 — this is known as P-hacking. Within customer analytics, it is tempting to run the experiment and expose customers to the action until you reach significance (P < 0.05), which is a form of P-hacking. The probability of the observed significance being caused by random chance dramatically increases beyond 5%, and the result will be misleading (see this blog for a more detailed explanation). There exists several techniques to overcome these pitfalls, but these are beyond the scope of this article.
The approach we have described may seem straightforward to the statistician, analyst, or data scientist. However, in our consulting practice, we have encountered and overcome several practical challenges that may also be new to this group. Here we present our top tips for running experiments in real life:
We need to have a plan for collecting and storing data for our experiments. Important things to consider are how the customers in the experimental groups are labeled, and how the outcomes (product bought, link clicked, etc.) are registered. Also, if the organization wants to run many A/B tests in parallel, a system based on policies making sure customers are not subjected to several tests influencing each other (as discussed in the previous section) is also necessary. Building a common infrastructure to tackle these issues and automatically allocate customers to the different experimental groups and calculate effect and statistical significance is highly recommended. This ensures both scalability for running multiple tests, automating away manual work as much as possible, and makes sure that the integrity of each test is upheld by applying a common logic.
Oftentimes, the data science function meets resistance from the sales/marketing function. Below we list some typical situations we have experienced and questions that may be worth considering if they come up:
Removing the control group prevents the measurement of the impact — wouldn’t it be nice to know just how much impact was generated, to argue for more marketing budget next year? And are you sure that the action couldn’t be optimized further?
Perhaps counterintuitive from a statistical standpoint, it makes perfect sense from a pragmatic point of view. The effort to create the action (i.e. copywriting, image selection, configuration in campaign management tool, etc.) has already been expended, and the business may not have any other relevant messages to send to their customers — something is often better than nothing. However, if the action involves some special offer, the business case may not be viable when sending to all customers, and not just those selected by the model.
Like the previous situation, it may still make sense to keep doing the action from a pragmatic viewpoint. However, there are also costs associated with continuing any action as it requires maintenance to stay up to date. For example, if the details of an advertised product change, the message has to change as well, and the model/business rule used to generate the target group may have to be updated if data sources change.
In the day-to-day work of designing and implementing tests, it is easy to lose focus on the long-term goal of measuring effect: increasing business value. Appreciate that a negative or inconclusive test gives useful information for the future; if an action is not effective, the business can focus their time and energy elsewhere. When communicating the results, have a plan for the alternative next step ready, and focus on what the results tell us instead of what they do not — take clear leadership in the development of an iterative, data-driven culture in the whole organization.
We have explained how predictions of behavior and the effect of marketing actions can be tested independently by correctly designing A/B tests. There are several practical challenges to be aware of, often arising from business wants and needs. In the desire for quick results, experimental setups may become faulty. Clarifying expectations, keeping a simple and clear experimental setup, and critically evaluating the results from a statistical standpoint will help develop better solutions giving long-term business value.
Dr. Lars Bjålie is a Senior Data Scientist at BearingPoint Oslo, with experience in customer analytics from the financial industry and predictive maintenance within railroad infrastructure. Before joining BearingPoint in 2016, Lars obtained a Ph.D. in computational materials science from the University of California, Santa Barbara. Email address: email@example.com.
Karine Foss is a Technology Analyst at BearingPoint Oslo, specializing in Data & Analytics, where she has experience with A/B testing within the media industry. She has a Master’s degree in Applied Physics and Mathematics, specializing in Statistics, from the Norwegian University of Science and Technology. Email address: firstname.lastname@example.org.
Stian Mikelsen is a Senior Manager at BearingPoint Oslo, specializing in Data & Analytics, where he has experience helping clients from big banks to start-ups build Data Platforms that provide real insight and value. He has a Master’s degree in Artificial Intelligence from the Norwegian University of Science and Technology. Email address: email@example.com.