In 2018, the Norwegian Ministry of Finance published a tender asking for ideas on how to estimate the market value of holiday homes for tax purposes. Responders were encouraged to consider “advanced solution algorithms”, such as Machine Learning techniques. BearingPoint delivered a concept description, proposing an approach utilizing Machine Learning models to inspire simpler and more transparent statistical models, and was elected the winner. In this article we want to share our key takeaways from the proposed solution.

Why combining Machine Learning with traditional statistics makes sense when balancing model accuracy and transparency

Machine Learning (ML) methods such as neural networks and large ensemble models are rapidly becoming the preferred tool for solving analytical problems in both business and science. Their rise in popularity can be attributed to increased ease of use through constantly improving open source code libraries, as well as the fact that the methods regularly outperform traditional statistical techniques. Performance gains are often due to their advantage in utilizing interaction effects, non-linear effects and peculiar outliers.

There is however a catch. Many ML models are far less transparent than their traditional counterparts due to their inherent complexity. This is often referred to as the “black box problem” where the inner workings of the algorithms prove extremely difficult to explain. The output may be highly accurate but the deductive rules that produced it are effectively hidden under the hood. This is not an issue when your only objective is to predict an accurate outcome, but when you need to understand the underlying reasoning for each prediction many ML models come up short.

When taxing holiday homes, the market value prediction has a significant impact on the private economy of a large number of Norwegian citizens. The Norwegian Ministry of Finance’s challenge was therefore to create a model that was both accurate, robust, transparent and fair.

We proposed a three-step approach that utilizes Machine Learning models to inspire simpler and more transparent statistical models

Figure 1

1. Data preparation

Collect an as exhaustive data set as possible. Use imagination, expert advice and clever feature engineering to produce a data set with variables that are thought to influence your target variable. Cleanse and transform your data into a final data set to be used in model development. This data set is usually called an analytical base table (ABT).

Perform standard exploratory data analysis (simple plotting, correlation analysis, principal component analysis, etc.) to gain insight into the dynamics of the data. Obtain insight into the predictive power of each variable on a stand-alone basis and start to build an understanding of any relevant phenomena that appear to be represented in the data. So far, business as usual.

2. Train a semi-transparent ML model, interpret and construct data transformation rules

Develop an ML model of choice according to the recognized standard; appropriate test-training split, hyperparameter tuning with cross validation on training data and final evaluation on test data. When the model has been trained and performs at an adequate level, the next step is to analyze the model and extract insight from it. The goal is to be able to implement transformations on our ABT that mimic how features are used within the ML model.

We propose using transparency features provided by open source software/code libraries, i.e. the following

  • Feature importance is a metric that, roughly speaking, describes how important each variable is for the prediction output of the model.
    - The metric can be used to evaluate which variables to keep (high importance) and remove (low importance) from the ABT.
  • Partial dependency plots tell us something about the dependency between an input variable and the target variable.
    - The plots illustrate how one of the input variables influence the prediction outcome throughout the value range of the input variable, averaging out the effect of all other input variables.
    - Similarly, 3D partial dependency plots illustrate the influence of two input variables on the output variable. 

Assess each variable carefully, look for any relevant dependencies and construct transformations accordingly:

Partial dependence plot

Interpretation Transformation

Binary threshold

In this example, it seems that the dependence between variable X and the prediction output is non-linear such that X has a static, slightly negative influence when X < Xa and a positive, static influence when X > Xa

For example median income in area above 10mNOK.

Create new variable Y such that

Exponential effect

In this example, variable X seems to have an exponentially increasing positive effect on the prediction output throughout its value range.

For example proximity to nearest city.

Create new variable Y such that

Combination of linear and threshold

In this example, in the first interval where X is less than Xa there seems to be a linear dependence between variable X and the prediction output. When X is larger than Xa, however, it sees that X has a static, slightly negative effect on the output.

For example proximity to shoreline may have a linear relationship with housing prices, but only in costal areas (i.e. in areas within a certain distance to the shore).

Split X into two variables Y and Z such that

Interaction effect

In this 3D plot, we observe an interaction between X and Y where the prediction output seems to be linearly correlated with Y, but only when X is between Xa and Xb and Y is larger than Ya.

For example average price pr square foot between 30 kNOK and 50 kNOK and percentage of properties sold last three years between 50 and 100.

Create new variable Z such that

Figure 2: Examples of interpretation of partial dependency plots

3. Train a transparent model using transformed data and assess model accuracy

After transforming the original ABT, use the new extended ABT to train a transparent linear regression model. Evaluate the linear model by scoring it on the test data and compare this result with the original score from the ML model. Hopefully, the score from the linear model is close to that of the ML model – and we have succeeded with preserving the predictive power throughout the process.

Our approach is summarized in figure 1. As seen, the final model consists of a transformation layer and a regression layer. They both contain specific rules that could be explained to an end user. This stands in contrast to an “unexplainable black box” machine learning algorithm and could serve as an alternative strategy when combined accuracy and transparency is of importance.  

Good luck on your quest to increased model transparency!

BearingPoint wish the Norwegian Ministry of Finance good luck on implementing our approach, hopefully resulting in fair taxation of holiday houses. We also hope our approach will help solve similar cases in other industries where increased model transparency is needed.

Toggle search
Toggle location