In 2018, the Norwegian Ministry of Finance published a tender asking for ideas on how to estimate the market value of holiday homes for tax purposes. Responders were encouraged to consider “advanced solution algorithms”, such as Machine Learning techniques. BearingPoint delivered a concept description, proposing an approach utilizing Machine Learning models to inspire simpler and more transparent statistical models, and was elected the winner. In this article we want to share our key takeaways from the proposed solution.
Machine Learning (ML) methods such as neural networks and large ensemble models are rapidly becoming the preferred tool for solving analytical problems in both business and science. Their rise in popularity can be attributed to increased ease of use through constantly improving open source code libraries, as well as the fact that the methods regularly outperform traditional statistical techniques. Performance gains are often due to their advantage in utilizing interaction effects, non-linear effects and peculiar outliers.
There is however a catch. Many ML models are far less transparent than their traditional counterparts due to their inherent complexity. This is often referred to as the “black box problem” where the inner workings of the algorithms prove extremely difficult to explain. The output may be highly accurate but the deductive rules that produced it are effectively hidden under the hood. This is not an issue when your only objective is to predict an accurate outcome, but when you need to understand the underlying reasoning for each prediction many ML models come up short.
When taxing holiday homes, the market value prediction has a significant impact on the private economy of a large number of Norwegian citizens. The Norwegian Ministry of Finance’s challenge was therefore to create a model that was both accurate, robust, transparent and fair.
Figure 1
Collect an as exhaustive data set as possible. Use imagination, expert advice and clever feature engineering to produce a data set with variables that are thought to influence your target variable. Cleanse and transform your data into a final data set to be used in model development. This data set is usually called an analytical base table (ABT).
Perform standard exploratory data analysis (simple plotting, correlation analysis, principal component analysis, etc.) to gain insight into the dynamics of the data. Obtain insight into the predictive power of each variable on a stand-alone basis and start to build an understanding of any relevant phenomena that appear to be represented in the data. So far, business as usual.
Develop an ML model of choice according to the recognized standard; appropriate test-training split, hyperparameter tuning with cross validation on training data and final evaluation on test data. When the model has been trained and performs at an adequate level, the next step is to analyze the model and extract insight from it. The goal is to be able to implement transformations on our ABT that mimic how features are used within the ML model.
We propose using transparency features provided by open source software/code libraries, i.e. the following
Assess each variable carefully, look for any relevant dependencies and construct transformations accordingly:
Partial dependence plot |
Interpretation | Transformation |
---|---|---|
![]() |
Binary threshold In this example, it seems that the dependence between variable X and the prediction output is non-linear such that X has a static, slightly negative influence when X < Xa and a positive, static influence when X > Xa For example median income in area above 10mNOK. |
Create new variable Y such that |
![]() |
Exponential effect In this example, variable X seems to have an exponentially increasing positive effect on the prediction output throughout its value range. For example proximity to nearest city. |
Create new variable Y such that |
![]() |
Combination of linear and threshold In this example, in the first interval where X is less than Xa there seems to be a linear dependence between variable X and the prediction output. When X is larger than Xa, however, it sees that X has a static, slightly negative effect on the output. For example proximity to shoreline may have a linear relationship with housing prices, but only in costal areas (i.e. in areas within a certain distance to the shore). |
Split X into two variables Y and Z such that |
![]() |
Interaction effect In this 3D plot, we observe an interaction between X and Y where the prediction output seems to be linearly correlated with Y, but only when X is between Xa and Xb and Y is larger than Ya. For example average price pr square foot between 30 kNOK and 50 kNOK and percentage of properties sold last three years between 50 and 100. |
Create new variable Z such that |
Figure 2: Examples of interpretation of partial dependency plots
After transforming the original ABT, use the new extended ABT to train a transparent linear regression model. Evaluate the linear model by scoring it on the test data and compare this result with the original score from the ML model. Hopefully, the score from the linear model is close to that of the ML model – and we have succeeded with preserving the predictive power throughout the process.
Our approach is summarized in figure 1. As seen, the final model consists of a transformation layer and a regression layer. They both contain specific rules that could be explained to an end user. This stands in contrast to an “unexplainable black box” machine learning algorithm and could serve as an alternative strategy when combined accuracy and transparency is of importance.
BearingPoint wish the Norwegian Ministry of Finance good luck on implementing our approach, hopefully resulting in fair taxation of holiday houses. We also hope our approach will help solve similar cases in other industries where increased model transparency is needed.