In Part 1 of this series we looked at how we could train a machine to predict whether an investigator would find the presence of a certain combination of behaviours suspicious. For simplicity those behaviours were binary; either the customer exhibited those behaviours or they didn't. Using training data - that is assessments of suspicion in previous cases worked by skilled investigators - our machine learnt some simple rules to predict how those investigators would adjudicate future cases exhibiting the same behaviours.
In that very basic example the rules could easily have been derived using simple data analytics. In this article we will move on to the next stage and consider how to train our machine to identify suspicious cash deposit values for customers with a range of expected turnover. The example is also simplistic (and a little contrived) but it is intended to illustrate how a machine can learn to make finer judgements based on discrete measures in addition to binary behaviour flags.
Consider the table below showing a population of customers with various levels of expected turnover and their corresponding cash deposits for the period.
Our Subject Matter Expert has set a threshold of cash deposits of greater than 70% of turnover as suspicious. In the fourth column we can see that applying this rule produces 9 alerts. In the last column our investigators have marked 4 of those cases as suspicious, each having a ratio of greater than 80% of cash deposits to turnover, the remainder as not suspicious.
Unhappy with 5 False Positives out of 9 alerts, we seek to train a machine to see if it can outperform the simple 70% detection rule. The training process produces a decision tree like this:
These decisions trees can be a little hard to read so I'll decode it for you. Using the examples given the machine has learned to segment the customer population by turnover and apply cash deposit thresholds as follows:Â
At first glance this seems simplistic, but reasonable, except for the last line where no alert will be generated if turnover exceeds 900 and the cash deposit is above 995.50 (more on this later).
However the proof is to see how accurately the machine will predict how suspicious the investigators will judge the next set of customers to be. So we give it the next set of data on unseen customers and examine its response. As before suspicion is confirmed by investigators, implicitly but consistently applying an 80% threshold.
Our simple >70% rule produces 11 alerts against this population, including 6 False Positives. In contrast our machine learning model produces only 7 alerts, with 4 False Positives. So far so good. But, whilst our simple rule didn't produce any False Negatives (i.e. failed to alert when it should have done), our learned model did so twice (customers 25 and 27). Why is that? What is going wrong?
This is a case of over-fitting. The model has learned to segment the population and predict suspicion on the examples it was shown. The thresholds it has learned are tailored precisely to match the training data, but don't translate well to the new customer population. The model does not generalise well.
This also explains why the model has learnt that for very high turnover with very high cash deposits it should never alert - because it was only shown examples (customers 13 & 14) where that was the case. This illustrates the vital importance of having sufficient training data of good quality to train a model.
Now you might be thinking: we could just do some simple analytics on the data and work out that the investigators were applying a higher 80% threshold (than the 70% SME-provided figure) and adjust our rule accordingly. In this example, with only one variable (cash deposits) and one piece of customer context (expected turnover) that would be possible, but in an operational scenario with dozens of inter-related variables, hundreds of investigators and millions of customers, manually discerning those correlations can be very challenging.
Machine learning automates this analytic process, deriving from the judgements of investigators a complex set of rules to apply in future cases. These derived rules (many thousands of them) can collectively provide a more granular level of decision-making than could be maintained by a manually tuned rules-based approach. This is one of the ways that machine learning can outperform a traditional approach to detecting suspicion if a sufficiently large and rich set of training data is available.
So far we have considered supervised machine learning where the machine learns from the investigators and attempts to reproduce their judgements to predict how they would adjudicate new cases. In the next article we will take a look at unsupervised learning where the machine tries to identify suspicion based on statistically unusual customer behaviour.
[This article is based on a scikit-learn DecisionTreeClassifier using the CART algorithm]