Detecting fraud in utility billing

An energy utility approached us with an interesting problem:

We know our meter readings are incorrect. This is for various reasons, but fraud is a key component. We don’t, however, have the concrete proof we need to act on this.

Part of their problem was the inexperience in tools or analyses to identify such patterns. The other was the volume of data: the meter readings for just one city was 2 gigabytes.

We took the data in the raw database format, extracted it, and ran it through our toolset. The first step was to look at the frequency of subscribers at various meter readings.


It looks mostly like a log-normal distribution, except that there are large spikes at 50 units, 100 units and 200 units. Interestingly, those are exactly the slab boundaries. Subscribers who consume even one more unit more than 50 would pay at a higher rate plan, and similarly for 100 and 200.

It is statistically impossible (p < 10-18) for this to be a normal event. This clearly shows fraud of some kind.

This is not “random fraud” either – it’s not a random set of people that are benefitting from this just-at-the-boundary slab reading. There are a relatively small group of people who consistently have the same set of readings. Here are the monthly meter readings of 10 subscribers:


Notice the pattern on the first row. 200, 200, 200, 200, 200… such precision in usage would be admirable if it were believable.

What’s also interesting are the smaller spikes at 10, 20, 30, … 90. For the spikes at 50 and 100, there’s an economic reason. For these smaller spikes, there appears to be no economic reason. However, in this case, a different vice was suggested: laziness. These would represent meter readings that were never taken in the first place, and were just entered as round numbers.

So we have a mechanism to detect not just fraud, but laziness too!

To measure the fraud and narrow it down by region, we took the height of the spike as a proxy for the extent of fraud. So if the average of subscribers with a reading of 99 and 101 is 1 million, but 1.5 million customers had a reading of 100, the extent of fraud is:

Fraud = (1.5 million – 1 million) / 1 million = 50%.

We then plotted the extent of fraud by different sections in a city.


This is sorted in descending order of fraud. Section 1 has fraud of around 100% – which means there are nearly twice as many subscribers with a reading of 100 as compared to 99 or 101. While this number has fluctuated a bit, it’s remained quite high right through.

In contrast, Section 9 has relatively less fraud – ranging just up to 37%. (That’s still a huge number, of course.)

Section 5 shows an strange pattern. In June 2010, fraud dipped dramatically. Then, almost as if to make up, it shot back up in September 2010. In our discussions, we identified the cause behind this, but we’ll leave this for you to work out as a lateral thinking puzzle: what do you think caused the anomalous pattern in Section 5?