Detecting fraud in utility billing

An en­ergy util­ity ap­proached us with an in­ter­est­ing prob­lem:

We know our meter read­ings are in­cor­rect. This is for vari­ous reas­ons, but fraud is a key com­pon­ent. We don’t, how­ever, have the con­crete proof we need to act on this.

Part of their prob­lem was the in­ex­per­i­ence in tools or ana­lyses to identi­fy such pat­terns. The oth­er was the volume of data: the meter read­ings for just one city was 2 giga­bytes.

We took the data in the raw data­base form­at, ex­trac­ted it, and ran it through our tool­set. The first step was to look at the fre­quency of sub­scribers at vari­ous meter read­ings.


It looks mostly like a log-normal dis­tri­bu­tion, ex­cept that there are large spikes at 50 units, 100 units and 200 units. Interestingly, those are ex­actly the slab bound­ar­ies. Subscribers who con­sume even one more unit more than 50 would pay at a higher rate plan, and sim­il­arly for 100 and 200.

It is stat­ist­ic­ally im­possible (p < 10-18) for this to be a nor­mal event. This clearly shows fraud of some kind.

This is not “ran­dom fraud” either – it’s not a ran­dom set of people that are be­ne­fit­ting from this just-at-the-boundary slab read­ing. There are a re­l­at­ively small group of people who con­sist­ently have the same set of read­ings. Here are the monthly meter read­ings of 10 sub­scribers:


Notice the pat­tern on the first row. 200, 200, 200, 200, 200… such pre­ci­sion in us­age would be ad­mir­able if it were be­liev­able.

What’s also in­ter­est­ing are the smal­ler spikes at 10, 20, 30, … 90. For the spikes at 50 and 100, there’s an eco­nom­ic reas­on. For these smal­ler spikes, there ap­pears to be no eco­nom­ic reas­on. However, in this case, a dif­fer­ent vice was sug­ges­ted: lazi­ness. These would rep­res­ent meter read­ings that were nev­er taken in the first place, and were just entered as round num­bers.

So we have a mech­an­ism to de­tect not just fraud, but lazi­ness too!

To meas­ure the fraud and nar­row it down by re­gion, we took the height of the spike as a proxy for the ex­tent of fraud. So if the av­er­age of sub­scribers with a read­ing of 99 and 101 is 1 mil­lion, but 1.5 mil­lion cus­tom­ers had a read­ing of 100, the ex­tent of fraud is:

Fraud = (1.5 mil­lion – 1 mil­lion) / 1 mil­lion = 50%.

We then plot­ted the ex­tent of fraud by dif­fer­ent sec­tions in a city.


This is sor­ted in des­cend­ing or­der of fraud. Section 1 has fraud of around 100% – which means there are nearly twice as many sub­scribers with a read­ing of 100 as com­pared to 99 or 101. While this num­ber has fluc­tu­ated a bit, it’s re­mained quite high right through.

In con­trast, Section 9 has re­l­at­ively less fraud – ran­ging just up to 37%. (That’s still a huge num­ber, of course.)

Section 5 shows an strange pat­tern. In June 2010, fraud dipped dra­mat­ic­ally. Then, al­most as if to make up, it shot back up in September 2010. In our dis­cus­sions, we iden­ti­fied the cause be­hind this, but we’ll leave this for you to work out as a lat­er­al think­ing puzzle: what do you think caused the an­om­al­ous pat­tern in Section 5?

One thought on “Detecting fraud in utility billing”

Leave a Reply