A guide to online data plotting

http://www.livemint.com/2012/01/24211809/A-guide-to-online-data-plottin.html

A guide to online data plotting

When you’re deal­ing with com­plex data, visu­al­iz­a­tion tools can help you sim­pli­fy it and, more im­port­antly, spot key trends and gain new in­sights

Shweta Taneja

Sales fig­ures, con­sumer be­ha­vi­our and mar­ket re­search – the work we do of­ten in­volves un­der­stand­ing and com­mu­nic­at­ing a lot of com­plex in­form­a­tion. To make good de­cisions, you need to be able to un­der­stand the data, and quickly. Visualization tools can sim­pli­fy data, and make it easi­er to un­der­stand and spot key trends.

According to Deloitte’s “Tech Trends 2011: The Natural Convergence of Business and IT” re­port re­leased in March, data-visualization tools were the fast­est de­vel­op­ing area in soft­ware last year.

Data in, graphic out: Visual representations of data are easier to understand.

Data in, graph­ic out: Visual rep­res­ent­a­tions of data are easi­er to un­der­stand.

“Data visu­al­iz­a­tion com­presses in­form­a­tion quickly,” says S. Anand, 37, chief data sci­ent­ist, Gramener, a Hyderabad-based data-visualization com­pany. “For ex­ample, in a chart, a bar can give you a data set with its height, col­our and thick­ness, so you have already com­pressed a table with three columns in­to one graph,” he ex­plains. “A 40-page re­port can eas­ily be con­ver­ted in­to a single page of graph­ics.” By do­ing this, a large amount of data be­comes eas­ily ac­cess­ible, and trends and high­lights are easy to pick out, com­pared to a table of num­bers.

“Data-visualization tools are typ­ic­ally de­signed to high­light rel­ev­ant in­sights, rather than just present raw data as in a dash­board,” ex­plains Stewart Langille, co-founder, Visual.ly, a new on­line visu­al­iz­a­tion tool. Another use­ful as­pect of view­ing data as visu­als is that you can high­light the in­form­a­tion that’s really im­port­ant and even get new­er, com­pletely un­ex­pec­ted in­sights in­to the data sets.

Like the idea? We list some of the most in­nov­at­ive on­line data-visualization tools:

Tableau Public

Website:www.tableausoftware.com/public

After you in­stall the soft­ware, you en­ter the data either as a spread­sheet (Microsoft Excel, Microsoft Access) or text file with tab spaces. The soft­ware reads the file to identi­fy vari­ables. Once you choose the rel­ev­ant vari­ables, it cre­ates a visu­al chart of your data. The soft­ware auto­mat­ic­ally tries to give the right kind of chart, but you can also manu­ally choose from op­tions such as bar charts, his­to­grams, scat­ter plots, bubble charts, pie charts, bul­let graphs, maps and heat mat­rix, etc. Tableau charts can also be in­ter­act­ive, so view­ers can re­arrange the data to ana­lyse it from dif­fer­ent per­spect­ives. The chart is saved on www.tableausoftware.com

The down­side is that visu­al­iz­a­tions and data are pub­lic — any­one can down­load your work. To keep it private, and for ad­ded fea­tures such as more fil­ters and rep­res­ent­a­tions, you could buy the Personal Edition for $999 (around Rs. 50,300), or the Professional Edition for $1,999.

Cost: Free to use, with paid edi­tions start­ing at $999.

Many Eyes

Website:www-958.ibm.com

Many Eyes, launched in January 2007, is one of the first data-visualization tools, and was cre­ated by IBM Research. You have to up­load your data to the site, and can view it as a scat­ter plot, mat­rix chart, net­work dia­gram, bar chart, block his­to­gram, bubble chart, graph, pie chart, tree­map, and many oth­er visu­al­iz­a­tions. The up­load pro­cess is cum­ber­some, though — you can copy-paste, but only from a prop­erly format­ted text file, not a spread­sheet. It ac­cepts a spe­cific style of rows and column data. So even if you have a spread­sheet, you might need to ed­it it to make sure Many Eyes un­der­stands your data.

As with Tableau Public, whatever data you up­load be­comes pub­lic prop­er­ty, but un­like Tableau, there is no paid, private op­tion.

Cost: Free to use

Spotfire Analytics

Website:spotfire.tibco.com

The soft­ware cre­ates an in­ter­act­ive dash­board which shows your data and your company’s pro­gress to you at a glance through 3D graphs. You can also use the dash­board to give your cli­ents or in­vestors a clear pic­ture of ex­actly what they are in­vest­ing in. All you need to do is drag and drop your Excel or CSV-formatted text files in­to Spotfire, and then start play­ing around with your data. The ana­lys­is can be shared or em­bed­ded on web­sites, blogs or so­cial net­works, and there is no down­load needed as it runs off the Web. A paid ser­vice, Spotfire keeps your data private.

Cost: Starts from $199 per month.

QlikView

Website:www.qlikview.com

QlikView is meant to find new busi­ness an­swers to prob­lems through data ana­lys­is. The tool also of­fers com­par­at­ive ana­lys­is of a pro­duct or per­son. QlikView also has ver­sions for smart­phones and tab­lets, and is well suited to use on touch screens.

Cost: QlikView Personal Edition soft­ware is free, but you can only use it to ana­lyse data, not share your res­ults. To share data, you need the en­ter­prise ver­sion. Pricing var­ies, so con­tact www.qlikview.com

Visual.ly

Website:Visual.ly

A col­lec­tion of in­fograph­ics from vari­ous pro­fes­sion­als in the in­dustry, Visual.ly al­lows users to eas­ily share charts and in­fograph­ics. In March, the com­pany plans to launch a free on­line tool to con­vert data in­to visu­als — there is a small de­mo avail­able on the web­site where you can con­vert your Twitter feed in­to an in­fograph­ic about you.

The in­put data has to be in known data sets like Excel and CSV files. Users need to cre­ate a lo­gin for the tool, in­put their data and se­lect from a choice of tem­plates to cus­tom­ize.

Cost: Free. Paid pack­ages are ex­pec­ted but have not yet been an­nounced.

FusionCharts Suite

Website:www.fusioncharts.com

FusionCharts Suite is an Indian com­mer­cial visu­al­iz­a­tion tool which can con­vert any data­base or Web script in­to Flash or HTML5 charts, gauges and maps. Creating the chart takes 15 minutes, and you can choose from over 90 types of charts. The visu­al­iz­a­tion helps in ana­lyses of data by giv­ing trend­lines, col­our ranges and num­ber scales. You can choose sub­sets, tips, ex­port charts and do visu­al edit­ing. Once you are ready, the chart is rendered and can be shared or em­bed­ded in a web­site. You can choose wheth­er the chart is ed­it­able by oth­ers, and can use the soft­ware to cre­ate real-time charts that up­date auto­mat­ic­ally.

Cost: One-time pur­chase for com­mer­cial use starts at $1,299. For per­son­al use, you can get a li­cence for $499. Also avail­able as ex­ten­sions for Flex (start­ing at $299), Dreamweaver (start­ing at $69) and VisualBasic 6 (start­ing at $99).

When to invest

Sometimes, tim­ing is everything in in­vest­ments.

Last year, The New York Times pub­lished a piece titled In Investing, It’s When You Start And When You Finish. This showed the sig­ni­fic­ant im­pact of tim­ing in in­vest­ment de­cisions.

At Gramener, we ap­plied the same visu­al­isa­tion to a few Indian stocks over the last 5 years.

Here’s what it looks like for ICICI’s stock.

If you in­ves­ted in ICICI stock in Jan 2007, the first row of boxes show the kind of re­turns you would have seen.

The col­ours in­dic­ate the de­gree of profit or loss. Red for losses, green for profits, and yel­low for neut­ral val­ues. Selling in March 2007 would have made sig­ni­fic­ant losses. Selling in Jan 2008, one year later, would have given you a good profit. And so on.

The same is ex­ten­ded to in­vest­ments made in oth­er months.

The black boxes show a hold­ing pat­tern of 1 year, 2 years, etc. You can get a sense of what kind of re­turns you would make with a strategy of hold­ing for 1 year, 2 years, and so on.

Here are sim­il­ar pic­tures for Infosys stock and SBI stock.

At Gramener, we took a look at a num­ber of such stocks and their per­form­ance over the last five years. A in­ter­act­ive app show­cas­ing sample of those is avail­able at http://gramener.com/whentoinvest/.

Detecting fraud in utility billing

An en­ergy util­ity ap­proached us with an in­ter­est­ing prob­lem:

We know our meter read­ings are in­cor­rect. This is for vari­ous reas­ons, but fraud is a key com­pon­ent. We don’t, how­ever, have the con­crete proof we need to act on this.

Part of their prob­lem was the in­ex­per­i­ence in tools or ana­lyses to identi­fy such pat­terns. The oth­er was the volume of data: the meter read­ings for just one city was 2 giga­bytes.

We took the data in the raw data­base form­at, ex­trac­ted it, and ran it through our tool­set. The first step was to look at the fre­quency of sub­scribers at vari­ous meter read­ings.

meter-reading-frequency

It looks mostly like a log-normal dis­tri­bu­tion, ex­cept that there are large spikes at 50 units, 100 units and 200 units. Interestingly, those are ex­actly the slab bound­ar­ies. Subscribers who con­sume even one more unit more than 50 would pay at a higher rate plan, and sim­il­arly for 100 and 200.

It is stat­ist­ic­ally im­possible (p < 10-18) for this to be a nor­mal event. This clearly shows fraud of some kind.

This is not “ran­dom fraud” either – it’s not a ran­dom set of people that are be­ne­fit­ting from this just-at-the-boundary slab read­ing. There are a re­l­at­ively small group of people who con­sist­ently have the same set of read­ings. Here are the monthly meter read­ings of 10 sub­scribers:

meter-reading-of-specific-customers

Notice the pat­tern on the first row. 200, 200, 200, 200, 200… such pre­ci­sion in us­age would be ad­mir­able if it were be­liev­able.

What’s also in­ter­est­ing are the smal­ler spikes at 10, 20, 30, … 90. For the spikes at 50 and 100, there’s an eco­nom­ic reas­on. For these smal­ler spikes, there ap­pears to be no eco­nom­ic reas­on. However, in this case, a dif­fer­ent vice was sug­ges­ted: lazi­ness. These would rep­res­ent meter read­ings that were nev­er taken in the first place, and were just entered as round num­bers.

So we have a mech­an­ism to de­tect not just fraud, but lazi­ness too!

To meas­ure the fraud and nar­row it down by re­gion, we took the height of the spike as a proxy for the ex­tent of fraud. So if the av­er­age of sub­scribers with a read­ing of 99 and 101 is 1 mil­lion, but 1.5 mil­lion cus­tom­ers had a read­ing of 100, the ex­tent of fraud is:

Fraud = (1.5 mil­lion – 1 mil­lion) / 1 mil­lion = 50%.

We then plot­ted the ex­tent of fraud by dif­fer­ent sec­tions in a city.

meter-reading-fraud-by-section

This is sor­ted in des­cend­ing or­der of fraud. Section 1 has fraud of around 100% – which means there are nearly twice as many sub­scribers with a read­ing of 100 as com­pared to 99 or 101. While this num­ber has fluc­tu­ated a bit, it’s re­mained quite high right through.

In con­trast, Section 9 has re­l­at­ively less fraud – ran­ging just up to 37%. (That’s still a huge num­ber, of course.)

Section 5 shows an strange pat­tern. In June 2010, fraud dipped dra­mat­ic­ally. Then, al­most as if to make up, it shot back up in September 2010. In our dis­cus­sions, we iden­ti­fied the cause be­hind this, but we’ll leave this for you to work out as a lat­er­al think­ing puzzle: what do you think caused the an­om­al­ous pat­tern in Section 5?