Data visualisation course at IIIT

We are of­fer­ing Data Visualisation course at IIIT Hyderabad and JNTU Hyderabad as part of the Master of Science in Information Technology (MSIT) out­reach pro­gram­me. This pro­gram­me is offered by a con­sor­ti­um of uni­ver­sit­ies in col­lab­or­a­tion with Carnegie Mellon with the sup­port of State gov­ern­ment of Andhra Pradesh.

Through this part­ner­ship, Gramener is col­lab­or­at­ing to cre­ate course con­tent, design cur­riculum as per in­dustry stand­ards and also have joint part­ner­ship to ex­ecute pro­jects on pre­dict­ive ana­lyt­ics and data visu­al­isa­tion.

The course has 5 mod­ules:

  1. Handling big data
    • How to scrape data from ex­tern­al sources
    • How to parse and trans­form it in­to a form­at you need
  2. Analysis
    • Segmentation
    • Predictive ana­lyt­ics
  3. Vector graph­ics
    • Drawing graphs us­ing SVG
    • Tools to ma­nip­u­late SVG
  4. Templates
    • Programmatically cre­at­ing graphs us­ing tem­plates
    • Using data to drive the tem­plates
  5. Gramener visu­al­isa­tion server
    • Using lib­rar­ies to cre­ate visu­al­isa­tions

The course is also avail­able on­line to those who are in­ter­ested. You may email us at contact@gramener.com to ac­cess the con­tent, ex­er­cises and videos.

The top Indian one-day batsmen

I’ve al­ways been curi­ous… who among India’s pro­li­fic one-day run-getters had a good strike rate. This pic­ture be­low shows you the top 50 ODI run get­ters for India.

batting-plain-summary

The little squares in­dic­ates one play­er. The size in­dic­ates the num­ber of runs scored, and the col­our in­dic­ates the av­er­age strike rate. (Red is poor, green is high).

Firstly, you can see that Sachin, apart from be­ing a pro­li­fic run-getter, is slightly above av­er­age. The same can’t be said of the next three: Saurav, Azhar and Rahul. The next three how­ever, Yuvra, Sehwag and Dhoni, are as fast or faster run-getters than Sachin – es­pe­cially Sehwag.

We do have a few low scorers there – Sunil Gavaskar, Ravi Shastri, Mohinder Amarnath, Dilip Vengsarkar, etc.. but al­low­ance must be made for the in­crease in run rate over time:

In the 1975 World Cup, for in­stance, the av­er­age run-rate in the en­tire tour­na­ment was 3.91 runs per over; in the next edi­tion, in 1979, it dropped to 3.54. Compare that with the run-rate in the most re­cent edi­tion of the World Cup, when the over­all tour­na­ment scor­ing rate ex­ceeded five for the first time, and it’s ob­vi­ous that the way the ODI is played has changed hugely over 35 years.

For Indian bats­men, the strike rate seems to go up at about 3.4% every dec­ade. Adjusting for that, this is what the pic­ture looks like:

batting-adjusted-summary

These play­ers do look a bit bet­ter now, but they’re still fairly slow. The big ex­cep­tion of that gen­er­a­tion was Kapil Dev. His strike rate is the only one that rivals Virender Sehwag’s rate today.

Based on this pic­ture, if I were to pick the top 3 fast run-getters across time, I’d pick Kapil Dev, Sehwag and Yusuf Pathan. The slowest would prob­ably be Mohinder Amarnath, Manoj Prabhakar and Sadagopan Ramesh.

We can drill a little deep­er in­to their per­form­ance, at a match-level. In the pic­ture be­low, each box is a match, col­our coded by strike rate.

batting-plain-detailed

A pat­tern emerges here: higher totals (on the top left for each play­er) are scored at a higher strike rate. This isn’t par­tic­u­larly sur­pris­ing, how­ever.

Another in­ter­est­ing view is to see how our bats­men fare again­st the rest of the world. On an ad­jus­ted basis, this is what it looks like:

batting-adjusted-world

Shahid Afridi, with an av­er­age strike rate of 115 stands way above the rest – and the second play­er on this list is Sehwag. Interestingly, Afridi has just a few less runs than the le­gendary Viv Richards, but these have been scored at a much faster rate than even the mas­ter blaster.

Visit gramener.com/cricket to see the crick­et visu­al­isa­tion live.

The visu­al­isa­tion you just saw is a Treemap. It’s a very power­ful way of com­par­ing ele­ments in a hier­archy with re­l­at­ive im­port­ance. Some oth­er ways you can use tree­maps in a busi­ness con­text are:

  • Profitability by busi­ness unit. The col­our in­dic­ates profits, and size in­dic­ates sales. Large un­prof­it­able units stand out clearly.
  • Sales growth by cat­egory. The size in­dic­ates the cat­egory sales, and col­our in­dic­ates growth over a peri­od.
  • Risk by cus­tom­er seg­ment. The size in­dic­ates the ex­pos­ure to each seg­ment / sub-segment. The col­our in­dic­ates de­gree of risk.

Charting one-dimensional data linearly

How many ways are there of look­ing at series of data? Consider this rain­fall data:

chart-table

We have rain­fall for every dis­trict in Tamil Nadu for every month over the last 5 years. That’s 60 data points per dis­trict. How many ways are there of plot­ting it?

In this post, we’ll look at 10 ways you can rep­res­ent a sim­ple series – in a straight line.

Data Bars

chart-bars

These are a quick way of plot­ting bar graphs with­in the cells. The eye is nat­ur­ally drawn to num­bers with large val­ues. It’s an easy way of loc­at­ing big num­bers, and in par­tic­u­lar, to com­pare data across series. But it isn’t very easy to find trends with­in a series.

Colour scales

chart-gradient

These shade each cell with a col­our gradi­ent. Red for low, green for high. While they’re much worse at ex­act com­par­is­ons, they’re much bet­ter at help­ing identi­fy trends – both with­in a series and across.

Heatmap

chart-heatmapThe col­our scales can be shrunk without much loss of in­form­a­tion if we’re more in­ter­ested in the trend than in the num­bers.

This heat­map is a com­pact way of com­par­ing in­form­a­tion over time, and across dis­tricts. Reading left-to-right, the pat­terns of growth, de­cline or sea­son­al­ity can be ob­served. Reading top-to-bottom, patches of high or low that cut across data series be­come evid­ent.

This is a sim­pli­fied one-dimensional ver­sion of the tra­di­tion­al heat­map which typ­ic­ally shows data in two di­men­sions.

Bar chart

chart-barchart

If be­ing able to com­pare quant­it­ies with­in a series be­comes im­port­ant, one can use bar charts in­stead.

The bar chart shown here is a vari­ant of the tra­di­tion­al bar chart. It does away with the ho­ri­zont­al and ver­tic­al axes, as well as the la­bels, and just shows the bars.

This is an ex­ample of a micro-chart, the most clas­sic ex­ample of which is the spark­line. Microsoft has in­tro­duced a num­ber of these micro-charts in Excel 2010. This is one of the sig­ni­fic­ant up­grades in Excel 2010’s chart­ing cap­ab­il­it­ies.

Sparkline

chart-sparkline

Sparklines are among the earli­est mi­crocharts, ini­tially cre­ated by Edward Tufte. They are the equi­val­ent of line graphs, but without the la­bels and axes.

These make it very easy to com­pare trends with­in a series. However, com­par­ing across series may not be easy. In fact, it would not be pos­sible at all un­less the spark­lines are drawn to scale.

Sparklines are, how­ever, par­tic­u­larly ef­fect­ive in rep­res­ent­ing growth or de­cline trends in a com­pact fash­ion.

Trendline

chart-sparkline-trend

A trend­line over­lays spark­lines with a trend. This may be a mov­ing av­er­age, a best-fit line (e.g. lin­ear re­gres­sion), etc.

The high vari­ab­il­ity of spark­lines can be smoothened out through the trend­lines, mak­ing it slightly easi­er to spot long-term trends.

This is par­tic­u­larly use­ful when the data shows multi-seasonal pat­ters (e.g. a weekly as well as a monthly pat­tern), and we want to bring out both ef­fects in the same chart.

Streamgraph

chart-streamgraph

A stream­graph is identic­al to a spark­line, ex­cept that in­stead of the height rep­res­ent­ing the value, it is the width of the graph that rep­res­ents the value.

These are also re­ferred to as stacked graphs. They are par­tic­u­larly ef­fect­ive when visu­al­ising mul­tiple series one on top of an­other. See Lee Byron’s Last.fm listen­ing his­tory for an ex­ample of ef­fect­ive use of this graph.

These are most ef­fect­ive in identi­fy­ing which series is dom­in­ant at a given point in time, and how the series grows or dies around that point.

Horizon graph

chart-horizon

The ho­ri­zon graph ex­pands the res­ol­u­tion of spark­lines. First, it uses an ab­so­lute scale, dif­fer­en­ti­at­ing between pos­it­ives and neg­at­ives. Negatives are col­oured red, and pos­it­ives are col­oured green. These are then fol­ded.

The chart is then fol­ded re­peatedly, and uses col­our in­tens­ity in con­junc­tion with height to show the value. Panopticon, who cre­ated Horizon Graphs, have a good in­tro­duc­tion to the use and con­struc­tion of these graphs.

Like heat­maps, these are use­ful in spot­ting ho­ri­zont­al and ver­tic­al trends, but us­ing an ab­so­lute rather than a re­l­at­ive scale.

Jitter plot

chart-jitter

Jitter plots are use­ful ways of visu­al­ising the dens­ity and fre­quency of a data series. They plot the val­ues ho­ri­zont­ally, rather than ver­tic­ally. That is, the x-axis is the value rather than the y-axis. The y-axis just spreads the points around ran­domly to min­im­ise the over­lap.

This is use­ful in com­par­ing fre­quency data. For ex­ample, here, it is clear that no rain­fall is the most fre­quent state. It can also been seen that Cuddalore typ­ic­ally has many months with little rain­fall.

When the data dens­ity be­comes too high, how­ever, jit­ter plots are not as ef­fect­ive.

Box plot

chart-boxplot

In such cases, box-plots make for a bet­ter dis­play. Invented by John Tukey in 1977, these sum­mar­ise a data series us­ing just five num­bers: the min­im­um, the lower quart­ile, the me­di­an, the up­per quart­ile and the max­im­um.

The box rep­res­ents the area where 50% of the ob­ser­va­tions lie. The ho­ri­zont­al line rep­res­ents the full range of val­ues in the series. The ver­tic­al line is the me­di­an. Half the val­ues lie to the left, and half to the right.

While this plot ap­pears simplist­ic, it of­ten is much more ro­bust (i.e. safe to use for a wide vari­ety of data­sets).