## Composing data visualisations

How does one create new data visualisations? Apart from the art, is there a science to it?

Let’s explore a few popular charts. We have the vertical bar graph or the horizontal bar graph . The stacked bar . The variwide or Marimekko chart . The waterfall . The scatterplot . The treemap . And so on.

The first thing you’ll observe is that all of these are a series of rectangles. (We’re treating the dots on the scatterplot as little squares.) The only thing that varies across these charts is the position and size of the rectangles – and the colour as well.

That gives us a hint. Perhaps there are many ways of creating visualisations just by changing the position, size and colour of rectangles. For example the horizontal bar graph can be constructed as follows:

• The x position is constant for each rectangle. It starts at zero.
• The width is proportional to the value of the series
• The y position is proportional to the index of the values (1,2,3,…)
• The height is constant for each of the bars
• The colour is constant too.

Whereas, if we look at a horizontal stacked bar , then:

• The x position is proportional to the cumulative value of the series.
• The width is proportional to the value of the series
• The y position is constant at zero
• The height is constant for each of the bars
• The colour is based on the index of the values (distinct colours labelled 1,2,3,…)

Generalising this, we can construct a table like this that shows the structure of various visualisations:

Chart x width y height colour
Vertical bar chart index constant constant value constant
Stacked bar index constant cumulative value index
Waterfall index constant cumulative value constant
Scatterplot value constant value constant index
Horizontal bar chart constant value index constant constant
Variwide cumulative value constant value constant

That leads to a line of thought: what if we tweaked this table? Would we get new visualisations that might be interesting?

Let’s experiment with a few.

What if we took the waterfall chart, and made the constant widths proportional to value, instead? The waterfall chart shows a cumulative series of values (e.g. percentages). This new chart – a cascade chart – allows us to depict each bar’s relative importance as well as value.

What if we kept the width, height and y constant, and just let the x values vary as the index? It would just be a row of boxes. But we’d have the option of colouring them with a value. This could be useful when showing performance along a discrete series (e.g. attendance by weekday).

What if we allowed the x, y, width, height and colour to vary with a different value? The graph looks like a scatterplot, but every dimension here – position, size,  colour, even aspect ratio – indicates some informational measure.

This chart can, for example, show the position and spread of two metrics. For example, if the X-axis were sales, and the Y-axis were price, each bar could be the distribution of price and sales in a branch, with the colour indicating growth of the branch.

Just using the combinations discussed above, there are 75 possible types of visualisations – many of which are meaningful in different circumstances. And this is just using rectangles.

What we’ve done here is mapped data to attributes of a visualisation. This is part of a generalised approach to graphics, similar to that covered by Leland Wilkinson’s Grammar of Graphics and implemented in libraries like ggplot2 or D3. Once we establish that basic concept – that a chart is a mapping of attributes to data – the variety of charts you’ll be able to create is unlimited, and you move from being a user of charts to a composer of data-driven visualisations.

## Tracking computer usage

CIHL (a consortium of universities in Andhra Pradesh) offers a masters course in information technology. As part of that, the computer usage for volunteering student was tracked for 7 weeks. The raw data shows how long each application was used.

We visualised the total usage of the top applications by student.

Before we go on to the results, a few words about the visualisation.

• Each row is one application. They are sorted by usage.
• Each column is one student. The width of the column is proportional to their usage. They are sorted by the amount of time spent on computers.
• Each cell shows %time spent on the application. For example, 20% means that student spent 20% of her time on that application.

This is similar to the heatgrid we saw last month, but with a difference – the widths of the columns are not constant, and represent the hours of usage. This means that the colour represents not just the % usage by a student – it has an additional significance. The amount of purple ink used in each row is the total hours of usage of the application.

Now for what we found.

Browsers are clearly the most popular application, with people spending 25-50% of their time on the browser. Firefox is the most popular browser, followed by Chrome. Only 3 students used IE as their main browser.

Microsoft Word emerged as the second most popular application. This is what students submitted their assignments in.

VLC was the next most popular, ignoring the time spent on Windows Explorer. While their coursework did require them to view a number of videos, an analysis of the window titles showed that the percentage of course-related videos were in a minority. This also provided us with a number of interesting movie recommendations that has kept us busy last month.

Two games made their way into the top applications list: Half Life and Warcraft III. While only 4 students were serious gamers, the time they spent on this was significant. The student spending maximum time on the PC spent almost 20% of time on games, with another 30% on movies. (We were yet to investigate whether this had a positive or negative effect on grades.)

Chat applications did not show significant usage. IPMsg was the most popular, with up to 0.5% of time being spent on this. Google talk was used by fewer people, but those that used it spent up to 3% of time on it.

But the strangest observation was regarding two students, both of who spent about 10% of their time looking at screen savers. One of them was, in fact, a blank screen saver. We have still not been able to figure out what exactly they were up to.

## Browsing a tale of 5 cities

You can learn a lot about a city by the kind of activity it displays. In our case, we were interested in when a city wakes up – virtually.

Using data from a leading internet service provider, we looked at the time at which people log on to the Internet.

That’s the average browsing behaviour in 2011 across India. The darker blue indicate more people browsing online at that time. Most people seem to be browsing at 10pm (remember: this is primarily a domestic ISP), and there doesn’t seem to be a huge difference between the days, except that people wake up slightly later and sleep slightly later on Sundays.

However, between cities, there is a considerable variation in this pattern. Let’s compare Bangalore and Mumbai, for example.

The red areas indicate times when Bangalore-ans browse more, and blue areas indicate times when Mumbai-ites browse more. Folks from Bangalore are relatively early risers, starting as early as 4am on most days, and retain the lead until around 10am, when the Mumbai usage catches up. Mumbai leads in the afternoon, while Bangalore recaptures the lead a bit after 6pm. Most Bangalore-ans are asleep by 11pm, though, and Mumbai-ites zoom past, capturing the lion’s share of browsing between 11pm and 4am at night.

Also, it appears that Mumbai-ites work a lot less than Bangalore-ans on Sundays – opting to go online only after 9pm at night.

On the whole, “work-hard, play-hard” might capture the spirit of Mumbai-ites, while “early-to-rise” seems to define Bangalore. (From personal experience, we find that a bit hard to digest, but the data is irrefutable.)

A comparison with New Delhi shows a somewhat similar profile through the hours of the day, but the weekday behaviour is similar. So Bangalore does rise earlier and sleep earlier than New Delhi, but works about the same on weekdays and weekends.

In contrast, Bangalore and Chennai seem to have quite a similar browsing profile through the day. However, on weekends, Chennai seems to browse a lot less than Bangalore, thus qualifying as “early-to-rise”, and also for “relax-on-weekends”. (From personal experience, we find that quite believable.)

If we take Chennai and compare that with Mumbai or Delhi…

… the profiles are quite similar. As before, both cities wake up and sleep late. Mumbai is a bit more active on weekends than Delhi.

Just based on this data, if we summarise our experience, this is what it appears to be:

City Behaviour Rises early Rises late
Works on weekends Bangalore Mumbai
Takes weekends off Chennai New Delhi

The visualisation we used above is a heat-grid. It’s analogous to scatter plots, but for discrete data instead of continuous data. Some of the situations where we’ve used heat-grids include:

• Plotting call volume in a call centre by date and hour
• Plotting number of complaints by customer group and reason
• Plotting training course expense by type of course and instructor
• Plotting profitability by geography and branch type

… and so on.

If you’re interested in creating heat-grids yourself, you will find it fairly easy to do on Excel 2007 and beyond. Just fill in the values you want in, select them, and choose Conditional Formatting – Color Scales.

Update: We got requests to add a few cities – Hyderabad in particular. I’m afraid that this ISP does not have a strong presence in Hyderabad, so data is sparse. But here’s Pune:

As you can see, Pune seems to be a late riser. Probably not too different from Mumbai…

… and other than checking mails a bit earlier than Mumbai-ites on Sunday night, it’s pretty much the same pattern!