Tag Archives: Retail

Student browsing patterns

This is a guest post by Rahul Gonsalves of Pixelogue.

About a week ago, Anand suggested that we spend a day some weekend working collaboratively on data visualisation. I jumped at the chance to spend a day working and learning from him and this is how we found ourselves at the Gramener office on a Sunday morning.

We decided to look at a dataset that Anand has blogged about before – computer usage of MSIT students at CIHL, a consortium of universities based out of IIIT, Hyderabad. Over a period of seven weeks, students’ computer usage was tracked. The data includes application usage and duration, internet browsing patterns, and even keystrokes, broken down by user. If this data sounds like a privacy landmine, that’s because it is! The only consolation is that all the students involved in the study consented to have their usage tracked, and so were presumably aware of what was happening.

We decided to look at a subset of this data – at their internet usage and to try and answer the following question: What websites do people browse at different times of day? Are there interesting patterns that emerge? Do “social” websites constitute a significant portion of their browsing time? etc.

We created an interactive visualisation, as well as an Excel based one. The interactive version is available at http://gramener.com/siteusage/

On Excel, the variables at our disposal included:

  1. User
  2. URL
  3. Time of browsing

We pulled the data into Excel, and had the following table:

excel-1

We then split up the time values in Excel into their component pieces (hour and minute), so that 22-11-2011 10:19 becomes:

excel-2

You can see the raw data and the formulas used in the following screenshot:

excel-3

We combined the hour into a value which we called “Minute of the Day”, which is merely a numeral value of the minute from 12AM. 1am is 60, 2am is 120, 3am is 180 and so forth.

We then used a pivot table to plot the domain accessed by frequency, which allowed us to generate the top 10 most accessed domains (Facebook, unsurprisingly was 2nd, right behind a local address 10.10.10.68, which is presumably a development server.)

excel-4

We arranged the domains on the horizontal axis, with the hour of day listed on the y-axis, as below:

At this point in time, Anand pulls out his Excel magic, and pulls in the number of times within that hour that a particular domain was accessed. COUNTIFS looks counts the number of times the domain was accessed at that particular minute. IFERROR ensures that errors are counted as zeroes. (This formula works only in Excel 2007 and later.)

excel-6

The results of applying this particular formula across the whole table is given below:

excel-7

Using the conditional formatting tools, we are able to apply a colour scale that changes the cell background colour — a darker green implies a higher frequency while a lighter colour implies a lower incidence at that point in time.

excel-8

The extreme preponderance of the top hit (the local dev server, 10.10.10.68) led to a not very useful visualisation, with only the highest values being marked out.

excel-9

Using a logarithmetic scale helps give a better heatmap, as can be seen in the following screenshots.

excel-a

We finally arrived at the following heatmap, which offers some insights into the ways that the students at this particular course spent their time.

excel-b

We talked about different ways of depicting this data, which resulted in the following interactive visualization of the way a student spends his or her time on an average day in Hyderabad. We hope you enjoy it!

The Social Network of Coders

Every problem faces the problem of finding smart, motivated people. Joel Spolsky offers this advice for finding great developers:

Think about where the people you want to hire are hanging out… Go to their conferences where you’ll find early adopters who are curious about new things and always interested in improving.

These days, the smart folks hang out at Github. (Github is like Facebook for coders. Coders can follow each other, and instead of uploading photos, they upload code.)

Last year, Matt Biddulph published a piece on Algorithmic recruitment with Github, and plotted the social network of coders on Github in specific cities: San Francisco and London in particular. People have extended this effort to other cities, but none in India.

At Gramener, we took a look at the Github follower network in various cities in India. The images below show the social network of Github users at Bangalore and Chennai – the Indian cities with the most users on Github.

bangalorechennai

Firstly, Bangalore, with 1460 users, clearly has more coders than Chennai (658). But what’s also interesting is the relatively large networked cluster in Bangalore. This is something that’s lacking in most other cities, as you can see below.

punemumbaidelhihyderabad

These cities tend to have smaller, disparate clusters. Whereas, in Bangalore, if you know some of the top Github users, you can easily hop from person to person and cover most of the popular users on Github. You can also guess that Hyderabad, Mumbai and Delhi (especially) are a bit less “sociable” and tend to form islands, when compared to Chennai or Pune.

In a way, this is reflected in the city’s social interaction as well. It’s a whole lot easier to meet a group of developers in Bangalore than it is in almost any other city in India.

To make your life easier, we’ve created a tool that lets you explore this social network.

coder-network

Each coder is shown as a circle. The size of the circle increases with the number of followers. The colour of the circle changes based on their primary programming language. The labels indicate their Github user ID, the number of followers and their main programming language. Lines indicate that a user is following another. You can move each circle around to get a better view, and click on the circle to open their Github page.

This graph is called a force-directed layout. They are an excellent way of exploring and visualising small-scale networks interactively, since it lets you compare the structure of different networks, and also drill deep into every node in a network.

Visit gramener.com/codersearch to see the tool in action.

Colouring the calendar

Sometimes, just viewing a time series as a simple graph isn’t enough.

The graph below shows the daily visitors to a leading Indian website in 2011. The overall trends are apparent. There was a dip in Mar-Apr, and again in Oct, followed by a steady rise in November.

analytics-line

But what’s also apparent is a weekly cyclicality: the steady pattern of rises and falls several times a month, that disturbs this trend.

Yet, there’s considerable insight within that cyclicality, that a calendar heatmap can bring out. Here is the same data on a calendar heatmap. This is simply a calendar on which the values are plotted as a range of colours: red for fewer visitors, green for more visitors.

analytics-calendar

analytics-octoberThose dips you saw on the line graph? Those were Sundays, when browsing activity dives down consistently. However, as you can see from above, not all Sundays are equal. July 31st and August 7th, though they were Sundays, had considerable traffic. Similarly, weekdays can also experience dips. Jun 23rd is an example of a somewhat unusual dip, and so is Oct 26th – Diwali.

Calendar heatmaps provide a way of exploring information at a far richer level of detail than traditional line graphs or bar graphs do.

For example, they focus on weekly trends. In businesses where there is a weekly cyclicality, it becomes much easier to spot an unusual weekday. In the month of August (see below), it’s fairly obvious from both graphs that August 14th had a bad dip. But what becomes clearer from the calendar map (but not the line graph) is that August 13th was a relatively bad Saturday, and August 16th was a relatively bad Tuesday.

analytics-Aug

analytics-octoberSecondly, they focus on individual days. Its a lot easier to see the exact date on which an event occurred. For example, in the graph alongside, there has been a big dip in October. The most significant has been in the last week, specifically on October 26th. Once you know the date, it’s easy to associate the change in behaviour with Diwali as its cause.

On the line graph below, you can see the major dip in October. However, mapping this specifically to Diwali is a far tougher task.

analytics-line

Below is another calendar heatmap – this time, showing the percentage of visitors from New Delhi. Consider the month of August. We saw from the earlier calendar map that there was a decline in traffic between August 13 – 16. If that decrease was uniform across cities, the colours below would be uniform too. However, New Delhi’s percentage share declines as well on these days.

analytics-calendar-delhi-pc

Apparently, the people at New Delhi are more likely to spend the day outside on Independence Day than most other cities! In fact, they seem to spend the whole of August avoiding browsing. However, the same cannot be during of Diwali. Delhi-ites are as likely / unlikely to be browsing during Diwali as any denizens of any other city.

The next time you look at data with weekly patterns, where you need to figure out quickly when exactly the numbers rose or fell, do try out a calendar heatmap.