Student browsing patterns

This is a guest post by Rahul Gonsalves of Pixelogue.

About a week ago, Anand sug­ges­ted that we spend a day some week­end work­ing col­lab­or­at­ively on data visu­al­isa­tion. I jumped at the chance to spend a day work­ing and learn­ing from him and this is how we found ourselves at the Gramener of­fice on a Sunday morn­ing.

We de­cided to look at a data­set that Anand has blogged about be­fore – com­puter us­age of MSIT stu­dents at CIHL, a con­sor­ti­um of uni­ver­sit­ies based out of IIIT, Hyderabad. Over a peri­od of sev­en weeks, stu­dents’ com­puter us­age was tracked. The data in­cludes ap­plic­a­tion us­age and dur­a­tion, in­ter­net brows­ing pat­terns, and even key­strokes, broken down by user. If this data sounds like a pri­vacy land­mine, that’s be­cause it is! The only con­sol­a­tion is that all the stu­dents in­volved in the study con­sen­ted to have their us­age tracked, and so were pre­sum­ably aware of what was hap­pen­ing.

We de­cided to look at a sub­set of this data – at their in­ter­net us­age and to try and an­swer the fol­low­ing ques­tion: What web­sites do people browse at dif­fer­ent times of day? Are there in­ter­est­ing pat­terns that emerge? Do “so­cial” web­sites con­sti­tute a sig­ni­fic­ant por­tion of their brows­ing time? etc.

We cre­ated an in­ter­act­ive visu­al­isa­tion, as well as an Excel based one. The in­ter­act­ive ver­sion is avail­able at http://gramener.com/siteusage/

On Excel, the vari­ables at our dis­pos­al in­cluded:

  1. User
  2. URL
  3. Time of brows­ing

We pulled the data in­to Excel, and had the fol­low­ing table:

excel-1

We then split up the time val­ues in Excel in­to their com­pon­ent pieces (hour and minute), so that 22-11-2011 10:19 be­comes:

excel-2

You can see the raw data and the for­mu­las used in the fol­low­ing screen­shot:

excel-3

We com­bined the hour in­to a value which we called “Minute of the Day”, which is merely a nu­mer­al value of the minute from 12AM. 1am is 60, 2am is 120, 3am is 180 and so forth.

We then used a pivot table to plot the do­main ac­cessed by fre­quency, which al­lowed us to gen­er­ate the top 10 most ac­cessed do­mains (Facebook, un­sur­pris­ingly was 2nd, right be­hind a loc­al ad­dress 10.10.10.68, which is pre­sum­ably a de­vel­op­ment server.)

excel-4

We ar­ranged the do­mains on the ho­ri­zont­al ax­is, with the hour of day lis­ted on the y-axis, as be­low:

At this point in time, Anand pulls out his Excel ma­gic, and pulls in the num­ber of times with­in that hour that a par­tic­u­lar do­main was ac­cessed. COUNTIFS looks counts the num­ber of times the do­main was ac­cessed at that par­tic­u­lar minute. IFERROR en­sures that er­rors are coun­ted as zer­oes. (This for­mu­la works only in Excel 2007 and later.)

excel-6

The res­ults of ap­ply­ing this par­tic­u­lar for­mu­la across the whole table is given be­low:

excel-7

Using the con­di­tion­al format­ting tools, we are able to ap­ply a col­our scale that changes the cell back­ground col­our — a dark­er green im­plies a higher fre­quency while a lighter col­our im­plies a lower in­cid­ence at that point in time.

excel-8

The ex­treme pre­pon­der­ance of the top hit (the loc­al dev server, 10.10.10.68) led to a not very use­ful visu­al­isa­tion, with only the highest val­ues be­ing marked out.

excel-9

Using a log­ar­ith­met­ic scale helps give a bet­ter heat­map, as can be seen in the fol­low­ing screen­shots.

excel-a

We fi­nally ar­rived at the fol­low­ing heat­map, which of­fers some in­sights in­to the ways that the stu­dents at this par­tic­u­lar course spent their time.

excel-b

We talked about dif­fer­ent ways of de­pict­ing this data, which res­ul­ted in the fol­low­ing in­ter­act­ive visu­al­iz­a­tion of the way a stu­dent spends his or her time on an av­er­age day in Hyderabad. We hope you en­joy it!

The Social Network of Coders

Every prob­lem faces the prob­lem of find­ing smart, mo­tiv­ated people. Joel Spolsky of­fers this ad­vice for find­ing great de­velopers:

Think about where the people you want to hire are hanging out… Go to their con­fer­ences where you’ll find early ad­op­ters who are curi­ous about new things and al­ways in­ter­ested in im­prov­ing.

These days, the smart folks hang out at Github. (Github is like Facebook for coders. Coders can fol­low each oth­er, and in­stead of up­load­ing pho­tos, they up­load code.)

Last year, Matt Biddulph pub­lished a piece on Algorithmic re­cruit­ment with Github, and plot­ted the so­cial net­work of coders on Github in spe­cific cit­ies: San Francisco and London in par­tic­u­lar. People have ex­ten­ded this ef­fort to oth­er cit­ies, but none in India.

At Gramener, we took a look at the Github fol­low­er net­work in vari­ous cit­ies in India. The im­ages be­low show the so­cial net­work of Github users at Bangalore and Chennai – the Indian cit­ies with the most users on Github.

bangalorechennai

Firstly, Bangalore, with 1460 users, clearly has more coders than Chennai (658). But what’s also in­ter­est­ing is the re­l­at­ively large net­worked cluster in Bangalore. This is some­thing that’s lack­ing in most oth­er cit­ies, as you can see be­low.

punemumbaidelhihyderabad

These cit­ies tend to have smal­ler, dis­par­ate clusters. Whereas, in Bangalore, if you know some of the top Github users, you can eas­ily hop from per­son to per­son and cov­er most of the pop­ular users on Github. You can also guess that Hyderabad, Mumbai and Delhi (es­pe­cially) are a bit less “so­ci­able” and tend to form is­lands, when com­pared to Chennai or Pune.

In a way, this is re­flec­ted in the city’s so­cial in­ter­ac­tion as well. It’s a whole lot easi­er to meet a group of de­velopers in Bangalore than it is in al­most any oth­er city in India.

To make your life easi­er, we’ve cre­ated a tool that lets you ex­plore this so­cial net­work.

coder-network

Each coder is shown as a circle. The size of the circle in­creases with the num­ber of fol­low­ers. The col­our of the circle changes based on their primary pro­gram­ming lan­guage. The la­bels in­dic­ate their Github user ID, the num­ber of fol­low­ers and their main pro­gram­ming lan­guage. Lines in­dic­ate that a user is fol­low­ing an­other. You can move each circle around to get a bet­ter view, and click on the circle to open their Github page.

This graph is called a force-directed lay­out. They are an ex­cel­lent way of ex­plor­ing and visu­al­ising small-scale net­works in­ter­act­ively, since it lets you com­pare the struc­ture of dif­fer­ent net­works, and also drill deep in­to every node in a net­work.

Visit gramener.com/codersearch to see the tool in ac­tion.

Colouring the calendar

Sometimes, just view­ing a time series as a sim­ple graph isn’t enough.

The graph be­low shows the daily vis­it­ors to a lead­ing Indian web­site in 2011. The over­all trends are ap­par­ent. There was a dip in Mar-Apr, and again in Oct, fol­lowed by a steady rise in November.

analytics-line

But what’s also ap­par­ent is a weekly cyc­lic­al­ity: the steady pat­tern of rises and falls sev­er­al times a month, that dis­turbs this trend.

Yet, there’s con­sid­er­able in­sight with­in that cyc­lic­al­ity, that a cal­en­dar heat­map can bring out. Here is the same data on a cal­en­dar heat­map. This is simply a cal­en­dar on which the val­ues are plot­ted as a range of col­ours: red for few­er vis­it­ors, green for more vis­it­ors.

analytics-calendar

analytics-octoberThose dips you saw on the line graph? Those were Sundays, when brows­ing activ­ity di­ves down con­sist­ently. However, as you can see from above, not all Sundays are equal. July 31st and August 7th, though they were Sundays, had con­sid­er­able traf­fic. Similarly, week­days can also ex­per­i­ence dips. Jun 23rd is an ex­ample of a some­what un­usu­al dip, and so is Oct 26th – Diwali.

Calendar heat­maps provide a way of ex­plor­ing in­form­a­tion at a far rich­er level of de­tail than tra­di­tion­al line graphs or bar graphs do.

For ex­ample, they fo­cus on weekly trends. In busi­nesses where there is a weekly cyc­lic­al­ity, it be­comes much easi­er to spot an un­usu­al week­day. In the month of August (see be­low), it’s fairly ob­vi­ous from both graphs that August 14th had a bad dip. But what be­comes clear­er from the cal­en­dar map (but not the line graph) is that August 13th was a re­l­at­ively bad Saturday, and August 16th was a re­l­at­ively bad Tuesday.

analytics-Aug

analytics-octoberSecondly, they fo­cus on in­di­vidu­al days. Its a lot easi­er to see the ex­act date on which an event oc­curred. For ex­ample, in the graph along­side, there has been a big dip in October. The most sig­ni­fic­ant has been in the last week, spe­cific­ally on October 26th. Once you know the date, it’s easy to as­so­ci­ate the change in be­ha­vi­our with Diwali as its cause.

On the line graph be­low, you can see the ma­jor dip in October. However, map­ping this spe­cific­ally to Diwali is a far tougher task.

analytics-line

Below is an­other cal­en­dar heat­map – this time, show­ing the per­cent­age of vis­it­ors from New Delhi. Consider the month of August. We saw from the earli­er cal­en­dar map that there was a de­cline in traf­fic between August 13 – 16. If that de­crease was uni­form across cit­ies, the col­ours be­low would be uni­form too. However, New Delhi’s per­cent­age share de­clines as well on these days.

analytics-calendar-delhi-pc

Apparently, the people at New Delhi are more likely to spend the day out­side on Independence Day than most oth­er cit­ies! In fact, they seem to spend the whole of August avoid­ing brows­ing. However, the same can­not be dur­ing of Diwali. Delhi-ites are as likely / un­likely to be brows­ing dur­ing Diwali as any den­iz­ens of any oth­er city.

The next time you look at data with weekly pat­terns, where you need to fig­ure out quickly when ex­actly the num­bers rose or fell, do try out a cal­en­dar heat­map.