The language of tweets

This post is part of the output of the Bangalore Fifth Elephant Hacknight.


What you see above are the words most often used on Twitter by Indians. (Click for a larger image). The size of the bubble indicates how often the word is used.

We were looking at whether there are specific words that people with a large number of followers use, that are distinct from people with few followers. The words on the left (also coloured red) are used mainly by people with few followers. The words on the right (also coloured green) are mainly used by people with many followers.

(At this point, it’s worth discussing the dataset. These are 1 week’s worth of geocoded tweets, mainly around India (but including Pakistan, Nepal, etc.) It’s interesting that there were just 80,000 geocoded tweets in this period – and many of them were FourSquare entries.

It’s interesting that people )with low followers often talk about “know”, “high” and ‘”traffic”. People with many followers have significantly more hashtags. Whether this is a cause or an effect of having many followers is, of course, debatable. But the correlation is quite definite.

It also appears that those with more followers are polite. The “good morning”s and “thank you”s are quite to the right. Those with more followers are more likely to say “good” than “bad”, and vice versa. Perhaps there’s something about having Twitter followers that leads to happiness – or is it the other way around?


This picture shows you the words more often used in replies (on the left, in red) when compared to new tweets (on the right, in green).

“haha” and “lol” appear rather prominently in replies. Either folks who reply are an amused bunch, or it’s the funny tweets that get more replies. A lot of replies are also to thank people. The dominance of Mumbai, Maharashtra and Delhi on the right is easiest explained by the presence of the words “@foursquare” and “mayor” – most of these tweets appear to be FourSquare related.


The above shows the words used in the morning (up to 12 noon) vs the evening. Clearly, people mention “morning” in the morning – often, but not always, in the context of “good morning”. The evenings were, at least on this week, were dominated by Euro 2012.

The visualisation used above is a document contrast diagram. Each word is drawn as a bubble, whose size represents its frequency. The horizontal position determines whether the word is closer to one aspect or another – e.g. replies on the left vs new tweets on the right. This is a very quick and easy way of understanding what characterises an aspect (e.g. which words are often used with good vs bad), as well as the context in which words are used.

Student browsing patterns

This is a guest post by Rahul Gonsalves of Pixelogue.

About a week ago, Anand suggested that we spend a day some weekend working collaboratively on data visualisation. I jumped at the chance to spend a day working and learning from him and this is how we found ourselves at the Gramener office on a Sunday morning.

We decided to look at a dataset that Anand has blogged about before – computer usage of MSIT students at CIHL, a consortium of universities based out of IIIT, Hyderabad. Over a period of seven weeks, students’ computer usage was tracked. The data includes application usage and duration, internet browsing patterns, and even keystrokes, broken down by user. If this data sounds like a privacy landmine, that’s because it is! The only consolation is that all the students involved in the study consented to have their usage tracked, and so were presumably aware of what was happening.

We decided to look at a subset of this data – at their internet usage and to try and answer the following question: What websites do people browse at different times of day? Are there interesting patterns that emerge? Do “social” websites constitute a significant portion of their browsing time? etc.

We created an interactive visualisation, as well as an Excel based one. The interactive version is available at

On Excel, the variables at our disposal included:

  1. User
  2. URL
  3. Time of browsing

We pulled the data into Excel, and had the following table:


We then split up the time values in Excel into their component pieces (hour and minute), so that 22-11-2011 10:19 becomes:


You can see the raw data and the formulas used in the following screenshot:


We combined the hour into a value which we called “Minute of the Day”, which is merely a numeral value of the minute from 12AM. 1am is 60, 2am is 120, 3am is 180 and so forth.

We then used a pivot table to plot the domain accessed by frequency, which allowed us to generate the top 10 most accessed domains (Facebook, unsurprisingly was 2nd, right behind a local address, which is presumably a development server.)


We arranged the domains on the horizontal axis, with the hour of day listed on the y-axis, as below:

At this point in time, Anand pulls out his Excel magic, and pulls in the number of times within that hour that a particular domain was accessed. COUNTIFS looks counts the number of times the domain was accessed at that particular minute. IFERROR ensures that errors are counted as zeroes. (This formula works only in Excel 2007 and later.)


The results of applying this particular formula across the whole table is given below:


Using the conditional formatting tools, we are able to apply a colour scale that changes the cell background colour — a darker green implies a higher frequency while a lighter colour implies a lower incidence at that point in time.


The extreme preponderance of the top hit (the local dev server, led to a not very useful visualisation, with only the highest values being marked out.


Using a logarithmetic scale helps give a better heatmap, as can be seen in the following screenshots.


We finally arrived at the following heatmap, which offers some insights into the ways that the students at this particular course spent their time.


We talked about different ways of depicting this data, which resulted in the following interactive visualization of the way a student spends his or her time on an average day in Hyderabad. We hope you enjoy it!

Common birthdays


This visualisation shows the popularity of birthdays in the US between 1973 – 1999. The darkness of the colour shows the rank of how popular that birthday is. Dark colours are more popular (i.e. better ranked) birthdays.

  • Most people are born in August & September (and therefore were conceived around November & December, during the holidays, perhaps?)
  • However, very few people are actually born during holidays – New year, Independence day, Halloween, Thanksgiving and Christmas. (People don’t like to spoil their holidays?)
  • Few people are born on the 1st of April. (You don’t want your kid born on Fool’s Day)
  • Few people are born on the 13th of any month. (Unlucky?)
  • Plenty are born on Valentine’s Day and St Patrick’s day

We tried to see what this looked like in India.

Based on school registration data for ~700,000 students born between 1992 – 1995, here’s what it looks like. (Click for a larger version.)


This shows a number of bizarre patterns:

  • Almost everyone’s born between May and June – just before the school opens.
  • Almost no one is born in August – after school opens.
  • An unusual number of people have round-numbered days as birthdays – 5th, 10th, 15th, 20th, 25, and 30th. (This round-numbered pattern was also seen when we analysed utility fraud).
  • January 1st is fairly popular. Other than that, none of the holidays seem to have an effect.

In fact, these results are so striking that we are tempted to believe that the popularly accepted proof for a person’s age – their Class 10 certificate – generally bears a convenient fiction created for the purposes of school admission several years ago.