What the World is looking for

Why is Andre AgassiLarge scale so­ci­olo­gic­al re­search has nev­er been this easy. Google’s search sug­ges­tions are based on what people search for on their search en­gine. This can be a fairly good re­flec­tion of what people are cur­rently in­ter­ested in, mak­ing it a power­ful tool for re­search. (You could also save these res­ults and look at them over time to see trends in these pref­er­ences, but that’s a top­ic for a dif­fer­ent day..)

So, to learn what ques­tions people are ask­ing about Andre Agassi, just go to Google’s search box and type “Why is Andre Agassi” and wait for a second. (People want to know why he’s fam­ous, why he’s bald, why he broke up with Brooke Shields, and why he wore a wig.)

Or, to see what India is in­ter­ested in learn­ing, just type “How to” on google.co.in and you’ll find – per­haps to your sur­prise – that Indians want to learn:

  • how to kiss
  • how to lose weight
  • how to down­load you­tube videos
  • how to get preg­nant (clearly less im­port­ant than kiss­ing well)

Search for How to on Google India

On the oth­er hand, the UK wants to know

  • how to make loom bands (but why?)
  • how to lose weight
  • how to make pan­cakes (which may not be a good idea if you want to lose weight)
  • how to write a cv

Search for How to on Google UK

The US wants to learn

  • how to train your dragon 2 (that’s the an­im­ated film)
  • how to tie a tie
  • how to hard boil eggs
  • how to lose weight

Search for How to on Google US

What’s clear is that people of all three na­tions have los­ing weight as one of their top 4 pri­or­it­ies, but vary quite a bit in their pref­er­ences oth­er­wise.

At Gramener, we put to­geth­er a com­pil­a­tion of the search res­ults for com­mon ques­tions.

Search for questions on Google

There are sev­er­al nug­gets in here. The world is gen­er­ally curi­ous about why Salman Khan is not mar­ried, and why he’s not in jail. But the pref­er­ence and or­der of ques­tions var­ies from coun­try to coun­try.

Why is Salman Khan

Focus on in­ven­tions vary a lot across re­gions too. Indians are the only ones who seem con­cerned about who in­ven­ted zero. For the British, foot­ball comes ahead of the Internet and Electricity.

Who invented

You can ex­plore these are more at https://gramener.com/search/

If you find any in­ter­est­ing query pat­terns please let us know either in the com­ments be­low or via Twitter. We’ll add it here.

The language of tweets

This post is part of the out­put of the Bangalore Fifth Elephant Hacknight.

followers

What you see above are the words most of­ten used on Twitter by Indians. (Click for a lar­ger im­age). The size of the bubble in­dic­ates how of­ten the word is used.

We were look­ing at wheth­er there are spe­cific words that people with a large num­ber of fol­low­ers use, that are dis­tinct from people with few fol­low­ers. The words on the left (also col­oured red) are used mainly by people with few fol­low­ers. The words on the right (also col­oured green) are mainly used by people with many fol­low­ers.

(At this point, it’s worth dis­cuss­ing the data­set. These are 1 week’s worth of geo­coded tweets, mainly around India (but in­clud­ing Pakistan, Nepal, etc.) It’s in­ter­est­ing that there were just 80,000 geo­coded tweets in this peri­od – and many of them were FourSquare entries.

It’s in­ter­est­ing that people )with low fol­low­ers of­ten talk about “know”, “high” and ‘”traf­fic”. People with many fol­low­ers have sig­ni­fic­antly more hasht­ags. Whether this is a cause or an ef­fect of hav­ing many fol­low­ers is, of course, de­bat­able. But the cor­rel­a­tion is quite def­in­ite.

It also ap­pears that those with more fol­low­ers are po­lite. The “good morning”s and “thank you”s are quite to the right. Those with more fol­low­ers are more likely to say “good” than “bad”, and vice ver­sa. Perhaps there’s some­thing about hav­ing Twitter fol­low­ers that leads to hap­pi­ness – or is it the oth­er way around?

replies

This pic­ture shows you the words more of­ten used in replies (on the left, in red) when com­pared to new tweets (on the right, in green).

“haha” and “lol” ap­pear rather prom­in­ently in replies. Either folks who reply are an amused bunch, or it’s the funny tweets that get more replies. A lot of replies are also to thank people. The dom­in­ance of Mumbai, Maharashtra and Delhi on the right is easi­est ex­plained by the pres­ence of the words “@foursquare” and “may­or” – most of these tweets ap­pear to be FourSquare re­lated.

morning

The above shows the words used in the morn­ing (up to 12 noon) vs the even­ing. Clearly, people men­tion “morn­ing” in the morn­ing – of­ten, but not al­ways, in the con­text of “good morn­ing”. The even­ings were, at least on this week, were dom­in­ated by Euro 2012.

The visu­al­isa­tion used above is a doc­u­ment con­trast dia­gram. Each word is drawn as a bubble, whose size rep­res­ents its fre­quency. The ho­ri­zont­al po­s­i­tion de­term­ines wheth­er the word is closer to one as­pect or an­other – e.g. replies on the left vs new tweets on the right. This is a very quick and easy way of un­der­stand­ing what char­ac­ter­ises an as­pect (e.g. which words are of­ten used with good vs bad), as well as the con­text in which words are used.

Student browsing patterns

This is a guest post by Rahul Gonsalves of Pixelogue.

About a week ago, Anand sug­ges­ted that we spend a day some week­end work­ing col­lab­or­at­ively on data visu­al­isa­tion. I jumped at the chance to spend a day work­ing and learn­ing from him and this is how we found ourselves at the Gramener of­fice on a Sunday morn­ing.

We de­cided to look at a data­set that Anand has blogged about be­fore – com­puter us­age of MSIT stu­dents at CIHL, a con­sor­ti­um of uni­ver­sit­ies based out of IIIT, Hyderabad. Over a peri­od of sev­en weeks, stu­dents’ com­puter us­age was tracked. The data in­cludes ap­plic­a­tion us­age and dur­a­tion, in­ter­net brows­ing pat­terns, and even key­strokes, broken down by user. If this data sounds like a pri­vacy land­mine, that’s be­cause it is! The only con­sol­a­tion is that all the stu­dents in­volved in the study con­sen­ted to have their us­age tracked, and so were pre­sum­ably aware of what was hap­pen­ing.

We de­cided to look at a sub­set of this data – at their in­ter­net us­age and to try and an­swer the fol­low­ing ques­tion: What web­sites do people browse at dif­fer­ent times of day? Are there in­ter­est­ing pat­terns that emerge? Do “so­cial” web­sites con­sti­tute a sig­ni­fic­ant por­tion of their brows­ing time? etc.

We cre­ated an in­ter­act­ive visu­al­isa­tion, as well as an Excel based one. The in­ter­act­ive ver­sion is avail­able at http://gramener.com/siteusage/

On Excel, the vari­ables at our dis­pos­al in­cluded:

  1. User
  2. URL
  3. Time of brows­ing

We pulled the data in­to Excel, and had the fol­low­ing table:

excel-1

We then split up the time val­ues in Excel in­to their com­pon­ent pieces (hour and minute), so that 22-11-2011 10:19 be­comes:

excel-2

You can see the raw data and the for­mu­las used in the fol­low­ing screen­shot:

excel-3

We com­bined the hour in­to a value which we called “Minute of the Day”, which is merely a nu­mer­al value of the minute from 12AM. 1am is 60, 2am is 120, 3am is 180 and so forth.

We then used a pivot table to plot the do­main ac­cessed by fre­quency, which al­lowed us to gen­er­ate the top 10 most ac­cessed do­mains (Facebook, un­sur­pris­ingly was 2nd, right be­hind a loc­al ad­dress 10.10.10.68, which is pre­sum­ably a de­vel­op­ment server.)

excel-4

We ar­ranged the do­mains on the ho­ri­zont­al ax­is, with the hour of day lis­ted on the y-axis, as be­low:

At this point in time, Anand pulls out his Excel ma­gic, and pulls in the num­ber of times with­in that hour that a par­tic­u­lar do­main was ac­cessed. COUNTIFS looks counts the num­ber of times the do­main was ac­cessed at that par­tic­u­lar minute. IFERROR en­sures that er­rors are coun­ted as zer­oes. (This for­mu­la works only in Excel 2007 and later.)

excel-6

The res­ults of ap­ply­ing this par­tic­u­lar for­mu­la across the whole table is given be­low:

excel-7

Using the con­di­tion­al format­ting tools, we are able to ap­ply a col­our scale that changes the cell back­ground col­our — a dark­er green im­plies a higher fre­quency while a lighter col­our im­plies a lower in­cid­ence at that point in time.

excel-8

The ex­treme pre­pon­der­ance of the top hit (the loc­al dev server, 10.10.10.68) led to a not very use­ful visu­al­isa­tion, with only the highest val­ues be­ing marked out.

excel-9

Using a log­ar­ith­met­ic scale helps give a bet­ter heat­map, as can be seen in the fol­low­ing screen­shots.

excel-a

We fi­nally ar­rived at the fol­low­ing heat­map, which of­fers some in­sights in­to the ways that the stu­dents at this par­tic­u­lar course spent their time.

excel-b

We talked about dif­fer­ent ways of de­pict­ing this data, which res­ul­ted in the fol­low­ing in­ter­act­ive visu­al­iz­a­tion of the way a stu­dent spends his or her time on an av­er­age day in Hyderabad. We hope you en­joy it!