The Mahabharatha in Pictures

At 1.8 mil­lion words, the Mahabharatha is one of the largest epics – roughly 10 times the size of the Iliad and Odyssey com­bined. At some level, this rep­res­ents “big data”. Text is gen­er­ally con­sidered “un­struc­tured” and there­fore tough to ana­lyse. But the grow­ing field of text ana­lyt­ics and text visu­al­isa­tion tell us that there’s a lot more struc­ture to plain text than one might think.

To be­gin with, a word cloud can tell us a lot about the story.

mahabharatha-wordcloud

The story is ob­vi­ously about a battle between great kings and sons, with the prin­cip­al char­ac­ters be­ing Arjuna, Pandu, Bhishma, Bharata, Karna, Duryodhana, Yudhishthira, Vaisampayama, etc. That’s de­cipher­able without hav­ing to read the text.

The struc­ture that we gleam out of it arises from a fre­quency dis­tri­bu­tion of the words – i.e. a count of which words oc­cur how many times. The word cloud plots the words at a font size pro­por­tion­al to the fre­quency of oc­cur­rence. (Wordle is a good place to cre­ate word clouds.)

Now what we know who’re the prin­cip­al char­ac­ters, the next ques­tions are: where are they men­tioned? Who’re closely re­lated? etc.

Our Mahabharatha browser provides a sim­ple in­ter­face to browse the full text of the Mahabharatha and find where the char­ac­ters ap­pear.

mahabharatha-mentions

The Mahabharatha is made of 18 books, each with sev­er­al sec­tions. This visu­al­isa­tion shows each sec­tion as a block (the length of the block is pro­por­tion­al to the size of the sec­tion.) When you click on a character’s name, the po­s­i­tions in each sec­tion where they are men­tioned are high­lighted

This makes it easy to see where char­ac­ters speak to­geth­er (e.g. where does Kunti throw away Karna? Where does she meet him again? Did Draupadi really love Karna be­fore her wed­ding? Was Arjuna really her fa­vour­ite? Whom does Krishna fa­vour? etc.) By click­ing on the sec­tion, you can read the full text of that sec­tion.

The second ques­tion is, which char­ac­ters are most closely re­lated? Measuring close­ness of char­ac­ters is a dif­fi­cult thing to do, even for hu­mans. Fortunately, with text, we can rely on a proxy: how of­ten are two char­ac­ters found with­in a few words of each oth­er.

If we take Draupadi as a bench­mark char­ac­ter and check how of­ten vari­ous people are men­tioned with­in a few words of her, here’s what the pic­ture looks like:

mahabharatha-draupadi-closeness

Each row has the name of the char­ac­ter (along with ali­ases). The first column shows the num­ber of times they’re men­tioned with­in 50 words of her. The next shows how many times they’re men­tioned with­in 100 words of her. And so on. (All with­in the same sec­tion.)

A visu­al in­spec­tion sug­gests that many char­ac­ters start fad­ing off at a dis­tance of 200 words, so per­haps 200 might be a reas­on­able bound­ary to con­sider. (This is ar­bit­rary. But based on our sub­sequent ana­lys­is, we find that this para­met­er does not im­pact the visu­al res­ult too much.)

By plot­ting a net­work of their close­ness, one can get some in­sights about the struc­ture of the tale.

mahabharatha-network

Yudhishthira is clearly at the centre of the plot. Arjuna, sur­pris­ingly, isn’t. Apart from his close re­la­tion­ship with Krishna and Bhishma, his in­ter­ac­tion with oth­er char­ac­ters is not as well spread out (des­pite his pop­ular­ity in the epic.) Contrary to pop­ular opin­ion, Bhima is men­tioned quite of­ten, and is fairly well-networked. Nakula and Sahadeva re­main peri­pher­al char­ac­ters. Gandhari is nearly out­side of the net­work, ex­cept for her con­nec­tion with her hus­band Dhritarashtra, sister-in-law Kunti, and brother-in-law Vidura (with whom she seems to con­verse much more than with her hus­band.)

Another way of look­ing at this pic­ture is through a cor­rel­a­tion mat­rix.

mahabharatha-matrix

This shows each pair of char­ac­ters and the num­ber of times they oc­cur with­in 200 words of each oth­er. The close­ness between Nakula and Sahadeva is very ob­vi­ous; so are Drona & Kripa; Dhritharastra & Vidura; Arjuna & Krishna. Draupadi is men­tioned with Dhrishtadhyumna more than any­one else.

You can also see the blocks break­ing up in­to two clusters of sorts – on the bot­tom right are the primary char­ac­ters. They in­ter­act a lot with each oth­er. In the middle are sec­ond­ary char­ac­ters, who again in­ter­act among­st them­selves; and then there are the nar­rat­ors on the top left. This is in line with the Mahabharatha dis­cuss­ing sev­er­al side-plots with sec­ond­ary char­ac­ters in par­al­lel with the main plot. The story of Dhrishtadhyumna, of Satyaki, Nakula and Sahadeva’s con­ver­sa­tions, etc are ex­amples of these. In fact, in a lar­ger scat­ter­plot, you can see many more tales emerge, such as Nala & Damayanti; Nahusha & Yayati; Uma & Daksha; Vasishta & Vishwamitra; Chitrasena & Vikarna; Virata & Uttara; Dhrishtadhyumna & Shikhandin; Parva & Sambhava; even Ravana & Vali.

If you are in­ter­ested in see­ing the full cor­rel­a­tion mat­rix with all ma­jor and minor char­ac­ters, please reach us at contact@gramener.com.

Comparing school performance

Continuing the design jams, we had one at Akshara’s of­fice last week­end. The data­set we de­cided to pur­sue was the Karnataka SSLC res­ults, which we had for the 5 years.

We ad­dressed two ques­tions:

  1. How do Government schools per­form when com­pared to private schools?
  2. How does the me­di­um of in­struc­tion af­fect marks in dif­fer­ent sub­jects?

When com­par­ing Government and private schools, here’s the res­ult.

govt-private-schools

Each box is a school. The size of the box rep­res­ents the num­ber of stu­dents from that school who ap­peared in the Class X ex­am. (Only schools with at least 60 stu­dents were con­sidered.) The col­our rep­res­ents the av­er­age mark – red is low, and green is high.

What’s im­me­di­ately ob­vi­ous is that private schools per­form much bet­ter on av­er­age than Government schools, what’s less clear is when this dif­fer­ence starts. The series of graphs be­low show the num­ber of schools at vari­ous mark ranges. The first shows schools with an av­er­age of 0 – 30%. The next, from 0 – 40%, and so on un­til 80%. Then it shows schools with an av­er­age of 30% – 100%. The next, from 40% – 100%, and so on un­til 80% – 100%.

bschool-00-30bschool-00-40bschool-00-50bschool-00-60bschool-00-70bschool-00-80bschool-30-100bschool-40-100bschool-50-100bschool-60-100bschool-70-100bschool-80-100

From the first graph, you can see that there are as many poor schools (av­er­age 0 – 30%) among the private and Government schools. But from the last graph, you can see that there are far more good private schools (av­er­age 80 – 100%) than Government schools.

So, there are poor per­form­ing schools among the private schools as well. However, there are very few ex­cel­lent Government schools.

We com­pared the im­pact of me­di­um of in­struc­tion again­st the sub­jects as well. The table be­low shows boxes for each sub­ject taken un­der each me­di­um of in­struc­tion. The size of the box rep­res­ents the num­ber of stu­dents tak­ing that com­bin­a­tion. The col­our in­dic­ates the av­er­age mark (red is low, green is high.)

subject-medium

Clearly, Sanksrit is a high scor­ing lan­guage. (At least one per­son at the design jam chose Sanskrit for this very reas­on.) Kannada scores well too – es­pe­cially as a first or third lan­guage; but not as well as a second lan­guage.

On av­er­age, English me­di­um stu­dents have the highest marks, fol­lowed by Kannada me­di­um stu­dents. Students study­ing oth­er in me­di­ums of in­struc­tion per­form poorly in most sub­jects bar­ring their lan­guage.

There’s clearly a strong cor­rel­a­tion between the me­di­um and the sub­ject. Kannada me­di­um stu­dents score high in Kannada, Urdu me­di­um stu­dents shore high in Urdu, and so on. But while English me­di­um stu­dents do score high in English, they tend to score much bet­ter at Kannada, Urdu and Sanskrit!

You can ex­plore these res­ults at http://gramener/karnatakamarks/

Composing data visualisations

How does one cre­ate new data visu­al­isa­tions? Apart from the art, is there a sci­ence to it?

Let’s ex­plore a few pop­ular charts. We have the ver­tic­al bar graph small-vertical-bar or the ho­ri­zont­al bar graph small-horizontal-bar. The stacked bar small-stacked-bar. The vari­wide or Marimekko chart small-variwide. The wa­ter­fall small-waterfall. The scat­ter­plot small-scatterplot. The tree­map small-treemap. And so on.

The first thing you’ll ob­serve is that all of these are a series of rect­angles. (We’re treat­ing the dots on the scat­ter­plot as little squares.) The only thing that var­ies across these charts is the po­s­i­tion and size of the rect­angles – and the col­our as well.

That gives us a hint. Perhaps there are many ways of cre­at­ing visu­al­isa­tions just by chan­ging the po­s­i­tion, size and col­our of rect­angles. For ex­ample the ho­ri­zont­al bar graph small-horizontal-bar can be con­struc­ted as fol­lows:

  • The x po­s­i­tion is con­stant for each rect­angle. It starts at zero.
  • The width is pro­por­tion­al to the value of the series
  • The y po­s­i­tion is pro­por­tion­al to the in­dex of the val­ues (1,2,3,…)
  • The height is con­stant for each of the bars
  • The col­our is con­stant too.

Whereas, if we look at a ho­ri­zont­al stacked bar small-horizontal-stack, then:

  • The x po­s­i­tion is pro­por­tion­al to the cu­mu­lat­ive value of the series.
  • The width is pro­por­tion­al to the value of the series
  • The y po­s­i­tion is con­stant at zero
  • The height is con­stant for each of the bars
  • The col­our is based on the in­dex of the val­ues (dis­tinct col­ours la­belled 1,2,3,…)

Generalising this, we can con­struct a table like this that shows the struc­ture of vari­ous visu­al­isa­tions:

Chart x width y height col­our
Vertical bar chart in­dex con­stant con­stant value con­stant
Stacked bar in­dex con­stant cu­mu­lat­ive value in­dex
Waterfall in­dex con­stant cu­mu­lat­ive value con­stant
Scatterplot value con­stant value con­stant in­dex
Horizontal bar chart con­stant value in­dex con­stant con­stant
Variwide cu­mu­lat­ive value con­stant value con­stant

That leads to a line of thought: what if we tweaked this table? Would we get new visu­al­isa­tions that might be in­ter­est­ing?

Let’s ex­per­i­ment with a few.

waterfall-variwideWhat if we took the wa­ter­fall chart, and made the con­stant widths pro­por­tion­al to value, in­stead? The wa­ter­fall chart shows a cu­mu­lat­ive series of val­ues (e.g. per­cent­ages). This new chart – a cas­cade chart – al­lows us to de­pict each bar’s re­l­at­ive im­port­ance as well as value.

boxesWhat if we kept the width, height and y con­stant, and just let the x val­ues vary as the in­dex? It would just be a row of boxes. But we’d have the op­tion of col­ouring them with a value. This could be use­ful when show­ing per­form­ance along a dis­crete series (e.g. at­tend­ance by week­day).

boxesWhat if we al­lowed the x, y, width, height and col­our to vary with a dif­fer­ent value? The graph looks like a scat­ter­plot, but every di­men­sion here – po­s­i­tion, size, col­our, even as­pect ra­tio – in­dic­ates some in­form­a­tion­al meas­ure.

This chart can, for ex­ample, show the po­s­i­tion and spread of two met­rics. For ex­ample, if the X-axis were sales, and the Y-axis were price, each bar could be the dis­tri­bu­tion of price and sales in a branch, with the col­our in­dic­at­ing growth of the branch.

Just us­ing the com­bin­a­tions dis­cussed above, there are 75 pos­sible types of visu­al­isa­tions – many of which are mean­ing­ful in dif­fer­ent cir­cum­stances. And this is just us­ing rect­angles.

What we’ve done here is mapped data to at­trib­utes of a visu­al­isa­tion. This is part of a gen­er­al­ised ap­proach to graph­ics, sim­il­ar to that covered by Leland Wilkinson’s Grammar of Graphics and im­ple­men­ted in lib­rar­ies like ggplot2 or D3. Once we es­tab­lish that ba­sic con­cept – that a chart is a map­ping of at­trib­utes to data – the vari­ety of charts you’ll be able to cre­ate is un­lim­ited, and you move from be­ing a user of charts to a com­poser of data-driven visu­al­isa­tions.