Reverse moderation of marks

Forensic science applies scientific principles to evidence to discover past fraud. Let’s extend that to data, and initiate a  journey into data forensics. We’ll begin with school marks.

One powerful tool at the hand of a forensic data scientist is the humble histogram: plotting frequency distributions of values. For example, birthdays are not random and some birthdays are systematically avoided (e.g. April Fool’s day, the 13th of any month.)

It is powerful for the same reason that fingerprints are effective: they are easy to leave behind, difficult to erase, and highlight who did  what and where.

Let’s apply this to the marks scored in English by the CBSE class 12 students in 2013. English is the single most common subject taken by students – over 8.5 lakh students wrote the English exams out of the 9.4 lakh students.

Typically, such mark distributions are normal distributions – smooth, thin near the ends and thick at the center. This is mainly because most exams require a combination of abilities (spelling, grammar,  comprehension, creativity, etc.). Few people excel in all of these. Few suffer from the complete lack of all of these. Hence we expect to see fewer people at the edges than at the center.

Normal distribution

What we observe, in fact, is the following distribution. The height of each bar represents the number of students who got a specific mark between 0 – 100.

CBSE Class 12 English Marks

Several items are noteworthy. Let’s begin with the two large spikes.

Pass mark

The spike on the left appears at 33 marks. Further, no student has marks between 26 – 32.

According to the CBSE:

The qualifying marks in each subject of external examination shall be 33% at Secondary / Senior School Certificate Examinations. However at Senior School Certificate Examination in a subject involving practical work, a candidate must obtain 33% marks in the theory and 33% marks in the practical separately in addition to 33% marks in aggregate, in order to qualify in that subject.

That gives us a plausible explanation: the kind souls correcting these papers give borderline students the benefit of doubt, and ensure that no one has “just failed”. Either students fail to reach 25%, or they are unofficially bumped up to 33%. This is popularly termed moderation. However, it is not documented in any guidebook that we know of.

95 percentThe second spike on the right is equally interesting. This time, it is at 95%, and there are very few students scoring above 95%.

Unlike 33%, the 95% score might be more driven less by the structure of exams and more by media. In May 2013, several media organisations announced the steadily rising number of students scoring 95% and more, and the fact that this trend has continued across many years.

Our best guess is that this shows a pattern of reverse moderation. To reduce the number of students scoring above 95%, the marks of several such students was brought down to 95%, leading to the large spike.

Clearly, for the students that are naturally good at English, 2013 was not a good year to have taken the CBSE exam – especially given that most colleges have cut-offs at above 95%.

We will be periodically re-visiting education data (marks, infrastructure, spending, etc.) to see what else we can learn – both about the Indian education system as well as data forensics.

The Mahabharatha in Pictures

At 1.8 million words, the Mahabharatha is one of the largest epics – roughly 10 times the size of the Iliad and Odyssey combined. At some level, this represents “big data”. Text is generally considered “unstructured” and therefore tough to analyse. But the growing field of text analytics and text visualisation tell us that there’s a lot more structure to plain text than one might think.

To begin with, a word cloud can tell us a lot about the story.

mahabharatha-wordcloud

The story is obviously about a battle between great kings and sons, with the principal characters being Arjuna, Pandu, Bhishma, Bharata, Karna, Duryodhana, Yudhishthira, Vaisampayama, etc. That’s decipherable without having to read the text.

The structure that we gleam out of it arises from a frequency distribution of the words – i.e. a count of which words occur how many times. The word cloud plots the words at a font size proportional to the frequency of occurrence. (Wordle is a good place to create word clouds.)

Now what we know who’re the principal characters, the next questions are: where are they mentioned? Who’re closely related? etc.

Our Mahabharatha browser provides a simple interface to browse the full text of the Mahabharatha and find where the characters appear.

mahabharatha-mentions

The Mahabharatha is made of 18 books, each with several sections. This visualisation shows each section as a block (the length of the block is proportional to the size of the section.) When you click on a character’s name, the positions in each section where they are mentioned are highlighted

This makes it easy to see where characters speak together (e.g. where does Kunti throw away Karna? Where does she meet him again? Did Draupadi really love Karna before her wedding? Was Arjuna really her favourite? Whom does Krishna favour? etc.) By clicking on the section, you can read the full text of that section.

The second question is, which characters are most closely related? Measuring closeness of characters is a difficult thing to do, even for humans. Fortunately, with text, we can rely on a proxy: how often are two characters found within a few words of each other.

If we take Draupadi as a benchmark character and check how often various people are mentioned within a few words of her, here’s what the picture looks like:

mahabharatha-draupadi-closeness

Each row has the name of the character (along with aliases). The first column shows the number of times they’re mentioned within 50 words of her. The next shows how many times they’re mentioned within 100 words of her. And so on. (All within the same section.)

A visual inspection suggests that many characters start fading off at a distance of 200 words, so perhaps 200 might be a reasonable boundary to consider. (This is arbitrary. But based on our subsequent analysis, we find that this parameter does not impact the visual result too much.)

By plotting a network of their closeness, one can get some insights about the structure of the tale.

mahabharatha-network

Yudhishthira is clearly at the centre of the plot. Arjuna, surprisingly, isn’t. Apart from his close relationship with Krishna and Bhishma, his interaction with other characters is not as well spread out (despite his popularity in the epic.) Contrary to popular opinion, Bhima is mentioned quite often, and is fairly well-networked. Nakula and Sahadeva remain peripheral characters. Gandhari is nearly outside of the network, except for her connection with her husband Dhritarashtra, sister-in-law Kunti, and brother-in-law Vidura (with whom she seems to converse much more than with her husband.)

Another way of looking at this picture is through a correlation matrix.

mahabharatha-matrix

This shows each pair of characters and the number of times they occur within 200 words of each other. The closeness between Nakula and Sahadeva is very obvious; so are Drona & Kripa; Dhritharastra & Vidura; Arjuna & Krishna. Draupadi is mentioned with Dhrishtadhyumna more than anyone else.

You can also see the blocks breaking up into two clusters of sorts – on the bottom right are the primary characters. They interact a lot with each other. In the middle are secondary characters, who again interact amongst themselves; and then there are the narrators on the top left. This is in line with the Mahabharatha discussing several side-plots with secondary characters in parallel with the main plot. The story of Dhrishtadhyumna, of Satyaki, Nakula and Sahadeva’s conversations, etc are examples of these. In fact, in a larger scatterplot, you can see many more tales emerge, such as Nala & Damayanti; Nahusha & Yayati; Uma & Daksha; Vasishta & Vishwamitra; Chitrasena & Vikarna; Virata & Uttara; Dhrishtadhyumna & Shikhandin; Parva & Sambhava; even Ravana & Vali.

If you are interested in seeing the full correlation matrix with all major and minor characters, please reach us at contact@gramener.com.

Comparing school performance

Continuing the design jams, we had one at Akshara’s office last weekend. The dataset we decided to pursue was the Karnataka SSLC results, which we had for the 5 years.

We addressed two questions:

  1. How do Government schools perform when compared to private schools?
  2. How does the medium of instruction affect marks in different subjects?

When comparing Government and private schools, here’s the result.

govt-private-schools

Each box is a school. The size of the box represents the number of students from that school who appeared in the Class X exam. (Only schools with at least 60 students were considered.) The colour represents the average mark – red is low, and green is high.

What’s immediately obvious is that private schools perform much better on average than Government schools, what’s less clear is when this difference starts. The series of graphs below show the number of schools at various mark ranges. The first shows schools with an average of 0 – 30%. The next, from 0 – 40%, and so on until 80%. Then it shows schools with an average of 30% – 100%. The next, from 40% – 100%, and so on until 80% – 100%.

bschool-00-30bschool-00-40bschool-00-50bschool-00-60bschool-00-70bschool-00-80bschool-30-100bschool-40-100bschool-50-100bschool-60-100bschool-70-100bschool-80-100

From the first graph, you can see that there are as many poor schools (average 0 – 30%) among the private and Government schools. But from the last graph, you can see that there are far more good private schools (average 80 – 100%) than Government schools.

So, there are poor performing schools among the private schools as well. However, there are very few excellent Government schools.

We compared the impact of medium of instruction against the subjects as well. The table below shows boxes for each subject taken under each medium of instruction. The size of the box represents the number of students taking that combination. The colour indicates the average mark (red is low, green is high.)

subject-medium

Clearly, Sanksrit is a high scoring language. (At least one person at the design jam chose Sanskrit for this very reason.) Kannada scores well too – especially as a first or third language; but not as well as a second language.

On average, English medium students have the highest marks, followed by Kannada medium students. Students studying other in mediums of instruction perform poorly in most subjects barring their language.

There’s clearly a strong correlation between the medium and the subject. Kannada medium students score high in Kannada, Urdu medium students shore high in Urdu, and so on. But while English medium students do score high in English, they tend to score much better at Kannada, Urdu and Sanskrit!

You can explore these results at http://gramener/karnatakamarks/