Comparing school performance

Continuing the design jams, we had one at Akshara’s office last weekend. The dataset we decided to pursue was the Karnataka SSLC results, which we had for the 5 years.

We addressed two questions:

  1. How do Government schools perform when compared to private schools?
  2. How does the medium of instruction affect marks in different subjects?

When comparing Government and private schools, here’s the result.

govt-private-schools

Each box is a school. The size of the box represents the number of students from that school who appeared in the Class X exam. (Only schools with at least 60 students were considered.) The colour represents the average mark – red is low, and green is high.

What’s immediately obvious is that private schools perform much better on average than Government schools, what’s less clear is when this difference starts. The series of graphs below show the number of schools at various mark ranges. The first shows schools with an average of 0 – 30%. The next, from 0 – 40%, and so on until 80%. Then it shows schools with an average of 30% – 100%. The next, from 40% – 100%, and so on until 80% – 100%.

bschool-00-30bschool-00-40bschool-00-50bschool-00-60bschool-00-70bschool-00-80bschool-30-100bschool-40-100bschool-50-100bschool-60-100bschool-70-100bschool-80-100

From the first graph, you can see that there are as many poor schools (average 0 – 30%) among the private and Government schools. But from the last graph, you can see that there are far more good private schools (average 80 – 100%) than Government schools.

So, there are poor performing schools among the private schools as well. However, there are very few excellent Government schools.

We compared the impact of medium of instruction against the subjects as well. The table below shows boxes for each subject taken under each medium of instruction. The size of the box represents the number of students taking that combination. The colour indicates the average mark (red is low, green is high.)

subject-medium

Clearly, Sanksrit is a high scoring language. (At least one person at the design jam chose Sanskrit for this very reason.) Kannada scores well too – especially as a first or third language; but not as well as a second language.

On average, English medium students have the highest marks, followed by Kannada medium students. Students studying other in mediums of instruction perform poorly in most subjects barring their language.

There’s clearly a strong correlation between the medium and the subject. Kannada medium students score high in Kannada, Urdu medium students shore high in Urdu, and so on. But while English medium students do score high in English, they tend to score much better at Kannada, Urdu and Sanskrit!

You can explore these results at http://gramener/karnatakamarks/

Student browsing patterns

This is a guest post by Rahul Gonsalves of Pixelogue.

About a week ago, Anand suggested that we spend a day some weekend working collaboratively on data visualisation. I jumped at the chance to spend a day working and learning from him and this is how we found ourselves at the Gramener office on a Sunday morning.

We decided to look at a dataset that Anand has blogged about before – computer usage of MSIT students at CIHL, a consortium of universities based out of IIIT, Hyderabad. Over a period of seven weeks, students’ computer usage was tracked. The data includes application usage and duration, internet browsing patterns, and even keystrokes, broken down by user. If this data sounds like a privacy landmine, that’s because it is! The only consolation is that all the students involved in the study consented to have their usage tracked, and so were presumably aware of what was happening.

We decided to look at a subset of this data – at their internet usage and to try and answer the following question: What websites do people browse at different times of day? Are there interesting patterns that emerge? Do “social” websites constitute a significant portion of their browsing time? etc.

We created an interactive visualisation, as well as an Excel based one. The interactive version is available at http://gramener.com/siteusage/

On Excel, the variables at our disposal included:

  1. User
  2. URL
  3. Time of browsing

We pulled the data into Excel, and had the following table:

excel-1

We then split up the time values in Excel into their component pieces (hour and minute), so that 22-11-2011 10:19 becomes:

excel-2

You can see the raw data and the formulas used in the following screenshot:

excel-3

We combined the hour into a value which we called “Minute of the Day”, which is merely a numeral value of the minute from 12AM. 1am is 60, 2am is 120, 3am is 180 and so forth.

We then used a pivot table to plot the domain accessed by frequency, which allowed us to generate the top 10 most accessed domains (Facebook, unsurprisingly was 2nd, right behind a local address 10.10.10.68, which is presumably a development server.)

excel-4

We arranged the domains on the horizontal axis, with the hour of day listed on the y-axis, as below:

At this point in time, Anand pulls out his Excel magic, and pulls in the number of times within that hour that a particular domain was accessed. COUNTIFS looks counts the number of times the domain was accessed at that particular minute. IFERROR ensures that errors are counted as zeroes. (This formula works only in Excel 2007 and later.)

excel-6

The results of applying this particular formula across the whole table is given below:

excel-7

Using the conditional formatting tools, we are able to apply a colour scale that changes the cell background colour — a darker green implies a higher frequency while a lighter colour implies a lower incidence at that point in time.

excel-8

The extreme preponderance of the top hit (the local dev server, 10.10.10.68) led to a not very useful visualisation, with only the highest values being marked out.

excel-9

Using a logarithmetic scale helps give a better heatmap, as can be seen in the following screenshots.

excel-a

We finally arrived at the following heatmap, which offers some insights into the ways that the students at this particular course spent their time.

excel-b

We talked about different ways of depicting this data, which resulted in the following interactive visualization of the way a student spends his or her time on an average day in Hyderabad. We hope you enjoy it!

Visualising Sensex returns in Excel

You don’t need specialised data visualisation software. Excel is a powerful tool in itself. For example, you could re-create and extend our When to Invest visualisations in Excel. Here’s an example:

bse-sensex-returns-visualisation

Each row indicates a month in which you’ve made an investment. The columns indicate how long you’ve held the investment for. For example, the 10th row, 10th column shows the returns you’d make if you invested in the 10th month (Oct 1991) and held it for 10 months (until Aug 1992). Red indicates negative returns, green indicates positive returns.

Here’s a tutorial on how you could do this.

First, let’s get the data

Let’s start by downloading the Sensex data. Visit the BSE Historical Indices page, choose the SENSEX and select a monthly feed for Jan 1991.

bse-indices

Click submit to see the results for the year 1991.

bse-sensex-monthly

Copy and paste the table into Excel.

Now, repeat this for each year from 1991 to 2012, and you’ll have the full database of monthly Sensex prices. (It’s possible to scrape, copying and pasting 21 pages is faster.)

bse-excel-data

Next, let’s calculate the returns

To find the returns after 1 month, 2 months, etc, we can use the OFFSET formula:

offset-formula

OFFSET($B2, $C1, 0) says:

  • Start with cell B2 (and when I copy and paste, always get me data from column B)
  • Move down C1 rows (and when I copy and paste, always get the offset from row 1)
  • Move right 0 columns

As we copy and paste the formula, we’ll get the opening price 1 month, 2 months, 3 months after the starting month. (The 1, 2, 3… months come from the first row).

Now, we need to calculate the annualised return from this price. That’s done by the following formula:

=POWER(OFFSET($B2,C$1,0)/$B2, 12/C$1)-1

return-formula

Put another way, this is the same as:

=POWER( Future_Amount / Initial_Amount, 12/Number_of_months) – 1

Copying and pasting this formula to all the cells gives you a sheet that shows the returns for every period.

returns

Lastly, let’s colour it.

If you want to see what the returns look like as a gradient, select the cells, go to Conditional Formatting – Color Scales and choose one of the scales there.

conditional-formatting

You may then want to tweak the colours for each value. Go to any of the cells, then select Conditional Formatting – Manage Rules, and double-click the “Graded Color Scale” rule. Then you can change what Red, Yellow and Green stand for. (In the example below, they’ve been set to –50%, 0% and +50% returns respectively.)

formats

This will give you the visualisation below when you zoom out.

bse-sensex-returns-visualisation

Another way of looking at the same data is to only show those periods where the return was more than, say, the average inflation rate (let’s say 11%).

formats2

… this is what the result will look like.

bse-sensex-returns-inflation

We’re not trying to endorse Excel as a product: We admire all spreadsheets. Excel is merely our abbreviation for “Excel and other spreadsheets”. The same principles would apply equally well to Open Office, Google Docs, or any other spreadsheet of your choice.

There’s a good chance Excel is already on your machine. You’re likely to be familiar with it. Why bother learning a new tool, or a programming language, when you’re already familiar with something?

The point is, most tools can visualise well. But you need to have the imagination.