What do people actually do when they say they do Data Science?

By Bhanu Kamapantula

On a daily basis, I notice a lot of misconceptions regarding the field of data science. A lot of this is due to the ever evolving mechanics of the field. I will skip the standard definitions (think mathematics, statistics, machine learning, software) of what Data Science means since others have covered it well enough (Links section below) and will touch upon different aspects of data science.

1) Engineering

Engineering involves setting up the infrastructure, tools and products.

a) Infrastructure

Infrastructure refers to setting up of servers and configuring them for your optimal use, setting up version control, access controls. This is not vastly different from the existing technology infrastructure setup at software firms.

b) Tools

Establishing a tools pipeline early-on will ease the product development. The material that continuously assists your team to develop rapidly are what I refer to tools. These include: the design approach (+ supporting software), choice of your text editor, default programming language (+ supporting software), version control, testing practices. Notice that some of this can be automated.

c) Products

The development of products is an evolving process with solutioning. Once the solution is identified (at this stage the solution is very much a prototype — very glitchy) it has to be productionized (used by clients or making it open for public use — minimal glitches).

2) Solutioning

The type of problem, of course, varies from scenario to scenario. The problem of predicting the weather for the next 4 days is different from classifying the type of flowers which is again different from creating a dashboard to consume sales across regions.

Designing a solution involves understanding business requirements, designing the workflow, creating a development pipeline, software testing, and productionizing.

Here by solution, I refer to an output that is consumed by an end user (business clients, general audience). The exact solutioning will vary across projects. A solution that handles, say, 100 MB of data isn’t necessarily valuable for a 5GB data. A solution that handles flat files (CSV) won’t be valid for the ones that fetch data from remote databases. Bring in authentication, authorization, security at the least you will have a lot of variation in solutions.

A machine learning (ML) solution would use a trained model and test using real-time data (think real-time Tweet classification). This solution requires identifying a ML algorithm, creating a feature space (using Twitter data post data gathering, cleaning, analysis), training a ML model which will be used to test real-time data.

Given the variation, not every solution can be a product. If features are abstracted enough to work well for multiple scenarios you have a product.

3) Communication

Much of the communication here, as any other field, occurs as part of internal reports, talks, seminars.

If you are an academic you will share the results as part of a journal or conference or a workshop publication.

If you are in a corporate environment, you will likely share the results with clients in a web view (dashboard) or a slide deck.

If you are an independent investigator (ex: citizen data scientist), you will likely share the analysis via a blog post.

All approaches will involve carefully created visuals to narrate the story and results. It is critical to remember that the audience in each of the cases is different. Trained academics write publications for other trained academics. Corporate workers create results for business clients. Citizen data scientists write analysis for journalists, other investigators.

Image credit: Udacity [e]
In a gist, a variety of skills are useful to be a data scientist: data (gathering, cleaning, analysis, visualization), modeling (machine learning algorithms), statistics (understanding causality, accuracy metrics), software engineering (efficiency, quality assurance). The overlap of these skills in an individual will depend on the team size (individual vs small or medium or large organization). The breadth (software engineering, data processing, ML etc.) of the skills is as important as depth (understanding the mathematics behind ML algorithms or different statistical techniques) — you can only improve it over time.

If you are working on any of the cogs above, you are contributing to the wheel of data science. Don’t let anyone tell you otherwise.


a) Trey Causey, Getting started in data science

b) University of Wisconsin, What do data scientists do?

c) KDNuggets, What is Data Science, and What Does a Data Scientist Do?

d) Hillary Mason, What is a Data Scientist?

e) Udacity, Data Science Job Skills

Data Wrangling: What, How and Why

Gramener’s CEO Anand and Senior Data Scientist Kathirmani along with the Upgrad team are running a workshop this Sunday (23 Oct, 11:30am) at Koramangala, Bangalore. The topic is “Data Wrangling – What, How and Why?”

Gramener's CEO Anand and Senior Data Scientist Kathirmani

Link: http://events.upgrad.com/da/workshop/blr23oct

Over 90% of data science is about cleaning data – the process of loading, correcting and preparing the data for analysis. This is a tedious process. But what does it involve? What are the tricks of the trade? What tools and techniques make this easier?

Data is messy. (What a surpriise)
Data is messy. (If it surprises you, your career will be messy.)

Once we have the data ready, the toughest part is asking the right questions. Exploratory data analysis is about playing with datasets in a structured way to extract as many of the important insights as possible in a given time. What are effective EDA techniques? Is there a structure to this?

This talk is ideal for people interested in learning about data analysis, as well as analysts who are looking to improve their data wrangling skills. Join us!

To be, or not to be – and other questions


Today’s post millennial generation doesn’t waste time mulling over something. Why give your brain a workout when all of the world’s opinion for the most trivial, nonsensical query to the most hypothetical and everything in between is a ‘search’ away. Here’s a look at some of the popular ‘Should I…’ questions Indians have been asking on Google. The most popular ‘Should I’ questions in the India region have been mostly related to technology & gadgets – ‘Should I remove it?’ (Software), ‘Should I buy IPhone 5s’, ‘Should I upgrade to Windows 10’, ‘Should I update to iOS 9’ and so on.

There are also unhappy employees seeking the wisdom of the crowd asking ‘Should I quit my job?’ and you also have the strange ones – ‘Should I marry him?’, ‘Should I carry an umbrella today?’ Or how about this popular one? – ‘Should I die?’ One can only hope the search results for this one on page 1 offer some sensible advice. Pity Hamlet didn’t know about google.dk – it could have saved the Prince of Denmark a sombre soliloquy.


To see what else the world is searching for head out here.