Talking Movies – ‘OK’, how about Talking Data!!

By Amit Pishe

During childhood, we were all at some point pushed or encouraged to participate in Story Telling competition. Story telling had key characters and depiction of inanimate things described through words which kind of used to create some picture or motion in our minds. Stories were used to teach and learn.

Nothing much has changed today in applying the same ‘storytelling’ concept for data/datasets – in what we now call Narratives/ Data Narratives.

Big Data is the buzz word, however visualising and explaining the entire dashboard may pose some challenges, not all the info displayed may be required for all the users.

Narratives/Data Narratives come in handy to pass on all the relevant information, yet being easy to grasp it succinctly. Narratives are basically the Talking data (highlighting insights, trends, unique patterns, exploring factors shaping data) for the data driven stories, they provide crisp, concise information. More advanced in a visualization dashboard format, trying to tell us the characteristics of data.

Leveraging the Data Narratives and writing one (in general):

  1. Understanding the overall data components (metrics & dimension), individually mapping those to weave a story line for a given chart/dashboard. Story line is the central theme of the narration. Always question yourself, why are you considering only that particular data component
  2. Storyline can be translated in multiple ways, pick any one version and compare how it scores with the rest. Are all the data points included?
  3. Flow of information/ insight should transition effectively throughout story without abruptly ending
  4. Target audience needs to be kept in mind – BFSI, Health care, Media, Telecom etc. to provide information accordingly.
  5. Get to know the Inputs, Analysis of the data, result orientation and finally conclusion

Sample Template/use case (Investment Banking): for a given Business user (Racey) by an Indian bound Mutual Fund investment banking firm.

Wealth Management Report

Date: dd-mm-yyyy

Dear Racey,

Information about your portfolio

a. Portfolio update for Last quarter vs current month growth/loss. Net worth for the year ending financial fiscal. Peer rating compared to other investors. Benchmarking the performance.

b. Equity portfolio – returns on investments information, fund index and NIFTY. Type of investments in different segments (Mid cap, advantage funds, MIP and so on). Display chart with comparison between CRISIL, NIFTY, current portfolio returns

Similar investors at our bank has given a return of 8.3%. Crisil Composite Bond Fund Index has given a return of 5.4%. NIFTY has given a return of 2.0%. Your portfolio has given a return of 15.8%, i.e. on par with or better than Similar investors at our bank, Crisil Composite Bond Fund Index and NIFTY

c. Fixed income portfolio details (~ 1% of total portfolio) are mostly in intermediate term funds, this investment is driven by short term plan growth investments options.

d. Your Portfolio description through charts spread across segments

Your equity investments (99% of total portfolio) are mostly in large cap growth oriented funds. Your portfolio is dominated largely by Canara Robeco Treasury Advantage Fund – Institutional Plan- Growth option, UTI – Treasury Advantage Fund – Institutional-Growth and HSBC MIP – Savings – Growth.

e. Suggestion on investment – based on average cash balance in savings account (~15% of savings into xxx plan). I would suggest investing in Pure Value fund growth options

For the detailed report, please refer www.Investmentfirm.co.in/racey/portfolio.htm

Gist: In the sample example, the data components are underlined (Portfolio, Equity portfolio, Fixed income portfolio and so on). The entire dashboard would have multiple other charts, the key data component points will provide holistic view without going into too much specifics.

  1. Main summary is to highlight on how the user portfolio is performing over variety of time periods, provide benchmarking info and analyse the saving patterns of the user.
  2. Based on the savings, suggest the investment as a conclusion.

What do people actually do when they say they do Data Science?

By Bhanu Kamapantula

On a daily basis, I notice a lot of misconceptions regarding the field of data science. A lot of this is due to the ever evolving mechanics of the field. I will skip the standard definitions (think mathematics, statistics, machine learning, software) of what Data Science means since others have covered it well enough (Links section below) and will touch upon different aspects of data science.

1) Engineering

Engineering involves setting up the infrastructure, tools and products.

a) Infrastructure

Infrastructure refers to setting up of servers and configuring them for your optimal use, setting up version control, access controls. This is not vastly different from the existing technology infrastructure setup at software firms.

b) Tools

Establishing a tools pipeline early-on will ease the product development. The material that continuously assists your team to develop rapidly are what I refer to tools. These include: the design approach (+ supporting software), choice of your text editor, default programming language (+ supporting software), version control, testing practices. Notice that some of this can be automated.

c) Products

The development of products is an evolving process with solutioning. Once the solution is identified (at this stage the solution is very much a prototype — very glitchy) it has to be productionized (used by clients or making it open for public use — minimal glitches).

2) Solutioning

The type of problem, of course, varies from scenario to scenario. The problem of predicting the weather for the next 4 days is different from classifying the type of flowers which is again different from creating a dashboard to consume sales across regions.

Designing a solution involves understanding business requirements, designing the workflow, creating a development pipeline, software testing, and productionizing.

Here by solution, I refer to an output that is consumed by an end user (business clients, general audience). The exact solutioning will vary across projects. A solution that handles, say, 100 MB of data isn’t necessarily valuable for a 5GB data. A solution that handles flat files (CSV) won’t be valid for the ones that fetch data from remote databases. Bring in authentication, authorization, security at the least you will have a lot of variation in solutions.

A machine learning (ML) solution would use a trained model and test using real-time data (think real-time Tweet classification). This solution requires identifying a ML algorithm, creating a feature space (using Twitter data post data gathering, cleaning, analysis), training a ML model which will be used to test real-time data.

Given the variation, not every solution can be a product. If features are abstracted enough to work well for multiple scenarios you have a product.

3) Communication

Much of the communication here, as any other field, occurs as part of internal reports, talks, seminars.

If you are an academic you will share the results as part of a journal or conference or a workshop publication.

If you are in a corporate environment, you will likely share the results with clients in a web view (dashboard) or a slide deck.

If you are an independent investigator (ex: citizen data scientist), you will likely share the analysis via a blog post.

All approaches will involve carefully created visuals to narrate the story and results. It is critical to remember that the audience in each of the cases is different. Trained academics write publications for other trained academics. Corporate workers create results for business clients. Citizen data scientists write analysis for journalists, other investigators.

Image credit: Udacity [e]
In a gist, a variety of skills are useful to be a data scientist: data (gathering, cleaning, analysis, visualization), modeling (machine learning algorithms), statistics (understanding causality, accuracy metrics), software engineering (efficiency, quality assurance). The overlap of these skills in an individual will depend on the team size (individual vs small or medium or large organization). The breadth (software engineering, data processing, ML etc.) of the skills is as important as depth (understanding the mathematics behind ML algorithms or different statistical techniques) — you can only improve it over time.

If you are working on any of the cogs above, you are contributing to the wheel of data science. Don’t let anyone tell you otherwise.

Links:

a) Trey Causey, Getting started in data science

b) University of Wisconsin, What do data scientists do?

c) KDNuggets, What is Data Science, and What Does a Data Scientist Do?

d) Hillary Mason, What is a Data Scientist?

e) Udacity, Data Science Job Skills

Gramener at PyConf Hyd, 2017

Gramener’s CEO S Anand is the Key Note speaker at  2017 on 8th October. Anand will be talking about code re-use in Python in this session titled ‘Don’t Repeat yourself -Adventures in re-use’.

PyConf Hyderabad is the regional gathering for the community that uses and develops the open-source Python Programming Language and is hosted by Hyderabad Python Users Group.