Hear our Chief Data Scientist at Strata+Hadoop conference

Gramener’s Chief Data Scientist – S Anand , will be speaking at the Strata + Hadoop conference presented by O’Reilly and Cloudera – which will be held in London during 5-7 May 2015.

On the theme titled “visualizing the world’s largest  democratic exercise”,  Anand will  share how Gramener has analysed and visualized Indian general elections 2014 for CNN and bing.com.

Click for details

Data science news

Data Visualization: The Next Frontier for Big Data 

In a recent Business2Community column, Howard University Marketing Professor Angela Hausman says data visualization tools can make descriptive and predictive analytics more accessible.
“… regardless of the tool, data visualization, at its best, should uncover new patterns of relationships not visible to the naked eye,” Hausman writes. “Hence, the key to effective data visualization is the ability to capture patterns and relationships in clean, simple visuals that allow the signal to stand out from noise contained in the data.”

How businesses can benefit from visual analytics 

Enterprises seek innovative techniques that help them draw attention to key messages and allow them make informed business decisions in complex situations. Visual analytics is one such method that allows decision makers to gain insight into complex problems. It simplifies data values, makes it easy to understand and helps enterprises in communicating important messages and insights which otherwise would be difficult to understand without deep technical expertise.

The practice of presenting information visually is nothing new and the industry has witnessed a growing progression in the techniques down the years; starting with hand-drawn simple charts and tables followed by spreadsheets giving rise to graphs such as bar graphs, pie charts, and line graphs.
Visual analytics uses data visualization techniques like 3-D scatter plots, network graph, interactive bar graph, animated sequence assembly etc. It runs on a code which in turn can be compiled into any programming software platform. It can be used across any industry as it helps capture the changes in a business environment in real time, and facilitates top management to take the best suited business decision.

How CFOs Benefit from a BI Visualization Tool 

Appearances matter. The way that data is presented affects our capacity to understand it and influences how quickly and easily the information and insights it provides can be used to inform our decision-making.

This helps finance professionals maximize benefits:
• in their software investments
• in their role as users of finance data
• as providers of finance data to other parts of the business
• as holders of the purse strings for budgeting and investment in IT resources

Why should I care about data visualization? “By visualizing information, we turn it into a landscape that you can explore with your eyes, a sort of information map. And when you’re lost in information, an information map is kind of useful” saysDavid McCandless, an author and information designer. Simply put – data visualization is a better way of displaying the information that you and others are gathering. Instead of looking at a long sheet of numbers and information, it adds visual meaning to the data – whether it’s highlighting areas on a map, creating intuitive charts, or showcasing trends via interactive graphs.

How Data Visualization Improves PR Communications 

Visualization can help public relations professionals communicate data more clearly and effectively. Endless rows and columns on spreadsheets are far too difficult to grasp for normal human beings. Text explaining data is often messy, unclear and usually boring.

Creating visuals to depict data helps an audience understand the numbers faster and better. With well-designed visuals, the audience can grasp insights that were not obvious to them before and incorporate those insights in their decision-making.

Consumers, journalists and other PR audiences are inundated with information. PR communications that include data visualizations and infographics stand out and rise above text-only articles and posts. A well-designed graphic can prompt an editor to publish your press release rather than a competitor’s.
With the rise of “big data,” data visualizations are more useful than ever. However, a hastily produced image won’t suffice and may even misinform or confuse viewers. Careful research and design are crucial for developing a visual that is eye-catching, informative and factually accurate.

Predictive Modelling Of Stakeholder Behaviors Using Past (Large) Datasets

Data

Stakeholder here refers to consumers, voters, patients and subjects of any heterogeneous or homogeneous groups. In short, groups of people who interact with a business or institution.

Overview

Gramener’s corporate customers store rich data about their stakeholders – be it consumers, be it patients/subjects, be it shareholders or be it voters/participants of a survey. Their past behavior/actions contain plenty of information about their preferences to certain products, triggers that led to switch to a competitor’s offering, or motivations to become an experimental subject in the lifecycle of drug discovery. In this document Gramener explains how it has used past data to predict the behavior of a corporate conglomerate’s shareholders.

Note 1: To protect the identity of this customer, Gramener is refraining from stating the exact nature of their business while ensuring the essence of the predictive analysis is preserved.

Note 2: It is important to note that principles used here are equally applicable in predicting the behavior of stakeholders in any nature of business irrespective of industry, size or geography.

Intended Audience: References to Data Modeling & Statistical methods are part of this write-up. Familiarity of these concepts are not a pre-requisite to appreciate the essence of the message in the document. However, readers who would like to know the details of these techniques may refer to Appendix A – ‘Project Details & Execution’.

Business Case

A US based business conglomerate welcomes their shareholders to participate in corporate decision making. This is achieved by shareholders voting to let know of their opinion on various issues. This business conglomerate would like to use past data and predict the voting behavior of these shareholders. High voting percentages is a representation of the engagement levels these shareholders have with this company.

Past Data is the key resource

This customer has plenty of information on billions of its shareholder’s past behavior; terabytes of data with hundreds of columns and billions of rows. They would like to predict and influence the voting percentages by analyzing the past data. Gramener with its’ analyzes and visualization solutions donned the mantle of the consultant and the doer for this predictive exercise.

Approach in brief

The approach can be split into four broad headers

Consolidating currently recognized variables & data

Capture

A good predictive model needs to recognize all the variables which influence the problem in a holistic way. Hence, creating an exhaustive list of influencing variables is a critical first step to study the problem at hand.

Supplementing extrinsic variables as additional influencers

CaptureWhile starting a predictive model it is very important to look beyond the currently known variables to see any other large influencers. With this in mind and in discussion with the customer Gramener synthesized extrinsic and relevant data which were added to the columns to be considered along with known variables. For example, though the customer knew details at a shareholder level, similar data for the industry and competition landscape were scraped from the internet to be included.

Dimensionality reduction

CaptureWhile considering all influencing variables were critical, this increased the dimensions of the problem beyond manageable levels with over 600 variables. Delineating those variables which influence the voting percentage above a threshold level helped to reduce the problem dimension to manageable levels. Prioritizing and identifying these impactful variables was done by striking a balance between analytical techniques and inputs from the business user.

Data Modelling & Choice of Algorithms

CaptureWhile there are many time tested statistical models for predictive modelling and a natural choice would have been a ‘classical’ administration of such an algorithm, Gramener tempered the choice of such algorithms with business and practical considerations. Details of this approach is provided in Appendix A. The selection of algorithms was moderated with domain knowledge, customer preferences and practical considerations.

Simplified consumption

CaptureThe outputs were implemented using Gramener’s visualization engine which produced visual outputs for uniform consumption and action across the board.

Inherent Challenges & Gramener’s value add

Selection of right tools & methods while too many variables influence outcomes

In the start, Gramener had to deal with 600 variables. Neither the magnitude of impact of variables on the outcome are uniform and nor a pattern can be gleaned by merely studying the variables. Segregating those variables which are statistically significant was not easy considering the volume of data.

Analytical techniques used helped to reduce the number of variables judiciously during the preprocessing and elimination stages. A misstep at this preprocessing stage would have muted the impact of some important variables leading to unintended consequences and wrongful conclusions.

Right proportions of analytical techniques and business inputs

Over dependency on existing businesses biases or over reliance of analytical techniques both could have lead to wrong outcomes.

The right balance was very critical – using existing biases and strengthen it with a guidance from analyses methods. Experience of doing similar projects, team’s expertise in both descriptive and predictive aspects of data and ability to tweak analytical models based on business considerations – all this led to achieve the balance.

Keeping data models relevant to the business scenario

Rather than going by variables and data, the emphasis was on understanding customer’s motivation in doing the predictive modelling.

The customer would not have been able to act on some of the recommendations irrespective of what the statistical model suggested and hence these limitations were pre-built into the algorithms. This helped to ensure that levers identified which will be used to influence the desired outcomes did not lack practical considerations. For example, the communication channels like snail mails which were traditionally unreliable were eliminated from the algorithm despite the collected data having lot of information about snail mail channels.

Key Conclusion

From a set of past data, Gramener’s methodology helped this company understand plausible actions and practical insights on predicting the outcome of the target metric – voting % in this case. The fact that outputs given were visuals further helped the teams to consume these insights without any loss in translation.

General Conclusion

This is an example of Gramener’s work on how predictive models and algorithms can be used to understand and influence stakeholder behavior from large data sets. These same techniques are applicable in many business scenarios to predict the likely:

  • Enrollment of subjects in a clinical trial
  • Adoption of a new product by consumers
  • Churn of loyal customers from a telecom network
  • Impact of a new advertisement campaign on brand loyalists

Appendix A

Project Details & Execution Further details on the project executions are provided here and is meant for readers who are familiar with Big Data and Statistical vocabularies. The execution approach can be split into phases

Application of known & extrinsic variables from Varied data sources

Structured Data – Variables and associated data existed in multiple data sources within the company. Assimilation of this data from various data sources was the first step in understanding the problem better. With terabytes of data and more than billions of rows, the fastest way to process was with automated queries. This reduced human intervention errors and also sped up the first cut analysis which helped to understand the problem’s dimensions better – number of variables, need for data cleansing and data volumes etc.

Unstructured Data – External data from unstructured sources was collated and appended to the structured data taken from the corporate database. This helped to bring new dimensions in analyzing the data and increased the possibility of a holistic approach for generating insights. (For example: publicly available competitor shareholder information was brought in to be used along with existing data)

Dimensionality reduction through Modularized Variable Pre-processing

Problem dimension extended to over 600 variables and associated data. This made the problem very difficult to manage and made it untenable for any meaningful processing. The dimensions had to be reduced while ensuring all the impactful variables and associated data were still in consideration. Modularizing and pairing the different variables made the analysis quick and repeatable for identification of all those impactful factors

Quantitative Measures: Group Means – 90 percentile methodology, Multivariate analysis were some of the techniques used to test the impact of a variable on the sensitivity of the identified target metric – voting %. These class of techniques were found less taxing on the hardware but were very effective in quantifying the magnitude impact of each variable on the target metric.

Significance testing measures: Quantitative measures like T test, Walds test etc. were some of the techniques which were further used to delineate random influence of a variable. This clearly established the real significance of each variable on the target metric.

Choice of Data Models and Algorithms: Constructing the predictive model

Preprocessing and delineation of impactful variables helped to reduce the dimensions of the problem and now advanced analytics could be done on the most meaningful set of variables.

The two main purposes for data model building were:

  1. Cluster all the voters into common logical groups based on profiles
  2. Prescribe actionable ways to predict and improve voter participation

Data models were constructed for each of the voters involved. This helped to cluster each voter into groups based on the outcome of their participation. Those shareholders with similar traits leading to low or high voting participation could now be targeted to make marketing campaigns more effective. For example, Shareholders belonging to certain geography, income group and certain education levels with low number of outstanding shares had generally voted less frequently.

Data models thus helped in identifying those levers which may improve voting participation. For example, giving the shareholders more number of days available to vote would improve their participation rate from 18% to 22%.

For predictive modelling, Decision Trees were found transparent in their working and visual in their outputs. However, decision tree algorithms were not administered as-is. Judicious choice of split levels helped to tweak the algorithm to suit the problem at hand. For example, some split levels were ignored to accommodate practical considerations –lead times (# of days) below certain split levels were discarded despite their statistical significance, since it was not practical to have communication reach the voters within these lead times.

Outputs as Visuals for easy consumption: Exploratory and Interactive

The most impactful insights from the analysis were condensed into visuals. From a set of past data, now we had a set of visually consumable outputs as actions & insights. The impact of these actions on the predictability of voting % was clear since they were easily understood by all due to visual representation. This led to meaningful action oriented discussions among the customer teams since mental models were common.

Visuals were also exploratory:  Apart from predictive modelling, Gramener’s visual outputs helped all users irrespective of their skill levels to explore the insights better on how each variable influenced the outcome. For example, sending a communication on Tuesdays had the maximum impact on the voting % – this was not something that intuitively known before the analysis. When this was visually represented, it become evident for all & led to further exploration of the impact of other business days on the outcome.

Interactive decision trees: The decision trees were mines of various insights and which were converted into interactive web links. User could interact with these links for various contexts that they were particularly interested in. For example, a user may be interested in knowing the impact of a communication channel, while another user may want to see the impact of a geographic cluster of voters.