Stakeholder here refers to consumers, voters, patients and subjects of any heterogeneous or homogeneous groups. In short, groups of people who interact with a business or institution.
Gramener’s corporate customers store rich data about their stakeholders – be it consumers, be it patients/subjects, be it shareholders or be it voters/participants of a survey. Their past behavior/actions contain plenty of information about their preferences to certain products, triggers that led to switch to a competitor’s offering, or motivations to become an experimental subject in the lifecycle of drug discovery. In this document Gramener explains how it has used past data to predict the behavior of a corporate conglomerate’s shareholders.
Note 1: To protect the identity of this customer, Gramener is refraining from stating the exact nature of their business while ensuring the essence of the predictive analysis is preserved.
Note 2: It is important to note that principles used here are equally applicable in predicting the behavior of stakeholders in any nature of business irrespective of industry, size or geography.
Intended Audience: References to Data Modeling & Statistical methods are part of this write-up. Familiarity of these concepts are not a pre-requisite to appreciate the essence of the message in the document. However, readers who would like to know the details of these techniques may refer to Appendix A – ‘Project Details & Execution’.
A US based business conglomerate welcomes their shareholders to participate in corporate decision making. This is achieved by shareholders voting to let know of their opinion on various issues. This business conglomerate would like to use past data and predict the voting behavior of these shareholders. High voting percentages is a representation of the engagement levels these shareholders have with this company.
Past Data is the key resource
This customer has plenty of information on billions of its shareholder’s past behavior; terabytes of data with hundreds of columns and billions of rows. They would like to predict and influence the voting percentages by analyzing the past data. Gramener with its’ analyzes and visualization solutions donned the mantle of the consultant and the doer for this predictive exercise.
Approach in brief
The approach can be split into four broad headers
Consolidating currently recognized variables & data
A good predictive model needs to recognize all the variables which influence the problem in a holistic way. Hence, creating an exhaustive list of influencing variables is a critical first step to study the problem at hand.
Supplementing extrinsic variables as additional influencers
While starting a predictive model it is very important to look beyond the currently known variables to see any other large influencers. With this in mind and in discussion with the customer Gramener synthesized extrinsic and relevant data which were added to the columns to be considered along with known variables. For example, though the customer knew details at a shareholder level, similar data for the industry and competition landscape were scraped from the internet to be included.
While considering all influencing variables were critical, this increased the dimensions of the problem beyond manageable levels with over 600 variables. Delineating those variables which influence the voting percentage above a threshold level helped to reduce the problem dimension to manageable levels. Prioritizing and identifying these impactful variables was done by striking a balance between analytical techniques and inputs from the business user.
Data Modelling & Choice of Algorithms
While there are many time tested statistical models for predictive modelling and a natural choice would have been a ‘classical’ administration of such an algorithm, Gramener tempered the choice of such algorithms with business and practical considerations. Details of this approach is provided in Appendix A. The selection of algorithms was moderated with domain knowledge, customer preferences and practical considerations.
The outputs were implemented using Gramener’s visualization engine which produced visual outputs for uniform consumption and action across the board.
Inherent Challenges & Gramener’s value add
Selection of right tools & methods while too many variables influence outcomes
In the start, Gramener had to deal with 600 variables. Neither the magnitude of impact of variables on the outcome are uniform and nor a pattern can be gleaned by merely studying the variables. Segregating those variables which are statistically significant was not easy considering the volume of data.
Analytical techniques used helped to reduce the number of variables judiciously during the preprocessing and elimination stages. A misstep at this preprocessing stage would have muted the impact of some important variables leading to unintended consequences and wrongful conclusions.
Right proportions of analytical techniques and business inputs
Over dependency on existing businesses biases or over reliance of analytical techniques both could have lead to wrong outcomes.
The right balance was very critical – using existing biases and strengthen it with a guidance from analyses methods. Experience of doing similar projects, team’s expertise in both descriptive and predictive aspects of data and ability to tweak analytical models based on business considerations – all this led to achieve the balance.
Keeping data models relevant to the business scenario
Rather than going by variables and data, the emphasis was on understanding customer’s motivation in doing the predictive modelling.
The customer would not have been able to act on some of the recommendations irrespective of what the statistical model suggested and hence these limitations were pre-built into the algorithms. This helped to ensure that levers identified which will be used to influence the desired outcomes did not lack practical considerations. For example, the communication channels like snail mails which were traditionally unreliable were eliminated from the algorithm despite the collected data having lot of information about snail mail channels.
From a set of past data, Gramener’s methodology helped this company understand plausible actions and practical insights on predicting the outcome of the target metric – voting % in this case. The fact that outputs given were visuals further helped the teams to consume these insights without any loss in translation.
This is an example of Gramener’s work on how predictive models and algorithms can be used to understand and influence stakeholder behavior from large data sets. These same techniques are applicable in many business scenarios to predict the likely:
- Enrollment of subjects in a clinical trial
- Adoption of a new product by consumers
- Churn of loyal customers from a telecom network
- Impact of a new advertisement campaign on brand loyalists
Project Details & Execution Further details on the project executions are provided here and is meant for readers who are familiar with Big Data and Statistical vocabularies. The execution approach can be split into phases
Application of known & extrinsic variables from Varied data sources
Structured Data – Variables and associated data existed in multiple data sources within the company. Assimilation of this data from various data sources was the first step in understanding the problem better. With terabytes of data and more than billions of rows, the fastest way to process was with automated queries. This reduced human intervention errors and also sped up the first cut analysis which helped to understand the problem’s dimensions better – number of variables, need for data cleansing and data volumes etc.
Unstructured Data – External data from unstructured sources was collated and appended to the structured data taken from the corporate database. This helped to bring new dimensions in analyzing the data and increased the possibility of a holistic approach for generating insights. (For example: publicly available competitor shareholder information was brought in to be used along with existing data)
Dimensionality reduction through Modularized Variable Pre-processing
Problem dimension extended to over 600 variables and associated data. This made the problem very difficult to manage and made it untenable for any meaningful processing. The dimensions had to be reduced while ensuring all the impactful variables and associated data were still in consideration. Modularizing and pairing the different variables made the analysis quick and repeatable for identification of all those impactful factors
Quantitative Measures: Group Means – 90 percentile methodology, Multivariate analysis were some of the techniques used to test the impact of a variable on the sensitivity of the identified target metric – voting %. These class of techniques were found less taxing on the hardware but were very effective in quantifying the magnitude impact of each variable on the target metric.
Significance testing measures: Quantitative measures like T test, Walds test etc. were some of the techniques which were further used to delineate random influence of a variable. This clearly established the real significance of each variable on the target metric.
Choice of Data Models and Algorithms: Constructing the predictive model
Preprocessing and delineation of impactful variables helped to reduce the dimensions of the problem and now advanced analytics could be done on the most meaningful set of variables.
The two main purposes for data model building were:
- Cluster all the voters into common logical groups based on profiles
- Prescribe actionable ways to predict and improve voter participation
Data models were constructed for each of the voters involved. This helped to cluster each voter into groups based on the outcome of their participation. Those shareholders with similar traits leading to low or high voting participation could now be targeted to make marketing campaigns more effective. For example, Shareholders belonging to certain geography, income group and certain education levels with low number of outstanding shares had generally voted less frequently.
Data models thus helped in identifying those levers which may improve voting participation. For example, giving the shareholders more number of days available to vote would improve their participation rate from 18% to 22%.
For predictive modelling, Decision Trees were found transparent in their working and visual in their outputs. However, decision tree algorithms were not administered as-is. Judicious choice of split levels helped to tweak the algorithm to suit the problem at hand. For example, some split levels were ignored to accommodate practical considerations –lead times (# of days) below certain split levels were discarded despite their statistical significance, since it was not practical to have communication reach the voters within these lead times.
Outputs as Visuals for easy consumption: Exploratory and Interactive
The most impactful insights from the analysis were condensed into visuals. From a set of past data, now we had a set of visually consumable outputs as actions & insights. The impact of these actions on the predictability of voting % was clear since they were easily understood by all due to visual representation. This led to meaningful action oriented discussions among the customer teams since mental models were common.
Visuals were also exploratory: Apart from predictive modelling, Gramener’s visual outputs helped all users irrespective of their skill levels to explore the insights better on how each variable influenced the outcome. For example, sending a communication on Tuesdays had the maximum impact on the voting % – this was not something that intuitively known before the analysis. When this was visually represented, it become evident for all & led to further exploration of the impact of other business days on the outcome.
Interactive decision trees: The decision trees were mines of various insights and which were converted into interactive web links. User could interact with these links for various contexts that they were particularly interested in. For example, a user may be interested in knowing the impact of a communication channel, while another user may want to see the impact of a geographic cluster of voters.