Predictive Modelling Of Stakeholder Behaviors Using Past (Large) Datasets


Stakeholder here refers to con­sumers, voters, pa­tients and sub­jects of any het­ero­gen­eous or ho­mo­gen­eous groups. In short, groups of people who in­ter­act with a busi­ness or in­sti­tu­tion.


Gramener’s cor­por­ate cus­tom­ers store rich data about their stake­hold­ers – be it con­sumers, be it patients/subjects, be it share­hold­ers or be it voters/participants of a sur­vey. Their past behavior/actions con­tain plenty of in­form­a­tion about their pref­er­ences to cer­tain products, trig­gers that led to switch to a competitor’s of­fer­ing, or mo­tiv­a­tions to be­come an ex­per­i­ment­al sub­ject in the li­fe­cycle of drug dis­cov­ery. In this doc­u­ment Gramener ex­plains how it has used past data to pre­dict the be­ha­vi­or of a cor­por­ate conglomerate’s share­hold­ers.

Note 1: To pro­tect the iden­tity of this cus­tom­er, Gramener is re­frain­ing from stat­ing the ex­act nature of their busi­ness while en­sur­ing the es­sence of the pre­dict­ive ana­lys­is is pre­served.

Note 2: It is im­port­ant to note that prin­ciples used here are equally ap­plic­able in pre­dict­ing the be­ha­vi­or of stake­hold­ers in any nature of busi­ness ir­re­spect­ive of in­dustry, size or geo­graphy.

Intended Audience: References to Data Modeling & Statistical meth­ods are part of this write-up. Familiarity of these con­cepts are not a pre-requisite to ap­pre­ci­ate the es­sence of the mes­sage in the doc­u­ment. However, read­ers who would like to know the de­tails of these tech­niques may refer to Appendix A – ‘Project Details & Execution’.

Business Case

A US based busi­ness con­glom­er­ate wel­comes their share­hold­ers to par­ti­cip­ate in cor­por­ate de­cision mak­ing. This is achieved by share­hold­ers vot­ing to let know of their opin­ion on vari­ous is­sues. This busi­ness con­glom­er­ate would like to use past data and pre­dict the vot­ing be­ha­vi­or of these share­hold­ers. High vot­ing per­cent­ages is a rep­res­ent­a­tion of the en­gage­ment levels these share­hold­ers have with this com­pany.

Past Data is the key resource

This cus­tom­er has plenty of in­form­a­tion on bil­lions of its shareholder’s past be­ha­vi­or; tera­bytes of data with hun­dreds of columns and bil­lions of rows. They would like to pre­dict and in­flu­ence the vot­ing per­cent­ages by ana­lyz­ing the past data. Gramener with its’ ana­lyzes and visu­al­iz­a­tion solu­tions donned the mantle of the con­sult­ant and the doer for this pre­dict­ive ex­er­cise.

Approach in brief

The ap­proach can be split in­to four broad head­ers

Consolidating cur­rently re­cog­nized vari­ables & data


A good pre­dict­ive mod­el needs to re­cog­nize all the vari­ables which in­flu­ence the prob­lem in a hol­ist­ic way. Hence, cre­at­ing an ex­haust­ive list of in­flu­en­cing vari­ables is a crit­ic­al first step to study the prob­lem at hand.

Supplementing ex­trins­ic vari­ables as ad­di­tion­al in­flu­en­cers

CaptureWhile start­ing a pre­dict­ive mod­el it is very im­port­ant to look bey­ond the cur­rently known vari­ables to see any oth­er large in­flu­en­cers. With this in mind and in dis­cus­sion with the cus­tom­er Gramener syn­thes­ized ex­trins­ic and rel­ev­ant data which were ad­ded to the columns to be con­sidered along with known vari­ables. For ex­ample, though the cus­tom­er knew de­tails at a share­hold­er level, sim­il­ar data for the in­dustry and com­pet­i­tion land­scape were scraped from the in­ter­net to be in­cluded.

Dimensionality re­duc­tion

CaptureWhile con­sid­er­ing all in­flu­en­cing vari­ables were crit­ic­al, this in­creased the di­men­sions of the prob­lem bey­ond man­age­able levels with over 600 vari­ables. Delineating those vari­ables which in­flu­ence the vot­ing per­cent­age above a threshold level helped to re­duce the prob­lem di­men­sion to man­age­able levels. Prioritizing and identi­fy­ing these im­pact­ful vari­ables was done by strik­ing a bal­ance between ana­lyt­ic­al tech­niques and in­puts from the busi­ness user.

Data Modelling & Choice of Algorithms

CaptureWhile there are many time tested stat­ist­ic­al mod­els for pre­dict­ive mod­el­ling and a nat­ur­al choice would have been a ‘clas­sic­al’ ad­min­is­tra­tion of such an al­gorithm, Gramener tempered the choice of such al­gorithms with busi­ness and prac­tic­al con­sid­er­a­tions. Details of this ap­proach is provided in Appendix A. The se­lec­tion of al­gorithms was mod­er­ated with do­main know­ledge, cus­tom­er pref­er­ences and prac­tic­al con­sid­er­a­tions.

Simplified con­sump­tion

CaptureThe out­puts were im­ple­men­ted us­ing Gramener’s visu­al­iz­a­tion en­gine which pro­duced visu­al out­puts for uni­form con­sump­tion and ac­tion across the board.

Inherent Challenges & Gramener’s value add

Selection of right tools & meth­ods while too many vari­ables in­flu­ence out­comes

In the start, Gramener had to deal with 600 vari­ables. Neither the mag­nitude of im­pact of vari­ables on the out­come are uni­form and nor a pat­tern can be gleaned by merely study­ing the vari­ables. Segregating those vari­ables which are stat­ist­ic­ally sig­ni­fic­ant was not easy con­sid­er­ing the volume of data.

Analytical tech­niques used helped to re­duce the num­ber of vari­ables ju­di­ciously dur­ing the pre­pro­cessing and elim­in­a­tion stages. A mis­step at this pre­pro­cessing stage would have muted the im­pact of some im­port­ant vari­ables lead­ing to un­in­ten­ded con­sequences and wrong­ful con­clu­sions.

Right pro­por­tions of ana­lyt­ic­al tech­niques and busi­ness in­puts

Over de­pend­ency on ex­ist­ing busi­nesses bi­ases or over re­li­ance of ana­lyt­ic­al tech­niques both could have lead to wrong out­comes.

The right bal­ance was very crit­ic­al – us­ing ex­ist­ing bi­ases and strengthen it with a guid­ance from ana­lyses meth­ods. Experience of do­ing sim­il­ar pro­jects, team’s ex­pert­ise in both de­script­ive and pre­dict­ive as­pects of data and abil­ity to tweak ana­lyt­ic­al mod­els based on busi­ness con­sid­er­a­tions – all this led to achieve the bal­ance.

Keeping data mod­els rel­ev­ant to the busi­ness scen­ario

Rather than go­ing by vari­ables and data, the em­phas­is was on un­der­stand­ing customer’s mo­tiv­a­tion in do­ing the pre­dict­ive mod­el­ling.

The cus­tom­er would not have been able to act on some of the re­com­mend­a­tions ir­re­spect­ive of what the stat­ist­ic­al mod­el sug­ges­ted and hence these lim­it­a­tions were pre-built in­to the al­gorithms. This helped to en­sure that levers iden­ti­fied which will be used to in­flu­ence the de­sired out­comes did not lack prac­tic­al con­sid­er­a­tions. For ex­ample, the com­mu­nic­a­tion chan­nels like snail mails which were tra­di­tion­ally un­re­li­able were elim­in­ated from the al­gorithm des­pite the col­lec­ted data hav­ing lot of in­form­a­tion about snail mail chan­nels.

Key Conclusion

From a set of past data, Gramener’s meth­od­o­logy helped this com­pany un­der­stand plaus­ible ac­tions and prac­tic­al in­sights on pre­dict­ing the out­come of the tar­get met­ric – vot­ing % in this case. The fact that out­puts given were visu­als fur­ther helped the teams to con­sume these in­sights without any loss in trans­la­tion.

General Conclusion

This is an ex­ample of Gramener’s work on how pre­dict­ive mod­els and al­gorithms can be used to un­der­stand and in­flu­ence stake­hold­er be­ha­vi­or from large data sets. These same tech­niques are ap­plic­able in many busi­ness scen­ari­os to pre­dict the likely:

  • Enrollment of sub­jects in a clin­ic­al tri­al
  • Adoption of a new pro­duct by con­sumers
  • Churn of loy­al cus­tom­ers from a tele­com net­work
  • Impact of a new ad­vert­ise­ment cam­paign on brand loy­al­ists

Appendix A

Project Details & Execution Further de­tails on the pro­ject ex­e­cu­tions are provided here and is meant for read­ers who are fa­mil­i­ar with Big Data and Statistical vocab­u­lar­ies. The ex­e­cu­tion ap­proach can be split in­to phases

Application of known & extrinsic variables from Varied data sources

Structured Data – Variables and as­so­ci­ated data ex­is­ted in mul­tiple data sources with­in the com­pany. Assimilation of this data from vari­ous data sources was the first step in un­der­stand­ing the prob­lem bet­ter. With tera­bytes of data and more than bil­lions of rows, the fast­est way to pro­cess was with auto­mated quer­ies. This re­duced hu­man in­ter­ven­tion er­rors and also sped up the first cut ana­lys­is which helped to un­der­stand the problem’s di­men­sions bet­ter – num­ber of vari­ables, need for data cleans­ing and data volumes etc.

Unstructured Data – External data from un­struc­tured sources was col­lated and ap­pen­ded to the struc­tured data taken from the cor­por­ate data­base. This helped to bring new di­men­sions in ana­lyz­ing the data and in­creased the pos­sib­il­ity of a hol­ist­ic ap­proach for gen­er­at­ing in­sights. (For ex­ample: pub­licly avail­able com­pet­it­or share­hold­er in­form­a­tion was brought in to be used along with ex­ist­ing data)

Dimensionality reduction through Modularized Variable Pre-processing

Problem di­men­sion ex­ten­ded to over 600 vari­ables and as­so­ci­ated data. This made the prob­lem very dif­fi­cult to man­age and made it un­ten­able for any mean­ing­ful pro­cessing. The di­men­sions had to be re­duced while en­sur­ing all the im­pact­ful vari­ables and as­so­ci­ated data were still in con­sid­er­a­tion. Modularizing and pair­ing the dif­fer­ent vari­ables made the ana­lys­is quick and re­peat­able for iden­ti­fic­a­tion of all those im­pact­ful factors

Quantitative Measures: Group Means – 90 per­cent­ile meth­od­o­logy, Multivariate ana­lys­is were some of the tech­niques used to test the im­pact of a vari­able on the sens­it­iv­ity of the iden­ti­fied tar­get met­ric – vot­ing %. These class of tech­niques were found less tax­ing on the hard­ware but were very ef­fect­ive in quan­ti­fy­ing the mag­nitude im­pact of each vari­able on the tar­get met­ric.

Significance test­ing meas­ures: Quantitative meas­ures like T test, Walds test etc. were some of the tech­niques which were fur­ther used to de­lin­eate ran­dom in­flu­ence of a vari­able. This clearly es­tab­lished the real sig­ni­fic­ance of each vari­able on the tar­get met­ric.

Choice of Data Models and Algorithms: Constructing the predictive model

Preprocessing and de­lin­eation of im­pact­ful vari­ables helped to re­duce the di­men­sions of the prob­lem and now ad­vanced ana­lyt­ics could be done on the most mean­ing­ful set of vari­ables.

The two main pur­poses for data mod­el build­ing were:

  1. Cluster all the voters in­to com­mon lo­gic­al groups based on pro­files
  2. Prescribe ac­tion­able ways to pre­dict and im­prove voter par­ti­cip­a­tion

Data mod­els were con­struc­ted for each of the voters in­volved. This helped to cluster each voter in­to groups based on the out­come of their par­ti­cip­a­tion. Those share­hold­ers with sim­il­ar traits lead­ing to low or high vot­ing par­ti­cip­a­tion could now be tar­geted to make mar­ket­ing cam­paigns more ef­fect­ive. For ex­ample, Shareholders be­long­ing to cer­tain geo­graphy, in­come group and cer­tain edu­ca­tion levels with low num­ber of out­stand­ing shares had gen­er­ally voted less fre­quently.

Data mod­els thus helped in identi­fy­ing those levers which may im­prove vot­ing par­ti­cip­a­tion. For ex­ample, giv­ing the share­hold­ers more num­ber of days avail­able to vote would im­prove their par­ti­cip­a­tion rate from 18% to 22%.

For pre­dict­ive mod­el­ling, Decision Trees were found trans­par­ent in their work­ing and visu­al in their out­puts. However, de­cision tree al­gorithms were not ad­min­istered as-is. Judicious choice of split levels helped to tweak the al­gorithm to suit the prob­lem at hand. For ex­ample, some split levels were ig­nored to ac­com­mod­ate prac­tic­al con­sid­er­a­tions –lead times (# of days) be­low cer­tain split levels were dis­carded des­pite their stat­ist­ic­al sig­ni­fic­ance, since it was not prac­tic­al to have com­mu­nic­a­tion reach the voters with­in these lead times.

Outputs as Visuals for easy consumption: Exploratory and Interactive

The most im­pact­ful in­sights from the ana­lys­is were con­densed in­to visu­als. From a set of past data, now we had a set of visu­ally con­sum­able out­puts as ac­tions & in­sights. The im­pact of these ac­tions on the pre­dict­ab­il­ity of vot­ing % was clear since they were eas­ily un­der­stood by all due to visu­al rep­res­ent­a­tion. This led to mean­ing­ful ac­tion ori­ented dis­cus­sions among the cus­tom­er teams since men­tal mod­els were com­mon.

Visuals were also ex­plor­at­ory: Apart from pre­dict­ive mod­el­ling, Gramener’s visu­al out­puts helped all users ir­re­spect­ive of their skill levels to ex­plore the in­sights bet­ter on how each vari­able in­flu­enced the out­come. For ex­ample, send­ing a com­mu­nic­a­tion on Tuesdays had the max­im­um im­pact on the vot­ing % – this was not some­thing that in­tu­it­ively known be­fore the ana­lys­is. When this was visu­ally rep­res­en­ted, it be­come evid­ent for all & led to fur­ther ex­plor­a­tion of the im­pact of oth­er busi­ness days on the out­come.

Interactive de­cision trees: The de­cision trees were mines of vari­ous in­sights and which were con­ver­ted in­to in­ter­act­ive web links. User could in­ter­act with these links for vari­ous con­texts that they were par­tic­u­larly in­ter­ested in. For ex­ample, a user may be in­ter­ested in know­ing the im­pact of a com­mu­nic­a­tion chan­nel, while an­other user may want to see the im­pact of a geo­graph­ic cluster of voters.

One thought on “Predictive Modelling Of Stakeholder Behaviors Using Past (Large) Datasets”

  1. Excellent over­view.

    I really got a good idea in phases of pre­dict­ive mod­el­ling de­vel­op­ment. Though i have stud­ied lot of the­or­it­ic­al art­icles on pre­dict­ive mod­el­ling, this is very dif­fer­ent and shows the real­ity . Much in­ter­ested to work these kind of pro­jects.

    Appreciate you for shar­ing the in­form­a­tion.



Leave a Reply