Return to flip book view

Data and Model Governance in the day of Data Science and AI

Page 1

Governance in the Day of DATA SCIENCE AND AI Written by Sri Krishnamurthy SUMMARY ata driven decision making has enabled companies to significantly harness value from data in all aspects of business Chief data officers CDOs leading digital transformation efforts are seeing the importance of data from collection to processing to learning insights from datasets With innovations in hardware cloud and algorithms and availability of large datasets and analytical tools the adoption of data science artificial intelligence AI and machine learning ML is exploding In addition the AI ML revolution has provided an impetus to ensure the collection processing storage and governance of data are robust and wellmanaged within the organization Recognizing the value of data to the enterprise data governance efforts have scaled up to identify address and mitigate risks associated with data As more and more decisions become automated AI ML models which rely heavily on data face challenges associated with interpretability bias explainability fairness and model governance Most organizations undergoing digital transformation efforts recognize that data governance and AI ML model governance are heavily intertwined but continue to treat data and model governance in two separate silos In this article we seek to bring clarity on some of the data and model governance challenges when adopting data science and AI ML processes in the enterprise As the role and importance of the CDO evolves within an organization it is essential to recognize how the landscape of data and model driven methods are changing traditional business practices We believe a holistic data and model governance framework is needed to successfully adopt AI ML within the en terprise and to plan for a future where data driven decision making plays a key to the execution of business strategies Every CDO needs to think about five key drivers when establishing a comprehensive data and model governance practice when adopting AI ML in the enterprise 1 DATA AND MODEL GOVERNANCE ARE INTERTWINED When working with AI ML models it is important to realize that data drives the AI ML models In a recent conversation with a model governance team as a third party independent model validation agency we requested a meeting with the data governance team The data governance team questioned why they had to

Download

Page 2

MACHINE LEARNING WORKFLOW Data Engineer Dev Ops Engineer Data Scraping Ingestion Data Exploration Data Cleansing and Processing Feature Engineering Robotic Process Automation RPA Microservices Pipelines Model Deployment Inference SW Web Rest API HW GPU Cloud Monitoring Model Section Auto ML Model Validation Interpretability Software Web Engineer be involved in the model validation exercise for machine learning models We had to discuss the interdependencies and convince the data governance team to be engaged in the model validation process since the processes are intertwined In the past data governance and model governance were treated as separate silos and organizations drew distinctions between the realm of data modeling and the model development world In today s world of AI ML a comprehensive strategy is needed to integrate data governance and model governance issues FIGURE 1 illustrates a typical machine learning workflow and how data drives various modeling decisions 2 DATA QUALITY IS PARAMOUNT FOR PRODUCTIONIZING MACHINE LEARNING MODELS In machine learning Garbage in Garbage out It is important to govern every step of the machine learning workflow including data processing steps considering that badquality data hinders adoption of AI and machine learning processes in production In a recent article 1 Tom Redman emphasizes the importance of data quality and concludes that machine learning models are useless if the data quality is bad It is estimated that 80 of the effort in machine learning is spent in data processing We recently worked on a project RMS MAPS MAE Confusion Matrix Precision Matrix ROC Model Evaluation Tuning Hyper parameter tuning Parameter Grids Data Scientist Quants where the modelers threw out 6070 of the data in the modeling process because of data quality issues significantly affecting the quality of machine learning outputs A comprehensive data quality framework is required to cater to machine learning needs in organizations This includes having a comprehensive strategy involving data acquisition storage preprocessing handing missing values feature engineering master data management archiving and having processes to address and mitigate data risks 3 META DATA MANAGEMENT IS IMPORTANT In addition to processing datasets it is important to consider the meta data associated with these datasets In the last five years the machine learning life cycle has matured significantly Feature stores 2 are becoming extremely popular to serve features for machine learning applications Feature stores provide curated datasets to machine learning applications and enable traceability of data In addition Regression Naive Bayes KNN Neural Network Decision Trees Ensembles Modeling Analysts Decision Makers Risk Management Compliance All Stages Supervised Unsupervised Clustering PCA Autoencoder Copyright www quantuniversity com 2019 metadata management especially associated with versioning different data snapshots and tracking the provenance and lineage of data sets is essential to enhance reproducibility and tracking of machine learning models performance Companies like Amazon 3 have proposed frameworks to manage the provenance and lineage of metadata when building machine learning models In addition open source projects like Delta Lake 4 are being proposed to enable life cycle management of data lakes for machine learning projects This is an evolving area but is becoming important as the scale of data to be managed increases within the enterprise 4 FAIRNESS BIAS PRIVACY SECURITY AI ML governance has been an important topic of discussion in the last year At a recent model governance conference in San Francisco I had the opportunity to discuss the topic of AI Model governance with various governance teams from multiple finan

Page 3

cial organizations The lack of comprehensive guidance from regulators the pace of technological innovation the plethora of options to build machine learning systems today from open source to black boxes makes adopting machine learning a complicated process In addition with models becoming so complex to design especially in unanticipated and volatile situations like Covid19 explicit efforts need to be made to understand the behavior of models especially when addressing stressed and edge cases The World Economic Forum 5 recently issued comprehensive guidelines to address governance issues pertaining to adoption of AI ML products In addition GDPR the European Union s guidelines for adoption of AI etc provide guidelines to ensure issues like fairness bias privacy security interpretability explainability and auditability issues are addressed as a part of AI adoption within the enterprise Companies must have a comprehensive strategy to formulate polices on how to address these aspects and to address potential gaps in the data processing steps Data annotation synthetic data generation and tagging labeling are novel areas to many organizations and governance policies must assess how these new areas will impact their operational processes model development and deployment 5 ADDRESSING THE SKILL GAPS Despite the downturn in economy organizations adopting and relying on data driven decision making continue to experience skill shortages in the areas of machine learning and data processing At QuantUniversity 7 we have trained thousands of analysts and data professionals in data and machine learning techniques in the last few years Despite the rapid growth of educational programs teaching AI and machine learning topics there is a skill shortage of qualified data and machine learning professionals who can address evolving challenges within the enterprise Companies must proactively review skill gaps and ensure that comprehensive teams are formed within organizations In addition to model data and operational risks companies take a huge reputation risk when data related model related issues affect business processes Security breaches within organizations use of stale data in models and the wrong parameters applied to models can cause enormous shocks in the operations of organizations and if the scale is large could lead to systemic shocks affecting financial markets supply chains etc Organizations must ensure that quality trained personal who can enforce the data and model governance policies are available within an organization to address the growing challenges of machine learning CONCLUSION It is said that data is the new fuel when it comes to AI ML models As organizations move toward data driven decision making it is important for CDOs to proactively develop strategies to enable the benefits of AI ML methods within their organizations The rise of AI ML in the enterprise has created novel challenges to CDOs who have the responsibility of ensuring that the data strategy is done right to ensure successful adoption of AI ML in the enterprise To summarize with the introduction of AI ML methods in the enterprise 1 Governance needs to be more comprehensive and integrated across data and AI ML 2 Data quality needs to be prioritized and streamlined from the ground up and driven by business 3 Issues like metadata management must be proactively designed from the beginning 4 Issues of privacy bias security and fairness need to be assessed and factored into workflow design 5 Organizations must evaluate the evolving skill needs and proactively gear up toward acquiring or retraining employees to address the skill gaps The AI ML revolution has just begun and CDOs are front and center in steering their organizations data strategies toward the fourth industrial revolution While the technologies are exciting and companies are leaping towards adopting these frontier areas factoring governance throughout the process is the responsible thing to do and will lead to successful outcomes REFERENCES 1 https hbr org 2018 04 if your datais bad your machine learning toolsare useless 2 http featurestore org 3 https pdfs semanticscholar org 09 3c 3b389384812ea16f1ad18ce6c5f43c 4f7106 pdf 4 https databricks com product delta lake on databricks 5 https www weforum org whitepapers ai governance a holistic approach to implement ethics into ai 6 https ec europa eu digital singlemarket en artificial intelligence 7 www quantuniversity com Sri Krishnamurthy CFA CAP is the founder of QuantUniversity com a data and quantitative analysis company Sri is a recognized AI and machine learning expert with more than two decades of experience in quantitative analysis statistical modeling data and model governance Prior to starting QuantUniversity Sri has worked at Citigroup Endeca and MathWorks and has consulted with more than 25 companies including leadership teams at many Fortune 500 companies Sri serves as an adjunct professor and has trained more than 1 000 students in quantitative methods analytics and big data in the industry at Babson College Northeastern University and Hult International Business School Sri is a frequent speaker on AI and machine learning related topics and he has spoken at various industry gatherings and conferences hosted by the CFA institute PRMIA CQF ARPM ODSC ReWork GFMI Marcus Evans QWAFAFEW QCon SAMSI PAPIS MathWorks Babson College Northeastern University COSEAL DataCon etc Sri earned master s in science degrees in computer systems engineering and computer science from Northeastern University and an MBA with a focus on investments from Babson College Sri can be reached at sri quantuniversity com