High Dimensionality in Large Datasets : simplebooklet.com

QuantUniversity High Dimensionality in Large Datasets Part I Sri Krishnamurthy presents the foundations and basics of dealing with this high dimensionality in large financial datasets M any financial datasets are characterized by a large number of dimensions High dimensionality of datasets increases the complexity of analysis and requires sophisticated techniques to process these datasets Whether it is stock data for individual companies or economic data used for macroeconomic modeling high dimensional datasets present unique challenges When building predictive models quants typically have to deploy statistical methods to reduce data complexity and the number of dimensions to make it easier and tractable for processing Traditional techniques involve choosing important dimensions variable selection methods where a subset of dimensions is chosen or reducing dimensions where variables are transformed to a smaller set of new variables to make analysis feasible and practical However these traditional techniques are seeing limits when dealing with today s large datasets Technological innovations in large data collection and processing in the last decade have made access to large volumes of data possible In addition the data collected has high granularity frequency and complexity increasing the need to adopt sophisticated data handling techniques Collectively quants are seeing the four V s of Big Data Volume Velocity Variety and Veracity manifest in financial datasets requiring rethinking on approaches to process these datasets see our prior article in the March 2014 issue of Wilmott for more on this topic In order to appreciate the nature of 50 the problem of high dimensional datasets we need to understand both traditional and modern techniques In this two part article we will cover both traditional and modern techniques to address high dimensional datasets In part 1 we will lay the foundation by discussing some of the common traditional techniques to handle high dimensional datasets We begin by discussing the problems in high dimensional datasets including the famous curse of dimensionality problem We then discuss two methods to deal with high dimensional datasets The goal of the first method is to reduce the number of variables by variable selection and that of the second is to reduce the number of variables by deriving new variables We will illustrate these methods through sample techniques regression decision trees and principal component analysis PCA and give pointers on implementing these techniques in MATLAB We will also include sample applications of these techniques in finance and economics as a part of this discussion In part 2 of this article we discuss some of the challenges of dealing with high dimensionality in the context of Big Data problems and methodologies and innovations to process these large datasets We will review some of the proposed methodologies and provide guidance on choosing approaches to handle large highdimensional datasets Curse of dimensionality and other problems in high dimensional datasets In a seminal lecture in 2000 Donohue 1 discussed the challenges of dealing with high dimensional datasets He spoke about the curse of dimensionality problem coined by Richard Bellman in the context of problems in multivariate data analysis when dealing with high dimensional datasets Let s wilmott magazine

Page 2

Figure 1 Samples in 2 D space Figure 2 The same samples in 3 D space discuss the significance through a simple example Let s assume there are nine data samples and there are two dimensions for each of these samples It can be visualized in a 3 X 3 grid as shown in Figure 1 If there are three dimensions we can visualize these nine points in a 3 X 3 X 3 grid as shown in Figure 2 Notice that the points are much sparser in the 3 D plot In order to build a model that ensures statistically sound results we need to increase the number of samples significantly In fact as we keep increasing the dimensions the number of samples needed would be exponential Collecting large amounts of data is expensive and processing them brings computational model design and implementation challenges Fitting models that factor the entire variable space can be an expensive computational exercise when using algorithms that span the variable space in search of optimal solutions Sometimes data availability can be an issue Data may not be available or may be noisy or missing in pockets restricting modeling choices to build statistically sound models Even if data is available variables may be irrelevant and may not add significant value requiring the modeler to exclude such variables when building models In other cases there may be variables which are dependent and exhibit correlation When building models such as regression models multicollinearity becomes an issue that makes the coefficient estimates subject to huge changes When building predictive models the goal of a modeler should be to build a parsimonious model that incorporates the fewest possible variables has been tested for predictive power and is generic enough to be deployed for new datasets Building predictive models that incorporate a large number of variables leads to statistically unstable models which may not have predictive power as they overfit to Techniques for handling high dimensionality in finance and sample applications Two methods are predominantly used when deal Collecting large amounts of data is expensive and processing them brings computational model design and implementation challenges samples and aren t generic enough for prediction This becomes an issue for applications such as forecasting In techniques such as regression adding additional variables could increase the R squared value indicating better fit but such models typically don t add predictive power when tested with new data Now that we understand the importance of reducing variables in large datasets the question is how do we reduce the number of dimensions A na ve approach would be to manually choose variables for a model This may work when dealing with few variables and when variables are chosen by domain experts but may not be ing with high dimension datasets Variable selection involves using techniques to select the best features that add to the predictive power of the model Variable reduction involves generating new sets of variables that are derived from the original variable set Variable selection The goal of variable selection is to select features from the universe of variables based on a specific criterion Typically these methods are iterative and computationally intensive as they search the best subset of predictors Variable selection is preferred wilmott magazine optimal when dealing with a very large number of variables In this approach the modeler has to make assumptions about the importance of variables that may or may not be valid In addition the model may not have variables that add to its predictive power On the other end of the spectrum a kitchen sink approach is to do an exhaustive search of all possible combinations of variables and choose the best model This is a computationally intensive exercise as testing two variables would mean trying three models for example if a and b are predictor variables that are used to model a target response variable we can build three models Y f a b Y f a Y f b But testing ten variables means testing more than a thousand possible models which becomes impractical very soon Let us consider some techniques that are commonly applied to handle high dimensional datasets 51

Page 3

QuantUniversity when the final model needs to preserve the original variables to understand the contributions of each one to the model Criterion based methods use metrics such as the Akaike information criterion AIC and the Bayes information criterion BIC the adjusted R squared and Mallows Cp to compare and evaluate models For classification misclassification rates are typically used We will illustrate variable selection using subset section methods in regression and variable importance measures for decision trees in this section and touch upon using criterion based methods with regression and decision trees Regression When building regression models subset selection methods provide an effective way to retain the influential variables in a model Various software It should be noted that exhaustive searches give the best optimal model but are computationally intensive Forward backward and stepwise regressions optimize on computational intensity but can miss the best possible model MATLAB s implementation of these methods and examples of using criterion based methods for regression can be found at 2 Regression has many applications in finance from macroeconomic modeling to understanding which factors are the best predictors of stock returns When dealing with large sets of variables variable selection methods are key to getting sound results For example Northfield s macroeconomic equity risk model 3 uses 5 000 securities and the security return is explained by 12 factors whose exposures are inferred through When variables are highly correlated PCA provides a method to transform a large set of variables into a smaller set of variables that have the predictive power of the original variable set packages provide options to implement feature selection Typically there are four possibilities Exhaustive search As previously explained this method evaluates all possible combinations of variables and chooses the best model based on the chosen criterion Forward selection Here the model adds one predictor at a time and continues until adding another predictor is no longer statistically significant Backward selection This is the opposite of forward selection all variables are included in the model to start with and variables are dropped one at a time until only the statistically significant variables remain Stepwise regression This combines both forward and backward eliminations and drops adds variables based on their statistical significance 52 stepwise regression Jank 4 demonstrates how stepwise regression can be used to predict a company s stock price using 25 variables An example illustrating the use of stepwise regression to select a basket of securities using MATLAB is available at 5 and a case study is available at 6 Decision trees Decision trees are binary trees predominantly used for classifying data into subgroups They are built using splitting rules to recursively generate subtrees at the leaf notes The leaves are typically the classification groups of interest In classification tasks having trees with a large number of variables complicates the decision trees and typically trees are pruned to incorporate the most significant variables that are required for the classification task In addition techniques such as random forests and bagging are used to improve the stability of the algorithm Here the variable importance measures 7 are used to estimate the importance of each predictor The Treebagger algorithm see 8 for implementation in MATLAB uses bagging which is an ensemble of decision trees to build a decision tree Here one of the outputs is the variable importance measure that can be used for variable selection Criterion based methods for decision trees 2 can also be employed for classification using misclassification rates as the criterion Decision trees are used in varied applications like bankruptcy prediction and fraud detection Cho et al 9 employ variable selection using decision trees for bankruptcy prediction Genuer et al 10 discuss variable selection using random forests An example illustrating the use of the Treebagger algorithm to select a basket of securities using MATLAB is available at 8 and a case study elaborating use of decision trees for variable selection is available at 6 Variable Reduction Variable reduction methods create new variables that are transformed from the original ones and try to capture the predictive power of a large set of variables in a smaller set As variable transformation is involved the new variables don t provide the intuitive attribution of contributions of individual variables in the derived ones PCA independent component analysis factor analysis and singular value decomposition are examples of variable reduction methods We will review PCA as an example and discuss some sample applications PCA When variables are highly correlated PCA provides a method to transform a large set of variables into a smaller set that has the predictive power of the original The new variables are a weighted linear combination of the original ones and are uncorrelated These new variables called principal components are ordered in a way to ensure that the highest variance is captured in the first principal component the second component capturing the second highest variance and the subsequent components capturing the rest of the variability in decreasing order Therefore the first few components capture most of the variability observed in the original dataset Note that PCA works only for numeric variables magazine

Page 4

Discussion on using PCA in MATLAB is available at 10 and a case study elaborating the use of PCA is available at 6 PCA has found multiple applications in finance Forecasting economic time series is a classic example where high dimensionality is evident Typically economic time series data has hundreds of potential variables that could be used for forecasting Stock and Watson 11 provide a comprehensive survey of the problems with forecasting with a large number of variables and of the various methods including PCA to handle such problems Fifield et al 13 discuss how PCA can be used to identify relevant factors from the pool of macroeconomic data Portfolio optimization is another high dimensional problem Here the covariance matrix for returns of assets needs to be estimated which becomes challenging for large portfolios As we discussed in the section on the curse of dimensionality for a universe of 500 stocks 125 500 parameters need to be estimated Jorion 12 discusses ways of simplifying the covariance matrix estimation process including how PCA can be used to compute the covariance matrix when correlations in the asset return series are high Tsay 14 provides a detailed example on how to implement PCA for a sample stock series Yield curve modeling is another well known application of PCA Litterman and Scheinkman 15 discuss their three factor approach level steepness curvature to explain the variation of returns in fixed income securities using PCA Simulations are another application area for PCA When a large number of simulations is required running PCA helps reduce the number of factors thus reducing computational intensity Huynh et al 16 discuss how to apply PCA and Monte Carlo MC and Quasi Monte Carlo QMC simulations to compute the VaR of a bond portfolio Jamshidian and Zhu 17 discuss how they use PCA to reduce the number of factors when analyzing large multicurrency portfolios There are many other interesting applications that have been published recently For example Ambrus et al 18 describes using PCA to reduce the number of variables when modeling interest rate risk with 13 risk factors per currency for the wilmott magazine References 1 Donoho DL 2000 High dimensional data analysis The curses and blessings of dimensionality Aide Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century 2 http www mathworks com help stats feature selection html 3 http www northinfo com documents 7 pdf 4 Jank W 2011 Business Analytics for Managers Springer 5 http www mathworks com machine learning examples html file products demos machine learning basket_selection basket_selection html 6 http www quantuniversity com variableReduction html 7 Genuer R Poggi JM andTuleau Malot C 2010 Variable selection using random forests Pattern Recognition Letters 31 14 2225 2236 8 http www mathworks com help finance examples credit rating by bagging decision trees html 9 Cho S Hong H and Ha BC 2010 A hybrid approach based on the combination of variable selection using decision trees and case based reasoning using the Mahalanobis distance For bankruptcy prediction Expert Systems with Applications 37 4 3482 3488 Swiss Solvency Test Standard Model With these diverse applications we are seeing adoption of dimension reduction techniques like PCA increase and quants should consider using these techniques when dealing with high dimensional problems In summary High dimensionality in datasets poses modeling and processing challenges and must be dealt with to build effective statistical models With the volume and variety of data increasing renewed focus has been placed on addressing high dimensionality in datasets In this article we have provided the foundation for high dimensional data analysis and discuss some of the problems posed by highdimensional data sets We have discussed some of the traditional techniques used in quantitative finance to handle datasets with a large number of variables with a particular focus on variable selection and variable reduction We have also illustrated how these techniques are implemented in practice when building models using regression decision trees and PCA and discussed sample financial applications In a sequel to this article we will focus on high dimensionality in the context of Big Data problems and methodologies and inno 10 http www mathworks com help stats feature transformation html 11 Stock JH and Watson MW 2006 Forecasting with many predictors In Handbook of Economic Forecasting Vol 1 Elliott G Granger C Timmermann A eds Elsevier 12 Fifield SGM Power DM and Sinclair CD 2002 Macroeconomic factors and share returns An analysis using emerging market data International Journal of Finance and Economics 7 1 51 62 13 Jorion P 2007 Value at Risk The New Benchmark for Managing Financial Risk 3rd edn McGraw Hill 14 Tsay RS 2005 Analysis of FinancialTime Series Vol 543 Wiley 15 Litterman RB and Scheinkman J 1991 Common factors affecting bond returns Journal of Fixed Income 1 1 54 61 16 Huynh HT and Soumare I 2011 Stochastic Simulation and Applications in Finance with MATLAB Programs Vol 633 Wiley 17 Jamshidian F and Zhu Y 1996 Scenario simulation Theory and methodology Finance and Stochastics 1 1 43 67 18 Ambrus M Crugnola Humbert J and Schmid M 2011 Interest rate risk Dimension reduction in the Swiss SolvencyTest European Actuarial Journal 1 2 159 172 vations to process these large datasets The curse of dimensionality has been recognized as one of the most challenging problems in statistics With the knowledge and tools to deal with high dimensionality quants can effectively leverage computational power and appropriate algorithms to mine the nuggets of information hidden in large datasets About the Author QuantUniversity offers quantitative modeling and consulting services to financial institutions and specializes in analytics optimization and Big Data solutions Sri Krishnamurthy CFA CAP is the founder of www QuantUniversity com a data and quantitative analysis company Sri has significant experience in designing quantitative finance applications for some of the world s largest asset management and financial companies He teaches quantitative methods and analytics for MBA students at Babson College and is the author of the forthcoming book to be published by Wiley titled Financial Application Development A case study approach Sri can be reached at sri quantuniversity com 53

High Dimensionality in Large Datasets

Page 1

Page 2

Page 3

Page 4