Multicollinearity Analysis
Running Multicollinearity Analysis is one way to determine the best inputs for the machine learning model, or conversely, the attributes to exclude from the model for optimal results to remove any redundancy in the indepedent vaiables.
Multicollinearity occurs when two or more independent variable are highly correlated with one another in a regression model. When multicollinearity exists, an independent variable can be predicted from another independent variable, which, if included in the model, can affect the accuracy of that model.
The Multicollinearity Analysis tool identifies the variables that are highly correlated based on the Variance Inflation Factor (VIF) which measures the degree of multicollinearity. Small VIF values, VIF < 3, indicate low correlation among variables under ideal conditions. The default VIF cutoff value is 5; only variables with a VIF less than 5 will be included in the model. However, note that many sources say that a VIF of less than 10 is acceptable.
After the VIF of each variable has been determine, the process to determine which variables to exclude is recursive, eliminating 1 variable at a time, working from the variable with the highest VIF down to the cutoff VIF typically 5-10.
Multicollinearity analysis can be run as a step in data preparation independently or as part of the machine learning model creation process.
To run multicollinearity analysis:
-
Select the data table and the input attributes (columns) for the model. Remember that you can load a template if you have one saved for this data table. Do not include variables that you know are highly correlated with the variable you want to predict. Instead, focus on independent variables. For example, if you want to predict the next 12 months of oil production, do not include other production variables.
-
The only editable option is the VIF value. Again, the default is 5. Any attribute with a VIF greater than 5 (or the set value) - indicates a high correlation with other independent variables and will be excluded from the model.
-
Click Run Analysis to begin. When finished, the Multicollinearity Analysis dialog box opens. There are 3 columns:
Input Attribute |
Lists all of the attributes that you selected for the model. The attributes that had a VIF less than the cutoff (1 - 10) are listed. The application determined that these attributes should be used to create the machine learning model. Note that text attributes or attributes with very few values are automatically discarded and will not be checked. |
VIF Value |
Lists the VIF value of each attribute. |
Discarded Attribute |
Attributes that are highly correlated with the selected attribute but have a lower VIF value. The algorithm determines which attributes are highly correlated with each selected attribute and lists those in the dropdown. Using your expertise and knowledge of the data sources, you may want to select one of these highly correlated, discarded attributes to use instead of the attribute the algorithm selected. Expand the dropdown list and click Select this instead to change the attribute. |
Save the attributes selection as a template to use when creating your Machine Learning model.