Principal Component Analysis (PCA)
It is beyond the scope of this help to give instruction in linear algebra.
PCA is a way of identifying patterns in data, and expressing the data in such way as to highlight their similarities and differences. Patterns in data with a high number of dimensions can be difficult to find. PCA reduces the number of dimensions without much loss of information, and makes graphical representations meaningful and easy to interpret.
The main goals of Principal Component Analysis are:
-
Identify hidden patterns in a data set
-
Reduce the dimensionality of the data by removing the noise and redundancy in the data
-
Identify correlated variables
PCA is particularly useful when the variables within the data set are highly correlated as correlation indicates redundancy in the data. PCA transforms the initial variables into a new, small set of variables without losing the most important information in the original data set.
These new variables correspond to a linear combination of the originals and are called principal components, which will explain much of the variance in the original variables.
Now for some terminology:
Eigenvalues |
Eigenvalues reflect the variance of the data. Large eigenvalues correspond to large variances. |
Eigenvectors |
Eigenvectors reflect the direction and proximity of the input variables included in the computation. The PCA computations identify and rank the eigenvectors that account for most of the data relationships (covariance). The selected eigenvectors are labeled as primary components (PC) with PC1 responsible for the most variation, PC2 the next and so on. |
Transformed Data |
Transformed Data crossplots 2 principal components. The defaults are PCA1 on the X axis, and PCA2 on the Y axis. |
To run PCA:
- From the Spotfire Analytics Explorer menu select Data Preparation > Principal Component Analysis.
- Select the Data table that you want to run PCA on. Supported data tables include the following:
- Select the input columns. Only numeric columns are valid. If a column has empty cells, then the mean of the populated cells in that column is used for the empty cells instead. If a column has a large percentage of blank cells, that input column may skew the results.
- Click Run. When the calculations are complete, the PCA Visualizations tab for the selected data table opens.
The PCA visualizations include the following: