PCA
Principal component analysis of a FluidDataSet
Principal Component Analysis (PCA) fits to a DataSet to determine its principal components, each of which is a new axis through the data that maximises the variance, or “differences”, within the Data. PCA is often used for dimensionality reduction (see below) and is also useful for removing redundancy (i.e., correlation) and/or noise (i.e., dimensions that are uniformly distributed) from a DataSet. After fitting to a DataSet, PCA can then transform the original DataSet or individual points to position them in relation to the principal components (i.e., “new axes”) for better comparing how they differ from other points in the DataSet.
For a great visual explanation of PCA, watch this Computerphile video:
Calculating the Principal Components
In order to calculate the principal components, PCA analyses the original DataSet and determines how to “rotate” the data so that an axis can be drawn through the data along which the most variance in the data can be observed (for a excellent visual explanation, see the YouTube video above). The new axis, or attribute, is a principal component of the DataSet. PCA continues the process of “drawing axes”, through the data (always maximising variance) until all the principal components have been determined. PCA computes as many principal components as there are dimensions in the original DataSet. This process orders the principal components (starting at 1) according to how much of the variance in the DataSet they are able to demonstrate (from demonstrating the most variance to the least). The lower ordered principal components are therefore more useful in understanding the differences between the points in the DataSet.
PCA for Dimensionality Reduction
Because PCA orders the principal components from those that demonstrate the most variance in the data to the least, one can effectively perform dimensionality reduction by removing some of the higher ordered principal components (in the transformed data set) because they are less useful at representing the variance in the data. This allows the data set to have fewer dimensions, while retaining the aspects of the data that are more useful in understanding the differences between the data points (the lower order principal components).
As this is often how PCA is used, FluidPCA allows the user to specify the number of output dimensions desired, so that transform
operates as a dimensionality reduction algorithm (other dimensionality reduction algorithms in FluCoMa include UMAP and MDS. After FluidPCA transforms a DataSet it also returns the explained variance: the fraction (between 0 and 1) of the total variance in the DataSet accounted for using the reduced number of dimensions it has transformed the DataSet into. Values close to 1 mean that most of the variance in the original data is represented while number near 0 mean most of the variance in the original data is not represented in the transformed DataSet. When using PCA, knowing the explained variance is quite important. Reducing the number of dimensions while losing most of the variance will likely be problematic further in the processes. Being able to reduce the number of dimensions while retaining most of the variance can help avoid the curse of dimensionality, speed up future computations, and/or remove redundancy or noise in the data. Aiming to retain an explained variance of 0.99, 0.95, or 0.9 are common.
whiten
PCA “whitening” can be turned on by setting the whiten
parameter to 1. “Whitening” ensures that not only will the principal components be uncorrelated, but also they will all have unit variance (they’ll all have a standard deviation of 1). “Whitening” is sometimes also referred to as sphering
meaning that it’s making the data “spherical” in the sense that each principal component (each new dimension) has roughly the same spread, just as a sphere has the same width in each of its three dimensions.
Whitening loses some information during the transformation process (essentially, the relative scales of the principal components), however, whitening may improve the accuracy of further computations such as classification. If there is noise in the data, whitening might amplify this noise by scaling it to be the same size as other other information, making it as prominent as the relevant information, therefore whitening will not always improve data.