Uniform Manifold Approximation and Projection (UMAP) on a FluidDataSet
Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique. It allows you to take high-dimensional data and reduce it to a lower-dimensional representation. UMAP is a non-linear form of dimension reduction, and is robust when working with data that is noisy, sparse or cannot be reduced effectively using linear techniques such as Principal Component Analysis (PCA).
This example visualises UMAP’s iterative process of taking the original data and transforming it to a lower dimensional space. To begin with, 100 three-dimensional points are randomly generated. For each point we see visually, the original three dimensional data is used to give them an RGB colour. The algorithm then proceeds, interatively finding the optimal two-dimensional representation of the original three-dimensional space. UMAP is able to derive some structure from the RGB colour space resulting in the two dimensional plot becoming a version of a smooth gradient or colour spectrum.
UMAP has three key parameters that affect the result it produces. It is good to remember that UMAP is trying to work within the constraints you provide while giving the “best” possible result. Configuring UMAP with the the right parameters, (i.e constraints) allows you to balance the global or local features of the original space.
There is also a good explanation in the documentation of the original implementation located here.
mindist encourages UMAP to consider how close points can be represented in the low-dimensional space that it produces. Small values of
mindist (tending towards 0) mean that UMAP can pack points in a tight embedding which preserves the local characteristics of the high-dimensional data. Bigger values cause the embedding to be spread out, focusing on the preservation of the global topology. The actual values of
mindist are somewhat arbitrary, though keeping the values in the range of 0.0 to 3.0 is a good place to start.
You can think of the
numneighbours parameter as a control that determines how UMAP balances the preservation of global or local features of the high-dimension data. Because UMAP is estimating another structure in a lower dimensionality that can represent the high-dimensional data, we can tell it how many points around any given point it needs to consider when it transforms the space. As such, high values of
numneighbours will result in UMAP considering perhaps the entire space as a whole, whereas low values will focus on the relationship between only a few points. Remember, this parameter’s value is relative to the number of points in your original data. If you have 200 points and you set
numneighbours to 2 then UMAP is only ever considering 1% of your data at any given time. Therefore, a “low” or “high” value is quantified by its relationship to the size of your data at hand.
UMAP works iteratively to calculate the lower-dimension representation of the input data. As such, performing more or less iterations will drastically effect the result. It is hard to say whether the results will be “better” or “worse” in any case. In fact, you might find something musically useful at any given point of the process. That said, if you find that the low-dimension embedding that UMAP gives you doesn’t make sense it might be the case that more iterations can help it to discern some structure of your data.
An explanation that stays high-level without sacrificing essential details and information.https://pair-code.github.io/understanding-umap
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
The original UMAP paper by Leland McInnes, John Healy and James Melville.https://arxiv.org/abs/1802.03426
UMAP at SciPy 2018
An presentation at the 2018 SciPy conference where creator Leland McInnes discusses the UMAP algorithm.https://www.youtube.com/watch?v=nq6iPZVUxZU&t=1s
How to Use t-SNE Effectively
Discusses another similar dimension reduction algorithim, t-SNE which might be enriching when thinking about the wider world of machine learning and techniques in this space.https://distill.pub/2016/misread-tsne