Removing Outliers with BufStats

Prevent outliers from negatively impacting BufStats' statistical summary.

For a consideration of what might constitute an outlier and more ideas about how to manage them, visit the page on Outliers

BufStats can find and remove outliers by using an analysis of the data to set boundaries for each channel in the buffer. Any frames that have a value outside the its channel’s boundaries will not be used to compute the statistics. The strictness of these boundaries is determined by outliersCutoff (see below for how this parameter is used). Removing these outliers before computing the statistics will prevent them from affecting the statistical summary so the output of BufStats is a better representation of the majority of the data.

The boundaries of each channel are computed using the interquartile rangeThe range between the 25th and 75th percentiles. 50% of the data will fall within this range. (IQR), thereby ensuring that the boundaries are relative to the scale and distribution of values. The lower bound of this range is 25th percentile - (IQR * outliersCutoff). The upper bound of this range is 75th percentile + (IQR * outliersCutoff). The 25th and 75th percentiles are also called “Q1” and “Q3” respectively, short for the 1st Quartile and 3rd Quartile.

The default of -1 bypasses this function, using all the frames in the statistical measurements.

An example

To demonstrate how this works, consider this output of a SpectralShape analysis (columns are frames and rows are channels of the features buffer):

FFT Frame 1FFT Frame 2FFT Frame 3FFT Frame 4FFT Frame 5FFT Frame 6FFT Frame 7FFT Frame 8
Centroid 3001.342347.712087.172217.72282.622425.792655.612607.8
Spread 3182.742802.392832.763051.993180.783302.013462.623424.58
Skewness 1.812.552.982.682.632.452.182.22
Kurtosis 6.9911.5813.511.1310.719.678.218.26
Rolloff 9615.967792.538347.069182.469491.749785.2210178.9510112.94
Flatness -14.93-17.33-17.35-16.44-15.56-14.78-14-14.18
Crest 23.6731.6232.7332.5733.7734.6235.7335.55

First, BufStats will find the Q1 and Q3. Using these it will calculate the interquartile range (IQR) which is the difference between these two values (Q3 - Q1). Next a “Margin” is calculated as IQR * outliersCutoff (in this example, outliersCutoff = 1.1). Finally, the lower and upper bounds are calculated as a “Margin distance” below Q1 and above Q3. lower bound = Q1 - Margin, upper bound = Q3 + Margin.

Q1 Q3 IQR Margin Lower Bound Upper Bound
Centroid 2282.622607.8325.18357.71924.922965.49
Spread 3051.993302.01250.02275.022776.963577.03
Skewness 2.222.630.410.451.773.08
Kurtosis 8.2611.132.873.155.1114.28
Rolloff 9182.469785.22602.76663.038519.4310448.26
Flatness -16.44-14.781.661.83-18.27-12.95
Crest 32.5734.622.052.2530.3236.87

Now, using the lower and upper bounds, BufStats checks the original values in the buffer to see if any fall outside this range.

FFT Frame 1FFT Frame 2FFT Frame 3FFT Frame 4FFT Frame 5FFT Frame 6FFT Frame 7FFT Frame 8
Centroid 3001.342347.712087.172217.72282.622425.792655.612607.8
Spread 3182.742802.392832.763051.993180.783302.013462.623424.58
Skewness 1.812.552.982.682.632.452.182.22
Kurtosis 6.9911.5813.511.1310.719.678.218.26
Rolloff 9615.967792.538347.069182.469491.749785.2210178.9510112.94
Flatness -14.93-17.33-17.35-16.44-15.56-14.78-14-14.18
Crest 23.6731.6232.7332.5733.7734.6235.7335.55

Frames 1, 2, and 3 all have values that fall outside the boundaries. Each of these three frames will be removed from what is used to compute the statistical summary leaving these frames:

FFT Frame 4FFT Frame 5FFT Frame 6FFT Frame 7FFT Frame 8
Centroid 2217.72282.622425.792655.612607.8
Spread 3051.993180.783302.013462.623424.58
Skewness 2.682.632.452.182.22
Kurtosis 11.1310.719.678.218.26
Rolloff 9182.469491.749785.2210178.9510112.94
Flatness -16.44-15.56-14.78-14-14.18
Crest 32.5733.7734.6235.7335.55

Here are the statistics these selected frames produce:

MeanStd DevSkewnessKurtosisLowMiddleHigh
Centroid 2437.91172.630.031.352217.72425.792655.61
Spread 3284.4152.63-0.291.633051.993302.013462.62
Skewness 2.430.21-0.061.32.182.452.68
Kurtosis 9.61.21-0.011.318.219.6711.13
Rolloff 9750.26375.7-0.281.69182.469785.2210178.95
Flatness -14.990.91-0.461.75-16.44-14.78-14
Crest 34.451.17-0.431.7832.5734.6235.73

Compared to the statistics if all the frames were included:

MeanStd DevSkewnessKurtosisLowMiddleHigh
Centroid 2453.22272.890.672.582087.172425.793001.34
Spread 3154.98231.61-0.271.782802.393182.743462.62
Skewness 2.440.34-0.322.441.812.552.98
Kurtosis 1020.152.046.9910.7113.5
Rolloff 9313.36790.44-0.792.327792.539615.9610178.95
Flatness -15.571.25-0.281.57-17.35-14.93-14
Crest 32.533.62-1.654.6623.6733.7735.73
Last modified: Tue May 10 10 by Ted Moore
Edit File on GitHub