Descriptive Statistics

While NumPy provides a wide range of statistical functions, calculating multiple statistical variables from the same array often requires multiple passes over the data. The pyinterp.DescriptiveStatistics class offers a more efficient solution by computing several statistical variables in a single pass. This approach is not only faster but also more numerically stable, thanks to its incremental calculation algorithm.

Note

This implementation is based on the following paper:

Pébay, P., Terriberry, T.B., Kolla, H. et al. Numerically stable, scalable formulas for parallel and online computation of higher-order multivariate central moments with arbitrary weights. Comput Stat 31, 1305-1325, 2016, https://doi.org/10.1007/s00180-015-0637-z

Basic Usage

Let’s start by creating a random array and using it to initialize the pyinterp.DescriptiveStatistics class.

import dask.array
import numpy

import pyinterp

generator = numpy.random.Generator(numpy.random.PCG64(0))
values = generator.random((2, 4, 6, 8))

ds = pyinterp.DescriptiveStatistics(values)

Once the object is created, you can access various statistical variables, such as the count, mean, variance, standard deviation, skewness, kurtosis, minimum, maximum, and sum.

print(f'Count: {ds.count()}')
print(f'Mean: {ds.mean()}')
print(f'Variance: {ds.var()}')
Count: [384]
Mean: [0.52703051]
Variance: [0.08762152]

You can also get all the calculated statistical variables as a structured NumPy array.

stats_array = ds.array()
print('Structured array of statistics:')
print(stats_array)
Structured array of statistics:
[(384, -1.20728744, 0.99720994, 0.52703051, 0.00030069, -0.12457213, 384., 202.37971539, 0.08762152)]

Computing Statistics Along an Axis

Similar to NumPy, you can compute statistics along a specific axis by providing the axis parameter.

ds_axis = pyinterp.DescriptiveStatistics(values, axis=(1, 2))
print('Mean along axis (1, 2):')
print(ds_axis.mean())
Mean along axis (1, 2):
[[0.46094551 0.58985662 0.56335527 0.59960438 0.53755935 0.46486889
  0.59151122 0.50117507]
 [0.5223734  0.50646854 0.56639677 0.43645944 0.53861813 0.48949667
  0.52384772 0.53995116]]

Working with Dask Arrays

The pyinterp.DescriptiveStatistics class also supports Dask arrays, allowing you to work with datasets that are larger than memory.

dask_values = dask.array.from_array(values, chunks=(2, 2, 2, 2))
ds_dask = pyinterp.DescriptiveStatistics(dask_values, axis=(1, 2))
print('Mean with Dask array:')
print(ds_dask.mean())
Mean with Dask array:
[[0.46094551 0.58985662 0.56335527 0.59960438 0.53755935 0.46486889
  0.59151122 0.50117507]
 [0.5223734  0.50646854 0.56639677 0.43645944 0.53861813 0.48949667
  0.52384772 0.53995116]]

Weighted Statistics

You can also calculate weighted statistics by providing a weights array.

weights = generator.random((2, 4, 6, 8))
ds_weighted = pyinterp.DescriptiveStatistics(values,
                                             weights=weights,
                                             axis=(1, 2))
print('Weighted mean:')
print(ds_weighted.mean())
Weighted mean:
[[0.44437477 0.59020081 0.52588079 0.62338788 0.52285862 0.44453644
  0.63767415 0.4967181 ]
 [0.51336212 0.46654736 0.6201174  0.45925923 0.53653586 0.4910487
  0.5093799  0.55740906]]

Total running time of the script: (0 minutes 0.476 seconds)

Gallery generated by Sphinx-Gallery