Note
Go to the end to download the full example code. or to run this example in your browser via Binder
Descriptive Statistics¶
While NumPy provides a wide range of statistical functions, calculating multiple
statistical variables from the same array often requires multiple passes over
the data. The pyinterp.DescriptiveStatistics
class offers a more
efficient solution by computing several statistical variables in a single pass.
This approach is not only faster but also more numerically stable, thanks to its
incremental calculation algorithm.
Note
This implementation is based on the following paper:
Pébay, P., Terriberry, T.B., Kolla, H. et al. Numerically stable, scalable formulas for parallel and online computation of higher-order multivariate central moments with arbitrary weights. Comput Stat 31, 1305-1325, 2016, https://doi.org/10.1007/s00180-015-0637-z
Basic Usage¶
Let’s start by creating a random array and using it to initialize the
pyinterp.DescriptiveStatistics
class.
import dask.array
import numpy
import pyinterp
generator = numpy.random.Generator(numpy.random.PCG64(0))
values = generator.random((2, 4, 6, 8))
ds = pyinterp.DescriptiveStatistics(values)
Once the object is created, you can access various statistical variables, such as the count, mean, variance, standard deviation, skewness, kurtosis, minimum, maximum, and sum.
print(f'Count: {ds.count()}')
print(f'Mean: {ds.mean()}')
print(f'Variance: {ds.var()}')
Count: [384]
Mean: [0.52703051]
Variance: [0.08762152]
You can also get all the calculated statistical variables as a structured NumPy array.
stats_array = ds.array()
print('Structured array of statistics:')
print(stats_array)
Structured array of statistics:
[(384, -1.20728744, 0.99720994, 0.52703051, 0.00030069, -0.12457213, 384., 202.37971539, 0.08762152)]
Computing Statistics Along an Axis¶
Similar to NumPy, you can compute statistics along a specific axis by
providing the axis
parameter.
ds_axis = pyinterp.DescriptiveStatistics(values, axis=(1, 2))
print('Mean along axis (1, 2):')
print(ds_axis.mean())
Mean along axis (1, 2):
[[0.46094551 0.58985662 0.56335527 0.59960438 0.53755935 0.46486889
0.59151122 0.50117507]
[0.5223734 0.50646854 0.56639677 0.43645944 0.53861813 0.48949667
0.52384772 0.53995116]]
Working with Dask Arrays¶
The pyinterp.DescriptiveStatistics
class also supports Dask
arrays, allowing you to work with datasets that are larger than memory.
dask_values = dask.array.from_array(values, chunks=(2, 2, 2, 2))
ds_dask = pyinterp.DescriptiveStatistics(dask_values, axis=(1, 2))
print('Mean with Dask array:')
print(ds_dask.mean())
Mean with Dask array:
[[0.46094551 0.58985662 0.56335527 0.59960438 0.53755935 0.46486889
0.59151122 0.50117507]
[0.5223734 0.50646854 0.56639677 0.43645944 0.53861813 0.48949667
0.52384772 0.53995116]]
Weighted Statistics¶
You can also calculate weighted statistics by providing a weights
array.
weights = generator.random((2, 4, 6, 8))
ds_weighted = pyinterp.DescriptiveStatistics(values,
weights=weights,
axis=(1, 2))
print('Weighted mean:')
print(ds_weighted.mean())
Weighted mean:
[[0.44437477 0.59020081 0.52588079 0.62338788 0.52285862 0.44453644
0.63767415 0.4967181 ]
[0.51336212 0.46654736 0.6201174 0.45925923 0.53653586 0.4910487
0.5093799 0.55740906]]
Total running time of the script: (0 minutes 0.476 seconds)