Note
Go to the end to download the full example code. or to run this example in your browser via Binder
Descriptive Statistics¶
Numpy offers many statistical functions, but if you want to obtain several statistical variables from the same array, it’s necessary to process the data several times to calculate the various parameters. This example shows how to use the DescriptiveStatistics class to obtain several statistical variables with a single calculation. Also, the calculation algorithm is incremental and is more numerically stable.
Note
Pébay, P., Terriberry, T.B., Kolla, H. et al. Numerically stable, scalable formulas for parallel and online computation of higher-order multivariate central moments with arbitrary weights. Comput Stat 31, 1305-1325, 2016, https://doi.org/10.1007/s00180-015-0637-z
import dask.array
import numpy
import pyinterp
Create a random array
generator = numpy.random.Generator(numpy.random.PCG64(0))
values = generator.random((2, 4, 6, 8))
Create a DescriptiveStatistics object.
ds = pyinterp.DescriptiveStatistics(values)
The constructor will calculate the statistical variables on the provided data. The calculated variables are stored in the instance and can be accessed using different methods:
mean
var
std
skewness
kurtosis
min
max
sum
sum_of_weights
count
ds.count()
array([384], dtype=uint64)
ds.mean()
array([0.52703051])
It’s possible to get a structured numpy array containing the different statistical variables calculated.
ds.array()
array([(384, -1.20728744, 0.99720994, 0.52703051, 0.00030069, -0.12457213, 384., 202.37971539, 0.08762152)],
dtype=[('count', '<u8'), ('kurtosis', '<f8'), ('max', '<f8'), ('mean', '<f8'), ('min', '<f8'), ('skewness', '<f8'), ('sum_of_weights', '<f8'), ('sum', '<f8'), ('var', '<f8')])
Like numpy, it’s possible to compute statistics along axis.
ds = pyinterp.DescriptiveStatistics(values, axis=(1, 2))
ds.mean()
array([[0.46094551, 0.58985662, 0.56335527, 0.59960438, 0.53755935,
0.46486889, 0.59151122, 0.50117507],
[0.5223734 , 0.50646854, 0.56639677, 0.43645944, 0.53861813,
0.48949667, 0.52384772, 0.53995116]])
The class can also process a dask array. In this case, the call to the constructor triggers the calculation.
ds = pyinterp.DescriptiveStatistics(dask.array.from_array(values,
chunks=(2, 2, 2, 2)),
axis=(1, 2))
ds.mean()
array([[0.46094551, 0.58985662, 0.56335527, 0.59960438, 0.53755935,
0.46486889, 0.59151122, 0.50117507],
[0.5223734 , 0.50646854, 0.56639677, 0.43645944, 0.53861813,
0.48949667, 0.52384772, 0.53995116]])
Finally, it’s possible to calculate weighted statistics.
weights = generator.random((2, 4, 6, 8))
ds = pyinterp.DescriptiveStatistics(values, weights=weights, axis=(1, 2))
ds.mean()
array([[0.44437477, 0.59020081, 0.52588079, 0.62338788, 0.52285862,
0.44453644, 0.63767415, 0.4967181 ],
[0.51336212, 0.46654736, 0.6201174 , 0.45925923, 0.53653586,
0.4910487 , 0.5093799 , 0.55740906]])
Total running time of the script: (0 minutes 0.074 seconds)