Intro to the stats module¶
from scipy import stats
import numpy as np
from xarray_einstats.stats import XrContinuousRV, rankdata, hmean, skew, median_abs_deviation
from xarray_einstats.tutorial import generate_mcmc_like_dataset
ds = generate_mcmc_like_dataset(11)
ds
<xarray.Dataset> Size: 6kB
Dimensions: (plot_dim: 20, chain: 4, draw: 10, team: 6, match: 12)
Coordinates:
* team (team) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
* chain (chain) int64 32B 0 1 2 3
* draw (draw) int64 80B 0 1 2 3 4 5 6 7 8 9
Dimensions without coordinates: plot_dim, match
Data variables:
x_plot (plot_dim) float64 160B 0.0 0.5263 1.053 1.579 ... 8.947 9.474 10.0
mu (chain, draw, team) float64 2kB 0.2296 0.5383 ... 0.4452 2.004
sigma (chain, draw) float64 320B 0.3703 0.00899 0.1398 ... 0.2246 0.2875
score (chain, draw, match) int64 4kB 1 0 0 1 4 1 1 0 ... 0 0 2 1 2 1 0 2Probability distributions¶
Initialization¶
Initialization takes a class from an external library that defines distributions with array inputs and the positional/keyword arguments needed to initialize it. Tested distributions are:
SciPy distributions
PreliZ distributions
SciPy random variables (introduced in SciPy 1.14).
Note
Positional and keyword arguments are broadcasted with all other inputs, but they are then reconstructed and passed as is. For example: preliz.Normal(0, 1) is valid, so XrContinuousRV(pz.Normal, 0, 1) is also valid but scipy.stats.Normal(0, 1) is not valid only keyword arguments are accepted, so the same will apply when wrapping with with xarray-einstats.
Moreover, the use of XrContinuousRV and XrDiscreteRV only changes for some of the methods like pdf/pmf. However, PreliZ always uses pdf, even for discrete distributions so it needs to always be wrapped through XrContinuousRV
dist = stats.norm # SciPy distribution
# import preliz as pz
# dist = pz.Normal # PreliZ distribution
# dist = stats.Normal # SciPy random variable
norm = XrContinuousRV(dist, ds["mu"], ds["sigma"])
Using its methods¶
Once initialized, you can use its methods exactly as you’d use them with frozen scipy distributions or preliz distributions. The only two differences are
They now take scalars or DataArrays as inputs, arrays are only accepted as the arguments on which to evaluate the methods (in scipy docs they are represented by
x,korqdepending on the method)sizebehaves differently in thervsmethod; heresizerepresents theshapeof a single draw. This ensures that you don’t need to care about any broadcasting or alignment of arrays,xarray_einstatsdoes this for you and will return a DataArray with combined shape(*size, *broadcasted_input_shape).
You can generate 100 random draws from the initialized distribution. As we have mentioned and unlike what would happen with scipy the output won’t have shape 100 but instead will have shape 100, *broadcasted_input_shape. xarray generates the broadcasted_input_shape and size is independent from it so you can relax and not care about broadcasting.
norm.rvs(size=100)
<xarray.DataArray (rv_dim0: 100, chain: 4, draw: 10, team: 6)> Size: 192kB -0.1369 0.312 1.122 0.296 -0.3041 2.919 ... 0.2673 0.07624 0.02325 0.3478 2.179 Coordinates: * team (team) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f' * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 80B 0 1 2 3 4 5 6 7 8 9 Dimensions without coordinates: rv_dim0
If the dimension names are not provided, xarray_einstats assings rv_dim# as dimension name as many times as necessary. To define the names manually you can use the dims argument:
norm.rvs(size=(5, 3), dims=["subject", "batch"])
<xarray.DataArray (subject: 5, batch: 3, chain: 4, draw: 10, team: 6)> Size: 29kB 0.1375 0.729 1.113 0.5688 0.1979 3.794 ... -0.1246 0.2988 0.3684 0.1294 1.877 Coordinates: * team (team) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f' * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 80B 0 1 2 3 4 5 6 7 8 9 Dimensions without coordinates: subject, batch
In the output above we’ll often want to provide coordinate values to the new dimensions. .rvs takes a coords argument which can be used for that, but it isn’t much of an improvement over .rvs(...).assign_coords(coords). It is also possible however to skip size and dims altogether and use coords to inform both:
subject = [
"Monstera deliciosa", "Monstera borsigiana", "Monstera siltepecana",
"Monstera variegata", "Monstera pinnatipartita"
]
batch = ["March", "June", "October"]
norm.rvs(coords={"subject": subject, "batch": batch})
<xarray.DataArray (subject: 5, batch: 3, chain: 4, draw: 10, team: 6)> Size: 29kB -0.1794 0.5826 1.101 -0.07705 0.1534 3.711 ... 0.7631 0.2766 0.4849 0.5801 2.014 Coordinates: * subject (subject) <U23 460B 'Monstera deliciosa' ... 'Monstera pinnatipa... * batch (batch) <U7 84B 'March' 'June' 'October' * team (team) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f' * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 80B 0 1 2 3 4 5 6 7 8 9
The behaviour for other methods is similar:
norm.logpdf(ds["x_plot"])
<xarray.DataArray 'x_plot' (plot_dim: 20, chain: 4, draw: 10, team: 6)> Size: 38kB -0.1177 -0.9821 -4.519 0.06682 0.02491 ... -594.3 -600.6 -561.9 -551.8 -386.4 Coordinates: * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 80B 0 1 2 3 4 5 6 7 8 9 * team (team) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f' Dimensions without coordinates: plot_dim
For convenience, you can also use array_like input which is converted to a DataArray under the hood. In such cases, the dimension name is quantile for ppf and isf, point otherwise. In both cases, the values passed as input are preserved as coordinate values.
norm.ppf([.25, .5, .75])
<xarray.DataArray (quantile: 3, chain: 4, draw: 10, team: 6)> Size: 6kB -0.02018 0.2885 0.8726 -0.204 -0.1332 ... 0.2786 0.2264 0.5523 0.6391 2.198 Coordinates: * quantile (quantile) float64 24B 0.25 0.5 0.75 * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 80B 0 1 2 3 4 5 6 7 8 9 * team (team) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
pdf = norm.pdf(np.linspace(-5, 5))
pdf
<xarray.DataArray (point: 50, chain: 4, draw: 10, team: 6)> Size: 96kB 5.321e-44 2.898e-49 4.753e-60 5.206e-41 ... 3.563e-57 4.449e-55 3.664e-24 Coordinates: * point (point) float64 400B -5.0 -4.796 -4.592 -4.388 ... 4.592 4.796 5.0 * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 80B 0 1 2 3 4 5 6 7 8 9 * team (team) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
Plot a subset of the pdf we just calculated with matplotlib.
import matplotlib.pyplot as plt
plt.rcParams["figure.facecolor"] = "white"
fig, ax = plt.subplots()
ax.plot(pdf.point, pdf.sel(team="d", chain=2), color="C0", alpha=.5)
ax.set(xlabel="x", ylabel="pdf of normal distribution", );
Other functions¶
The rest of the functions in the module have a very similar API to their scipy counterparts, the only differences are:
They take
dimsinstead ofaxis. Moreover,dimscan bestror a sequence ofstrinstead of a single integer only as supported byaxis.Arguments that take array_like as values take
DataArrayinputs instead. For example thescaleargument inmedian_abs_deviationThey accept extra arbitrary kwargs, that are passed to
xarray.apply_ufunc.
Here are some examples of using functions in the stats module of xarray_einstats with dims argument instead of axis.
hmean(ds["mu"], dims="team")
<xarray.DataArray 'mu' (chain: 4, draw: 10)> Size: 320B 0.1588 0.2123 0.5543 0.7826 0.1913 0.6035 ... 0.1269 0.712 0.3044 0.1936 0.1223 Coordinates: * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 80B 0 1 2 3 4 5 6 7 8 9
rankdata(ds["score"], dims=("chain", "draw"), method="min")
<xarray.DataArray 'score' (match: 12, chain: 4, draw: 10)> Size: 4kB 14 14 14 14 14 31 14 1 31 14 31 1 14 1 ... 15 15 15 15 15 1 34 15 15 1 34 34 34 Dimensions without coordinates: match, chain, draw
Important
The statistical summaries and other statistical functions can take both DataArray and Dataset. Methods in probability functions and functions in linear algebra module
are tested only on DataArrays.
When using Dataset inputs, you must make sure that all the dimensions in dims are
present in all the DataArrays within the Dataset.
skew(ds[["score", "mu", "sigma"]], dims=("chain", "draw"))
<xarray.Dataset> Size: 176B
Dimensions: (match: 12, team: 6)
Coordinates:
* team (team) <U1 24B 'a' 'b' 'c' 'd' 'e' 'f'
Dimensions without coordinates: match
Data variables:
score (match) float64 96B 1.466 0.2149 0.6788 1.361 ... 1.099 1.156 1.265
mu (team) float64 48B 0.8152 1.84 2.102 1.806 1.091 0.9678
sigma float64 8B 1.314median_abs_deviation(ds)
<xarray.Dataset> Size: 32B
Dimensions: ()
Data variables:
x_plot float64 8B 2.632
mu float64 8B 0.4878
sigma float64 8B 0.39
score float64 8B 1.0%load_ext watermark
%watermark -n -u -v -iv -w -p xarray_einstats,xarray
Last updated: Thu May 22 2025
Python implementation: CPython
Python version : 3.12.7
IPython version : 8.29.0
xarray_einstats: 0.9.0
xarray : 2025.4.0
matplotlib : 3.10.1
numpy : 2.2.6
xarray_einstats: 0.9.0
scipy : 1.15.2
Watermark: 2.5.0