How frequent are 'unusual' sample sizes?

Downsampling large data sets is often necessary to facilitate analysis. This process doesn’t typically affect statistics derived from the entire sample; however, when examining subgroups, some strata may become under or over-represented due to sampling bias.

How frequent are these ‘unusual’ sample sizes?

Theoretical derivation #

Suppose we have a data set containing $ n $ observations. We wish to draw a random sample without replacement where each observation has probability $ p $ of being included.

Call $ X $ the random variable representing the size of the sample. As we’re sampling without replacement, each observation may be included with probability $ p $ independently of any other observations. It follows that $ X \sim \text{Binomial(n, p)} $, and so our sample size will be $ \mathbb{E}[X] = n p $ on average.

Given an exceedance level $ l > 0 $, define a ‘usual’ sample size as falling within the closed interval $ [ x_{lb}, x_{ub} ] $, where: $$ x_{lb} = \left\lceil \frac{n p}{1 + l} \right\rceil \quad\text{and}\quad x_{ub} = \left\lfloor n p (1 + l) \right\rfloor \text{.} $$ For example, for $ l = 1 $, a ‘usual’ sample size will be between $ \lceil n p / 2 \rceil $ and $ \lfloor 2 n p \rfloor $, i.e. between half and double its expected value.

Conversely, an ‘unusual’ sample size falls outside the interval $ [ x_{lb}, x_{ub} ] $. Bigger values of $ l $ correspond to larger intervals and a less stringent definition of ‘unusual’.

The probability of an ‘unusual’ sample size is given by: $$ \begin{align*} \mathbb{P} \left[ X \notin [ x_{lb}, x_{ub} ] \right] &= \mathbb{P} [ X < x_{lb} \lor X > x_{ub} ] \\ &= \mathbb{P} [ X < x_{lb} ] + \mathbb{P} [ X > x_{ub} ] \\ &= F_{X} (x_{lb} - 1) + \left[ 1 - F_{X} (x_{ub}) \right] \text{,} \end{align*} $$ where $ F_{X} $ is the cumulative distribution function of $ X $.

Numerical calculation #

The probability of an ‘unusual’ sample size can be computed using NumPy and SciPy:

import numpy as np
from scipy.stats import binom

def compute_prob_unusual_sample_size(n, p, l=1):
  X = binom(n, p)
  x_lb = (np.ceil(n * p / (1 + l))).astype(np.int64)
  x_ub = np.maximum(x_lb, (np.floor(n * p * (1 + l))).astype(np.int64))
  return X.cdf(x_lb - 1) + (1 - X.cdf(x_ub))

Fix $ l = 1 $ and consider the case of subsampling a data set of size $ n = 10^{9} $ down to $ 10^{6} $, corresponding to $ p = 0.001 $.

Using the function above, the probability of an ‘unusual’ sample size is less than the machine precision, i.e. practically zero. This probability, however, increases significantly if we consider a subgroup but keep $ p $ constant.

For example, a subgroup of 10,000 observations will be under or over-represented — i.e., have less than five or more than 20 observations included in the sample, compared to the expected 10 — about 3% of the time. This increases to almost 45% for a subgroup of 1,000 observations.

Conclusions #

The probability of encountering ‘unusual’ sample sizes when conducting stratified analyses is non-negligible. This has important consequences, as strata may be significantly under or over-represented, leading to skewed results.

For subgroup analyses, stratified sampling is a preferable option to ensure that each subgroup is adequately represented.