Skip to main content

How frequent are 'unusual' sample sizes?

·553 words·3 mins

Downsampling large data sets is often necessary to facilitate analysis. This process doesn’t typically affect statistics derived from the entire sample; however, when examining subgroups, some strata may become under or over-represented due to sampling bias.

How frequent are these ‘unusual’ sample sizes?

Theoretical derivation #

Suppose we have a data set containing \( n \) observations. We wish to draw a random sample without replacement where each observation has probability \( p \) of being included.

Call \( X \) the random variable representing the size of the sample. As we’re sampling without replacement, each observation may be included with probability \( p \) independently of any other observations. It follows that \( X \sim \text{Binomial(n, p)} \), and so our sample size will be \( \mathbb{E}[X] = n p \) on average.

Given an exceedance level \( l > 0 \), define a ‘usual’ sample size as falling within the closed interval \( [ x_{lb}, x_{ub} ] \), where: $$ x_{lb} = \left\lceil \frac{n p}{1 + l} \right\rceil \quad\text{and}\quad x_{ub} = \left\lfloor n p (1 + l) \right\rfloor \text{.} $$ For example, for \( l = 1 \), a ‘usual’ sample size will be between \( \lceil n p / 2 \rceil \) and \( \lfloor 2 n p \rfloor \), i.e. between half and double its expected value.

Conversely, an ‘unusual’ sample size falls outside the interval \( [ x_{lb}, x_{ub} ] \). Bigger values of \( l \) correspond to larger intervals and a less stringent definition of ‘unusual’.

The probability of an ‘unusual’ sample size is given by: $$ \begin{align*} \mathbb{P} \left[ X \notin [ x_{lb}, x_{ub} ] \right] &= \mathbb{P} [ X < x_{lb} \lor X > x_{ub} ] \\ &= \mathbb{P} [ X < x_{lb} ] + \mathbb{P} [ X > x_{ub} ] \\ &= F_{X} (x_{lb} - 1) + \left[ 1 - F_{X} (x_{ub}) \right] \text{,} \end{align*} $$ where \( F_{X} \) is the cumulative distribution function of \( X \).

Numerical calculation #

The probability of an ‘unusual’ sample size can be computed using NumPy and SciPy:

import numpy as np
from scipy.stats import binom

def compute_prob_unusual_sample_size(n, p, l=1):
  X = binom(n, p)
  x_lb = (np.ceil(n * p / (1 + l))).astype(np.int64)
  x_ub = np.maximum(x_lb, (np.floor(n * p * (1 + l))).astype(np.int64))
  return X.cdf(x_lb - 1) + (1 - X.cdf(x_ub))

Fix \( l = 1 \) and consider the case of subsampling a data set of size \( n = 10^{9} \) down to \( 10^{6} \), corresponding to \( p = 0.001 \).

Using the function above, the probability of an ‘unusual’ sample size is less than the machine precision, i.e. practically zero. This probability, however, increases significantly if we consider a subgroup but keep \( p \) constant.

For example, a subgroup of 10,000 observations will be under or over-represented — i.e., have less than five or more than 20 observations included in the sample, compared to the expected 10 — about 3% of the time. This increases to almost 45% for a subgroup of 1,000 observations.

Conclusions #

The probability of encountering ‘unusual’ sample sizes when conducting stratified analyses is non-negligible. This has important consequences, as strata may be significantly under or over-represented, leading to skewed results.

For subgroup analyses, stratified sampling is a preferable option to ensure that each subgroup is adequately represented.