How frequent are 'unusual' sample sizes?
Downsampling large data sets is often necessary to facilitate analysis. This process doesn’t typically affect statistics derived from the entire sample; however, when examining subgroups, some strata may become under or over-represented due to sampling bias.
How frequent are these ‘unusual’ sample sizes?
Theoretical derivation #
Suppose we have a data set containing \( n \) observations. We wish to draw a random sample without replacement where each observation has probability \( p \) of being included.
Call \( X \) the random variable representing the size of the sample. As we’re sampling without replacement, each observation may be included with probability \( p \) independently of any other observations. It follows that \( X \sim \text{Binomial(n, p)} \), and so our sample size will be \( \mathbb{E}[X] = n p \) on average.
Given an exceedance level \( l > 0 \), define a ‘usual’ sample size as falling within the closed interval \( [ x_{lb}, x_{ub} ] \), where: $$ x_{lb} = \left\lceil \frac{n p}{1 + l} \right\rceil \quad\text{and}\quad x_{ub} = \left\lfloor n p (1 + l) \right\rfloor \text{.} $$ For example, for \( l = 1 \), a ‘usual’ sample size will be between \( \lceil n p / 2 \rceil \) and \( \lfloor 2 n p \rfloor \), i.e. between half and double its expected value.
Conversely, an ‘unusual’ sample size falls outside the interval \( [ x_{lb}, x_{ub} ] \). Bigger values of \( l \) correspond to larger intervals and a less stringent definition of ‘unusual’.
The probability of an ‘unusual’ sample size is given by: $$ \begin{align*} \mathbb{P} \left[ X \notin [ x_{lb}, x_{ub} ] \right] &= \mathbb{P} [ X < x_{lb} \lor X > x_{ub} ] \\ &= \mathbb{P} [ X < x_{lb} ] + \mathbb{P} [ X > x_{ub} ] \\ &= F_{X} (x_{lb} - 1) + \left[ 1 - F_{X} (x_{ub}) \right] \text{,} \end{align*} $$ where \( F_{X} \) is the cumulative distribution function of \( X \).
Numerical calculation #
The probability of an ‘unusual’ sample size can be computed using NumPy and SciPy:
import numpy as np
from scipy.stats import binom
def compute_prob_unusual_sample_size(n, p, l=1):
X = binom(n, p)
x_lb = (np.ceil(n * p / (1 + l))).astype(np.int64)
x_ub = np.maximum(x_lb, (np.floor(n * p * (1 + l))).astype(np.int64))
return X.cdf(x_lb - 1) + (1 - X.cdf(x_ub))
Fix \( l = 1 \) and consider the case of subsampling a data set of size \( n = 10^{9} \) down to \( 10^{6} \), corresponding to \( p = 0.001 \).
Using the function above, the probability of an ‘unusual’ sample size is less than the machine precision, i.e. practically zero. This probability, however, increases significantly if we consider a subgroup but keep \( p \) constant.
For example, a subgroup of 10,000 observations will be under or over-represented — i.e., have less than five or more than 20 observations included in the sample, compared to the expected 10 — about 3% of the time. This increases to almost 45% for a subgroup of 1,000 observations.
Conclusions #
The probability of encountering ‘unusual’ sample sizes when conducting stratified analyses is non-negligible. This has important consequences, as strata may be significantly under or over-represented, leading to skewed results.
For subgroup analyses, stratified sampling is a preferable option to ensure that each subgroup is adequately represented.