What is Sampling? - Flux+Form

What it is

A working definition of sampling.

Statistical sampling selects a subset of observations from a population to estimate the population’s properties without measuring every member. Random sampling gives every member of the population an equal probability of selection, producing unbiased estimates under the law of large numbers. Stratified sampling ensures that key subgroups are represented at specified proportions, which is important when the population contains rare subgroups that random sampling might underrepresent. The sampling method determines whether conclusions drawn from the sample are valid generalizations about the population, and sampling errors, where the sample is systematically unrepresentative, are a major source of bias in both statistical analyses and machine learning models trained on historical data.

In machine learning, sampling refers both to how training data is selected from a larger dataset and to how outputs are drawn from probabilistic models during inference. Training data sampling includes random subsampling to create train and validation splits, bootstrapping (sampling with replacement) to create multiple training sets for ensemble methods, and importance sampling, which oversamples rare but informative examples to prevent the model from underperforming on minority cases. During inference, language models and other generative models use temperature-controlled sampling to draw from the output probability distribution: sampling with high temperature produces more diverse and creative outputs; sampling with low temperature or greedy decoding (always selecting the most probable token) produces more predictable, consistent outputs.

Negative sampling is a training technique used in recommendation systems and word embedding models to create useful negative training examples from unlabeled data. In word2vec and similar embedding models, negative sampling teaches the model that a center word and a randomly sampled non-context word are unrelated, balancing the positive training signal from observed co-occurrences. In recommendation models, negative sampling treats items that a user did not interact with as implicit negatives for training, with various strategies for choosing which non-interacted items to use as negatives and how many negative examples to pair with each positive interaction.

Why ad agencies care

Why sampling methodology determines the validity of AI experiments and the quality of generative AI outputs.

A working ad agency running A/B tests, training audience models, or generating AI content at scale is making sampling decisions at every step. The sampling method used to split a test audience affects whether the test result is a valid estimate of campaign performance. The sampling method used to construct training data affects whether the model generalizes to the full population or only to the sample it was trained on. The sampling temperature used to draw outputs from a language model affects whether the generated content is appropriately varied or monotonously similar. Sampling is not a low-level technical detail; it is a first-order factor in whether AI and analytics work produces valid, useful results.

Improper test audience sampling produces invalid A/B test results that lead to wrong budget and creative decisions. A creative A/B test that assigns treatment and control based on day of week (ads on Monday and Wednesday see creative A; Tuesday and Thursday see creative B) is not a valid random sample because day-of-week confounds with engagement patterns, audience composition, and competitive intensity. The observed difference between creative A and creative B performance reflects both genuine creative quality differences and systematic day-of-week differences in audience behavior, producing a result that cannot be correctly attributed to creative quality alone. Proper test design requires random user-level assignment that breaks the correlation between treatment and any potential confounding variable.

Temperature sampling in generative AI controls the diversity-quality tradeoff in content production workflows. A content generation system using a language model set to temperature 0.2 will produce very similar outputs for the same prompt across multiple generations, which is appropriate for factual content requiring accuracy but produces a low-diversity creative portfolio. Setting temperature to 0.9 produces more diverse outputs that explore the model’s capability space but include more outputs that are off-brand, incoherent, or low-quality. The optimal temperature for a specific production task is determined empirically by generating a sample of outputs at multiple temperatures and evaluating the tradeoff between diversity and quality at each setting, then applying quality filtering to keep only outputs above the acceptance threshold.

Oversampling rare conversion events in propensity model training improves model calibration for low-frequency outcomes. A purchase propensity model trained on a dataset with 1% positive rate (purchases) and 99% negative rate will naturally produce a model that rarely predicts positive, because always predicting negative achieves 99% accuracy. Oversampling the positive class (or equivalently, undersampling the negative class) during training creates a more balanced training distribution that forces the model to learn what distinguishes purchasers from non-purchasers rather than simply learning to predict the majority class. The oversampling factor should be reported alongside model performance metrics so that output scores can be recalibrated to reflect the true base rate in the population the model will score.

In practice

What sampling looks like inside a working ad agency.

An agency is designing a test to measure the lift from a new email subject line strategy that uses personalized dynamic content versus the client’s standard templated subject lines. The email audience is 280,000 active subscribers. The prior email system assigns treatment based on first letter of last name (A through M receive treatment, N through Z receive control), which the agency identifies as invalid because last name distribution correlates with demographic and geographic patterns that affect open rates independently of the subject line. The agency redesigns the test with proper random sampling: a hash of each subscriber’s unique ID modulo 100 determines assignment, with values 0 through 49 assigned to control and 50 through 99 to treatment, producing a 50/50 split that is completely random and independent of any subscriber attribute. Sample size analysis using the prior week’s open rate (21.3%), a minimum detectable effect of 2 percentage points (roughly a 10% relative improvement), 80% statistical power, and 95% confidence requires 14,800 subscribers per group, giving the test 9.4 times the required minimum sample in each group and producing very high statistical power. The test runs over a single send to 280,000 subscribers (140,000 per group). Treatment group open rate: 24.1%. Control group open rate: 21.4%. The 2.7 percentage point lift is statistically significant with p less than 0.001. The agency recommends adopting the personalized subject line strategy and estimates the annual revenue impact based on the open rate lift, click rate, conversion rate from email, and annual email frequency.

Sampling.

A working definition of sampling.

Why sampling methodology determines the validity of AI experiments and the quality of generative AI outputs.

What sampling looks like inside a working ad agency.

Build the statistical foundations that produce valid experiments and well-trained AI models through The Creative Cadence Workshop.

Sampling.

A working definition of sampling.

Why sampling methodology determines the validity of AI experiments and the quality of generative AI outputs.

What sampling looks like inside a working ad agency.

Build the statistical foundations that produce valid experiments and well-trained AI models through The Creative Cadence Workshop.

Concepts in sampling’s territory.