AI Glossary · Letter I

Importance Sampling.

A statistical technique for estimating properties of one probability distribution by using samples drawn from a different distribution, with each sample weighted by the ratio of the two distributions’ probabilities at that point. Importance sampling is used in offline evaluation of recommendation and ranking systems, in causal inference for campaign effectiveness, and in reinforcement learning, making it a foundational technique for the kind of counterfactual analysis that agencies use to evaluate AI system changes without running live experiments.

Also known as IS weighting, weighted sampling, importance weights

What it is

A working definition of importance sampling.

Importance sampling addresses the problem of evaluating an expected value under a target distribution when samples are only available from a different source distribution. If direct sampling from the target distribution is expensive, impossible, or would require running a live experiment, importance sampling uses existing samples from the source distribution and reweights them by the probability ratio of the target distribution to the source distribution at each sample point. The resulting weighted average is an unbiased estimator of the expectation under the target distribution. The variance of the importance sampling estimator depends on how different the two distributions are: when they are similar, the weights are near 1 and variance is low; when they are very different, some weights become very large, inflating variance.

In the context of recommendation and ranking system evaluation, importance sampling enables offline policy evaluation: estimating how a new recommendation policy would have performed on historical user interactions if it had been deployed instead of the policy that was actually running. The historical interactions were generated under the logging policy, but by reweighting each interaction by the ratio of the new policy’s probability of taking that action to the logging policy’s probability, the expected performance under the new policy can be estimated from historical data without running a live A/B test. This offline evaluation capability is valuable because it allows safe evaluation of many candidate policies before choosing which to test live, reducing the number of suboptimal policies that expose users to worse experiences during testing.

Doubly robust estimation combines importance sampling with a direct model of the outcome, producing an estimator that is consistent even if either the importance weights or the outcome model is slightly misspecified, as long as one of them is correct. This robustness is practically valuable because both the importance weights and the outcome model are typically learned from finite data and will be imperfect approximations. Using doubly robust estimation for offline policy evaluation produces more reliable estimates than pure importance sampling when the logging and target policies differ substantially.

Why ad agencies care

Why importance sampling might matter more in agency work than in most industries.

Running live A/B tests to evaluate every change to a recommendation system, ranking algorithm, or personalization policy is slow, costly, and exposes users to potentially worse experiences during the test period. Importance sampling-based offline evaluation reduces the need for live testing by enabling accurate estimation of new policy performance from historical data. A working ad agency that uses offline evaluation can evaluate more candidate changes per unit of time, de-risk larger changes before live deployment, and explain their evaluation methodology to clients with technical precision.

Recommendation system improvements are evaluated more efficiently with offline policy evaluation. When an agency is testing changes to a content recommendation algorithm for a client, running a full A/B test for each candidate change requires weeks of traffic and introduces user experience risk if a suboptimal candidate is tested. Offline policy evaluation using importance sampling can identify which of many candidates is most likely to perform well before any live testing, enabling the agency to run one A/B test with the most promising candidate rather than sequential tests with all candidates. This compresses the improvement cycle from months to weeks.

Causal effect estimation in marketing analytics uses importance sampling-adjacent methods. Estimating the causal effect of an ad exposure on conversion, controlling for the selection bias in who receives the ad, uses propensity score weighting: the probability of receiving the treatment given observed covariates is estimated, and units are reweighted by the inverse of this propensity to create a pseudo-randomized sample. This is mathematically equivalent to importance sampling and uses it to estimate counterfactual outcomes, specifically what conversions would have occurred if the unexposed group had been exposed, from observational data.

Off-policy learning in reinforcement learning requires importance sampling. When an AI agent learns from historical data generated by a different policy than the one being learned, importance sampling weights each historical transition by the probability ratio of the target policy to the behavior policy. This enables learning from logged data, such as historical ad serving decisions and their outcomes, without requiring online interaction with the environment during training. Agencies building AI systems that learn from historical campaign data rather than online interaction need to understand importance sampling as the mechanism that makes off-policy learning valid.

In practice

What importance sampling looks like inside a working ad agency.

An agency manages a content recommendation system for a subscription media client and wants to evaluate whether switching from a popularity-based ranking to a personalized ranking model would improve time-on-site per session. Running a live A/B test with the personalized model would require at least 4 weeks to achieve statistical significance and would expose half the user base to the new model before its quality is confirmed. The agency uses offline policy evaluation with importance sampling: they compute the probability that the current popularity-based policy would have shown each article that was actually clicked in the historical log, and the probability that the new personalized policy would have shown the same article. The ratio of these probabilities is the importance weight for that observation. Applying these weights to historical click data, the agency estimates that the personalized model would produce a 17% improvement in time-on-site per session compared to the popularity baseline. Confident in this estimate, the agency runs a single 2-week live A/B test with the personalized model, which achieves a 14% improvement in time-on-site. The offline estimate was close enough to the live result to have justified the decision to proceed directly to a single definitive test rather than running multiple exploratory tests.

Build the evaluation methodology that lets you test more AI system changes with less live experiment risk through The Creative Cadence Workshop.

The automations and agents module covers how to build and evaluate AI-powered recommendation and optimization systems, including the offline evaluation methods that enable faster iteration with lower user experience risk.