AI Glossary · Letter P

Policy.

In reinforcement learning, a function that maps observations about the current situation to the action the agent should take. The policy is what the agent learns: it defines behavior across all possible situations the agent might encounter, from choosing which ad to bid on to selecting which content to recommend. Training a reinforcement learning system is the process of learning the best possible policy for the given objective.

Also known as decision policy, RL policy, agent policy

What it is

A working definition of policy in reinforcement learning.

A policy is a function from states (observations of the environment) to actions (decisions the agent makes). A bidding policy maps an impression’s context signals to a bid amount. A recommendation policy maps a user’s current session context to the item to recommend. A budget allocation policy maps the current campaign performance state to the spend adjustment to make. In reinforcement learning, the agent interacts with the environment, observes the state, takes an action according to the current policy, receives a reward signal, and updates the policy to improve future rewards. After many iterations, the policy converges toward the behavior that maximizes cumulative reward over time.

Policies can be deterministic, producing the same action for the same state every time, or stochastic, producing a probability distribution over possible actions for each state. Stochastic policies are often preferable during training because exploration of different actions is needed to discover which actions produce the best long-term outcomes. A purely deterministic policy that always selects the action it currently believes is best will never discover whether alternative actions might be better (the exploitation-exploration tradeoff). Stochastic policies that assign small probabilities to non-greedy actions continue exploring even after reaching a good solution, which is important when the environment is non-stationary and the optimal action may change over time.

Policy gradient methods are a family of reinforcement learning algorithms that optimize the policy directly by computing the gradient of the expected reward with respect to the policy’s parameters. Rather than first learning a value function and deriving a policy from it, policy gradient methods use the reward signals from sampled interactions to estimate the gradient that improves the policy. REINFORCE, Proximal Policy Optimization, and actor-critic methods are all policy gradient approaches. The reinforcement learning from human feedback process used to align large language models is a policy gradient method where the reward signal comes from human preference ratings of model outputs.

Why ad agencies care

Why understanding policies clarifies how AI systems make decisions in bidding, recommendation, and content optimization.

A working ad agency deploying AI for bid optimization, next-best-action marketing, or content recommendation is deploying policies, whether the vendor describes them that way or not. A bid optimization system is a policy that maps impression signals to bid amounts. A recommendation engine is a policy that maps user context to content selection. A marketing orchestration system’s decision engine is a policy that maps customer state to communication actions. Understanding these systems as policies illuminates what they learn, what data shapes their behavior, and how the reward structure drives the decisions they make on the agency’s behalf.

Bid optimization policies encode the campaign objective through the reward signal they are trained on. The behavior of an automated bidding policy is determined by the reward signal it optimizes. A policy trained to maximize attributed conversions at a fixed CPA will learn to bid aggressively on high-intent signals and conservatively on low-intent signals in a way that produces the target CPA on average. A policy trained to maximize brand-weighted reach will behave differently, concentrating on high-reach inventory that maximizes audience coverage rather than targeting only conversion-predictive signals. Auditing the reward signal of a bidding policy reveals what objective it is actually optimizing, which may not match the business objective the agency believes it is serving.

Next-best-action policies in customer engagement systems learn when to communicate, on which channel, and with which message. A customer engagement policy learns the optimal sequence of marketing touches across email, push, SMS, and paid media based on the customer’s current engagement state and behavioral history. Unlike rule-based orchestration systems that follow explicit decision trees, reinforcement learning-based engagement policies discover non-obvious communication sequences that maximize long-term customer value rather than immediate response. These policies typically require 6 to 12 months of deployment data before the learned policy quality exceeds well-designed rule-based alternatives, because the policy needs substantial interaction history to estimate long-term value accurately.

Policy exploration requirements mean reinforcement learning systems must occasionally take suboptimal actions to improve over time. A bidding or recommendation policy that never explores alternative actions will stagnate at its current level of performance because it never discovers whether different behaviors might produce better outcomes. Agencies deploying RL-based systems should understand that some exploration-driven suboptimal actions are a necessary cost of the system’s long-term improvement, not a sign of malfunction. The exploration rate decreases over time as the policy converges, but a fully converged, non-exploring policy in a changing environment will eventually degrade as the environment changes and the policy fails to adapt.

In practice

What policy looks like inside a working ad agency.

An agency implements a reinforcement learning-based email engagement policy for a retail client’s reactivation program targeting 85,000 lapsed subscribers. The goal is to maximize 90-day purchase conversion among lapsed subscribers, which requires deciding for each subscriber: whether to send a reactivation email this week, which offer type to include (discount versus content versus social proof), and at what point to stop attempting reactivation and remove the subscriber from the active list. A rule-based baseline policy sends a 5-email sequence over 10 weeks to all lapsed subscribers with a fixed 20% discount offer. The RL policy is initialized with the rule-based policy’s behavior and begins updating based on reward signals: a reward of 1.0 for a 90-day purchase conversion, 0.1 for an email open with no conversion, and 0.0 for no engagement. After 120 days of deployment, the learned policy has evolved away from the uniform discount-first sequence. For subscribers who opened but did not click the first email, the policy has learned to switch to a content-led second email rather than repeating the discount offer. For subscribers with no engagement after 3 contacts, the policy has learned to extend the gap to 3 weeks rather than continuing weekly contacts, as weekly repetition was associated with unsubscribes that terminated the relationship entirely. The learned policy achieves 7.4% 90-day conversion rate versus 4.2% for the rule-based baseline, a 76% improvement. The post-training analysis reveals that the content-led follow-up for opener non-clickers was the single largest driver of improvement, a tactic the rule-based policy’s designers had not considered.

Build the reinforcement learning foundations that illuminate how AI bidding and recommendation systems actually make decisions through The Creative Cadence Workshop.

The generative AI foundations module covers reinforcement learning including policy learning, reward design, the exploration-exploitation tradeoff, and how these concepts apply to bid optimization, recommendation, and customer engagement automation in marketing.