AI Glossary · Letter M

Markov Decision Process.

A mathematical framework for modeling decisions made sequentially in environments where outcomes are partly controlled by the decision maker and partly determined by randomness. Markov decision processes provide the formal foundation for reinforcement learning and are used to model bid optimization, customer journey interventions, and any AI system that must make a sequence of decisions to maximize a long-term objective.

Also known as MDP, sequential decision process, stochastic decision process

What it is

A working definition of the Markov decision process.

A Markov decision process has four components: a state space that describes all possible situations the decision maker can be in; an action space that describes all possible decisions; a transition function that specifies the probability of moving from one state to another given an action; and a reward function that specifies the immediate value received for taking an action in a state. The goal is to find a policy, a mapping from states to actions, that maximizes the expected cumulative reward over time. The “Markov” property means that the probability of transitioning to the next state depends only on the current state and action, not on the full history of prior states and actions.

The Markov property is both a mathematical simplification and a substantive assumption about the problem structure. It says that the current state contains all relevant information about the past that matters for predicting the future. In a well-designed MDP, the state representation is rich enough to encode all the history that affects what will happen next, so the Markov property holds. In practice, designing a state representation that makes the Markov property approximately true requires careful thought about what information is decision-relevant and what can be safely ignored.

Solving an MDP means finding the optimal policy that maximizes expected cumulative reward. For small MDPs with known transition and reward functions, dynamic programming methods can compute the optimal policy exactly. For large MDPs or those with unknown dynamics, reinforcement learning methods learn policies from experience by interacting with the environment and observing the rewards that result from each action. Q-learning, policy gradient methods, and actor-critic algorithms are different approaches to this learning problem, each with different tradeoffs in sample efficiency, computational cost, and applicability to different problem structures.

Why ad agencies care

Why the MDP framework is the right mental model for evaluating AI bid optimization and customer journey systems.

A working ad agency using AI-powered bid optimization, budget pacing, or customer journey automation is implicitly deploying systems that can be understood as policies operating in Markov decision processes. The bid optimization system observes a state (auction characteristics, user signals, campaign delivery metrics) and takes an action (bid amount) to maximize cumulative reward (conversions or revenue over the campaign flight). Understanding this MDP structure helps agencies diagnose when these systems are underperforming, what their exploration-exploitation tradeoffs look like, and what state information they are or are not using.

Automated bidding systems are learned policies in high-dimensional auction MDPs. A programmatic bidding system observes a state that includes user signals, page context, historical conversion patterns, and campaign delivery constraints, then selects a bid action that determines whether the impression is won and at what cost. The system’s policy is trained to maximize conversions subject to a cost-per-acquisition constraint, which is the reward function. When these systems underperform, the most common causes are state representations that are too coarse to capture the relevant auction dynamics, reward functions that are misspecified relative to the actual business objective, and insufficient exploration during the learning phase that causes the policy to converge to locally optimal but globally suboptimal bidding behavior.

Customer journey intervention models are MDPs where the state is the customer’s engagement history. A next-best-action system that decides when to send what message to a customer at risk of churning is navigating a customer journey MDP: the state is the customer’s current behavioral profile and history of prior communications; the actions are the available interventions including email, push, call, or no contact; and the reward is the long-term customer lifetime value contribution. Systems that frame this as an MDP explicitly can learn policies that trade off short-term engagement metrics against long-term retention, avoiding the over-communication trap that plagues systems that optimize only for immediate response.

The exploration-exploitation tradeoff in MDPs explains why AI bidding systems need warmup periods. A bidding policy that always exploits its current best estimate of the optimal bid will learn slowly because it never tries actions outside its current belief about what is optimal. A policy that explores widely will learn faster but sacrifice short-term performance. The warmup periods that many programmatic platforms recommend when deploying new automated bidding strategies reflect the exploration phase of the underlying reinforcement learning system. Cutting short the warmup period by imposing tight bid constraints prematurely prevents the policy from exploring enough to find the globally optimal bidding behavior, locking it into a locally optimal but ultimately underperforming strategy.

In practice

What markov decision process looks like inside a working ad agency.

An agency is troubleshooting underperformance in a newly deployed smart bidding campaign for a home services client. The automated bidding strategy is set to Target CPA with a $95 target, and the campaign has been running for three weeks with an observed CPA of $142, well above target. The platform’s diagnostic recommends “adding conversion data” and “allowing more time for learning,” but the account has over 200 conversions recorded during the period. The agency examines the situation through an MDP lens and identifies two issues. First, the state representation available to the bidding system does not distinguish between high-intent service area searches and low-intent broad research queries, because the keyword structure groups both in the same ad groups. The system is learning a single policy across two qualitatively different state types that warrant different bid levels. Second, the conversion events being used as rewards include form fills and phone calls, but the client has told the agency that phone call leads convert to booked jobs at 3x the rate of form fills. The reward function that the bidding system is optimizing is giving equal weight to both conversion types. The agency restructures the keyword segmentation to separate high-intent and low-intent searches into distinct campaigns with separate bidding policies, and implements separate conversion actions with conversion value weights reflecting the relative lead quality of calls versus form fills. Over the following four weeks, CPA decreases to $88 and booked job rate from leads improves by 34%.

Build the reinforcement learning and decision-making foundations that improve AI bidding system management through The Creative Cadence Workshop.

The generative AI foundations module explains how AI systems learn sequential decision policies including the MDP framework that underlies automated bidding, customer journey optimization, and next-best-action systems.