The initial training phase in which a large model learns general representations from massive datasets using self-supervised objectives, before being adapted for specific downstream tasks through fine-tuning. Pre-training is what produces foundation models such as GPT, BERT, and CLIP: the models that agencies fine-tune, prompt, and deploy as the AI layer in their products and services.
Also known as foundation model training, upstream training, self-supervised training
Pre-training trains a model on a large, general dataset using a self-supervised learning objective that does not require human-labeled examples. Language models are pre-trained to predict the next token in a sequence of text, a task for which the training signal is generated automatically from the text itself: the next token is always known from the text. Vision models are pre-trained on image-text pairs to align visual and textual representations in a shared embedding space. The pre-training objective is chosen to force the model to develop rich general representations in order to perform the prediction task well, because predicting the next word in natural language requires understanding syntax, semantics, world knowledge, and discourse structure.
Pre-training on large datasets produces representations that generalize broadly across downstream tasks. A language model pre-trained on hundreds of billions of tokens of internet text has been exposed to a vast range of linguistic patterns, factual knowledge, and reasoning styles. When fine-tuned on a specific task such as sentiment classification, the model starts from these rich general representations rather than from random initialization, requiring far fewer task-specific labeled examples to reach high performance. This transfer efficiency is the primary economic rationale for the pre-train-then-fine-tune paradigm: the cost of pre-training is amortized across all downstream tasks that benefit from the general representations, and each fine-tuning task requires only a small incremental investment.
The scale of pre-training compute and data has been the primary driver of foundation model capability improvements. GPT-3’s emergent capabilities, including in-context learning from examples in the prompt without any gradient updates, were not present in smaller versions of the same architecture and appear to arise from the scale of pre-training rather than architectural innovations. The observation that capability emerges with scale motivates the continued investment in larger pre-training runs, even though the theoretical reasons why scale produces emergent capabilities are not fully understood.
A working ad agency that fine-tunes, prompts, or deploys pre-trained foundation models is always working within the capability bounds established by the pre-training phase. A language model pre-trained predominantly on English text will underperform on French marketing copy even after fine-tuning on French examples, because the pre-training representations for French are weaker than for English. A vision model pre-trained on general photography will underperform on specialized medical or satellite imagery domains, because those image types are underrepresented in the pre-training data. Understanding what data and objectives a model was pre-trained on is essential context for predicting where it will and will not generalize.
The pre-training data determines what knowledge a foundation model has and what it lacks. A language model’s factual knowledge, stylistic range, and reasoning capabilities are directly shaped by what text was included in its pre-training corpus. A model pre-trained before 2023 has no knowledge of events after its training cutoff. A model pre-trained predominantly on English has limited multilingual capability. A model pre-trained on general web text may not have strong pre-training signal for specialized domains such as pharmaceutical marketing, legal copywriting, or technical B2B content. Agencies using AI for specialized marketing domains should evaluate whether the pre-training data for their tools includes sufficient coverage of the relevant domain, and whether domain-specific fine-tuning is needed to supplement limited pre-training coverage.
Fine-tuning adapts pre-trained representations to specific tasks but cannot introduce capabilities absent from pre-training. Fine-tuning adjusts the pre-trained model’s parameters toward better performance on a specific task using labeled examples, but it cannot create entirely new capabilities that have no analog in the pre-training phase. A model that never encountered a particular writing style or content format during pre-training will not develop strong performance in that area through fine-tuning alone, because fine-tuning adapts existing representations rather than building new ones from scratch. For highly specialized content types with minimal representation in standard pre-training corpora, combining fine-tuning with retrieval-augmented generation that injects relevant examples at inference time often outperforms fine-tuning alone.
Continual pre-training on domain-specific data is an intermediate option between general fine-tuning and full pre-training from scratch. Continual pre-training extends the pre-training phase on a domain-specific corpus before task-specific fine-tuning, deepening the model’s domain representations without the cost of pre-training from random initialization. A model continually pre-trained on marketing and advertising text before fine-tuning for copywriting tasks typically outperforms a model that goes directly from general pre-training to fine-tuning on copywriting examples, because the continual pre-training phase builds richer marketing-specific representations that the fine-tuning phase then specializes. This approach is practical for agencies with sufficient domain-specific text data and the compute to run a continual pre-training phase.
An agency is evaluating AI writing tools for a pharmaceutical client that produces DTC (direct-to-consumer) advertising for prescription medications. The client’s regulatory team requires that all AI-generated copy proposals be reviewed by a medical writer for accuracy and compliance before entering the creative review process, but wants to use AI to accelerate first-draft generation. The agency tests three language models: a general-purpose GPT-class model, an open-source model with continual pre-training on biomedical text (BioMedLM), and a proprietary model marketed as specialized for regulated industries. The evaluation consists of generating 50 first-draft copy proposals for three campaigns across different therapeutic areas, rated by the medical writer on accuracy, appropriate use of indication language, correct ISI (Important Safety Information) referencing patterns, and brand voice alignment. The general-purpose model achieves high scores on brand voice (3.9/5) and copywriting quality (4.1/5) but low scores on ISI referencing (1.8/5) and indication language accuracy (2.3/5), producing copy that reads well but routinely uses language that would require significant regulatory revision. The biomedically pre-trained model scores higher on indication language accuracy (3.7/5) due to its exposure to clinical language during pre-training, but lower on brand voice (2.6/5) because its pre-training skewed toward academic rather than marketing text. The specialized proprietary model achieves the highest combined scores (3.8/5 accuracy, 3.6/5 brand voice) but at a cost 4x higher than the general model. The agency recommends the biomedically pre-trained model with a brand voice fine-tuning project using 300 approved DTC copy examples, predicting that fine-tuning on marketing-specific data will improve brand voice scores while retaining the pre-trained domain accuracy advantage.
The generative AI foundations module covers pre-training, the pre-train-then-fine-tune paradigm, and the data and capability considerations that determine whether a foundation model will perform reliably in a specific marketing domain.