The process of rescaling numeric features to a common range or distribution before feeding them into a machine learning model. Normalization ensures that no single feature dominates the learning process due to its scale rather than its predictive value, stabilizes gradient-based training, and enables models to converge faster and more reliably.
Also known as feature scaling, data normalization, standardization
Machine learning algorithms that use distance measures or gradient-based optimization are sensitive to the scale of input features. A dataset with one feature measured in millions (annual revenue) and another measured in fractions (click-through rate) will have the distance and gradient computations dominated by the large-scale feature, not because it is more informative but simply because it is numerically larger. Normalization rescales features to remove this spurious scale dominance, ensuring that the learning algorithm treats features equitably based on their predictive content rather than their units of measurement.
Min-max normalization scales each feature to the range [0, 1] by subtracting the minimum and dividing by the range. Standardization, also called Z-score normalization, rescales each feature to have zero mean and unit variance by subtracting the feature mean and dividing by the standard deviation. Which method to use depends on the algorithm: tree-based methods such as random forests and gradient boosting are invariant to monotonic feature transformations and do not require normalization; distance-based methods such as k-nearest neighbors and support vector machines require it; neural networks benefit from it for training stability but are less strictly dependent on it than distance-based methods.
Batch normalization and layer normalization are internal normalization techniques applied inside neural networks during training, independently of feature normalization at the input. Batch normalization normalizes the activations of each layer across the training batch, reducing the problem of internal covariate shift where the distribution of activations at each layer changes as the parameters in earlier layers are updated. This stabilizes and accelerates training and allows higher learning rates, making it a standard component of modern deep network architectures.
A working ad agency building propensity models, audience classifiers, or performance prediction tools from structured behavioral data will encounter normalization as a necessary preprocessing step that, when omitted, produces models that train slowly, converge to suboptimal solutions, or fail to learn from low-variance features entirely. Normalization is rarely the most interesting part of model development, but incorrect or absent normalization is a common source of unexplained model underperformance that is easy to diagnose once recognized.
Feature normalization in audience scoring models ensures all behavioral signals receive equal consideration. A conversion propensity model trained on features including session count (range: 1 to 500), click-through rate (range: 0.001 to 0.35), and days since last visit (range: 0 to 365) will be dominated by the session count feature in distance-based calculations and by all features with large gradients in gradient-based training, not because these features are more predictive but because they have larger absolute values. Standardizing all features to zero mean and unit variance before training ensures that initial gradient magnitudes are comparable across features, allowing the learning algorithm to discover the true predictive importance of each feature from the data.
Normalization must be computed on training data and applied to validation and test data consistently. A common error is computing normalization statistics, the mean and standard deviation for standardization or the min and max for min-max scaling, from the full dataset including the test set. This leaks test set information into the training pipeline: the model’s preprocessing has seen statistics from the test set before evaluation, which can produce inflated performance estimates. Normalization parameters must be computed from the training set only and then applied consistently to the training, validation, and test sets, as well as to new data at inference time.
Embeddings from pre-trained models often benefit from normalization before downstream use. Embedding vectors produced by pre-trained language and image models vary in magnitude across examples and across dimensions. When these embeddings are used as inputs to downstream classifiers or distance-based retrieval systems, normalizing each embedding vector to unit length, called L2 normalization, ensures that cosine similarity calculations give equal weight to all examples rather than being dominated by embeddings with large norms. L2 normalization of embedding vectors is standard practice in semantic search and similarity-based retrieval applications.
An agency is building a lead scoring model for a B2B software client to predict which free trial users are most likely to convert to paid subscriptions. The training dataset has 8 behavioral features per user: number of sessions in the trial period, average session duration in minutes, number of core features activated, number of team members invited, number of support tickets submitted, days since account creation, product pages viewed, and a binary indicator for whether the user attended the product onboarding webinar. The team initially trains a logistic regression model without normalization and observes that the model achieves only 0.63 AUC on the validation set, much lower than the 0.75 to 0.80 expected from similar datasets. Inspection of the learned coefficients reveals that the days since account creation feature has a coefficient near zero despite having strong bivariate correlation with conversion, while the session count feature has a very large coefficient. Standardizing all features to zero mean and unit variance and retraining produces a model with 0.77 AUC. The coefficient on days since account creation is now appropriately large and negative, correctly reflecting that users who have not converted earlier in the trial period are less likely to convert overall. The normalization step, requiring 10 lines of preprocessing code, is responsible for a 14-point AUC improvement, which translates to meaningfully better prioritization of high-propensity leads for the client’s sales team.
The generative AI foundations module covers feature engineering and preprocessing including normalization, encoding, and the data pipeline steps that determine whether a model trains effectively and generalizes to production data.