AI Glossary · Letter K

K-Nearest Neighbors.

A classification and regression algorithm that predicts the label or value for a new point by finding the k training examples most similar to it and returning the majority vote or average of their labels. K-nearest neighbors is the simplest instance-based learning method, requires no training phase, and forms the foundation of recommendation engines, content similarity search, and audience lookalike modeling.

Also known as KNN, k-NN, nearest neighbor classifier

What it is

A working definition of k-nearest neighbors.

K-nearest neighbors makes predictions by memorizing the entire training set and, at prediction time, computing the distance between the query point and every training example, identifying the k closest examples, and aggregating their labels. For classification, the prediction is the majority class among the k neighbors; for regression, it is the average of their values. The algorithm has no learning phase in the traditional sense: it stores the training data and performs all computation at prediction time. This makes it computationally expensive at inference scale, because each prediction requires computing distances to all training examples, but trivially simple to implement and update as new training data arrives.

The choice of k controls the bias-variance tradeoff. With k=1, the prediction is determined by the single nearest neighbor, producing low-bias but high-variance predictions that are sensitive to individual noisy training points. With large k, predictions are determined by many neighbors and converge toward the class distribution in large regions of the feature space, producing higher-bias but lower-variance predictions. The optimal k is found empirically through cross-validation and depends on the noise level and sample density of the training data. The distance metric used also substantially affects prediction quality: Euclidean distance is appropriate for continuous numerical features of similar scale, but requires feature normalization to prevent high-magnitude features from dominating the distance calculation.

Approximate nearest neighbor algorithms including FAISS, ScaNN, and HNSW make k-nearest neighbor search tractable at large scale by finding approximate rather than exact nearest neighbors in sublinear time, using data structures that organize embeddings for efficient similarity search. These algorithms are the infrastructure behind semantic search, recommendation systems, and content retrieval systems that need to find the most similar items from collections of millions or billions of items in milliseconds. An embedding model combined with an approximate nearest neighbor index is the modern production architecture for any similarity-based retrieval task.

Why ad agencies care

Why k-nearest neighbors might matter more in agency work than in most industries.

Recommendation, content similarity search, and audience lookalike modeling all reduce, at their core, to finding the nearest neighbors of a query in an embedding space. A working ad agency that understands k-nearest neighbors understands the foundational mechanism behind these tools, which enables more principled evaluation of their quality and more effective configuration of their parameters.

Approximate nearest neighbor search is the infrastructure behind modern semantic retrieval. When an agency deploys a semantic search system over a content library or knowledge base, the core operation is embedding each document, building an approximate nearest neighbor index over those embeddings, and retrieving the top-k most similar embeddings to a query embedding at search time. Understanding that this retrieval operation is k-nearest neighbor search in embedding space, using approximate algorithms for scale, explains why retrieval quality depends on both embedding quality and the choice of approximate nearest neighbor parameters, and informs how to tune both for the specific retrieval task.

Lookalike audience modeling is nearest-neighbor modeling in customer behavioral space. Finding users who are similar to a seed audience of known converters or high-value customers is a nearest-neighbor problem in the feature space of user behavioral embeddings. The quality of the lookalike audience depends on the quality of the feature representation: if the embedding space correctly encodes the behavioral dimensions that matter for conversion, nearest neighbors in that space will be genuinely similar users. If the embedding space is dominated by irrelevant or noisy dimensions, nearest neighbors will be spuriously similar. Evaluating whether lookalike audiences contain users who are actually similar to the seed, rather than just metrically nearest in a poorly constructed feature space, is the validation step that distinguishes high-quality lookalike modeling from cosmetic audience expansion.

KNN as a baseline provides a sanity check for more complex models. For many classification and regression tasks, a KNN classifier or regressor with appropriate k and feature normalization provides a surprisingly competitive baseline. If a complex model does not outperform a well-configured KNN baseline by a meaningful margin, the additional complexity of the complex model may not be justified. Running KNN as a baseline before investing in more complex model development is a practical check that catches cases where the problem does not benefit from the additional model capacity.

In practice

What k-nearest neighbors looks like inside a working ad agency.

An agency is building a content recommendation system for a professional development platform client. The system needs to recommend courses to users based on their profile and learning history. The agency builds a collaborative filtering approach using matrix factorization to produce user and course embedding vectors. At recommendation time, for each user the system needs to find the k most similar courses to what the user has previously engaged with, from a library of 8,000 courses. Exact nearest neighbor search over 8,000 128-dimensional course embeddings takes 4.2 milliseconds per query, which is acceptable for a recommendation use case where results can be pre-computed. However, as the course library grows toward the client’s projected 40,000 courses, exact search would take approximately 21 milliseconds per query. The agency builds an HNSW approximate nearest neighbor index over the course embeddings, which reduces query time to 0.3 milliseconds for approximate top-k retrieval with 97% recall accuracy compared to exact search. This indexing infrastructure makes the recommendation system scale-independent: adding 10x more courses will not change query latency meaningfully because the HNSW graph structure maintains sublinear search complexity as the index grows.

Build the similarity search and retrieval infrastructure that powers recommendation, semantic search, and audience modeling through The Creative Cadence Workshop.

The automations and agents module covers how to build AI-powered retrieval and recommendation systems, including the nearest neighbor search algorithms that make embedding-based similarity practical at production scale.