A supervised learning algorithm that finds the decision boundary between classes that maximizes the margin, the distance between the boundary and the nearest training examples from each class. Support vector machines are effective for high-dimensional classification problems with limited training data, are interpretable through their support vectors, and use kernel functions to learn non-linear boundaries without explicitly computing high-dimensional feature maps.
Also known as SVM, support vector classifier, kernel machine
A support vector machine finds a hyperplane that separates training examples from two classes, choosing among all separating hyperplanes the one that maximizes the margin between the nearest examples from each class (the support vectors). Maximum margin classification is motivated by a generalization argument: among all separating hyperplanes, the one with the largest margin is most robust to small perturbations in the input data and has the best theoretical generalization guarantee under certain distributional assumptions. The support vectors, the training examples that lie on the margin boundary, are the only examples that determine the decision boundary; examples further from the boundary have no influence on the hyperplane.
The soft-margin SVM extends the hard-margin formulation to data that is not linearly separable by allowing some examples to fall inside the margin or on the wrong side, penalizing each violation by a slack variable. The cost parameter C controls the tradeoff between maximizing the margin and minimizing the number and severity of margin violations: high C tolerates little violation (smaller margin, fewer training errors) while low C accepts more violations in exchange for a wider margin (higher training error, but potentially better generalization). The optimal C is found through cross-validation.
The kernel trick allows SVMs to implicitly compute decision boundaries in high-dimensional feature spaces without explicitly transforming the input features. The radial basis function (RBF or Gaussian) kernel computes the dot product in an infinite-dimensional feature space where each input dimension is replaced by an exponential function of the distance between inputs, enabling SVMs to learn complex non-linear boundaries. The kernel choice determines the family of decision boundaries the SVM can represent: linear kernels for linearly separable problems, polynomial kernels for degree-d polynomial boundaries, and RBF kernels for smooth non-linear boundaries. SVMs with the right kernel are competitive with many deep learning approaches on tabular data classification problems with limited training data.
A working ad agency building classification models for client tasks where training data is limited, where interpretability of the decision boundary is important, or where high-dimensional sparse features such as text or behavioral event sequences are the primary input should consider SVMs alongside gradient boosted trees and neural network classifiers. SVMs are particularly strong on text classification tasks with TF-IDF or n-gram features, where the sparse, high-dimensional input space and limited labeled data create conditions where SVMs historically perform well. For small datasets (under 5,000 examples), SVMs often match or outperform gradient boosted trees that tend to overfit more aggressively with limited data.
SVMs with TF-IDF features are a strong baseline for text classification tasks such as intent classification, brand voice scoring, and content category labeling when labeled data is limited. A brand safety classifier trained on 800 labeled examples with TF-IDF features and a linear SVM kernel may achieve accuracy competitive with fine-tuned transformer models on simple binary classification tasks, at a fraction of the training cost and without the API infrastructure requirements. For agencies building initial versions of text classifiers for client deployments where labeled data is being collected, SVMs provide a reliable and interpretable starting point that can be replaced with more powerful models as labeled data accumulates.
The support vector interpretation provides a principled explanation of which training examples are most influential for the model’s classification decisions. Unlike neural network models where feature attribution requires post-hoc explanation methods such as SHAP or LIME, an SVM’s decision boundary is explicitly defined by its support vectors, the training examples nearest the boundary. Inspecting the support vectors reveals which borderline examples the model considers most informative about the class boundary, and can identify unusual or challenging examples that should be reviewed for label quality. This interpretability is valuable for audit purposes, for client trust in regulated industries, and for diagnosing misclassification patterns by examining the support vectors near the boundary in the mislabeled region.
RBF kernel SVMs provide effective non-linear classification on moderate-dimensional feature vectors without the architecture search and hyperparameter complexity of neural networks. For propensity models on structured tabular data with 20 to 100 features and 2,000 to 15,000 training examples, RBF kernel SVMs with cross-validated C and gamma parameters are competitive with random forests and gradient boosted trees and require no architectural decisions. The hyperparameter search space is two-dimensional (C and gamma), much smaller than gradient boosted tree hyperparameter spaces (learning rate, max depth, n estimators, subsample fraction) and orders of magnitude smaller than neural network hyperparameter spaces. For agencies without deep machine learning expertise who need reliable classification models, SVMs with RBF kernels reduce the tuning burden while delivering strong performance on moderate-sized classification problems.
An agency is building a creative intent classifier for a publishing client that automatically categorizes incoming content pitches into 7 categories: news, feature, opinion, product review, sponsored content, data journalism, and evergreen how-to. The classifier is used to route pitches to the appropriate editorial desk. The training set contains 680 labeled pitch emails (97 per category average, minimum 62 for product review). The agency evaluates four models: logistic regression with TF-IDF features, a linear SVM with TF-IDF features, a gradient boosted tree with TF-IDF features, and a fine-tuned DistilBERT transformer. With 680 labeled examples, the models perform as follows on a 136-example test set (20% holdout): logistic regression accuracy 71%, linear SVM accuracy 76%, gradient boosted tree accuracy 73%, DistilBERT accuracy 82%. The linear SVM outperforms logistic regression and gradient boosted tree on this small, high-dimensional text classification problem. DistilBERT achieves the highest accuracy but requires GPU inference at 38ms per pitch versus 2ms for the SVM, and the 6 percentage point accuracy advantage is acceptable for an editorial routing task where occasional misrouting is easily corrected by the receiving editor. The agency deploys the linear SVM for initial production because it is fast, requires no GPU, achieves clearly better accuracy than the simpler logistic regression baseline, and the 6 percentage point gap to DistilBERT is insufficient to justify the additional infrastructure cost for this low-stakes routing application. The SVM is flagged for retraining and potential model upgrade once the labeled dataset reaches 2,000 examples, at which point the transformer is expected to show a larger accuracy advantage as more labeled data allows its larger representational capacity to outperform the SVM’s simpler decision boundary.
The generative AI foundations module covers support vector machines, kernel methods, and algorithm selection principles, with guidance on when SVMs are the right tool for text classification, small-dataset problems, and high-dimensional marketing classification tasks.