Supervised Text Classification and Sentiment Analysis

Supervised Text Classification and Sentiment Analysis is a pillar of modern natural language processing. From spam filters to large language models, NLP systems rely on a small set of ideas applied with increasing sophistication — and this lesson anchors one of them in your practice.

Why Supervised Text Classification Matters

Text is the world's most abundant data type and the primary way humans communicate. Systems that understand natural language unlock enormous product and research opportunities.

Tokenisation shapes everything downstream — choose it with care.
Contextual embeddings encode far more than bag-of-words ever could.
Evaluate with task-appropriate metrics (BLEU, F1, exact-match, human).
Always spot-check outputs — NLP metrics are famously leaky.

How Supervised Text Classification Shows Up in Practice

In a typical project, supervised text classification and sentiment analysis is combined with the rest of the Natural Language Processing toolkit. You rarely use any one technique in isolation; the real skill is knowing which combination fits the problem you are trying to solve, and being able to explain that choice to a non-technical stakeholder.

Applied in search, support automation, compliance monitoring, sentiment analysis, clinical text understanding and LLM-powered products of every kind.

Back to the Data Science curriculum →

Code Examples: Supervised Text Classification and Sentiment Analysis (5 runnable snippets)

Copy any block into a file or notebook and run it end-to-end — each example stands alone.

Example 1: Semantic search with sentence embeddings

# Example 1: Semantic search with sentence embeddings -- Supervised Text Classification and Sentiment Analysis
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
docs  = [
    "A neural network learns a nonlinear mapping from inputs to outputs.",
    "Gradient boosting trains shallow trees sequentially on residuals.",
    "Transformers use self-attention to model token dependencies.",
    "The central limit theorem underpins many inference procedures.",
]
query = "How do attention mechanisms work?"

E    = model.encode(docs + [query], normalize_embeddings=True)
sims = E[-1] @ E[:-1].T

for doc, s in sorted(zip(docs, sims), key=lambda t: -t[1]):
    print(f"{s:.3f}  {doc}")

Example 2: Transformers pipeline for zero-shot classification

# Example 2: Transformers pipeline for zero-shot classification -- Supervised Text Classification and Sentiment Analysis
from transformers import pipeline

classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",
    device=-1,                              # CPU; set 0 for first GPU
)

texts = [
    "Our quarterly revenue exceeded forecasts by 12%.",
    "The deployment failed after the memory leak in the worker pool.",
    "This apple pie is delicious and the crust is perfectly flaky.",
]
labels = ["business", "engineering", "food", "politics"]

for text in texts:
    result = classifier(text, candidate_labels=labels, multi_label=False)
    pairs  = list(zip(result["labels"], result["scores"]))
    top    = ", ".join(f"{l}={s:.2f}" for l, s in pairs[:3])
    print(f"- {text[:50]}...\n    {top}")

Example 3: BPE tokenizer training with tokenizers

# Example 3: BPE tokenizer training with tokenizers -- Supervised Text Classification and Sentiment Analysis
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer

tokenizer                  = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer    = Whitespace()
trainer = BpeTrainer(
    vocab_size      = 2_000,
    special_tokens  = ["[PAD]", "[UNK]", "[CLS]", "[SEP]"],
    min_frequency   = 2,
)

corpus = [
    "machine learning tokenizers split words into sub-units",
    "byte pair encoding merges frequent pairs iteratively",
    "a small vocabulary keeps the model efficient",
] * 50
tokenizer.train_from_iterator(corpus, trainer=trainer)

enc = tokenizer.encode("machine learning is efficient")
print("tokens:", enc.tokens)
print("ids   :", enc.ids)
print("vocab :", tokenizer.get_vocab_size())

Example 4: TF-IDF + logistic regression text classifier

# Example 4: TF-IDF + logistic regression text classifier -- Supervised Text Classification and Sentiment Analysis
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

cats  = ["sci.space", "rec.sport.hockey", "talk.politics.misc"]
train = fetch_20newsgroups(subset="train", categories=cats,
                           remove=("headers", "footers"))
test  = fetch_20newsgroups(subset="test",  categories=cats,
                           remove=("headers", "footers"))

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=20_000, ngram_range=(1, 2))),
    ("clf",   LogisticRegression(max_iter=1_000, C=1.0)),
])
pipe.fit(train.data, train.target)
print(classification_report(test.target, pipe.predict(test.data),
                            target_names=cats, digits=3))

Example 5: Named-entity recognition with spaCy

# Example 5: Named-entity recognition with spaCy -- Supervised Text Classification and Sentiment Analysis
import spacy

nlp  = spacy.load("en_core_web_sm")
text = (
    "In 2024, OpenAI and Microsoft announced a multi-billion-dollar "
    "partnership in Redmond, Washington, to accelerate AI research."
)
doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text:<30} {ent.label_:<12} ({ent.start_char}..{ent.end_char})")

people = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
print("organisations:", people)