Supervised Text Classification and Sentiment Analysis
Supervised Text Classification and Sentiment Analysis is a pillar of modern natural language processing. From spam filters to large language models, NLP systems rely on a small set of ideas applied with increasing sophistication — and this lesson anchors one of them in your practice.
Why Supervised Text Classification Matters
Text is the world's most abundant data type and the primary way humans communicate. Systems that understand natural language unlock enormous product and research opportunities.
- Tokenisation shapes everything downstream — choose it with care.
- Contextual embeddings encode far more than bag-of-words ever could.
- Evaluate with task-appropriate metrics (BLEU, F1, exact-match, human).
- Always spot-check outputs — NLP metrics are famously leaky.
How Supervised Text Classification Shows Up in Practice
In a typical project, supervised text classification and sentiment analysis is combined with the rest of the Natural Language Processing toolkit. You rarely use any one technique in isolation; the real skill is knowing which combination fits the problem you are trying to solve, and being able to explain that choice to a non-technical stakeholder.
Applied in search, support automation, compliance monitoring, sentiment analysis, clinical text understanding and LLM-powered products of every kind.
- Foundations Computational Linguistics Natural Language Processing
- Unsupervised Topic Modeling Latent Dirichlet Allocation
- Information Extraction Named Entity Recognition Ner
- Applied Natural Language Understanding Large Language
Back to the Data Science curriculum →
Code Examples: Supervised Text Classification and Sentiment Analysis (5 runnable snippets)
Copy any block into a file or notebook and run it end-to-end — each example stands alone.
Example 1: Semantic search with sentence embeddings
# Example 1: Semantic search with sentence embeddings -- Supervised Text Classification and Sentiment Analysis
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
docs = [
"A neural network learns a nonlinear mapping from inputs to outputs.",
"Gradient boosting trains shallow trees sequentially on residuals.",
"Transformers use self-attention to model token dependencies.",
"The central limit theorem underpins many inference procedures.",
]
query = "How do attention mechanisms work?"
E = model.encode(docs + [query], normalize_embeddings=True)
sims = E[-1] @ E[:-1].T
for doc, s in sorted(zip(docs, sims), key=lambda t: -t[1]):
print(f"{s:.3f} {doc}")
Example 2: Transformers pipeline for zero-shot classification
# Example 2: Transformers pipeline for zero-shot classification -- Supervised Text Classification and Sentiment Analysis
from transformers import pipeline
classifier = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli",
device=-1, # CPU; set 0 for first GPU
)
texts = [
"Our quarterly revenue exceeded forecasts by 12%.",
"The deployment failed after the memory leak in the worker pool.",
"This apple pie is delicious and the crust is perfectly flaky.",
]
labels = ["business", "engineering", "food", "politics"]
for text in texts:
result = classifier(text, candidate_labels=labels, multi_label=False)
pairs = list(zip(result["labels"], result["scores"]))
top = ", ".join(f"{l}={s:.2f}" for l, s in pairs[:3])
print(f"- {text[:50]}...\n {top}")
Example 3: BPE tokenizer training with tokenizers
# Example 3: BPE tokenizer training with tokenizers -- Supervised Text Classification and Sentiment Analysis
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(
vocab_size = 2_000,
special_tokens = ["[PAD]", "[UNK]", "[CLS]", "[SEP]"],
min_frequency = 2,
)
corpus = [
"machine learning tokenizers split words into sub-units",
"byte pair encoding merges frequent pairs iteratively",
"a small vocabulary keeps the model efficient",
] * 50
tokenizer.train_from_iterator(corpus, trainer=trainer)
enc = tokenizer.encode("machine learning is efficient")
print("tokens:", enc.tokens)
print("ids :", enc.ids)
print("vocab :", tokenizer.get_vocab_size())
Example 4: TF-IDF + logistic regression text classifier
# Example 4: TF-IDF + logistic regression text classifier -- Supervised Text Classification and Sentiment Analysis
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
cats = ["sci.space", "rec.sport.hockey", "talk.politics.misc"]
train = fetch_20newsgroups(subset="train", categories=cats,
remove=("headers", "footers"))
test = fetch_20newsgroups(subset="test", categories=cats,
remove=("headers", "footers"))
pipe = Pipeline([
("tfidf", TfidfVectorizer(max_features=20_000, ngram_range=(1, 2))),
("clf", LogisticRegression(max_iter=1_000, C=1.0)),
])
pipe.fit(train.data, train.target)
print(classification_report(test.target, pipe.predict(test.data),
target_names=cats, digits=3))
Example 5: Named-entity recognition with spaCy
# Example 5: Named-entity recognition with spaCy -- Supervised Text Classification and Sentiment Analysis
import spacy
nlp = spacy.load("en_core_web_sm")
text = (
"In 2024, OpenAI and Microsoft announced a multi-billion-dollar "
"partnership in Redmond, Washington, to accelerate AI research."
)
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text:<30} {ent.label_:<12} ({ent.start_char}..{ent.end_char})")
people = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
print("organisations:", people)