Applied Natural Language Understanding with Large Language Models Llms
Applied Natural Language Understanding with Large Language Models Llms is a pillar of modern natural language processing. From spam filters to large language models, NLP systems rely on a small set of ideas applied with increasing sophistication — and this lesson anchors one of them in your practice.
Why Applied Natural Language Matters
Text is the world's most abundant data type and the primary way humans communicate. Systems that understand natural language unlock enormous product and research opportunities.
- Tokenisation shapes everything downstream — choose it with care.
- Contextual embeddings encode far more than bag-of-words ever could.
- Evaluate with task-appropriate metrics (BLEU, F1, exact-match, human).
- Always spot-check outputs — NLP metrics are famously leaky.
How Applied Natural Language Shows Up in Practice
In a typical project, applied natural language understanding with large language models llms is combined with the rest of the Natural Language Processing toolkit. You rarely use any one technique in isolation; the real skill is knowing which combination fits the problem you are trying to solve, and being able to explain that choice to a non-technical stakeholder.
Applied in search, support automation, compliance monitoring, sentiment analysis, clinical text understanding and LLM-powered products of every kind.
- Foundations Computational Linguistics Natural Language Processing
- Unsupervised Topic Modeling Latent Dirichlet Allocation
- Supervised Text Classification and Sentiment Analysis
- Information Extraction Named Entity Recognition Ner
Back to the Data Science curriculum →
Code Examples: Applied Natural Language Understanding Large Language (5 runnable snippets)
Copy any block into a file or notebook and run it end-to-end — each example stands alone.
Example 1: Transformers pipeline for zero-shot classification
# Example 1: Transformers pipeline for zero-shot classification -- Applied Natural Language Understanding Large Language
from transformers import pipeline
classifier = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli",
device=-1, # CPU; set 0 for first GPU
)
texts = [
"Our quarterly revenue exceeded forecasts by 12%.",
"The deployment failed after the memory leak in the worker pool.",
"This apple pie is delicious and the crust is perfectly flaky.",
]
labels = ["business", "engineering", "food", "politics"]
for text in texts:
result = classifier(text, candidate_labels=labels, multi_label=False)
pairs = list(zip(result["labels"], result["scores"]))
top = ", ".join(f"{l}={s:.2f}" for l, s in pairs[:3])
print(f"- {text[:50]}...\n {top}")
Example 2: BPE tokenizer training with tokenizers
# Example 2: BPE tokenizer training with tokenizers -- Applied Natural Language Understanding Large Language
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(
vocab_size = 2_000,
special_tokens = ["[PAD]", "[UNK]", "[CLS]", "[SEP]"],
min_frequency = 2,
)
corpus = [
"machine learning tokenizers split words into sub-units",
"byte pair encoding merges frequent pairs iteratively",
"a small vocabulary keeps the model efficient",
] * 50
tokenizer.train_from_iterator(corpus, trainer=trainer)
enc = tokenizer.encode("machine learning is efficient")
print("tokens:", enc.tokens)
print("ids :", enc.ids)
print("vocab :", tokenizer.get_vocab_size())
Example 3: TF-IDF + logistic regression text classifier
# Example 3: TF-IDF + logistic regression text classifier -- Applied Natural Language Understanding Large Language
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
cats = ["sci.space", "rec.sport.hockey", "talk.politics.misc"]
train = fetch_20newsgroups(subset="train", categories=cats,
remove=("headers", "footers"))
test = fetch_20newsgroups(subset="test", categories=cats,
remove=("headers", "footers"))
pipe = Pipeline([
("tfidf", TfidfVectorizer(max_features=20_000, ngram_range=(1, 2))),
("clf", LogisticRegression(max_iter=1_000, C=1.0)),
])
pipe.fit(train.data, train.target)
print(classification_report(test.target, pipe.predict(test.data),
target_names=cats, digits=3))
Example 4: Named-entity recognition with spaCy
# Example 4: Named-entity recognition with spaCy -- Applied Natural Language Understanding Large Language
import spacy
nlp = spacy.load("en_core_web_sm")
text = (
"In 2024, OpenAI and Microsoft announced a multi-billion-dollar "
"partnership in Redmond, Washington, to accelerate AI research."
)
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text:<30} {ent.label_:<12} ({ent.start_char}..{ent.end_char})")
people = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
print("organisations:", people)
Example 5: Semantic search with sentence embeddings
# Example 5: Semantic search with sentence embeddings -- Applied Natural Language Understanding Large Language
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
docs = [
"A neural network learns a nonlinear mapping from inputs to outputs.",
"Gradient boosting trains shallow trees sequentially on residuals.",
"Transformers use self-attention to model token dependencies.",
"The central limit theorem underpins many inference procedures.",
]
query = "How do attention mechanisms work?"
E = model.encode(docs + [query], normalize_embeddings=True)
sims = E[-1] @ E[:-1].T
for doc, s in sorted(zip(docs, sims), key=lambda t: -t[1]):
print(f"{s:.3f} {doc}")