1. NLP Concepts and Pipeline

1.1 Agenda

Estimated reading time: ~11 minutes

Learning Outcomes

Define Natural Language Processing and articulate (diễn đạt chính xác) why it is a distinct (riêng biệt) challenge for machines
Identify the six core NLP tasks and give a concrete example for each
Describe the stages of an NLP pipeline and the purpose of each stage
Recognize the responsible AI considerations unique (đặc thù) to NLP systems

1.2 Glossary

Term	Quick Explanation
Token	Đơn vị nhỏ nhất mà NLP xử lý — thường là một từ hoặc một phần của từ (subword). "Machine learning" → ["Machine", "learning"]
Tokenization	Quá trình chia văn bản thành các token để model xử lý.
Stop Words	Từ phổ biến (common words) có ít giá trị phân biệt (discriminative value) — như "the", "is", "a" — thường bị lọc bỏ trong preprocessing.
Embedding	Biểu diễn (representation) của từ dưới dạng vector số nhiều chiều — từ tương nghĩa (semantically similar) có vector gần nhau.
Intent (Ý định)	Mục đích đằng sau một câu nói của người dùng. "Book a flight to Hanoi" → intent: `book_flight`.
Entity (Thực thể)	Thông tin cụ thể được trích xuất từ text. "Book a flight to Hanoi on Friday" → entities: `location=Hanoi`, `date=Friday`.
Corpus (Kho ngữ liệu)	Tập hợp lớn văn bản được dùng để huấn luyện (train) hoặc đánh giá (evaluate) mô hình NLP.
Transformer	Kiến trúc (architecture) deep learning dùng cơ chế attention (chú ý) — nền tảng của GPT, BERT, và hầu hết LLM hiện đại.

2. Problem Statement

Human language is ambiguous (mơ hồ), contextual (phụ thuộc ngữ cảnh), and constantly evolving (liên tục thay đổi). These properties make it fundamentally (về bản chất) difficult for machines:

Ambiguity: "Bank" can mean a financial institution (tổ chức tài chính) or a river bank. Context determines (xác định) meaning.
Implicit context: "It was too hot to eat" — what was too hot? The food or the weather? Humans infer (suy luận) automatically; machines struggle (gặp khó khăn).
Synonymy and paraphrase (Đồng nghĩa và diễn đạt khác): "The CEO resigned" and "The top executive stepped down" convey (truyền đạt) the same meaning with no shared keywords.
Cultural and domain variation: Medical jargon (biệt ngữ y tế), slang (tiếng lóng), and regional dialects (phương ngữ) create enormous vocabulary variation.

NLP uses ML and deep learning to address these challenges — not by encoding rules, but by learning patterns from massive text corpora (kho ngữ liệu khổng lồ).

3. What is Natural Language Processing?

3.1 Definition

Natural Language Processing (NLP) is a subfield of AI that focuses on enabling machines to understand, interpret (diễn giải), and generate human language — in both written and spoken form — with sufficient (đủ) accuracy to be useful in real-world applications.

3.2 Definition Anatomy

"Understand" — Extract the meaning from text, not just the literal (nghĩa đen) words. This includes intent, sentiment, and implied (ẩn ý) context.
"Interpret" — Map unstructured (phi cấu trúc) language to structured (có cấu trúc) data usable by software (e.g., intent + entities for a booking system).
"Generate" — Produce coherent (mạch lạc), contextually appropriate (phù hợp ngữ cảnh) language as output — the defining capability of LLMs.
"Both written and spoken" — NLP spans text and speech; Azure AI Speech handles the audio-to-text conversion (chuyển đổi) before NLP models process it.

4. Core NLP Tasks

4.1 Overview

4.2 Core Tasks Explained

Task	What It Extracts	Example
Sentiment Analysis	Emotional tone (cảm xúc) of text	Review: "Terrible service, never again." → Negative (0.03 Positive)
Aspect-based Sentiment (Cảm xúc theo khía cạnh)	Sentiment for specific (cụ thể) aspects of an entity	"Great camera but terrible battery life" → camera: Positive, battery: Negative
Key Phrase Extraction	Most informative (có giá trị thông tin nhất) concepts	Article → ["supply chain disruption", "Q3 earnings", "margin pressure"]
Language Detection	Language of input text	"Xin chào" → Vietnamese (vi), confidence: 0.99
Named Entity Recognition (NER)	Structured entities (thực thể có cấu trúc)	"Elon Musk launched SpaceX in 2002" → PERSON: Elon Musk, ORG: SpaceX, DATE: 2002
Relation Extraction (Trích xuất quan hệ)	Semantic relationships (quan hệ ngữ nghĩa) between entities	"Elon Musk founded SpaceX" → (Elon Musk) [FOUNDED] (SpaceX)
PII Detection	Personally Identifiable Information (dữ liệu nhận dạng cá nhân)	"Call me at 0912-345-678" → PHONE: [REDACTED]
Text Classification	Predefined (định sẵn) labels for text	Support ticket → Category: "billing", Priority: "urgent"
Summarization — Extractive (Trích xuất)	Selects (chọn lọc) the most important sentences	Returns actual sentences from the document — no new text generated
Summarization — Abstractive (Tổng hợp)	Generates (tạo ra) a new summary in different words	Generates sentences not in the original document
Machine Translation	Converts text between languages	"Bonjour monde" → "Hello world" (FR → EN)
Transliteration (Phiên âm)	Converts script (chữ viết) without translating meaning	"Nguyễn Văn An" → "Nguyen Van An" — same pronunciation, different script
Intent Recognition	User's goal (mục tiêu) in a conversational utterance (lượt nói)	"Book a table for two tonight" → intent: `reserve_table`

5. The NLP Pipeline

5.1 Pipeline Stages

Stage	What Happens	Example
Preprocessing	Remove noise (nhiễu): lowercase, strip (loại bỏ) HTML tags, normalize (chuẩn hóa) whitespace, handle contractions (viết tắt)	"Don't CLICK HERE!!!" → "do not click here"
Tokenization	Split text into tokens; handle punctuation (dấu câu) and special characters	"I'm running." → ["I", "'m", "running", "."]
Feature Extraction	Convert tokens into numeric representations (biểu diễn số) the model can process	Word embeddings (embedding từ), TF-IDF, BERT subword tokens
Model Inference	Apply the trained NLP model to produce predictions	Sentiment model outputs: Positive: 0.91, Negative: 0.09
Post-processing	Format output into structured (có cấu trúc) data for the application	Return JSON with `{"sentiment": "positive", "score": 0.91}`

AI-900 note: You need to understand the role of each pipeline stage, not implement (thực hiện) them. Know that "preprocessing is why NLP models need clean text" and "tokenization determines how the model sees words."

5.2 Deep Dive: Feature Extraction Techniques

Feature Extraction is the step that converts human-readable tokens into numbers a model can compute. Three techniques are important to understand conceptually:

TF-IDF (Term Frequency – Inverse Document Frequency)

Ý tưởng: Từ nào xuất hiện nhiều trong một tài liệu nhưng ít xuất hiện trong toàn bộ corpus → từ đó có giá trị phân biệt (discriminative) cao.

Component	Formula Intuition	Example
TF (Tần suất trong tài liệu)	The word appears 5 times in a 100-word document → TF = 0.05	"ngân hàng" in a banking article = high TF
IDF (Nghịch đảo tần suất trong corpus)	Word appears in only 3 of 10,000 documents → IDF is high	"ngân hàng" appears in almost every finance doc → IDF is low
TF-IDF score	TF × IDF	High score = important AND distinctive (phân biệt) word

Limitation: TF-IDF treats each word independently (độc lập) — it does not understand that "bank" and "financial institution" mean the same thing.

Word Embedding (Nhúng từ)

Ý tưởng: Biểu diễn mỗi từ bằng một vector số nhiều chiều sao cho các từ có nghĩa liên quan thì có vector gần nhau trong không gian vector (vector space).

"king"  → [0.82, 0.41, -0.23, 0.95, ...] (300 dimensions)
"queen" → [0.80, 0.39, -0.28, 0.94, ...] (similar vector)
"car"   → [-0.12, 0.73, 0.56, -0.44, ...] (distant vector)

The famous (nổi tiếng) example that shows embeddings capture semantic relationships (quan hệ ngữ nghĩa):

king − man + woman ≈ queen

This works because the difference (sự khác biệt) between "king" and "man" in vector space encodes the concept of "royalty (hoàng tộc) minus gender." Word Embeddings are why modern NLP models understand synonyms (từ đồng nghĩa) and analogies (tương tự).

Subword Tokenization and the OOV Problem

Problem: Classic tokenization breaks on rare (hiếm gặp) or new words. A model trained before 2020 has never seen "COVID-19" — it's Out-of-Vocabulary (OOV — ngoài từ điển). The model cannot represent it.

Subword solution: Break unknown words into known (đã biết) subword units (đơn vị subword):

"COVID-19"  → ["CO", "##VID", "-", "19"]
"unhappiness" → ["un", "##happi", "##ness"]

This is how BERT, GPT, and modern LLMs handle any word — including brand names, technical terms, and neologisms (từ mới) — without needing to retrain on new vocabulary.

6. Responsible AI for NLP

NLP systems carry unique (đặc thù) responsible AI risks:

Risk	Explanation	Mitigation (Biện pháp giảm thiểu)
Bias in training data	If training text reflects historical discrimination (phân biệt đối xử), the model inherits (kế thừa) those biases	Audit (kiểm tra) training corpora; use fairness metrics across demographic groups
PII exposure (Lộ dữ liệu cá nhân)	Text fed to an NLP API may contain sensitive personal information	Apply PII detection and redaction (che đi) before sending to any external API
Language bias (Thiên lệch ngôn ngữ)	Models trained primarily (chủ yếu) on English data perform worse on other languages	Use multilingual (đa ngôn ngữ) models; test explicitly on target languages
Hallucination in generation	Generative NLP may produce confident (tự tin) but factually wrong (sai thực tế) outputs	Apply RAG; add human review (xem xét con người) for high-stakes outputs
Harmful content amplification (Khuếch đại nội dung có hại)	A language model can generate or reinforce (củng cố) toxic (độc hại) content at scale	Apply content filters (bộ lọc nội dung) and Azure AI Content Safety

7. Discussion Questions

Q1 — Ambiguity in Practice: A bank deploys a sentiment analysis system to classify customer complaint emails. A customer writes: "I can't believe how fast my loan was approved — I'm shocked!" The word "shocked" often correlates (tương quan) with negative sentiment (cảm xúc tiêu cực). How should the NLP system handle this? What does this case reveal about the limits of word-level (cấp độ từ) sentiment analysis?

Q2 — PII Risk: A company uses Azure AI Language's NER and Sentiment Analysis API to analyze customer service transcripts (bản ghi) — which contain customer names, phone numbers, and health information. The data is sent raw (thô) to the API. What Responsible AI violation (vi phạm) is occurring, and what specific Azure AI Language capability should be applied first in the pipeline?

Q3 — Extractive vs. Abstractive Summarization: A law firm needs to summarize lengthy court decisions (bản án dài) for clients who are non-lawyers. A junior developer proposes using extractive summarization. A senior engineer counters (phản biện) that abstractive summarization is necessary. What is the fundamental difference, and why does the use case (tình huống sử dụng) favor (ủng hộ) one over the other?

Made by Anh Tu - Share to be share

1.1 Agenda​

Learning Outcomes​

1.2 Glossary​

2. Problem Statement​

3. What is Natural Language Processing?​

3.1 Definition​

3.2 Definition Anatomy​

4. Core NLP Tasks​

4.1 Overview​

4.2 Core Tasks Explained​

5. The NLP Pipeline​

5.1 Pipeline Stages​

5.2 Deep Dive: Feature Extraction Techniques​

TF-IDF (Term Frequency – Inverse Document Frequency)​

Word Embedding (Nhúng từ)​

Subword Tokenization and the OOV Problem​

6. Responsible AI for NLP​

7. Discussion Questions​