Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
When you're building NLP systems—whether for classification, translation, or text generation—you need concrete ways to measure success. These evaluation metrics aren't just numbers to report; they reveal what kind of errors your model makes, how well it balances competing goals, and whether it's actually solving the problem you care about. Understanding when to use each metric is just as important as knowing the formulas.
You're being tested on your ability to select appropriate metrics for specific tasks, interpret what different scores mean, and recognize the trade-offs between metrics like precision and recall. Don't just memorize formulas—know what each metric prioritizes, when it fails, and how it connects to real-world NLP applications like spam detection, machine translation, summarization, and information retrieval.
These metrics evaluate how well your model assigns correct labels to inputs. The key insight is that different errors have different costs—sometimes false positives hurt more, sometimes false negatives do.
Compare: Precision vs. Recall—both use true positives in the numerator, but precision penalizes false positives while recall penalizes false negatives. If asked to choose a metric for medical diagnosis, recall is typically preferred because missing a disease (FN) is worse than a false alarm (FP).
When models produce text (translations, summaries, responses), we need metrics that compare generated output to reference texts. These metrics capture how closely the output matches human-quality text.
Compare: BLEU vs. ROUGE—both measure n-gram overlap, but BLEU was designed for translation (precision-focused with brevity penalty) while ROUGE targets summarization (recall-focused). Use BLEU for MT systems, ROUGE for summarizers.
These metrics evaluate how well your model's probability distributions match the actual data. Lower uncertainty means better predictions.
When models rank or retrieve documents, we need metrics that evaluate the quality of the entire ranked list, not just individual predictions.
Before training models, you need reliable labeled data. These metrics assess whether human annotators agree, which determines dataset quality.
Compare: Accuracy vs. Cohen's Kappa—both measure agreement, but accuracy ignores chance agreement. If two annotators randomly assign binary labels, they'll agree 50% of the time by chance. Kappa accounts for this, making it the preferred metric for annotation reliability.
| Concept | Best Metrics |
|---|---|
| Binary/multiclass classification | Accuracy, Precision, Recall, F1 Score |
| Imbalanced datasets | F1 Score, Precision, Recall (avoid Accuracy alone) |
| Machine translation | BLEU Score |
| Text summarization | ROUGE Score |
| Language model quality | Perplexity |
| Search/retrieval systems | Mean Average Precision (MAP) |
| Annotation reliability | Cohen's Kappa |
| Error analysis | Confusion Matrix |
You're building a medical diagnosis system where missing a disease is far worse than a false alarm. Which metric should you prioritize—precision or recall—and why?
A spam classifier achieves 98% accuracy but only 40% F1 score. What does this tell you about the dataset, and why is F1 more informative here?
Compare BLEU and ROUGE: which emphasizes precision vs. recall, and which would you use for a summarization task?
Your language model achieves perplexity of 15 on test data, while a competitor's achieves 45. Which model is better, and what does perplexity actually measure?
Two annotators labeled 1,000 sentences for sentiment. Their accuracy (agreement rate) is 85%, but Cohen's Kappa is only 0.60. Explain why these numbers differ and which better reflects annotation quality.