Annotated corpus

An annotated corpus is a collection of texts labeled with extra information, like part of speech, semantic roles, or pragmatic features. In Intro to Semantics and Pragmatics, it gives you real language data to study meaning in use.

Last updated July 2026

What is annotated corpus?

In Intro to Semantics and Pragmatics, an annotated corpus is a collection of texts that has been tagged with extra information so you can study meaning more systematically. The text itself is the raw corpus, and the annotations are the added labels, such as part of speech, sense labels, discourse markers, speech-act tags, or notes about context and reference.

Those labels let linguists ask questions that would be hard to answer by just reading a few examples. For instance, you can look at how often a word appears with certain nearby words, how a phrase is used across genres, or whether a particular expression tends to signal irony, politeness, or a request. That makes the corpus useful for both semantics, which focuses on meaning in the words and sentences themselves, and pragmatics, which focuses on how context shapes interpretation.

A simple way to think about it is that the corpus is the data set, and the annotation is the layer that makes patterns visible. If you want to compare how a word like bank is used in financial writing versus river contexts, annotation can separate the different senses. If you want to study implied meaning, annotation can mark places where speakers use implicature, presupposition, or a speech act that is not literally stated in the sentence.

Creating an annotated corpus can be slow because someone has to decide what each label means and apply it consistently. Sometimes humans do the tagging, sometimes software helps, and often both are involved. Good annotation matters because if the labels are inconsistent, the pattern you think you see may just be noise.

In this course, annotated corpora connect theory to real language. They give you a way to test claims about meaning with actual examples instead of relying only on invented sentences.

Why annotated corpus matters in Intro to Semantics and Pragmatics

Annotated corpora matter because semantics and pragmatics become much clearer when you can study how people actually use language, not just how examples are imagined in a textbook. If a chapter asks whether a phrase has one meaning or several, an annotated corpus can show the different senses in real contexts. If a discussion turns to implicature, you can look for patterns in how speakers hint, hedge, or leave information unsaid.

This term also bridges theory and method. A course on meaning often asks you to analyze reference, truth conditions, presupposition, or speech acts, but an annotated corpus shows how those ideas are tracked in data. That is especially useful in corpus-based and computational semantics, where researchers need labeled examples before they can build or evaluate models.

It also teaches you to be careful about evidence. A tiny set of hand-picked examples can make a pattern look stronger than it is, while a well-annotated corpus can show variation across genres, speakers, or time periods. That makes it easier to separate a real semantic pattern from a one-off coincidence.

Keep studying Intro to Semantics and Pragmatics Unit 15

Visual cheatsheet

view gallery

Unit 15 study guide

How annotated corpus connects across the course

corpus linguistics

Annotated corpora are a tool inside corpus linguistics. Corpus linguistics looks at large collections of real text to find patterns, and annotation adds labels that make those patterns easier to measure and compare. In other words, corpus linguistics is the broader approach, while an annotated corpus is one of the main data sources it uses.

semantic annotation

Semantic annotation is the labeling layer that marks meaning-related information in a corpus. That might include word senses, semantic roles, or relations between expressions. If you are working with an annotated corpus in semantics, this is often the specific type of tagging you are looking at when the goal is to study meaning rather than just structure.

natural language processing

Natural language processing uses annotated corpora to train and test language models. The labels give the system examples of how language categories and meanings are supposed to be recognized. In a semantics course, this connection shows how linguistic theory can feed into tools that process language automatically.

semantic similarity

Semantic similarity is often measured using data from annotated corpora, especially when researchers want to compare word meanings or sentence meanings in context. The annotations help show whether two expressions are genuinely close in meaning or just appear together often. This makes similarity analysis more precise than simple word matching.

Is annotated corpus on the Intro to Semantics and Pragmatics exam?

A quiz question or short-answer prompt may give you a scenario and ask you to identify why an annotated corpus is useful. Your job is to explain that it is a labeled text collection that lets researchers trace meaning, context, and usage patterns in real language. If the item mentions computational semantics, connect the corpus to training data, sense distinctions, or tagging for semantic and pragmatic features.

In a passage analysis, you might be asked to say what kind of annotation would help study a target word, a presupposition trigger, or a speech act. A strong answer names the label type and explains what it would reveal. For example, you could mention semantic annotation for word senses or pragmatic tagging for context-dependent uses like implicature or politeness.

Key things to remember about annotated corpus

An annotated corpus is a text collection with added labels that mark features like syntax, meaning, or pragmatic use.
In Intro to Semantics and Pragmatics, it helps you study how meaning works in real language instead of only in made-up examples.
The annotations make it possible to compare senses, track patterns, and test claims about context-dependent interpretation.
Good annotation has to be consistent, or the patterns you find may be misleading.
Annotated corpora are especially useful in corpus-based semantics and natural language processing because they give models and researchers structured evidence.

Frequently asked questions about annotated corpus

What is annotated corpus in Intro to Semantics and Pragmatics?

It is a collection of texts that has been labeled with extra linguistic information, such as part of speech, semantic tags, or pragmatic features. In this course, it gives you real language data for studying meaning, reference, and context. The annotations turn a plain text collection into something you can analyze systematically.

How is an annotated corpus different from a regular corpus?

A regular corpus is just a large set of texts. An annotated corpus adds labels that mark features you want to study, like grammatical categories, word senses, or discourse functions. Those labels make it much easier to test ideas about semantics and pragmatics across many examples.

What kinds of annotations can an annotated corpus have?

It can include syntactic, semantic, or pragmatic labels. For example, one corpus might tag parts of speech, another might mark sense distinctions, and another might identify speech acts or context-dependent meanings. The type of annotation depends on the question being studied.

Why do linguists use annotated corpora for meaning?

They let you study patterns in actual language use instead of relying only on intuition. That is useful when you want to compare meanings across contexts, find recurring pragmatic effects, or see how a word behaves in different genres. It also gives computational models training data for language tasks.