Corpus-based semantics uses large collections of text to study word meanings and relationships. By analyzing how words appear together in real-world language data, researchers can uncover semantic patterns and build computational models that represent meaning as vectors in multidimensional space. These techniques matter for semantics and pragmatics because they offer a way to test theoretical claims about meaning using actual language use at massive scale.

Corpus-based Semantics

Use of large-scale language corpora

A language corpus is a large, structured collection of text data used for linguistic analysis. Two well-known examples are the British National Corpus (100 million words of British English) and the Corpus of Contemporary American English (COCA, over 1 billion words). These collections let researchers study semantic relationships by looking at co-occurrence patterns, meaning which words tend to show up near each other.

This approach rests on a key theoretical idea: the distributional hypothesis, which states that words occurring in similar contexts tend to have similar meanings. If "cat" and "dog" consistently appear alongside words like "pet," "feed," and "veterinarian," the model treats them as semantically related.

Collocations are word pairs that frequently appear together, like "strong coffee" or "powerful argument." Statistical measures like mutual information and t-scores quantify how strongly two words are associated beyond what you'd expect by chance.
Words with similar distributional profiles across a corpus are likely to share meaning. This is what makes it possible to represent words as vectors in a high-dimensional semantic space, where proximity reflects semantic similarity.

Use of large-scale language corpora, Frontiers | Transition From Sublexical to Lexico-Semantic Stimulus Processing

Computational techniques for word meanings

Several techniques turn raw co-occurrence data into useful representations of meaning:

Vector space models represent each word as a vector (a list of numbers) in a high-dimensional space. Each dimension corresponds to some contextual feature, typically derived from co-occurrence counts. To measure how semantically related two words are, you calculate the cosine similarity between their vectors. A cosine similarity close to 1 means the words are used in very similar contexts; close to 0 means they have little in common.

Topic modeling discovers hidden thematic structure in a corpus. The most common technique is Latent Dirichlet Allocation (LDA), which treats each document as a mixture of topics and each topic as a distribution over words. For example, a topic model trained on news articles might discover a "sports" topic where words like "game," "score," and "team" have high probability. This helps uncover the underlying semantic organization of large text collections.

Word embeddings are dense vector representations learned through neural networks. Models like Word2Vec and GloVe learn these embeddings by training on the task of predicting context words from a target word (or vice versa). The resulting vectors capture both semantic and syntactic relationships. The classic demonstration: the vector for "king" minus "man" plus "woman" produces a vector close to "queen." This shows the model has learned an abstract relationship between gender and royalty without being explicitly taught it.

Use of large-scale language corpora, Computational Linguistics, Volume 19, Number 1, March 1993, Special Issue on Using Large Corpora ...

Computational Semantics

Performance of semantic models

How do you know if a semantic model is actually working? Researchers evaluate them on specific tasks:

Word sense disambiguation (WSD) tackles polysemy, where a single word has multiple meanings. The word "bank" could mean a financial institution or the edge of a river. WSD models use surrounding context to pick the correct sense. Performance is measured with accuracy (proportion of correct assignments) and F1 score (which balances precision and recall).
Semantic similarity evaluation compares model judgments against human judgments. Benchmark datasets like WordSim-353 and SimLex-999 contain word pairs that humans have rated for relatedness. A good model's cosine similarity scores should correlate strongly with those human ratings.
Sentiment analysis classifies text as positive, negative, or neutral. Classifiers are typically trained on labeled data using algorithms like Naive Bayes or Support Vector Machines (SVM), then evaluated on accuracy, precision, recall, and F1 score.

Applications of computational semantics

Natural Language Processing (NLP):

Machine translation: Distributional models help align words across languages by identifying which words in two languages appear in similar contexts.
Text classification: Topic models and word embeddings improve categorization by representing documents in terms of their semantic content rather than just raw word counts.
Named entity recognition: Distributional patterns help systems identify whether a word refers to a person, location, organization, or other entity type.

Information Retrieval (IR):

Query expansion: If you search for "car," semantic similarity can automatically include related terms like "automobile" and "vehicle," improving recall.
Document ranking: Semantic models improve relevance scoring by recognizing that a document about "automobiles" is relevant to a query about "cars," even without exact word overlap.
Personalized search: User profiles built on semantic preferences help tailor results to individual interests.

Other applications:

Sentiment analysis for social media monitoring and opinion mining
Dialogue systems and chatbots with improved natural language understanding
Content recommendation systems that suggest items based on semantic similarity rather than surface-level keyword matching