Cosine similarity

Cosine similarity is a way to compare two word or document vectors by measuring the cosine of the angle between them. In Intro to Semantics and Pragmatics, it is used in computational semantics to estimate semantic similarity from corpus data.

Last updated July 2026

What is cosine similarity?

Cosine similarity is a score for comparing two vectors that represent language, usually words, phrases, or documents in a semantic model. In Intro to Semantics and Pragmatics, you run into it when meaning is represented numerically instead of described only in words.

The basic idea is simple: if two vectors point in nearly the same direction, they are treated as more similar. If they point in different directions, the similarity is lower. The formula uses the dot product divided by the lengths of the vectors, which means it focuses on direction rather than raw size.

That matters in language work because texts are often different lengths. A longer document may contain more words simply because it is longer, not because it is more semantically related to another text. Cosine similarity normalizes for that, so a short paragraph and a long paragraph can still be compared fairly if their distributions of terms or features line up.

In this course, the vectors often come from a vector space model or from counts transformed with TF-IDF. Instead of asking whether two words literally match, cosine similarity asks whether their patterns across a corpus look alike. For example, two restaurant reviews might score highly similar even if they do not repeat the exact same wording, because they share terms and contexts about food, service, and price.

You can think of it as a bridge between corpus data and semantic interpretation. It does not tell you everything about meaning, especially context-dependent or pragmatic meaning, but it gives a measurable way to compare semantic content at scale. That is why it shows up in computational semantics, text mining, and information retrieval, where you need a number that reflects how close two language items are in a model.

Why cosine similarity matters in Intro to Semantics and Pragmatics

Cosine similarity matters because it gives corpus-based semantics a concrete way to test whether two pieces of language are semantically close. In a class on semantics and pragmatics, that is useful when you move from theory to data, since you can compare words, phrases, or whole documents without relying only on intuition.

It also connects directly to how meaning is modeled in multidimensional space. When you see a vector space model, cosine similarity is often the comparison step that turns those coordinates into an interpretable score. That makes it a common tool for showing semantic relatedness, clustering similar texts, or checking whether a computational model matches human judgments.

It also helps you see the limits of surface form. Two sentences can look different but still receive a high similarity score if their vector representations are close. At the same time, two sentences with the same topic can score lower if they use very different vocabulary or if the model does not capture context well. That tension is exactly the kind of thing semantics and pragmatics classes care about, because it shows where meaning is captured by distribution and where it depends on use, context, or inference.

Keep studying Intro to Semantics and Pragmatics Unit 15

How cosine similarity connects across the course

Vector Space Model

Cosine similarity usually works on top of a vector space model. The model turns linguistic items into coordinates, and cosine similarity compares the direction of those coordinates. If you understand the vector space setup, cosine similarity becomes the measurement step that tells you which texts or words are closer in semantic representation.

TF-IDF

TF-IDF often creates the vectors that cosine similarity compares. By downweighting common words and boosting more informative ones, TF-IDF makes similarity scores more meaningful for text analysis. Without that weighting, very frequent words can blur the semantic signal and make unrelated texts look too close.

semantic similarity

Semantic similarity is the broader idea that two language items share meaning or meaning-related features. Cosine similarity is one way to estimate that relationship numerically. In class, the distinction matters because semantic similarity is the concept, while cosine similarity is the computational tool that approximates it.

Natural Language Processing

Natural Language Processing uses cosine similarity in tasks like document retrieval, clustering, recommendation, and paraphrase detection. In an Intro to Semantics and Pragmatics setting, this shows how semantic theory can be applied with algorithms. It is a good example of meaning being studied through large-scale language data rather than only handcrafted examples.

Is cosine similarity on the Intro to Semantics and Pragmatics exam?

A quiz question or short-answer prompt may give you two vectors, two texts, or a description of a corpus model and ask what cosine similarity tells you. Your job is to explain that a higher score means the items are more similar in their vector direction, not necessarily that they share the same exact words.

If the item appears in a data analysis question, you may need to interpret why one document pair scores higher than another, especially when document length differs. In a passage analysis or essay response, you might connect cosine similarity to corpus-based semantics by explaining how language meaning can be measured from usage patterns in a corpus.

You should also be ready to recognize what cosine similarity does not do. It does not directly capture speaker intention, irony, implicature, or other pragmatic effects. That distinction often shows up in discussion questions that ask you to separate computational meaning from context-dependent meaning.

Cosine similarity vs Euclidean Distance

Cosine similarity and Euclidean distance both compare vectors, but they ask different questions. Cosine similarity looks at angle and direction, so it is about how alike the patterns are even if the vectors have different sizes. Euclidean distance looks at straight-line distance, so it is more sensitive to magnitude and overall scale.

Key things to remember about cosine similarity

  • Cosine similarity compares two vectors by measuring the angle between them, which makes it useful for semantic comparison in language models.

  • In Intro to Semantics and Pragmatics, it shows up when meaning is represented numerically from corpus data instead of described only with words.

  • It is especially useful with TF-IDF or vector space models because it normalizes for document length and focuses on pattern similarity.

  • A high cosine similarity score suggests that two texts or terms have similar distributions in a corpus, not that they are identical.

  • It is a computational tool for semantic similarity, but it does not by itself capture pragmatic meaning like irony, implicature, or speaker intent.

Frequently asked questions about cosine similarity

What is cosine similarity in Intro to Semantics and Pragmatics?

Cosine similarity is a numerical way to compare two vectors that represent linguistic meaning. In this course, it is used in computational semantics to estimate how semantically close two words, phrases, or documents are based on corpus-derived features.

How is cosine similarity different from Euclidean distance?

Cosine similarity compares direction, while Euclidean distance compares straight-line distance. That means cosine similarity is usually better when you care about whether two texts pattern alike, even if one is much longer than the other. Euclidean distance changes more when vector size changes.

Why do linguists use cosine similarity for texts?

It gives a fair comparison across documents of different lengths by normalizing vector size. That makes it useful for corpus-based semantics, where you want to compare semantic content from real language data rather than just count exact word matches.

Does cosine similarity measure meaning or context?

It measures semantic relatedness in a computational model, not full pragmatic context. It can show that two texts are close in their usage patterns, but it does not directly capture implicature, irony, or speaker intention.