Fiveable

🧩Discrete Mathematics Unit 11 Review

QR code for Discrete Mathematics practice questions

11.1 Languages and Grammars

11.1 Languages and Grammars

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🧩Discrete Mathematics
Unit & Topic Study Guides

Fundamentals

Alphabets and Strings

An alphabet is a finite set of symbols used to build strings. You'll often see it written as Σ\Sigma. The symbols can be letters, digits, or special characters. For example, Σ={a,b,c}\Sigma = \{a, b, c\} or Σ={0,1}\Sigma = \{0, 1\}.

A string is a finite sequence of symbols drawn from an alphabet. For instance, if Σ={0,1}\Sigma = \{0, 1\}, then 01100110 is a string over that alphabet.

A few key definitions to know:

  • The empty string, denoted ε\varepsilon, contains no symbols. Its length is zero.
  • The length of a string ww, written w|w|, is the number of symbols it contains. So 0110=4|0110| = 4.
  • Concatenation joins two strings end-to-end. If w1=abw_1 = ab and w2=cdw_2 = cd, then w1w2=abcdw_1 w_2 = abcd. Concatenating any string with ε\varepsilon leaves it unchanged: wε=ww \varepsilon = w.
  • Σ\Sigma^* denotes the set of all strings over Σ\Sigma, including ε\varepsilon. Σ+\Sigma^+ is the same set but without ε\varepsilon.

Languages and Their Properties

A language is any set of strings over a given alphabet. More precisely, a language LL over Σ\Sigma is a subset of Σ\Sigma^*. A language can be finite (like L={cat,dog}L = \{cat, dog\}) or infinite (like the set of all binary strings that start with 11).

  • A formal language is defined by precise, unambiguous rules. This is what we study in discrete math and computer science.
  • A natural language is human language like English or Mandarin. Natural languages are full of ambiguity, which is exactly why formal languages matter for computation.

Languages support set operations just like any other set:

  • Union: L1L2L_1 \cup L_2 contains all strings in either language.
  • Intersection: L1L2L_1 \cap L_2 contains only strings in both.
  • Complement: L\overline{L} contains all strings in Σ\Sigma^* that are not in LL.
  • Kleene closure: LL^* is the set of all strings formed by concatenating zero or more strings from LL.

Grammar Fundamentals

A grammar is a set of rules that specifies exactly which strings belong to a language. Instead of listing every valid string (impossible for infinite languages), a grammar gives you a recipe for generating them.

Every formal grammar GG has four components, written as G=(V,T,S,P)G = (V, T, S, P):

  • VV (non-terminals): Variables that act as placeholders during string generation. Often written as uppercase letters like S,A,BS, A, B.
  • TT (terminals): The actual symbols of the alphabet that appear in the final strings. Often lowercase letters like a,b,ca, b, c.
  • SS (start symbol): A designated non-terminal where every derivation begins. SVS \in V.
  • PP (production rules): Rules that tell you how to replace non-terminals with combinations of terminals and non-terminals.

A derivation is the process of applying production rules starting from SS until you have a string made entirely of terminals. Each step replaces one non-terminal according to a production rule, and you write \Rightarrow between steps.

Alphabets and Strings, Names and Strings - Too Real

Grammar Components

Production Rules and Their Application

Production rules have the form AαA \rightarrow \alpha, where AA is a non-terminal on the left-hand side and α\alpha is a string of terminals and/or non-terminals on the right-hand side.

Here's a concrete example. Consider the grammar with SaSbS \rightarrow aSb and SabS \rightarrow ab. A derivation of the string aaabbbaaabbb looks like this:

  1. Start with SS
  2. Apply SaSbS \rightarrow aSb: SaSb\quad S \Rightarrow aSb
  3. Apply SaSbS \rightarrow aSb: aSbaaSbb\quad aSb \Rightarrow aaSbb
  4. Apply SabS \rightarrow ab: aaSbbaaabbb\quad aaSbb \Rightarrow aaabbb

This grammar generates the language L={anbnn1}L = \{a^n b^n \mid n \geq 1\}.

When a string has multiple non-terminals at some step, you have a choice about which one to replace first:

  • Leftmost derivation: Always replace the leftmost non-terminal first.
  • Rightmost derivation: Always replace the rightmost non-terminal first.

Both approaches produce the same final string, but the order of steps differs. This distinction becomes important in parsing.

Chomsky Hierarchy and Language Classifications

The Chomsky hierarchy classifies grammars into four types based on the restrictions placed on their production rules. Each type generates a corresponding class of languages, and each class is a proper subset of the one above it.

TypeGrammar NameLanguage ClassRule RestrictionRecognizing Machine
3RegularRegular languagesAaBA \rightarrow aB or AaA \rightarrow aFinite automaton
2Context-freeContext-free languagesAαA \rightarrow \alpha (single non-terminal on LHS)Pushdown automaton
1Context-sensitiveContext-sensitive languagesαAβαγβ\alpha A \beta \rightarrow \alpha \gamma \beta where $$\gamma
0UnrestrictedRecursively enumerable languagesNo restrictionsTuring machine

The hierarchy goes: Regular \subset Context-Free \subset Context-Sensitive \subset Recursively Enumerable. As you move from Type 3 to Type 0, the production rules become less restricted, the languages become more expressive, and the machines needed to recognize them become more powerful.

Alphabets and Strings, AtbashCipher | Wolfram Function Repository

Regular and Context-Free Languages

Regular languages are the simplest class in the hierarchy. Their production rules are tightly constrained: every rule must have the form AaBA \rightarrow aB or AaA \rightarrow a (right-linear), where AA and BB are non-terminals and aa is a terminal. You can also describe regular languages using regular expressions or recognize them with finite automata.

Regular languages can handle patterns like "strings ending in 0101" or "strings with an even number of aa's," but they cannot handle nested or matched structures.

Context-free languages (CFLs) are strictly more powerful. Their production rules have the form AαA \rightarrow \alpha, where the left side is always a single non-terminal but the right side α\alpha can be any combination of terminals and non-terminals. This extra freedom lets CFLs describe recursive, nested structures.

Examples of context-free languages:

  • Balanced parentheses: {(n)nn0}\{(^n )^n \mid n \geq 0\}
  • The language {anbnn1}\{a^n b^n \mid n \geq 1\} from the earlier example
  • The syntax of most programming languages (if-else blocks, nested function calls)

The key distinction: regular languages can't "count" or match pairs, while context-free languages can. That's why programming language syntax is defined with context-free grammars, not regular ones.

Language Processing

Parsing Techniques and Applications

Parsing is the process of analyzing a string to determine its structure according to a grammar. Given a string and a grammar, a parser figures out how (or whether) the grammar can generate that string.

There are two main strategies:

  • Top-down parsing starts from the start symbol SS and tries to derive the input string by expanding non-terminals. Recursive descent parsing is a common implementation, where each non-terminal gets its own procedure that tries to match part of the input.
  • Bottom-up parsing works in reverse: it starts with the input string and tries to reduce it back to SS by recognizing right-hand sides of production rules and replacing them with the corresponding left-hand side. Shift-reduce parsing is a standard bottom-up technique that uses a stack to hold symbols as it scans the input.

A successful parse produces a parse tree, which shows the hierarchical structure of the derivation. The root is the start symbol, internal nodes are non-terminals, and leaves are terminals. An abstract syntax tree (AST) is a simplified version of the parse tree that strips out unnecessary grammatical detail, keeping only the structure that matters for processing.

Parsing is central to compiler design (turning source code into executable programs), natural language processing, and processing structured data formats like XML and JSON.