Fundamentals
Alphabets and Strings
An alphabet is a finite set of symbols used to build strings. You'll often see it written as . The symbols can be letters, digits, or special characters. For example, or .
A string is a finite sequence of symbols drawn from an alphabet. For instance, if , then is a string over that alphabet.
A few key definitions to know:
- The empty string, denoted , contains no symbols. Its length is zero.
- The length of a string , written , is the number of symbols it contains. So .
- Concatenation joins two strings end-to-end. If and , then . Concatenating any string with leaves it unchanged: .
- denotes the set of all strings over , including . is the same set but without .
Languages and Their Properties
A language is any set of strings over a given alphabet. More precisely, a language over is a subset of . A language can be finite (like ) or infinite (like the set of all binary strings that start with ).
- A formal language is defined by precise, unambiguous rules. This is what we study in discrete math and computer science.
- A natural language is human language like English or Mandarin. Natural languages are full of ambiguity, which is exactly why formal languages matter for computation.
Languages support set operations just like any other set:
- Union: contains all strings in either language.
- Intersection: contains only strings in both.
- Complement: contains all strings in that are not in .
- Kleene closure: is the set of all strings formed by concatenating zero or more strings from .
Grammar Fundamentals
A grammar is a set of rules that specifies exactly which strings belong to a language. Instead of listing every valid string (impossible for infinite languages), a grammar gives you a recipe for generating them.
Every formal grammar has four components, written as :
- (non-terminals): Variables that act as placeholders during string generation. Often written as uppercase letters like .
- (terminals): The actual symbols of the alphabet that appear in the final strings. Often lowercase letters like .
- (start symbol): A designated non-terminal where every derivation begins. .
- (production rules): Rules that tell you how to replace non-terminals with combinations of terminals and non-terminals.
A derivation is the process of applying production rules starting from until you have a string made entirely of terminals. Each step replaces one non-terminal according to a production rule, and you write between steps.

Grammar Components
Production Rules and Their Application
Production rules have the form , where is a non-terminal on the left-hand side and is a string of terminals and/or non-terminals on the right-hand side.
Here's a concrete example. Consider the grammar with and . A derivation of the string looks like this:
- Start with
- Apply :
- Apply :
- Apply :
This grammar generates the language .
When a string has multiple non-terminals at some step, you have a choice about which one to replace first:
- Leftmost derivation: Always replace the leftmost non-terminal first.
- Rightmost derivation: Always replace the rightmost non-terminal first.
Both approaches produce the same final string, but the order of steps differs. This distinction becomes important in parsing.
Chomsky Hierarchy and Language Classifications
The Chomsky hierarchy classifies grammars into four types based on the restrictions placed on their production rules. Each type generates a corresponding class of languages, and each class is a proper subset of the one above it.
| Type | Grammar Name | Language Class | Rule Restriction | Recognizing Machine |
|---|---|---|---|---|
| 3 | Regular | Regular languages | or | Finite automaton |
| 2 | Context-free | Context-free languages | (single non-terminal on LHS) | Pushdown automaton |
| 1 | Context-sensitive | Context-sensitive languages | where $$ | \gamma |
| 0 | Unrestricted | Recursively enumerable languages | No restrictions | Turing machine |
The hierarchy goes: Regular Context-Free Context-Sensitive Recursively Enumerable. As you move from Type 3 to Type 0, the production rules become less restricted, the languages become more expressive, and the machines needed to recognize them become more powerful.

Regular and Context-Free Languages
Regular languages are the simplest class in the hierarchy. Their production rules are tightly constrained: every rule must have the form or (right-linear), where and are non-terminals and is a terminal. You can also describe regular languages using regular expressions or recognize them with finite automata.
Regular languages can handle patterns like "strings ending in " or "strings with an even number of 's," but they cannot handle nested or matched structures.
Context-free languages (CFLs) are strictly more powerful. Their production rules have the form , where the left side is always a single non-terminal but the right side can be any combination of terminals and non-terminals. This extra freedom lets CFLs describe recursive, nested structures.
Examples of context-free languages:
- Balanced parentheses:
- The language from the earlier example
- The syntax of most programming languages (if-else blocks, nested function calls)
The key distinction: regular languages can't "count" or match pairs, while context-free languages can. That's why programming language syntax is defined with context-free grammars, not regular ones.
Language Processing
Parsing Techniques and Applications
Parsing is the process of analyzing a string to determine its structure according to a grammar. Given a string and a grammar, a parser figures out how (or whether) the grammar can generate that string.
There are two main strategies:
- Top-down parsing starts from the start symbol and tries to derive the input string by expanding non-terminals. Recursive descent parsing is a common implementation, where each non-terminal gets its own procedure that tries to match part of the input.
- Bottom-up parsing works in reverse: it starts with the input string and tries to reduce it back to by recognizing right-hand sides of production rules and replacing them with the corresponding left-hand side. Shift-reduce parsing is a standard bottom-up technique that uses a stack to hold symbols as it scans the input.
A successful parse produces a parse tree, which shows the hierarchical structure of the derivation. The root is the start symbol, internal nodes are non-terminals, and leaves are terminals. An abstract syntax tree (AST) is a simplified version of the parse tree that strips out unnecessary grammatical detail, keeping only the structure that matters for processing.
Parsing is central to compiler design (turning source code into executable programs), natural language processing, and processing structured data formats like XML and JSON.