An alphabet is a finite set of symbols used to build strings. You'll often see it written as $\Sigma$ . The symbols can be letters, digits, or special characters. For example, $\Sigma = \{a, b, c\}$ or $\Sigma = \{0, 1\}$ .

A string is a finite sequence of symbols drawn from an alphabet. For instance, if $\Sigma = \{0, 1\}$ , then $0110$ is a string over that alphabet.

A few key definitions to know:

The empty string, denoted $\varepsilon$ , contains no symbols. Its length is zero.
The length of a string $w$ , written $|w|$ , is the number of symbols it contains. So $|0110| = 4$ .
Concatenation joins two strings end-to-end. If $w_1 = ab$ and $w_2 = cd$ , then $w_1 w_2 = abcd$ . Concatenating any string with $\varepsilon$ leaves it unchanged: $w \varepsilon = w$ .
$\Sigma^*$ denotes the set of all strings over $\Sigma$ , including $\varepsilon$ . $\Sigma^+$ is the same set but without $\varepsilon$ .

Languages and Their Properties

A language is any set of strings over a given alphabet. More precisely, a language $L$ over $\Sigma$ is a subset of $\Sigma^*$ . A language can be finite (like $L = \{cat, dog\}$ ) or infinite (like the set of all binary strings that start with $1$ ).

A formal language is defined by precise, unambiguous rules. This is what we study in discrete math and computer science.
A natural language is human language like English or Mandarin. Natural languages are full of ambiguity, which is exactly why formal languages matter for computation.

Languages support set operations just like any other set:

Union: $L_1 \cup L_2$ contains all strings in either language.
Intersection: $L_1 \cap L_2$ contains only strings in both.
Complement: $\overline{L}$ contains all strings in $\Sigma^*$ that are not in $L$ .
Kleene closure: $L^*$ is the set of all strings formed by concatenating zero or more strings from $L$ .

Grammar Fundamentals

A grammar is a set of rules that specifies exactly which strings belong to a language. Instead of listing every valid string (impossible for infinite languages), a grammar gives you a recipe for generating them.

Every formal grammar $G$ has four components, written as $G = (V, T, S, P)$ :

$V$ (non-terminals): Variables that act as placeholders during string generation. Often written as uppercase letters like $S, A, B$ .
$T$ (terminals): The actual symbols of the alphabet that appear in the final strings. Often lowercase letters like $a, b, c$ .
$S$ (start symbol): A designated non-terminal where every derivation begins. $S \in V$ .
$P$ (production rules): Rules that tell you how to replace non-terminals with combinations of terminals and non-terminals.

A derivation is the process of applying production rules starting from $S$ until you have a string made entirely of terminals. Each step replaces one non-terminal according to a production rule, and you write $\Rightarrow$ between steps.

Alphabets and Strings, Names and Strings - Too Real

Grammar Components

Production Rules and Their Application

Production rules have the form $A \rightarrow \alpha$ , where $A$ is a non-terminal on the left-hand side and $\alpha$ is a string of terminals and/or non-terminals on the right-hand side.

Here's a concrete example. Consider the grammar with $S \rightarrow aSb$ and $S \rightarrow ab$ . A derivation of the string $aaabbb$ looks like this:

Start with $S$
Apply $S \rightarrow aSb$ : $\quad S \Rightarrow aSb$
Apply $S \rightarrow aSb$ : $\quad aSb \Rightarrow aaSbb$
Apply $S \rightarrow ab$ : $\quad aaSbb \Rightarrow aaabbb$

This grammar generates the language $L = \{a^n b^n \mid n \geq 1\}$ .

When a string has multiple non-terminals at some step, you have a choice about which one to replace first:

Leftmost derivation: Always replace the leftmost non-terminal first.
Rightmost derivation: Always replace the rightmost non-terminal first.

Both approaches produce the same final string, but the order of steps differs. This distinction becomes important in parsing.

Chomsky Hierarchy and Language Classifications

The Chomsky hierarchy classifies grammars into four types based on the restrictions placed on their production rules. Each type generates a corresponding class of languages, and each class is a proper subset of the one above it.

Type	Grammar Name	Language Class	Rule Restriction	Recognizing Machine
3	Regular	Regular languages	$A \rightarrow aB$ or $A \rightarrow a$	Finite automaton
2	Context-free	Context-free languages	$A \rightarrow \alpha$ (single non-terminal on LHS)	Pushdown automaton
1	Context-sensitive	Context-sensitive languages	$\alpha A \beta \rightarrow \alpha \gamma \beta$ where $$	\gamma
0	Unrestricted	Recursively enumerable languages	No restrictions	Turing machine

The hierarchy goes: Regular $\subset$ Context-Free $\subset$ Context-Sensitive $\subset$ Recursively Enumerable. As you move from Type 3 to Type 0, the production rules become less restricted, the languages become more expressive, and the machines needed to recognize them become more powerful.

Alphabets and Strings, AtbashCipher | Wolfram Function Repository

Regular and Context-Free Languages

Regular languages are the simplest class in the hierarchy. Their production rules are tightly constrained: every rule must have the form $A \rightarrow aB$ or $A \rightarrow a$ (right-linear), where $A$ and $B$ are non-terminals and $a$ is a terminal. You can also describe regular languages using regular expressions or recognize them with finite automata.

Regular languages can handle patterns like "strings ending in $01$ " or "strings with an even number of $a$ 's," but they cannot handle nested or matched structures.

Context-free languages (CFLs) are strictly more powerful. Their production rules have the form $A \rightarrow \alpha$ , where the left side is always a single non-terminal but the right side $\alpha$ can be any combination of terminals and non-terminals. This extra freedom lets CFLs describe recursive, nested structures.

Examples of context-free languages:

Balanced parentheses: $\{(^n )^n \mid n \geq 0\}$
The language $\{a^n b^n \mid n \geq 1\}$ from the earlier example
The syntax of most programming languages (if-else blocks, nested function calls)

The key distinction: regular languages can't "count" or match pairs, while context-free languages can. That's why programming language syntax is defined with context-free grammars, not regular ones.

Language Processing

Parsing Techniques and Applications

Parsing is the process of analyzing a string to determine its structure according to a grammar. Given a string and a grammar, a parser figures out how (or whether) the grammar can generate that string.

There are two main strategies:

Top-down parsing starts from the start symbol $S$ and tries to derive the input string by expanding non-terminals. Recursive descent parsing is a common implementation, where each non-terminal gets its own procedure that tries to match part of the input.
Bottom-up parsing works in reverse: it starts with the input string and tries to reduce it back to $S$ by recognizing right-hand sides of production rules and replacing them with the corresponding left-hand side. Shift-reduce parsing is a standard bottom-up technique that uses a stack to hold symbols as it scans the input.

A successful parse produces a parse tree, which shows the hierarchical structure of the derivation. The root is the start symbol, internal nodes are non-terminals, and leaves are terminals. An abstract syntax tree (AST) is a simplified version of the parse tree that strips out unnecessary grammatical detail, keeping only the structure that matters for processing.

Parsing is central to compiler design (turning source code into executable programs), natural language processing, and processing structured data formats like XML and JSON.