Recurrent neural networks (RNNs) process sequential data by maintaining an internal state that evolves as each new element arrives. This makes them fundamentally different from feedforward networks, which treat every input independently. For signal processing, RNNs are critical because most signals (speech, sensor readings, communications data) have temporal structure where past values inform the meaning of current ones.

The defining feature of an RNN is its feedback connection: the hidden layer's output at one time step feeds back as input to the same layer at the next time step. This recurrence gives the network a form of memory, letting it learn patterns that unfold over time.

Sequence modeling with RNNs

RNNs handle sequence modeling tasks where inputs, outputs, or both are variable-length sequences. Typical tasks include:

Language modeling and machine translation
Speech recognition from acoustic feature sequences
Time series forecasting for financial, weather, or sensor data

Because the network processes one element at a time and updates its state at each step, it naturally accommodates sequences of any length without requiring a fixed input dimension.

Hidden state in RNNs

The hidden state (sometimes called the context vector or memory) is what gives an RNN its temporal awareness. At each time step $t$ , the hidden state $h_t$ is recomputed from two sources: the current input $x_t$ and the previous hidden state $h_{t-1}$ .

Think of the hidden state as a compressed summary of everything the network has seen so far. It doesn't store every past input literally; instead, it learns to retain whatever information is most useful for the task. This summary then influences how the network interprets the current input and what output it produces.

Input, output, and state equations

The core RNN dynamics are captured by two equations:

State update equation:

$h_t = f(W_{ih} x_t + W_{hh} h_{t-1} + b_h)$

where:

$h_t$ is the hidden state at time $t$
$x_t$ is the input at time $t$
$W_{ih}$ is the input-to-hidden weight matrix
$W_{hh}$ is the hidden-to-hidden (recurrent) weight matrix
$b_h$ is the bias vector
$f(\cdot)$ is a nonlinear activation function (typically $\tanh$ or ReLU)

Output equation:

$y_t = g(W_{ho} h_t + b_o)$

where:

$y_t$ is the output at time $t$
$W_{ho}$ is the hidden-to-output weight matrix
$b_o$ is the output bias
$g(\cdot)$ is an activation function chosen for the task (e.g., softmax for classification)

The key point: the same weight matrices $W_{ih}$ , $W_{hh}$ , and $W_{ho}$ are shared across all time steps. This parameter sharing is what allows the network to generalize across different sequence positions and handle variable-length inputs.

Training recurrent neural networks

Training an RNN means adjusting its shared weights and biases to minimize a loss function (e.g., cross-entropy for classification, MSE for regression) that measures the gap between predicted and target outputs. Standard gradient-based optimizers like SGD, Adam, or RMSprop are used, but the sequential nature of RNNs introduces unique challenges in how gradients are computed.

Backpropagation through time (BPTT)

BPTT is the standard algorithm for computing gradients in RNNs. It works by "unfolding" the recurrent network into an equivalent feedforward network with one layer per time step, then applying standard backpropagation to this unfolded graph.

The procedure:

Forward pass: Process the entire input sequence step by step, computing $h_t$ and $y_t$ at each time step and accumulating the loss.
Unfold: Conceptually create a copy of the network for each time step, with shared weights.
Backward pass: Propagate gradients backward through the unfolded network, from the final time step to the first.
Accumulate: Sum the gradients for each shared weight across all time steps.
Update: Apply the accumulated gradients to update the weights.

For long sequences, full BPTT becomes computationally expensive and memory-intensive. Truncated BPTT addresses this by limiting backpropagation to a fixed number of time steps rather than the full sequence length. This trades off some gradient accuracy for practical trainability.

Vanishing and exploding gradients

These are the central training challenges for RNNs. During BPTT, the gradient of the loss with respect to an early hidden state involves a product of Jacobian matrices across all intervening time steps. Specifically, the gradient flowing from time step $t$ back to step $k$ involves the product:

$\prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}}$

Vanishing gradients: If the spectral norm of these Jacobians is consistently less than 1, the product shrinks exponentially. The network effectively "forgets" that early inputs matter, making it unable to learn long-range dependencies.
Exploding gradients: If the spectral norm is consistently greater than 1, the product grows exponentially, causing weight updates to become enormous and training to diverge.

This is not just a theoretical concern. For simple RNNs with $\tanh$ activations, vanishing gradients become severe for dependencies spanning more than roughly 10-20 time steps.

Gradient clipping techniques

Gradient clipping directly addresses exploding gradients by capping the magnitude of gradients before the weight update. Two common approaches:

Clipping by value: Each gradient component is clipped independently to a range like $[-\theta, \theta]$ .
Clipping by norm: If the global $L_2$ norm of the gradient vector exceeds a threshold $\theta$ , the entire gradient is rescaled: $g \leftarrow \frac{\theta}{\|g\|} g$ . This preserves the gradient direction, which is generally preferred.

Clipping by norm is more widely used in practice because it maintains the relative proportions of gradient components. Typical threshold values are in the range of 1 to 5, though this is task-dependent.

Note that gradient clipping only solves the exploding gradient problem. The vanishing gradient problem requires architectural solutions like LSTMs and GRUs.

Sequence modeling with RNNs, Investigation of Automatic Speech Recognition Systems via the Multilingual Deep Neural Network ...

Types of recurrent neural networks

Simple recurrent neural networks (SRNNs)

Simple RNNs (also called vanilla RNNs or Elman networks) use the basic architecture described above: a single hidden layer with a recurrent connection and a $\tanh$ or ReLU activation. They're straightforward to implement and sufficient for tasks involving short-range dependencies.

Their main limitation is the vanishing gradient problem, which makes them poor at learning dependencies that span more than a handful of time steps. In practice, SRNNs are mostly useful as a pedagogical baseline. For real applications, LSTMs or GRUs are almost always preferred.

Long short-term memory (LSTM) networks

LSTMs were specifically designed to solve the vanishing gradient problem by introducing a cell state $C_t$ that runs through time with minimal modification, plus three gates that regulate information flow:

Forget gate $f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$ : Decides what fraction of the previous cell state to retain. A sigmoid output near 1 means "keep this," near 0 means "discard."
Input gate $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$ : Controls how much of the new candidate information to write into the cell state.
Output gate $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$ : Determines what portion of the cell state to expose as the hidden state output.

The cell state update is:

$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$

where $\tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)$ is the candidate cell state and $\odot$ denotes element-wise multiplication.

The hidden state is then:

$h_t = o_t \odot \tanh(C_t)$

The reason this solves vanishing gradients: the cell state $C_t$ can propagate gradients across many time steps with only element-wise operations (no repeated matrix multiplications). When the forget gate is close to 1 and the input gate is close to 0, information flows through unchanged, and gradients pass through nearly unattenuated.

LSTMs remain one of the most widely used RNN architectures for speech recognition, language modeling, and time series analysis.

Gated recurrent units (GRUs)

GRUs simplify the LSTM architecture by merging the cell state and hidden state into a single hidden state, and reducing three gates to two:

Update gate $z_t = \sigma(W_z [h_{t-1}, x_t] + b_z)$ : Combines the roles of the LSTM's forget and input gates. It controls how much of the previous hidden state to carry forward versus how much to replace with new information.
Reset gate $r_t = \sigma(W_r [h_{t-1}, x_t] + b_r)$ : Controls how much of the previous hidden state influences the computation of the new candidate state.

The hidden state update:

$\tilde{h}_t = \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h)$

$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

GRUs have fewer parameters than LSTMs (two gates instead of three, no separate cell state), which makes them faster to train and less prone to overfitting on smaller datasets. Empirically, GRUs and LSTMs perform comparably on most tasks, so the choice often comes down to computational budget and dataset size.

Bidirectional RNNs

Standard RNNs only have access to past context when processing a given time step. Bidirectional RNNs (BiRNNs) run two separate RNNs over the same sequence: one processes left-to-right (forward), the other right-to-left (backward). At each time step, the outputs of both directions are concatenated (or summed) to form the final representation.

This means each output $y_t$ has access to the full sequence context, both past and future. BiRNNs are particularly effective for tasks where the entire sequence is available at inference time, such as:

Speech recognition (the full utterance is recorded before processing)
Named entity recognition (surrounding words help disambiguate)
Signal classification (the complete signal segment is available)

BiRNNs can wrap any recurrent architecture. A BiLSTM (bidirectional LSTM) is one of the most common configurations in practice. The trade-off is that bidirectional processing requires the full sequence upfront, so it can't be used for real-time or causal applications where you must produce output before seeing future inputs.

Applications of recurrent neural networks

Language modeling and text generation

Language modeling predicts the probability of the next token (word or character) given all preceding tokens: $P(w_t | w_1, w_2, \ldots, w_{t-1})$ . RNNs, especially LSTMs and GRUs, learn these conditional distributions by training on large text corpora.

At generation time, the network samples from its predicted distribution at each step, feeding the sampled token back as the next input. This autoregressive process can produce coherent paragraphs of text. Applications include text completion, summarization, and dialogue systems.

Worth noting: transformer-based models have largely supplanted RNNs for large-scale language modeling, but RNN-based language models remain relevant for resource-constrained settings and as foundational concepts.

Sequence modeling with RNNs, Hands-on: Deep Learning (Part 2) - Recurrent neural networks (RNN) / Statistics and machine learning

Sentiment analysis and classification

For sequence classification tasks like sentiment analysis, the RNN processes the entire input sequence and uses the final hidden state $h_T$ as a fixed-length representation of the whole sequence. This representation is then passed through a fully connected layer with softmax activation to produce class probabilities (e.g., positive, negative, neutral).

BiLSTMs often outperform unidirectional models here because sentiment can depend on words appearing anywhere in the text. Applications include social media monitoring, customer feedback analysis, and market research.

Speech recognition and synthesis

In speech recognition, the input is a sequence of acoustic feature vectors (e.g., mel-frequency cepstral coefficients extracted from short overlapping frames), and the output is a sequence of phonemes, characters, or words. BiLSTMs have been a core component of speech recognition pipelines because they capture both forward and backward temporal context in the acoustic signal.

For speech synthesis (text-to-speech), RNNs predict acoustic features or waveform samples from input text. Attention-based sequence-to-sequence models improved synthesis quality significantly by allowing the decoder to align flexibly with different parts of the input text rather than relying on a single fixed-length context vector.

Time series forecasting

RNNs are a natural fit for time series forecasting because the problem is inherently sequential: predict future values given a window of past observations. The network ingests a sequence of past values $[x_{t-n}, \ldots, x_{t-1}]$ and outputs predictions $[\hat{x}_t, \ldots, \hat{x}_{t+k}]$ .

RNNs handle both univariate forecasting (single signal channel) and multivariate forecasting (multiple correlated channels, e.g., temperature, humidity, and pressure together). For signal processing specifically, RNN-based forecasting is used in predictive maintenance, channel estimation in communications, and adaptive filtering scenarios where the underlying system dynamics are nonlinear and time-varying.

Advanced topics in RNNs

Attention mechanisms in RNNs

Standard sequence-to-sequence models compress the entire input into a single fixed-size context vector, which becomes a bottleneck for long sequences. Attention mechanisms solve this by letting the decoder look back at all encoder hidden states and compute a weighted combination at each decoding step.

For each decoder time step $t$ , attention computes:

A score $e_{t,i}$ for each encoder hidden state $h_i$ , measuring its relevance to the current decoding step.
Attention weights $\alpha_{t,i} = \text{softmax}(e_{t,i})$ that normalize the scores into a probability distribution.
A context vector $c_t = \sum_i \alpha_{t,i} h_i$ that is a weighted sum of encoder states.

Two common scoring functions:

Additive (Bahdanau) attention: $e_{t,i} = v^T \tanh(W_1 h_i + W_2 s_t)$ , where $s_t$ is the decoder state
Multiplicative (Luong) attention: $e_{t,i} = s_t^T W h_i$

Attention dramatically improves performance on long sequences and also provides interpretability, since the attention weights reveal which parts of the input the model focuses on for each output.

Sequence-to-sequence models

Sequence-to-sequence (Seq2Seq) models map a variable-length input sequence to a variable-length output sequence using an encoder-decoder architecture:

The encoder RNN processes the input sequence and produces a sequence of hidden states (or, in the basic version, a single final hidden state used as the context vector).
The decoder RNN generates the output sequence one element at a time, conditioned on the context and its own previous outputs.

In the basic Seq2Seq model, the encoder's final hidden state initializes the decoder. With attention, the decoder accesses all encoder hidden states at every step.

Seq2Seq models are the backbone of machine translation, abstractive summarization, and speech recognition systems. They introduced the idea of separating the "understanding" (encoding) and "generation" (decoding) stages, which remains influential even in transformer-based architectures.

Recurrent neural network regularization

RNNs are prone to overfitting, especially on smaller datasets, because their recurrent weights are applied at every time step. Common regularization strategies:

Dropout: Applied to non-recurrent connections (input-to-hidden and hidden-to-output). Naively applying dropout to recurrent connections disrupts the hidden state dynamics. Variational dropout (Gal & Ghahramani, 2016) addresses this by using the same dropout mask at every time step, which works much better for recurrent connections.
Weight decay ( $L_2$ regularization): Adds a penalty $\lambda \|W\|^2$ to the loss function, discouraging large weights.
Gradient noise: Adds Gaussian noise to gradients during training, which can help escape sharp minima and improve generalization.

Choosing the right regularization depends on model size and dataset. For large LSTMs, dropout rates of 0.2 to 0.5 on non-recurrent connections are typical starting points.

RNNs vs convolutional neural networks (CNNs)

CNNs and RNNs take fundamentally different approaches to sequence processing:

RNNs process sequences step by step, maintaining a hidden state. They naturally capture long-range dependencies but are inherently sequential, limiting parallelization.
CNNs apply learned filters across local windows of the input. They're highly parallelizable and efficient at capturing local patterns, but capturing long-range dependencies requires stacking many layers (or using dilated convolutions).

For signal processing tasks: if the relevant structure is primarily local (e.g., edge detection in spectrograms, short-duration transient detection), CNNs may be more efficient. If the task requires modeling long-range temporal dependencies (e.g., tracking slow-varying channel conditions, modeling prosody in speech), RNNs are the more natural choice.

Hybrid architectures combine both. For example, a 1D CNN can extract local features from a raw signal, and an LSTM can then model the temporal evolution of those features. Convolutional LSTMs extend this idea by replacing the fully connected operations inside LSTM gates with convolutions, which is useful for spatiotemporal data like video or radar imagery.

It's also worth noting that transformers have increasingly replaced both RNNs and CNNs for many sequence tasks, using self-attention to capture dependencies at all ranges with full parallelism. However, RNNs retain advantages in low-latency streaming applications and scenarios with very long sequences where transformer memory costs become prohibitive.