⌨️AP Computer Science Principles

Data Compression Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data compression sits at the heart of nearly everything you do on a computer or the internet. When you stream music, send photos, or download files, compression algorithms are working behind the scenes to make that data small enough to transmit efficiently and store affordably. The AP CSP exam tests your understanding of how data is represented digitally and the trade-offs involved in different approaches to managing that data—compression is a perfect case study for both concepts.

You're being tested on more than just knowing that "JPEG is lossy." The exam wants you to understand why certain techniques work better for certain data types, what information gets sacrificed in lossy compression, and how algorithms like Huffman coding exploit patterns to achieve smaller file sizes. Don't just memorize which formats use which methods—know what concept each technique illustrates and be ready to explain the trade-offs involved.

Lossless Compression: Keeping Every Bit

Lossless compression reduces file size while preserving all original data, allowing perfect reconstruction. These techniques exploit patterns and redundancy in data without discarding any information.

Run-Length Encoding (RLE)

Replaces consecutive repeated values with a count and single value—for example, "AAAAAABBB" becomes "6A3B"
Most effective for data with long runs of identical values, such as simple graphics, icons, or monochrome images
Can actually increase file size when data has high variability, since storing "1A1B1C" takes more space than "ABC"

Huffman Coding

Assigns shorter binary codes to frequent symbols—a letter appearing 1000 times gets a 2-bit code while a rare letter gets a 12-bit code
Builds a binary tree from symbol frequencies to determine optimal code lengths, guaranteeing no code is a prefix of another
Foundation for formats like ZIP and PNG—understanding this algorithm helps you explain how lossless compression achieves significant size reduction

Dictionary-Based Compression (LZW Algorithm)

Builds a dictionary of repeated sequences on-the-fly—replaces patterns like "the" with shorter codes as they're encountered
Powers GIF and TIFF formats and works especially well on text files with repetitive phrases or structured data
Less effective on random or encrypted data where no patterns exist for the dictionary to exploit

Compare: Run-Length Encoding vs. Huffman Coding—both are lossless, but RLE exploits spatial repetition (same values in a row) while Huffman exploits frequency patterns (some symbols appear more often overall). If an FRQ asks about compressing a simple black-and-white image, RLE is your go-to example.

Lossy Compression: Strategic Sacrifice

Lossy compression achieves much higher compression ratios by permanently removing data deemed less important. These techniques rely on human perception—we don't notice when certain details disappear.

Image Compression (JPEG)

Discards visual information humans can't easily perceive—small color variations and high-frequency details get removed
Uses Discrete Cosine Transform (DCT) to convert image blocks into frequency components, then quantizes (rounds) less important frequencies
Chroma subsampling reduces color resolution because human eyes are more sensitive to brightness than color differences

Audio Compression (MP3)

Removes frequencies humans can't hear or sounds masked by louder nearby frequencies—this is called perceptual coding
Exploits psychoacoustic principles to prioritize sounds that matter most to human hearing
Achieves roughly 10:1 compression while maintaining acceptable quality for most listeners, enabling music streaming and portable players

Video Compression (MPEG)

Eliminates redundancy between frames—if only a small part of the scene changes, only that change gets stored
Combines intra-frame compression (like JPEG for individual frames) with inter-frame compression (storing differences between frames)
Essential for streaming and storage—without it, a single HD movie would require hundreds of gigabytes

Compare: JPEG vs. PNG—both compress images, but JPEG is lossy (smaller files, some quality loss) while PNG is lossless (larger files, perfect quality). Choose JPEG for photographs where minor quality loss is acceptable; choose PNG for graphics, text, or anything requiring exact pixel preservation.

Specialized Techniques: Exploiting Data Structure

Some compression methods target specific data characteristics rather than general patterns. These techniques work by understanding what makes certain data types predictable.

Delta Encoding

Stores differences between sequential values rather than complete data points—if frame 2 is almost identical to frame 1, only the changes get recorded
Ideal for time-series data and video where consecutive values are highly correlated (temperature readings, stock prices, video frames)
Often combined with other compression methods—delta encoding reduces the data first, then RLE or Huffman coding compresses the differences further

Compare: Delta Encoding vs. Dictionary-Based Compression—delta encoding exploits sequential similarity (nearby values are similar) while dictionary compression exploits repeated patterns (same sequences appear multiple times). Delta encoding is your best example when discussing video compression or sensor data.

The Lossless vs. Lossy Trade-Off

The fundamental choice in compression comes down to what you're willing to sacrifice. Understanding this trade-off is one of the most commonly tested concepts on the AP CSP exam.

Compression Ratios and Quality

Compression ratio measures size reduction—a 10:1 ratio means the compressed file is one-tenth the original size
Lossless compression typically achieves 2:1 to 3:1 for most data, while lossy can reach 10:1 or higher depending on acceptable quality loss
Higher compression in lossy formats means more artifacts—heavily compressed JPEGs show visible blocks, and low-bitrate MP3s sound "muddy"

Choosing the Right Approach

Use lossless when accuracy matters—medical images, legal documents, source code, or any data that will be edited further
Use lossy when perception matters more than perfection—streaming media, web images, or any content where slight quality loss is acceptable
Consider the use case and audience—a professional photographer needs lossless originals, but social media posts work fine with lossy compression

Compare: Lossless vs. Lossy Compression—lossless preserves all data (perfect for text, code, medical images) while lossy sacrifices some data for dramatically smaller files (ideal for media streaming). An FRQ might ask you to justify which approach fits a given scenario—always consider whether the data can tolerate any loss.

File Formats and Their Methods

Format	Compression Type	Primary Technique
PNG	Lossless	Dictionary-based (similar to LZW)
GIF	Lossless	LZW dictionary compression
JPEG	Lossy	DCT + quantization
MP3	Lossy	Perceptual coding
FLAC	Lossless	Predictive coding + entropy coding
MP4/MPEG	Lossy	Inter-frame + intra-frame
ZIP	Lossless	Huffman + dictionary-based

Quick Reference Table

Concept	Best Examples
Exploiting repetition	Run-Length Encoding, Dictionary-Based (LZW)
Exploiting frequency patterns	Huffman Coding
Exploiting sequential similarity	Delta Encoding, Video inter-frame compression
Exploiting human perception limits	JPEG (vision), MP3 (hearing), MPEG (both)
Lossless image formats	PNG, GIF, TIFF
Lossy media formats	JPEG, MP3, MP4/MPEG
Trade-off decision factors	Accuracy needs, file size constraints, editing requirements

Self-Check Questions

Which two compression techniques both exploit patterns in data but target different types of patterns—and what's the key difference between them?
A hospital needs to store X-ray images for patient records. Should they use JPEG or PNG compression, and what principle guides this decision?
Explain why delta encoding is particularly effective for video compression but would be nearly useless for compressing a collection of unrelated photographs.
Compare Huffman coding and Run-Length Encoding: under what conditions would each technique be most effective, and could they ever be used together?
An FRQ describes a music streaming service that wants to reduce bandwidth costs. Explain the trade-offs involved in increasing the compression ratio of their audio files, referencing specific techniques and their effects on user experience.

⌨️AP Computer Science Principles

Data Compression Techniques

Why This Matters

Lossless Compression: Keeping Every Bit

Run-Length Encoding (RLE)

Huffman Coding

Dictionary-Based Compression (LZW Algorithm)

Lossy Compression: Strategic Sacrifice

Image Compression (JPEG)

Audio Compression (MP3)

Video Compression (MPEG)

Specialized Techniques: Exploiting Data Structure

Delta Encoding

The Lossless vs. Lossy Trade-Off

Compression Ratios and Quality

Choosing the Right Approach

File Formats and Their Methods

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes