upgrade
upgrade

⌨️AP Computer Science Principles

Data Compression Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data compression sits at the heart of nearly everything you do on a computer or the internet. When you stream music, send photos, or download files, compression algorithms are working behind the scenes to make that data small enough to transmit efficiently and store affordably. The AP CSP exam tests your understanding of how data is represented digitally and the trade-offs involved in different approaches to managing that data—compression is a perfect case study for both concepts.

You're being tested on more than just knowing that "JPEG is lossy." The exam wants you to understand why certain techniques work better for certain data types, what information gets sacrificed in lossy compression, and how algorithms like Huffman coding exploit patterns to achieve smaller file sizes. Don't just memorize which formats use which methods—know what concept each technique illustrates and be ready to explain the trade-offs involved.


Lossless Compression: Keeping Every Bit

Lossless compression reduces file size while preserving all original data, allowing perfect reconstruction. These techniques exploit patterns and redundancy in data without discarding any information.

Run-Length Encoding (RLE)

  • Replaces consecutive repeated values with a count and single value—for example, "AAAAAABBB" becomes "6A3B"
  • Most effective for data with long runs of identical values, such as simple graphics, icons, or monochrome images
  • Can actually increase file size when data has high variability, since storing "1A1B1C" takes more space than "ABC"

Huffman Coding

  • Assigns shorter binary codes to frequent symbols—a letter appearing 1000 times gets a 2-bit code while a rare letter gets a 12-bit code
  • Builds a binary tree from symbol frequencies to determine optimal code lengths, guaranteeing no code is a prefix of another
  • Foundation for formats like ZIP and PNG—understanding this algorithm helps you explain how lossless compression achieves significant size reduction

Dictionary-Based Compression (LZW Algorithm)

  • Builds a dictionary of repeated sequences on-the-fly—replaces patterns like "the" with shorter codes as they're encountered
  • Powers GIF and TIFF formats and works especially well on text files with repetitive phrases or structured data
  • Less effective on random or encrypted data where no patterns exist for the dictionary to exploit

Compare: Run-Length Encoding vs. Huffman Coding—both are lossless, but RLE exploits spatial repetition (same values in a row) while Huffman exploits frequency patterns (some symbols appear more often overall). If an FRQ asks about compressing a simple black-and-white image, RLE is your go-to example.


Lossy Compression: Strategic Sacrifice

Lossy compression achieves much higher compression ratios by permanently removing data deemed less important. These techniques rely on human perception—we don't notice when certain details disappear.

Image Compression (JPEG)

  • Discards visual information humans can't easily perceive—small color variations and high-frequency details get removed
  • Uses Discrete Cosine Transform (DCT) to convert image blocks into frequency components, then quantizes (rounds) less important frequencies
  • Chroma subsampling reduces color resolution because human eyes are more sensitive to brightness than color differences

Audio Compression (MP3)

  • Removes frequencies humans can't hear or sounds masked by louder nearby frequencies—this is called perceptual coding
  • Exploits psychoacoustic principles to prioritize sounds that matter most to human hearing
  • Achieves roughly 10:1 compression while maintaining acceptable quality for most listeners, enabling music streaming and portable players

Video Compression (MPEG)

  • Eliminates redundancy between frames—if only a small part of the scene changes, only that change gets stored
  • Combines intra-frame compression (like JPEG for individual frames) with inter-frame compression (storing differences between frames)
  • Essential for streaming and storage—without it, a single HD movie would require hundreds of gigabytes

Compare: JPEG vs. PNG—both compress images, but JPEG is lossy (smaller files, some quality loss) while PNG is lossless (larger files, perfect quality). Choose JPEG for photographs where minor quality loss is acceptable; choose PNG for graphics, text, or anything requiring exact pixel preservation.


Specialized Techniques: Exploiting Data Structure

Some compression methods target specific data characteristics rather than general patterns. These techniques work by understanding what makes certain data types predictable.

Delta Encoding

  • Stores differences between sequential values rather than complete data points—if frame 2 is almost identical to frame 1, only the changes get recorded
  • Ideal for time-series data and video where consecutive values are highly correlated (temperature readings, stock prices, video frames)
  • Often combined with other compression methods—delta encoding reduces the data first, then RLE or Huffman coding compresses the differences further

Compare: Delta Encoding vs. Dictionary-Based Compression—delta encoding exploits sequential similarity (nearby values are similar) while dictionary compression exploits repeated patterns (same sequences appear multiple times). Delta encoding is your best example when discussing video compression or sensor data.


The Lossless vs. Lossy Trade-Off

The fundamental choice in compression comes down to what you're willing to sacrifice. Understanding this trade-off is one of the most commonly tested concepts on the AP CSP exam.

Compression Ratios and Quality

  • Compression ratio measures size reduction—a 10:1 ratio means the compressed file is one-tenth the original size
  • Lossless compression typically achieves 2:1 to 3:1 for most data, while lossy can reach 10:1 or higher depending on acceptable quality loss
  • Higher compression in lossy formats means more artifacts—heavily compressed JPEGs show visible blocks, and low-bitrate MP3s sound "muddy"

Choosing the Right Approach

  • Use lossless when accuracy matters—medical images, legal documents, source code, or any data that will be edited further
  • Use lossy when perception matters more than perfection—streaming media, web images, or any content where slight quality loss is acceptable
  • Consider the use case and audience—a professional photographer needs lossless originals, but social media posts work fine with lossy compression

Compare: Lossless vs. Lossy Compression—lossless preserves all data (perfect for text, code, medical images) while lossy sacrifices some data for dramatically smaller files (ideal for media streaming). An FRQ might ask you to justify which approach fits a given scenario—always consider whether the data can tolerate any loss.


File Formats and Their Methods

FormatCompression TypePrimary Technique
PNGLosslessDictionary-based (similar to LZW)
GIFLosslessLZW dictionary compression
JPEGLossyDCT + quantization
MP3LossyPerceptual coding
FLACLosslessPredictive coding + entropy coding
MP4/MPEGLossyInter-frame + intra-frame
ZIPLosslessHuffman + dictionary-based

Quick Reference Table

ConceptBest Examples
Exploiting repetitionRun-Length Encoding, Dictionary-Based (LZW)
Exploiting frequency patternsHuffman Coding
Exploiting sequential similarityDelta Encoding, Video inter-frame compression
Exploiting human perception limitsJPEG (vision), MP3 (hearing), MPEG (both)
Lossless image formatsPNG, GIF, TIFF
Lossy media formatsJPEG, MP3, MP4/MPEG
Trade-off decision factorsAccuracy needs, file size constraints, editing requirements

Self-Check Questions

  1. Which two compression techniques both exploit patterns in data but target different types of patterns—and what's the key difference between them?

  2. A hospital needs to store X-ray images for patient records. Should they use JPEG or PNG compression, and what principle guides this decision?

  3. Explain why delta encoding is particularly effective for video compression but would be nearly useless for compressing a collection of unrelated photographs.

  4. Compare Huffman coding and Run-Length Encoding: under what conditions would each technique be most effective, and could they ever be used together?

  5. An FRQ describes a music streaming service that wants to reduce bandwidth costs. Explain the trade-offs involved in increasing the compression ratio of their audio files, referencing specific techniques and their effects on user experience.