Overview
Big Idea 2: Data makes up 17-22% of the AP CSP exam, the second-largest chunk after Algorithms and Programming. It covers how computers represent everything as bits, how compression shrinks data, and how programs turn raw data into useful information. The core claim of the whole unit fits in one sentence: data and information facilitate the creation of knowledge.
Here's the mental model. A video of your marching band, a census spreadsheet, a song on Spotify... all of it is ultimately 1s and 0s. This unit explains how that translation works (binary, sampling, abstraction), how we store it efficiently (compression), and how we squeeze meaning out of it (filtering, cleaning, visualizing). You can find all four topic guides on the Unit 2 hub page.

What Big Idea 2 Covers
Big Idea 2 has four topics, and they flow in a logical order: represent the data, store it efficiently, analyze it, then use programs to do the analysis at scale.
| Topic | What it's about |
|---|---|
| 2.1 Binary Numbers | How bits represent numbers, text, color, and sound, plus converting between binary and decimal |
| 2.2 Data Compression | Lossless vs. lossy compression and choosing the right one for a situation |
| 2.3 Extracting Information from Data | What data and metadata can tell you, plus the challenges of messy real-world data |
| 2.4 Using Programs with Data | How programs filter, transform, combine, and visualize data to find patterns |
Topic 2.1, Binary Numbers, is the foundation. Computers store everything digitally, meaning the lowest-level component of any value is a bit (a binary digit, either 0 or 1). Eight bits make a byte. You'll learn to convert positive integers between binary (base 2) and decimal (base 10) and to compare binary numbers. You'll also see what bits can't do perfectly: fixed numbers of bits cause overflow errors with integers and roundoff errors with real numbers, and analog data (like the smooth pitch of a song) can only be approximated digitally through sampling.
Topic 2.2, Data Compression, asks one practical question: how do we make files smaller? Lossless compression reduces bits while guaranteeing you can perfectly reconstruct the original. Lossy compression shrinks files even more, but you can only get back an approximation. The exam skill here is judgment. When quality matters most, choose lossless. When file size or transmission speed matters most, choose lossy.
Topic 2.3, Extracting Information from Data, is about what data can actually tell you. Information is the collection of facts and patterns extracted from data. You'll work with metadata (data about data, like an image's file size or creation date) and the real headaches of data analysis: cleaning inconsistent entries, handling incomplete or invalid data, combining multiple sources, and recognizing that bias comes from how data is collected, not how much you have.
Topic 2.4, Using Programs with Data, connects data to programming. Large datasets can't be analyzed by hand, so programs transform every element, filter out what you don't need, combine or compare values, and visualize results through charts and graphs. This is where Big Idea 2 links up with Big Idea 3 (Algorithms and Programming) and Big Idea 5 (Impact of Computing).
Key Concepts and Vocabulary
These terms show up constantly in Unit 2 questions. For the full course glossary, check the AP CSP key terms page.
- Bit: shorthand for binary digit, either 0 or 1. The smallest unit of data a computer stores.
- Byte: a group of 8 bits.
- Binary (base 2): a number system using only 0s and 1s. Each position's place value is 2 raised to the power of the position, counting from 0 at the rightmost digit.
- Decimal (base 10): the everyday number system using digits 0-9.
- Abstraction: reducing complexity by hiding irrelevant details and focusing on the main idea. Representing analog data digitally is itself an abstraction.
- Analog data: data with values that change smoothly over time, like the pitch of music or colors in a painting.
- Sampling: measuring an analog signal at regular intervals to approximate it digitally.
- Overflow error: what happens when a number is too large for the fixed number of bits available to store it.
- Roundoff error: error from real numbers being stored as approximations.
- Lossless compression: reduces data size while guaranteeing complete reconstruction of the original.
- Lossy compression: reduces data size more aggressively but only allows reconstruction of an approximation.
- Information: the collection of facts and patterns extracted from data.
- Metadata: data about data, like a photo's creation date or file size. Changing metadata doesn't change the primary data.
- Cleaning data: making data uniform without changing its meaning, like standardizing spellings and abbreviations.
- Correlation vs. causation: data may show two variables move together, but that doesn't prove one causes the other.
- Scalability: how well a system handles growing datasets. Very large datasets may require parallel systems instead of a single computer.
How This Unit Shows Up on the Exam
Big Idea 2 accounts for 17-22% of the multiple-choice exam, so expect a meaningful slice of your questions to come from this unit. The questions tend to fall into a few predictable buckets.
Binary conversion and comparison. You'll be given how text or media (like color values) are represented and asked to convert between binary and decimal or to order binary numbers. The trick is realizing binary works exactly like decimal, just with place values of 2 instead of 10. Lean on what you already know about base 10.
Compression judgment calls. Expect scenario questions asking which compression algorithm is best in a given situation. The deciding factor is always the priority: perfect reconstruction points to lossless, minimum size or transmission time points to lossy. Knowing whether data can be restored to its uncompressed state is exactly the distinction these questions test.
Data and metadata scenarios. You'll see descriptions of a dataset and its metadata and be asked what information can be extracted, or what programming process (transform, filter, combine, visualize) could extract or modify it.
Code segments with data. Several Unit 2 skills involve determining the result of code segments that process data, so this unit overlaps with your programming skills more than you might expect.
Common Mistakes
- Treating binary like a foreign language instead of a place-value system. Binary works just like decimal: each digit times its place value, summed up. The only change is that place values are powers of 2. Write out the powers (1, 2, 4, 8, 16, 32...) under each bit and add.
- Assuming fewer bits means less information. Not necessarily. Compression can shrink the bit count while preserving all the information, which is the entire point of lossless algorithms.
- Mixing up lossy and lossless. Lossless guarantees perfect reconstruction; lossy gives you a smaller file but only an approximation. If a question says quality or exact reconstruction matters most, the answer is lossless every time.
- Confusing correlation with causation. Data showing two variables move together does not prove one causes the other. The exam loves this distinction, and the correct answer usually involves "additional research is needed."
- Thinking more data fixes bias. Bias comes from the type or source of data collected. Collecting more of the same biased data just gives you more bias.
- Believing editing metadata changes the data. Changes and deletions to metadata never alter the primary data. Deleting a photo's timestamp doesn't change the photo.
Practice and Next Steps
Start with Topic 2.1 and get genuinely fast at binary-to-decimal conversion, since it's the most mechanical skill in the unit and the easiest points to lock down. Then work through the compression, data extraction, and programs-with-data topics on the Unit 2 page, focusing on the scenario judgment those topics require.
To check your understanding, run Unit 2 questions in guided practice, which gives you instant feedback on multiple-choice questions like the ones on the exam. Since Big Idea 2 connects directly to programming, mixing in some FRQ-style practice helps you see how data processing shows up in code. When you're further into the course, a full-length practice exam will show you how Unit 2 questions feel mixed in with everything else, and the score calculator can translate your results into a projected AP score.
Vocabulary
The following words are mentioned explicitly in the College Board Course and Exam Description for this topic.Term | Definition |
|---|---|
abstraction | The process of reducing complexity by focusing on main ideas and hiding irrelevant details to allow focus on the essential concept. |
analog data | Data that have values changing smoothly and continuously over time, such as pitch, volume, or position. |
binary | A base-2 number system that uses only the digits 0 and 1 to represent data. |
bit | Shorthand for binary digit; the smallest unit of data in computing, represented as either 0 or 1. |
byte | A unit of digital data consisting of 8 bits. |
constants | Fixed data values that do not change during program execution. |
data | Information represented in a form that can be processed by a program, such as numbers, text, or records. |
decimal | A base-10 number system that uses the digits 0-9 to represent data. |
fixed number of bits | A predetermined, limited quantity of bits allocated to represent a data value in programming languages, which constrains the range of representable values. |
integers | Whole numbers (positive, negative, or zero) that are represented in programming languages using a fixed number of bits. |
number bases | Different systems for representing numerical values, such as binary (base 2) and decimal (base 10). |
overflow | An error that occurs when a computed value exceeds the maximum value that can be represented by a fixed number of bits. |
place value | The numeric value assigned to a digit's position in a number, determined by the base raised to the power of the position. |
real numbers | Numbers that include both integers and decimal values, represented in programming languages with a fixed number of bits as approximations. |
roundoff error | An error that occurs when real numbers are approximated in computer storage due to limitations in the fixed number of bits used to represent them. |
sampling | A technique for approximating analog data digitally by measuring values of an analog signal at regular intervals. |
variable | A named container in a program that stores a value which can be changed through assignment. |
Frequently Asked Questions
What is Big Idea 2 in AP CSP?
Big Idea 2: Data covers how computers represent information using bits and how programs extract knowledge from data. It includes four topics: Binary Numbers (2.1), Data Compression (2.2), Extracting Information from Data (2.3), and Using Programs with Data (2.4). You can review all four on the Unit 2 page.
How much of the AP CSP exam is Big Idea 2?
Big Idea 2: Data accounts for 17-22% of the AP CSP multiple-choice exam, making it the second-most-weighted Big Idea after Algorithms and Programming (30-35%). Expect questions on binary conversion, choosing compression algorithms, and what information can be extracted from data and metadata.
What is the difference between lossy and lossless compression in AP CSP?
Lossless compression reduces the number of bits while guaranteeing complete reconstruction of the original data. Lossy compression usually shrinks data more, but you can only reconstruct an approximation of the original. On the exam, pick lossless when quality or exact reconstruction matters most, and lossy when minimizing file size or transmission time matters most.
Does collecting more data eliminate bias?
No. Bias comes from the type or source of the data being collected, so gathering more of the same data just reproduces the same bias. This is a favorite misconception on AP CSP exam questions about data processing challenges, alongside the related trap of confusing correlation with causation.
What is metadata in AP CSP?
Metadata is data about data. For example, an image is the data, while its creation date and file size are metadata. Metadata is used for finding, organizing, and managing information, and changing or deleting metadata never changes the primary data itself.