📊Principles of Data Science

Types of Data Structures

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In data science, choosing the right data structure isn't just a programming decision—it's a performance decision that can make or break your analysis. You're being tested on understanding time complexity, memory trade-offs, and access patterns because these concepts determine whether your code runs in seconds or hours on real-world datasets. The structures you'll encounter—from simple arrays to complex graphs—each solve specific problems, and knowing when to reach for each one separates competent data scientists from struggling ones.

Don't just memorize what each structure looks like—know why it exists and what problem it solves. Can you explain why a hash table beats an array for lookups? Why you'd choose a heap over a sorted list for finding maximums? These are the conceptual connections that appear in exam questions and, more importantly, in actual data science work. Master the underlying mechanics, and the implementation details will follow naturally.

Sequential Access Structures

These structures store elements in a specific order where position matters. The key trade-off here is between access speed and modification flexibility—you're essentially choosing between fast reads or fast writes.

Arrays

Contiguous memory storage—elements sit next to each other in memory, enabling direct index-based access in $O(1)$ time
Fixed size requires knowing capacity upfront; resizing means creating an entirely new array and copying elements
Insertion/deletion costs $O(n)$ because shifting elements maintains the contiguous structure—critical for understanding why other structures exist

Linked Lists

Node-based architecture—each element contains data plus a pointer to the next node, eliminating the need for contiguous memory
Dynamic sizing allows $O(1)$ insertions and deletions once you've found the location (no shifting required)
Sequential traversal required for access, making lookups $O(n)$ —you sacrifice random access for flexibility

Compare: Arrays vs. Linked Lists—both store ordered sequences, but arrays win on access speed ( $O(1)$ vs. $O(n)$ ) while linked lists win on modification flexibility ( $O(1)$ vs. $O(n)$ for insertions). If an FRQ asks about trade-offs in data structure selection, this is your foundational example.

Constrained Access Structures

Stacks and queues restrict how you interact with data, enforcing specific access patterns. This constraint isn't a limitation—it's a feature that models real-world processes and simplifies certain algorithms.

Stacks

Last-In-First-Out (LIFO) ordering means the most recently added element comes out first—think of a stack of plates
Two core operations: push (add to top) and pop (remove from top), both running in $O(1)$ time
Essential for recursion and function call management; also powers undo mechanisms and expression parsing in compilers

Queues

First-In-First-Out (FIFO) ordering processes elements in arrival order—like a line at a coffee shop
Two core operations: enqueue (add to back) and dequeue (remove from front), both running in $O(1)$ time
Powers scheduling systems from CPU task management to print queues; fundamental for breadth-first search algorithms

Compare: Stacks vs. Queues—both restrict access to specific ends, but LIFO vs. FIFO determines completely different use cases. Stacks handle nested/recursive processes; queues handle sequential/fair-ordering processes. Know which real-world scenario maps to which structure.

Hierarchical Structures

When data has natural parent-child relationships or requires efficient sorting/searching, tree-based structures shine. The branching factor and balance of these structures determine their performance characteristics.

Trees

Parent-child hierarchy with a single root node; each node can have multiple children but only one parent
Binary search trees enable $O(\log n)$ search, insert, and delete when balanced—the structure encodes sorting information
Models real hierarchies like file systems, organizational charts, and decision processes; foundation for many ML algorithms

Heaps

Specialized tree satisfying the heap property—in a max-heap, every parent is greater than its children; min-heap reverses this
$O(1)$ access to extreme values (max or min) with $O(\log n)$ insertion and deletion—perfect for "find the biggest/smallest" problems
Powers priority queues and heapsort; essential when you need repeated access to extreme values without full sorting

Compare: Trees vs. Heaps—both are hierarchical, but trees optimize for searching any element while heaps optimize for accessing extremes. A binary search tree finds any value in $O(\log n)$ ; a heap finds only the max/min in $O(1)$ but can't efficiently search for arbitrary values.

Key-Value Structures

When you need to retrieve data by a unique identifier rather than position, key-value structures provide near-instant lookups. The magic lies in hash functions that convert keys into array indices.

Hash Tables

Hash function maps keys to indices—converts any key into an array position, enabling $O(1)$ average-case search, insert, and delete
Collision handling is critical; techniques like chaining (linked lists at each index) or open addressing (probing for empty slots) maintain performance
Foundation of database indexing and caching; when lookups dominate your workload, hash tables are usually the answer

Dictionaries

Key-value pairs with unique keys—conceptually similar to hash tables, often implemented using them under the hood
Average $O(1)$ operations for search, insert, and delete; Python's dict is your go-to for fast lookups
Ideal for labeled data like user profiles, configuration settings, and any scenario where you access data by name rather than position

Compare: Hash Tables vs. Dictionaries—functionally similar (dictionaries are typically implemented as hash tables), but "hash table" emphasizes the mechanism while "dictionary" emphasizes the interface. In Python, you'll use dictionaries; in algorithm discussions, you'll reference hash table properties.

Relational Structures

Graphs and matrices represent relationships and multidimensional data. These structures are essential for network analysis, linear algebra, and most machine learning computations.

Graphs

Vertices connected by edges—models relationships between entities; can be directed (one-way) or undirected (two-way)
Weighted edges add magnitude to relationships (distance, cost, strength); unweighted graphs treat all connections equally
Powers network analysis from social connections to shortest-path algorithms (Dijkstra's, BFS/DFS); essential for recommendation systems

Matrices

Two-dimensional array organized in rows and columns; the fundamental structure for numerical computing in data science
Supports linear algebra operations including addition, multiplication, transposition, and inversion—backbone of ML algorithms
Represents datasets directly—rows as observations, columns as features; also encodes images (pixel grids) and graph adjacency

Compare: Graphs vs. Matrices—graphs emphasize relationships while matrices emphasize numerical computation, but they're deeply connected. An adjacency matrix represents a graph numerically, enabling linear algebra techniques on network problems. Know both representations.

Quick Reference Table

Concept	Best Examples
$O(1)$ Access by Position	Arrays, Matrices
$O(1)$ Access by Key	Hash Tables, Dictionaries
$O(1)$ Modification	Linked Lists, Stacks, Queues
$O(\log n)$ Search	Binary Search Trees, Heaps (for extremes only)
Hierarchical Relationships	Trees, Heaps
Network/Relational Data	Graphs, Matrices (adjacency)
LIFO/FIFO Constraints	Stacks, Queues
Numerical Computation	Matrices, Arrays

Self-Check Questions

Which two structures offer $O(1)$ average-case lookups, and what mechanism makes this possible?
Compare arrays and linked lists: if you need to frequently insert elements in the middle of a sequence, which would you choose and why?
A task scheduler needs to process jobs in the order they arrive. Which constrained access structure models this, and what would happen if you used the other one instead?
Explain why a heap can find the maximum element faster than a binary search tree, but a binary search tree can find any element faster than a heap.
You're building a social network analysis tool that needs to find the shortest path between users and also perform matrix operations on connection strengths. Which two structures would you use together, and how do they relate to each other?

📊Principles of Data Science

Types of Data Structures

Why This Matters

Sequential Access Structures

Arrays

Linked Lists

Constrained Access Structures

Stacks

Queues

Hierarchical Structures

Trees

Heaps

Key-Value Structures

Hash Tables

Dictionaries

Relational Structures

Graphs

Matrices

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes