upgrade
upgrade

📊Principles of Data Science

Types of Data Structures

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In data science, choosing the right data structure isn't just a programming decision—it's a performance decision that can make or break your analysis. You're being tested on understanding time complexity, memory trade-offs, and access patterns because these concepts determine whether your code runs in seconds or hours on real-world datasets. The structures you'll encounter—from simple arrays to complex graphs—each solve specific problems, and knowing when to reach for each one separates competent data scientists from struggling ones.

Don't just memorize what each structure looks like—know why it exists and what problem it solves. Can you explain why a hash table beats an array for lookups? Why you'd choose a heap over a sorted list for finding maximums? These are the conceptual connections that appear in exam questions and, more importantly, in actual data science work. Master the underlying mechanics, and the implementation details will follow naturally.


Sequential Access Structures

These structures store elements in a specific order where position matters. The key trade-off here is between access speed and modification flexibility—you're essentially choosing between fast reads or fast writes.

Arrays

  • Contiguous memory storage—elements sit next to each other in memory, enabling direct index-based access in O(1)O(1) time
  • Fixed size requires knowing capacity upfront; resizing means creating an entirely new array and copying elements
  • Insertion/deletion costs O(n)O(n) because shifting elements maintains the contiguous structure—critical for understanding why other structures exist

Linked Lists

  • Node-based architecture—each element contains data plus a pointer to the next node, eliminating the need for contiguous memory
  • Dynamic sizing allows O(1)O(1) insertions and deletions once you've found the location (no shifting required)
  • Sequential traversal required for access, making lookups O(n)O(n)—you sacrifice random access for flexibility

Compare: Arrays vs. Linked Lists—both store ordered sequences, but arrays win on access speed (O(1)O(1) vs. O(n)O(n)) while linked lists win on modification flexibility (O(1)O(1) vs. O(n)O(n) for insertions). If an FRQ asks about trade-offs in data structure selection, this is your foundational example.


Constrained Access Structures

Stacks and queues restrict how you interact with data, enforcing specific access patterns. This constraint isn't a limitation—it's a feature that models real-world processes and simplifies certain algorithms.

Stacks

  • Last-In-First-Out (LIFO) ordering means the most recently added element comes out first—think of a stack of plates
  • Two core operations: push (add to top) and pop (remove from top), both running in O(1)O(1) time
  • Essential for recursion and function call management; also powers undo mechanisms and expression parsing in compilers

Queues

  • First-In-First-Out (FIFO) ordering processes elements in arrival order—like a line at a coffee shop
  • Two core operations: enqueue (add to back) and dequeue (remove from front), both running in O(1)O(1) time
  • Powers scheduling systems from CPU task management to print queues; fundamental for breadth-first search algorithms

Compare: Stacks vs. Queues—both restrict access to specific ends, but LIFO vs. FIFO determines completely different use cases. Stacks handle nested/recursive processes; queues handle sequential/fair-ordering processes. Know which real-world scenario maps to which structure.


Hierarchical Structures

When data has natural parent-child relationships or requires efficient sorting/searching, tree-based structures shine. The branching factor and balance of these structures determine their performance characteristics.

Trees

  • Parent-child hierarchy with a single root node; each node can have multiple children but only one parent
  • Binary search trees enable O(logn)O(\log n) search, insert, and delete when balanced—the structure encodes sorting information
  • Models real hierarchies like file systems, organizational charts, and decision processes; foundation for many ML algorithms

Heaps

  • Specialized tree satisfying the heap property—in a max-heap, every parent is greater than its children; min-heap reverses this
  • O(1)O(1) access to extreme values (max or min) with O(logn)O(\log n) insertion and deletion—perfect for "find the biggest/smallest" problems
  • Powers priority queues and heapsort; essential when you need repeated access to extreme values without full sorting

Compare: Trees vs. Heaps—both are hierarchical, but trees optimize for searching any element while heaps optimize for accessing extremes. A binary search tree finds any value in O(logn)O(\log n); a heap finds only the max/min in O(1)O(1) but can't efficiently search for arbitrary values.


Key-Value Structures

When you need to retrieve data by a unique identifier rather than position, key-value structures provide near-instant lookups. The magic lies in hash functions that convert keys into array indices.

Hash Tables

  • Hash function maps keys to indices—converts any key into an array position, enabling O(1)O(1) average-case search, insert, and delete
  • Collision handling is critical; techniques like chaining (linked lists at each index) or open addressing (probing for empty slots) maintain performance
  • Foundation of database indexing and caching; when lookups dominate your workload, hash tables are usually the answer

Dictionaries

  • Key-value pairs with unique keys—conceptually similar to hash tables, often implemented using them under the hood
  • Average O(1)O(1) operations for search, insert, and delete; Python's dict is your go-to for fast lookups
  • Ideal for labeled data like user profiles, configuration settings, and any scenario where you access data by name rather than position

Compare: Hash Tables vs. Dictionaries—functionally similar (dictionaries are typically implemented as hash tables), but "hash table" emphasizes the mechanism while "dictionary" emphasizes the interface. In Python, you'll use dictionaries; in algorithm discussions, you'll reference hash table properties.


Relational Structures

Graphs and matrices represent relationships and multidimensional data. These structures are essential for network analysis, linear algebra, and most machine learning computations.

Graphs

  • Vertices connected by edges—models relationships between entities; can be directed (one-way) or undirected (two-way)
  • Weighted edges add magnitude to relationships (distance, cost, strength); unweighted graphs treat all connections equally
  • Powers network analysis from social connections to shortest-path algorithms (Dijkstra's, BFS/DFS); essential for recommendation systems

Matrices

  • Two-dimensional array organized in rows and columns; the fundamental structure for numerical computing in data science
  • Supports linear algebra operations including addition, multiplication, transposition, and inversion—backbone of ML algorithms
  • Represents datasets directly—rows as observations, columns as features; also encodes images (pixel grids) and graph adjacency

Compare: Graphs vs. Matrices—graphs emphasize relationships while matrices emphasize numerical computation, but they're deeply connected. An adjacency matrix represents a graph numerically, enabling linear algebra techniques on network problems. Know both representations.


Quick Reference Table

ConceptBest Examples
O(1)O(1) Access by PositionArrays, Matrices
O(1)O(1) Access by KeyHash Tables, Dictionaries
O(1)O(1) ModificationLinked Lists, Stacks, Queues
O(logn)O(\log n) SearchBinary Search Trees, Heaps (for extremes only)
Hierarchical RelationshipsTrees, Heaps
Network/Relational DataGraphs, Matrices (adjacency)
LIFO/FIFO ConstraintsStacks, Queues
Numerical ComputationMatrices, Arrays

Self-Check Questions

  1. Which two structures offer O(1)O(1) average-case lookups, and what mechanism makes this possible?

  2. Compare arrays and linked lists: if you need to frequently insert elements in the middle of a sequence, which would you choose and why?

  3. A task scheduler needs to process jobs in the order they arrive. Which constrained access structure models this, and what would happen if you used the other one instead?

  4. Explain why a heap can find the maximum element faster than a binary search tree, but a binary search tree can find any element faster than a heap.

  5. You're building a social network analysis tool that needs to find the shortest path between users and also perform matrix operations on connection strengths. Which two structures would you use together, and how do they relate to each other?