📊Principles of Data Science

Fundamental Python Libraries for Data Science

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In data science, you're not writing code from scratch—you're orchestrating a powerful ecosystem of specialized tools. The Python libraries you'll encounter on exams and in practice each solve specific computational problems: numerical operations, data manipulation, visualization, statistical modeling, and machine learning. Understanding which library to reach for (and why) separates competent data scientists from those who just copy code snippets.

You're being tested on more than syntax. Exam questions will probe whether you understand why NumPy arrays outperform Python lists, when to use Pandas versus raw NumPy, or what distinguishes TensorFlow from Scikit-learn. Don't just memorize library names—know what computational problem each one solves and how they connect in a typical data science workflow.

Data Structures and Numerical Computing

These libraries form the foundation of nearly every data science workflow. They provide optimized data structures that make Python competitive with lower-level languages for numerical work.

NumPy

N-dimensional arrays (ndarrays)—the core data structure enabling vectorized operations that are orders of magnitude faster than Python lists
Broadcasting allows operations on arrays of different shapes without explicit data replication, reducing memory overhead and code complexity
Mathematical functions operate element-wise on entire arrays, forming the computational backbone for nearly every other data science library

Pandas

DataFrame and Series structures provide labeled, tabular data with intuitive indexing—think of it as Excel for programmers
Data wrangling tools for cleaning, merging, grouping, and reshaping data make it essential for the 80% of data science that's data preparation
Time series functionality with built-in date parsing, resampling, and rolling windows makes temporal analysis straightforward

SciPy

Scientific computing extensions build on NumPy with specialized modules for optimization, integration, interpolation, and signal processing
Linear algebra and statistics modules provide functions beyond NumPy's basics, including sparse matrix operations and statistical distributions
Numerical solvers handle complex mathematical computations like differential equations and root-finding that pure NumPy can't address

Compare: NumPy vs. Pandas—both handle array-like data, but NumPy optimizes for homogeneous numerical arrays while Pandas excels at heterogeneous, labeled tabular data. If an exam asks about performance-critical numerical computation, reach for NumPy; for data cleaning and exploration, Pandas is your answer.

Data Visualization

Visualization libraries transform numerical results into interpretable graphics. The choice between them often comes down to customization needs versus speed of development.

Matplotlib

Low-level control over every plot element makes it the foundation for publication-quality figures and custom visualizations
Figure and Axes objects provide the grammar for building plots programmatically—understanding this hierarchy is essential for debugging
Integration with NumPy and Pandas allows direct plotting from arrays and DataFrames with minimal conversion

Seaborn

Statistical visualizations like heatmaps, violin plots, and pair plots require just one function call instead of dozens of Matplotlib commands
Built-in themes and color palettes produce attractive defaults that follow data visualization best practices
DataFrame-native interface accepts column names directly, eliminating the need to extract arrays manually

Compare: Matplotlib vs. Seaborn—Matplotlib offers granular control for custom figures, while Seaborn provides high-level statistical plots with sensible defaults. Use Seaborn for exploratory analysis and Matplotlib when you need pixel-perfect customization.

Statistical Modeling and Inference

When you need formal statistical tests, confidence intervals, or interpretable model coefficients, these libraries provide the rigor that machine learning tools often skip.

Statsmodels

Statistical model estimation covers OLS regression, logistic regression, and time series models (ARIMA) with full diagnostic output
Hypothesis testing tools provide p-values, confidence intervals, and test statistics essential for validating data-driven conclusions
Formula interface using R-style syntax (e.g., y ~ x1 + x2) makes model specification intuitive and readable

Compare: Statsmodels vs. Scikit-learn—both can fit linear regression, but Statsmodels emphasizes statistical inference (p-values, R-squared, residual diagnostics) while Scikit-learn focuses on prediction accuracy. Choose based on whether you're explaining relationships or making predictions.

Machine Learning

These libraries span from classical algorithms to deep neural networks. The key distinction is between traditional ML (Scikit-learn) and deep learning frameworks (TensorFlow, PyTorch, Keras).

Scikit-learn

Consistent API with fit(), predict(), and transform() methods across all algorithms makes switching between models trivial
Comprehensive algorithm coverage includes classification, regression, clustering, dimensionality reduction, and preprocessing—all in one package
Pipeline and cross-validation tools enable reproducible workflows and proper model evaluation without data leakage

TensorFlow

Computation graphs represent mathematical operations as nodes, enabling automatic differentiation and optimization across CPUs, GPUs, and TPUs
Deep learning at scale supports neural networks from simple feedforward architectures to complex models for image recognition, NLP, and reinforcement learning
Production deployment tools allow models to run on mobile devices, web browsers, and distributed server clusters

PyTorch

Dynamic computation graphs build the network on-the-fly during execution, making debugging intuitive—you can use standard Python debugging tools
Research-friendly design with Pythonic syntax has made it the dominant framework in academic deep learning
GPU acceleration through CUDA integration enables efficient training of large neural networks with minimal code changes

Keras

High-level API abstracts away boilerplate code, letting you define complex architectures in just a few lines
Sequential and Functional APIs support everything from simple stacked layers to multi-input, multi-output models with shared components
Backend flexibility allows running on TensorFlow (now the default), enabling Keras simplicity with TensorFlow's production capabilities

Compare: Scikit-learn vs. TensorFlow/PyTorch—Scikit-learn handles classical ML algorithms with minimal setup, while TensorFlow and PyTorch are necessary for deep learning. If your data is tabular and your model is a random forest or SVM, Scikit-learn is faster to implement; for neural networks with custom architectures, use a deep learning framework.

Compare: TensorFlow vs. PyTorch—TensorFlow emphasizes production deployment and scalability, while PyTorch prioritizes research flexibility with dynamic graphs. If an FRQ asks about prototyping or debugging neural networks, PyTorch's eager execution is the key advantage.

Quick Reference Table

Concept	Best Examples
Numerical array operations	NumPy, SciPy
Tabular data manipulation	Pandas
Basic visualization	Matplotlib
Statistical visualization	Seaborn
Statistical inference & hypothesis testing	Statsmodels
Classical machine learning	Scikit-learn
Deep learning (production)	TensorFlow, Keras
Deep learning (research)	PyTorch
Scientific computing (optimization, integration)	SciPy
Time series analysis	Pandas, Statsmodels

Self-Check Questions

Which two libraries both handle array-like data but differ in their optimization focus? Explain when you'd choose one over the other.
If you need to report p-values and confidence intervals for a regression model, which library should you use instead of Scikit-learn, and why?
Compare and contrast TensorFlow and PyTorch: What type of computation graph does each use, and how does this affect the debugging experience?
A data science workflow involves loading CSV data, cleaning missing values, training a random forest classifier, and plotting feature importances. Which libraries would you use for each step?
Why might a researcher choose PyTorch for developing a novel neural network architecture, while a company deploying models to mobile devices might prefer TensorFlow?

📊Principles of Data Science

Fundamental Python Libraries for Data Science

Why This Matters

Data Structures and Numerical Computing

NumPy

Pandas

SciPy

Data Visualization

Matplotlib

Seaborn

Statistical Modeling and Inference

Statsmodels

Machine Learning

Scikit-learn

TensorFlow

PyTorch

Keras

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes