upgrade
upgrade

📊Principles of Data Science

Fundamental Python Libraries for Data Science

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In data science, you're not writing code from scratch—you're orchestrating a powerful ecosystem of specialized tools. The Python libraries you'll encounter on exams and in practice each solve specific computational problems: numerical operations, data manipulation, visualization, statistical modeling, and machine learning. Understanding which library to reach for (and why) separates competent data scientists from those who just copy code snippets.

You're being tested on more than syntax. Exam questions will probe whether you understand why NumPy arrays outperform Python lists, when to use Pandas versus raw NumPy, or what distinguishes TensorFlow from Scikit-learn. Don't just memorize library names—know what computational problem each one solves and how they connect in a typical data science workflow.


Data Structures and Numerical Computing

These libraries form the foundation of nearly every data science workflow. They provide optimized data structures that make Python competitive with lower-level languages for numerical work.

NumPy

  • N-dimensional arrays (ndarrays)—the core data structure enabling vectorized operations that are orders of magnitude faster than Python lists
  • Broadcasting allows operations on arrays of different shapes without explicit data replication, reducing memory overhead and code complexity
  • Mathematical functions operate element-wise on entire arrays, forming the computational backbone for nearly every other data science library

Pandas

  • DataFrame and Series structures provide labeled, tabular data with intuitive indexing—think of it as Excel for programmers
  • Data wrangling tools for cleaning, merging, grouping, and reshaping data make it essential for the 80% of data science that's data preparation
  • Time series functionality with built-in date parsing, resampling, and rolling windows makes temporal analysis straightforward

SciPy

  • Scientific computing extensions build on NumPy with specialized modules for optimization, integration, interpolation, and signal processing
  • Linear algebra and statistics modules provide functions beyond NumPy's basics, including sparse matrix operations and statistical distributions
  • Numerical solvers handle complex mathematical computations like differential equations and root-finding that pure NumPy can't address

Compare: NumPy vs. Pandas—both handle array-like data, but NumPy optimizes for homogeneous numerical arrays while Pandas excels at heterogeneous, labeled tabular data. If an exam asks about performance-critical numerical computation, reach for NumPy; for data cleaning and exploration, Pandas is your answer.


Data Visualization

Visualization libraries transform numerical results into interpretable graphics. The choice between them often comes down to customization needs versus speed of development.

Matplotlib

  • Low-level control over every plot element makes it the foundation for publication-quality figures and custom visualizations
  • Figure and Axes objects provide the grammar for building plots programmatically—understanding this hierarchy is essential for debugging
  • Integration with NumPy and Pandas allows direct plotting from arrays and DataFrames with minimal conversion

Seaborn

  • Statistical visualizations like heatmaps, violin plots, and pair plots require just one function call instead of dozens of Matplotlib commands
  • Built-in themes and color palettes produce attractive defaults that follow data visualization best practices
  • DataFrame-native interface accepts column names directly, eliminating the need to extract arrays manually

Compare: Matplotlib vs. Seaborn—Matplotlib offers granular control for custom figures, while Seaborn provides high-level statistical plots with sensible defaults. Use Seaborn for exploratory analysis and Matplotlib when you need pixel-perfect customization.


Statistical Modeling and Inference

When you need formal statistical tests, confidence intervals, or interpretable model coefficients, these libraries provide the rigor that machine learning tools often skip.

Statsmodels

  • Statistical model estimation covers OLS regression, logistic regression, and time series models (ARIMA) with full diagnostic output
  • Hypothesis testing tools provide p-values, confidence intervals, and test statistics essential for validating data-driven conclusions
  • Formula interface using R-style syntax (e.g., y ~ x1 + x2) makes model specification intuitive and readable

Compare: Statsmodels vs. Scikit-learn—both can fit linear regression, but Statsmodels emphasizes statistical inference (p-values, R-squared, residual diagnostics) while Scikit-learn focuses on prediction accuracy. Choose based on whether you're explaining relationships or making predictions.


Machine Learning

These libraries span from classical algorithms to deep neural networks. The key distinction is between traditional ML (Scikit-learn) and deep learning frameworks (TensorFlow, PyTorch, Keras).

Scikit-learn

  • Consistent API with fit(), predict(), and transform() methods across all algorithms makes switching between models trivial
  • Comprehensive algorithm coverage includes classification, regression, clustering, dimensionality reduction, and preprocessing—all in one package
  • Pipeline and cross-validation tools enable reproducible workflows and proper model evaluation without data leakage

TensorFlow

  • Computation graphs represent mathematical operations as nodes, enabling automatic differentiation and optimization across CPUs, GPUs, and TPUs
  • Deep learning at scale supports neural networks from simple feedforward architectures to complex models for image recognition, NLP, and reinforcement learning
  • Production deployment tools allow models to run on mobile devices, web browsers, and distributed server clusters

PyTorch

  • Dynamic computation graphs build the network on-the-fly during execution, making debugging intuitive—you can use standard Python debugging tools
  • Research-friendly design with Pythonic syntax has made it the dominant framework in academic deep learning
  • GPU acceleration through CUDA integration enables efficient training of large neural networks with minimal code changes

Keras

  • High-level API abstracts away boilerplate code, letting you define complex architectures in just a few lines
  • Sequential and Functional APIs support everything from simple stacked layers to multi-input, multi-output models with shared components
  • Backend flexibility allows running on TensorFlow (now the default), enabling Keras simplicity with TensorFlow's production capabilities

Compare: Scikit-learn vs. TensorFlow/PyTorch—Scikit-learn handles classical ML algorithms with minimal setup, while TensorFlow and PyTorch are necessary for deep learning. If your data is tabular and your model is a random forest or SVM, Scikit-learn is faster to implement; for neural networks with custom architectures, use a deep learning framework.

Compare: TensorFlow vs. PyTorch—TensorFlow emphasizes production deployment and scalability, while PyTorch prioritizes research flexibility with dynamic graphs. If an FRQ asks about prototyping or debugging neural networks, PyTorch's eager execution is the key advantage.


Quick Reference Table

ConceptBest Examples
Numerical array operationsNumPy, SciPy
Tabular data manipulationPandas
Basic visualizationMatplotlib
Statistical visualizationSeaborn
Statistical inference & hypothesis testingStatsmodels
Classical machine learningScikit-learn
Deep learning (production)TensorFlow, Keras
Deep learning (research)PyTorch
Scientific computing (optimization, integration)SciPy
Time series analysisPandas, Statsmodels

Self-Check Questions

  1. Which two libraries both handle array-like data but differ in their optimization focus? Explain when you'd choose one over the other.

  2. If you need to report p-values and confidence intervals for a regression model, which library should you use instead of Scikit-learn, and why?

  3. Compare and contrast TensorFlow and PyTorch: What type of computation graph does each use, and how does this affect the debugging experience?

  4. A data science workflow involves loading CSV data, cleaning missing values, training a random forest classifier, and plotting feature importances. Which libraries would you use for each step?

  5. Why might a researcher choose PyTorch for developing a novel neural network architecture, while a company deploying models to mobile devices might prefer TensorFlow?