Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
In data science, you're not writing code from scratch—you're orchestrating a powerful ecosystem of specialized tools. The Python libraries you'll encounter on exams and in practice each solve specific computational problems: numerical operations, data manipulation, visualization, statistical modeling, and machine learning. Understanding which library to reach for (and why) separates competent data scientists from those who just copy code snippets.
You're being tested on more than syntax. Exam questions will probe whether you understand why NumPy arrays outperform Python lists, when to use Pandas versus raw NumPy, or what distinguishes TensorFlow from Scikit-learn. Don't just memorize library names—know what computational problem each one solves and how they connect in a typical data science workflow.
These libraries form the foundation of nearly every data science workflow. They provide optimized data structures that make Python competitive with lower-level languages for numerical work.
Compare: NumPy vs. Pandas—both handle array-like data, but NumPy optimizes for homogeneous numerical arrays while Pandas excels at heterogeneous, labeled tabular data. If an exam asks about performance-critical numerical computation, reach for NumPy; for data cleaning and exploration, Pandas is your answer.
Visualization libraries transform numerical results into interpretable graphics. The choice between them often comes down to customization needs versus speed of development.
Compare: Matplotlib vs. Seaborn—Matplotlib offers granular control for custom figures, while Seaborn provides high-level statistical plots with sensible defaults. Use Seaborn for exploratory analysis and Matplotlib when you need pixel-perfect customization.
When you need formal statistical tests, confidence intervals, or interpretable model coefficients, these libraries provide the rigor that machine learning tools often skip.
y ~ x1 + x2) makes model specification intuitive and readableCompare: Statsmodels vs. Scikit-learn—both can fit linear regression, but Statsmodels emphasizes statistical inference (p-values, R-squared, residual diagnostics) while Scikit-learn focuses on prediction accuracy. Choose based on whether you're explaining relationships or making predictions.
These libraries span from classical algorithms to deep neural networks. The key distinction is between traditional ML (Scikit-learn) and deep learning frameworks (TensorFlow, PyTorch, Keras).
fit(), predict(), and transform() methods across all algorithms makes switching between models trivialCompare: Scikit-learn vs. TensorFlow/PyTorch—Scikit-learn handles classical ML algorithms with minimal setup, while TensorFlow and PyTorch are necessary for deep learning. If your data is tabular and your model is a random forest or SVM, Scikit-learn is faster to implement; for neural networks with custom architectures, use a deep learning framework.
Compare: TensorFlow vs. PyTorch—TensorFlow emphasizes production deployment and scalability, while PyTorch prioritizes research flexibility with dynamic graphs. If an FRQ asks about prototyping or debugging neural networks, PyTorch's eager execution is the key advantage.
| Concept | Best Examples |
|---|---|
| Numerical array operations | NumPy, SciPy |
| Tabular data manipulation | Pandas |
| Basic visualization | Matplotlib |
| Statistical visualization | Seaborn |
| Statistical inference & hypothesis testing | Statsmodels |
| Classical machine learning | Scikit-learn |
| Deep learning (production) | TensorFlow, Keras |
| Deep learning (research) | PyTorch |
| Scientific computing (optimization, integration) | SciPy |
| Time series analysis | Pandas, Statsmodels |
Which two libraries both handle array-like data but differ in their optimization focus? Explain when you'd choose one over the other.
If you need to report p-values and confidence intervals for a regression model, which library should you use instead of Scikit-learn, and why?
Compare and contrast TensorFlow and PyTorch: What type of computation graph does each use, and how does this affect the debugging experience?
A data science workflow involves loading CSV data, cleaning missing values, training a random forest classifier, and plotting feature importances. Which libraries would you use for each step?
Why might a researcher choose PyTorch for developing a novel neural network architecture, while a company deploying models to mobile devices might prefer TensorFlow?