What Data Science Interviews Cover in 2026

Data science interviews test four pillars: statistics fundamentals, machine learning theory and practice, programming (Python/SQL), and business problem-solving. The balance varies by company — research-heavy companies go deep on ML theory, product companies emphasize business impact, and startups want full-stack capability.

Statistics and Probability

1. Explain the central limit theorem and why it matters.

The sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the population distribution. This is why many statistical tests assume normality — and why it is important for confidence intervals and hypothesis testing.

2. What is p-value and what does it NOT tell you?

P-value is the probability of observing results as extreme as the data, assuming the null hypothesis is true. It does NOT tell you the probability that the null hypothesis is true, the practical significance of the effect, or whether your study is well-designed.

3. Type I vs Type II errors — give a practical example.

Type I (false positive): concluding a treatment works when it does not. Type II (false negative): concluding it does not work when it does. In A/B testing: Type I = shipping a feature that has no real effect. Type II = killing a feature that actually helps.

4. When do you use parametric vs non-parametric tests?

Parametric (t-test, ANOVA): assumes normal distribution, more powerful with valid assumptions. Non-parametric (Mann-Whitney, Kruskal-Wallis): no distribution assumptions, better for small samples or skewed data.

5. Explain Bayesian vs frequentist approaches.

Frequentist: probability is long-run frequency, parameters are fixed, data varies. Bayesian: probability is degree of belief, parameters have distributions, update beliefs with data. Bayesian is more intuitive but computationally intensive.

Machine Learning Questions

6. Explain the bias-variance tradeoff.

Bias: error from oversimplified assumptions (underfitting). Variance: error from sensitivity to training data (overfitting). Total error = bias^2 + variance + irreducible error. The goal is to find the sweet spot — complex enough to capture patterns, simple enough to generalize.

7. How do you handle imbalanced datasets?

Strategies: oversampling minority (SMOTE), undersampling majority, class weights in loss function, ensemble methods (balanced random forest), evaluation with precision-recall instead of accuracy, threshold tuning.

8. Explain gradient descent and its variants.

Iteratively adjusting parameters to minimize loss function. Batch GD: uses all data (stable, slow). Stochastic GD: uses one sample (fast, noisy). Mini-batch: compromise. Adam optimizer: adaptive learning rates per parameter. Learning rate scheduling for convergence.

9. When do you use random forest vs gradient boosting?

Random forest: parallel trees, resistant to overfitting, good default. Gradient boosting (XGBoost, LightGBM): sequential trees correcting errors, usually higher accuracy, but prone to overfitting without tuning. GBM generally wins on structured data with proper hyperparameter tuning.

10. Explain cross-validation and when you would NOT use k-fold.

K-fold: split data into k parts, train on k-1, test on 1, rotate. Not appropriate for time series (use time-based splits), when data has groups that should not be split (use grouped k-fold), or when dataset is very small (use leave-one-out).

11. How do you handle missing data?

First: understand WHY data is missing (MCAR, MAR, MNAR). Options: deletion (if MCAR and small proportion), imputation (mean/median for numerical, mode for categorical, KNN imputation, model-based), indicator variable for missingness as a feature.

12. Explain regularization — L1 vs L2.

Both add penalty terms to prevent overfitting. L1 (Lasso): absolute value penalty, produces sparse models (feature selection). L2 (Ridge): squared penalty, shrinks coefficients toward zero. Elastic Net combines both.

Deep Learning Questions

13. Explain transformer architecture.

Self-attention mechanism allowing each token to attend to all others. Multi-head attention for different relationship types. Position encodings for sequence order. Encoder-decoder structure (or encoder-only like BERT, decoder-only like GPT).

14. What is transfer learning and when do you use it?

Using a model pretrained on a large dataset and fine-tuning for a specific task. Use when: limited training data, similar domain to pretrained model, want to save training time/compute. Foundation models (BERT, GPT, ResNet) are the starting point for most modern tasks.

15. How do you prevent overfitting in neural networks?

Dropout, early stopping, data augmentation, weight decay (L2 regularization), batch normalization, reducing model complexity, more training data, ensemble methods.

Python and Implementation

16. How do you optimize a pandas pipeline for large datasets?

Use appropriate dtypes (int32 vs int64, category for low-cardinality strings), vectorized operations over loops, chunked reading for files larger than memory, consider Polars or Dask for parallelism.

17. Explain how you would deploy an ML model to production.

Model serialization (pickle, ONNX), API wrapper (FastAPI/Flask), containerization (Docker), CI/CD for model updates, monitoring (data drift, model performance), A/B testing, feature store for consistency between training and serving.

Business Case Questions

18. How would you measure the success of a recommendation system?

Offline metrics: precision@k, recall@k, NDCG, coverage, diversity. Online metrics: CTR, conversion rate, engagement time, revenue per user. A/B test against baseline. Monitor for filter bubbles and long-term user satisfaction.

19. A model performs well in testing but poorly in production. Why?

Data drift (production data differs from training), feature pipeline bugs, training-serving skew, concept drift (underlying patterns changed), data leakage in training, different preprocessing in production.

20. How do you explain a complex model to non-technical stakeholders?

Focus on what the model does and its business impact, not how it works. Use analogies, visualizations, and examples. SHAP values for feature importance explanations. Confidence intervals for prediction uncertainty.

AI-Powered Data Science Interview Prep

Data science interviews span statistics, ML theory, coding, and business thinking. Craqly's AI assistant provides real-time reminders of formulas, algorithm trade-offs, and evaluation metrics during your interview — the details you reference textbooks for in daily work.

Practice with mock interviews to get comfortable with AI assistance for technical questions.

Data Science Interview Help: ML, Statistics, and Python Questions