FOUNDATIONS OF MACHINE LEARNING
1. What is Artificial Intelligence? How is it different from Machine Learning and Deep Learning?
Answer. AI is the broad field of building systems that perceive, reason, plan, and act under uncertainty—rule engines, search, and learning-based systems all qualify historically. Machine learning is the subfield where behavior improves from data and experience rather than hand-authored rules for every case. Deep learning is ML with layered representations (typically neural networks) where features are learned end-to-end. In panels I use a product example: rules for hard constraints, classical ML for tabular signal, deep nets for perception/language—and I stress engineering around evaluation, drift, and failure budgets.
2. What are the main types of machine learning? Explain each with examples.
Answer. Main buckets: supervised (labeled input–output pairs—spam detection, ranking), unsupervised (structure/hidden factors—clustering, representation learning), semi-supervised (few labels + many unlabeled), self-supervised (labels generated from the data via a pretext task—BERT MLM, contrastive image learning), and reinforcement learning (sequential decisions with rewards—gameplay, some control). I name one production pitfall each: leakage for supervised, stability/tuning for RL, and evaluation validity for self-supervised.
3. What is the difference between a parametric and a non-parametric model?
Answer. Parametric models summarize the data with a fixed parameter vector after training (e.g., linear models, neural nets): storage and inference are O(params), but misspecification hurts. Non-parametric models retain training data or grow structure with data (k-NN, kernel SVMs, many trees); flexible but costlier at scale and need careful control of complexity. Interview nuance: “non-parametric” still has hyperparameters; the distinction is about fixed vs growing functional capacity with n.
4. What is the bias-variance tradeoff? Explain with examples.
Answer. Bias is the error you’d have even with infinite data (underfitting / wrong inductive bias). Variance is how much your model swings if you resampled the training set (overfitting sensitivity). The tradeoff: more capacity usually lowers bias but raises variance unless regularized or fed more diverse data. I narrate with learning curves and deployment slices: fragile cohorts often signal variance; systematic underperformance across a segment often signals bias.
5. What is overfitting? What is underfitting? How do you detect and fix them?
Answer. Overfitting: great on train, worse on held-out—model memorized idiosyncrasies. Underfitting: bad everywhere—model too simple or features too weak. Detection: stratified validation, learning curves, slice analysis, compare to baselines. Fixes: regularization, more/better data, simpler architecture, early stopping, feature engineering, ensembling with diversity, or constraining the hypothesis class. I always mention leakage checks because “overfitting” is often contaminated validation.
6. What is the curse of dimensionality?
Answer. Framing: data generation process, inductive bias, and what would break generalization. For “What is the curse of dimensionality?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
7. What is feature engineering? Why is it important?
Answer. Framing: data generation process, inductive bias, and what would break generalization. For “What is feature engineering? Why is it important?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
8. What is the difference between a model's capacity and its generalization ability?
Answer. Framing: data generation process, inductive bias, and what would break generalization. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between a model's capacity and its generalization ability?”.
9. Explain the No Free Lunch theorem.
Answer. Framing: data generation process, inductive bias, and what would break generalization. I walk the mechanical story for “Explain the No Free Lunch theorem.”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for FOUNDATIONS OF MACHINE LEARNING. I finish with implementation pitfalls and a debugging knob I’d touch first.
10. What is Occam's Razor and how does it apply to ML?
Answer. Framing: data generation process, inductive bias, and what would break generalization. For “What is Occam's Razor and how does it apply to ML?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
11. What is a hyperparameter? How is it different from a model parameter?
Answer. Framing: data generation process, inductive bias, and what would break generalization. For “What is a hyperparameter? How is it different from a model parameter?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
12. What is cross-validation? What are its types?
Answer. Framing: data generation process, inductive bias, and what would break generalization. For “What is cross-validation? What are its types?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
13. What is k-fold cross-validation? What value of k is commonly chosen and why?
Answer. Framing: data generation process, inductive bias, and what would break generalization. For “What is k-fold cross-validation? What value of k is commonly chosen and why?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
14. What is stratified k-fold cross-validation?
Answer. A hyperparameter is set before training via search or heuristics (tree depth, learning rate, batch size); it defines the training process or inductive bias. A parameter is learned by optimization (weights). Interview punch: changing hyperparameters changes the optimization landscape and effective capacity; track them like code with experiments and reproduce sweeps, not only final weights.
15. Explain the difference between training set, validation set, and test set.
Answer. Framing: data generation process, inductive bias, and what would break generalization. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “Explain the difference between training set, validation set, and test set.”.
16. What is data leakage and how can it affect model performance?
Answer. Leakage is any path where future or target-adjacent information reaches inputs during training or validation. It inflates metrics and then collapses in production. Classic cases: target-derived features, global normalization before splits, random CV on time series, or retrieval that peeks at labels. Prevention: point-in-time joins, split before any statistics, separate transform fit per fold, and parity tests between train and serving code paths.
17. What is the inductive bias in machine learning?
Answer. Framing: data generation process, inductive bias, and what would break generalization. For “What is the inductive bias in machine learning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
18. What is a loss function? Give examples used in regression and classification.
Answer. Framing: data generation process, inductive bias, and what would break generalization. For “What is a loss function? Give examples used in regression and classification.,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
19. What is the difference between a discriminative and a generative model?
Answer. Framing: data generation process, inductive bias, and what would break generalization. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between a discriminative and a generative model?”.
20. What is the Probably Approximately Correct (PAC) learning framework?
Answer. Framing: data generation process, inductive bias, and what would break generalization. For “What is the Probably Approximately Correct (PAC) learning framework?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
STATISTICS & PROBABILITY
21. What is Bayes' theorem? How is it used in machine learning?
Answer. Bayes’ rule relates posterior P(h|D) ∝ P(D|h)P(h)—how evidence updates beliefs. ML uses it for generative classifiers (Naive Bayes), Bayesian neural nets, topic models, and as conceptual glue for MAP vs MLE. Practically, interviewers often want Bayes error (irreducible noise ceiling), calibration as a Bayesian narrative, and when priors regularize ill-posed problems.
22. What is the difference between frequentist and Bayesian probability?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between frequentist and Bayesian probability?”.
23. What is the Central Limit Theorem and why does it matter in ML?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is the Central Limit Theorem and why does it matter in ML?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
24. Explain the concept of Maximum Likelihood Estimation (MLE).
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. I walk the mechanical story for “Explain the concept of Maximum Likelihood Estimation (MLE).”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for STATISTICS & PROBABILITY. I finish with implementation pitfalls and a debugging knob I’d touch first.
25. What is Maximum A Posteriori (MAP) estimation? How does it differ from MLE?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is Maximum A Posteriori (MAP) estimation? How does it differ from MLE?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
26. What is a Gaussian/Normal distribution? Why is it commonly assumed in ML?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is a Gaussian/Normal distribution? Why is it commonly assumed in ML?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
27. What is covariance? What is correlation? How are they different?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is covariance? What is correlation? How are they different?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
28. What is the difference between standard deviation and standard error?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between standard deviation and standard error?”.
29. What are Type I and Type II errors?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
30. What is a p-value? What does statistical significance mean?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is a p-value? What does statistical significance mean?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
31. What is the difference between a parametric and non-parametric statistical test?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between a parametric and non-parametric statistical test?”.
32. What is the law of large numbers?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is the law of large numbers?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
33. Explain conditional probability and independence.
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. I walk the mechanical story for “Explain conditional probability and independence.”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for STATISTICS & PROBABILITY. I finish with implementation pitfalls and a debugging knob I’d touch first.
34. What is the expectation and variance of a random variable?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is the expectation and variance of a random variable?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
35. What is the KL divergence? Where is it used in ML?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is the KL divergence? Where is it used in ML?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
36. What is entropy in the context of information theory?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is entropy in the context of information theory?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
37. What is mutual information?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is mutual information?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
38. What is the difference between L1 and L2 regularization?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between L1 and L2 regularization?”.
39. What is the likelihood function?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is the likelihood function?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
40. What is hypothesis testing and how is it applied in data science?
Answer. Framing: uncertainty quantification, assumptions behind estimators, and what finite-sample noise implies for decisions. For “What is hypothesis testing and how is it applied in data science?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
LINEAR & LOGISTIC REGRESSION
41. What are the assumptions of linear regression?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
42. Explain Ordinary Least Squares (OLS) estimation.
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. I walk the mechanical story for “Explain Ordinary Least Squares (OLS) estimation.”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for LINEAR & LOGISTIC REGRESSION. I finish with implementation pitfalls and a debugging knob I’d touch first.
43. What is multicollinearity? How do you detect and handle it?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is multicollinearity? How do you detect and handle it?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
44. What is the difference between simple and multiple linear regression?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between simple and multiple linear regression?”.
45. What is Ridge regression? What problem does it solve?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is Ridge regression? What problem does it solve?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
46. What is Lasso regression? How does it differ from Ridge?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is Lasso regression? How does it differ from Ridge?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
47. What is Elastic Net regularization?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is Elastic Net regularization?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
48. How do you interpret the coefficients in a linear regression model?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. I walk the mechanical story for “How do you interpret the coefficients in a linear regression model?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for LINEAR & LOGISTIC REGRESSION. I finish with implementation pitfalls and a debugging knob I’d touch first.
49. What is R-squared? What are its limitations?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is R-squared? What are its limitations?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
50. What is adjusted R-squared?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is adjusted R-squared?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
51. What is logistic regression? How does it model probabilities?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is logistic regression? How does it model probabilities?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
52. What is the sigmoid function? Why is it used in logistic regression?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is the sigmoid function? Why is it used in logistic regression?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
53. What is the log-loss (cross-entropy loss) function?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is the log-loss (cross-entropy loss) function?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
54. How do you handle class imbalance in logistic regression?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. I walk the mechanical story for “How do you handle class imbalance in logistic regression?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for LINEAR & LOGISTIC REGRESSION. I finish with implementation pitfalls and a debugging knob I’d touch first.
55. What is multinomial logistic regression?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is multinomial logistic regression?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
56. What is the decision boundary in logistic regression?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is the decision boundary in logistic regression?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
57. What is the odds ratio?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. For “What is the odds ratio?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
58. How does regularization affect logistic regression?
Answer. Framing: linearity/heteroskedasticity assumptions, interpretability vs flexibility, and how errors propagate to stakeholders. I walk the mechanical story for “How does regularization affect logistic regression?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for LINEAR & LOGISTIC REGRESSION. I finish with implementation pitfalls and a debugging knob I’d touch first.
EVALUATION METRICS
59. What is a confusion matrix?
Answer. A confusion matrix tabulates predictions vs ground truth across classes—true/false positives/negatives extend to multi-class counts. It’s the raw material for precision, recall, specificity, and cost-sensitive thresholds. In production I review slice-confusion matrices by market/region/device to find hidden failure modes global accuracy hides.
60. What is accuracy? When can it be misleading?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is accuracy? When can it be misleading?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
61. What is precision? What is recall? What is the tradeoff between them?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is precision? What is recall? What is the tradeoff between them?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
62. What is the F1 score? When would you use it over accuracy?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is the F1 score? When would you use it over accuracy?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
63. What is the ROC curve? What does AUC represent?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is the ROC curve? What does AUC represent?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
64. What is the PR (Precision-Recall) curve? When is it preferred over ROC?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is the PR (Precision-Recall) curve? When is it preferred over ROC?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
65. What is Mean Absolute Error (MAE)?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is Mean Absolute Error (MAE)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
66. What is Mean Squared Error (MSE)? What is Root Mean Squared Error (RMSE)?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is Mean Squared Error (MSE)? What is Root Mean Squared Error (RMSE)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
67. What is the difference between MAE and RMSE?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between MAE and RMSE?”.
68. What is mean absolute percentage error (MAPE)?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is mean absolute percentage error (MAPE)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
69. What is the Matthews Correlation Coefficient (MCC)?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is the Matthews Correlation Coefficient (MCC)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
70. How do you choose the right evaluation metric for a given problem?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. I walk the mechanical story for “How do you choose the right evaluation metric for a given problem?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for EVALUATION METRICS. I finish with implementation pitfalls and a debugging knob I’d touch first.
71. What is the Gini coefficient in the context of ML models?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is the Gini coefficient in the context of ML models?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
72. How do you evaluate a clustering model?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. I walk the mechanical story for “How do you evaluate a clustering model?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for EVALUATION METRICS. I finish with implementation pitfalls and a debugging knob I’d touch first.
73. What is the silhouette score?
Answer. Framing: align the metric with costs—who pays for FP vs FN—and avoid vanity aggregates. For “What is the silhouette score?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
SUPERVISED LEARNING ALGORITHMS
74. How does a Decision Tree work? What are its splitting criteria?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. I walk the mechanical story for “How does a Decision Tree work? What are its splitting criteria?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for SUPERVISED LEARNING ALGORITHMS. I finish with implementation pitfalls and a debugging knob I’d touch first.
75. What is information gain? What is Gini impurity?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is information gain? What is Gini impurity?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
76. What is pruning in decision trees? Why is it done?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is pruning in decision trees? Why is it done?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
77. What are the differences between ID3, C4.5, and CART algorithms?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
78. What is a Random Forest? How does it improve over a single decision tree?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is a Random Forest? How does it improve over a single decision tree?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
79. What is bagging? What is boosting? How are they different?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is bagging? What is boosting? How are they different?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
80. What is the Out-Of-Bag (OOB) error in Random Forests?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is the Out-Of-Bag (OOB) error in Random Forests?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
81. What is AdaBoost? How does it work?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is AdaBoost? How does it work?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
82. What is Gradient Boosting? Explain the algorithm step by step.
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is Gradient Boosting? Explain the algorithm step by step.,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
83. What is XGBoost? What makes it faster and more accurate?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is XGBoost? What makes it faster and more accurate?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
84. What is LightGBM? How does it differ from XGBoost?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is LightGBM? How does it differ from XGBoost?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
85. What is CatBoost? How does it handle categorical features?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is CatBoost? How does it handle categorical features?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
86. What is a Support Vector Machine (SVM)?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is a Support Vector Machine (SVM)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
87. What is the kernel trick in SVMs? Name commonly used kernels.
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
88. What is the role of C and gamma hyperparameters in SVM?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is the role of C and gamma hyperparameters in SVM?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
89. What is a soft-margin SVM?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is a soft-margin SVM?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
90. What is the k-Nearest Neighbors (k-NN) algorithm?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is the k-Nearest Neighbors (k-NN) algorithm?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
91. How do you choose the value of k in k-NN?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. I walk the mechanical story for “How do you choose the value of k in k-NN?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for SUPERVISED LEARNING ALGORITHMS. I finish with implementation pitfalls and a debugging knob I’d touch first.
92. What is a Naive Bayes classifier? What assumption does it make?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. For “What is a Naive Bayes classifier? What assumption does it make?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
93. What are the variants of Naive Bayes (Gaussian, Multinomial, Bernoulli)?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
94. When does Naive Bayes work well despite its naive assumption?
Answer. Framing: inductive bias of the algorithm vs data scale; trees vs kernel vs k-NN tells a capacity/locality story. I answer “When does Naive Bayes work well despite its naive assumption?” in three beats: definition/intuition, representative equation or procedure if relevant, and the operational consequence—what breaks first when assumptions fail.
UNSUPERVISED LEARNING
95. What is clustering? What are its main types?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is clustering? What are its main types?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
96. How does the k-Means algorithm work?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. I walk the mechanical story for “How does the k-Means algorithm work?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for UNSUPERVISED LEARNING. I finish with implementation pitfalls and a debugging knob I’d touch first.
97. What are the limitations of k-Means clustering?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
98. How do you choose the optimal number of clusters (k)?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. I walk the mechanical story for “How do you choose the optimal number of clusters (k)?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for UNSUPERVISED LEARNING. I finish with implementation pitfalls and a debugging knob I’d touch first.
99. What is the Elbow method?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is the Elbow method?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
100. What is hierarchical clustering? What are agglomerative and divisive approaches?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is hierarchical clustering? What are agglomerative and divisive approaches?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
101. What is DBSCAN? How does it differ from k-Means?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is DBSCAN? How does it differ from k-Means?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
102. What is Gaussian Mixture Models (GMM) clustering?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is Gaussian Mixture Models (GMM) clustering?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
103. What is the EM (Expectation-Maximization) algorithm?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is the EM (Expectation-Maximization) algorithm?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
104. What is dimensionality reduction? Why is it needed?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is dimensionality reduction? Why is it needed?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
105. What is Principal Component Analysis (PCA)?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is Principal Component Analysis (PCA)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
106. How does PCA work mathematically?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. I walk the mechanical story for “How does PCA work mathematically?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for UNSUPERVISED LEARNING. I finish with implementation pitfalls and a debugging knob I’d touch first.
107. What is the difference between PCA and LDA (Linear Discriminant Analysis)?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between PCA and LDA (Linear Discriminant Analysis)?”.
108. What is t-SNE? When is it used?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is t-SNE? When is it used?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
109. What is UMAP? How does it compare to t-SNE?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is UMAP? How does it compare to t-SNE?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
110. What is Singular Value Decomposition (SVD)?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is Singular Value Decomposition (SVD)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
111. What is matrix factorization? How is it used in recommender systems?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is matrix factorization? How is it used in recommender systems?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
112. What is Latent Dirichlet Allocation (LDA) in topic modeling?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. For “What is Latent Dirichlet Allocation (LDA) in topic modeling?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
113. What is the difference between hard and soft clustering?
Answer. Framing: objective vs operational utility—clusters are only as good as downstream decisions they enable. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between hard and soft clustering?”.
OPTIMIZATION & GRADIENT DESCENT
114. What is gradient descent? Explain the intuition.
Answer. Gradient descent iteratively moves parameters opposite the loss gradient to reduce empirical risk. Intuition: locally linearize the loss, take a step sized by learning rate; repeat until convergence criteria. I mention stochasticity: full-batch vs mini-batch noise that can help escape sharp minima, and the need for schedules and clipping in deep nets.
115. What is the difference between batch, stochastic, and mini-batch gradient descent?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between batch, stochastic, and mini-batch gradient descent?”.
116. What is the learning rate? How does it affect training?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is the learning rate? How does it affect training?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
117. What are saddle points and local minima in optimization?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
118. What is momentum in gradient descent?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is momentum in gradient descent?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
119. What is the Nesterov Accelerated Gradient (NAG)?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is the Nesterov Accelerated Gradient (NAG)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
120. What is AdaGrad? What problem does it solve?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is AdaGrad? What problem does it solve?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
121. What is RMSProp?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is RMSProp?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
122. What is the Adam optimizer? Why is it popular?
Answer. Adam adapts per-parameter steps using momentum and second-moment estimates of gradients—fast progress on noisy, ill-conditioned losses. It’s popular as a default in deep learning (often AdamW with decoupled weight decay) because it reduces manual LR tuning versus vanilla SGD, at the cost of extra memory for moments and occasional edge cases where SGD+momentum generalizes better if tuned long enough.
123. What is AdamW? How does it differ from Adam?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is AdamW? How does it differ from Adam?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
124. What is the learning rate schedule? Name common strategies.
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
125. What is warm-up in learning rate scheduling?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is warm-up in learning rate scheduling?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
126. What is gradient clipping? Why is it used?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is gradient clipping? Why is it used?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
127. What is the vanishing gradient problem?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is the vanishing gradient problem?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
128. What is the exploding gradient problem?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is the exploding gradient problem?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
129. What is the difference between convex and non-convex optimization?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between convex and non-convex optimization?”.
130. What is early stopping? How does it prevent overfitting?
Answer. Framing: conditioning, noise in gradients, and optimizer choice as stability technology for deep networks. For “What is early stopping? How does it prevent overfitting?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
NEURAL NETWORKS & DEEP LEARNING
131. What is a neural network? Describe its basic architecture.
Answer. A neural network is layered differentiable functions: linear maps, nonlinearities, and often normalization/dropout, composed to approximate a target function. Training is end-to-end backprop. Architecturally: MLP for tabular, CNN for spatial locality, RNN/Transformer for sequences, GNN for graphs—pick structure matching invariances in the domain.
132. What is the role of an activation function?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is the role of an activation function?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
133. What are sigmoid, tanh, ReLU, Leaky ReLU, ELU, GELU, and Swish activations?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
134. What is the dying ReLU problem? How is it addressed?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is the dying ReLU problem? How is it addressed?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
135. What is backpropagation? Explain the algorithm.
Answer. Backprop applies the chain rule backwards through the computation graph to accumulate ∂L/∂w. Forward pass stores activations; backward pass propagates gradients layer by layer. Implementation-wise I mention autograd, checkpointing for memory, and that graph structure enables hardware-friendly fused kernels—operationally, NaNs require LR/batch norm checks.
136. What is the chain rule and how is it used in backpropagation?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is the chain rule and how is it used in backpropagation?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
137. What is weight initialization? What are Xavier and He initialization?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is weight initialization? What are Xavier and He initialization?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
138. What is batch normalization? How does it help training?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is batch normalization? How does it help training?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
139. What is layer normalization? When is it preferred over batch normalization?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is layer normalization? When is it preferred over batch normalization?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
140. What is dropout? How does it prevent overfitting?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is dropout? How does it prevent overfitting?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
141. What is a Convolutional Neural Network (CNN)?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is a Convolutional Neural Network (CNN)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
142. What are convolution, pooling, and fully connected layers?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
143. What is the receptive field in a CNN?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is the receptive field in a CNN?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
144. What is stride and padding in convolutions?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is stride and padding in convolutions?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
145. What is depth-wise separable convolution?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is depth-wise separable convolution?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
146. What is a Recurrent Neural Network (RNN)?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is a Recurrent Neural Network (RNN)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
147. What is the vanishing gradient problem in RNNs?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is the vanishing gradient problem in RNNs?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
148. What is an LSTM? How does it solve the vanishing gradient problem?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is an LSTM? How does it solve the vanishing gradient problem?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
149. What is a GRU? How does it compare to LSTM?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is a GRU? How does it compare to LSTM?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
150. What is a bidirectional RNN?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is a bidirectional RNN?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
151. What are skip connections / residual connections?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
152. What is transfer learning? How is it used in deep learning?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is transfer learning? How is it used in deep learning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
153. What is fine-tuning vs feature extraction in transfer learning?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is fine-tuning vs feature extraction in transfer learning?”.
154. What is a Generative Adversarial Network (GAN)?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is a Generative Adversarial Network (GAN)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
155. What is a Variational Autoencoder (VAE)?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is a Variational Autoencoder (VAE)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
156. What is the reparameterization trick in VAEs?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is the reparameterization trick in VAEs?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
157. What is a diffusion model?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is a diffusion model?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
158. What is a mixture of experts (MoE) model?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is a mixture of experts (MoE) model?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
159. What is Neural Architecture Search (NAS)?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is Neural Architecture Search (NAS)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
160. What is AutoML?
Answer. Framing: representational depth, optimization pathologies, and compute/memory accounting. For “What is AutoML?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
NATURAL LANGUAGE PROCESSING (NLP)
161. What is tokenization? What are word-level, sub-word, and character-level tokenization?
Answer. Tokenization splits raw text into model units—words, subwords (BPE/Unigram), or characters. Subword balances vocabulary size vs unknown-token risk and is standard for Transformers. I stress tokenizer coupling: changing tokenization changes the dataset distribution; train/serve must use the same vocabulary and normalization rules.
162. What is Byte Pair Encoding (BPE)?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is Byte Pair Encoding (BPE)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
163. What is TF-IDF?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is TF-IDF?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
164. What is a bag of words model?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is a bag of words model?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
165. What is word2vec? What are CBOW and Skip-Gram models?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is word2vec? What are CBOW and Skip-Gram models?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
166. What is GloVe? How is it different from word2vec?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is GloVe? How is it different from word2vec?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
167. What is FastText? How does it handle out-of-vocabulary words?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is FastText? How does it handle out-of-vocabulary words?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
168. What are contextual word embeddings?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
169. What is named entity recognition (NER)?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is named entity recognition (NER)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
170. What is part-of-speech (POS) tagging?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is part-of-speech (POS) tagging?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
171. What is dependency parsing?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is dependency parsing?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
172. What is sentiment analysis? How is it done?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is sentiment analysis? How is it done?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
173. What is machine translation? What are its approaches?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is machine translation? What are its approaches?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
174. What is the BLEU score?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is the BLEU score?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
175. What is text summarization? What are extractive and abstractive approaches?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is text summarization? What are extractive and abstractive approaches?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
176. What is question answering in NLP?
Answer. Framing: discrete sequences, sparsity of labels, and how tokenization defines the learning problem. For “What is question answering in NLP?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
TRANSFORMERS & ATTENTION MECHANISMS
177. What is the attention mechanism in neural networks?
Answer. Attention computes a weighted mixture of values based on similarity between queries and keys—it's a differentiable, content-based lookup. Scaled dot-product attention uses softmax(QKᵀ/√dₖ)V. I emphasize permutation invariance without position, why scaling prevents softmax saturation, and that multi-head attention learns complementary relationship patterns.
178. What is self-attention?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is self-attention?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
179. What is the Transformer architecture? Describe its key components.
Answer. A Transformer stacks self-attention and position-wise feed-forward blocks with residuals and norm. Encoder stacks can be bidirectional (BERT-style); decoder stacks use causal masks for autoregressive LMs. I close with complexity—O(n²) attention over length, KV-cache memory at decode—and the product tie-in: batching/streaming changes latency profiles.
180. What is multi-head attention? Why is it beneficial?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is multi-head attention? Why is it beneficial?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
181. What is the role of positional encoding in Transformers?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is the role of positional encoding in Transformers?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
182. What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers?”.
183. What is BERT? How is it pre-trained?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is BERT? How is it pre-trained?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
184. What is masked language modeling (MLM)?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is masked language modeling (MLM)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
185. What is next sentence prediction (NSP)?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is next sentence prediction (NSP)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
186. What is GPT? How does it differ from BERT?
Answer. GPT is decoder-only, trained to predict the next token; BERT is encoder-focused on masked language modeling for rich contextual vectors. Practically GPT powers open-ended generation; BERT-style encoders power classification/retrieval encoders when generation isn’t required. Both depend on data curation, tokenizer, and fine-tuning recipe—architecture alone is never the full story.
187. What is the difference between autoregressive and autoencoding language models?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between autoregressive and autoencoding language models?”.
188. What is scaled dot-product attention?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is scaled dot-product attention?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
189. What is the query, key, and value concept in attention?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is the query, key, and value concept in attention?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
190. What is cross-attention?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is cross-attention?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
191. What is the computational complexity of self-attention?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is the computational complexity of self-attention?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
192. What is sparse attention?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is sparse attention?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
193. What is Flash Attention?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is Flash Attention?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
194. What is RoPE (Rotary Positional Embedding)?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is RoPE (Rotary Positional Embedding)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
195. What is ALiBi positional encoding?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is ALiBi positional encoding?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
196. What is grouped query attention (GQA)?
Answer. Framing: pairwise interactions, positional priors, and scaling laws for length and width. For “What is grouped query attention (GQA)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
LARGE LANGUAGE MODEL (LLM) THEORY
197. What is a Large Language Model (LLM)?
Answer. An LLM is a large neural language model (often decoder-only Transformer) pretrained on broad text, then adapted via prompting or fine-tuning. The interview frame: parametric knowledge bounded by context window + training cutoff + decoding randomness; product quality turns on eval harnesses, guardrails, retrieval when facts matter, and cost/latency envelopes.
198. What is instruction tuning?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is instruction tuning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
199. What is the difference between pre-training and fine-tuning in LLMs?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between pre-training and fine-tuning in LLMs?”.
200. What is LoRA (Low-Rank Adaptation)? How does it reduce training costs?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is LoRA (Low-Rank Adaptation)? How does it reduce training costs?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
201. What is QLoRA?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is QLoRA?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
202. What is parameter-efficient fine-tuning (PEFT)?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is parameter-efficient fine-tuning (PEFT)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
203. What is quantization in LLMs? What are INT8, INT4 quantization?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is quantization in LLMs? What are INT8, INT4 quantization?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
204. What is knowledge distillation?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is knowledge distillation?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
205. What is the context window in an LLM?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is the context window in an LLM?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
206. What is temperature in language model sampling?
Answer. Temperature scales logits before softmax: higher T increases entropy (more diverse outputs), T→0 nears greedy decoding. It’s an explicit knob on the generative distribution, not magic “creativity.” In production I pair with top-p/top-k, set defaults per task risk, and log token distributions for incident replay.
207. What is top-k and top-p (nucleus) sampling?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is top-k and top-p (nucleus) sampling?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
208. What is beam search in language generation?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is beam search in language generation?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
209. What is chain-of-thought prompting?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is chain-of-thought prompting?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
210. What is an embedding model? How is it different from a generative model?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is an embedding model? How is it different from a generative model?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
211. What is a foundation model?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is a foundation model?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
212. What is Constitutional AI?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is Constitutional AI?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
213. What is model alignment?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is model alignment?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
214. What is RLHF (Reinforcement Learning from Human Feedback)?
Answer. RLHF aligns generations to human preferences: train a reward model on rankings, optimize the LM policy (often PPO) against it with a KL penalty to a reference model. Strength: helpful/harmless style; risks: reward hacking, rater disagreement, and optimizing proxy rewards. DPO reparameterizes preference learning without an explicit reward model—simpler stack, still needs thoughtful data.
215. What is Direct Preference Optimization (DPO)?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is Direct Preference Optimization (DPO)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
216. What is the scaling law in LLMs? (Kaplan et al.)
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is the scaling law in LLMs? (Kaplan et al.),” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
217. What is emergent ability in large language models?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is emergent ability in large language models?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
218. What is in-context learning?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is in-context learning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
219. What is the difference between zero-shot, one-shot, and few-shot prompting?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between zero-shot, one-shot, and few-shot prompting?”.
220. What is catastrophic forgetting? How is it addressed in LLMs?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is catastrophic forgetting? How is it addressed in LLMs?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
221. What is hallucination in large language models? What causes it?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is hallucination in large language models? What causes it?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
222. What is perplexity as a language model evaluation metric?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is perplexity as a language model evaluation metric?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
223. What is SFT (Supervised Fine-Tuning)?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is SFT (Supervised Fine-Tuning)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
224. What is a system prompt in LLMs?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. For “What is a system prompt in LLMs?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
225. What is the difference between open-source and closed-source LLMs?
Answer. Framing: pretraining objective, adaptation, decoding, and product-grade evaluation—not parroting model sizes. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between open-source and closed-source LLMs?”.
REINFORCEMENT LEARNING
226. What is reinforcement learning? What are its key components?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is reinforcement learning? What are its key components?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
227. What is the Markov Decision Process (MDP)?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is the Markov Decision Process (MDP)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
228. What is the difference between model-based and model-free RL?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between model-based and model-free RL?”.
229. What is a policy in RL?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is a policy in RL?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
230. What is a value function? What is a Q-function?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is a value function? What is a Q-function?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
231. What is the Bellman equation?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is the Bellman equation?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
232. What is Q-learning?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is Q-learning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
233. What is Deep Q-Network (DQN)?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is Deep Q-Network (DQN)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
234. What is the exploration vs exploitation tradeoff?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the exploration vs exploitation tradeoff?”.
235. What is epsilon-greedy policy?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is epsilon-greedy policy?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
236. What is policy gradient?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is policy gradient?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
237. What is the REINFORCE algorithm?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is the REINFORCE algorithm?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
238. What is Actor-Critic architecture?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is Actor-Critic architecture?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
239. What is Proximal Policy Optimization (PPO)?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is Proximal Policy Optimization (PPO)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
240. What is the difference between on-policy and off-policy learning?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between on-policy and off-policy learning?”.
241. What is reward shaping?
Answer. Framing: credit assignment, exploration, and sample efficiency—why offline RL differs from supervised batches. For “What is reward shaping?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
MLOps & MODEL LIFECYCLE
242. What is MLOps? Why is it important?
Answer. MLOps is how ML systems are built, deployed, and operated like reliable software: versioned data/features, reproducible training, experiment tracking, CI/CD for models, registries, monitoring for drift/latency/quality, and rollback. I emphasize integration failures dominate: schema skew between train and serve, not the optimizer choice alone.
243. What is a machine learning pipeline?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is a machine learning pipeline?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
244. What is model versioning? Why does it matter?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is model versioning? Why does it matter?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
245. What is a feature store? What problems does it solve?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is a feature store? What problems does it solve?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
246. What is model drift? What are data drift and concept drift?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is model drift? What are data drift and concept drift?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
247. How do you monitor a deployed ML model?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. I walk the mechanical story for “How do you monitor a deployed ML model?”: inputs → intermediate statistics → update or recursion → stopping criteria, calling out complexity or memory when it matters for MLOps & MODEL LIFECYCLE. I finish with implementation pitfalls and a debugging knob I’d touch first.
248. What is A/B testing in ML model deployment?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is A/B testing in ML model deployment?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
249. What is shadow deployment / shadow testing?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is shadow deployment / shadow testing?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
250. What is canary deployment in ML?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is canary deployment in ML?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
251. What is the difference between online and batch inference?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between online and batch inference?”.
252. What is model serialization? What formats are commonly used? (ONNX, pickle, SavedModel)
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is model serialization? What formats are commonly used? (ONNX, pickle, SavedModel),” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
253. What is model quantization and why is it done for deployment?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is model quantization and why is it done for deployment?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
254. What is model pruning?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is model pruning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
255. What is experiment tracking? Name popular tools.
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
256. What is CI/CD for machine learning?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is CI/CD for machine learning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
257. What is the difference between a model registry and a model store?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. I contrast assumptions, inductive biases, compute profiles, and error profiles—not just textbook definitions. I close with a decision rule: which data regime, latency budget, and interpretability constraints push me to which side of “What is the difference between a model registry and a model store?”.
258. What is the role of Docker and Kubernetes in ML deployment?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is the role of Docker and Kubernetes in ML deployment?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
259. What is feature importance and how is it measured?
Answer. Framing: reproducibility, CI/CD, monitoring, and ownership across data+code+models. For “What is feature importance and how is it measured?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
RESPONSIBLE AI, EXPLAINABILITY & ADVANCED TOPICS
260. What is explainability (XAI) in machine learning?
Answer. Explainability is the ability to surface evidence for why a model produced an output—local vs global, intrinsic vs post-hoc. Methods span coefficients, partial dependence, SHAP/LIME, attention weights (caution: not causal), and counterfactuals. I always say: explainability serves debugging, compliance, and user trust—not proof of correctness—and should align with the stakeholder’s decision task.
261. What is SHAP (SHapley Additive exPlanations)?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is SHAP (SHapley Additive exPlanations)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
262. What is LIME (Local Interpretable Model-agnostic Explanations)?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is LIME (Local Interpretable Model-agnostic Explanations)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
263. What are the main types of bias in ML models?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
264. What is algorithmic fairness? What fairness metrics exist?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is algorithmic fairness? What fairness metrics exist?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
265. What is model robustness? What are adversarial attacks?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is model robustness? What are adversarial attacks?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
266. What is differential privacy in ML?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is differential privacy in ML?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
267. What is federated learning? Why is it useful for privacy?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is federated learning? Why is it useful for privacy?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
268. What is a Graph Neural Network (GNN)?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is a Graph Neural Network (GNN)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
269. What is a time series? What makes it different from regular tabular data?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is a time series? What makes it different from regular tabular data?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
270. What is stationarity in time series?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is stationarity in time series?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
271. What is the ARIMA model?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is the ARIMA model?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
272. What are deep learning approaches for time series forecasting?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. I enumerate the items crisply, but immediately cluster them so it doesn’t sound like flashcards—then I give one example per cluster and where I’ve seen it matter in systems work.
273. What is anomaly detection? What techniques are used?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is anomaly detection? What techniques are used?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
274. What is causal inference? How is it different from correlation?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is causal inference? How is it different from correlation?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
275. What is meta-learning (learning to learn)?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is meta-learning (learning to learn)?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
276. What is few-shot learning?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is few-shot learning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
277. What is contrastive learning?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is contrastive learning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
278. What is multi-task learning?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is multi-task learning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
279. What is continual / lifelong learning?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is continual / lifelong learning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.
280. What is curriculum learning?
Answer. Framing: stakeholder definitions of fairness/robustness, measurable harm, and governance hooks. For “What is curriculum learning?,” I give a precise definition, a one-line intuition (often geometric or probabilistic), and where it shows up in pipelines—then I name a failure mode (wrong assumptions, scaling limits, or metric mismatch) and what I’d measure in production.