Data Scientist Learning Path
A structured 16-week journey through the Knowledge Vault for aspiring data scientists. This path follows the natural progression: math foundations, Python tooling, EDA (69 pages), statistics, ML algorithms (30 pages), deep learning (25 pages), model evaluation, and deployment. It is the most comprehensive data-focused path, combining analytical depth with engineering skills.
Who This Is For
- Students or professionals starting a data science career
- Analysts transitioning into data science (you know SQL and Excel, now learn ML)
- Engineers who want to understand the full data science workflow
- Anyone preparing for data science interviews
Prerequisites
- Basic Python programming (functions, loops, dictionaries)
- High school level math (algebra, basic probability)
- Willingness to learn linear algebra and calculus fundamentals
- No prior ML or statistics experience required
Total estimated time: ~80 hours across 16 weeks (5 hrs/week)
Learning Progression
Week 1-2: Math Foundations
Estimated reading time: 5 hours
Data science is applied math. Build the foundation before touching data.
- [ ] Required -- Math Foundations (35 min)
- [ ] Required -- Probability for Engineers (30 min)
- [ ] Required -- Statistics & A/B Testing (30 min)
- [ ] Required -- System Design Math (25 min)
- [ ] Reference -- Python Cheat Sheet (10 min)
EDA statistics cross-reference:
- [ ] Required -- SciPy Stats (25 min)
- [ ] Required -- Understanding Distributions (25 min)
- [ ] Required -- Statistical Test Selector (20 min)
- [ ] Optional -- Statistical Power (20 min)
Checkpoint
After this section you should be able to: explain Bayes' theorem, compute expected values, understand normal/binomial/Poisson distributions, perform basic hypothesis tests, and use SciPy for statistical computations.
Week 2-3: Python Data Science Tools
Estimated reading time: 5 hours
Master the Python tools you will use every day as a data scientist.
- [ ] Required -- Python ML Ecosystem (25 min)
- [ ] Required -- NumPy (25 min)
- [ ] Required -- Pandas Fundamentals (30 min)
- [ ] Required -- Pandas Advanced (30 min)
- [ ] Required -- Matplotlib (25 min)
- [ ] Required -- Seaborn (25 min)
- [ ] Required -- Plotly (25 min)
- [ ] Optional -- Polars for EDA (25 min)
- [ ] Optional -- Streamlit (20 min)
- [ ] Reference -- Pandas EDA Cheat Sheet (10 min)
- [ ] Reference -- Scikit-learn Cheat Sheet (10 min)
Checkpoint
After this section you should be able to: manipulate dataframes with pandas fluently, create publication-quality visualizations, and build interactive dashboards with Streamlit.
Week 3-5: EDA Foundations (Part 1 of 69 pages)
Estimated reading time: 8 hours
Exploratory Data Analysis is the core skill of data science. You spend 80% of your time here.
Workflow & Data Understanding
- [ ] Required -- EDA Overview (15 min)
- [ ] Required -- EDA Workflow (25 min)
- [ ] Required -- Asking Right Questions (20 min)
- [ ] Required -- Data Collection (20 min)
- [ ] Required -- Data Shapes & Structures (20 min)
- [ ] Required -- Data Types Deep Dive (20 min)
- [ ] Required -- Data Profiling (20 min)
Univariate Analysis
- [ ] Required -- Univariate Numerical (25 min)
- [ ] Required -- Univariate Categorical (25 min)
- [ ] Required -- Univariate Temporal (20 min)
- [ ] Required -- Univariate Text (20 min)
- [ ] Required -- Understanding Scale (20 min)
Data Cleaning
- [ ] Required -- Missing Data (25 min)
- [ ] Required -- Outlier Analysis (25 min)
- [ ] Required -- Data Cleaning Categories (20 min)
- [ ] Required -- Data Cleaning Dates (20 min)
- [ ] Required -- Data Cleaning Text (20 min)
- [ ] Optional -- Data Cleaning Edge Cases (20 min)
Checkpoint
After this section you should be able to: profile any dataset systematically, perform univariate analysis for all data types, handle missing data with appropriate strategies, and detect and handle outliers.
Week 5-7: EDA Advanced (Part 2 of 69 pages)
Estimated reading time: 10 hours
Bivariate & Multivariate
- [ ] Required -- Bivariate Num-Num (25 min)
- [ ] Required -- Bivariate Cat-Num (25 min)
- [ ] Required -- Bivariate Cat-Cat (20 min)
- [ ] Required -- Multivariate (25 min)
- [ ] Required -- Correlation Traps (20 min)
- [ ] Required -- Multicollinearity (20 min)
- [ ] Required -- Relational Data EDA (20 min)
Feature Engineering
- [ ] Required -- Feature Creation (25 min)
- [ ] Required -- Encoding Strategies (25 min)
- [ ] Required -- Scaling & Normalization (20 min)
- [ ] Required -- Transformations (20 min)
- [ ] Required -- Text Features (20 min)
- [ ] Required -- Datetime Features (20 min)
- [ ] Optional -- High Cardinality (20 min)
Advanced EDA Topics
- [ ] Required -- Imbalanced Data (20 min)
- [ ] Required -- Data Quality Validation (20 min)
- [ ] Required -- Data Leakage (20 min)
- [ ] Required -- Sampling Strategies (20 min)
- [ ] Required -- Data Drift (20 min)
- [ ] Required -- Visualization Decision Tree (15 min)
- [ ] Required -- EDA Checklist (15 min)
- [ ] Optional -- Automated EDA (20 min)
- [ ] Optional -- Communicating Findings (20 min)
- [ ] Optional -- EDA Ethics & Bias (20 min)
- [ ] Optional -- Explainability EDA (20 min)
- [ ] Optional -- Geospatial EDA (20 min)
- [ ] Optional -- EDA for Different Domains (20 min)
- [ ] Optional -- Large Datasets (20 min)
- [ ] Optional -- Small Datasets (15 min)
- [ ] Optional -- Reproducibility (20 min)
- [ ] Optional -- Common Mistakes (15 min)
- [ ] Optional -- Streamlit EDA App (20 min)
- [ ] Optional -- Post-Modeling EDA (20 min)
- [ ] Optional -- Image & Audio EDA (20 min)
Checkpoint
After this section you should be able to: perform bivariate and multivariate analysis, engineer features from raw data, detect data leakage, handle imbalanced datasets, and communicate findings effectively.
Week 7-8: Statistics Deep Dive
Estimated reading time: 4 hours
Statistical rigor separates data science from data analysis.
- [ ] Required -- Statistics & A/B Testing (30 min -- deep read)
- [ ] Required -- Statistical Test Selector (20 min -- revisit)
- [ ] Required -- Statistical Power (20 min)
- [ ] Required -- Evaluation Metrics (25 min)
- [ ] Required -- Cross-Validation (20 min)
A/B testing production reference:
- [ ] Optional -- A/B Testing Blueprint (15 min)
- [ ] Optional -- Statistical Significance (25 min)
- [ ] Optional -- Analysis Pipeline (25 min)
Week 8-10: Machine Learning Foundations (30 pages)
Estimated reading time: 8 hours
Core Algorithms
- [ ] Required -- Machine Learning Overview (15 min)
- [ ] Required -- ML Workflow (25 min)
- [ ] Required -- Data Preparation (25 min)
- [ ] Required -- Linear Regression (30 min)
- [ ] Required -- Logistic Regression (25 min)
- [ ] Required -- Decision Trees (25 min)
- [ ] Required -- Random Forests (25 min)
- [ ] Required -- Gradient Boosting (30 min)
- [ ] Required -- SVM (25 min)
- [ ] Required -- KNN (20 min)
- [ ] Required -- Naive Bayes (20 min)
- [ ] Required -- Model Selection (25 min)
- [ ] Required -- Hyperparameter Tuning (25 min)
- [ ] Required -- Algorithm Selection Guide (20 min)
Unsupervised Learning
- [ ] Required -- Clustering (25 min)
- [ ] Required -- Dimensionality Reduction (25 min)
- [ ] Required -- Ensemble Methods (25 min)
Checkpoint
After this section you should be able to: implement and evaluate all major ML algorithms, tune hyperparameters, perform feature selection, and explain the bias-variance tradeoff.
Week 10-11: ML Advanced Topics
Estimated reading time: 5 hours
- [ ] Required -- Feature Engineering Advanced (25 min)
- [ ] Required -- Anomaly Detection (20 min)
- [ ] Required -- Recommendation Systems (25 min)
- [ ] Required -- Time Series ML (25 min)
- [ ] Required -- ML Interpretability (25 min)
- [ ] Required -- ML Checklist (15 min)
- [ ] Optional -- Topic Modeling (20 min)
- [ ] Optional -- Association Rules (15 min)
Week 11-13: Deep Learning Foundations
Estimated reading time: 8 hours
Data scientists need deep learning for NLP, computer vision, and tabular data at scale.
- [ ] Required -- Deep Learning Overview (15 min)
- [ ] Required -- Neural Network Basics (35 min)
- [ ] Required -- PyTorch Fundamentals (30 min)
- [ ] Required -- Training Techniques (25 min)
- [ ] Required -- Transformers (30 min)
- [ ] Required -- BERT Family (25 min)
- [ ] Required -- Transfer Learning (25 min)
- [ ] Required -- NLP Fundamentals (25 min)
- [ ] Optional -- CNN (25 min)
- [ ] Optional -- Language Models (30 min)
- [ ] Optional -- Architecture Selection Guide (25 min)
- [ ] Optional -- Model Optimization (25 min)
- [ ] Optional -- DL Checklist (20 min)
Checkpoint
After this section you should be able to: build neural networks in PyTorch, fine-tune BERT for NLP tasks, apply transfer learning to custom datasets, and choose between classical ML and deep learning for a given problem.
Week 13-14: Model Serving & Production
Estimated reading time: 5 hours
Data scientists who can deploy their models are 10x more valuable.
- [ ] Required -- Model Serving (30 min)
- [ ] Required -- ML Pipelines (30 min)
- [ ] Required -- AI Testing (30 min)
- [ ] Required -- Docker Overview (15 min)
- [ ] Required -- Production Dockerfiles (25 min)
- [ ] Optional -- GPU Kubernetes (30 min)
- [ ] Optional -- CI/CD Overview (15 min)
Data pipeline cross-reference:
- [ ] Optional -- Pipeline Monitoring (20 min)
- [ ] Optional -- Great Expectations (25 min)
- [ ] Optional -- Pandera Validation (20 min)
Week 14-16: End-to-End Projects
Estimated reading time: 5 hours + project time
EDA Projects
- [ ] Required -- Project: Titanic (30 min)
- [ ] Required -- Project: E-Commerce (30 min)
- [ ] Required -- Project: Financial (30 min)
- [ ] Optional -- Project: Healthcare (30 min)
- [ ] Optional -- Project: NLP (30 min)
Capstone: Full Data Science Project
- Problem: Define a business question with real data
- EDA: Comprehensive exploratory analysis (following EDA Checklist)
- Feature Engineering: Create, encode, and scale features
- Modeling: Try 3+ algorithms, tune hyperparameters, evaluate
- Deep Learning: Apply neural networks if appropriate
- Deploy: Containerize and serve the best model
- Present: Communicate findings with visualizations
What You Will Be Able to Do After This Path
- Perform comprehensive EDA on any dataset (69 pages of techniques)
- Apply statistical methods with proper hypothesis testing and A/B testing
- Implement and evaluate all major ML algorithms (30 pages)
- Build deep learning models for NLP and vision tasks (25 pages)
- Engineer features that improve model performance
- Deploy models to production with monitoring
- Communicate findings effectively to stakeholders
Cross-References to Related Paths
- ML/DL Engineer Path -- Deep dive into DL architectures and research
- AI/ML Engineer Path -- LLM integration, RAG, agents, production AI
- Data Engineer Path -- Data pipelines, orchestration, lakehouse
- Backend Engineer Path -- APIs and infrastructure
- All EDA pages: EDA Overview -- index of all 69 topics
- All ML pages: Machine Learning Overview -- index of all 30 topics
- All DL pages: Deep Learning Overview -- index of all 25 topics
Total Progress
This path contains approximately 150 pages (69 EDA + 30 ML + 25 DL + statistics + tools + projects). Budget 16 weeks at 5 hours per week. The EDA section (weeks 3-7) is the largest and most important for data scientists.