R is a language built for statistics. That means it thinks differently from most programming languages — and that difference is worth understanding from the ground up.
This repository covers R systematically, from the basics of the language to applied machine learning. The code is organized in sections that build on each other. You can read them in order or jump to what you need.
The applied section at the end uses the National Survey of Children's Health (NSCH) 2023 — public microdata from the U.S. Census Bureau — to demonstrate the full analytical workflow in a real context.
r-from-scratch/
├── data/ # Sample files + NSCH download instructions
├── install_packages.R # Install all CRAN dependencies at once
├── 01_basics/ # Data types, operators, coercion
├── 02_data_structures/ # Vectors, matrices, lists, data frames, factors
├── 03_control_flow/ # Conditionals, loops
├── 04_functions/ # Basics, lexical scope, closures, recursion
├── 05_apply_family/ # apply, lapply, sapply, tapply, mapply, purrr
├── 06_strings_and_dates/ # stringr, regex, lubridate
├── 07_io_and_data_import/ # CSV, Excel, JSON, XML, databases, web scraping
├── 08_data_manipulation/ # Base R, dplyr, tidyr, data.table
├── 09_visualization/ # Base graphics, ggplot2, plotly
├── 10_statistics/ # Descriptive stats, distributions, hypothesis testing, regression
├── 11_debugging_and_performance/ # Debugging, profiling, benchmarking, parallel computing
├── 12_oop/ # S3, S4, R5 reference classes
├── 13_functional_programming/ # Higher-order functions, memoization, pipe operators
├── 14_environment_and_packages/ # Namespaces, package creation, renv
├── 15_reporting/ # R Markdown, parameterized reports, Quarto
├── 16_ml_supervised/ # KNN, decision trees, random forest, SVM, naive Bayes
├── 17_ml_unsupervised/ # K-means, hierarchical clustering, DBSCAN
├── 18_dimensionality_reduction/ # PCA, t-SNE, UMAP
├── 19_model_evaluation/ # Confusion matrix, cross-validation, ROC/AUC, benchmarking
└── 20_applied_project/ # End-to-end analysis on NSCH 2023 (Florida subset)
Language fundamentals (start here if R is new)
01_basics → 02_data_structures → 03_control_flow → 04_functions → 05_apply_family
Data science workflow
06_strings_and_dates → 07_io_and_data_import → 08_data_manipulation → 09_visualization → 10_statistics
Machine learning
16_ml_supervised → 17_ml_unsupervised → 18_dimensionality_reduction → 19_model_evaluation → 20_applied_project
Full curriculum: follow sections 01–20 in order. Each folder is self-contained; later sections assume familiarity with earlier ones.
.R files contain code and inline comments. They run as-is in R or RStudio.
.Rmd files combine code, output, and explanation in a single document. Render them with:
rmarkdown::render("file.Rmd").qmd files use Quarto. Render with:
quarto render file.qmdClone the repository and open it in RStudio or any R environment.
git clone https://github.com/samuelfabel/r-from-scratch.git
cd r-from-scratchInstall all dependencies once:
source("install_packages.R")Each section is self-contained. Dependencies are loaded at the top of each file. If a single package is missing, install it with:
install.packages("package_name")For reproducible package management with pinned versions, see 14_environment_and_packages/renv_reproducibility.R and run renv::init() in the project root.
Sample data for sections 07 is in data/. The applied project requires a separate NSCH download — see data/README.md.
From the project root, after npm install:
npm run checkThis runs markdown lint on all .md files and R syntax checks on all .R, .Rmd, and .qmd files. With R installed, you can also run:
Rscript scripts/check_syntax.RGitHub Actions runs the same checks on every push (see .github/workflows/check-syntax.yml).
Section 20_applied_project/ applies the techniques from sections 16–19 to a real dataset.
Data source: National Survey of Children's Health (NSCH) 2023, U.S. Census Bureau / HRSA Maternal and Child Health Bureau.
- Download: https://www.census.gov/programs-surveys/nsch/data/datasets/nsch2023.html
- Format: SAS (
.sas7bdat) — read withhaven::read_sas() - Place at:
data/nsch_2023_topical.sas7bdat
The analysis filters to Florida and uses demographic and socioeconomic predictors to model parent-reported ASD diagnosis. It runs through data exploration, preprocessing, dimensionality reduction, and comparison of supervised classification models.
See 20_applied_project/README.md for the full pipeline, variables, and constraints.
- R >= 4.1.0
- RStudio (recommended) or any R environment
- Quarto CLI (for
.qmdfiles in section 15)
This repository started as a fork of the Johns Hopkins R Programming assignment on Coursera (cachematrix.R, 2015). That file now lives in 13_functional_programming/memoization.R, which is where it conceptually belongs.
- R Programming for Data Science, Roger D. Peng — Chapter 4: R Nuts and Bolts
- Advanced R, Hadley Wickham — Chapter 3: Vectors
- R for Data Science, Wickham et al. — https://r4ds.hadley.nz
- An Introduction to Statistical Learning, James et al. — https://www.statlearning.com
MIT — see LICENSE.