wevanjohnson · natLIMbio · May 6, 2026 · May 6, 2026 · May 11, 2026 · May 18, 2026
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ By the end of this course, students will:
 A working knowledge and experience with R, GitHub, and Unix is required. Students are required to complete GSND 5345Q (Fundamentals of Data Science) prior to taking this course, or have equivalent experience. An introductory course in statistics or biostatistics is also recommended (e.g., GSND 5135Q or MSBS N5030), or comparable experience in statistical analysis. Students without sufficient background in coding or statistics are expected to acquire these skills before the course begins. Please contact Dr. Johnson for a list of required proficiencies and available asynchronous resources to help build the necessary programming and statistical foundations.
 
 ## COURSE FORMAT:
-This class will be taught virtually using a synchronous remote modality, although students will be provided a classroom to gather for each lecture. A co-instructor will be present in the classroom for each lecture. Class will occur Mondays and Wednesdays from 12:00pm-1:50pm.. Courses may also be recorded and made available for students who need to miss classes due to personal reasons, illness, or research related needs.
+This class will be taught virtually using a synchronous remote modality, although students will be provided a classroom to gather for each lecture. A co-instructor will be present in the classroom for each lecture. Class will occur Mondays and Wednesdays from 12:00pm-1:50pm. Courses may also be recorded and made available for students who need to miss classes due to personal reasons, illness, or research related needs.
 
 ## ZOOM LINK AND CLASSROOM:
 Zoom Meeting ID for all sessions is Meeting ID: 929 7046 3934, with the password: 850292, or use the following direct link (the link is also available though the course GitHub page): https://rutgers.zoom.us/j/92970463934?pwd=wVF7nGblCNaAMon3fFlRwboSgEiDUg.1
@@ -69,12 +69,12 @@ This is an 8-week, 2.0 credit class near the end of Spring 2024. In general, you
 I **strongly** encourage you to contact early me if you have difficulty with the material. This course builds on material from prior lectures, so do not fall behind! My job is to help you understand the material as well as possible, and I am flexible with meeting times.
 
 ### ACADEMIC INTEGRITY: 
-You are expected to have read and follow the guidelines at the university’s academic integrity website (http://academicintegrity.rutgers.edu ). These principles forbid plagiarism and require that every Rutgers University student to:
+You are expected to have read and follow the guidelines at the university’s academic integrity website (http://academicintegrity.rutgers.edu). These principles forbid plagiarism and require that every Rutgers University student to:
 
 * Properly acknowledge and cite all use of the ideas, results, or words of others
 * Properly acknowledge all contributors to a given piece of work
 * Make sure that all work submitted as his or her own in a course or other academic activity isproduced without the aid of unsanctioned materials or unsanctioned collaboration
-* Treat all other students in an ethical manner, respecting their integrity and right to pursue their educational goals without interference. This requires that a student neither facilitate academic dishonesty by others nor obstruct their academic progress (reproduced from: ttp://academicintegrity.rutgers.edu/academic-integrity-at-rutgers/ ).
+* Treat all other students in an ethical manner, respecting their integrity and right to pursue their educational goals without interference. This requires that a student neither facilitate academic dishonesty by others nor obstruct their academic progress (reproduced from: http://academicintegrity.rutgers.edu/academic-integrity-at-rutgers/).
 
 Violations of academic integrity will be treated in accordance with university policy, and sanctions for violations may range from no credit for the assignment, to a failing course grade to (for the most severe violations) dismissal from the university.
 

diff --git a/homework/HW3BMDA.Rmd b/homework/HW3BMDA.Rmd
@@ -0,0 +1,316 @@
+---
+title: "HW 3 BMDA"
+author: "Nat Lim"
+date: "2026-06-13"
+output:
+  html_document: default
+  pdf_document: default
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = TRUE)
+```
+
+In Homework 1 you downloaded and QC'd the 33 FASTQ files for the HIV/TB dataset from SRA. In this assignment you will carry those reads through a complete RNA-seq differential expression analysis.
+
+## RNA-sequencing analysis
+
+##### 1. Align the HIV/TB reads (downloaded in Homework 1) to the human genome reference using your choice of the `Rsubread` or the `STAR` aligners. Confirm whether or not you were able to successfully complete this step (on your honor).
+
+  1. Yes
+  2. No
+
+```{r, eval=FALSE}
+require("BiocManager")
+
+library(Rsubread)
+buildindex("genomefile", "genomefile.fa")
+
+align("genomefile", "myfastqfile.fq", 
+      output_file="myfastq.bam", nthreads = 4)
+```
+
+
+##### 2. Use the `featureCounts` function in the `Rsubread` package to generate a counts file for this dataset. Upload your feature counts .txt file here.
+
+
+
+For the rest of the assignment, please use the feature counts provided by Dr. Johnson in the lecture materials: {https://github.com/wevanjohnson/2026_Spring_BMDA/raw/refs/heads/main/lectures/week6/example_data/features_combined.txt}{features_combined.txt (link)}
+
+
+
+##### 3. Generate a `SummarizedExperiment` object for your counts. The `colData` for these data are provided in the `homework2\_metadata.txt` file. What function is used to create a SummarizedExperiment object?
+
+  1. **SummarizedExperiment(assays = list(counts = as.matrix(counts)), colData = metadata)**
+  2. makeSummarizedExp(as.matrix(counts), metadata)
+  3. seObject(counts, metadata)
+  4. summarizeData(data = counts, metadata = metadata)
+  5. buildExperiment(assay = as.matrix(counts), metadata = metadata)
+
+```{r}
+library(SummarizedExperiment)
+
+counts <- read.delim("features_combined.txt", row.names = 1, check.names = FALSE)
+
+metadata <- read.delim("hivtb_SRR_metadata.txt")
+
+rownames(metadata) <- metadata$SampleID
+metadata <- metadata[colnames(counts), ]
+
+se <- SummarizedExperiment(assays = list(counts = as.matrix(counts)), colData = metadata
+)
+
+se
+```
+
+
+##### 4. Preprocess these data by removing `TB-HIV-ART` samples (should be two of them), removing any genes with zero expression for all samples, and by generating a log counts per million assay. How may genes were removed with the zero expression filter?
+
+1. 48
+2. **2,204**
+3. 9,700
+4. 17,385
+
+```{r}
+se <- se[, se$disease_status != "TB-HIV-ART"]
+dim(se)
+
+keep <- rowSums(assay(se, "counts")) > 0
+n_removed <- sum(!keep)
+n_removed
+
+se <- se[keep, ]
+dim(se)
+
+library(edgeR)
+logcpm <- cpm(assay(se, "counts"), log = TRUE)
+assay(se, "logCPM") <- logcpm
+
+se
+```
+
+
+##### 5. Create a batch corrected assay in your `SummarizedExperiment` using ComBat-Seq. You can use the `ComBat\_Seq` function in the `sva` package, or simply do it in `BatchQC` and then extract the `SummarizedExperiment`. For this example, pretend that `disease\_status` is the batch variable. Note that this will remove the disease status variability, so **don't** use this assay in the following analyses! This was merely a practice for cases where you have an actual batch variable. What command/syntax would work to apply ComBat-Seq?
+
+  1. **ComBat_seq(as.matrix(assay(se, "counts")), batch = se$disease_status)**
+  2. ComBat_seq(counts = assay(se), batch = disease_status)
+  3. BatchFix(se, "counts", batch = "disease_status")
+  4. combatCorrect(se$counts, batch = se\$disease_status)
+  5. ComBat(se, "counts", batch = "disease_status")
+
+```{r, eval=FALSE}
+library(sva)
+
+combat_counts <- ComBat_seq(as.matrix(assay(se, "counts")), batch = se$disease_status)
+
+assay(se, "combat_counts") <- combat_counts
+
+se
+```
+
+
+
+##### 6. Apply PCA and UMAP to your data and generate dimension reduction plots for the results. Color the TB-HIV to the HIV only samples in different colors. Note that you should be using the log CPM values for this analysis. Please attach your UMAP plot here. 
+
+```{r}
+library(umap)
+
+logcpm_mat <- t(assay(se, "logCPM"))
+
+pca <- prcomp(logcpm_mat, scale. = TRUE)
+pca_df <- data.frame(
+  PC1 = pca$x[,1],
+  PC2 = pca$x[,2],
+  disease_status = se$disease_status
+)
+
+library(ggplot2)
+ggplot(pca_df, aes(PC1, PC2, color = disease_status)) +
+  geom_point(size = 3) +
+  theme_minimal() +
+  labs(title = "PCA of log CPM values")
+
+set.seed(42)
+umap_res <- umap(logcpm_mat)
+umap_df <- data.frame(
+  UMAP1 = umap_res$layout[,1],
+  UMAP2 = umap_res$layout[,2],
+  disease_status = se$disease_status
+)
+
+ggplot(umap_df, aes(UMAP1, UMAP2, color = disease_status)) +
+  geom_point(size = 3) +
+  theme_minimal() +
+  labs(title = "UMAP of log CPM values")
+```
+
+
+##### 7. Use `DESeq2` to do a differential expression analysis (on the counts) comparing the TB-HIV to the HIV only samples. What is the twenty second gene on the differentially expressed gene list?
+
+  1. IGF2BP3
+  2. AIM2
+  3. IL1R2
+  4. RAB20
+  5. **FCGR1C**
+
+```{r}
+library(DESeq2)
+
+dds <- DESeqDataSetFromMatrix(
+  countData = assay(se, "counts"),
+  colData   = colData(se),
+  design    = ~ disease_status
+)
+
+dds <- DESeq(dds)
+res <- results(dds, contrast = c("disease_status", "TB-HIV", "HIV"))
+
+# Order by significance (typically by p-value)
+res_ordered <- res[order(res$pvalue), ]
+head(as.data.frame(res_ordered), 25)
+
+# The 22nd gene:
+rownames(res_ordered)[22]
+```
+
+
+
+##### 8. Now conduct the same analysis using `limma` on the log CPM values. How do the `DESeq2` results compare to the `limma` results? What is the seventh gene on the differentially expressed gene list?
+
+  1. IGF2BP3
+  2. AIM2
+  3. **IL1R2**
+  4. RAB20
+  5. FCGR1C
+
+```{r}
+library(limma)
+
+logcpm_mat <- assay(se, "logCPM")  # genes as rows, samples as columns
+
+design <- model.matrix(~ disease_status, data = colData(se))
+
+fit <- lmFit(logcpm_mat, design)
+fit <- eBayes(fit)
+
+res_limma <- topTable(fit, coef = 2, number = Inf, sort.by = "P")
+head(res_limma, 25)
+
+# The 7th gene:
+rownames(res_limma)[7]
+```
+
+##### 9. Give a heatmap plot of either the `DESeq2` or the `limma` results (top 50). Add a colorbar for disease status. Upload your plot here.
+
+```{r}
+library(ComplexHeatmap)
+library(circlize)
+
+# Top 50 genes from DESeq2 results
+top50 <- rownames(res_ordered)[1:50]
+
+# Pull their logCPM values
+mat <- assay(se, "logCPM")[top50, ]
+
+# Scale by row (z-score) so the heatmap shows relative expression patterns
+mat_scaled <- t(scale(t(mat)))
+
+# Annotation colorbar for disease status
+ha <- HeatmapAnnotation(
+  disease_status = se$disease_status,
+  col = list(disease_status = c("TB-HIV" = "firebrick", "HIV" = "steelblue"))
+)
+
+Heatmap(
+  mat_scaled,
+  name = "z-score",
+  top_annotation = ha,
+  show_row_names = TRUE,
+  show_column_names = TRUE,
+  column_title = "Top 50 DESeq2 DE Genes (TB-HIV vs HIV)",
+  row_names_gp = grid::gpar(fontsize = 6)
+)
+```
+
+
+##### 10. Conduct a pathway analysis of the top 50 Limma genes using enrichR (through R or online). Use the "Reactome\_Pathways\_2024" database. What is the top scoring pathway?
+
+  1. "Interferon Gamma Signaling"
+  2. "Fatty Acid Metabolism"
+  3. **"Cytokine Signaling in Immune System"**
+  4. "RAB Geranylgeranylation"
+
+```{r}
+require("enrichR")  # if not already installed
+library(enrichR)
+
+# Confirm the database is available
+dbs <- listEnrichrDbs()
+"Reactome_Pathways_2024" %in% dbs$libraryName
+
+# Top 50 limma genes
+top50_limma <- rownames(res_limma)[1:50]
+
+# Run enrichment
+enriched <- enrichr(top50_limma, databases = "Reactome_Pathways_2024")
+
+# View top results
+reactome_results <- enriched[["Reactome_Pathways_2024"]]
+head(reactome_results[order(reactome_results$P.value), c("Term","P.value","Adjusted.P.value")], 50)
+
+# Top scoring pathway:
+reactome_results$Term[which.min(reactome_results$P.value)]
+```
+
+
+##### 11. Conduct a TBSignatureProfiler analysis on these data -- including signature heatmaps, individual boxplots, and AUC boxplots. Interpret your findings. Upload one of your TBSignatureProfiler plots here.
+
+```{r}
+library(TBSignatureProfiler)
+
+# Run profiling using ssGSEA on the logCPM assay
+sigs_result <- runTBsigProfiler(
+  input        = se,
+  useAssay     = "logCPM",
+  algorithm    = "ssGSEA",
+  signatures   = TBsignatures,   # the full built-in signature library
+  parallel.sz  = 1,
+  combineSigAndAlgorithm = TRUE
+)
+
+# Check what signature score columns got added
+colnames(colData(sigs_result))
+```
+```{r}
+sigs_to_plot <- c("Chen_HIV_4", "Sambarey_HIV_10", "Sweeney_OD_3", "Zak_RISK_16")
+
+signatureHeatmap(
+  sigs_result,
+  signatureColNames = sigs_to_plot,
+  annotationColNames = "disease_status",
+  scale = TRUE,
+  showColumnNames = FALSE,
+  name = "TB/HIV Signature Scores"
+)
+```
+
+```{r}
+signatureBoxplot(
+  inputData = sigs_result,
+  name = "Chen_HIV_4 Signature Score",
+  signatureColNames = "Chen_HIV_4",
+  annotationColName = "disease_status"
+)
+```
+
+```{r}
+auc_table <- tableAUC(
+  SE_scored = sigs_result,
+  annotationColName = "disease_status",
+  signatureColNames = sigs_to_plot,
+  num.boot = 100
+)
+
+auc_table
+```
+
diff --git a/homework/HW3BMDA.html b/homework/HW3BMDA.html
diff --git a/homework/HW3BMDA_files/figure-latex/unnamed-chunk-11-1.pdf b/homework/HW3BMDA_files/figure-latex/unnamed-chunk-11-1.pdf
diff --git a/homework/HW3BMDA_files/figure-latex/unnamed-chunk-12-1.pdf b/homework/HW3BMDA_files/figure-latex/unnamed-chunk-12-1.pdf
diff --git a/homework/HW3BMDA_files/figure-latex/unnamed-chunk-5-1.pdf b/homework/HW3BMDA_files/figure-latex/unnamed-chunk-5-1.pdf
diff --git a/homework/HW3BMDA_files/figure-latex/unnamed-chunk-5-2.pdf b/homework/HW3BMDA_files/figure-latex/unnamed-chunk-5-2.pdf
diff --git a/homework/HW3BMDA_files/figure-latex/unnamed-chunk-8-1.pdf b/homework/HW3BMDA_files/figure-latex/unnamed-chunk-8-1.pdf