Laura's Portfolio

Logo

Classwork for BIMM143

View the Project on GitHub Lauralu05/bimm143_github

Class 14: RNASeq Mini Project

Laura Lu (PID: A17844089)

Background

Here we work through a complete RNASeq analysis project. The input data ceoms from a knock-down experiment of a HOX gene

Data Import

Reading the counts and metadata CSV files

counts <- read.csv("GSE37704_featurecounts.csv", row.names = 1)
metadata <- read.csv("GSE37704_metadata.csv") 

Check on data structure

head(counts)
                length SRR493366 SRR493367 SRR493368 SRR493369 SRR493370
ENSG00000186092    918         0         0         0         0         0
ENSG00000279928    718         0         0         0         0         0
ENSG00000279457   1982        23        28        29        29        28
ENSG00000278566    939         0         0         0         0         0
ENSG00000273547    939         0         0         0         0         0
ENSG00000187634   3214       124       123       205       207       212
                SRR493371
ENSG00000186092         0
ENSG00000279928         0
ENSG00000279457        46
ENSG00000278566         0
ENSG00000273547         0
ENSG00000187634       258
metadata
         id     condition
1 SRR493366 control_sirna
2 SRR493367 control_sirna
3 SRR493368 control_sirna
4 SRR493369      hoxa1_kd
5 SRR493370      hoxa1_kd
6 SRR493371      hoxa1_kd

Some book-keeping is required as there looks to be a mis-match between metadata and counts columns

ncol(counts)
[1] 7
nrow(metadata) 
[1] 6

Looks like we need to get rid of the first “length” column of our counts object.

cleancounts <- counts[ , -1]
colnames(cleancounts) 
[1] "SRR493366" "SRR493367" "SRR493368" "SRR493369" "SRR493370" "SRR493371"
metadata$id
[1] "SRR493366" "SRR493367" "SRR493368" "SRR493369" "SRR493370" "SRR493371"
all( colnames(cleancounts) == metadata$id ) 
[1] TRUE

Remove zero count genes

There are lots of genes with zero counts. We can remove these.

head(cleancounts)
                SRR493366 SRR493367 SRR493368 SRR493369 SRR493370 SRR493371
ENSG00000186092         0         0         0         0         0         0
ENSG00000279928         0         0         0         0         0         0
ENSG00000279457        23        28        29        29        28        46
ENSG00000278566         0         0         0         0         0         0
ENSG00000273547         0         0         0         0         0         0
ENSG00000187634       124       123       205       207       212       258
to.keep.inds <- rowSums(cleancounts) > 0 
nonzero_counts <- cleancounts[to.keep.inds,]

DESeq analysis

Load the package

library(DESeq2) 
Warning: package 'DESeq2' was built under R version 4.5.2

Setup DESeq object

dds <- DESeqDataSetFromMatrix(countData = nonzero_counts, 
                              colData = metadata, 
                              design = ~condition) 
Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors

run DESeq

dds <- DESeq(dds)
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

get results

res <- results(dds) 
head(res) 
log2 fold change (MLE): condition hoxa1 kd vs control sirna 
Wald test p-value: condition hoxa1 kd vs control sirna 
DataFrame with 6 rows and 6 columns
                 baseMean log2FoldChange     lfcSE       stat      pvalue
                <numeric>      <numeric> <numeric>  <numeric>   <numeric>
ENSG00000279457   29.9136      0.1792571 0.3248216   0.551863 5.81042e-01
ENSG00000187634  183.2296      0.4264571 0.1402658   3.040350 2.36304e-03
ENSG00000188976 1651.1881     -0.6927205 0.0548465 -12.630158 1.43990e-36
ENSG00000187961  209.6379      0.7297556 0.1318599   5.534326 3.12428e-08
ENSG00000187583   47.2551      0.0405765 0.2718928   0.149237 8.81366e-01
ENSG00000187642   11.9798      0.5428105 0.5215598   1.040744 2.97994e-01
                       padj
                  <numeric>
ENSG00000279457 6.86555e-01
ENSG00000187634 5.15718e-03
ENSG00000188976 1.76549e-35
ENSG00000187961 1.13413e-07
ENSG00000187583 9.19031e-01
ENSG00000187642 4.03379e-01

Data Visualization

Volcano Plot

dds <- DESeq(dds)
using pre-existing size factors

estimating dispersions

found already estimated dispersions, replacing these

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing
res <- results(dds)
library(ggplot2) 

ggplot(res) + 
  aes(log2FoldChange, -log(padj) ) + 
  geom_point()
Warning: Removed 1237 rows containing missing values or values outside the scale range
(`geom_point()`).

Add threshold lines for fold-change and P-value and color our subset of genes that make these threshold cut-offs in the plot

Add Annotation

Add gene symbols and entrez id

BiocManager::install("org.Hs.eg.db", force = TRUE)
Bioconductor version 3.22 (BiocManager 1.30.27), R 4.5.1 (2025-06-13)

Installing package(s) 'org.Hs.eg.db'

installing the source package 'org.Hs.eg.db'

Warning in install.packages(...): installation of package 'org.Hs.eg.db' had
non-zero exit status
library(AnnotationDbi)
library(org.Hs.eg.db)
res$symbol <- mapIds(
  x = org.Hs.eg.db, 
  keys = row.names(res),
  keytype = "ENSEMBL",
  column = "SYMBOL"
)
'select()' returned 1:many mapping between keys and columns
res$entrez <- mapIds(
  x = org.Hs.eg.db, 
  keys = row.names(res),
  keytype = "ENSEMBL",
  column = "ENTREZID"
)
'select()' returned 1:many mapping between keys and columns

Pathway Analysis

Run gage analysis

library(gage)
library(gageData) 
library(pathview) 

We need a named vector of fold-change values as input for gage

foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
head(foldchanges)
       <NA>      148398       26155      339451       84069       84808 
 0.17925708  0.42645712 -0.69272046  0.72975561  0.04057653  0.54281049 
data(kegg.sets.hs) 

keggres = gage(foldchanges, gsets=kegg.sets.hs)
head(keggres$less, 2) 
                            p.geomean stat.mean        p.val       q.val
hsa04110 Cell cycle      8.995727e-06 -4.378644 8.995727e-06 0.001889103
hsa03030 DNA replication 9.424076e-05 -3.951803 9.424076e-05 0.009841047
                         set.size         exp1
hsa04110 Cell cycle           121 8.995727e-06
hsa03030 DNA replication       36 9.424076e-05
pathview(pathway.id = "hsa04110", gene.data = foldchanges) 
'select()' returned 1:1 mapping between keys and columns

Info: Working in directory /Users/lauralu/Desktop/untitled folder/lecture 4/BIMM 143/Class 14

Info: Writing image file hsa04110.pathview.png

Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)
Warning in cbind(blk.ind, j): number of rows of result is not a multiple of
vector length (arg 2)

GO terms

Same analysis but using GO genesets rather than KEGG

data(go.sets.hs) 
data(go.subs.hs) 

# Focus on Biological Process subset of GO 

gobpsets = go.sets.hs[go.subs.hs$BP] 

gobpres = gage(foldchanges, gsets=gobpsets) 
head(gobpres$less) 
                                            p.geomean stat.mean        p.val
GO:0048285 organelle fission             1.536227e-15 -8.063910 1.536227e-15
GO:0000280 nuclear division              4.286961e-15 -7.939217 4.286961e-15
GO:0007067 mitosis                       4.286961e-15 -7.939217 4.286961e-15
GO:0000087 M phase of mitotic cell cycle 1.169934e-14 -7.797496 1.169934e-14
GO:0007059 chromosome segregation        2.028624e-11 -6.878340 2.028624e-11
GO:0000236 mitotic prometaphase          1.729553e-10 -6.695966 1.729553e-10
                                                q.val set.size         exp1
GO:0048285 organelle fission             5.841698e-12      376 1.536227e-15
GO:0000280 nuclear division              5.841698e-12      352 4.286961e-15
GO:0007067 mitosis                       5.841698e-12      352 4.286961e-15
GO:0000087 M phase of mitotic cell cycle 1.195672e-11      362 1.169934e-14
GO:0007059 chromosome segregation        1.658603e-08      142 2.028624e-11
GO:0000236 mitotic prometaphase          1.178402e-07       84 1.729553e-10

Reactome

Lots of folks like the reactome web interface. You can also run this as an R function but lets look at the website first < https://reactome.org/ >

The website wants a text file with one gene symbol per line of the genes you want to map to pathways.

sig_genes <- res[res$padj <= 0.05 & !is.na(res$padj), ]$symbol 
head(sig_genes) 
ENSG00000187634 ENSG00000188976 ENSG00000187961 ENSG00000188290 ENSG00000187608 
       "SAMD11"         "NOC2L"        "KLHL17"          "HES4"         "ISG15" 
ENSG00000188157 
         "AGRN" 
#res$symbol

and write out to a file:

write.table(sig_genes, file="significant_genes.txt",
            row.names=FALSE, col.names=FALSE, quote=FALSE)

Save Our Results

write.csv(res, file="myresults.csv")