Bulk RNAseq: Differential Expression Analysis

Environment Setup

salloc -N 1 --exclusive -p amd -t 8:00:00
# use the same env used for #1 
conda activate smallrna
# working dir
mkdir -p /work/LAS/geetu-lab/arnstrm/mouse.ev.RNAseq
cd /work/LAS/geetu-lab/arnstrm/mouse.ev.RNAseq
# file structure
tree -L 1
.
├── 1_data
├── 2_fastqc
├── 3_STAR-mapping
├── 4_featureCounts
└── 5_multiqc

Raw data

Raw data was downloaded from the sequencing facility using the rsync command, with authentication. The downloaded files were checked for md5sum and compared against list of files expected as per the input samples provided.

cd 1_data
rsync -rltvPh ext-arnstrm@download.bioinformatics.missouri.edu:xyz .
# link masked 
# GEO link will be included later

Genome/annotation

Additional files required for the analyses were downloaded from GenCode. The downloaded files are as follows:

cd 3_STAR-mapping
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M30/GRCm39.primary_assembly.genome.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M30/gencode.vM30.annotation.gff3.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M30/gencode.vM30.annotation.gtf.gz
gunzip GRCm39.primary_assembly.genome.fa.gz
gunzip gencode.vM30.annotation.gff3.gz
gunzip gencode.vM30.annotation.gtf.gz
# ids for prot coding genes
awk '$3=="gene" {print $9}' gencode.vM30.annotation.gff3 |\
  cut -f 1-3 -d ";" |\
  sed -e 's/;/\t/g' -e 's/gene_id=//g' -e 's/ID=//g' |\
  grep "gene_type=protein_coding" |\
  cut -f 1 > mm10.protein_coding

FastQC

Quality inspection of the reads. The multiqc report, collating all samples together are provided as html file.

cd 2_fastqc
for fq in ../1_data/*.fq.gz; do
  fastqc --threads $SLURM_JOB_CPUS_PER_NODE $fq;
done

Mapping

To index the genome, following command was run (in an interactive session).

fastaGenome="GRCm39.genome.fa"
gtf="gencode.vM30.annotation.gtf"
STAR --runThreadN $SLURM_JOB_CPUS_PER_NODE \
     --runMode genomeGenerate \
     --genomeDir $(pwd) \
     --genomeFastaFiles $fastaGenome \
     --sjdbGTFfile $gtf \
     --sjdbOverhang 1

Each fastq file was mapped to the indexed genome as using runSTAR_map.sh script shown below:

#!/bin/bash
read1=$1
read2=$(echo ${read1} | sed 's/_L003_R1_001.fastq.gz/_L003_R2_001.fastq.gz/g')
cpus=${SLURM_JOB_CPUS_PER_NODE}
out=$(basename ${read1} | sed 's/_L003_R1_001.fastq.gz//g')
index=/work/LAS/geetu-lab/arnstrm/GRCm39_index
STAR \
--runThreadN ${cpus} \
--genomeDir ${index} \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--outFilterScoreMinOverLread 0.3 \
--outFilterMatchNminOverLread 0.3 \
--outFileNamePrefix ${out}_ \
--readFilesCommand zcat \
--outWigType bedGraph \
--outWigStrand Unstranded \
--outWigNorm RPM \
--readFilesIn ${read1} ${read2}

Mapping was run with a simple loop:

for fq in *_R1_*fastq.gz; do
  runSTAR_map.sh $fq;
done

Counts

For generating counts from the mapped reads, we used subread package program featureCounts. All bam files were supplied togehter to generate a single count file for individual samples.

cd 3_STAR-mapping
realpath *.bam > ../4_featureCounts/bam.fofn
cd ../4_featureCounts
while read line; do
ln -s $line;
done
featureCounts \
   -T ${SLURM_CPUS_ON_NODE} \
   -a gencode.vM30.annotation.gtf \
   -t exon \
   -g gene_id \
   -p \
   -B \
   --countReadPairs \
   -o merged_counts.txt \
   --tmpDir ./tmp *.bam

The generated counts file was processed to use it direclty with DESeq2

cut -f 1,7- merged_counts.txt |\
   grep -v "^#" |\
  sed 's/_S...\?_Aligned.sortedByCoord.out.bam//g' |\
  sed 's/mNCSC_//g' > merged_counts-clean.tsv
grep -Fw -f mm10.protein_coding merged_counts-clean.tsv > body
head -n 1 merged_counts-clean.tsv > head
cat head body >> merged_counts-clean-prot.tsv
rm head body

Create a info file:

head -n 1 merged_counts-clean.tsv |\
   tr "\t" "\n" |\
   grep -v "^Geneid" |\
   awk '{print $1"\t"$1}' |\
   sed 's/_.$//g' > info.tsv

Differential expression analysis

Differential expression (DE) analyses using DESeq2 was performed as shown below.

Prerequisites

R packages required for this section are loaded

# set path
setwd("/work/LAS/geetu-lab/arnstrm/mouse.trophoblast.smallRNAseq")
# load the modules
library(tidyverse)
library(DESeq2)
library(vsn)
library(pheatmap)
library(ggrepel)
library(RColorBrewer)
library(reshape2)
require(biomaRt)
library(TissueEnrich)
library(plotly)
library(cowplot)
library(biomaRt)
library(scales)
library(kableExtra)
library(htmlwidgets)
library(DT)
library(enrichR)

Import datasets

The counts data and its associated metadata (coldata) are imported for analyses.

countsFile = 'assets/merged_counts-clean-prot.tsv'
groupFile = 'assets/info.tsv'
coldata <-
  read.csv(
    groupFile,
    row.names = 1,
    sep = "\t",
    header = FALSE,
    stringsAsFactors = TRUE
  )
colnames(coldata) <- "condition"
cts <- as.matrix(read.delim(countsFile, row.names = 1, header = TRUE))

Reorder columns of cts according to coldata rows. Check if samples in both files match.

all(rownames(coldata) %in% colnames(cts))
#> [1] TRUE
cts <- cts[, rownames(coldata)]

DESeq2

The batch corrected read counts are then used for running DESeq2 analyses

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
vsd <- vst(dds, blind = FALSE)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
dds <- DESeq(dds)

PCA plot for QC

PCA plot for the dataset that includes all libraries.

rv <- rowVars(assay(vsd))
pcaData <-
  plotPCA(vsd,
          intgroup = "condition",
          returnData = TRUE)
percenttVar1 <- round(100 * attr(pcaData, "percentVar"))
# @ Thomas W. Battaglia

#' Plot DESeq2's PCA plotting with Plotly 3D scatterplot
#'
#' The function will generate a plot_ly 3D scatter plot image for
#' a 3D exploration of the PCA.
#'
#' @param object a DESeqTransform object, with data in assay(x), produced for example by either rlog or varianceStabilizingTransformation.
#' @param intgroup interesting groups: a character vector of names in colData(x) to use for grouping
#' @param ntop number of top genes to use for principal components, selected by highest row variance
#' @param returnData should the function only return the data.frame of PC1, PC2 and PC3 with intgroup covariates for custom plotting (default is FALSE)
#' @return An object created by plot_ly, which can be assigned and further customized.
#' @export
plotPCA3D <-
  function (object,
            intgroup = "condition",
            ntop = 500,
            returnData = FALSE) {
    rv <- rowVars(assay(object))
    select <-
      order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
    pca <- prcomp(t(assay(object)[select,]))
    percentVar <- pca$sdev ^ 2 / sum(pca$sdev ^ 2)
    if (!all(intgroup %in% names(colData(object)))) {
      stop("the argument 'intgroup' should specify columns of colData(dds)")
    }
    intgroup.df <-
      as.data.frame(colData(object)[, intgroup, drop = FALSE])
    group <- if (length(intgroup) > 1) {
      factor(apply(intgroup.df, 1, paste, collapse = " : "))
    }
    else {
      colData(object)[[intgroup]]
    }
    d <- data.frame(
      PC1 = pca$x[, 1],
      PC2 = pca$x[, 2],
      PC3 = pca$x[, 3],
      group = group,
      intgroup.df,
      name = colnames(object)
    )
    if (returnData) {
      attr(d, "percentVar") <- percentVar[1:3]
      return(d)
    }
    message("Generating plotly plot")
    p <- plotly::plot_ly(
      data = d,
      x = ~ PC1,
      y = ~ PC2,
      z = ~ PC3,
      color = group,
      mode = "markers",
      type = "scatter3d"
    )
    return(p)
  }


select <-
  order(rv, decreasing = TRUE)[seq_len(min(500, length(rv)))]
pca <- prcomp(t(assay(vsd)[select, ]))
percentVar <- pca$sdev ^ 2 / sum(pca$sdev ^ 2)
intgroup = "condition"
intgroup.df <- as.data.frame(colData(vsd)[, intgroup, drop = FALSE])
group <- if (length(intgroup) == 1) {
  factor(apply(intgroup.df, 1, paste, collapse = " : "))
}
d <- data.frame(
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2],
  intgroup.df,
  name = colnames(vsd)
)
ggplot(d, aes(PC1, PC2, color = condition)) +
  scale_shape_manual(values = 1:12) +
  scale_color_manual(values = c('mTSC_EV'       = '#0178c7',
                                'No_EV'     = '#7acc37',
                                'pTGC_EV' = '#8e27af')) +
  theme_bw() +
  theme(legend.title = element_blank()) +
  geom_point(size = 2, stroke = 2) +
  geom_text_repel(aes(label = name)) +
  xlab(paste("PC1", round(percentVar[1] * 100, 2), "% variance")) +
  ylab(paste("PC2", round(percentVar[2] * 100, 2), "% variance"))
Figure 4: PCA plot for the first 2 principal components

Figure 4: PCA plot for the first 2 principal components

Interactive 3D PCA plot

g= plotPCA3D(vsd, intgroup = "condition")
g
saveWidget(g, file = "PCA.html")

View the interactive 3D PCA plot in a new tab.

Sample distance for QC

sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- colnames(vsd)
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)
Figure 5: Euclidean distance between samples

Figure 5: Euclidean distance between samples

Set contrasts and find DE genes

res.NoEvsTSC <-
  results(dds,
          contrast = c(
            "condition",
            "No_EV",
            "mTSC_EV"))
res.NoEvsTGC <-
  results(dds,
          contrast = c(
            "condition",
            "No_EV",
            "pTGC_EV"))

Functions

Processing DE objects

processDE <- function(res.se, string) {
  fc = 1.5
  log2fc = log(fc, base = 2)
  neg.log2fc = log2fc * -1
  res.se <- res.se[order(res.se$padj), ]
  res.data <-
    merge(as.data.frame(res.se),
          as.data.frame(counts(dds, normalized = TRUE)),
          by = "row.names",
          sort = FALSE)
  names(res.data)[1] <- "Gene"
  res.up <-
    res.data %>%
    filter(log2FoldChange >= log2fc) %>%
    filter(padj <= 0.05) %>%
    arrange(desc(log2FoldChange)) %>%
    dplyr::select(Gene)
  res.dw <-
    res.data %>%
    filter(log2FoldChange <= neg.log2fc) %>%
    filter(padj <= 0.05) %>%
    arrange(desc(log2FoldChange)) %>%
    dplyr::select(Gene)
  res.up.new <-
    annot[annot$ensembl_gene_id_version %in% res.up$Gene,]
  res.dw.new <-
    annot[annot$ensembl_gene_id_version %in% res.dw$Gene,]
  res.data.info <-
    left_join(res.data, mart, by = c('Gene' = 'ensembl_gene_id_version'))
  res.data.filtered <- res.data.info %>%
    filter(padj <= 0.05) %>%
    filter(log2FoldChange >= log2fc | log2FoldChange <= neg.log2fc) %>%
    arrange(desc(log2FoldChange))
  pce.up1 <- paste0(string, ".up.pce", 1)
  pce.dw1 <- paste0(string, ".dw.pce", 1)
  pce.up2 <- paste0(string, ".up.pce", 2)
  pce.dw2 <- paste0(string, ".dw.pce", 2)
  DEGtable <- paste0(string, ".DE.table")
  assign(pce.up1, as.character(res.up.new$ensembl_gene_id), envir = .GlobalEnv)
  assign(pce.dw1, as.character(res.dw.new$ensembl_gene_id), envir = .GlobalEnv)
  assign(pce.up2, as.character(res.up.new$external_gene_name), envir = .GlobalEnv)
  assign(pce.dw2, as.character(res.dw.new$external_gene_name), envir = .GlobalEnv)
  assign(DEGtable, res.data.info, envir = .GlobalEnv)
  # save full table
  write_delim(
    res.data.info,
    file = paste0("results/DESeq2results-", string, "_fc.tsv"),
    delim = "\t"
  )
  # save filtered table (fc = 1.5 & padj <= 0.05)
  write_delim(
    res.data.filtered,
    file = paste0("results/DE_", string, "_filtered.tsv"),
    delim = "\t"
  )
}

Gene information

ensembl = useMart("ENSEMBL_MART_ENSEMBL")
listDatasets(ensembl) %>%
  filter(str_detect(description, "Mouse"))
ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)
listFilters(ensembl) %>%
  filter(str_detect(name, "ensembl"))
filterType <- "ensembl_gene_id_version"
head(rownames(counts))
counts <- read.delim(countsFile, row.names = 1, header = TRUE)
head(rownames(counts))
filterValues <- rownames(counts)
listAttributes(ensembl) %>%
  head(20)
attributeNames <- c('ensembl_gene_id_version',
                    'ensembl_gene_id',
                    'external_gene_name')
annot <- getBM(
  attributes = attributeNames,
  filters = filterType,
  values = filterValues,
  mart = ensembl
)
attributeNames <- c('ensembl_gene_id_version',
                    'gene_biotype',
                    'external_gene_name',
                    'description')
mart <- getBM(
  attributes = attributeNames,
  filters = filterType,
  values = filterValues,
  mart = ensembl
)
write_delim(
  annot,
  file = "assets/annot.tsv",
  delim = "\t"
)
write_delim(
  mart,
  file = "assets/mart.tsv",
  delim = "\t"
)    

Files were saved, so we don’t query BioMart everytime we run the markdown. The files will be loaded, instead

mart <-
    read.csv(
        "assets/mart.tsv",
        sep = "\t",
        header = TRUE,
    )
annot <-
    read.csv(
        "assets/annot.tsv",
        sep = "\t",
        header = TRUE,
    )

Volcano plotting function

g <- volcanoPlots <-
  function(res.se,
           string,
           first,
           second,
           color1,
           color2,
           color3,
           ChartTitle) {
    fc = 1.5
    log2fc = log(fc, base = 2)
    neg.log2fc = log2fc * -1
    res.se <- res.se[order(res.se$padj),]
    res.se <-
      rownames_to_column(as.data.frame(res.se[order(res.se$padj), ]))
    names(res.se)[1] <- "Gene"
    res.data <-
      merge(res.se,
            mart,
            by.x = "Gene",
            by.y = "ensembl_gene_id_version")
    res.data <- res.data %>% mutate_all(na_if, "")
    res.data <- res.data %>% mutate_all(na_if, " ")
    res.data <-
      res.data %>% mutate(external_gene_name = coalesce(external_gene_name, Gene))
    res.data$diffexpressed <- "other.genes"
    res.data$diffexpressed[res.data$log2FoldChange >= log2fc &
                             res.data$padj <= 0.05] <-
      paste("Higher expression in", first)
    res.data$diffexpressed[res.data$log2FoldChange <= neg.log2fc &
                             res.data$padj <= 0.05] <-
      paste("Higher expression in", second)
    upgenes <- res.data %>%
      dplyr::filter(log2FoldChange >= log2fc & padj <= 0.05) %>%
      arrange(desc(log2FoldChange)) %>%
      mutate(delabel = external_gene_name) %>%
      select(Gene, delabel) %>%
      top_n(10)
    downgenes <- res.data %>%
      dplyr::filter(log2FoldChange <= neg.log2fc &
                      padj <= 0.05) %>%
      arrange(desc(log2FoldChange)) %>%
      mutate(delabel = external_gene_name) %>%
      select(Gene, delabel) %>%
      top_n(10)
    fullgenes <- rbind(upgenes, downgenes)
    res.data <- left_join(res.data, fullgenes, by = "Gene")
    res.data$delabel[res.data$external_gene_name == "Htr2b"] <-
      "Htr2b"
    ggplot(res.data,
           aes(
             x = log2FoldChange,
             y = -log10(padj),
             col = diffexpressed,
             label = delabel
           )) +
      geom_point(alpha = 0.5) +
      xlim(-2.5, 2.5) +
      theme_classic() +
      scale_color_manual(name = "Expression", values = c(color1, color3, color2)) +
       geom_label_repel(
        data = res.data,
        aes(size = 0.5, point.size = 0.5),
        max.overlaps = Inf,
        force_pull   = 0,
        min.segment.length = 0.5,
        show.legend = F,
        seed = 11,
        box.padding = 0.5
) +
      ggtitle(ChartTitle) +
      xlab(paste("log2 fold change")) +
      ylab("-log10 pvalue (adjusted)") +
      theme(legend.text.align = 0)
  }

GenePlot function


gnCountFn <- function(geneName="ENSMUSG00000026228.7") {
  geneCounts <- as.data.frame(counts(dds, normalized = TRUE)[geneName, ])
colnames(geneCounts) <- "normalizedCounts"
geneCounts <- geneCounts %>% 
  rownames_to_column("condition") %>% 
  mutate(condition = str_sub(condition, 1, str_length(condition)-2))
geneCounts$condition <- factor(geneCounts$condition, 
                               levels=c('No_EV', 'mTSC_EV', 'pTGC_EV'))
g <- ggplot(geneCounts, aes(x = condition, y = normalizedCounts, fill = condition)) +
  geom_boxplot(color = "black") +
  stat_summary(
    fun.y = mean,
    geom = "point",
    shape = 23,
    size = 2
  ) + 
  geom_dotplot(
    binaxis = 'y',
    binwidth = 1,
    stackdir = 'center',
    dotsize = 0.75
  ) +
  scale_fill_manual(values = c(
    "No_EV" = "#7acc37",
    "mTSC_EV" = "#0178c7",
    "pTGC_EV" = "#8e27af"
  )) +
  theme_clean()
g
}

TissueEnrich function

source("assets/theme_clean.R")
plotTE <- function(inputGenes = gene.list,
                   myColor = "color") {
  gs <-
    GeneSet(geneIds = inputGenes,
            organism = "Mus Musculus",
            geneIdType = SymbolIdentifier())
  output <- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
  en.output <-
    setNames(data.frame(assay(output[[1]]),
                        row.names = rowData(output[[1]])[, 1]),
             colData(output[[1]])[, 1])
  en.output$Tissue <- rownames(en.output)
  logp <- -log10(0.05)
  en.output <-
    mutate(en.output,
           significance = ifelse(Log10PValue > logp,
                                 "colored", "nocolor"))
  en.output$Sig <- "NA"
  ggplot(en.output, aes(reorder(Tissue, Log10PValue),
                        Log10PValue,
                        fill = significance)) +
    geom_bar(stat = 'identity') +
    theme_clean() + ylab("- log10 adj. p-value") + xlab("") +
    scale_fill_manual(values = c("colored" = myColor, "nocolor" = "gray")) +
    scale_y_continuous(expand = expansion(mult = c(0, .1)),
                       breaks = scales::pretty_breaks()) +
    coord_flip()
}

enrichR function

plotEnrichR <- function(enriched, table="string", myColor = "slateblue") {
  logp <- -log10(0.05)
  myData <- enriched[[table]]
  myData$negLogP <-  -log10(myData$P.value)
  myData <-
    mutate(myData,
           significance = ifelse(negLogP > logp, "colored", "nocolor"))
  myData$Sig <- "NA"
  myData <- head(arrange(myData, -negLogP, Term), 15)
  ggplot(myData, aes(reorder(Term, negLogP),
                     negLogP,
                     fill = significance)) +
    geom_bar(stat = 'identity') +
    theme_clean() + ylab("- log10 p-value") + xlab("") +
    scale_fill_manual(values = c("colored" = myColor, "nocolor" = "gray")) +
    scale_y_continuous(expand = expansion(mult = c(0, .1)),
                       breaks = scales::pretty_breaks()) +
    coord_flip()
}

Results

Write files

processDE(res.NoEvsTSC, "NoEvsTSC")
processDE(res.NoEvsTGC, "NoEvsTGC")

Htr2b gene expression

gnCountFn(geneName="ENSMUSG00000026228.7")
placeholder

placeholder

Volcano plots

no_EVs vs. pTGC_EV

g <- volcanoPlots(
  res.NoEvsTGC,
  "NoEvsTGC",
  "No_EVs",
  "pTGC_EVs",
  "#7acc37",
  "#4d4d4d",
  "#8e27af",
  ChartTitle = "No_EVs vs. pTGC_EVs"
)
g
#> Warning: Removed 2397 rows containing missing values (`geom_point()`).
#> Warning: Removed 15477 rows containing missing values (`geom_label_repel()`).
Fig X: No_EVs vs. pTGC_EVs

Fig X: No_EVs vs. pTGC_EVs

No_EVs vs. mTSC_EV

g <- volcanoPlots(
  res.NoEvsTSC,
  "NoEvsTSC",
  "No_EVs",
  "mTSC_EV",
  "#0178c7",
  "#4d4d4d",
  "#7acc37",
  ChartTitle = "No_EVs vs. mTSC_EV"
)
g
#> Warning: Removed 2397 rows containing missing values (`geom_point()`).
#> Warning: Removed 15477 rows containing missing values (`geom_label_repel()`).
Fig X: No_EVs vs. pTSC_EVs

Fig X: No_EVs vs. pTSC_EVs

Overlap of genes TSC EV (vs. no EV) and pTGC EV (vs. no EV)

All significant

library(ggvenn)
y <-  list(pTGC = c(NoEvsTGC.dw.pce1,NoEvsTGC.up.pce1),
           TSC = c(NoEvsTSC.dw.pce1, NoEvsTSC.up.pce1))
ggvenn(
    y,
    fill_color = c("#8e27af", "#0178c7"),
    stroke_size = 0.3,
    set_name_size = 4
)
placeholder

placeholder

upregulated in pTGC or TSC

library(ggvenn)
y <- list(pTGC = NoEvsTGC.dw.pce1, TSC = NoEvsTSC.dw.pce1)
ggvenn(
    y,
    fill_color = c("#8e27af", "#0178c7"),
    stroke_size = 0.3,
    set_name_size = 4
)
placeholder

placeholder

upregulated in no EV

library(ggvenn)
y <- list(pTGC = NoEvsTGC.up.pce1, TSC = NoEvsTSC.up.pce1)
ggvenn(
    y,
    fill_color = c("#8e27af", "#0178c7"),
    stroke_size = 0.3,
    set_name_size = 4
)
placeholder

placeholder

TE: No_EV vs. pTGC

up in NoEvs

plotTE(unique(NoEvsTGC.up.pce2), "#7acc37")
Fig X: TE for genes upregulated in No_EV samples

Fig X: TE for genes upregulated in No_EV samples

up in pTGC

plotTE(unique(NoEvsTGC.dw.pce2), "#8e27af")
Fig X: TE for genes upregulated in pTGC samples

Fig X: TE for genes upregulated in pTGC samples

TE: No_EV vs. mTSC

up in NoEvs

plotTE(unique(NoEvsTSC.up.pce2), "#7acc37")
Fig X: TE for genes upregulated in No_EV sample

Fig X: TE for genes upregulated in No_EV sample

up in mTSC

plotTE(unique(NoEvsTSC.dw.pce2), "#0178c7")
Fig X: TE for genes upregulated in mTSC samples

Fig X: TE for genes upregulated in mTSC samples

Enrichment Analyses

setEnrichrSite("Enrichr")
websiteLive <- TRUE
myDBs <-
  c(
    "DisGeNET",
    "WikiPathways_2019_Human",
    "WikiPathways_2019_Mouse",
    "KEGG_2019_Mouse"
  )
if (websiteLive) {
  NoEvsTGC.up.enriched <- enrichr(NoEvsTGC.up.pce2, myDBs)
  Sys.sleep(60)
  NoEvsTGC.dw.enriched <- enrichr(NoEvsTGC.dw.pce2, myDBs)
  Sys.sleep(60)
  NoEvsTSC.up.enriched <- enrichr(NoEvsTSC.up.pce2, myDBs)
  Sys.sleep(60)
  NoEvsTSC.dw.enriched <- enrichr(NoEvsTSC.dw.pce2, myDBs)
}
#> Uploading data to Enrichr... Done.
#>   Querying DisGeNET... Done.
#>   Querying WikiPathways_2019_Human... Done.
#>   Querying WikiPathways_2019_Mouse... Done.
#>   Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
#> Uploading data to Enrichr... Done.
#>   Querying DisGeNET... Done.
#>   Querying WikiPathways_2019_Human... Done.
#>   Querying WikiPathways_2019_Mouse... Done.
#>   Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
#> Uploading data to Enrichr... Done.
#>   Querying DisGeNET... Done.
#>   Querying WikiPathways_2019_Human... Done.
#>   Querying WikiPathways_2019_Mouse... Done.
#>   Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
#> Uploading data to Enrichr... Done.
#>   Querying DisGeNET... Done.
#>   Querying WikiPathways_2019_Human... Done.
#>   Querying WikiPathways_2019_Mouse... Done.
#>   Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.

up in NoEvs (No_EV vs. pTGC)

DisGeNET

plotEnrichR(NoEvsTGC.up.enriched, table="DisGeNET" , myColor = "#7acc37")
placeholder

placeholder

WikiPathways_2019_Human

plotEnrichR(NoEvsTGC.up.enriched, table="WikiPathways_2019_Human" , "#7acc37")
placeholder

placeholder

WikiPathways_2019_Mouse

plotEnrichR(NoEvsTGC.up.enriched, table="WikiPathways_2019_Mouse" , "#7acc37")
placeholder

placeholder

KEGG_2019_Mouse

plotEnrichR(NoEvsTGC.up.enriched, table="KEGG_2019_Mouse" , "#7acc37")
placeholder

placeholder

up in pTGC (No_EV vs. pTGC)

DisGeNET

plotEnrichR(NoEvsTGC.dw.enriched, table="DisGeNET" , myColor = "#8e27af")
placeholder

placeholder

WikiPathways_2019_Human

plotEnrichR(NoEvsTGC.dw.enriched, table="WikiPathways_2019_Human" , "#8e27af")
placeholder

placeholder

WikiPathways_2019_Mouse

plotEnrichR(NoEvsTGC.dw.enriched, table="WikiPathways_2019_Mouse" , "#8e27af")
placeholder

placeholder

KEGG_2019_Mouse

plotEnrichR(NoEvsTGC.dw.enriched, table="KEGG_2019_Mouse" , "#8e27af")
placeholder

placeholder

up in NoEvs (No_EV vs. mTSC)

DisGeNET

plotEnrichR(NoEvsTSC.up.enriched, table="DisGeNET" , myColor = "#7acc37")
placeholder

placeholder

WikiPathways_2019_Human

plotEnrichR(NoEvsTSC.up.enriched, table="WikiPathways_2019_Human" , "#7acc37")
placeholder

placeholder

WikiPathways_2019_Mouse

plotEnrichR(NoEvsTGC.up.enriched, table="WikiPathways_2019_Mouse" , "#7acc37")
placeholder

placeholder

KEGG_2019_Mouse

plotEnrichR(NoEvsTGC.up.enriched, table="KEGG_2019_Mouse" , "#7acc37")
placeholder

placeholder

up in mTSC (No_EV vs. mTSC)

DisGeNET

plotEnrichR(NoEvsTSC.dw.enriched, table="DisGeNET" , myColor = "#0178c7")
placeholder

placeholder

WikiPathways_2019_Human

plotEnrichR(NoEvsTSC.dw.enriched, table="WikiPathways_2019_Human" , "#0178c7")
placeholder

placeholder

WikiPathways_2019_Mouse

plotEnrichR(NoEvsTSC.dw.enriched, table="WikiPathways_2019_Mouse" , "#0178c7")
placeholder

placeholder

KEGG_2019_Mouse

plotEnrichR(NoEvsTSC.dw.enriched, table="KEGG_2019_Mouse" , "#0178c7")
placeholder

placeholder

Overlap analyses

For the top quartile (expression) miRNAs, targets were predicted in the previous sheet. Here we will import them (both target gene lists as well as brain specific target gene lists) and perform overlap analyses with the bulk RNAseq results.

brainGeneLists <- readRDS("results/brainGeneLists.rds")
targetLists <- readRDS("results/targetsLists_75pc.rds")
mergeTargetsWithDEG <- function(deTable = NoEvsTSC.DE.table,
                                targetTable = targetLists$miRNAs_75pc_pTGC,
                                brainGenes = brainGeneLists$brEnrTargets_75pc_pTGC) {
 # thresholds
  fc = 1.5
  log2fc = log(fc, base = 2)
  neg.log2fc = log2fc * -1
  # make clean DE table
  cleanDE <- deTable %>%
    dplyr::select(Gene, log2FoldChange, padj, external_gene_name, description) %>%
    mutate(Gene = gsub("\\..*", "", Gene))
  # make clean brain gene list
  brainPAV <- as.data.frame(brainGenes) %>%
    rename(SYMBOL = 1)  %>%
    mutate(SYMBOL = toupper(SYMBOL), BrainSpecific = 1)
  # merge targets with clean tables
  mergedTable <-
    targetTable %>%
    left_join(cleanDE, by = c("ENSEMBL" = "Gene")) %>%
    mutate(SYMBOL = toupper(SYMBOL)) %>%
    mutate(RNAseq = case_when((log2FoldChange > log2fc &
                                 padj <= 0.05) ~ 'up',
                              (log2FoldChange < neg.log2fc &
                                 padj <= 0.05) ~ 'down',
                              TRUE ~ 'NA'
    )) %>%
    dplyr::select(mirbase_id, ENSEMBL, SYMBOL, RNAseq, log2FoldChange, padj, description) %>%
    left_join(as_tibble(brainPAV), by = "SYMBOL") %>%
    filter(mirbase_id != "") %>% distinct()
  mergedTable
}
createSummaryTable <-
  function(mergedTabe = NoEvsTSC.DE_with_pTGC.75pc_targets) {
    t1 <- mergedTabe %>%
      group_by(mirbase_id, RNAseq) %>%
      summarise(BrainIntersecting = sum(!is.na(BrainSpecific))) %>%
      filter(mirbase_id != "") %>%
      group_by(mirbase_id) %>%
      spread(RNAseq, BrainIntersecting) %>%
      select(mirbase_id, down, up, `NA`) %>%
      rename(brainDownDE = down,
             brainUpDE = up,
             Brain.NA = `NA`) %>%
      mutate(totalBrain = sum(brainDownDE, 
                              Brain.NA, 
                              brainUpDE, 
                              na.rm =TRUE))
    t2 <- mergedTabe %>%
      group_by(mirbase_id, RNAseq) %>%
      summarise(AllIntersecting = n()) %>%
      filter(mirbase_id != "") %>%
      group_by(mirbase_id) %>%
      spread(RNAseq, AllIntersecting) %>%
      select(mirbase_id, down, up, `NA`) %>%
      rename(
        TargetsDownDE = down,
        TargetsUpDE = up,
        TargetsNA = `NA`
      ) %>%
      mutate(totalTargets = sum(TargetsDownDE, 
                                TargetsUpDE, 
                                TargetsNA, 
                                na.rm =TRUE))
    summaryTable <- inner_join(t1, t2, by = "mirbase_id")
    summaryTable
  }
NoEvsTSC.DE_with_mTSC.75pc_targets <-
  mergeTargetsWithDEG(
    deTable = NoEvsTSC.DE.table,
    targetTable = targetLists$miRNAs_75pc_mTSC,
    brainGenes = brainGeneLists$brEnrTargets_75pc_mTSC
  )
NoEvsTSC.DE_with_mTSC.75pc_summary <-
  createSummaryTable(NoEvsTSC.DE_with_mTSC.75pc_targets)
write_delim(
  NoEvsTSC.DE_with_mTSC.75pc_targets ,
  "results/NoEvsTSC.DE_with_mTSC.75pc_targets.tsv",
  delim = "\t"
)
write_delim(
  NoEvsTSC.DE_with_mTSC.75pc_summary,
  "results/NoEvsTSC.DE_with_mTSC.75pc_summary.tsv",
  delim = "\t"
)
NoEvsTGC.DE_with_pTGC.75pc_targets <-
  mergeTargetsWithDEG(
    deTable = NoEvsTGC.DE.table,
    targetTable = targetLists$miRNAs_75pc_pTGC,
    brainGenes = brainGeneLists$brEnrTargets_75pc_pTGC
  )
NoEvsTGC.DE_with_pTGC.75pc_summary <- 
  createSummaryTable(NoEvsTGC.DE_with_pTGC.75pc_targets)
write_delim(
  NoEvsTGC.DE_with_pTGC.75pc_targets,
  "results/NoEvsTGC.DE_with_pTGC.75pc_targets.tsv",
  delim = "\t"
)
write_delim(
  NoEvsTGC.DE_with_pTGC.75pc_summary,
  "results/NoEvsTGC.DE_with_pTGC.75pc_summary.tsv",
  delim = "\t"
)

MultiQC report

MultiQC report is available at this link

Session Information

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] grid      stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] ggvenn_0.1.10               enrichR_3.1                
#>  [3] DT_0.27                     htmlwidgets_1.5.4          
#>  [5] kableExtra_1.3.4            scales_1.2.1               
#>  [7] cowplot_1.1.1               plotly_4.10.1              
#>  [9] TissueEnrich_1.18.0         GSEABase_1.60.0            
#> [11] graph_1.76.0                annotate_1.76.0            
#> [13] XML_3.99-0.12               AnnotationDbi_1.60.0       
#> [15] ensurer_1.1                 biomaRt_2.54.1             
#> [17] reshape2_1.4.4              RColorBrewer_1.1-3         
#> [19] ggrepel_0.9.3               pheatmap_1.0.12            
#> [21] vsn_3.66.0                  DESeq2_1.38.3              
#> [23] SummarizedExperiment_1.28.0 Biobase_2.58.0             
#> [25] MatrixGenerics_1.10.0       matrixStats_0.63.0         
#> [27] GenomicRanges_1.50.1        GenomeInfoDb_1.34.3        
#> [29] IRanges_2.32.0              S4Vectors_0.36.0           
#> [31] BiocGenerics_0.44.0         forcats_0.5.2              
#> [33] stringr_1.5.0               dplyr_1.0.10               
#> [35] purrr_1.0.1                 readr_2.1.3                
#> [37] tidyr_1.3.0                 tibble_3.1.8               
#> [39] ggplot2_3.4.0               tidyverse_1.3.2            
#> 
#> loaded via a namespace (and not attached):
#>   [1] readxl_1.4.1           backports_1.4.1        BiocFileCache_2.6.1   
#>   [4] systemfonts_1.0.4      plyr_1.8.8             lazyeval_0.2.2        
#>   [7] crosstalk_1.2.0        BiocParallel_1.32.1    digest_0.6.30         
#>  [10] htmltools_0.5.3        fansi_1.0.3            magrittr_2.0.3        
#>  [13] memoise_2.0.1          googlesheets4_1.0.1    tzdb_0.3.0            
#>  [16] limma_3.54.0           Biostrings_2.66.0      modelr_0.1.10         
#>  [19] vroom_1.6.0            svglite_2.1.0          timechange_0.1.1      
#>  [22] rmdformats_1.0.4       prettyunits_1.1.1      colorspace_2.0-3      
#>  [25] blob_1.2.3             rvest_1.0.3            rappdirs_0.3.3        
#>  [28] haven_2.5.1            xfun_0.35              crayon_1.5.2          
#>  [31] RCurl_1.98-1.9         jsonlite_1.8.3         glue_1.6.2            
#>  [34] gtable_0.3.1           gargle_1.2.1           zlibbioc_1.44.0       
#>  [37] XVector_0.38.0         webshot_0.5.4          DelayedArray_0.24.0   
#>  [40] DBI_1.1.3              Rcpp_1.0.9             viridisLite_0.4.1     
#>  [43] xtable_1.8-4           progress_1.2.2         bit_4.0.5             
#>  [46] preprocessCore_1.60.2  httr_1.4.4             ellipsis_0.3.2        
#>  [49] farver_2.1.1           pkgconfig_2.0.3        sass_0.4.3            
#>  [52] dbplyr_2.2.1           locfit_1.5-9.7         utf8_1.2.2            
#>  [55] tidyselect_1.2.0       labeling_0.4.2         rlang_1.1.1           
#>  [58] munsell_0.5.0          cellranger_1.1.0       tools_4.2.2           
#>  [61] cachem_1.0.6           cli_3.4.1              generics_0.1.3        
#>  [64] RSQLite_2.2.18         broom_1.0.1            evaluate_0.18         
#>  [67] fastmap_1.1.0          yaml_2.3.6             knitr_1.41            
#>  [70] bit64_4.0.5            fs_1.5.2               KEGGREST_1.38.0       
#>  [73] xml2_1.3.3             compiler_4.2.2         rstudioapi_0.14       
#>  [76] filelock_1.0.2         curl_4.3.3             png_0.1-7             
#>  [79] affyio_1.68.0          reprex_2.0.2           geneplotter_1.76.0    
#>  [82] bslib_0.4.1            stringi_1.7.8          highr_0.9             
#>  [85] lattice_0.20-45        Matrix_1.5-3           vctrs_0.6.2           
#>  [88] pillar_1.8.1           lifecycle_1.0.3        BiocManager_1.30.19   
#>  [91] jquerylib_0.1.4        data.table_1.14.6      bitops_1.0-7          
#>  [94] R6_2.5.1               affy_1.76.0            bookdown_0.33         
#>  [97] codetools_0.2-18       assertthat_0.2.1       rjson_0.2.21          
#> [100] withr_2.5.0            GenomeInfoDbData_1.2.9 parallel_4.2.2        
#> [103] hms_1.1.2              rmarkdown_2.18         googledrive_2.0.0     
#> [106] lubridate_1.9.0