Small RNAseq: Differential Expression Analysis

Environment Setup

salloc -N 1 --exclusive -p amd -t 8:00:00
conda env create -f conda-env.yml
conda activate smallrna

Downloading datasets

Raw data

Raw data was downloaded from the sequencing facility using the secure link, with wget command. The downloaded files were checked for md5sum and compared against list of files expected as per the input samples provided.

wget https://oc1.rnet.missouri.edu/xyxz
# link masked 
# GEO link will be included later
# merge files of same samples (technical replicates)
paste <(ls *_L001_R1_001.fastq.gz) <(ls *_L002_R1_001.fastq.gz) | \
   sed 's/\t/ /g' |\
   awk '{print "cat",$1,$2" > "$1}' |\
   sed 's/_L001_R1_001.fastq.gz/.fq.gz/2' > concatenate.sh
chmod +x concatenate.sh
sh concatenate.sh

Genome/annotation

Additional files required for the analyses were downloaded from GenCode. The downloaded files are as follows:

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M30/GRCm39.primary_assembly.genome.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M30/gencode.vM30.annotation.gff3.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M30/gencode.vM30.annotation.gtf.gz
gunzip GRCm39.primary_assembly.genome.fa.gz
gunzip gencode.vM30.annotation.gff3.gz
gunzip gencode.vM30.annotation.gtf.gz

FastQC (before processing)

for fq in *.fq.gz; do
  fastqc --threads $SLURM_JOB_CPUS_PER_NODE $fq;
done
mkdir -p fastqc_pre
mv *.zip *.html fastqc_pre/

Mapping

To index the genome, following command was run (in an interactive session).

fastaGenome="GRCm39.genome.fa"
gtf="gencode.vM30.annotation.gtf"
STAR --runThreadN $SLURM_JOB_CPUS_PER_NODE \
     --runMode genomeGenerate \
     --genomeDir $(pwd) \
     --genomeFastaFiles $fastaGenome \
     --sjdbGTFfile $gtf \
     --sjdbOverhang 1

Each fastq file was mapped to the indexed genome as using runSTAR_map.sh script shown below:

#!/bin/bash
read1=$1
out=$(basename ${read1%%.*})
STARgenomeDir=$(pwd)
# illumina adapter
adapterseq="AGATCGGAAGAGC"
STAR \
    --genomeDir ${STARgenomeDir} \
    --readFilesIn ${read1} \
    --outSAMunmapped Within \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --quantMode GeneCounts \
    --outFilterMultimapNmax 20 \
    --clip3pAdapterSeq ${adapterseq} \
    --clip3pAdapterMMp 0.1 \
    --outFilterMismatchNoverLmax 0.03 \
    --outFilterScoreMinOverLread 0 \
    --outFilterMatchNminOverLread 0 \
    --outFilterMatchNmin 16 \
    --outFileNamePrefix ${out} \
    --alignSJDBoverhangMin 1000 \ 
    --alignIntronMax 1 \
    --runThreadN ${SLURM_JOB_CPUS_PER_NODE} \
    --genomeLoad LoadAndKeep \
    --limitBAMsortRAM 30000000000 \
    --outSAMheaderHD "@HD VN:1.4 SO:coordinate"

Mapping was run with a simple loop:

for fq in *.fq.gz; do
  runSTAR_map.sh $fq;
done

Counting Stats

Counts generated by STAR with option --quantMode GeneCounts were parsed to generate summary stats as well as to extract annotated small RNA feature counts.

mkdir -p counts_files
# copy counts for each sample
cp *ReadsPerGene.out.tab counts_files/
cd counts_files
# merge counts
join_files.sh *ReadsPerGene.out.tab |\
   sed 's/ReadsPerGene.out.tab//g' |\
   grep -v "^N_" > counts_star.tsv
# merge stats
join_files.sh *ReadsPerGene.out.tab |\
   sed 's/ReadsPerGene.out.tab//g' |\
   head -n 1 > summary_star.tsv
join_files.sh *ReadsPerGene.out.tab |\
   sed 's/ReadsPerGene.out.tab//g' |\
   grep "^N_" >> summary_star.tsv
# parse GTF to extact gene.id and its biotype:
gtf=gencode.vM30.annotation.gtf
awk 'BEGIN{OFS=FS="\t"} $3=="gene" {split($9,a,";"); print a[1],a[2]}' ${gtf} |\
   awk '{print $4"\t"$2}' |\
   sed 's/"//g' > GeneType_GeneID.tsv
cut -f 1 GeneType_GeneID.tsv | sort |uniq > features.txt

The information for biotype as provided by the gencodegenes were used for categorizing biotype.

The smallRNA group consists of following biotype:

miRNA
misc_RNA
scRNA
snRNA
snoRNA
sRNA
scaRNA

The full table is as follows:

library(knitr)
setwd("/work/LAS/geetu-lab/arnstrm/mouse.trophoblast.smallRNAseq")
file1="assets/GeneType_Group.tsv"
info <-
  read.csv(
    file1,
    header = TRUE,
    sep = "\t",
    stringsAsFactors = TRUE
  )
kable(info, caption = "Table 1: biotype and its groupings")

Table 1: biotype and its groupings
biotype	group
protein_coding	coding_genes
pseudogene	pseudogenes
TR_C_gene	Ig_genes
TR_D_gene	Ig_genes
TR_J_gene	Ig_genes
TR_V_gene	Ig_genes
IG_C_gene	Ig_genes
IG_D_gene	Ig_genes
IG_J_gene	Ig_genes
IG_LV_gene	Ig_genes
IG_V_gene	Ig_genes
TR_J_pseudogene	pseudogenes
TR_V_pseudogene	pseudogenes
IG_C_pseudogene	pseudogenes
IG_D_pseudogene	pseudogenes
IG_pseudogene	pseudogenes
IG_V_pseudogene	pseudogenes
lncRNA	long_non_conding_RNA
miRNA	non_conding_RNA
misc_RNA	non_conding_RNA
ribozyme	non_conding_RNA
rRNA	non_conding_RNA
scaRNA	non_conding_RNA
scRNA	non_conding_RNA
snoRNA	non_conding_RNA
snRNA	non_conding_RNA
sRNA	non_conding_RNA
Mt_rRNA	non_conding_RNA
Mt_tRNA	non_conding_RNA
processed_pseudogene	pseudogenes
unprocessed_pseudogene	pseudogenes
translated_unprocessed_pseudogene	pseudogenes
transcribed_processed_pseudogene	pseudogenes
transcribed_unitary_pseudogene	pseudogenes
transcribed_unprocessed_pseudogene	pseudogenes
unitary_pseudogene	pseudogenes
TEC	unconfirmed_genes

A samples table (samples.tsv) categorizing samples to its condition were also generated:

file2="assets/samples.tsv"
samples <-
  read.csv(
    file2,
    header = TRUE,
    sep = "\t",
    stringsAsFactors = TRUE
  )
kable(samples, caption = "Table 2: Samples in the study")

Table 2: Samples in the study
Sample	Group
pTGC_1	pTGC
pTGC_2	pTGC
pTGC_3	pTGC
pTGC_4	pTGC
mTSC_1	mTSC
mTSC_2	mTSC
mTSC_3	mTSC
mTSC_4	mTSC

This information was then merged withe counts table to generate QC plots:

awk 'BEGIN{OFS=FS="\t"}FNR==NR{a[$1]=$2;next}{ print $2,$1,a[$1]}' \
    GeneType_Group.tsv GeneType_GeneID.tsv > GeneID_GeneType_Group.tsv

awk 'BEGIN{OFS=FS="\t"}FNR==NR{a[$1]=$2"\t"$3;next}{print $1,a[$1],$0}' \
    GeneID_GeneType_Group.tsv counts_star.tsv |\
    cut -f 1-3,5- > processed_counts_star.tsv

Plotting the mapping summary and count statistics for various biotypes:

library(scales)
library(tidyverse)
library(plotly)

setwd("/work/LAS/geetu-lab/arnstrm/mouse.trophoblast.smallRNAseq")
file1="assets/processed_counts_star.tsv"
file2="assets/summary_stats_star.tsv"
counts <-
  read.csv(
    file1,
    sep = "\t",
    stringsAsFactors = TRUE
  )
subread <-
  read.csv(
    file2,
    sep = "\t",
    stringsAsFactors = TRUE
  )
# convert long format
counts.long <- gather(counts, Sample, Count, pTGC_1:mTSC_4, factor_key=TRUE)
subread.long <- gather(subread, Sample,  Count, pTGC_1:mTSC_4, factor_key=TRUE)
# organize
counts.long$Group <-
  factor(
    counts.long$Group,
    levels = c(
      "coding_genes",
      "non_conding_RNA",
      "long_non_conding_RNA",
      "pseudogenes",
      "unconfirmed_genes",
      "Ig_genes"
    )
  )

subread.long$Assignments <-
  factor(
    subread.long$Assignments,
    levels = c(
      "N_input",
      "N_unmapped",
      "N_multimapping",
      "N_unique",
      "N_ambiguous",
      "N_noFeature"
    )
  )

ggplot(subread.long, aes(x = Assignments, y = Count, fill = Assignments)) +
  geom_bar(stat = 'identity') +
  labs(x = "Subread assingments", y = "reads") + theme_minimal() +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_manual(
    values = c(
      "N_input"        = "#577b92",
      "N_unmapped"     = "#021927",
      "N_multimapping" = "#007caa",
      "N_unique"       = "#1a3c54",
      "N_ambiguous"    = "#3a5b63",
      "N_noFeature"    = "#33526c"
    )
  ) +
  theme(
    axis.text.x = element_text(
      angle = 45,
      vjust = 1,
      hjust = 1,
      size = 12
    ),
    strip.text = element_text(
      face = "bold",
      color = "gray35",
      hjust = 0,
      size = 10
    ),
    strip.background = element_rect(fill = "white", linetype = "blank"),
    legend.position = "none"
  ) +
  facet_wrap("Sample", scales = "free_y", ncol = 4)

Figure 1: STAR read mapping and feature assignment. Here, N_input is total input reads, N_unmapped is reads that were either too short to map after adapter removal or had higher mismatch rate to place reliably on the genome, N_multimapping is reads mapped to multiple loci, N_unique is reads mapped to unique loci. A subset of N_unique reads that were unable to clearly assign to a feature or assign any feature at all are grouped as N_ambigious or N_noFeature, respectively

g <- ggplot(counts.long, aes(x = Group, y = Count, fill = Group)) +
  geom_bar(stat = 'sum') +
  labs(x = "biotype", y = "read counts") + theme_minimal() +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_manual(values = c(
      "coding_genes"         = "#92b5b7",
      "non_conding_RNA"      = "#be4f54",
      "long_non_conding_RNA" = "#eca87a",
      "pseudogenes"          = "#784440",
      "unconfirmed_genes"    = "#eba3a4",
      "Ig_genes"             = "#8a3a1b"
    ))  +
  theme(
    axis.text.x = element_text(
      angle = 45,
      vjust = 1,
      hjust = 1,
      size = 12
    ),
    strip.text = element_text(
      face = "bold",
      color = "gray35",
      hjust = 0,
      size = 10
    ),
    strip.background = element_rect(fill = "white", linetype = "blank"),
    legend.position = "none"
  ) +
  facet_wrap("Sample", scales = "free_y", ncol = 4)
g

Figure 2: Features with read counts

counts.nc <- filter(counts.long, Group %in% "non_conding_RNA" )
counts.nc$GeneType <-
  factor(
    counts.nc$GeneType,
    levels = c(
      "miRNA",
      "misc_RNA",
      "snoRNA",
      "snRNA",
      "sRNA",
      "scRNA",
      "scaRNA",
      "Mt_tRNA",
      "Mt_rRNA",
      "rRNA",
      "ribozyme"
    )
  )

g <-
  ggplot(counts.nc, aes(x = GeneType, y = Count, fill = GeneType)) +
  geom_bar(stat = 'sum') +
  labs(x = "biotype", y = "read counts") + theme_minimal() +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_manual(
    values = c(
      'miRNA'       = '#54693e',
      'misc_RNA'    = '#9c47cb',
      'snRNA'       = '#94d14f',
      'snoRNA'      = '#5c4f9c',
      'scaRNA'      = '#cca758',
      'sRNA'        = '#c85a90',
      'Mt_tRNA'     = '#80d0a8',
      'Mt_rRNA'     = '#c4533b',
      'rRNA'        = '#9ea5c0',
      'ribozyme'    = '#51333c'
    )
  ) +
  theme(
    axis.text.x = element_text(
      angle = 45,
      vjust = 1,
      hjust = 1,
      size = 12
    ),
    strip.text = element_text(
      face = "bold",
      color = "gray35",
      hjust = 0,
      size = 10
    ),
    strip.background = element_rect(fill = "white", linetype = "blank"),
    legend.position = "none"
  ) +
  facet_wrap("Sample", scales = "free_y", ncol = 4)
#ggplotly(g)
g

Figure 3: non-coding biotype read counts

subset the counts file to select only smallRNA genes

snrna <- c('miRNA',
           'misc_RNA',
           'scRNA',
           'snRNA',
           'snoRNA',
           'sRNA',
           'scaRNA')
cts <- dplyr::filter(counts, GeneType %in% snrna) %>%
  dplyr::select(Geneid, pTGC_1:mTSC_4)
write_delim(cts, file = "assets/noncoding_counts_star.tsv", delim = "\t")

This noncoding_counts_star.tsv and samples.tsv file will be used for DESeq2 analyses.

DESeq2

For the next steps, we used DESeq2 for performing the DE analyses. Results were visualized as volcano plots and tables were exported to excel.

Load packages

setwd("/work/LAS/geetu-lab/arnstrm/mouse.trophoblast.smallRNAseq")
library(DESeq2)
library(RColorBrewer)
library(pheatmap)
library(genefilter)
library(ggrepel)
library(biomaRt)
library(reshape2)
library(PupillometryR)
library(ComplexUpset)
library(splitstackshape)
library(enrichR)

Import counts and sample metadata

The counts data and its associated metadata (coldata) are imported for analyses.

counts = 'assets/noncoding_counts_star.tsv'
groupFile = 'assets/samples.tsv'
coldata <-
  read.csv(
    groupFile,
    row.names = 1,
    sep = "\t",
    stringsAsFactors = TRUE
  )
cts <- as.matrix(read.csv(counts, sep = "\t", row.names = "Geneid"))

Reorder columns of cts according to coldata rows. Check if samples in both files match.

colnames(cts)
#> [1] "pTGC_1" "pTGC_2" "pTGC_3" "pTGC_4" "mTSC_1" "mTSC_2" "mTSC_3" "mTSC_4"
all(rownames(coldata) %in% colnames(cts))
#> [1] TRUE
cts <- cts[, rownames(coldata)]

Normalize

The batch corrected read counts are then used for running DESeq2 analyses

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ Group)
vsd <- vst(dds, blind = FALSE, nsub =500)
keep <- rowSums(counts(dds)) >= 5
dds <- dds[keep, ]
dds <- DESeq(dds)
dds
#> class: DESeqDataSet 
#> dim: 1266 8 
#> metadata(1): version
#> assays(4): counts mu H cooks
#> rownames(1266): ENSMUSG00000119106.1 ENSMUSG00000119589.1 ...
#>   ENSMUSG00000065444.3 ENSMUSG00000077869.3
#> rowData names(22): baseMean baseVar ... deviance maxCooks
#> colnames(8): pTGC_1 pTGC_2 ... mTSC_3 mTSC_4
#> colData names(2): Group sizeFactor

vst <- assay(vst(dds, blind = FALSE, nsub = 500))
vsd <- vst(dds, blind = FALSE, nsub = 500)
pcaData <-
  plotPCA(vsd,
          intgroup = "Group",
          returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))

PCA plot for QC

PCA plot for the dataset that includes all libraries.

rv <- rowVars(assay(vsd))
select <-
  order(rv, decreasing = TRUE)[seq_len(min(500, length(rv)))]
pca <- prcomp(t(assay(vsd)[select, ]))
percentVar <- pca$sdev ^ 2 / sum(pca$sdev ^ 2)
intgroup = "Group"
intgroup.df <- as.data.frame(colData(vsd)[, intgroup, drop = FALSE])
group <- if (length(intgroup) == 1) {
  factor(apply(intgroup.df, 1, paste, collapse = " : "))
}
d <- data.frame(
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2],
  intgroup.df,
  name = colnames(vsd)
)

plot PCA for components 1 and 2

#nudge <- position_nudge(y = 0.5)
g <- ggplot(d, aes(PC1, PC2, color = Group)) +
  scale_shape_manual(values = 1:8) +
  scale_color_manual(values = c('pTGC'      = '#c6007b',
                                'mTSC'  = '#a0b600')) +
  theme_bw() +
  theme(legend.title = element_blank()) +
  geom_point(size = 2, stroke = 2) +
  geom_text_repel(aes(label = name)) +
  xlab(paste("PC1", round(percentVar[1] * 100, 2), "% variance")) +
  ylab(paste("PC2", round(percentVar[2] * 100, 2), "% variance"))
g

Figure 4: PCA plot for the first 2 principal components

Sample distance for QC

sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- colnames(vsd)
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)

Figure 5: Euclidean distance between samples

Set contrasts and find DE genes

resultsNames(dds)
#> [1] "Intercept"          "Group_pTGC_vs_mTSC"
res.UndfvsDiff <- results(dds, contrast = c("Group", "mTSC", "pTGC"))
table(res.UndfvsDiff$padj < 0.05)
#> 
#> FALSE  TRUE 
#>   579   294
res.UndfvsDiff <- res.UndfvsDiff[order(res.UndfvsDiff$padj),]
res.UndfvsDiffdata <-
  merge(
    as.data.frame(res.UndfvsDiff),
    as.data.frame(counts(dds, normalized = TRUE)),
    by = "row.names",
    sort = FALSE
  )

Creating gene lists

The gene lists have Ensembl gene-ID-version. We need mirbase_id, gene_biotype and external_gene_name attached to make the results interpretable. So we will download metadata from ensembl using biomaRt package.

ensembl = useMart("ENSEMBL_MART_ENSEMBL")
ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)
filterType <- "ensembl_gene_id_version"
filterValues <- rownames(cts)
attributeNames <- c('ensembl_gene_id_version',
                    'ensembl_gene_id',
                    'gene_biotype',
                    'mirbase_id',
                    'external_gene_name')
annot <- getBM(
  attributes = attributeNames,
  filters = filterType,
  values = filterValues,
  mart = ensembl
)
isDup <- duplicated(annot$ensembl_gene_id)
dup <- annot$ensembl_gene_id[isDup]
annot <- annot[!annot$ensembl_gene_id %in% dup, ] #this object will be saved and used later
saveRDS(annot, file = "assets/annot.rds")

The rds is saved and reloaded for subsequent run (instead of querying biomart over and over for the same information)

annot <- readRDS("assets/annot.rds")

Volcano plots

volcanoPlots2 <-
  function(res.se,
           string,
           first,
           second,
           color1,
           color2,
           color3,
           ChartTitle) {
    res.se <- res.se[order(res.se$padj),]
    res.se <-
      rownames_to_column(as.data.frame(res.se[order(res.se$padj),]))
    names(res.se)[1] <- "Gene"
    res.data <-
      merge(res.se,
            annot,
            by.x = "Gene",
            by.y = "ensembl_gene_id_version")
    res.data <- res.data %>% mutate_all(na_if, "")
    res.data <- res.data %>% mutate_all(na_if, " ")
    res.data <-
      res.data %>% mutate(gene_symbol = coalesce(external_gene_name, Gene))
    fc = 1.5
    log2fc = log(fc, base = 2)
    neg.log2fc = log2fc * -1
    
    res.data$diffexpressed <- "other.genes"
    res.data$diffexpressed[res.data$log2FoldChange >= log2fc &
                             res.data$padj <= 0.05] <- paste("Higher expression in", first)
    res.data$diffexpressed[res.data$log2FoldChange <= neg.log2fc &
                             res.data$padj <= 0.05] <- paste("Higher expression in", second)
    
    res.data$delabel <- NA
    res.data$delabel[res.data$log2FoldChange >= log2fc &
                       res.data$padj <= 0.05 &
                       !is.na(res.data$padj)] <-
      res.data$gene_symbol[res.data$log2FoldChange >= log2fc &
                             res.data$padj <= 0.05 & !is.na(res.data$padj)]
    res.data$delabel[res.data$log2FoldChange <= neg.log2fc &
                       res.data$padj <= 0.05 &
                       !is.na(res.data$padj)] <-
      res.data$gene_symbol[res.data$log2FoldChange <= neg.log2fc &
                             res.data$padj <= 0.05 & !is.na(res.data$padj)]
    ggplot(res.data,
           aes(
             x = log2FoldChange,
             y = -log10(padj),
             col = diffexpressed,
             label = delabel
           )) +
      geom_point(alpha = 0.5) +
      xlim(-6, 6) +
      theme_classic() +
      scale_color_manual(name = "Expression", values = c(color1, color2, color3)) +
      geom_text_repel(
        data = subset(res.data, padj <= 0.05),
        max.overlaps  = 15,
        show.legend = F,
        min.segment.length = Inf,
        seed = 42,
        box.padding = 0.5
      ) +
      ggtitle(ChartTitle) +
      xlab(paste("log2 fold change")) +
      ylab("-log10 pvalue (adjusted)") +
      theme(legend.text.align = 0)
  }

g <- volcanoPlots2(
  res.UndfvsDiff,
  "mTSC_vs_pTGC",
  "mTSC",
  "pTGC",
  "#a0b600",
  "#c6007b",
  "grey",
  ChartTitle = "mTSC (undifferentiated) vs. pTGC (differentiated)"
)
g
#> Warning: Removed 394 rows containing missing values (`geom_point()`).
#> Warning: Removed 1 rows containing missing values (`geom_text_repel()`).
#> Warning: ggrepel: 282 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps

Figure 6: Volcano plot showing genes overexpressed in undifferentiated and differentiated states.

Heatmap

Heatmap for the top 30 variable genes:

topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 30)
mat  <- assay(vsd)[ topVarGenes, ]
mat  <- mat - rowMeans(mat)
mat2 <-    merge(mat,
          annot,
          by.x = 'row.names',
          by.y = "ensembl_gene_id_version")
mat2$gene <- paste0(mat2$external_gene_name," (",mat2$gene_biotype,")")
rownames(mat2) <- mat2[,'gene']
mat2 <- mat2[2:9]
heat_colors <- brewer.pal(9, "YlOrRd")
g <- pheatmap(
  mat2,
  color = heat_colors,
  main = "Top 30 variable small RNA genes",
  cluster_rows = T,
  cluster_cols  = T,
  show_rownames = T,
  border_color = NA,
  fontsize = 10,
  scale = "row",
  fontsize_row = 10
)
g

Figure 7: Heat map for top 30 variable small RNA genes

Write result tables

Here, we will attach the mirbase_id, gene_biotype and external_gene_name downloaded in the previous section to the results table. We will also filter the the table to write:

DE list that have padj less than or equal to 0.05
DE list that have padj less than or equal to 0.05, and log2FoldChange greater than or equal to fold change of 1.5
DE list that have padj less than or equal to 0.05, and log2FoldChange less than or equal to fold change of -1.5
Full list of DE table without any filtering.

names(res.UndfvsDiffdata)[1] <- "ensembl_gene_id_version"
res.UndfvsDiffdata <-
  merge(x = res.UndfvsDiffdata,
        y = annot,
        by = "ensembl_gene_id_version",
        all.x = TRUE)
res.UndfvsDiffdata <-
  res.UndfvsDiffdata[, c(1, 
                         (ncol(res.UndfvsDiffdata) - 2):ncol(res.UndfvsDiffdata),
                         2:(ncol(res.UndfvsDiffdata) - 3))]
res.UndfvsDiffSig <- res.UndfvsDiffdata %>%
  filter(padj <= 0.05)
fc = 1.5
log2fc = log(fc, base = 2)
neg.log2fc = log2fc * -1
res.UndfvsDiffSig.up <- res.UndfvsDiffdata %>%
  filter(padj <= 0.05 & log2FoldChange >= log2fc)
res.UndfvsDiffSig.dw <- res.UndfvsDiffdata %>%
  filter(padj <= 0.05 & log2FoldChange <= neg.log2fc)
mTSC.up <- res.UndfvsDiffSig.up %>% dplyr::filter(gene_biotype == "miRNA") %>% dplyr::select(mirbase_id)
pTGC.up <- res.UndfvsDiffSig.dw %>% dplyr::filter(gene_biotype == "miRNA") %>% dplyr::select(mirbase_id)

write_delim(res.UndfvsDiffdata, 
            file = "results/DESeq2_results_mTSC_vs_pTGC_full-table.tsv", 
            delim = "\t")
write_delim(res.UndfvsDiffSig, 
            file = "results/DESeq2_results_mTSC_vs_pTGC_adj.p-le-0.05.tsv", 
            delim = "\t")
write_delim(res.UndfvsDiffSig.up, 
            file = "results/DESeq2_results_mTSC_vs_pTGC_adj.p-le-0.05_fc-ge-1.5.tsv", 
            delim = "\t")
write_delim(res.UndfvsDiffSig.dw, 
            file = "results/DESeq2_results_mTSC_vs_pTGC_adj.p-le-0.05_fc-le-neg1.5.tsv", 
            delim = "\t")

mi.mTSC.up       <- mTSC.up
mi.pTGC.up       <- pTGC.up 
mi.mTSC.and.pTGC <- unique(rbind(mi.mTSC.up, mi.pTGC.up))


write_delim(mTSC.up, 
            file = "results/miRNAs_up_mTSC.txt", 
            delim = "\t")
write_delim(pTGC.up, 
            file = "results/miRNAs_up_pTGC.txt", 
            delim = "\t")
write_delim(mi.mTSC.and.pTGC, 
            file = "results/miRNAs_up_mTSC.and.pTGC.txt", 
            delim = "\t")

Characterizing smallRNA expression

To check the expression of genes in each condition, we looked at the highly expressed genes and their composition. The analyses is shown below.

exp <- res.UndfvsDiffdata[, c(1:4, 11:ncol(res.UndfvsDiffdata))]
exp$external_gene_name <-
  ifelse(exp$external_gene_name == "",
         exp$ensembl_gene_id_version,
         exp$external_gene_name)
exp$gene <- paste0(exp$external_gene_name, "(", exp$gene_biotype, ")")
renamed.exp <- exp %>% dplyr::select(gene, ensembl_gene_id_version:mTSC_4)
renamed.exp.long <-
  melt(
    renamed.exp,
    id.vars = c(
      'gene',
      'ensembl_gene_id_version',
      'gene_biotype',
      'mirbase_id',
      'external_gene_name'
    )
  )
colnames(renamed.exp.long) <-
  c('gene',
    'ensembl_gene_id_version',
    'gene_biotype',
    'mirbase_id',
    'external_gene_name',
    'replicates',
    'norm.expression'
  )
renamed.exp.long$condition <- "NA"
renamed.exp.long$condition[which(str_detect(renamed.exp.long$replicates, "pTGC_"))] <-
  "pTGC"
renamed.exp.long$condition[which(str_detect(renamed.exp.long$replicates, "mTSC_"))] <-
  "mTSC"
renamed.exp.long <-
  renamed.exp.long  %>% filter(norm.expression > 0)
renamed.exp.long <- na.omit(renamed.exp.long)
# SOURCE: https://ourcodingclub.github.io/tutorials/dataviz-beautification/
theme_niwot <- function() {
  theme_bw() +
    theme(
      axis.text = element_text(size = 12),
      axis.title = element_text(size = 12),
      axis.line.x = element_line(color = "black"),
      axis.line.y = element_line(color = "black"),
      panel.border = element_blank(),
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank(),
      panel.grid.minor.y = element_blank(),
      panel.grid.major.y = element_blank(),
      plot.margin = unit(c(1, 1, 1, 1), units = , "cm"),
      plot.title = element_text(
        size = 12,
        vjust = 1,
        hjust = 0
      ),
      legend.text = element_text(size = 12),
      legend.title = element_blank(),
      legend.position = c(0.95, 0.15),
      legend.key = element_blank(),
      legend.background = element_rect(
        color = "black",
        fill = "transparent",
        size = 2,
        linetype = "blank"
      )
    )
}

Normalized expression plots

vlnPlot <-
  function(dataIn = renamed.exp.long,
           xcol = "replicates",
           fillcol = "condition",
           expre = "norm.expression") {
    p <- ggplot(data = dataIn,
                aes_string(x = xcol, y = expre, fill = fillcol)) +
      geom_flat_violin(position = position_nudge(x = 0.2, y = 0),
                       alpha = 0.6,
                       trim = FALSE) +
      geom_point(
        aes_string(y = expre, color = fillcol),
        position = position_jitter(width = 0.15),
        size = 1,
        alpha = 0.5
      ) +
      geom_boxplot(width = 0.2,
                   outlier.shape = NA,
                   alpha = 0.8) + stat_summary(
                     fun = mean,
                     geom = "point",
                     shape = 23,
                     size = 2
                   ) +
      labs(y = "Normalized Expression", x = NULL) +
      guides(fill = "none", color = "none") +
      scale_y_log10(label = comma) +
      scale_fill_manual(values = c('pTGC'       = '#c6007b',
                                   'mTSC'   = '#a0b600')) +
      scale_color_manual(values = c('pTGC'      = '#c6007b',
                                   'mTSC'   = '#a0b600')) +
      theme_niwot() + theme(axis.text.x = element_text(
        angle = 45,
        vjust = 1,
        hjust = 1
      ))
    p
  }

Expression plots

vlnPlot(xcol="replicates")
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation ideoms with `aes()`
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
#> ℹ Please use the `linewidth` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: Using the `size` aesthietic with geom_polygon was deprecated in ggplot2 3.4.0.
#> ℹ Please use the `linewidth` aesthetic instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

Figure 8A: Normalized expression of genes in undifferentiated and differentiated samples (replicate)

vlnPlot(xcol="condition")

Figure 8B: Normalized expression of genes in undifferentiated and differentiated samples (condition)

vlnPlot(xcol="gene_biotype")

Figure 8C: Normalized expression of genes in undifferentiated and differentiated samples (split based on gene biotype)

Highly expressed small RNAs

df.both <- renamed.exp %>%
  rowwise() %>%
  mutate(Both = mean(c(
    pTGC_1, pTGC_2, pTGC_3, pTGC_4, mTSC_1, mTSC_2, mTSC_3, mTSC_4
  ))) %>%
  dplyr::select(gene:external_gene_name, Both) %>%
  dplyr::filter(Both > 0)  %>%
  ungroup() %>%
  mutate(quart = ntile(Both, 4)) %>%
  mutate(decile = ntile(Both, 10))
df.diff <- renamed.exp %>%
  rowwise() %>%
  mutate(pTGC = mean(c(
    pTGC_1, pTGC_2, pTGC_3, pTGC_4
  ))) %>%
  dplyr::select(gene:external_gene_name, pTGC) %>%
  dplyr::filter(pTGC > 0)  %>%
  ungroup() %>%
  mutate(quart = ntile(pTGC, 4)) %>%
  mutate(decile = ntile(pTGC, 10))
df.undiff <-  renamed.exp %>%
  rowwise() %>%
  mutate(mTSC = mean(c(
    mTSC_1, mTSC_2, mTSC_3, mTSC_4
  ))) %>%
  dplyr::select(gene:external_gene_name, mTSC) %>%
  dplyr::filter(mTSC > 0) %>%
  ungroup() %>%
  mutate(quart = ntile(mTSC, 4)) %>%
  mutate(decile = ntile(mTSC, 10))

filterCuts <- function(dataIn = df.undiff,
                       cutOff = 4,
                       type = decile) {
  type <- enquo(type)
  dataIn %>%
    dplyr::filter(!!type == cutOff) %>%
    dplyr::select(ensembl_gene_id_version)
}
both.75pc   <- filterCuts(df.both, 4, type = quart)
undiff.75pc <- filterCuts(df.undiff, 4, type = quart)
diff.75pc   <- filterCuts(df.diff, 4, type = quart)
filterMiRNA <- function(dataIn = df) {
  dataIn %>% left_join(
    annot,
    by = "ensembl_gene_id_version",
) %>% filter(gene_biotype == "miRNA") %>% dplyr::select(mirbase_id)
} 
mi.both.75pc     <- filterMiRNA(both.75pc)
mi.undiff.75pc   <- filterMiRNA(undiff.75pc)
mi.diff.75pc     <- filterMiRNA(diff.75pc)
write_delim(
  mi.undiff.75pc,
  file = paste0("results/miRNAs_75pc_mTSC.txt"),
  delim = "\t"
)
write_delim(
  mi.diff.75pc,
  file = paste0("results/miRNAs_75pc_pTGC.txt"),
  delim = "\t"
)
write_delim(
  mi.both.75pc,
  file = paste0("results/miRNAs_75pc_both.txt"),
  delim = "\t"
)
quartile.data <- data.frame(table(diff.75pc)) %>% 
    full_join(data.frame(table(undiff.75pc)), 
              by = c("ensembl_gene_id_version" = "ensembl_gene_id_version")) %>%
    replace(is.na(.), 0) %>%
    dplyr::rename(diff75pc=Freq.x, undiff75pc=Freq.y) %>%
    left_join(annot, by = c("ensembl_gene_id_version" = "ensembl_gene_id_version")) %>%
    dplyr::select(ensembl_gene_id_version, diff75pc, undiff75pc, gene_biotype) %>%
    column_to_rownames(var = "ensembl_gene_id_version")
colnames(quartile.data) <- c("pTGC.75pc", "mTSC.75pc", "gene_biotype")
plotUpSet <- function(fulltable = decile.data) {
  inter <- colnames(fulltable)[1:2]
  p <- upset(
    fulltable,
    inter,
    annotations = list(
      'smallRNA type' = (
        ggplot(mapping = aes(fill = gene_biotype)) +
          geom_bar(stat = 'count', position = 'fill') +
          scale_y_continuous(labels = scales::percent_format()) +
          scale_fill_manual(
            values = c(
              'miRNA'       = '#54693e',
              'misc_RNA'    = '#9c47cb',
              'snRNA'       = '#94d14f',
              'snoRNA'      = '#5c4f9c',
              'scaRNA'      = '#cca758',
              'sRNA'        = '#c85a90'
            )
          )
        + ylab('smallRNA proportion')
      )
    ),
    queries = list(
      upset_query(
        intersect = inter,
        color = '#C19B5D',
        fill = '#C19B5D',
        only_components = c('intersections_matrix', 'Intersection size')
      ),
      upset_query(
        intersect = inter[1],
        color = '#c6007b',
        fill = '#c6007b',
        only_components = c('intersections_matrix', 'Intersection size')
      ),
      upset_query(
        intersect = inter[2],
        color = '#a0b600',
        fill = '#a0b600',
        only_components = c('intersections_matrix', 'Intersection size')
      )
    ),
    width_ratio = 0.4,
    set_sizes = (
      upset_set_size(geom = geom_bar(
        aes(fill = gene_biotype, x = group),
        width = 0.8
      ),
      position = 'right') + theme(legend.position = "none") +
        scale_fill_manual(
          values = c(
            'miRNA'         = '#54693e',
            'misc_RNA'  = '#9c47cb',
            'snRNA'     = '#94d14f',
            'snoRNA'        = '#5c4f9c',
            'scaRNA'        = '#cca758',
            'sRNA'      = '#c85a90'
          )
        )
    ),
    guides = 'over'
  )
  p
}

Intersection plots

quartile

plotUpSet(quartile.data)

Figure 9A: Gene expression greater than 75th percentile (intersection)

Save intersection results

filterUpsetRes <-
  function(datatable, title = "title") {
    data.annot <-
      merge(datatable[1:2],
            annot,
            by.x = 0,
            by.y = "ensembl_gene_id_version",
            all.x = TRUE)
    mirna.annot <- data.annot %>%
      dplyr::filter(.[[2]] == 1 & .[[3]] == 1) %>%
      dplyr::filter(gene_biotype == "miRNA")
    snrna.annot <- data.annot %>%
      dplyr::filter(.[[2]] == 1 & .[[3]] == 1) 
    write_delim(
      mirna.annot,
      file = paste0("results/intersection_miRNA_", title , ".tsv"),
      delim = "\t"
    )
    write_delim(
      snrna.annot,
      file = paste0("results/intersection_smallRNA_", title , ".tsv"),
      delim = "\t"
    )
  }
filterUpsetRes(quartile.data, title = "quartile")

Finding target genes for miRNAs

The mirdb.org database allows you to find target genes for each miRNA. We downloaded the file from the server and linked each miRNA form our lists to their targets The information downloaded from the server had score information, and for many other species. A filtered list of just mouse, with score >= 80 was created. The target gene ids were in RefSeq format, which was convererted to gene symbol as well.

cd assets
wget https://mirdb.org/download/miRDB_v6.0_prediction_result.txt.gz
gunzip miRDB_v6.0_prediction_result.txt.gz

library(org.Mm.eg.db)
mirdb <- read.delim("assets/miRDB_v6.0_prediction_result.txt", header=FALSE)
colnames(mirdb) <- c("mirbase_id", "refseq", "score")
mmu.mirdb <- mirdb %>% filter(str_detect(mirbase_id, "^mmu-") & score >= 80) %>%
  dplyr::select(mirbase_id, refseq)
cols <- c("REFSEQ", "ENSEMBL", "SYMBOL",  "GENENAME")
mirdb.ids <- AnnotationDbi::select(org.Mm.eg.db, keys=mmu.mirdb$refseq, columns=cols, keytype="REFSEQ")
mirdb.info <- left_join(mmu.mirdb, mirdb.ids, by=c('refseq'='REFSEQ')) %>% 
  dplyr::select(mirbase_id, refseq, ENSEMBL, SYMBOL) %>%
  distinct() %>% 
  mutate(mirbase_noprime = gsub("-3p$|-5p$", "", mirbase_id),
         prime = gsub("^-", "", str_extract(mirbase_id, "-3p$|-5p$"))) %>% 
  dplyr::select(mirbase_noprime, refseq, ENSEMBL, SYMBOL, prime)
colnames(mirdb.info) <- c("mirbase_id", "refseq", "ENSEMBL", "SYMBOL", "prime")
mirdb.info$mirbase_id <- tolower(mirdb.info$mirbase_id)
write_delim(mirdb.info, "targets/mirDB_info.tsv", delim = "\t")
saveRDS(mirdb.info, file = "assets/mirdb.rds")

The rds is saved and reloaded for subsequent run (instead of rerunning everytime we compiled the RMarkdown).

mirdb.info <- readRDS("assets/mirdb.rds")

Target genes for miRNAs

miTargets.mTSC.up <- mi.mTSC.up %>%
  left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id")) %>%
  dplyr::select(SYMBOL) %>%
  distinct(SYMBOL)
miTargets.pTGC.up <- mi.pTGC.up %>%
  left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id")) %>%
  dplyr::select(SYMBOL) %>%
  distinct(SYMBOL)
miTargets.mTSC.and.pTGC <- as.data.frame(intersect(miTargets.mTSC.up$SYMBOL, 
                                                   miTargets.pTGC.up$SYMBOL))
colnames(miTargets.mTSC.and.pTGC) <- "SYMBOL"
miTargets.only.in.mTSC <-
  as.data.frame(miTargets.mTSC.up$SYMBOL[!(miTargets.mTSC.up$SYMBOL %in%
                                             miTargets.pTGC.up$SYMBOL)])
colnames(miTargets.only.in.mTSC) <- "SYMBOL"
miTargets.only.in.pTGC <-
  as.data.frame(miTargets.pTGC.up$SYMBOL[!(miTargets.pTGC.up$SYMBOL %in%
                                             miTargets.mTSC.up$SYMBOL)])
colnames(miTargets.only.in.pTGC) <- "SYMBOL"
miTargets.diff.75pc <- mi.diff.75pc %>%
  left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id")) %>%
  dplyr::select(SYMBOL) %>%
  distinct(SYMBOL)
miTargets.undiff.75pc <- mi.undiff.75pc %>%
  left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id")) %>%
  dplyr::select(SYMBOL) %>%
  distinct(SYMBOL)

miRNAs_75pc_mTSC <- mi.undiff.75pc %>%
  left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id"))
miRNAs_75pc_pTGC <- mi.diff.75pc %>%
  left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id"))
mirs_up_mTSC <- mi.mTSC.up %>%
  left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id"))
mirs_up_pTGC <- mi.pTGC.up %>%
  left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id"))
mirs_mTSC_and_pTGC <- mi.mTSC.and.pTGC %>%
  left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id"))
# create miRNA and thier target lists
targetsLists = list(
  miRNAs_75pc_mTSC = miRNAs_75pc_mTSC,
  miRNAs_75pc_pTGC = miRNAs_75pc_pTGC,
  miRNAs_up_mTSC = mirs_up_mTSC,
  miRNAs_up_pTGC = mirs_up_pTGC,
  miRNAs_mTSC.and.pTGC = mirs_mTSC_and_pTGC
)
#create a geneList for later use
geneLists = list(
  miRNAs_75pc_mTSC = miTargets.undiff.75pc$SYMBOL,
  miRNAs_75pc_pTGC = miTargets.diff.75pc$SYMBOL,
  miRNAs_up_mTSC = miTargets.mTSC.up$SYMBOL,
  miRNAs_up_pTGC = miTargets.pTGC.up$SYMBOL,
  miRNAs_up_mTSC.and.pTGC = miTargets.mTSC.and.pTGC$SYMBOL,
  miRNAs_up_only.in.mTSC = miTargets.only.in.mTSC$SYMBOL,
  miRNAs_up_only.in.pTGC = miTargets.only.in.pTGC$SYMBOL
)
lengths(geneLists)
#>        miRNAs_75pc_mTSC        miRNAs_75pc_pTGC          miRNAs_up_mTSC 
#>                   10791                   11171                    9733 
#>          miRNAs_up_pTGC miRNAs_up_mTSC.and.pTGC  miRNAs_up_only.in.mTSC 
#>                    5810                    4589                    5144 
#>  miRNAs_up_only.in.pTGC 
#>                    1221


library(ggvenn)
y <- list(pTGC = miTargets.pTGC.up$SYMBOL, mTSC = miTargets.mTSC.up$SYMBOL)
ggvenn(
  y,
  fill_color = c("#c6007b", "#a0b600"),
  stroke_size = 0.3,
  set_name_size = 4
)

Figure 10: Target genes of miRNAs DE in diff and mTSC samples (intersection)

Enrichment analyses of target genes

source(  "/work/LAS/geetu-lab/arnstrm/mouse.trophoblast.smallRNAseq/assets/theme_clean.R")
library(TissueEnrich)
plotTE <- function(inputGenes = gene.list,
                   myColor = "color") {
  gs <-
    GeneSet(geneIds = inputGenes,
            organism = "Mus Musculus",
            geneIdType = SymbolIdentifier())
  output <- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
  en.output <-
    setNames(data.frame(assay(output[[1]]), 
                        row.names = rowData(output[[1]])[, 1]),
             colData(output[[1]])[, 1])
  en.output$Tissue <- rownames(en.output)
  logp <- -log10(0.05)
  en.output <-
    mutate(en.output,
           significance = ifelse(Log10PValue > logp,
                                 "colored", "nocolor"))
  en.output$Sig <- "NA"
  ggplot(en.output, aes(reorder(Tissue, Log10PValue),
                        Log10PValue, 
                        fill = significance)) +
    geom_bar(stat = 'identity') +
    theme_clean() + ylab("- log10 adj. p-value") + xlab("") +
    scale_fill_manual(values = c("colored" = myColor, "nocolor" = "gray")) +
    scale_y_continuous(expand = expansion(mult = c(0, .1)),
                       breaks = scales::pretty_breaks()) +
    coord_flip()
}

plotTE.heatmap <-
  function(inputGenes = gene.list,
           inputTissue = "E14.5-Brain",
           GeneNames = FALSE
           ) {
    gs <-
      GeneSet(geneIds = inputGenes,
              organism = "Mus Musculus",
              geneIdType = SymbolIdentifier())
    output <- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
    en.output <-
      setNames(data.frame(assay(output[[1]]),
                          row.names = rowData(output[[1]])[, 1]),
               colData(output[[1]])[, 1])
    en.output$Tissue <- rownames(en.output)
    seExp <- output[[2]][[inputTissue]]
    exp <-
      setNames(data.frame(assay(seExp), row.names = rowData(seExp)[, 1]), 
               colData(seExp)[, 1])
    exp <- head(exp[order(exp$`E14.5-Brain`, decreasing = TRUE), ], 25)
    g <- pheatmap(
      exp,
      color = heat_colors,
      cluster_rows = F,
      cluster_cols  = T,
      show_rownames = GeneNames,
      border_color = NA,
      fontsize = 10,
      scale = "row",
      fontsize_row = 8
    )
    g
  }
plotTE.heatmap.full <-
  function(inputGenes = gene.list,
           inputTissue = "E14.5-Brain",
           GeneNames = FALSE,
           string = "table.tsv"
           ) {
    gs <-
      GeneSet(geneIds = inputGenes,
              organism = "Mus Musculus",
              geneIdType = SymbolIdentifier())
    output <- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
    en.output <-
      setNames(data.frame(assay(output[[1]]),
                          row.names = rowData(output[[1]])[, 1]),
               colData(output[[1]])[, 1])
    en.output$Tissue <- rownames(en.output)
    seExp <- output[[2]][[inputTissue]]
    exp <-
      setNames(data.frame(assay(seExp), row.names = rowData(seExp)[, 1]), 
               colData(seExp)[, 1])
    d <- as.data.frame(rownames(exp))
    g <- pheatmap(
      exp,
      color = heat_colors,
      cluster_rows = F,
      cluster_cols  = T,
      show_rownames = GeneNames,
      border_color = NA,
      fontsize = 10,
      scale = "row",
      fontsize_row = 8
    )
    g
  }
setEnrichrSite("Enrichr")

makeEnrichR <-
  function(inputGenes = gene.list) {
    websiteLive <- TRUE
    myDBs <-
      c(
        "DisGeNET",
        "WikiPathways_2019_Human",
        "WikiPathways_2019_Mouse",
        "KEGG_2019_Mouse"
      )
    if (websiteLive) {
      enrichr(inputGenes, myDBs)
    }
  }
plotEnrichR <- function(enriched, table="string", myColor = "slateblue") {
  logp <- -log10(0.05)
  myData <- enriched[[table]]
  myData$negLogP <-  -log10(myData$P.value)
  myData <-
    mutate(myData,
           significance = ifelse(negLogP > logp, "colored", "nocolor"))
  myData$Sig <- "NA"
  myData <- head(arrange(myData, -negLogP, Term), 15)
  ggplot(myData, aes(reorder(Term, negLogP),
                     negLogP,
                     fill = significance)) +
    geom_bar(stat = 'identity') +
    theme_clean() + ylab("- log10 p-value") + xlab("") +
    scale_fill_manual(values = c("colored" = myColor, "nocolor" = "gray")) +
    scale_y_continuous(expand = expansion(mult = c(0, .1)),
                       breaks = scales::pretty_breaks()) +
    coord_flip()
}


heatmap.genelst <-
  function(inputGenes = gene.list,
           inputTissue = "E14.5-Brain"){
    gs <-
      GeneSet(geneIds = inputGenes,
              organism = "Mus Musculus",
              geneIdType = SymbolIdentifier())
    output <- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
    en.output <-
      setNames(data.frame(assay(output[[1]]),
                          row.names = rowData(output[[1]])[, 1]),
               colData(output[[1]])[, 1])
    en.output$Tissue <- rownames(en.output)
    seExp <- output[[2]][[inputTissue]]
    exp <-
      setNames(data.frame(assay(seExp), row.names = rowData(seExp)[, 1]),
               colData(seExp)[, 1])
    gl <- dplyr::select(exp, `E14.5-Brain`)  %>% rownames
    gl
  }

Enrichment plots (`TissueEnrich`)

plotTE(geneLists$miRNAs_up_pTGC, myColor = "#c6007b")

Figure 11A: Target genes of miRNAs up-regulated in pTGC enrichment in Mouse ENCODE data

plotTE(geneLists$miRNAs_up_mTSC, myColor = "#a0b600")

Figure 11B: Target genes of miRNAsup-regulated in mTSC) enrichment in Mouse ENCODE data

plotTE(geneLists$miRNAs_up_mTSC.and.pTGC, myColor = "#C19B5D")

Figure 11C: Target genes of miRNAs up-regulated in both pTGC and mTSC) enrichment in Mouse ENCODE data

plotTE(geneLists$miRNAs_up_only.in.pTGC, myColor = "#c6007b")

Figure 11D: Target genes of miRNAs up-regulated ONLY in pTGC) enrichment in Mouse ENCODE data

plotTE(geneLists$miRNAs_up_only.in.mTSC, myColor = "#a0b600")

Figure 11E: Target genes of miRNAs up-regulated ONLY in mTSC) enrichment in Mouse ENCODE data

Enrichment Heatmaps for E14.5-Brain (top 25, with gene names)

plotTE.heatmap(geneLists$miRNAs_up_pTGC,
               inputTissue = "E14.5-Brain",
               GeneNames = TRUE)

Figure 11A: Target genes of miRNAs up-regulated in pTGC enrichment in Mouse ENCODE data

plotTE.heatmap(geneLists$miRNAs_up_mTSC,
               inputTissue = "E14.5-Brain",
               GeneNames = TRUE)

Figure 11B: Target genes of miRNAsup-regulated in mTSC) enrichment in Mouse ENCODE data

plotTE.heatmap(
  geneLists$miRNAs_up_mTSC.and.pTGC,
  inputTissue = "E14.5-Brain",
  GeneNames = TRUE
)

Figure 11C: Target genes of miRNAs up-regulated in both pTGC and mTSC) enrichment in Mouse ENCODE data

plotTE.heatmap(
  geneLists$miRNAs_up_only.in.pTGC,
  inputTissue = "E14.5-Brain",
  GeneNames = TRUE
)

Figure 11D: Target genes of miRNAs up-regulated ONLY in pTGC) enrichment in Mouse ENCODE data

plotTE.heatmap(
  geneLists$miRNAs_up_only.in.mTSC,
  inputTissue = "E14.5-Brain",
  GeneNames = TRUE
)

Figure 11E: Target genes of miRNAs up-regulated ONLY in mTSC) enrichment in Mouse ENCODE data

Enrichment Heatmaps for E14.5-Brain (all, without gene names)

plotTE.heatmap.full(geneLists$miRNAs_up_pTGC,
                    inputTissue = "E14.5-Brain",
                    string = "up_in_diff.tsv",
                    GeneNames = FALSE)

Figure 11A: Target genes of miRNAs up-regulated in pTGC enrichment in Mouse ENCODE data

plotTE.heatmap.full(geneLists$miRNAs_up_mTSC,
                    inputTissue = "E14.5-Brain",
                    string = "up_in_undiff.tsv",
                    GeneNames = FALSE)

Figure 11B: Target genes of miRNAsup-regulated in mTSC) enrichment in Mouse ENCODE data

plotTE.heatmap.full(
  geneLists$miRNAs_up_mTSC.and.pTGC,
  inputTissue = "E14.5-Brain",
                    string = "up_in_diff_and_undiff.tsv",
  GeneNames = FALSE
)

Figure 11C: Target genes of miRNAs up-regulated in both pTGC and mTSC) enrichment in Mouse ENCODE data

plotTE.heatmap.full(
  geneLists$miRNAs_up_only.in.pTGC,
  inputTissue = "E14.5-Brain",
                    string = "up_in_diff_only.tsv",
  GeneNames = FALSE
)

Figure 11D: Target genes of miRNAs up-regulated ONLY in pTGC) enrichment in Mouse ENCODE data

plotTE.heatmap.full(
  geneLists$miRNAs_up_only.in.mTSC,
  inputTissue = "E14.5-Brain",
                    string = "up_in_udiff_only.tsv",
  GeneNames = FALSE
)

Figure 11E: Target genes of miRNAs up-regulated ONLY in mTSC) enrichment in Mouse ENCODE data

Extract target genes enriched for E14.5-Brain

brEnrTargets_up_mTSC.and.pTGC <-
  heatmap.genelst(geneLists$miRNAs_up_mTSC.and.pTGC,
                  inputTissue = "E14.5-Brain")
brEnrTargets_up_mTSC <- heatmap.genelst(geneLists$miRNAs_up_mTSC,
                                        inputTissue = "E14.5-Brain")
brEnrTargets_up_only.in.mTSC <-
  heatmap.genelst(geneLists$miRNAs_up_only.in.mTSC,
                  inputTissue = "E14.5-Brain")
brEnrTargets_up_only.in.pTGC <-
  heatmap.genelst(geneLists$miRNAs_up_only.in.pTGC,
                  inputTissue = "E14.5-Brain")
brEnrTargets_up_pTGC <- heatmap.genelst(geneLists$miRNAs_up_pTGC,
                                        inputTissue = "E14.5-Brain")
brEnrTargets_75pc_mTSC <- heatmap.genelst(geneLists$miRNAs_75pc_mTSC,
                                          inputTissue = "E14.5-Brain")
brEnrTargets_75pc_pTGC <- heatmap.genelst(geneLists$miRNAs_75pc_pTGC,
                                          inputTissue = "E14.5-Brain")


brainGeneLists = list(
  brEnrTargets_up_mTSC.and.pTGC = brEnrTargets_up_mTSC.and.pTGC,
  brEnrTargets_up_mTSC = brEnrTargets_up_mTSC,
  brEnrTargets_up_only.in.mTSC = brEnrTargets_up_only.in.mTSC,
  brEnrTargets_up_only.in.pTGC = brEnrTargets_up_only.in.pTGC,
  brEnrTargets_up_pTGC = brEnrTargets_up_pTGC,
  brEnrTargets_75pc_pTGC = brEnrTargets_75pc_pTGC,
  brEnrTargets_75pc_mTSC = brEnrTargets_75pc_mTSC
)
lengths(brainGeneLists)
#> brEnrTargets_up_mTSC.and.pTGC          brEnrTargets_up_mTSC 
#>                           353                           578 
#>  brEnrTargets_up_only.in.mTSC  brEnrTargets_up_only.in.pTGC 
#>                           225                            50 
#>          brEnrTargets_up_pTGC        brEnrTargets_75pc_pTGC 
#>                           403                           641 
#>        brEnrTargets_75pc_mTSC 
#>                           627

miRNA targets, disease db enrichment tests

undiffEnr <- makeEnrichR(brainGeneLists$brEnrTargets_up_mTSC)
#> Uploading data to Enrichr... Done.
#>   Querying DisGeNET... Done.
#>   Querying WikiPathways_2019_Human... Done.
#>   Querying WikiPathways_2019_Mouse... Done.
#>   Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
Sys.sleep(60)
diffEnr <- makeEnrichR(brainGeneLists$brEnrTargets_up_pTGC)
#> Uploading data to Enrichr... Done.
#>   Querying DisGeNET... Done.
#>   Querying WikiPathways_2019_Human... Done.
#>   Querying WikiPathways_2019_Mouse... Done.
#>   Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
Sys.sleep(60)
undiffOnlyEnr <- makeEnrichR(brainGeneLists$brEnrTargets_up_only.in.mTSC)
#> Uploading data to Enrichr... Done.
#>   Querying DisGeNET... Done.
#>   Querying WikiPathways_2019_Human... Done.
#>   Querying WikiPathways_2019_Mouse... Done.
#>   Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
Sys.sleep(60)
diffOnlyEnr <- makeEnrichR(brainGeneLists$brEnrTargets_up_only.in.pTGC)
#> Uploading data to Enrichr... Done.
#>   Querying DisGeNET... Done.
#>   Querying WikiPathways_2019_Human... Done.
#>   Querying WikiPathways_2019_Mouse... Done.
#>   Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
Sys.sleep(60)
bothEnr <- makeEnrichR(brainGeneLists$brEnrTargets_up_mTSC.and.pTGC)
#> Uploading data to Enrichr... Done.
#>   Querying DisGeNET... Done.
#>   Querying WikiPathways_2019_Human... Done.
#>   Querying WikiPathways_2019_Mouse... Done.
#>   Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
# save these, so we don't query the server frequentyly
save(undiffEnr, diffEnr, undiffOnlyEnr, diffOnlyEnr, bothEnr, file = "results/enrichmentResults.Rdata")

1. pTGC

plotEnrichR(diffEnr, table = "DisGeNET", myColor = "#c6007b")

placeholder

plotEnrichR(diffEnr, table = "WikiPathways_2019_Mouse", myColor = "#c6007b")

placeholder

plotEnrichR(diffEnr, table = "WikiPathways_2019_Human", myColor = "#c6007b")

placeholder

plotEnrichR(diffEnr, table = "KEGG_2019_Mouse", myColor = "#c6007b")

placeholder

2. TSC

plotEnrichR(undiffEnr, table = "DisGeNET", myColor = "#a0b600")

placeholder

plotEnrichR(undiffEnr, table = "WikiPathways_2019_Mouse", myColor = "#a0b600")

placeholder

plotEnrichR(undiffEnr, table = "WikiPathways_2019_Human", myColor = "#a0b600")

placeholder

plotEnrichR(undiffEnr, table = "KEGG_2019_Mouse", myColor = "#a0b600")

placeholder

3. Both pTGC and TSC

plotEnrichR(bothEnr, table = "DisGeNET", myColor = "#C19B5D")

placeholder

plotEnrichR(bothEnr, table = "WikiPathways_2019_Mouse", myColor = "#C19B5D")

placeholder

plotEnrichR(bothEnr, table = "WikiPathways_2019_Human", myColor = "#C19B5D")

placeholder

plotEnrichR(bothEnr, table = "KEGG_2019_Mouse", myColor = "#C19B5D")

placeholder

4. pTGC only

plotEnrichR(diffOnlyEnr, table = "DisGeNET", myColor = "#c6007b")

placeholder

plotEnrichR(diffOnlyEnr, table = "WikiPathways_2019_Mouse", myColor = "#c6007b")

placeholder

plotEnrichR(diffOnlyEnr, table = "WikiPathways_2019_Human", myColor = "#c6007b")

placeholder

plotEnrichR(diffOnlyEnr, table = "KEGG_2019_Mouse", myColor = "#c6007b")

placeholder

5. TSC only

plotEnrichR(undiffOnlyEnr, table = "DisGeNET", myColor = "#a0b600")

placeholder

plotEnrichR(undiffOnlyEnr, table = "WikiPathways_2019_Mouse", myColor = "#a0b600")

placeholder

plotEnrichR(undiffOnlyEnr, table = "WikiPathways_2019_Human", myColor = "#a0b600")

placeholder

plotEnrichR(undiffOnlyEnr, table = "KEGG_2019_Mouse", myColor = "#a0b600")

placeholder

Other plots

myTE <- function(inputGenes = gene.list,
                 myColor = "slateblue") {
  inputGenes
  gs <-
    GeneSet(
      geneIds = unique(inputGenes),
      organism = "Mus Musculus",
      geneIdType = SymbolIdentifier()
    )
  output <- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
  en.output <-
    setNames(data.frame(assay(output[[1]]),
                        row.names = rowData(output[[1]])[, 1]),
             colData(output[[1]])[, 1])
  en.output$Tissue <- rownames(en.output)
  rownames(en.output) <- c()
  en.output
}

runTEonList <- function(myTargetList = miRNAs_75pc_mTSC) {
  data_list <- split(myTargetList, f = myTargetList$mirbase_id)
  data_list <- Filter(Negate(is.na), data_list)
  myDataList <- lapply(data_list, `[[`, "SYMBOL")
  myDataList <-
    myDataList[names(myDataList) != "" &
                 myDataList != "" & lengths(myDataList) > 0]
  myDataList <- compact(myDataList)
  te.miRNA <- lapply(myDataList, myTE)
  infoTE <- lapply(te.miRNA, `[`, c("Log10PValue", "Tissue"))
  infoTE <- purrr::imap(infoTE, ~ mutate(.x, miRNA = .y))
  merged_info  <- bind_rows(infoTE, .id = "miRNA")
  merged_info <-
    merged_info[!(is.na(merged_info$miRNA) |
                    merged_info$miRNA == ""), ]
  merged_info
}
plotBubbleTE <- function(merged_info = TE.miRNAs_75pc_mTSC) {
  ggplot(merged_info, aes(x = miRNA, y = Tissue)) +
  geom_point(aes(size = Log10PValue, fill = Tissue),
             alpha = 0.75,
             shape = 21) +
  scale_size_continuous(limits = c(1, 8), range = c(1, 8)) +
  guides(fill = "none") +
  labs(x = "miRNAs",
       y = "",
       size = "-log10(p-adj)",
       fill = "")  +
  theme(
    legend.key = element_blank(),
    axis.text.x = element_text(
      colour = "black",
      size = 8,
      angle = 45,
      vjust = 1,
      hjust = 1
    ),
    axis.text.y = element_text(colour = "black", size = 8),
    legend.text = element_text(size = 10, colour = "black"),
    legend.title = element_text(size = 12),
    panel.background = element_blank(),
    panel.border = element_rect(colour = "black", fill = NA),
    legend.position = "right") 
}

TE bubble plots for each miRNA

Enrichment bubble plots for each miRNA (`TissueEnrich`)

TE.miRNAs_75pc_pTGC <- runTEonList(myTargetList = mirs_up_pTGC)
plotBubbleTE(merged_info = TE.miRNAs_75pc_pTGC)
#> Warning: Removed 271 rows containing missing values (`geom_point()`).

TE results for each miRNA, upregulated in pTGC

TE.miRNAs_up_mTSC <- runTEonList(myTargetList = mirs_up_mTSC)
plotBubbleTE(merged_info = TE.miRNAs_up_mTSC)
#> Warning: Removed 1697 rows containing missing values (`geom_point()`).

TE results for each miRNA, upregulated in mTSC

TE.miRNAs_mTSC.and.pTGC <- runTEonList(myTargetList = mirs_mTSC_and_pTGC)
plotBubbleTE(merged_info = TE.miRNAs_mTSC.and.pTGC)
#> Warning: Removed 2329 rows containing missing values (`geom_point()`).

TE results for each miRNA, both mTSC and pTGC

TE.miRNAs_75pc_mTSC <- runTEonList(myTargetList = miRNAs_75pc_mTSC)
plotBubbleTE(merged_info = TE.miRNAs_75pc_mTSC)
#> Warning: Removed 2082 rows containing missing values (`geom_point()`).

TE results for each miRNA, top quartile mTSC

TE.miRNAs_75pc_pTGC <- runTEonList(myTargetList = miRNAs_75pc_pTGC)
plotBubbleTE(merged_info = TE.miRNAs_75pc_pTGC)
#> Warning: Removed 2310 rows containing missing values (`geom_point()`).

TE results for each miRNA, top quartile pTGC

Save RData

Apart from saving the entire session, we will also genelists as R objects (rds) for for later.

save.image(file = "results/smallRNAseq_draft.RData")
saveRDS(targetsLists, "results/targetsLists_75pc.rds")
saveRDS(brainGeneLists, "results/brainGeneLists.rds")

MultiQC report:

MultiQC report is available at this link

Session Information

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] grid      stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] TissueEnrich_1.18.0         GSEABase_1.60.0            
#>  [3] graph_1.76.0                annotate_1.76.0            
#>  [5] XML_3.99-0.12               AnnotationDbi_1.60.0       
#>  [7] ensurer_1.1                 ggvenn_0.1.10              
#>  [9] enrichR_3.1                 splitstackshape_1.4.8      
#> [11] ComplexUpset_1.3.3          PupillometryR_0.0.4        
#> [13] rlang_1.1.1                 reshape2_1.4.4             
#> [15] biomaRt_2.54.1              ggrepel_0.9.3              
#> [17] genefilter_1.80.3           pheatmap_1.0.12            
#> [19] RColorBrewer_1.1-3          DESeq2_1.38.3              
#> [21] SummarizedExperiment_1.28.0 Biobase_2.58.0             
#> [23] MatrixGenerics_1.10.0       matrixStats_0.63.0         
#> [25] GenomicRanges_1.50.1        GenomeInfoDb_1.34.3        
#> [27] IRanges_2.32.0              S4Vectors_0.36.0           
#> [29] BiocGenerics_0.44.0         plotly_4.10.1              
#> [31] forcats_0.5.2               stringr_1.5.0              
#> [33] dplyr_1.0.10                purrr_1.0.1                
#> [35] readr_2.1.3                 tidyr_1.3.0                
#> [37] tibble_3.1.8                ggplot2_3.4.0              
#> [39] tidyverse_1.3.2             scales_1.2.1               
#> [41] knitr_1.41                 
#> 
#> loaded via a namespace (and not attached):
#>  [1] readxl_1.4.1           backports_1.4.1        BiocFileCache_2.6.1   
#>  [4] plyr_1.8.8             lazyeval_0.2.2         splines_4.2.2         
#>  [7] BiocParallel_1.32.1    digest_0.6.30          htmltools_0.5.3       
#> [10] fansi_1.0.3            magrittr_2.0.3         memoise_2.0.1         
#> [13] googlesheets4_1.0.1    tzdb_0.3.0             Biostrings_2.66.0     
#> [16] modelr_0.1.10          vroom_1.6.0            timechange_0.1.1      
#> [19] rmdformats_1.0.4       prettyunits_1.1.1      colorspace_2.0-3      
#> [22] blob_1.2.3             rvest_1.0.3            rappdirs_0.3.3        
#> [25] haven_2.5.1            xfun_0.35              crayon_1.5.2          
#> [28] RCurl_1.98-1.9         jsonlite_1.8.3         survival_3.4-0        
#> [31] glue_1.6.2             gtable_0.3.1           gargle_1.2.1          
#> [34] zlibbioc_1.44.0        XVector_0.38.0         DelayedArray_0.24.0   
#> [37] DBI_1.1.3              Rcpp_1.0.9             viridisLite_0.4.1     
#> [40] xtable_1.8-4           progress_1.2.2         bit_4.0.5             
#> [43] htmlwidgets_1.5.4      httr_1.4.4             ellipsis_0.3.2        
#> [46] pkgconfig_2.0.3        farver_2.1.1           sass_0.4.3            
#> [49] dbplyr_2.2.1           locfit_1.5-9.7         utf8_1.2.2            
#> [52] tidyselect_1.2.0       labeling_0.4.2         munsell_0.5.0         
#> [55] cellranger_1.1.0       tools_4.2.2            cachem_1.0.6          
#> [58] cli_3.4.1              generics_0.1.3         RSQLite_2.2.18        
#> [61] broom_1.0.1            evaluate_0.18          fastmap_1.1.0         
#> [64] yaml_2.3.6             bit64_4.0.5            fs_1.5.2              
#> [67] KEGGREST_1.38.0        xml2_1.3.3             compiler_4.2.2        
#> [70] rstudioapi_0.14        filelock_1.0.2         curl_4.3.3            
#> [73] png_0.1-7              reprex_2.0.2           geneplotter_1.76.0    
#> [76] bslib_0.4.1            stringi_1.7.8          highr_0.9             
#> [79] lattice_0.20-45        Matrix_1.5-3           vctrs_0.6.2           
#> [82] pillar_1.8.1           lifecycle_1.0.3        jquerylib_0.1.4       
#> [85] data.table_1.14.6      bitops_1.0-7           patchwork_1.1.2       
#> [88] R6_2.5.1               bookdown_0.33          codetools_0.2-18      
#> [91] assertthat_0.2.1       rjson_0.2.21           withr_2.5.0           
#> [94] GenomeInfoDbData_1.2.9 parallel_4.2.2         hms_1.1.2             
#> [97] rmarkdown_2.18         googledrive_2.0.0      lubridate_1.9.0

Small RNAseq: Differential Expression Analysis