Small RNAseq: Differential Expression Analysis
Environment Setup
salloc -N 1 --exclusive -p amd -t 8:00:00
conda env create -f conda-env.yml
conda activate smallrna
Downloading datasets
Raw data
Raw data was downloaded from the sequencing facility using the secure
link, with wget
command. The downloaded files were checked
for md5sum and compared against list of files expected as per the input
samples provided.
wget https://oc1.rnet.missouri.edu/xyxz
# link masked
# GEO link will be included later
# merge files of same samples (technical replicates)
paste <(ls *_L001_R1_001.fastq.gz) <(ls *_L002_R1_001.fastq.gz) | \
sed 's/\t/ /g' |\
awk '{print "cat",$1,$2" > "$1}' |\
sed 's/_L001_R1_001.fastq.gz/.fq.gz/2' > concatenate.sh
chmod +x concatenate.sh
sh concatenate.sh
Genome/annotation
Additional files required for the analyses were downloaded from GenCode. The downloaded files are as follows:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M30/GRCm39.primary_assembly.genome.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M30/gencode.vM30.annotation.gff3.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M30/gencode.vM30.annotation.gtf.gz
gunzip GRCm39.primary_assembly.genome.fa.gz
gunzip gencode.vM30.annotation.gff3.gz
gunzip gencode.vM30.annotation.gtf.gz
FastQC (before processing)
for fq in *.fq.gz; do
fastqc --threads $SLURM_JOB_CPUS_PER_NODE $fq;
done
mkdir -p fastqc_pre
mv *.zip *.html fastqc_pre/
Mapping
To index the genome, following command was run (in an interactive session).
fastaGenome="GRCm39.genome.fa"
gtf="gencode.vM30.annotation.gtf"
STAR --runThreadN $SLURM_JOB_CPUS_PER_NODE \
--runMode genomeGenerate \
--genomeDir $(pwd) \
--genomeFastaFiles $fastaGenome \
--sjdbGTFfile $gtf \
--sjdbOverhang 1
Each fastq
file was mapped to the indexed genome as
using runSTAR_map.sh
script shown below:
#!/bin/bash
read1=$1
out=$(basename ${read1%%.*})
STARgenomeDir=$(pwd)
# illumina adapter
adapterseq="AGATCGGAAGAGC"
STAR \
--genomeDir ${STARgenomeDir} \
--readFilesIn ${read1} \
--outSAMunmapped Within \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--outFilterMultimapNmax 20 \
--clip3pAdapterSeq ${adapterseq} \
--clip3pAdapterMMp 0.1 \
--outFilterMismatchNoverLmax 0.03 \
--outFilterScoreMinOverLread 0 \
--outFilterMatchNminOverLread 0 \
--outFilterMatchNmin 16 \
--outFileNamePrefix ${out} \
--alignSJDBoverhangMin 1000 \
--alignIntronMax 1 \
--runThreadN ${SLURM_JOB_CPUS_PER_NODE} \
--genomeLoad LoadAndKeep \
--limitBAMsortRAM 30000000000 \
--outSAMheaderHD "@HD VN:1.4 SO:coordinate"
Mapping was run with a simple loop:
for fq in *.fq.gz; do
runSTAR_map.sh $fq;
done
Counting Stats
Counts generated by STAR
with option
--quantMode GeneCounts
were parsed to generate summary
stats as well as to extract annotated small RNA feature counts.
mkdir -p counts_files
# copy counts for each sample
cp *ReadsPerGene.out.tab counts_files/
cd counts_files
# merge counts
join_files.sh *ReadsPerGene.out.tab |\
sed 's/ReadsPerGene.out.tab//g' |\
grep -v "^N_" > counts_star.tsv
# merge stats
join_files.sh *ReadsPerGene.out.tab |\
sed 's/ReadsPerGene.out.tab//g' |\
head -n 1 > summary_star.tsv
join_files.sh *ReadsPerGene.out.tab |\
sed 's/ReadsPerGene.out.tab//g' |\
grep "^N_" >> summary_star.tsv
# parse GTF to extact gene.id and its biotype:
gtf=gencode.vM30.annotation.gtf
awk 'BEGIN{OFS=FS="\t"} $3=="gene" {split($9,a,";"); print a[1],a[2]}' ${gtf} |\
awk '{print $4"\t"$2}' |\
sed 's/"//g' > GeneType_GeneID.tsv
cut -f 1 GeneType_GeneID.tsv | sort |uniq > features.txt
The information for biotype
as provided by the gencodegenes
were used for categorizing biotype
.
The smallRNA
group consists of following
biotype
:
miRNA
misc_RNA
scRNA
snRNA
snoRNA
sRNA
scaRNA
The full table is as follows:
library(knitr)
setwd("/work/LAS/geetu-lab/arnstrm/mouse.trophoblast.smallRNAseq")
="assets/GeneType_Group.tsv"
file1<-
info read.csv(
file1,header = TRUE,
sep = "\t",
stringsAsFactors = TRUE
)kable(info, caption = "Table 1: biotype and its groupings")
biotype | group |
---|---|
protein_coding | coding_genes |
pseudogene | pseudogenes |
TR_C_gene | Ig_genes |
TR_D_gene | Ig_genes |
TR_J_gene | Ig_genes |
TR_V_gene | Ig_genes |
IG_C_gene | Ig_genes |
IG_D_gene | Ig_genes |
IG_J_gene | Ig_genes |
IG_LV_gene | Ig_genes |
IG_V_gene | Ig_genes |
TR_J_pseudogene | pseudogenes |
TR_V_pseudogene | pseudogenes |
IG_C_pseudogene | pseudogenes |
IG_D_pseudogene | pseudogenes |
IG_pseudogene | pseudogenes |
IG_V_pseudogene | pseudogenes |
lncRNA | long_non_conding_RNA |
miRNA | non_conding_RNA |
misc_RNA | non_conding_RNA |
ribozyme | non_conding_RNA |
rRNA | non_conding_RNA |
scaRNA | non_conding_RNA |
scRNA | non_conding_RNA |
snoRNA | non_conding_RNA |
snRNA | non_conding_RNA |
sRNA | non_conding_RNA |
Mt_rRNA | non_conding_RNA |
Mt_tRNA | non_conding_RNA |
processed_pseudogene | pseudogenes |
unprocessed_pseudogene | pseudogenes |
translated_unprocessed_pseudogene | pseudogenes |
transcribed_processed_pseudogene | pseudogenes |
transcribed_unitary_pseudogene | pseudogenes |
transcribed_unprocessed_pseudogene | pseudogenes |
unitary_pseudogene | pseudogenes |
TEC | unconfirmed_genes |
A samples table (samples.tsv
) categorizing samples to
its condition were also generated:
="assets/samples.tsv"
file2<-
samples read.csv(
file2,header = TRUE,
sep = "\t",
stringsAsFactors = TRUE
)kable(samples, caption = "Table 2: Samples in the study")
Sample | Group |
---|---|
pTGC_1 | pTGC |
pTGC_2 | pTGC |
pTGC_3 | pTGC |
pTGC_4 | pTGC |
mTSC_1 | mTSC |
mTSC_2 | mTSC |
mTSC_3 | mTSC |
mTSC_4 | mTSC |
This information was then merged withe counts table to generate QC plots:
awk 'BEGIN{OFS=FS="\t"}FNR==NR{a[$1]=$2;next}{ print $2,$1,a[$1]}' \
> GeneID_GeneType_Group.tsv
GeneType_Group.tsv GeneType_GeneID.tsv
awk 'BEGIN{OFS=FS="\t"}FNR==NR{a[$1]=$2"\t"$3;next}{print $1,a[$1],$0}' \
|\
GeneID_GeneType_Group.tsv counts_star.tsv cut -f 1-3,5- > processed_counts_star.tsv
Plotting the mapping summary and count statistics for various biotypes:
library(scales)
library(tidyverse)
library(plotly)
setwd("/work/LAS/geetu-lab/arnstrm/mouse.trophoblast.smallRNAseq")
="assets/processed_counts_star.tsv"
file1="assets/summary_stats_star.tsv"
file2<-
counts read.csv(
file1,sep = "\t",
stringsAsFactors = TRUE
)<-
subread read.csv(
file2,sep = "\t",
stringsAsFactors = TRUE
)# convert long format
<- gather(counts, Sample, Count, pTGC_1:mTSC_4, factor_key=TRUE)
counts.long <- gather(subread, Sample, Count, pTGC_1:mTSC_4, factor_key=TRUE)
subread.long # organize
$Group <-
counts.longfactor(
$Group,
counts.longlevels = c(
"coding_genes",
"non_conding_RNA",
"long_non_conding_RNA",
"pseudogenes",
"unconfirmed_genes",
"Ig_genes"
)
)
$Assignments <-
subread.longfactor(
$Assignments,
subread.longlevels = c(
"N_input",
"N_unmapped",
"N_multimapping",
"N_unique",
"N_ambiguous",
"N_noFeature"
) )
ggplot(subread.long, aes(x = Assignments, y = Count, fill = Assignments)) +
geom_bar(stat = 'identity') +
labs(x = "Subread assingments", y = "reads") + theme_minimal() +
scale_y_continuous(labels = label_comma()) +
scale_fill_manual(
values = c(
"N_input" = "#577b92",
"N_unmapped" = "#021927",
"N_multimapping" = "#007caa",
"N_unique" = "#1a3c54",
"N_ambiguous" = "#3a5b63",
"N_noFeature" = "#33526c"
)+
) theme(
axis.text.x = element_text(
angle = 45,
vjust = 1,
hjust = 1,
size = 12
),strip.text = element_text(
face = "bold",
color = "gray35",
hjust = 0,
size = 10
),strip.background = element_rect(fill = "white", linetype = "blank"),
legend.position = "none"
+
) facet_wrap("Sample", scales = "free_y", ncol = 4)
<- ggplot(counts.long, aes(x = Group, y = Count, fill = Group)) +
g geom_bar(stat = 'sum') +
labs(x = "biotype", y = "read counts") + theme_minimal() +
scale_y_continuous(labels = label_comma()) +
scale_fill_manual(values = c(
"coding_genes" = "#92b5b7",
"non_conding_RNA" = "#be4f54",
"long_non_conding_RNA" = "#eca87a",
"pseudogenes" = "#784440",
"unconfirmed_genes" = "#eba3a4",
"Ig_genes" = "#8a3a1b"
+
)) theme(
axis.text.x = element_text(
angle = 45,
vjust = 1,
hjust = 1,
size = 12
),strip.text = element_text(
face = "bold",
color = "gray35",
hjust = 0,
size = 10
),strip.background = element_rect(fill = "white", linetype = "blank"),
legend.position = "none"
+
) facet_wrap("Sample", scales = "free_y", ncol = 4)
g
<- filter(counts.long, Group %in% "non_conding_RNA" )
counts.nc $GeneType <-
counts.ncfactor(
$GeneType,
counts.nclevels = c(
"miRNA",
"misc_RNA",
"snoRNA",
"snRNA",
"sRNA",
"scRNA",
"scaRNA",
"Mt_tRNA",
"Mt_rRNA",
"rRNA",
"ribozyme"
)
)
<-
g ggplot(counts.nc, aes(x = GeneType, y = Count, fill = GeneType)) +
geom_bar(stat = 'sum') +
labs(x = "biotype", y = "read counts") + theme_minimal() +
scale_y_continuous(labels = label_comma()) +
scale_fill_manual(
values = c(
'miRNA' = '#54693e',
'misc_RNA' = '#9c47cb',
'snRNA' = '#94d14f',
'snoRNA' = '#5c4f9c',
'scaRNA' = '#cca758',
'sRNA' = '#c85a90',
'Mt_tRNA' = '#80d0a8',
'Mt_rRNA' = '#c4533b',
'rRNA' = '#9ea5c0',
'ribozyme' = '#51333c'
)+
) theme(
axis.text.x = element_text(
angle = 45,
vjust = 1,
hjust = 1,
size = 12
),strip.text = element_text(
face = "bold",
color = "gray35",
hjust = 0,
size = 10
),strip.background = element_rect(fill = "white", linetype = "blank"),
legend.position = "none"
+
) facet_wrap("Sample", scales = "free_y", ncol = 4)
#ggplotly(g)
g
subset the counts file to select only smallRNA genes
<- c('miRNA',
snrna 'misc_RNA',
'scRNA',
'snRNA',
'snoRNA',
'sRNA',
'scaRNA')
<- dplyr::filter(counts, GeneType %in% snrna) %>%
cts ::select(Geneid, pTGC_1:mTSC_4)
dplyrwrite_delim(cts, file = "assets/noncoding_counts_star.tsv", delim = "\t")
This noncoding_counts_star.tsv
and
samples.tsv
file will be used for DESeq2
analyses.
DESeq2
For the next steps, we used DESeq2
for performing the DE
analyses. Results were visualized as volcano plots and tables were
exported to excel.
Load packages
setwd("/work/LAS/geetu-lab/arnstrm/mouse.trophoblast.smallRNAseq")
library(DESeq2)
library(RColorBrewer)
library(pheatmap)
library(genefilter)
library(ggrepel)
library(biomaRt)
library(reshape2)
library(PupillometryR)
library(ComplexUpset)
library(splitstackshape)
library(enrichR)
Import counts and sample metadata
The counts
data and its associated metadata
(coldata
) are imported for analyses.
= 'assets/noncoding_counts_star.tsv'
counts = 'assets/samples.tsv'
groupFile <-
coldata read.csv(
groupFile,row.names = 1,
sep = "\t",
stringsAsFactors = TRUE
)<- as.matrix(read.csv(counts, sep = "\t", row.names = "Geneid")) cts
Reorder columns of cts
according to coldata
rows. Check if samples in both files match.
colnames(cts)
#> [1] "pTGC_1" "pTGC_2" "pTGC_3" "pTGC_4" "mTSC_1" "mTSC_2" "mTSC_3" "mTSC_4"
all(rownames(coldata) %in% colnames(cts))
#> [1] TRUE
<- cts[, rownames(coldata)] cts
Normalize
The batch corrected read counts are then used for running DESeq2 analyses
<- DESeqDataSetFromMatrix(countData = cts,
dds colData = coldata,
design = ~ Group)
<- vst(dds, blind = FALSE, nsub =500)
vsd <- rowSums(counts(dds)) >= 5
keep <- dds[keep, ]
dds <- DESeq(dds)
dds
dds#> class: DESeqDataSet
#> dim: 1266 8
#> metadata(1): version
#> assays(4): counts mu H cooks
#> rownames(1266): ENSMUSG00000119106.1 ENSMUSG00000119589.1 ...
#> ENSMUSG00000065444.3 ENSMUSG00000077869.3
#> rowData names(22): baseMean baseVar ... deviance maxCooks
#> colnames(8): pTGC_1 pTGC_2 ... mTSC_3 mTSC_4
#> colData names(2): Group sizeFactor
<- assay(vst(dds, blind = FALSE, nsub = 500))
vst <- vst(dds, blind = FALSE, nsub = 500)
vsd <-
pcaData plotPCA(vsd,
intgroup = "Group",
returnData = TRUE)
<- round(100 * attr(pcaData, "percentVar")) percentVar
PCA plot for QC
PCA plot for the dataset that includes all libraries.
<- rowVars(assay(vsd))
rv <-
select order(rv, decreasing = TRUE)[seq_len(min(500, length(rv)))]
<- prcomp(t(assay(vsd)[select, ]))
pca <- pca$sdev ^ 2 / sum(pca$sdev ^ 2)
percentVar = "Group"
intgroup <- as.data.frame(colData(vsd)[, intgroup, drop = FALSE])
intgroup.df <- if (length(intgroup) == 1) {
group factor(apply(intgroup.df, 1, paste, collapse = " : "))
}<- data.frame(
d PC1 = pca$x[, 1],
PC2 = pca$x[, 2],
intgroup.df,name = colnames(vsd)
)
plot PCA for components 1 and 2
#nudge <- position_nudge(y = 0.5)
<- ggplot(d, aes(PC1, PC2, color = Group)) +
g scale_shape_manual(values = 1:8) +
scale_color_manual(values = c('pTGC' = '#c6007b',
'mTSC' = '#a0b600')) +
theme_bw() +
theme(legend.title = element_blank()) +
geom_point(size = 2, stroke = 2) +
geom_text_repel(aes(label = name)) +
xlab(paste("PC1", round(percentVar[1] * 100, 2), "% variance")) +
ylab(paste("PC2", round(percentVar[2] * 100, 2), "% variance"))
g
Sample distance for QC
<- dist(t(assay(vsd)))
sampleDists <- as.matrix( sampleDists )
sampleDistMatrix rownames(sampleDistMatrix) <- colnames(vsd)
colnames(sampleDistMatrix) <- NULL
<- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
colors pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors)
Set contrasts and find DE genes
resultsNames(dds)
#> [1] "Intercept" "Group_pTGC_vs_mTSC"
<- results(dds, contrast = c("Group", "mTSC", "pTGC"))
res.UndfvsDiff table(res.UndfvsDiff$padj < 0.05)
#>
#> FALSE TRUE
#> 579 294
<- res.UndfvsDiff[order(res.UndfvsDiff$padj),]
res.UndfvsDiff <-
res.UndfvsDiffdata merge(
as.data.frame(res.UndfvsDiff),
as.data.frame(counts(dds, normalized = TRUE)),
by = "row.names",
sort = FALSE
)
Creating gene lists
The gene lists have Ensembl gene-ID-version. We need
mirbase_id
, gene_biotype
and
external_gene_name
attached to make the results
interpretable. So we will download metadata from ensembl using
biomaRt
package.
= useMart("ENSEMBL_MART_ENSEMBL")
ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)
ensembl <- "ensembl_gene_id_version"
filterType <- rownames(cts)
filterValues <- c('ensembl_gene_id_version',
attributeNames 'ensembl_gene_id',
'gene_biotype',
'mirbase_id',
'external_gene_name')
<- getBM(
annot attributes = attributeNames,
filters = filterType,
values = filterValues,
mart = ensembl
)<- duplicated(annot$ensembl_gene_id)
isDup <- annot$ensembl_gene_id[isDup]
dup <- annot[!annot$ensembl_gene_id %in% dup, ] #this object will be saved and used later
annot saveRDS(annot, file = "assets/annot.rds")
The rds is saved and reloaded for subsequent run (instead of querying biomart over and over for the same information)
<- readRDS("assets/annot.rds") annot
Volcano plots
<-
volcanoPlots2 function(res.se,
string,
first,
second,
color1,
color2,
color3,
ChartTitle) {<- res.se[order(res.se$padj),]
res.se <-
res.se rownames_to_column(as.data.frame(res.se[order(res.se$padj),]))
names(res.se)[1] <- "Gene"
<-
res.data merge(res.se,
annot,by.x = "Gene",
by.y = "ensembl_gene_id_version")
<- res.data %>% mutate_all(na_if, "")
res.data <- res.data %>% mutate_all(na_if, " ")
res.data <-
res.data %>% mutate(gene_symbol = coalesce(external_gene_name, Gene))
res.data = 1.5
fc = log(fc, base = 2)
log2fc = log2fc * -1
neg.log2fc
$diffexpressed <- "other.genes"
res.data$diffexpressed[res.data$log2FoldChange >= log2fc &
res.data$padj <= 0.05] <- paste("Higher expression in", first)
res.data$diffexpressed[res.data$log2FoldChange <= neg.log2fc &
res.data$padj <= 0.05] <- paste("Higher expression in", second)
res.data
$delabel <- NA
res.data$delabel[res.data$log2FoldChange >= log2fc &
res.data$padj <= 0.05 &
res.data!is.na(res.data$padj)] <-
$gene_symbol[res.data$log2FoldChange >= log2fc &
res.data$padj <= 0.05 & !is.na(res.data$padj)]
res.data$delabel[res.data$log2FoldChange <= neg.log2fc &
res.data$padj <= 0.05 &
res.data!is.na(res.data$padj)] <-
$gene_symbol[res.data$log2FoldChange <= neg.log2fc &
res.data$padj <= 0.05 & !is.na(res.data$padj)]
res.dataggplot(res.data,
aes(
x = log2FoldChange,
y = -log10(padj),
col = diffexpressed,
label = delabel
+
)) geom_point(alpha = 0.5) +
xlim(-6, 6) +
theme_classic() +
scale_color_manual(name = "Expression", values = c(color1, color2, color3)) +
geom_text_repel(
data = subset(res.data, padj <= 0.05),
max.overlaps = 15,
show.legend = F,
min.segment.length = Inf,
seed = 42,
box.padding = 0.5
+
) ggtitle(ChartTitle) +
xlab(paste("log2 fold change")) +
ylab("-log10 pvalue (adjusted)") +
theme(legend.text.align = 0)
}
<- volcanoPlots2(
g
res.UndfvsDiff,"mTSC_vs_pTGC",
"mTSC",
"pTGC",
"#a0b600",
"#c6007b",
"grey",
ChartTitle = "mTSC (undifferentiated) vs. pTGC (differentiated)"
)
g#> Warning: Removed 394 rows containing missing values (`geom_point()`).
#> Warning: Removed 1 rows containing missing values (`geom_text_repel()`).
#> Warning: ggrepel: 282 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps
Heatmap
Heatmap for the top 30 variable genes:
<- head(order(rowVars(assay(vsd)), decreasing = TRUE), 30)
topVarGenes <- assay(vsd)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
mat <- merge(mat,
mat2
annot,by.x = 'row.names',
by.y = "ensembl_gene_id_version")
$gene <- paste0(mat2$external_gene_name," (",mat2$gene_biotype,")")
mat2rownames(mat2) <- mat2[,'gene']
<- mat2[2:9]
mat2 <- brewer.pal(9, "YlOrRd")
heat_colors <- pheatmap(
g
mat2,color = heat_colors,
main = "Top 30 variable small RNA genes",
cluster_rows = T,
cluster_cols = T,
show_rownames = T,
border_color = NA,
fontsize = 10,
scale = "row",
fontsize_row = 10
) g
Write result tables
Here, we will attach the mirbase_id
,
gene_biotype
and external_gene_name
downloaded
in the previous section to the results table. We will also filter the
the table to write:
- DE list that have
padj
less than or equal to0.05
- DE list that have
padj
less than or equal to0.05
, andlog2FoldChange
greater than or equal to fold change of1.5
- DE list that have
padj
less than or equal to0.05
, andlog2FoldChange
less than or equal to fold change of-1.5
- Full list of DE table without any filtering.
names(res.UndfvsDiffdata)[1] <- "ensembl_gene_id_version"
<-
res.UndfvsDiffdata merge(x = res.UndfvsDiffdata,
y = annot,
by = "ensembl_gene_id_version",
all.x = TRUE)
<-
res.UndfvsDiffdata c(1,
res.UndfvsDiffdata[, ncol(res.UndfvsDiffdata) - 2):ncol(res.UndfvsDiffdata),
(2:(ncol(res.UndfvsDiffdata) - 3))]
<- res.UndfvsDiffdata %>%
res.UndfvsDiffSig filter(padj <= 0.05)
= 1.5
fc = log(fc, base = 2)
log2fc = log2fc * -1
neg.log2fc <- res.UndfvsDiffdata %>%
res.UndfvsDiffSig.up filter(padj <= 0.05 & log2FoldChange >= log2fc)
<- res.UndfvsDiffdata %>%
res.UndfvsDiffSig.dw filter(padj <= 0.05 & log2FoldChange <= neg.log2fc)
<- res.UndfvsDiffSig.up %>% dplyr::filter(gene_biotype == "miRNA") %>% dplyr::select(mirbase_id)
mTSC.up <- res.UndfvsDiffSig.dw %>% dplyr::filter(gene_biotype == "miRNA") %>% dplyr::select(mirbase_id)
pTGC.up
write_delim(res.UndfvsDiffdata,
file = "results/DESeq2_results_mTSC_vs_pTGC_full-table.tsv",
delim = "\t")
write_delim(res.UndfvsDiffSig,
file = "results/DESeq2_results_mTSC_vs_pTGC_adj.p-le-0.05.tsv",
delim = "\t")
write_delim(res.UndfvsDiffSig.up,
file = "results/DESeq2_results_mTSC_vs_pTGC_adj.p-le-0.05_fc-ge-1.5.tsv",
delim = "\t")
write_delim(res.UndfvsDiffSig.dw,
file = "results/DESeq2_results_mTSC_vs_pTGC_adj.p-le-0.05_fc-le-neg1.5.tsv",
delim = "\t")
<- mTSC.up
mi.mTSC.up <- pTGC.up
mi.pTGC.up <- unique(rbind(mi.mTSC.up, mi.pTGC.up))
mi.mTSC.and.pTGC
write_delim(mTSC.up,
file = "results/miRNAs_up_mTSC.txt",
delim = "\t")
write_delim(pTGC.up,
file = "results/miRNAs_up_pTGC.txt",
delim = "\t")
write_delim(mi.mTSC.and.pTGC,
file = "results/miRNAs_up_mTSC.and.pTGC.txt",
delim = "\t")
Characterizing smallRNA expression
To check the expression of genes in each condition, we looked at the highly expressed genes and their composition. The analyses is shown below.
<- res.UndfvsDiffdata[, c(1:4, 11:ncol(res.UndfvsDiffdata))]
exp $external_gene_name <-
expifelse(exp$external_gene_name == "",
$ensembl_gene_id_version,
exp$external_gene_name)
exp$gene <- paste0(exp$external_gene_name, "(", exp$gene_biotype, ")")
exp<- exp %>% dplyr::select(gene, ensembl_gene_id_version:mTSC_4)
renamed.exp <-
renamed.exp.long melt(
renamed.exp,id.vars = c(
'gene',
'ensembl_gene_id_version',
'gene_biotype',
'mirbase_id',
'external_gene_name'
)
)colnames(renamed.exp.long) <-
c('gene',
'ensembl_gene_id_version',
'gene_biotype',
'mirbase_id',
'external_gene_name',
'replicates',
'norm.expression'
)$condition <- "NA"
renamed.exp.long$condition[which(str_detect(renamed.exp.long$replicates, "pTGC_"))] <-
renamed.exp.long"pTGC"
$condition[which(str_detect(renamed.exp.long$replicates, "mTSC_"))] <-
renamed.exp.long"mTSC"
<-
renamed.exp.long %>% filter(norm.expression > 0)
renamed.exp.long <- na.omit(renamed.exp.long)
renamed.exp.long # SOURCE: https://ourcodingclub.github.io/tutorials/dataviz-beautification/
<- function() {
theme_niwot theme_bw() +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
panel.border = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
plot.margin = unit(c(1, 1, 1, 1), units = , "cm"),
plot.title = element_text(
size = 12,
vjust = 1,
hjust = 0
),legend.text = element_text(size = 12),
legend.title = element_blank(),
legend.position = c(0.95, 0.15),
legend.key = element_blank(),
legend.background = element_rect(
color = "black",
fill = "transparent",
size = 2,
linetype = "blank"
)
) }
Normalized expression plots
<-
vlnPlot function(dataIn = renamed.exp.long,
xcol = "replicates",
fillcol = "condition",
expre = "norm.expression") {
<- ggplot(data = dataIn,
p aes_string(x = xcol, y = expre, fill = fillcol)) +
geom_flat_violin(position = position_nudge(x = 0.2, y = 0),
alpha = 0.6,
trim = FALSE) +
geom_point(
aes_string(y = expre, color = fillcol),
position = position_jitter(width = 0.15),
size = 1,
alpha = 0.5
+
) geom_boxplot(width = 0.2,
outlier.shape = NA,
alpha = 0.8) + stat_summary(
fun = mean,
geom = "point",
shape = 23,
size = 2
+
) labs(y = "Normalized Expression", x = NULL) +
guides(fill = "none", color = "none") +
scale_y_log10(label = comma) +
scale_fill_manual(values = c('pTGC' = '#c6007b',
'mTSC' = '#a0b600')) +
scale_color_manual(values = c('pTGC' = '#c6007b',
'mTSC' = '#a0b600')) +
theme_niwot() + theme(axis.text.x = element_text(
angle = 45,
vjust = 1,
hjust = 1
))
p }
Expression plots
Replicates
vlnPlot(xcol="replicates")
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation ideoms with `aes()`
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
#> ℹ Please use the `linewidth` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: Using the `size` aesthietic with geom_polygon was deprecated in ggplot2 3.4.0.
#> ℹ Please use the `linewidth` aesthetic instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
Conditions
vlnPlot(xcol="condition")
Biotypes
vlnPlot(xcol="gene_biotype")
Highly expressed small RNAs
<- renamed.exp %>%
df.both rowwise() %>%
mutate(Both = mean(c(
pTGC_1, pTGC_2, pTGC_3, pTGC_4, mTSC_1, mTSC_2, mTSC_3, mTSC_4%>%
))) ::select(gene:external_gene_name, Both) %>%
dplyr::filter(Both > 0) %>%
dplyrungroup() %>%
mutate(quart = ntile(Both, 4)) %>%
mutate(decile = ntile(Both, 10))
<- renamed.exp %>%
df.diff rowwise() %>%
mutate(pTGC = mean(c(
pTGC_1, pTGC_2, pTGC_3, pTGC_4%>%
))) ::select(gene:external_gene_name, pTGC) %>%
dplyr::filter(pTGC > 0) %>%
dplyrungroup() %>%
mutate(quart = ntile(pTGC, 4)) %>%
mutate(decile = ntile(pTGC, 10))
<- renamed.exp %>%
df.undiff rowwise() %>%
mutate(mTSC = mean(c(
mTSC_1, mTSC_2, mTSC_3, mTSC_4%>%
))) ::select(gene:external_gene_name, mTSC) %>%
dplyr::filter(mTSC > 0) %>%
dplyrungroup() %>%
mutate(quart = ntile(mTSC, 4)) %>%
mutate(decile = ntile(mTSC, 10))
<- function(dataIn = df.undiff,
filterCuts cutOff = 4,
type = decile) {
<- enquo(type)
type %>%
dataIn ::filter(!!type == cutOff) %>%
dplyr::select(ensembl_gene_id_version)
dplyr
}.75pc <- filterCuts(df.both, 4, type = quart)
both.75pc <- filterCuts(df.undiff, 4, type = quart)
undiff.75pc <- filterCuts(df.diff, 4, type = quart)
diff<- function(dataIn = df) {
filterMiRNA %>% left_join(
dataIn
annot,by = "ensembl_gene_id_version",
%>% filter(gene_biotype == "miRNA") %>% dplyr::select(mirbase_id)
)
} .75pc <- filterMiRNA(both.75pc)
mi.both.75pc <- filterMiRNA(undiff.75pc)
mi.undiff.75pc <- filterMiRNA(diff.75pc)
mi.diffwrite_delim(
.75pc,
mi.undifffile = paste0("results/miRNAs_75pc_mTSC.txt"),
delim = "\t"
)write_delim(
.75pc,
mi.difffile = paste0("results/miRNAs_75pc_pTGC.txt"),
delim = "\t"
)write_delim(
.75pc,
mi.bothfile = paste0("results/miRNAs_75pc_both.txt"),
delim = "\t"
)<- data.frame(table(diff.75pc)) %>%
quartile.data full_join(data.frame(table(undiff.75pc)),
by = c("ensembl_gene_id_version" = "ensembl_gene_id_version")) %>%
replace(is.na(.), 0) %>%
::rename(diff75pc=Freq.x, undiff75pc=Freq.y) %>%
dplyrleft_join(annot, by = c("ensembl_gene_id_version" = "ensembl_gene_id_version")) %>%
::select(ensembl_gene_id_version, diff75pc, undiff75pc, gene_biotype) %>%
dplyrcolumn_to_rownames(var = "ensembl_gene_id_version")
colnames(quartile.data) <- c("pTGC.75pc", "mTSC.75pc", "gene_biotype")
<- function(fulltable = decile.data) {
plotUpSet <- colnames(fulltable)[1:2]
inter <- upset(
p
fulltable,
inter,annotations = list(
'smallRNA type' = (
ggplot(mapping = aes(fill = gene_biotype)) +
geom_bar(stat = 'count', position = 'fill') +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(
values = c(
'miRNA' = '#54693e',
'misc_RNA' = '#9c47cb',
'snRNA' = '#94d14f',
'snoRNA' = '#5c4f9c',
'scaRNA' = '#cca758',
'sRNA' = '#c85a90'
)
)+ ylab('smallRNA proportion')
)
),queries = list(
upset_query(
intersect = inter,
color = '#C19B5D',
fill = '#C19B5D',
only_components = c('intersections_matrix', 'Intersection size')
),upset_query(
intersect = inter[1],
color = '#c6007b',
fill = '#c6007b',
only_components = c('intersections_matrix', 'Intersection size')
),upset_query(
intersect = inter[2],
color = '#a0b600',
fill = '#a0b600',
only_components = c('intersections_matrix', 'Intersection size')
)
),width_ratio = 0.4,
set_sizes = (
upset_set_size(geom = geom_bar(
aes(fill = gene_biotype, x = group),
width = 0.8
),position = 'right') + theme(legend.position = "none") +
scale_fill_manual(
values = c(
'miRNA' = '#54693e',
'misc_RNA' = '#9c47cb',
'snRNA' = '#94d14f',
'snoRNA' = '#5c4f9c',
'scaRNA' = '#cca758',
'sRNA' = '#c85a90'
)
)
),guides = 'over'
)
p }
Intersection plots
quartile
plotUpSet(quartile.data)
Save intersection results
<-
filterUpsetRes function(datatable, title = "title") {
<-
data.annot merge(datatable[1:2],
annot,by.x = 0,
by.y = "ensembl_gene_id_version",
all.x = TRUE)
<- data.annot %>%
mirna.annot ::filter(.[[2]] == 1 & .[[3]] == 1) %>%
dplyr::filter(gene_biotype == "miRNA")
dplyr<- data.annot %>%
snrna.annot ::filter(.[[2]] == 1 & .[[3]] == 1)
dplyrwrite_delim(
mirna.annot,file = paste0("results/intersection_miRNA_", title , ".tsv"),
delim = "\t"
)write_delim(
snrna.annot,file = paste0("results/intersection_smallRNA_", title , ".tsv"),
delim = "\t"
)
}filterUpsetRes(quartile.data, title = "quartile")
Finding target genes for miRNAs
The mirdb.org database allows you to find target genes for each miRNA. We downloaded the file from the server and linked each miRNA form our lists to their targets The information downloaded from the server had score information, and for many other species. A filtered list of just mouse, with score >= 80 was created. The target gene ids were in RefSeq format, which was convererted to gene symbol as well.
cd assets
wget https://mirdb.org/download/miRDB_v6.0_prediction_result.txt.gz
gunzip miRDB_v6.0_prediction_result.txt.gz
library(org.Mm.eg.db)
<- read.delim("assets/miRDB_v6.0_prediction_result.txt", header=FALSE)
mirdb colnames(mirdb) <- c("mirbase_id", "refseq", "score")
<- mirdb %>% filter(str_detect(mirbase_id, "^mmu-") & score >= 80) %>%
mmu.mirdb ::select(mirbase_id, refseq)
dplyr<- c("REFSEQ", "ENSEMBL", "SYMBOL", "GENENAME")
cols <- AnnotationDbi::select(org.Mm.eg.db, keys=mmu.mirdb$refseq, columns=cols, keytype="REFSEQ")
mirdb.ids <- left_join(mmu.mirdb, mirdb.ids, by=c('refseq'='REFSEQ')) %>%
mirdb.info ::select(mirbase_id, refseq, ENSEMBL, SYMBOL) %>%
dplyrdistinct() %>%
mutate(mirbase_noprime = gsub("-3p$|-5p$", "", mirbase_id),
prime = gsub("^-", "", str_extract(mirbase_id, "-3p$|-5p$"))) %>%
::select(mirbase_noprime, refseq, ENSEMBL, SYMBOL, prime)
dplyrcolnames(mirdb.info) <- c("mirbase_id", "refseq", "ENSEMBL", "SYMBOL", "prime")
$mirbase_id <- tolower(mirdb.info$mirbase_id)
mirdb.infowrite_delim(mirdb.info, "targets/mirDB_info.tsv", delim = "\t")
saveRDS(mirdb.info, file = "assets/mirdb.rds")
The rds is saved and reloaded for subsequent run (instead of rerunning everytime we compiled the RMarkdown).
<- readRDS("assets/mirdb.rds") mirdb.info
Target genes for miRNAs
<- mi.mTSC.up %>%
miTargets.mTSC.up left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id")) %>%
::select(SYMBOL) %>%
dplyrdistinct(SYMBOL)
<- mi.pTGC.up %>%
miTargets.pTGC.up left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id")) %>%
::select(SYMBOL) %>%
dplyrdistinct(SYMBOL)
<- as.data.frame(intersect(miTargets.mTSC.up$SYMBOL,
miTargets.mTSC.and.pTGC $SYMBOL))
miTargets.pTGC.upcolnames(miTargets.mTSC.and.pTGC) <- "SYMBOL"
<-
miTargets.only.in.mTSC as.data.frame(miTargets.mTSC.up$SYMBOL[!(miTargets.mTSC.up$SYMBOL %in%
$SYMBOL)])
miTargets.pTGC.upcolnames(miTargets.only.in.mTSC) <- "SYMBOL"
<-
miTargets.only.in.pTGC as.data.frame(miTargets.pTGC.up$SYMBOL[!(miTargets.pTGC.up$SYMBOL %in%
$SYMBOL)])
miTargets.mTSC.upcolnames(miTargets.only.in.pTGC) <- "SYMBOL"
.75pc <- mi.diff.75pc %>%
miTargets.diffleft_join(mirdb.info, by = c("mirbase_id" = "mirbase_id")) %>%
::select(SYMBOL) %>%
dplyrdistinct(SYMBOL)
.75pc <- mi.undiff.75pc %>%
miTargets.undiffleft_join(mirdb.info, by = c("mirbase_id" = "mirbase_id")) %>%
::select(SYMBOL) %>%
dplyrdistinct(SYMBOL)
<- mi.undiff.75pc %>%
miRNAs_75pc_mTSC left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id"))
<- mi.diff.75pc %>%
miRNAs_75pc_pTGC left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id"))
<- mi.mTSC.up %>%
mirs_up_mTSC left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id"))
<- mi.pTGC.up %>%
mirs_up_pTGC left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id"))
<- mi.mTSC.and.pTGC %>%
mirs_mTSC_and_pTGC left_join(mirdb.info, by = c("mirbase_id" = "mirbase_id"))
# create miRNA and thier target lists
= list(
targetsLists miRNAs_75pc_mTSC = miRNAs_75pc_mTSC,
miRNAs_75pc_pTGC = miRNAs_75pc_pTGC,
miRNAs_up_mTSC = mirs_up_mTSC,
miRNAs_up_pTGC = mirs_up_pTGC,
miRNAs_mTSC.and.pTGC = mirs_mTSC_and_pTGC
)#create a geneList for later use
= list(
geneLists miRNAs_75pc_mTSC = miTargets.undiff.75pc$SYMBOL,
miRNAs_75pc_pTGC = miTargets.diff.75pc$SYMBOL,
miRNAs_up_mTSC = miTargets.mTSC.up$SYMBOL,
miRNAs_up_pTGC = miTargets.pTGC.up$SYMBOL,
miRNAs_up_mTSC.and.pTGC = miTargets.mTSC.and.pTGC$SYMBOL,
miRNAs_up_only.in.mTSC = miTargets.only.in.mTSC$SYMBOL,
miRNAs_up_only.in.pTGC = miTargets.only.in.pTGC$SYMBOL
)lengths(geneLists)
#> miRNAs_75pc_mTSC miRNAs_75pc_pTGC miRNAs_up_mTSC
#> 10791 11171 9733
#> miRNAs_up_pTGC miRNAs_up_mTSC.and.pTGC miRNAs_up_only.in.mTSC
#> 5810 4589 5144
#> miRNAs_up_only.in.pTGC
#> 1221
library(ggvenn)
<- list(pTGC = miTargets.pTGC.up$SYMBOL, mTSC = miTargets.mTSC.up$SYMBOL)
y ggvenn(
y,fill_color = c("#c6007b", "#a0b600"),
stroke_size = 0.3,
set_name_size = 4
)
Enrichment analyses of target genes
source( "/work/LAS/geetu-lab/arnstrm/mouse.trophoblast.smallRNAseq/assets/theme_clean.R")
library(TissueEnrich)
<- function(inputGenes = gene.list,
plotTE myColor = "color") {
<-
gs GeneSet(geneIds = inputGenes,
organism = "Mus Musculus",
geneIdType = SymbolIdentifier())
<- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
output <-
en.output setNames(data.frame(assay(output[[1]]),
row.names = rowData(output[[1]])[, 1]),
colData(output[[1]])[, 1])
$Tissue <- rownames(en.output)
en.output<- -log10(0.05)
logp <-
en.output mutate(en.output,
significance = ifelse(Log10PValue > logp,
"colored", "nocolor"))
$Sig <- "NA"
en.outputggplot(en.output, aes(reorder(Tissue, Log10PValue),
Log10PValue, fill = significance)) +
geom_bar(stat = 'identity') +
theme_clean() + ylab("- log10 adj. p-value") + xlab("") +
scale_fill_manual(values = c("colored" = myColor, "nocolor" = "gray")) +
scale_y_continuous(expand = expansion(mult = c(0, .1)),
breaks = scales::pretty_breaks()) +
coord_flip()
}
<-
plotTE.heatmap function(inputGenes = gene.list,
inputTissue = "E14.5-Brain",
GeneNames = FALSE
) {<-
gs GeneSet(geneIds = inputGenes,
organism = "Mus Musculus",
geneIdType = SymbolIdentifier())
<- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
output <-
en.output setNames(data.frame(assay(output[[1]]),
row.names = rowData(output[[1]])[, 1]),
colData(output[[1]])[, 1])
$Tissue <- rownames(en.output)
en.output<- output[[2]][[inputTissue]]
seExp <-
exp setNames(data.frame(assay(seExp), row.names = rowData(seExp)[, 1]),
colData(seExp)[, 1])
<- head(exp[order(exp$`E14.5-Brain`, decreasing = TRUE), ], 25)
exp <- pheatmap(
g
exp,color = heat_colors,
cluster_rows = F,
cluster_cols = T,
show_rownames = GeneNames,
border_color = NA,
fontsize = 10,
scale = "row",
fontsize_row = 8
)
g
}<-
plotTE.heatmap.full function(inputGenes = gene.list,
inputTissue = "E14.5-Brain",
GeneNames = FALSE,
string = "table.tsv"
) {<-
gs GeneSet(geneIds = inputGenes,
organism = "Mus Musculus",
geneIdType = SymbolIdentifier())
<- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
output <-
en.output setNames(data.frame(assay(output[[1]]),
row.names = rowData(output[[1]])[, 1]),
colData(output[[1]])[, 1])
$Tissue <- rownames(en.output)
en.output<- output[[2]][[inputTissue]]
seExp <-
exp setNames(data.frame(assay(seExp), row.names = rowData(seExp)[, 1]),
colData(seExp)[, 1])
<- as.data.frame(rownames(exp))
d <- pheatmap(
g
exp,color = heat_colors,
cluster_rows = F,
cluster_cols = T,
show_rownames = GeneNames,
border_color = NA,
fontsize = 10,
scale = "row",
fontsize_row = 8
)
g
}setEnrichrSite("Enrichr")
<-
makeEnrichR function(inputGenes = gene.list) {
<- TRUE
websiteLive <-
myDBs c(
"DisGeNET",
"WikiPathways_2019_Human",
"WikiPathways_2019_Mouse",
"KEGG_2019_Mouse"
)if (websiteLive) {
enrichr(inputGenes, myDBs)
}
}<- function(enriched, table="string", myColor = "slateblue") {
plotEnrichR <- -log10(0.05)
logp <- enriched[[table]]
myData $negLogP <- -log10(myData$P.value)
myData<-
myData mutate(myData,
significance = ifelse(negLogP > logp, "colored", "nocolor"))
$Sig <- "NA"
myData<- head(arrange(myData, -negLogP, Term), 15)
myData ggplot(myData, aes(reorder(Term, negLogP),
negLogP,fill = significance)) +
geom_bar(stat = 'identity') +
theme_clean() + ylab("- log10 p-value") + xlab("") +
scale_fill_manual(values = c("colored" = myColor, "nocolor" = "gray")) +
scale_y_continuous(expand = expansion(mult = c(0, .1)),
breaks = scales::pretty_breaks()) +
coord_flip()
}
<-
heatmap.genelst function(inputGenes = gene.list,
inputTissue = "E14.5-Brain"){
<-
gs GeneSet(geneIds = inputGenes,
organism = "Mus Musculus",
geneIdType = SymbolIdentifier())
<- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
output <-
en.output setNames(data.frame(assay(output[[1]]),
row.names = rowData(output[[1]])[, 1]),
colData(output[[1]])[, 1])
$Tissue <- rownames(en.output)
en.output<- output[[2]][[inputTissue]]
seExp <-
exp setNames(data.frame(assay(seExp), row.names = rowData(seExp)[, 1]),
colData(seExp)[, 1])
<- dplyr::select(exp, `E14.5-Brain`) %>% rownames
gl
gl }
Enrichment plots (TissueEnrich
)
pTGC
plotTE(geneLists$miRNAs_up_pTGC, myColor = "#c6007b")
mTSC
plotTE(geneLists$miRNAs_up_mTSC, myColor = "#a0b600")
pTGC & mTSC
plotTE(geneLists$miRNAs_up_mTSC.and.pTGC, myColor = "#C19B5D")
pTGC ONLY
plotTE(geneLists$miRNAs_up_only.in.pTGC, myColor = "#c6007b")
mTSC ONLY
plotTE(geneLists$miRNAs_up_only.in.mTSC, myColor = "#a0b600")
Enrichment Heatmaps for E14.5-Brain (top 25, with gene names)
pTGC
plotTE.heatmap(geneLists$miRNAs_up_pTGC,
inputTissue = "E14.5-Brain",
GeneNames = TRUE)
mTSC
plotTE.heatmap(geneLists$miRNAs_up_mTSC,
inputTissue = "E14.5-Brain",
GeneNames = TRUE)
pTGC & mTSC
plotTE.heatmap(
$miRNAs_up_mTSC.and.pTGC,
geneListsinputTissue = "E14.5-Brain",
GeneNames = TRUE
)
pTGC ONLY
plotTE.heatmap(
$miRNAs_up_only.in.pTGC,
geneListsinputTissue = "E14.5-Brain",
GeneNames = TRUE
)
mTSC ONLY
plotTE.heatmap(
$miRNAs_up_only.in.mTSC,
geneListsinputTissue = "E14.5-Brain",
GeneNames = TRUE
)
Enrichment Heatmaps for E14.5-Brain (all, without gene names)
pTGC
plotTE.heatmap.full(geneLists$miRNAs_up_pTGC,
inputTissue = "E14.5-Brain",
string = "up_in_diff.tsv",
GeneNames = FALSE)
mTSC
plotTE.heatmap.full(geneLists$miRNAs_up_mTSC,
inputTissue = "E14.5-Brain",
string = "up_in_undiff.tsv",
GeneNames = FALSE)
pTGC & mTSC
plotTE.heatmap.full(
$miRNAs_up_mTSC.and.pTGC,
geneListsinputTissue = "E14.5-Brain",
string = "up_in_diff_and_undiff.tsv",
GeneNames = FALSE
)
pTGC ONLY
plotTE.heatmap.full(
$miRNAs_up_only.in.pTGC,
geneListsinputTissue = "E14.5-Brain",
string = "up_in_diff_only.tsv",
GeneNames = FALSE
)
mTSC ONLY
plotTE.heatmap.full(
$miRNAs_up_only.in.mTSC,
geneListsinputTissue = "E14.5-Brain",
string = "up_in_udiff_only.tsv",
GeneNames = FALSE
)
Extract target genes enriched for E14.5-Brain
<-
brEnrTargets_up_mTSC.and.pTGC heatmap.genelst(geneLists$miRNAs_up_mTSC.and.pTGC,
inputTissue = "E14.5-Brain")
<- heatmap.genelst(geneLists$miRNAs_up_mTSC,
brEnrTargets_up_mTSC inputTissue = "E14.5-Brain")
<-
brEnrTargets_up_only.in.mTSC heatmap.genelst(geneLists$miRNAs_up_only.in.mTSC,
inputTissue = "E14.5-Brain")
<-
brEnrTargets_up_only.in.pTGC heatmap.genelst(geneLists$miRNAs_up_only.in.pTGC,
inputTissue = "E14.5-Brain")
<- heatmap.genelst(geneLists$miRNAs_up_pTGC,
brEnrTargets_up_pTGC inputTissue = "E14.5-Brain")
<- heatmap.genelst(geneLists$miRNAs_75pc_mTSC,
brEnrTargets_75pc_mTSC inputTissue = "E14.5-Brain")
<- heatmap.genelst(geneLists$miRNAs_75pc_pTGC,
brEnrTargets_75pc_pTGC inputTissue = "E14.5-Brain")
= list(
brainGeneLists brEnrTargets_up_mTSC.and.pTGC = brEnrTargets_up_mTSC.and.pTGC,
brEnrTargets_up_mTSC = brEnrTargets_up_mTSC,
brEnrTargets_up_only.in.mTSC = brEnrTargets_up_only.in.mTSC,
brEnrTargets_up_only.in.pTGC = brEnrTargets_up_only.in.pTGC,
brEnrTargets_up_pTGC = brEnrTargets_up_pTGC,
brEnrTargets_75pc_pTGC = brEnrTargets_75pc_pTGC,
brEnrTargets_75pc_mTSC = brEnrTargets_75pc_mTSC
)lengths(brainGeneLists)
#> brEnrTargets_up_mTSC.and.pTGC brEnrTargets_up_mTSC
#> 353 578
#> brEnrTargets_up_only.in.mTSC brEnrTargets_up_only.in.pTGC
#> 225 50
#> brEnrTargets_up_pTGC brEnrTargets_75pc_pTGC
#> 403 641
#> brEnrTargets_75pc_mTSC
#> 627
miRNA targets, disease db enrichment tests
<- makeEnrichR(brainGeneLists$brEnrTargets_up_mTSC)
undiffEnr #> Uploading data to Enrichr... Done.
#> Querying DisGeNET... Done.
#> Querying WikiPathways_2019_Human... Done.
#> Querying WikiPathways_2019_Mouse... Done.
#> Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
Sys.sleep(60)
<- makeEnrichR(brainGeneLists$brEnrTargets_up_pTGC)
diffEnr #> Uploading data to Enrichr... Done.
#> Querying DisGeNET... Done.
#> Querying WikiPathways_2019_Human... Done.
#> Querying WikiPathways_2019_Mouse... Done.
#> Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
Sys.sleep(60)
<- makeEnrichR(brainGeneLists$brEnrTargets_up_only.in.mTSC)
undiffOnlyEnr #> Uploading data to Enrichr... Done.
#> Querying DisGeNET... Done.
#> Querying WikiPathways_2019_Human... Done.
#> Querying WikiPathways_2019_Mouse... Done.
#> Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
Sys.sleep(60)
<- makeEnrichR(brainGeneLists$brEnrTargets_up_only.in.pTGC)
diffOnlyEnr #> Uploading data to Enrichr... Done.
#> Querying DisGeNET... Done.
#> Querying WikiPathways_2019_Human... Done.
#> Querying WikiPathways_2019_Mouse... Done.
#> Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
Sys.sleep(60)
<- makeEnrichR(brainGeneLists$brEnrTargets_up_mTSC.and.pTGC)
bothEnr #> Uploading data to Enrichr... Done.
#> Querying DisGeNET... Done.
#> Querying WikiPathways_2019_Human... Done.
#> Querying WikiPathways_2019_Mouse... Done.
#> Querying KEGG_2019_Mouse... Done.
#> Parsing results... Done.
# save these, so we don't query the server frequentyly
save(undiffEnr, diffEnr, undiffOnlyEnr, diffOnlyEnr, bothEnr, file = "results/enrichmentResults.Rdata")
1. pTGC
DisGeNET
plotEnrichR(diffEnr, table = "DisGeNET", myColor = "#c6007b")
WikiPathways_2019_Mouse
plotEnrichR(diffEnr, table = "WikiPathways_2019_Mouse", myColor = "#c6007b")
WikiPathways_2019_Human
plotEnrichR(diffEnr, table = "WikiPathways_2019_Human", myColor = "#c6007b")
KEGG_2019_Mouse
plotEnrichR(diffEnr, table = "KEGG_2019_Mouse", myColor = "#c6007b")
2. TSC
DisGeNET
plotEnrichR(undiffEnr, table = "DisGeNET", myColor = "#a0b600")
WikiPathways_2019_Mouse
plotEnrichR(undiffEnr, table = "WikiPathways_2019_Mouse", myColor = "#a0b600")
WikiPathways_2019_Human
plotEnrichR(undiffEnr, table = "WikiPathways_2019_Human", myColor = "#a0b600")
KEGG_2019_Mouse
plotEnrichR(undiffEnr, table = "KEGG_2019_Mouse", myColor = "#a0b600")
3. Both pTGC and TSC
DisGeNET
plotEnrichR(bothEnr, table = "DisGeNET", myColor = "#C19B5D")
WikiPathways_2019_Mouse
plotEnrichR(bothEnr, table = "WikiPathways_2019_Mouse", myColor = "#C19B5D")
WikiPathways_2019_Human
plotEnrichR(bothEnr, table = "WikiPathways_2019_Human", myColor = "#C19B5D")
KEGG_2019_Mouse
plotEnrichR(bothEnr, table = "KEGG_2019_Mouse", myColor = "#C19B5D")
4. pTGC only
DisGeNET
plotEnrichR(diffOnlyEnr, table = "DisGeNET", myColor = "#c6007b")
WikiPathways_2019_Mouse
plotEnrichR(diffOnlyEnr, table = "WikiPathways_2019_Mouse", myColor = "#c6007b")
WikiPathways_2019_Human
plotEnrichR(diffOnlyEnr, table = "WikiPathways_2019_Human", myColor = "#c6007b")
KEGG_2019_Mouse
plotEnrichR(diffOnlyEnr, table = "KEGG_2019_Mouse", myColor = "#c6007b")
5. TSC only
DisGeNET
plotEnrichR(undiffOnlyEnr, table = "DisGeNET", myColor = "#a0b600")
WikiPathways_2019_Mouse
plotEnrichR(undiffOnlyEnr, table = "WikiPathways_2019_Mouse", myColor = "#a0b600")
WikiPathways_2019_Human
plotEnrichR(undiffOnlyEnr, table = "WikiPathways_2019_Human", myColor = "#a0b600")
KEGG_2019_Mouse
plotEnrichR(undiffOnlyEnr, table = "KEGG_2019_Mouse", myColor = "#a0b600")
Other plots
<- function(inputGenes = gene.list,
myTE myColor = "slateblue") {
inputGenes<-
gs GeneSet(
geneIds = unique(inputGenes),
organism = "Mus Musculus",
geneIdType = SymbolIdentifier()
)<- teEnrichment(inputGenes = gs, rnaSeqDataset = 3)
output <-
en.output setNames(data.frame(assay(output[[1]]),
row.names = rowData(output[[1]])[, 1]),
colData(output[[1]])[, 1])
$Tissue <- rownames(en.output)
en.outputrownames(en.output) <- c()
en.output
}
<- function(myTargetList = miRNAs_75pc_mTSC) {
runTEonList <- split(myTargetList, f = myTargetList$mirbase_id)
data_list <- Filter(Negate(is.na), data_list)
data_list <- lapply(data_list, `[[`, "SYMBOL")
myDataList <-
myDataList names(myDataList) != "" &
myDataList[!= "" & lengths(myDataList) > 0]
myDataList <- compact(myDataList)
myDataList <- lapply(myDataList, myTE)
te.miRNA <- lapply(te.miRNA, `[`, c("Log10PValue", "Tissue"))
infoTE <- purrr::imap(infoTE, ~ mutate(.x, miRNA = .y))
infoTE <- bind_rows(infoTE, .id = "miRNA")
merged_info <-
merged_info !(is.na(merged_info$miRNA) |
merged_info[$miRNA == ""), ]
merged_info
merged_info
}<- function(merged_info = TE.miRNAs_75pc_mTSC) {
plotBubbleTE ggplot(merged_info, aes(x = miRNA, y = Tissue)) +
geom_point(aes(size = Log10PValue, fill = Tissue),
alpha = 0.75,
shape = 21) +
scale_size_continuous(limits = c(1, 8), range = c(1, 8)) +
guides(fill = "none") +
labs(x = "miRNAs",
y = "",
size = "-log10(p-adj)",
fill = "") +
theme(
legend.key = element_blank(),
axis.text.x = element_text(
colour = "black",
size = 8,
angle = 45,
vjust = 1,
hjust = 1
),axis.text.y = element_text(colour = "black", size = 8),
legend.text = element_text(size = 10, colour = "black"),
legend.title = element_text(size = 12),
panel.background = element_blank(),
panel.border = element_rect(colour = "black", fill = NA),
legend.position = "right")
}
TE bubble plots for each miRNA
Enrichment bubble plots for each miRNA
(TissueEnrich
)
pTGC
<- runTEonList(myTargetList = mirs_up_pTGC)
TE.miRNAs_75pc_pTGC plotBubbleTE(merged_info = TE.miRNAs_75pc_pTGC)
#> Warning: Removed 271 rows containing missing values (`geom_point()`).
mTSC
<- runTEonList(myTargetList = mirs_up_mTSC)
TE.miRNAs_up_mTSC plotBubbleTE(merged_info = TE.miRNAs_up_mTSC)
#> Warning: Removed 1697 rows containing missing values (`geom_point()`).
pTGC & mTSC
<- runTEonList(myTargetList = mirs_mTSC_and_pTGC)
TE.miRNAs_mTSC.and.pTGC plotBubbleTE(merged_info = TE.miRNAs_mTSC.and.pTGC)
#> Warning: Removed 2329 rows containing missing values (`geom_point()`).
upper quartile mTSC
<- runTEonList(myTargetList = miRNAs_75pc_mTSC)
TE.miRNAs_75pc_mTSC plotBubbleTE(merged_info = TE.miRNAs_75pc_mTSC)
#> Warning: Removed 2082 rows containing missing values (`geom_point()`).
upper quartile pTGC
<- runTEonList(myTargetList = miRNAs_75pc_pTGC)
TE.miRNAs_75pc_pTGC plotBubbleTE(merged_info = TE.miRNAs_75pc_pTGC)
#> Warning: Removed 2310 rows containing missing values (`geom_point()`).
Save RData
Apart from saving the entire session, we will also genelists as R
objects (rds
) for for later.
save.image(file = "results/smallRNAseq_draft.RData")
saveRDS(targetsLists, "results/targetsLists_75pc.rds")
saveRDS(brainGeneLists, "results/brainGeneLists.rds")
MultiQC report:
MultiQC report is available at this link
Session Information
sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] grid stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] TissueEnrich_1.18.0 GSEABase_1.60.0
#> [3] graph_1.76.0 annotate_1.76.0
#> [5] XML_3.99-0.12 AnnotationDbi_1.60.0
#> [7] ensurer_1.1 ggvenn_0.1.10
#> [9] enrichR_3.1 splitstackshape_1.4.8
#> [11] ComplexUpset_1.3.3 PupillometryR_0.0.4
#> [13] rlang_1.1.1 reshape2_1.4.4
#> [15] biomaRt_2.54.1 ggrepel_0.9.3
#> [17] genefilter_1.80.3 pheatmap_1.0.12
#> [19] RColorBrewer_1.1-3 DESeq2_1.38.3
#> [21] SummarizedExperiment_1.28.0 Biobase_2.58.0
#> [23] MatrixGenerics_1.10.0 matrixStats_0.63.0
#> [25] GenomicRanges_1.50.1 GenomeInfoDb_1.34.3
#> [27] IRanges_2.32.0 S4Vectors_0.36.0
#> [29] BiocGenerics_0.44.0 plotly_4.10.1
#> [31] forcats_0.5.2 stringr_1.5.0
#> [33] dplyr_1.0.10 purrr_1.0.1
#> [35] readr_2.1.3 tidyr_1.3.0
#> [37] tibble_3.1.8 ggplot2_3.4.0
#> [39] tidyverse_1.3.2 scales_1.2.1
#> [41] knitr_1.41
#>
#> loaded via a namespace (and not attached):
#> [1] readxl_1.4.1 backports_1.4.1 BiocFileCache_2.6.1
#> [4] plyr_1.8.8 lazyeval_0.2.2 splines_4.2.2
#> [7] BiocParallel_1.32.1 digest_0.6.30 htmltools_0.5.3
#> [10] fansi_1.0.3 magrittr_2.0.3 memoise_2.0.1
#> [13] googlesheets4_1.0.1 tzdb_0.3.0 Biostrings_2.66.0
#> [16] modelr_0.1.10 vroom_1.6.0 timechange_0.1.1
#> [19] rmdformats_1.0.4 prettyunits_1.1.1 colorspace_2.0-3
#> [22] blob_1.2.3 rvest_1.0.3 rappdirs_0.3.3
#> [25] haven_2.5.1 xfun_0.35 crayon_1.5.2
#> [28] RCurl_1.98-1.9 jsonlite_1.8.3 survival_3.4-0
#> [31] glue_1.6.2 gtable_0.3.1 gargle_1.2.1
#> [34] zlibbioc_1.44.0 XVector_0.38.0 DelayedArray_0.24.0
#> [37] DBI_1.1.3 Rcpp_1.0.9 viridisLite_0.4.1
#> [40] xtable_1.8-4 progress_1.2.2 bit_4.0.5
#> [43] htmlwidgets_1.5.4 httr_1.4.4 ellipsis_0.3.2
#> [46] pkgconfig_2.0.3 farver_2.1.1 sass_0.4.3
#> [49] dbplyr_2.2.1 locfit_1.5-9.7 utf8_1.2.2
#> [52] tidyselect_1.2.0 labeling_0.4.2 munsell_0.5.0
#> [55] cellranger_1.1.0 tools_4.2.2 cachem_1.0.6
#> [58] cli_3.4.1 generics_0.1.3 RSQLite_2.2.18
#> [61] broom_1.0.1 evaluate_0.18 fastmap_1.1.0
#> [64] yaml_2.3.6 bit64_4.0.5 fs_1.5.2
#> [67] KEGGREST_1.38.0 xml2_1.3.3 compiler_4.2.2
#> [70] rstudioapi_0.14 filelock_1.0.2 curl_4.3.3
#> [73] png_0.1-7 reprex_2.0.2 geneplotter_1.76.0
#> [76] bslib_0.4.1 stringi_1.7.8 highr_0.9
#> [79] lattice_0.20-45 Matrix_1.5-3 vctrs_0.6.2
#> [82] pillar_1.8.1 lifecycle_1.0.3 jquerylib_0.1.4
#> [85] data.table_1.14.6 bitops_1.0-7 patchwork_1.1.2
#> [88] R6_2.5.1 bookdown_0.33 codetools_0.2-18
#> [91] assertthat_0.2.1 rjson_0.2.21 withr_2.5.0
#> [94] GenomeInfoDbData_1.2.9 parallel_4.2.2 hms_1.1.2
#> [97] rmarkdown_2.18 googledrive_2.0.0 lubridate_1.9.0