Section 2 Intro
In this section we analyze paralogs behavior during zebrafish development. We
aim to describe paralogs behavior with particular focus on paralogs residing in
macromolecular complexes. First of all we need to load some packages and
functions. The proteomicstools
, paralogstools
, spatialR
are some home made
packages with annotated functions useful for easily analyze protein complexes
and paralogs genes. Most of the functions in these package should be reasonably
documented. In any case the original purpose of these package is for internal
use.
The package are also build for easily communicate between each other. In this
workflow we will use some of the functions present here. If you are working on
a docker container you don’t need to install them. In case you want to install
them, sources are available inside the package
directory.
We load them into our workspace. Plus some other packages and the
complexes_function.R
script, that contains useful functions for our analyes.
library(proteomicstools)
library(paralogstools)
library(reshape2)
library(ggplot2)
library(dplyr)
library(spatialR)
library(gridExtra)
library(Cairo)
library(ggrepel)
library(RColorBrewer)
set.seed(123)
source("../ComplexScript/complexes_function.R")
2.1 Zebrafish Paralogs Percentage
For starting we calculate some statistics regarding Zebrafish paralogs, in particular, how many genes in the Zebrafish genome have paralogs ? This is an important question to ask before carrying out the rest of the analysis. We first read the paralog file, and remove duplicated entries. We then select only paralog genes that are also protein coding genes.
#Load the Zebrafish Paralogs file
<- read.delim("../Data/Paralogs/drerio_ENSEMBL_Paralogs_v102.txt",sep = "\t",header = T)
Paralogs <- unique(Paralogs[,c(1,2)])
Paralogs
#Only protein conding genes
<- read.delim("../Data/Paralogs/drerio_ENSEMBL_type_v102.txt",sep = "\t",header = T)
Protein.coding <- Protein.coding[Protein.coding$Transcript.type=="protein_coding",1]
Protein.coding <- unique(Protein.coding) Protein.coding
And then we apply the filters to select only protein coding genes, and count how many of them have paralogs in the genome.
<- Paralogs %>%
Paralogs filter(Gene.stable.ID %in% Protein.coding) %>%
group_by(Gene.stable.ID) %>%
summarize(N.Paralogs=sum(Zebrafish.paralogue.gene.stable.ID!=""))
Now we can simply calculates the proportion of genes that have paralogs in the genome, by simply dividing.
<- sum(Paralogs$N.Paralogs>0)/nrow(Paralogs)
drerio.paralogs.percentage drerio.paralogs.percentage
## [1] 0.7068228
2.2 Load RNA-Seq data
We first start by loading the RNA-Seq data from (White et al. 2017). This
dataset is a embryo-wise RNAseq analysis of 18 time points during zebrafish
embryo development. We will use some functions from the proteomicstools
package. For more information on the SetExperimentFromFile
function type
?SetExperimentFromFile
. The ExperimentMatrix
object is an object that
collect together all the information about the data. It is an easy way of
storing expression values.
<- "../Data/RNA-seqData/DrerioDevelopmentStage.csv"
RNASeqFileName <- SetExperimentFromFile(RNASeqFileName,exprcol = 3:20,ids = "Gene.ID")
RNASeqData
#Select only trascript where the profile has at least 1TPM in all stages
<- assay(RNASeqData) x
We also defined the paralogs code that we will use throughout the work.
<- c("#e6d5b8","#0278ae") ParalogsColor
2.3 Load Complexes
In case can be possible that we need to update our complex definition file with
zebrafish entries. This is done by matching H.sapiens complexes by orthology
with Zebrafish using the biomaRt resource. In case it can prompt an error saying
that our complex definition has already a label with that name. It means that
there is already a column called Drerio_Genes
, and this means that the complex
file is updated. We then load the Complexes definition via ComplexSet
functions from proteomicstools
package.
# Load Protein Complexes and Data ####
UpdateComplexFile(organism = "drerio","ENSEMBL",label = "Drerio_Genes")
In our case the protein Complexes file is already update so we can simply run
the following code. ComplexSet
creates an S4 Object that stores information
from the different protein complexes.
<- ComplexSet(organism = "drerio",identifier = "ENSEMBL",grepOnCol = F) DrerioComplex
2.4 Load Paralogs
As a first instance we load the paralogs that we are going to use thorught the study. Here we load paralogs for Homo sapiens, Danio rerio, Rattus norvegicus and Mus musculus.
<- "../Data/Paralogs/hsapiens_SYMBOL_paralogs_v102.txt"
ParalogsHsFile <- "../Data/Paralogs/mmusculus_SYMBOL_paralogs_v102.txt"
ParalogsMmFile <- "../Data/Paralogs/rnorvegicus_SYMBOL_paralogs_v102.txt"
ParalogsRnFile <- "../Data/Paralogs/drerio_ENSEMBL_Paralogs_v102.txt"
ParalogsDrFile
<- ParalogsSet(organism = "hsapiens",identifier = "ENSEMBL",filename = ParalogsHsFile)
ParalogsHs <- ParalogsHs@data ParalogsHs.df
Now we load the paralogs file with the ParalogsSet
function from the
paralogstools
package. To have information about how this function works type
?ParalogsSet
. This will create an S4 Object of class ParalogsSet
that will
allow to more easily work with paralogs genes. The general Paralogs file is
pretty big, so it takes a little bit to load completely.
#Get Danio rerio Paralogs <-
<- "../Data/Paralogs/drerio_ENSEMBL_Paralogs_v102.txt"
paralogsfile <- ParalogsSet(organism = "drerio",identifier = "ENSEMBL",filename = paralogsfile) Paralogs
Now in order to work more easily we use the function CleanParalogs
to remove
duplicated paralogs couples es. (Gene1 | Gene2 and Gene2|Gene1). Since the
file is pretty big and the function is not optmized at best, this will take a
couple of minutes this is due to the fact that the paralogs file is pretty huge
~ 1.1Gb.
<- CleanParalogs(Paralogs) Paralogs
Now we create two different variables. One is SubunitsParalogs
it will return
paralog pairs were both paralogs are residing in the same protein complex. By
then removing them (with the Remove
function) we create the
OtherParalogs
variable that represent all the other paralogs, so paralogs that
are not part of a protein complex.
#Paralogs of Protein Complex Subunits
<- ParalogsInComplex(Paralogs,(DrerioComplex))
SubunitsParalogs <- Remove(Paralogs,GetIds(SubunitsParalogs)) OtherParalogs
2.5 Are paralogs driving variability during development?
In this first part of the code we analyze the variability of paralog genes
during zebrafish development. We characterize the variability of genes that have
paralogs against genes that do not have paralogs in the genome. We address their
coefficient of variations defined as sd(exprs)/mean(exprs)
. We express
variability as the coefficient of variation of each genes across 18 embryo time
point. The function hasParalogs
is used to assess easily if a specific genes
possess paralogs or not.
#Add Ids
<- as.data.frame(assay(RNASeqData))
RNASeqData.exprs $ids <- rownames(RNASeqData)
RNASeqData.exprs
#Add Paralogs
$hasParalogs <- hasParalogs(Paralogs,RNASeqData.exprs$ids)
RNASeqData.exprs
#Coeff.Variation
$coeff.var <- apply(RNASeqData.exprs[,1:18],1,function(x)sd(x,na.rm = T)/mean(x,na.rm = T))
RNASeqData.exprs
#Put GeneName annotation
<- spatialR::Annotate(RNASeqData.exprs,idcol = "ids","Dr",annot = "SYMBOL")
RNASeqData.exprs $SYMBOL <- toupper(RNASeqData.exprs$SYMBOL)
RNASeqData.exprs
#WilcoxTest
<- wilcox.test(RNASeqData.exprs$coeff.var[RNASeqData.exprs$hasParalogs],
wx.test $coeff.var[!RNASeqData.exprs$hasParalogs]) RNASeqData.exprs
And we plot and save the results.
pdf("../out/plots/p01RNA.pdf",height = 5,width = 5)
<- ggplot(data=RNASeqData.exprs) + geom_density(aes(x=coeff.var,fill=hasParalogs),alpha=0.2) +
p01RNA scale_fill_manual(values = c("#e6d5b8","#0278ae")) + theme_bw() +
ggtitle("Development",subtitle = paste("Wilcox pvalue",format(wx.test$p.value,digits = 3))) +
xlab(paste("Coefficient of Variation [\u03C3/mean (Gene Expression)]"))
p01RNAdev.off()
## png
## 2
#Print
p01RNA
2.5.1 Percentage of Paralogs in our dataset
In the top section, we saw the general distribution of
protein conding genes that have paralogs. Here using the function
SimplePiePlot
we calculate the percentage of genes that has paralogs in our
considered dataset, using the annotation in hasParalogs
. We plot and save it.
pdf("../out/plots/pie.RNA.pdf",height = 2.10,width = 5)
<- SimplePiePlot(RNASeqData.exprs$hasParalogs) +
pie.RNA theme(title = element_text(size = 7),
legend.position = "bottom")
pie.RNAdev.off()
## png
## 2
pie.RNA
2.6 Correlation of paralogs
As a first analysis we can run the correlation values of all the paralog pairs
along the 18 stages of the RNASeq data. (Also here it will take a while). The
parameter nmin=5
means that along the RNASeqData
ExperimentMatrix, we
maintain only pairs for which we have at least 5 same complete observation (To
have a reliable quantification of the correlation).
# Correlation of Paralogs Subunits, Others, All paralogs ####
<- Correlate(Paralogs,RNASeqData,nmin=5) AllParalogsCorrelations
Occasionally for some comparison a warning message can appear: the standard deviation is zero, this can occur when for some genes the transcriptome quantification is the same accross the all the quantified time-points and so the standard deviation for that gene is 0.
An example:
<- assay(RNASeqData)
x "ENSDARG00000102296",] x[
## zygote cleavage.2.cell
## NA NA
## blastula.128.cell blastula.1k.cell
## 2 2
## blastula.dome gastrula.50..epiboly
## 2 NA
## gastrula.shield gastrula.75..epiboly
## NA NA
## segmentation.1.4.somites segmentation.14.19.somites
## NA NA
## segmentation.20.25.somites pharyngula.prim.5
## NA NA
## pharyngula.prim.15 pharyngula.prim.25
## NA NA
## hatching.long.pec larval.protruding.mouth
## NA NA
## larval.day.4 larval.day.5
## 2 2
Here all the quantified fractions have a value of 2. So when it tries to calculate standard deviations it actually reflect in having a value of 0. And we conversely do the same for paralogs localized inside protein complexes and all the rest of the paralogs.
<- Correlate(SubunitsParalogs,RNASeqData,nmin=5)
ParalogSubunitsCorrelations <- Correlate(OtherParalogs,RNASeqData,nmin=5) OtherParalogsCorrelations
Just for sake of practice we save the workspace, in case we will need for future analysis.
save.image("../Markdown/workspaces/01-Paralogs_RNASeqData.RData")
Now we can go into more detail into the proportion of correlating and
anticorrelating paralogs. We define anticorrelating paralogs as having a
correlation values of < 0
, conversely correlating paralogs are define as
having a correlation value of >=0
. We can plot the general proportion of
anticorrelating paralogs agains all others. We can see that there is a
significant proportions (not so huge but relevant in any way) of anticorrelating
gene pairs.
#Anticorrelating Paralogs
<- nrow(AllParalogsCorrelations[AllParalogsCorrelations$corr<(0),])
Anticorrelating #Prepare data for barplot
<- as.data.frame(c(nrow(AllParalogsCorrelations)-Anticorrelating,Anticorrelating))
Proportions colnames(Proportions) <- "n"
$n <- Proportions$n/sum(Proportions$n)
Proportions$type <- c("Others","Not-correlated")
Proportions$type <- factor(Proportions$type,levels = c("Others","Not-correlated"))
Proportions
$type <- rep("Paralogs in Complexes",nrow(ParalogSubunitsCorrelations))
ParalogSubunitsCorrelations$type <- rep("Other Paralogs",nrow(OtherParalogsCorrelations)) OtherParalogsCorrelations
2.6.1 Density of correlation profiles
A better way of visualizing the distribution is via the function
ParalogsDensityPlot
. It allows to easily summarize the information coming from
the correlation of paralog pairs, in a plot. It will show the density
distribution of the correlation of ALL paralogs pairs, the color code indicates
different correlation threshold. The barplot on the left actually reflects the
count of the data.
pdf("../out/plots/p03RNA.pdf",width = 5,height = 5)
<- ParalogsDensityPlot(AllParalogsCorrelations$corr,
p03RNA title = "Correlation of Paralog Pairs - Development",
x_scale = c(-1,1))
$p.all
p03RNAdev.off()
## png
## 2
$p.all p03RNA
2.6.2 Correlation and sequence Identity
We can relate the proportion of anticorrelating paralogs to their sequence
identity. In this chunk of code here, we fit a GAM with the mean paralog
sequence identity and their correlations. From the results we see that while
paralog pairs that are not taking part in the formation of the same complex,
show a strong relationship between reciprocal identity and co-regulation. On the
contrary paralog pairs that are also protein complex members show a weaker, and
in this case even negative, relationship.
The last step of this code also gives the table AllCorrelations
, basically a
data.frame
with all the paralog pairs annotated if they are complex subuits or
not.
#Create a unique dataframe
<- rbind(ParalogSubunitsCorrelations,OtherParalogsCorrelations)
AllCorrelations $p.1 <- as.character(AllCorrelations$p.1)
AllCorrelations$p.2 <- as.character(AllCorrelations$p.2) AllCorrelations
And here we visualize the results using a GAM.
<- c("#D6604D","#4393C3")
ComplexColor pdf("../out/plots/p05RNA.pdf",width = 5,height = 5)
<- ggplot(data=AllCorrelations) +
p05RNA geom_smooth(data=AllCorrelations,aes(x=mean.identity,y=corr,color=type)) +
theme_bw() + scale_color_manual(values = ComplexColor) +
theme(legend.position = "top") +
xlab("mean identity") +
ylab("correlation") + labs(color="")
#Plot
p05RNAdev.off()
## png
## 2
p05RNA
We can also run a correlation test on the distribution to assess if there is a
significant correlation between sequence identity and correlation of paralog
pairs. We do this for both paralogs in protein complexes, and paralogs that are
not inside the same complex, here names as Other Paralogs
.
#Correlation tests
<- cor.test(AllCorrelations[AllCorrelations$type=="Other Paralogs","corr"],AllCorrelations[AllCorrelations$type=="Other Paralogs","mean.identity"])
cortest cortest
##
## Pearson's product-moment correlation
##
## data: AllCorrelations[AllCorrelations$type == "Other Paralogs", "corr"] and AllCorrelations[AllCorrelations$type == "Other Paralogs", "mean.identity"]
## t = 63.33, df = 159149, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1519896 0.1615740
## sample estimates:
## cor
## 0.1567855
<- cor.test(AllCorrelations[!AllCorrelations$type=="Other Paralogs","corr"],AllCorrelations[!AllCorrelations$type=="Other Paralogs","mean.identity"])
cortest cortest
##
## Pearson's product-moment correlation
##
## data: AllCorrelations[!AllCorrelations$type == "Other Paralogs", "corr"] and AllCorrelations[!AllCorrelations$type == "Other Paralogs", "mean.identity"]
## t = -4.1389, df = 1487, p-value = 3.686e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.15666970 -0.05622364
## sample estimates:
## cor
## -0.1067189
In order to test our claim, we can also sample the same number of paralog
pairs from the two distributions. We get the number of pairs in the comp.pairs
variable, and we sample from the Other Paralogs
distribution. We bind again
the dataset together, and show them in a plot.
<- AllCorrelations %>% filter(type!="Other Paralogs") %>% nrow()
comp.pairs #Sample
<- sample((rownames(AllCorrelations %>%
s filter(type=="Other Paralogs"))),comp.pairs)
#Bind the dataset
<- rbind(AllCorrelations %>% filter(type!="Other Paralogs"),
AllCorr.Samp
AllCorrelations[s,])
#Show them on plot
ggplot(data=AllCorr.Samp) +
geom_point(aes(x=mean.identity,y=corr,color=type),alpha=0.2) +
geom_smooth(data=AllCorrelations,aes(x=mean.identity,y=corr,color=type)) +
theme_bw() + scale_color_manual(values = ComplexColor) +
theme(legend.position = "top") +
xlab("mean identity") +
ylab("correlation") + labs(color="")
We can subset only for paralogs residing in the same complex, and annotate with
gene Name. Using the function Annotate
which uses the Annotation.db resource.
We also define here variable paralog pairs, that represent the bottom 25% of
the distribution of correlation of paralog subunits.
<- AllCorrelations[AllCorrelations$type=="Paralogs in Complexes",]
AllCorr.Compl <- spatialR::Annotate(AllCorr.Compl,organism = "Dr",idcol="p.1",annot="SYMBOL")
AllCorr.Compl <- spatialR::Annotate(AllCorr.Compl,organism = "Dr",idcol="p.2",annot="SYMBOL")
AllCorr.Compl
#Change colnames
colnames(AllCorr.Compl)[c(7:8)] <- c("SYMBOL.x","SYMBOL.y")
#Select the most variable pairs
<- quantile(AllCorr.Compl$corr)["25%"]
quant
#Variable pairs
<- unlist(AllCorr.Compl[AllCorr.Compl$corr<=quant,c("SYMBOL.x","SYMBOL.y")])
variable.pairs <- unique(unlist((unique(variable.pairs))))
variable.pairs
#Extract gene set representing paralogs
<- unique(unlist(AllCorr.Compl[,c("SYMBOL.x","SYMBOL.y")])) geneSet
2.6.3 Variable Pairs in Protien Complexes GOenrich.
We run a GOEnrichement analysis on the variable paralog pair subunits againts
all the rest. The GOEnrichment
function is called from the spatialR
package,
and uses the topGO
package from bioconductor to perform GOenrichment analysis,
using Fisher test. Then a second step is applied in this case. The SummarizeGO
function is used to summarize the results of the data and uses the
rrvgo
package.
<- spatialR::GOEnrichment(geneSet,variable.pairs,
GOEnrich.var.Pairs.dev ontology = "BP",species = "Dr",topnode = 200)
<- spatialR::SummarizeGO(GOEnrich.var.Pairs.dev,plot = "scatter",thr = 0.7) GOEnrich.var.Pairs.Summ
And we save the plot.
pdf("../out/plots/p06RNA.pdf",width = 5,height = 5)
<- GOEnrich.var.Pairs.Summ$plot
p06RNA
p06RNAdev.off()
## png
## 2
p06RNA
In this chunks here we also define very similar paralogs that are paralogs that have a mean sequence identity of >= 50%.
$regulation <- ifelse(AllCorr.Compl$corr<=quant,"HighlyVar","Other")
AllCorr.Compl$similar <- ifelse(AllCorr.Compl$mean.identity>=50,"HighlySimilar","NotSimilar")
AllCorr.Compl
#FisherTest
<- table(AllCorr.Compl[,c("similar","regulation")])
Fisher.SimDF fisher.test(Fisher.SimDF)
##
## Fisher's Exact Test for Count Data
##
## data: Fisher.SimDF
## p-value = 0.03131
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.022917 1.690685
## sample estimates:
## odds ratio
## 1.316132
Plot the output.
pdf("../out/plots/p07RNA.pdf",width = 5,height = 5)
<- ggplot(data=AllCorr.Compl) +
p07RNA geom_density(aes(x=mean.identity,fill=regulation),alpha=0.3)
p07RNAdev.off()
## png
## 2
wilcox.test(AllCorr.Compl[AllCorr.Compl$regulation=="HighlyVar","mean.identity"],
$regulation=="Other","mean.identity"]) AllCorr.Compl[AllCorr.Compl
##
## Wilcoxon rank sum test with continuity correction
##
## data: AllCorr.Compl[AllCorr.Compl$regulation == "HighlyVar", "mean.identity"] and AllCorr.Compl[AllCorr.Compl$regulation == "Other", "mean.identity"]
## W = 225368, p-value = 0.01652
## alternative hypothesis: true location shift is not equal to 0
We observe in general no significant differences in correlations between similar parlaogs and other paralogs inside protein complexes, althought we can observe a clear peak localized in 75% sequence identity for highly variable paralogs.
2.6.4 ING4 and ING5B
Here we show some examples of differentially regulated paralogs. ING4 and ING5B:
<- brewer.pal("BrBG",n=11)[c(4,8)] colors
$Table <- GOEnrich.var.Pairs.dev$table
GOEnrich.var.Pairs.dev<- ShowGenesGOTerms(GOEnrich.var.Pairs.dev,variable.pairs)
LIST
pdf("../out/plots/p17RNA.pdf",height = 5,width = 5)
<- PlotProteinProfiles(RNASeqData.exprs[c(2:19,22)],
profile_plt proteins = c("ING4","ING5B"),idcol = "SYMBOL",foldChange = T)
<- profile_plt[[1]] +
p17RNA theme(axis.text.x = element_blank(),legend.position = "bottom",
legend.title = element_blank(),plot.title = element_blank(),
legend.text = element_text(size = 10)) + xlab("Dev. Stages") +
scale_color_manual(values = colors)
p17RNAdev.off()
## png
## 2
p17RNA
2.6.5 Smarcd1 and Smarcd2
We do the same for SMARCD1 and SMACD2
pdf("../out/plots/p18RNA.pdf",height = 5,width = 5)
<- PlotProteinProfiles(RNASeqData.exprs[c(2:19,22)],
profile_plt proteins = c("SMARCD1","SMARCD2"),idcol = "SYMBOL",foldChange = T)
<- profile_plt[[1]] +
p18RNA theme(axis.text.x = element_blank(),legend.position = "bottom",
legend.title = element_blank(),plot.title = element_blank(),
legend.text = element_text(size = 10)) + xlab("Dev. Stages") +
scale_color_manual(values = colors)
p18RNAdev.off()
## png
## 2
p18RNA
2.6.6 TMED4 and TMED9
TMED4 and TMED9 part of the COPI complex
pdf("../out/plots/p20RNA.pdf",height = 5,width = 5)
<- PlotProteinProfiles(RNASeqData.exprs[c(2:19,22)],
profile_plt proteins = c("TMED4","TMED9"),idcol = "SYMBOL",foldChange = T)
<- profile_plt[[1]] +
p20RNA theme(axis.text.x = element_blank(),legend.position = "bottom",
legend.title = element_blank(),plot.title = element_blank(),
legend.text = element_text(size = 10)) + xlab("Dev. Stages") +
scale_color_manual(values = colors)
p20RNAdev.off()
## png
## 2
p20RNA
2.6.7 Sec23a and Sec23b
And SEC23A and SEC23B inside the COPII complex aswell.
pdf("../out/plots/p19RNA.pdf",height = 5,width = 5)
<- PlotProteinProfiles(RNASeqData.exprs[c(2:19,22)],
profile_plt proteins = c("SEC23A","SEC23B"),idcol = "SYMBOL",foldChange = T)
<- profile_plt[[1]] +
p19RNA theme(axis.text.x = element_blank(),legend.position = "bottom",
legend.title = element_blank(),plot.title = element_blank(),
legend.text = element_text(size = 10)) + xlab("Dev. Stages") +
scale_color_manual(values = colors)
p19RNAdev.off()
## png
## 2
p19RNA
2.7 Function of divergent, but similar paralogs
By selecting for variable AND similar paralogs, we can assess the functions carryed on by those specific genes, running a GOanalysis against all the rest.
<- AllCorr.Compl[AllCorr.Compl$regulation=="HighlyVar" & AllCorr.Compl$similar=="HighlySimilar",c("SYMBOL.x","SYMBOL.y")]
keep <- unique(as.character(unlist(keep)))
keep
<- unique(unlist(strsplit(unlist(keep),split = ';')))
keep
#Table
<- spatialR::GOEnrichment(geneSet,keep,ontology = "BP",species = "Dr")
GOEnrich.var.Pairs.dev.similar <- spatialR::SummarizeGO(GOEnrich.var.Pairs.dev.similar,plot = "scatter") GOEnrich.var.Pairs.Summ.similar
pdf("../out/plots/p08RNA.pdf",width = 5,height = 5)
<- GOEnrich.var.Pairs.Summ.similar$plot
p08RNA dev.off()
## png
## 2
p08RNA
2.7.1 Subunits anticorrelations vs Paralogs anticorrelation
In this chuck of code here we aim to address if the higher proportion of
anti-correlated paralog subunit pairs is due to the fact that those paralogs are
just protein complex subunits, or due a specific characteristics of paralogs
that reside inside protein complexes. Here first we annotate our list of
protein complexes DrerioComplex
with the expression values from the
RNASeqData
using the GetExpressions
function. Selecting a minimum number of
quantied subunits in each complex =5
. All complexes that have less than 5
subunits quantified are discarded from this analysis.
Then for every complex we calculate the correlation values between all possible
subunit pairs, and then we check wether each pair is a paralog pair with the
areParalogs
function.
# Subunits Anticorrelations vs Paralogs Anticorrelations, is anticorrelations inside subunits or paralogs? ####
<- GetExpressions(DrerioComplex,RNASeqData,min=5)
DrerioComplex <- subunits(DrerioComplex)
Subunits <- Subunits[Subunits %in% GetIds(SubunitsParalogs)]
Subunits
<- list()
CorrList for (C in names(DrerioComplex))
{message(C)
<- cor(DrerioComplex[[C]])
Correlations $are.paralogs <- apply(Correlations[,c(1,2)],1,function(x){areParalogs(x[1],x[2],Paralogs)})
Correlations<- Correlations[complete.cases(Correlations),]
Correlations <- Correlations
CorrList[[C]]
}
#
<- do.call(rbind,CorrList) CorrList
We can plot the result in a density plot.
pdf("../out/plots/p09RNA.pdf",width = 5,height = 5)
<- ggplot(data=CorrList) + geom_density(aes(x=corr,color=are.paralogs),alpha=0.5,lwd=1,fill=NA) +
p09RNA theme_minimal() + theme(legend.position = "top") + ggtitle("Subunits:") +
xlab("Correlations")
dev.off()
## png
## 2
p09RNA
And also run a wilcox-test. There is a slightly evidence that the anticorrelating behaviour is a characteristic inside of paralog pairs that reside inside protein complexes. From these results seems that subunits pairs that are paralogs are generally less co-regulated compared to other subunit pairs in the same complexes.
wilcox.test(CorrList$corr[CorrList$are.paralogs==T],CorrList$corr[CorrList$are.paralogs==F])
##
## Wilcoxon rank sum test with continuity correction
##
## data: CorrList$corr[CorrList$are.paralogs == T] and CorrList$corr[CorrList$are.paralogs == F]
## W = 5616205, p-value = 0.03843
## alternative hypothesis: true location shift is not equal to 0
2.8 Complexes Analysis
Following the statement that antagonizing paralogs regulation seems to be particularly localized between paralogs of protein complex subunits, we carried on the characterization of the variation in composition of macromolecular complexes in order to determine the possible roles played by their paralog subunits. We used the median correlation between all the possible subunit pairs in order to classify protein complexes co expression as stable or variable. First we create a RandomComplex list and annotate with the values coming from the RNASeq data. The we calculate the correlations that we interpret as stability for random and not random complexes.
set.seed(123)
# ComplexAnalysis on 18Stages RNASeq Data ####
<- RandomComplexSet(DrerioComplex,rownames(RNASeqData))
RandomComplex <- GetExpressions(RandomComplex,RNASeqData,min=5)
RandomComplex
#Calculate Correlations
<- cor(DrerioComplex)
ComplexStability <- cor(RandomComplex) RandComplexStability
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero
#Sorted Complex , sort complex by stability
<- DrerioComplex[[names(sort(ComplexStability,decreasing = T))]]
DrerioComplex
#Get Ordered subunits by stability
<- subunits(DrerioComplex)
DrerioSubunits write.table(DrerioSubunits,"../out/RNASeq_Drerio_SubunitsCorr.txt",sep = "\t")
As a first point, we observe, as expected, high value of correlations between protein complex members (p-value < 2.2e-16 , wilcox test) compared to the distribution of correlation of random proteins.
#Create ggplotDataframe with Density distributions.
<- data.frame(corr=c(ComplexStability,RandComplexStability))
DF $type <- rep(c("Complexes"),nrow(DF))
DF$type[grep("Random",rownames(DF))] <- "Random Complexes"
DF
#wilcox test
<- wilcox.test(DF$corr[DF$type=="Complexes"],DF$corr[DF$type=="Random Complexes"])
wilc.test
#Stable and Variable threshold
<- quantile(ComplexStability)["25%"]
variable <- quantile(ComplexStability)["75%"] stable
We then highlight some specific protein complexes that varies their
stoichiometry composition during development. At first we select some specific
complex of interest and we highlight them on the density distribution. We use
approxfunc
to map back the y point on the plot for the different complexes.
$complex <- rownames(DF)
DF
#Create density functions
<- approxfun(density(DF[DF$type=="Complexes","corr"]))
densfunc
<- c("cytoplasmic ribosomal large subunit","cytoplasmic ribosomal small subunit","26S Proteasome","COPII","cytoplasmic dynein complex","SNARE complex","RSC")
InterestingSubset
#Density Plot
#Plot Distribution of Correlation of complexes
<- ggplot(data=DF) +
p10RNA geom_density(aes(x=corr,fill=type),alpha = 0.3,size=0.5,color=NA) +
geom_point(data=DF %>% filter(complex %in% InterestingSubset),
aes(x=corr,y=densfunc(corr)),color="red") +
theme_minimal() + ylab('density') +
scale_fill_manual(values=c("darkgreen","grey")) +
geom_text_repel(data=DF %>% filter(complex %in% InterestingSubset),
aes(x=corr,y=densfunc(corr),label=complex),
nudge_y = 0.5) + xlab("correlations") + labs(fill="subset")
pdf("../out/plots/p10RNA.pdf",width = 5,height = 5)
p10RNAdev.off()
## png
## 2
p10RNA
The plot show show the distribution of the correlations for some specific complexes, compared to the distribution of correlations for random complexes.
We then classify protein complexes as stable or variable based on their subunits correlation values along the 18 time points. Variable complexes, are localized in the lower- 25% quantile, and Stable ones are characterized in the > 75% quantile. We then check for paralogs content in each stable and variable complexes. Highlighting the fact that variable complexes possess in general also a higher proportion of subunits that have paralogs.
## Divide in stable and variable Complexes
<- ComplexStability[ComplexStability < quantile(ComplexStability)["25%"]]
VariableComplexes <- ComplexStability[ComplexStability > quantile(ComplexStability)["75%"]]
StableComplexes
<- DrerioComplex[[names(VariableComplexes)]]
VariableComplexes.Set <- DrerioComplex[[names(StableComplexes)]]
StableComplexes.Set
#Paralogs inside variable complexes
<- subunits(VariableComplexes.Set)
VariableComplexes.Set.Members <- hasParalogs(Paralogs,VariableComplexes.Set.Members)
VariableComplexes.Set.Members
#Paralogs inside stable Complexes
<- subunits(StableComplexes.Set)
StableComplexes.Set.Members <- hasParalogs(Paralogs,StableComplexes.Set.Members)
StableComplexes.Set.Members
#Proportion
<- t(rbind(table(VariableComplexes.Set.Members),table(StableComplexes.Set.Members)))
FisherDF colnames(FisherDF) <- c("Variable","Stable")
rownames(FisherDF) <- c("No Paralogs","Has Paralogs")
<- fisher.test(FisherDF)
fisher <- apply(FisherDF,2,function(x)x/sum(x))
FisherDF <- melt(FisherDF);FisherDF$Var2 <- factor(FisherDF$Var2,levels = c("Stable","Variable")) FisherDF
We show the results with a stacked barplot
pdf("../out/plots/p11RNA.pdf",width = 5,height = 5)
<- ggplot(data=FisherDF) + geom_bar(aes(x=Var2,y=value,fill=Var1),stat = "identity") + theme_minimal() +
p11RNA ggtitle("Proportion of Paralogs",subtitle = fisher$p.value) +
ylab("Proportion of Subunits")
p11RNAdev.off()
## png
## 2
p11RNA
In this chunck below we run the correlation between each subunits against the
other, this using the SubunitsCoregulation
and the GetStability
. To relate
to the protein complex, the “stability,” or correlation in this case, for each
subunits as the average correlation with all the other subunits in the complex.
This results in the density plot below. Similar to what we already done for
protein complexes but at a subunit-wise level.
# Subunits Stability in Protein Complexes ####
<- SubunitsCoregulation(DrerioComplex)
DrerioComplex
<- GetStability(DrerioComplex)
StabilityDF <- StabilityDF[complete.cases(StabilityDF),]
StabilityDF colnames(StabilityDF)[1] <- "complex"
#Quantiles
<- quantile(StabilityDF$cv,na.rm = T,c(0.25,0.75))
Quant $hasParalogs <- hasParalogs(Paralogs,StabilityDF$subunit)
StabilityDF
<- StabilityDF[StabilityDF$cv<Quant["25%"],];HighlyVariableSub$type<-rep("HighlyVar",nrow(HighlyVariableSub))
HighlyVariableSub <- StabilityDF[StabilityDF$cv>Quant["75%"],];HighlyStableSub$type<-rep("HighlyStable",nrow(HighlyStableSub))
HighlyStableSub
<- rbind(HighlyVariableSub,HighlyStableSub)
StableDF write.table(StabilityDF,"../out/SubunitsStability.txt",sep = "\t",row.names = F)
Plotting the results then shows, the median values of correlation of each subunits with its complex. The density plot really shows that the distribution is skewed toward high values.
pdf("../out/plots/p12RNA.pdf",width = 5,height = 5)
#Plot density of protein subunits regulation
#Plot Distribution of Correlation of complexes
<- ggplot(data=StabilityDF) + geom_density(aes(x=cv),alpha = 0.3,size=0.5) + theme_minimal() +
p12RNA scale_fill_manual(values=c("darkgreen","grey")) + scale_color_manual(values=c("darkgreen")) + xlab("Correlations") +
ggtitle("Subunits correlations",subtitle = "Mean correlation within complex") +
theme(legend.position = "none") + geom_vline(aes(xintercept = stable),linetype="dashed") +
geom_vline(aes(xintercept = variable),linetype="dashed")
p12RNAdev.off()
## png
## 2
p12RNA
Also here in this chunks of code, protein complexes are grouped by their size in
quantified proteins, and their proportion of paralogs. We then merge back this
information with the Stability
data frame, to obtain also the correlation
values for each complex.
#ComplexStability BarPlot
<- data.frame(complex=names(ComplexStability),corr=ComplexStability)
ComplexStability $complex <- factor(ComplexStability$complex,levels = ComplexStability[order(ComplexStability$corr,decreasing = T),"complex"])
ComplexStability
#Stability, paralog proportions and size.
$size <- sapply(StabilityDF$complex,function(x)nrow(StabilityDF[StabilityDF$complex==x,]))
StabilityDF<- StabilityDF %>% group_by(complex) %>%
Stability ::summarize(prop =sum(hasParalogs)/n(),size=n())
dplyr<- as.data.frame(Stability)
Stability <- merge(ComplexStability,Stability,by.x="complex",by.y="complex")
Stability colnames(Stability) <- c("complex","correlation","Paralogs.prop","size")
pdf("../out/plots/p13RNA.pdf",width = 5,height = 5)
<- plot(Stability[,2:4],pch=19,cex=0.7)
p13RNA p13RNA
## NULL
dev.off()
## png
## 2
p13RNA
## NULL
We can also visualize this data in a barplot format. Counting the stable and variable pairs, and the different proportion of subunits that have paralogs.
#Proportion
<- t(rbind(table(HighlyVariableSub$hasParalogs),table(HighlyStableSub$hasParalogs)))
FisherDF colnames(FisherDF) <- c("Variable Complexes","Stable Complexes")
rownames(FisherDF) <- c("No Paralogs","Has Paralogs")
<- fisher.test(FisherDF)
Fisher
#BarPlot
$type <- factor(StableDF$type,levels = c("HighlyVar","HighlyStable"))
StableDFggplot(data=StableDF) + geom_bar(aes(x=type,fill=hasParalogs),position = "fill") + ggtitle("Subunits") + theme_minimal()
#Run pearson correlation test
cor.test(Stability$correlation,Stability$Paralogs.prop)
##
## Pearson's product-moment correlation
##
## data: Stability$correlation and Stability$Paralogs.prop
## t = -6.6419, df = 237, p-value = 2.099e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4980028 -0.2834975
## sample estimates:
## cor
## -0.3961415
But a better way to visualize this is problably with a scatterplot. This part of code below plot then the variability values for each complex,expressed as 1-R , against the paralog content of each complex.
pdf("../out/plots/p14RNA.pdf",width = 5,height = 5)
#ComplexBarPlot
<-ggplot(data=Stability) + geom_point(aes(x=1-correlation,y=Paralogs.prop),stat = "identity",size=.9) +
p14RNA theme_bw()+
geom_smooth(aes(x=1-correlation,y=Paralogs.prop),method = "lm",color="orange") +
ylab("Proportion of Paralogs") + xlab("complex variability (1-R)") +
ylim(0,1)
p14RNA
## Warning: Removed 7 rows containing missing values (geom_smooth).
#Plot
dev.off()
## png
## 2
p14RNA
## Warning: Removed 7 rows containing missing values (geom_smooth).
The plot highlight clearly a relationship between paralogs content and complex
variability. In addition to that, we can visualize the genral relationship
between subunits that have paralogs and subunits that don’t. Since the median
correlation value of each subunit with its complex is stored in the
StabilityDF
dataframe, we can simply plot the density values similar to what
we have dove above, but highlighting subunits that have paralogs.
<- wilcox.test(StabilityDF$cv[StabilityDF$hasParalogs],
wx.test $cv[!StabilityDF$hasParalogs])
StabilityDF
pdf("../out/plots/p15RNA.pdf",width = 5,height = 5)
<- ggplot(data=StabilityDF) + geom_density(aes(x=cv,fill=hasParalogs),alpha=0.4) +
p15RNA scale_fill_manual(values = ParalogsColor) + theme_bw() +
xlab("correlation within complex") +
ggtitle("Within Compl. Correlations - Development",subtitle = paste("Wilcox pvalue =",format(wx.test$p.value,digits = 2)))
p15RNAdev.off()
## png
## 2
p15RNA
2.8.1 GOEnrichment on variable subunits that have paralogs
Here we run a GOEnrichment analysis on the subunits that have paralogs that are behaving more variably, against all other subunits. What are they carrying on ? In this chunks of code we carry on a GO enrichment analysis of the most variable subunits that have paralogs against all other subunits characterized inside protein complexes.
<- unique(StableDF[StableDF$type=="HighlyVar" & StableDF$hasParalogs==T,"subunit"])
keep <- unique(StabilityDF[,"subunit"])
geneset
#GOEnrichment
<- spatialR::GOEnrichment(geneList = geneset,
GOEnrichment.Variable.Sub.Paralogs interesting = keep,IDtype = "ENSEMBL",ontology = "BP",
species = "Dr")
We can show the general enrichment using the SimpleGOPlot
function.
SimpleGOPlot(GOEnrichment.Variable.Sub.Paralogs$table,showTOP = 30,title = "Variable Paralogs Subunits")
Or summarize the redundancy of the GOTerms using SummarizeGO
.
<- SummarizeGO(x = GOEnrichment.Variable.Sub.Paralogs,thr=0.5) SummGO
pdf("../out/plots/p16RNA.pdf",width = 5,height = 5)
<- SummGO$plot
p16RNA
p16RNAdev.off()
## png
## 2
2.9 Assemble Tables
We prepare here some of the table that we would need to use for the future we organize them with speicific columns merging different dataframes.
<- merge(Stability,StabilityDF,by='complex')
Complexes_Table_Drerio <- Complexes_Table_Drerio[,-8]
Complexes_Table_Drerio colnames(Complexes_Table_Drerio) <- c('complex','complex.coexpression',
'paralogs.fraction','comp.size',
'subunit','subunits.coexpr.with.complex',
'subunit.hasParalogs')
2.10 Save workspace
We save the complete workspace under the name
01-Paralogs_RNASeqData_CompleteWorkspace.RData
this saves all the workspace
that is active while writing this markdown. In this ways should be easy to
access back information coming from the different tables.
save.image("../Markdown/workspaces/01-Paralogs_RNASeqData_CompleteWorkspace.RData")