Section 2 Intro

In this section we analyze paralogs behavior during zebrafish development. We aim to describe paralogs behavior with particular focus on paralogs residing in macromolecular complexes. First of all we need to load some packages and functions. The proteomicstools, paralogstools, spatialR are some home made packages with annotated functions useful for easily analyze protein complexes and paralogs genes. Most of the functions in these package should be reasonably documented. In any case the original purpose of these package is for internal use.

The package are also build for easily communicate between each other. In this workflow we will use some of the functions present here. If you are working on a docker container you don’t need to install them. In case you want to install them, sources are available inside the package directory.

We load them into our workspace. Plus some other packages and the complexes_function.R script, that contains useful functions for our analyes.

library(proteomicstools)
library(paralogstools)
library(reshape2)
library(ggplot2)
library(dplyr)
library(spatialR)
library(gridExtra)
library(Cairo)
library(ggrepel)
library(RColorBrewer)

set.seed(123)
source("../ComplexScript/complexes_function.R")

2.1 Zebrafish Paralogs Percentage

For starting we calculate some statistics regarding Zebrafish paralogs, in particular, how many genes in the Zebrafish genome have paralogs ? This is an important question to ask before carrying out the rest of the analysis. We first read the paralog file, and remove duplicated entries. We then select only paralog genes that are also protein coding genes.

#Load the Zebrafish Paralogs file
Paralogs <- read.delim("../Data/Paralogs/drerio_ENSEMBL_Paralogs_v102.txt",sep = "\t",header = T)
Paralogs <- unique(Paralogs[,c(1,2)])

#Only protein conding genes
Protein.coding <- read.delim("../Data/Paralogs/drerio_ENSEMBL_type_v102.txt",sep = "\t",header = T)
Protein.coding <- Protein.coding[Protein.coding$Transcript.type=="protein_coding",1]
Protein.coding <- unique(Protein.coding)

And then we apply the filters to select only protein coding genes, and count how many of them have paralogs in the genome.

Paralogs <- Paralogs %>% 
            filter(Gene.stable.ID %in% Protein.coding) %>%
            group_by(Gene.stable.ID) %>%
            summarize(N.Paralogs=sum(Zebrafish.paralogue.gene.stable.ID!=""))

Now we can simply calculates the proportion of genes that have paralogs in the genome, by simply dividing.

drerio.paralogs.percentage <- sum(Paralogs$N.Paralogs>0)/nrow(Paralogs)
drerio.paralogs.percentage

## [1] 0.7068228

2.2 Load RNA-Seq data

We first start by loading the RNA-Seq data from (White et al. 2017). This dataset is a embryo-wise RNAseq analysis of 18 time points during zebrafish embryo development. We will use some functions from the proteomicstools package. For more information on the SetExperimentFromFile function type ?SetExperimentFromFile. The ExperimentMatrix object is an object that collect together all the information about the data. It is an easy way of storing expression values.

RNASeqFileName <- "../Data/RNA-seqData/DrerioDevelopmentStage.csv"
RNASeqData <- SetExperimentFromFile(RNASeqFileName,exprcol = 3:20,ids = "Gene.ID")

#Select only trascript where the profile has at least 1TPM in all stages
x <- assay(RNASeqData)

We also defined the paralogs code that we will use throughout the work.

ParalogsColor <- c("#e6d5b8","#0278ae")

2.3 Load Complexes

In case can be possible that we need to update our complex definition file with zebrafish entries. This is done by matching H.sapiens complexes by orthology with Zebrafish using the biomaRt resource. In case it can prompt an error saying that our complex definition has already a label with that name. It means that there is already a column called Drerio_Genes, and this means that the complex file is updated. We then load the Complexes definition via ComplexSet functions from proteomicstools package.

# Load Protein Complexes and Data ####
UpdateComplexFile(organism = "drerio","ENSEMBL",label = "Drerio_Genes")

In our case the protein Complexes file is already update so we can simply run the following code. ComplexSet creates an S4 Object that stores information from the different protein complexes.

DrerioComplex <- ComplexSet(organism = "drerio",identifier = "ENSEMBL",grepOnCol = F)

2.4 Load Paralogs

As a first instance we load the paralogs that we are going to use thorught the study. Here we load paralogs for Homo sapiens, Danio rerio, Rattus norvegicus and Mus musculus.

ParalogsHsFile <-  "../Data/Paralogs/hsapiens_SYMBOL_paralogs_v102.txt"
ParalogsMmFile <-  "../Data/Paralogs/mmusculus_SYMBOL_paralogs_v102.txt"
ParalogsRnFile <-  "../Data/Paralogs/rnorvegicus_SYMBOL_paralogs_v102.txt"
ParalogsDrFile <-  "../Data/Paralogs/drerio_ENSEMBL_Paralogs_v102.txt"

ParalogsHs <- ParalogsSet(organism = "hsapiens",identifier = "ENSEMBL",filename = ParalogsHsFile)
ParalogsHs.df <- ParalogsHs@data

Now we load the paralogs file with the ParalogsSet function from the paralogstools package. To have information about how this function works type ?ParalogsSet. This will create an S4 Object of class ParalogsSet that will allow to more easily work with paralogs genes. The general Paralogs file is pretty big, so it takes a little bit to load completely.

#Get Danio rerio Paralogs <-
paralogsfile <- "../Data/Paralogs/drerio_ENSEMBL_Paralogs_v102.txt"
Paralogs <- ParalogsSet(organism = "drerio",identifier = "ENSEMBL",filename = paralogsfile)

Now in order to work more easily we use the function CleanParalogs to remove duplicated paralogs couples es. (Gene1 | Gene2 and Gene2|Gene1). Since the file is pretty big and the function is not optmized at best, this will take a couple of minutes this is due to the fact that the paralogs file is pretty huge ~ 1.1Gb.

Paralogs <- CleanParalogs(Paralogs)

Now we create two different variables. One is SubunitsParalogs it will return paralog pairs were both paralogs are residing in the same protein complex. By then removing them (with the Remove function) we create the OtherParalogsvariable that represent all the other paralogs, so paralogs that are not part of a protein complex.

#Paralogs of Protein Complex Subunits
SubunitsParalogs <- ParalogsInComplex(Paralogs,(DrerioComplex))
OtherParalogs <- Remove(Paralogs,GetIds(SubunitsParalogs))

2.5 Are paralogs driving variability during development?

In this first part of the code we analyze the variability of paralog genes during zebrafish development. We characterize the variability of genes that have paralogs against genes that do not have paralogs in the genome. We address their coefficient of variations defined as sd(exprs)/mean(exprs). We express variability as the coefficient of variation of each genes across 18 embryo time point. The function hasParalogs is used to assess easily if a specific genes possess paralogs or not.

#Add Ids
RNASeqData.exprs <- as.data.frame(assay(RNASeqData))
RNASeqData.exprs$ids <- rownames(RNASeqData)

#Add Paralogs
RNASeqData.exprs$hasParalogs <- hasParalogs(Paralogs,RNASeqData.exprs$ids)

#Coeff.Variation
RNASeqData.exprs$coeff.var <- apply(RNASeqData.exprs[,1:18],1,function(x)sd(x,na.rm = T)/mean(x,na.rm = T))

#Put GeneName annotation
RNASeqData.exprs <- spatialR::Annotate(RNASeqData.exprs,idcol = "ids","Dr",annot = "SYMBOL")
RNASeqData.exprs$SYMBOL <- toupper(RNASeqData.exprs$SYMBOL)

#WilcoxTest
wx.test <- wilcox.test(RNASeqData.exprs$coeff.var[RNASeqData.exprs$hasParalogs],
                       RNASeqData.exprs$coeff.var[!RNASeqData.exprs$hasParalogs])

And we plot and save the results.

pdf("../out/plots/p01RNA.pdf",height = 5,width = 5)
p01RNA <- ggplot(data=RNASeqData.exprs) + geom_density(aes(x=coeff.var,fill=hasParalogs),alpha=0.2) + 
          scale_fill_manual(values = c("#e6d5b8","#0278ae")) + theme_bw() + 
          ggtitle("Development",subtitle = paste("Wilcox pvalue",format(wx.test$p.value,digits = 3))) + 
          xlab(paste("Coefficient of Variation [\u03C3/mean (Gene Expression)]"))
p01RNA
dev.off()

## png 
##   2

#Print
p01RNA

2.5.1 Percentage of Paralogs in our dataset

In the top section, we saw the general distribution of protein conding genes that have paralogs. Here using the function SimplePiePlot we calculate the percentage of genes that has paralogs in our considered dataset, using the annotation in hasParalogs. We plot and save it.

pdf("../out/plots/pie.RNA.pdf",height = 2.10,width = 5)
pie.RNA <- SimplePiePlot(RNASeqData.exprs$hasParalogs) + 
           theme(title = element_text(size = 7),
                 legend.position = "bottom")
pie.RNA
dev.off()

## png 
##   2

pie.RNA

2.6 Correlation of paralogs

As a first analysis we can run the correlation values of all the paralog pairs along the 18 stages of the RNASeq data. (Also here it will take a while). The parameter nmin=5 means that along the RNASeqData ExperimentMatrix, we maintain only pairs for which we have at least 5 same complete observation (To have a reliable quantification of the correlation).

# Correlation of Paralogs Subunits, Others, All paralogs ####
AllParalogsCorrelations <- Correlate(Paralogs,RNASeqData,nmin=5)

Occasionally for some comparison a warning message can appear: the standard deviation is zero, this can occur when for some genes the transcriptome quantification is the same accross the all the quantified time-points and so the standard deviation for that gene is 0.

An example:

x <- assay(RNASeqData)
x["ENSDARG00000102296",]

##                     zygote            cleavage.2.cell 
##                         NA                         NA 
##          blastula.128.cell           blastula.1k.cell 
##                          2                          2 
##              blastula.dome       gastrula.50..epiboly 
##                          2                         NA 
##            gastrula.shield       gastrula.75..epiboly 
##                         NA                         NA 
##   segmentation.1.4.somites segmentation.14.19.somites 
##                         NA                         NA 
## segmentation.20.25.somites          pharyngula.prim.5 
##                         NA                         NA 
##         pharyngula.prim.15         pharyngula.prim.25 
##                         NA                         NA 
##          hatching.long.pec    larval.protruding.mouth 
##                         NA                         NA 
##               larval.day.4               larval.day.5 
##                          2                          2

Here all the quantified fractions have a value of 2. So when it tries to calculate standard deviations it actually reflect in having a value of 0. And we conversely do the same for paralogs localized inside protein complexes and all the rest of the paralogs.

ParalogSubunitsCorrelations <- Correlate(SubunitsParalogs,RNASeqData,nmin=5)
OtherParalogsCorrelations <- Correlate(OtherParalogs,RNASeqData,nmin=5)

Just for sake of practice we save the workspace, in case we will need for future analysis.

save.image("../Markdown/workspaces/01-Paralogs_RNASeqData.RData")

Now we can go into more detail into the proportion of correlating and anticorrelating paralogs. We define anticorrelating paralogs as having a correlation values of < 0 , conversely correlating paralogs are define as having a correlation value of >=0. We can plot the general proportion of anticorrelating paralogs agains all others. We can see that there is a significant proportions (not so huge but relevant in any way) of anticorrelating gene pairs.

#Anticorrelating Paralogs
Anticorrelating <- nrow(AllParalogsCorrelations[AllParalogsCorrelations$corr<(0),])
#Prepare data for barplot
Proportions <- as.data.frame(c(nrow(AllParalogsCorrelations)-Anticorrelating,Anticorrelating))
colnames(Proportions) <- "n"
Proportions$n <- Proportions$n/sum(Proportions$n)
Proportions$type <- c("Others","Not-correlated")
Proportions$type <- factor(Proportions$type,levels = c("Others","Not-correlated"))

ParalogSubunitsCorrelations$type <- rep("Paralogs in Complexes",nrow(ParalogSubunitsCorrelations))
OtherParalogsCorrelations$type <- rep("Other Paralogs",nrow(OtherParalogsCorrelations))

2.6.1 Density of correlation profiles

A better way of visualizing the distribution is via the function ParalogsDensityPlot. It allows to easily summarize the information coming from the correlation of paralog pairs, in a plot. It will show the density distribution of the correlation of ALL paralogs pairs, the color code indicates different correlation threshold. The barplot on the left actually reflects the count of the data.

pdf("../out/plots/p03RNA.pdf",width = 5,height = 5)
p03RNA <- ParalogsDensityPlot(AllParalogsCorrelations$corr,
                              title = "Correlation of Paralog Pairs - Development",
                              x_scale = c(-1,1))
p03RNA$p.all
dev.off()

## png 
##   2

p03RNA$p.all

2.6.2 Correlation and sequence Identity

We can relate the proportion of anticorrelating paralogs to their sequence identity. In this chunk of code here, we fit a GAM with the mean paralog sequence identity and their correlations. From the results we see that while paralog pairs that are not taking part in the formation of the same complex, show a strong relationship between reciprocal identity and co-regulation. On the contrary paralog pairs that are also protein complex members show a weaker, and in this case even negative, relationship.

The last step of this code also gives the table AllCorrelations, basically a data.frame with all the paralog pairs annotated if they are complex subuits or not.

#Create a unique dataframe
AllCorrelations <- rbind(ParalogSubunitsCorrelations,OtherParalogsCorrelations)
AllCorrelations$p.1 <- as.character(AllCorrelations$p.1)
AllCorrelations$p.2 <- as.character(AllCorrelations$p.2)

And here we visualize the results using a GAM.

ComplexColor <- c("#D6604D","#4393C3")
pdf("../out/plots/p05RNA.pdf",width = 5,height = 5)

p05RNA <- ggplot(data=AllCorrelations) +
        geom_smooth(data=AllCorrelations,aes(x=mean.identity,y=corr,color=type)) +
        theme_bw() + scale_color_manual(values = ComplexColor) +
        theme(legend.position = "top") + 
        xlab("mean identity") +
        ylab("correlation") + labs(color="")

#Plot
p05RNA
dev.off()

## png 
##   2

p05RNA

We can also run a correlation test on the distribution to assess if there is a significant correlation between sequence identity and correlation of paralog pairs. We do this for both paralogs in protein complexes, and paralogs that are not inside the same complex, here names as Other Paralogs.

#Correlation tests
cortest <- cor.test(AllCorrelations[AllCorrelations$type=="Other Paralogs","corr"],AllCorrelations[AllCorrelations$type=="Other Paralogs","mean.identity"])
cortest

## 
##  Pearson's product-moment correlation
## 
## data:  AllCorrelations[AllCorrelations$type == "Other Paralogs", "corr"] and AllCorrelations[AllCorrelations$type == "Other Paralogs", "mean.identity"]
## t = 63.33, df = 159149, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1519896 0.1615740
## sample estimates:
##       cor 
## 0.1567855

cortest <- cor.test(AllCorrelations[!AllCorrelations$type=="Other Paralogs","corr"],AllCorrelations[!AllCorrelations$type=="Other Paralogs","mean.identity"])
cortest

## 
##  Pearson's product-moment correlation
## 
## data:  AllCorrelations[!AllCorrelations$type == "Other Paralogs", "corr"] and AllCorrelations[!AllCorrelations$type == "Other Paralogs", "mean.identity"]
## t = -4.1389, df = 1487, p-value = 3.686e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.15666970 -0.05622364
## sample estimates:
##        cor 
## -0.1067189

In order to test our claim, we can also sample the same number of paralog pairs from the two distributions. We get the number of pairs in the comp.pairs variable, and we sample from the Other Paralogs distribution. We bind again the dataset together, and show them in a plot.

comp.pairs <- AllCorrelations %>% filter(type!="Other Paralogs") %>% nrow()
#Sample
s <- sample((rownames(AllCorrelations %>% 
                            filter(type=="Other Paralogs"))),comp.pairs)
#Bind the dataset
AllCorr.Samp <- rbind(AllCorrelations %>% filter(type!="Other Paralogs"),
                      AllCorrelations[s,])

#Show them on plot
ggplot(data=AllCorr.Samp) +
        geom_point(aes(x=mean.identity,y=corr,color=type),alpha=0.2) + 
        geom_smooth(data=AllCorrelations,aes(x=mean.identity,y=corr,color=type)) +
        theme_bw() + scale_color_manual(values = ComplexColor) +
        theme(legend.position = "top") + 
        xlab("mean identity") +
        ylab("correlation") + labs(color="")

We can subset only for paralogs residing in the same complex, and annotate with gene Name. Using the function Annotate which uses the Annotation.db resource. We also define here variable paralog pairs, that represent the bottom 25% of the distribution of correlation of paralog subunits.

AllCorr.Compl <- AllCorrelations[AllCorrelations$type=="Paralogs in Complexes",]
AllCorr.Compl <- spatialR::Annotate(AllCorr.Compl,organism = "Dr",idcol="p.1",annot="SYMBOL")
AllCorr.Compl <- spatialR::Annotate(AllCorr.Compl,organism = "Dr",idcol="p.2",annot="SYMBOL")

#Change colnames
colnames(AllCorr.Compl)[c(7:8)] <- c("SYMBOL.x","SYMBOL.y")

#Select the most variable pairs
quant <- quantile(AllCorr.Compl$corr)["25%"]

#Variable pairs
variable.pairs <- unlist(AllCorr.Compl[AllCorr.Compl$corr<=quant,c("SYMBOL.x","SYMBOL.y")])
variable.pairs <- unique(unlist((unique(variable.pairs))))

#Extract gene set representing paralogs
geneSet <- unique(unlist(AllCorr.Compl[,c("SYMBOL.x","SYMBOL.y")]))

2.6.3 Variable Pairs in Protien Complexes GOenrich.

We run a GOEnrichement analysis on the variable paralog pair subunits againts all the rest. The GOEnrichmentfunction is called from the spatialR package, and uses the topGO package from bioconductor to perform GOenrichment analysis, using Fisher test. Then a second step is applied in this case. The SummarizeGO function is used to summarize the results of the data and uses the rrvgo package.

GOEnrich.var.Pairs.dev <- spatialR::GOEnrichment(geneSet,variable.pairs,
                                             ontology = "BP",species = "Dr",topnode = 200)
GOEnrich.var.Pairs.Summ <- spatialR::SummarizeGO(GOEnrich.var.Pairs.dev,plot = "scatter",thr = 0.7)

And we save the plot.

pdf("../out/plots/p06RNA.pdf",width = 5,height = 5)
p06RNA <- GOEnrich.var.Pairs.Summ$plot
p06RNA
dev.off()

## png 
##   2

p06RNA

In this chunks here we also define very similar paralogs that are paralogs that have a mean sequence identity of >= 50%.

AllCorr.Compl$regulation <- ifelse(AllCorr.Compl$corr<=quant,"HighlyVar","Other")
AllCorr.Compl$similar <- ifelse(AllCorr.Compl$mean.identity>=50,"HighlySimilar","NotSimilar")

#FisherTest
Fisher.SimDF <-  table(AllCorr.Compl[,c("similar","regulation")])
fisher.test(Fisher.SimDF)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  Fisher.SimDF
## p-value = 0.03131
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.022917 1.690685
## sample estimates:
## odds ratio 
##   1.316132

Plot the output.

pdf("../out/plots/p07RNA.pdf",width = 5,height = 5)
p07RNA <- ggplot(data=AllCorr.Compl) + 
          geom_density(aes(x=mean.identity,fill=regulation),alpha=0.3)

p07RNA
dev.off()

## png 
##   2

wilcox.test(AllCorr.Compl[AllCorr.Compl$regulation=="HighlyVar","mean.identity"],
            AllCorr.Compl[AllCorr.Compl$regulation=="Other","mean.identity"])

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  AllCorr.Compl[AllCorr.Compl$regulation == "HighlyVar", "mean.identity"] and AllCorr.Compl[AllCorr.Compl$regulation == "Other", "mean.identity"]
## W = 225368, p-value = 0.01652
## alternative hypothesis: true location shift is not equal to 0

We observe in general no significant differences in correlations between similar parlaogs and other paralogs inside protein complexes, althought we can observe a clear peak localized in 75% sequence identity for highly variable paralogs.

2.6.4 ING4 and ING5B

Here we show some examples of differentially regulated paralogs. ING4 and ING5B:

colors <- brewer.pal("BrBG",n=11)[c(4,8)]

GOEnrich.var.Pairs.dev$Table <- GOEnrich.var.Pairs.dev$table
LIST <- ShowGenesGOTerms(GOEnrich.var.Pairs.dev,variable.pairs)

pdf("../out/plots/p17RNA.pdf",height = 5,width = 5)
profile_plt <- PlotProteinProfiles(RNASeqData.exprs[c(2:19,22)],
                              proteins = c("ING4","ING5B"),idcol = "SYMBOL",foldChange = T)

p17RNA <- profile_plt[[1]] + 
  theme(axis.text.x = element_blank(),legend.position = "bottom",
        legend.title = element_blank(),plot.title = element_blank(),
        legend.text = element_text(size = 10)) + xlab("Dev. Stages") + 
  scale_color_manual(values = colors)

p17RNA
dev.off()

## png 
##   2

p17RNA

2.6.5 Smarcd1 and Smarcd2

We do the same for SMARCD1 and SMACD2

pdf("../out/plots/p18RNA.pdf",height = 5,width = 5)

profile_plt <- PlotProteinProfiles(RNASeqData.exprs[c(2:19,22)],
                              proteins = c("SMARCD1","SMARCD2"),idcol = "SYMBOL",foldChange = T)
p18RNA <- profile_plt[[1]] + 
 theme(axis.text.x = element_blank(),legend.position = "bottom",
        legend.title = element_blank(),plot.title = element_blank(),
        legend.text = element_text(size = 10)) + xlab("Dev. Stages") + 
  scale_color_manual(values = colors)

p18RNA
dev.off()

## png 
##   2

p18RNA

2.6.6 TMED4 and TMED9

TMED4 and TMED9 part of the COPI complex

pdf("../out/plots/p20RNA.pdf",height = 5,width = 5)

profile_plt <- PlotProteinProfiles(RNASeqData.exprs[c(2:19,22)],
                              proteins = c("TMED4","TMED9"),idcol = "SYMBOL",foldChange = T)

p20RNA <- profile_plt[[1]] + 
 theme(axis.text.x = element_blank(),legend.position = "bottom",
        legend.title = element_blank(),plot.title = element_blank(),
        legend.text = element_text(size = 10)) + xlab("Dev. Stages") + 
  scale_color_manual(values = colors) 

p20RNA
dev.off()

## png 
##   2

p20RNA

2.6.7 Sec23a and Sec23b

And SEC23A and SEC23B inside the COPII complex aswell.

pdf("../out/plots/p19RNA.pdf",height = 5,width = 5)
profile_plt <- PlotProteinProfiles(RNASeqData.exprs[c(2:19,22)],
                              proteins = c("SEC23A","SEC23B"),idcol = "SYMBOL",foldChange = T)

p19RNA <- profile_plt[[1]] + 
 theme(axis.text.x = element_blank(),legend.position = "bottom",
        legend.title = element_blank(),plot.title = element_blank(),
        legend.text = element_text(size = 10)) + xlab("Dev. Stages") + 
  scale_color_manual(values = colors) 
p19RNA
dev.off()

## png 
##   2

p19RNA

2.7 Function of divergent, but similar paralogs

By selecting for variable AND similar paralogs, we can assess the functions carryed on by those specific genes, running a GOanalysis against all the rest.

keep <- AllCorr.Compl[AllCorr.Compl$regulation=="HighlyVar" & AllCorr.Compl$similar=="HighlySimilar",c("SYMBOL.x","SYMBOL.y")]
keep <- unique(as.character(unlist(keep)))

keep <- unique(unlist(strsplit(unlist(keep),split = ';')))

#Table
GOEnrich.var.Pairs.dev.similar <- spatialR::GOEnrichment(geneSet,keep,ontology = "BP",species = "Dr")
GOEnrich.var.Pairs.Summ.similar <- spatialR::SummarizeGO(GOEnrich.var.Pairs.dev.similar,plot = "scatter")

pdf("../out/plots/p08RNA.pdf",width = 5,height = 5)
p08RNA <- GOEnrich.var.Pairs.Summ.similar$plot
dev.off()

## png 
##   2

p08RNA

2.7.1 Subunits anticorrelations vs Paralogs anticorrelation

In this chuck of code here we aim to address if the higher proportion of anti-correlated paralog subunit pairs is due to the fact that those paralogs are just protein complex subunits, or due a specific characteristics of paralogs that reside inside protein complexes. Here first we annotate our list of protein complexes DrerioComplex with the expression values from the RNASeqData using the GetExpressions function. Selecting a minimum number of quantied subunits in each complex =5. All complexes that have less than 5 subunits quantified are discarded from this analysis.

Then for every complex we calculate the correlation values between all possible subunit pairs, and then we check wether each pair is a paralog pair with the areParalogs function.

# Subunits Anticorrelations vs Paralogs Anticorrelations, is anticorrelations inside subunits or paralogs? ####
DrerioComplex <- GetExpressions(DrerioComplex,RNASeqData,min=5)
Subunits <- subunits(DrerioComplex)
Subunits <- Subunits[Subunits %in% GetIds(SubunitsParalogs)]

CorrList <- list()
for (C in names(DrerioComplex))
{
  message(C)
  Correlations <- cor(DrerioComplex[[C]])
  Correlations$are.paralogs <- apply(Correlations[,c(1,2)],1,function(x){areParalogs(x[1],x[2],Paralogs)})
  Correlations <- Correlations[complete.cases(Correlations),]
  CorrList[[C]] <- Correlations
}

#
CorrList <- do.call(rbind,CorrList)

We can plot the result in a density plot.

pdf("../out/plots/p09RNA.pdf",width = 5,height = 5)
p09RNA <- ggplot(data=CorrList) + geom_density(aes(x=corr,color=are.paralogs),alpha=0.5,lwd=1,fill=NA) + 
          theme_minimal() + theme(legend.position = "top") + ggtitle("Subunits:") +
          xlab("Correlations")
dev.off()

## png 
##   2

p09RNA

And also run a wilcox-test. There is a slightly evidence that the anticorrelating behaviour is a characteristic inside of paralog pairs that reside inside protein complexes. From these results seems that subunits pairs that are paralogs are generally less co-regulated compared to other subunit pairs in the same complexes.

wilcox.test(CorrList$corr[CorrList$are.paralogs==T],CorrList$corr[CorrList$are.paralogs==F])

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  CorrList$corr[CorrList$are.paralogs == T] and CorrList$corr[CorrList$are.paralogs == F]
## W = 5616205, p-value = 0.03843
## alternative hypothesis: true location shift is not equal to 0

2.8 Complexes Analysis

Following the statement that antagonizing paralogs regulation seems to be particularly localized between paralogs of protein complex subunits, we carried on the characterization of the variation in composition of macromolecular complexes in order to determine the possible roles played by their paralog subunits. We used the median correlation between all the possible subunit pairs in order to classify protein complexes co expression as stable or variable. First we create a RandomComplex list and annotate with the values coming from the RNASeq data. The we calculate the correlations that we interpret as stability for random and not random complexes.

set.seed(123)
# ComplexAnalysis on 18Stages RNASeq Data ####
RandomComplex <- RandomComplexSet(DrerioComplex,rownames(RNASeqData))
RandomComplex <- GetExpressions(RandomComplex,RNASeqData,min=5)

#Calculate Correlations
ComplexStability <- cor(DrerioComplex)
RandComplexStability <- cor(RandomComplex)

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

## Warning in cor(expressions[x[1], ], expressions[x[2], ], use =
## "pairwise.complete.obs"): the standard deviation is zero

#Sorted Complex , sort complex by stability
DrerioComplex <- DrerioComplex[[names(sort(ComplexStability,decreasing = T))]]

#Get Ordered subunits by stability
DrerioSubunits <- subunits(DrerioComplex)
write.table(DrerioSubunits,"../out/RNASeq_Drerio_SubunitsCorr.txt",sep = "\t")

As a first point, we observe, as expected, high value of correlations between protein complex members (p-value < 2.2e-16 , wilcox test) compared to the distribution of correlation of random proteins.

#Create ggplotDataframe with Density distributions.
DF <- data.frame(corr=c(ComplexStability,RandComplexStability))
DF$type <- rep(c("Complexes"),nrow(DF))
DF$type[grep("Random",rownames(DF))] <- "Random Complexes"

#wilcox test
wilc.test <- wilcox.test(DF$corr[DF$type=="Complexes"],DF$corr[DF$type=="Random Complexes"])

#Stable and Variable threshold
variable <- quantile(ComplexStability)["25%"]
stable <- quantile(ComplexStability)["75%"]

We then highlight some specific protein complexes that varies their stoichiometry composition during development. At first we select some specific complex of interest and we highlight them on the density distribution. We use approxfunc to map back the y point on the plot for the different complexes.

DF$complex <- rownames(DF)

#Create density functions
densfunc <- approxfun(density(DF[DF$type=="Complexes","corr"]))

InterestingSubset <- c("cytoplasmic ribosomal large subunit","cytoplasmic ribosomal small subunit","26S Proteasome","COPII","cytoplasmic dynein complex","SNARE complex","RSC")

#Density Plot
#Plot Distribution of Correlation of complexes
p10RNA <- ggplot(data=DF) + 
          geom_density(aes(x=corr,fill=type),alpha = 0.3,size=0.5,color=NA) + 
          geom_point(data=DF %>% filter(complex %in% InterestingSubset),
                     aes(x=corr,y=densfunc(corr)),color="red") + 
          theme_minimal() + ylab('density') + 
          scale_fill_manual(values=c("darkgreen","grey")) + 
          geom_text_repel(data=DF %>% filter(complex %in% InterestingSubset),
                          aes(x=corr,y=densfunc(corr),label=complex),
                          nudge_y = 0.5) + xlab("correlations") + labs(fill="subset")
         
  
pdf("../out/plots/p10RNA.pdf",width = 5,height = 5)
p10RNA
dev.off()

## png 
##   2

p10RNA

The plot show show the distribution of the correlations for some specific complexes, compared to the distribution of correlations for random complexes.

We then classify protein complexes as stable or variable based on their subunits correlation values along the 18 time points. Variable complexes, are localized in the lower- 25% quantile, and Stable ones are characterized in the > 75% quantile. We then check for paralogs content in each stable and variable complexes. Highlighting the fact that variable complexes possess in general also a higher proportion of subunits that have paralogs.

## Divide in stable and variable Complexes
VariableComplexes <- ComplexStability[ComplexStability < quantile(ComplexStability)["25%"]]
StableComplexes <- ComplexStability[ComplexStability > quantile(ComplexStability)["75%"]]

VariableComplexes.Set <- DrerioComplex[[names(VariableComplexes)]]
StableComplexes.Set <- DrerioComplex[[names(StableComplexes)]]

#Paralogs inside variable complexes
VariableComplexes.Set.Members <- subunits(VariableComplexes.Set)
VariableComplexes.Set.Members <- hasParalogs(Paralogs,VariableComplexes.Set.Members)

#Paralogs inside stable Complexes
StableComplexes.Set.Members <- subunits(StableComplexes.Set)
StableComplexes.Set.Members <- hasParalogs(Paralogs,StableComplexes.Set.Members)

#Proportion
FisherDF <- t(rbind(table(VariableComplexes.Set.Members),table(StableComplexes.Set.Members)))
colnames(FisherDF) <- c("Variable","Stable")
rownames(FisherDF) <- c("No Paralogs","Has Paralogs")

fisher <- fisher.test(FisherDF)
FisherDF <- apply(FisherDF,2,function(x)x/sum(x))
FisherDF <- melt(FisherDF);FisherDF$Var2 <- factor(FisherDF$Var2,levels = c("Stable","Variable"))

We show the results with a stacked barplot

pdf("../out/plots/p11RNA.pdf",width = 5,height = 5)
p11RNA <- ggplot(data=FisherDF) + geom_bar(aes(x=Var2,y=value,fill=Var1),stat = "identity") + theme_minimal() +
          ggtitle("Proportion of Paralogs",subtitle = fisher$p.value) + 
          ylab("Proportion of Subunits")
p11RNA
dev.off()

## png 
##   2

p11RNA

In this chunck below we run the correlation between each subunits against the other, this using the SubunitsCoregulation and the GetStability. To relate to the protein complex, the “stability,” or correlation in this case, for each subunits as the average correlation with all the other subunits in the complex. This results in the density plot below. Similar to what we already done for protein complexes but at a subunit-wise level.

# Subunits Stability in Protein Complexes ####
DrerioComplex <- SubunitsCoregulation(DrerioComplex)

StabilityDF <- GetStability(DrerioComplex)
StabilityDF <- StabilityDF[complete.cases(StabilityDF),]
colnames(StabilityDF)[1] <- "complex"

#Quantiles
Quant <- quantile(StabilityDF$cv,na.rm = T,c(0.25,0.75))
StabilityDF$hasParalogs <- hasParalogs(Paralogs,StabilityDF$subunit)

HighlyVariableSub <- StabilityDF[StabilityDF$cv<Quant["25%"],];HighlyVariableSub$type<-rep("HighlyVar",nrow(HighlyVariableSub))
HighlyStableSub <- StabilityDF[StabilityDF$cv>Quant["75%"],];HighlyStableSub$type<-rep("HighlyStable",nrow(HighlyStableSub))

StableDF <- rbind(HighlyVariableSub,HighlyStableSub)
write.table(StabilityDF,"../out/SubunitsStability.txt",sep = "\t",row.names = F)

Plotting the results then shows, the median values of correlation of each subunits with its complex. The density plot really shows that the distribution is skewed toward high values.

pdf("../out/plots/p12RNA.pdf",width = 5,height = 5)
#Plot density of protein subunits regulation
#Plot Distribution of Correlation of complexes
p12RNA <- ggplot(data=StabilityDF) + geom_density(aes(x=cv),alpha = 0.3,size=0.5) + theme_minimal() +
  scale_fill_manual(values=c("darkgreen","grey")) + scale_color_manual(values=c("darkgreen")) + xlab("Correlations") + 
  ggtitle("Subunits correlations",subtitle = "Mean correlation within complex") + 
  theme(legend.position = "none") + geom_vline(aes(xintercept = stable),linetype="dashed") + 
  geom_vline(aes(xintercept = variable),linetype="dashed")
p12RNA
dev.off()

## png 
##   2

p12RNA

Also here in this chunks of code, protein complexes are grouped by their size in quantified proteins, and their proportion of paralogs. We then merge back this information with the Stability data frame, to obtain also the correlation values for each complex.

#ComplexStability BarPlot
ComplexStability <- data.frame(complex=names(ComplexStability),corr=ComplexStability)
ComplexStability$complex <- factor(ComplexStability$complex,levels = ComplexStability[order(ComplexStability$corr,decreasing = T),"complex"])

#Stability, paralog proportions and size.
StabilityDF$size <- sapply(StabilityDF$complex,function(x)nrow(StabilityDF[StabilityDF$complex==x,]))
Stability <- StabilityDF %>% group_by(complex) %>% 
             dplyr::summarize(prop =sum(hasParalogs)/n(),size=n())
Stability <- as.data.frame(Stability)
Stability <- merge(ComplexStability,Stability,by.x="complex",by.y="complex")
colnames(Stability) <- c("complex","correlation","Paralogs.prop","size")

pdf("../out/plots/p13RNA.pdf",width = 5,height = 5)
p13RNA <- plot(Stability[,2:4],pch=19,cex=0.7)
p13RNA

## NULL

dev.off()

## png 
##   2

p13RNA

## NULL

We can also visualize this data in a barplot format. Counting the stable and variable pairs, and the different proportion of subunits that have paralogs.

#Proportion
FisherDF <- t(rbind(table(HighlyVariableSub$hasParalogs),table(HighlyStableSub$hasParalogs)))
colnames(FisherDF) <- c("Variable Complexes","Stable Complexes")
rownames(FisherDF) <- c("No Paralogs","Has Paralogs")
Fisher <- fisher.test(FisherDF)

#BarPlot
StableDF$type <- factor(StableDF$type,levels = c("HighlyVar","HighlyStable"))
ggplot(data=StableDF) + geom_bar(aes(x=type,fill=hasParalogs),position = "fill") + ggtitle("Subunits") + theme_minimal()

#Run pearson correlation test
cor.test(Stability$correlation,Stability$Paralogs.prop)

## 
##  Pearson's product-moment correlation
## 
## data:  Stability$correlation and Stability$Paralogs.prop
## t = -6.6419, df = 237, p-value = 2.099e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4980028 -0.2834975
## sample estimates:
##        cor 
## -0.3961415

But a better way to visualize this is problably with a scatterplot. This part of code below plot then the variability values for each complex,expressed as 1-R , against the paralog content of each complex.

pdf("../out/plots/p14RNA.pdf",width = 5,height = 5)

#ComplexBarPlot
p14RNA <-ggplot(data=Stability) + geom_point(aes(x=1-correlation,y=Paralogs.prop),stat = "identity",size=.9) +
        theme_bw()+
        geom_smooth(aes(x=1-correlation,y=Paralogs.prop),method = "lm",color="orange") + 
        ylab("Proportion of Paralogs") + xlab("complex variability (1-R)") + 
        ylim(0,1)
p14RNA

## Warning: Removed 7 rows containing missing values (geom_smooth).

#Plot
dev.off()

## png 
##   2

p14RNA

## Warning: Removed 7 rows containing missing values (geom_smooth).

The plot highlight clearly a relationship between paralogs content and complex variability. In addition to that, we can visualize the genral relationship between subunits that have paralogs and subunits that don’t. Since the median correlation value of each subunit with its complex is stored in the StabilityDF dataframe, we can simply plot the density values similar to what we have dove above, but highlighting subunits that have paralogs.

wx.test <- wilcox.test(StabilityDF$cv[StabilityDF$hasParalogs],
                       StabilityDF$cv[!StabilityDF$hasParalogs])

pdf("../out/plots/p15RNA.pdf",width = 5,height = 5)
p15RNA <- ggplot(data=StabilityDF) + geom_density(aes(x=cv,fill=hasParalogs),alpha=0.4) + 
  scale_fill_manual(values = ParalogsColor) + theme_bw() + 
  xlab("correlation within complex") + 
  ggtitle("Within Compl. Correlations - Development",subtitle = paste("Wilcox pvalue =",format(wx.test$p.value,digits = 2)))
p15RNA
dev.off()

## png 
##   2

p15RNA

2.8.1 GOEnrichment on variable subunits that have paralogs

Here we run a GOEnrichment analysis on the subunits that have paralogs that are behaving more variably, against all other subunits. What are they carrying on ? In this chunks of code we carry on a GO enrichment analysis of the most variable subunits that have paralogs against all other subunits characterized inside protein complexes.

keep <- unique(StableDF[StableDF$type=="HighlyVar" & StableDF$hasParalogs==T,"subunit"])
geneset <- unique(StabilityDF[,"subunit"])

#GOEnrichment
GOEnrichment.Variable.Sub.Paralogs <- spatialR::GOEnrichment(geneList = geneset,
                                                             interesting = keep,IDtype = "ENSEMBL",ontology = "BP",
                                                             species = "Dr")

We can show the general enrichment using the SimpleGOPlot function.

SimpleGOPlot(GOEnrichment.Variable.Sub.Paralogs$table,showTOP = 30,title = "Variable Paralogs Subunits")

Or summarize the redundancy of the GOTerms using SummarizeGO.

SummGO <- SummarizeGO(x = GOEnrichment.Variable.Sub.Paralogs,thr=0.5)

pdf("../out/plots/p16RNA.pdf",width = 5,height = 5)
p16RNA <- SummGO$plot
p16RNA
dev.off()

## png 
##   2

2.9 Assemble Tables

We prepare here some of the table that we would need to use for the future we organize them with speicific columns merging different dataframes.

Complexes_Table_Drerio <- merge(Stability,StabilityDF,by='complex')
Complexes_Table_Drerio <- Complexes_Table_Drerio[,-8]
colnames(Complexes_Table_Drerio) <- c('complex','complex.coexpression',
                                      'paralogs.fraction','comp.size',
                                      'subunit','subunits.coexpr.with.complex',
                                      'subunit.hasParalogs')

2.10 Save workspace

We save the complete workspace under the name 01-Paralogs_RNASeqData_CompleteWorkspace.RData this saves all the workspace that is active while writing this markdown. In this ways should be easy to access back information coming from the different tables.

save.image("../Markdown/workspaces/01-Paralogs_RNASeqData_CompleteWorkspace.RData")