Section 3 Paralogs Regulation Along 29 Human Tissue

3.1 Homo sapiens Paralogs

For starting we calculate again some statistics regarding Human paralogs, in particular, how many genes in the Human genome have paralogs ? This is an important question to ask before carrying out the rest of the analysis. We first read the paralogs file and with a similar approach done here, we calculate the proportion of protein coding genes that have paralogs in the human genome.

library(dplyr)

#Load the Zebrafish Paralogs file
Paralogs <- read.delim("../Data/Paralogs/hsapiens_ENSEMBL_paralogs_v102.txt",sep = "\t",header = T)
Paralogs <- unique(Paralogs[,c(1,2)])

#Only protein conding genes
Protein.coding <- read.delim("../Data/Paralogs/hsapiens_ENSEMBL_type_v102.txt",sep = "\t",header = T)
Protein.coding <- Protein.coding[Protein.coding$Transcript.type=="protein_coding",1]
Protein.coding <- unique(Protein.coding)

And we filter, for genes only present in the Protein.coding variable and count the occurrence using the summarize function.

Paralogs <- Paralogs %>%
            filter(Gene.stable.ID %in% Protein.coding) %>%
            group_by(Gene.stable.ID) %>%
            summarize(N.Paralogs=sum(Human.paralogue.gene.stable.ID!=""))

Now we can simply calculates the proportion of genes that have paralogs in the genome

human.paralogs.percentage <- sum(Paralogs$N.Paralogs>0)/nrow(Paralogs)
human.paralogs.percentage
## [1] 0.7008968

After we can proceed to analyze paralogs behavior and regulation across 29 human tissues from transcriptome and proteome data published in Wang et al., 2019. Here we focus in detail about paralogs regulation along tissues, with a particular focus inside protein complexes. Our built-in package paralogstools will help us in simplify some operations with paralogs. We load some build in packages, like spatialR , paralogstools and proteomicstools, plus some other ones needed for further analysis. We also load the script "../ComplexScript/complexes_function.R" that contains useful functions to carry on the analysis.

library(proteomicstools)
library(paralogstools)
library(spatialR)
library(reshape2)
library(biomaRt)
library(ggrepel)
library(ggplot2)
library(dplyr)
library(factoextra)
library(data.table)
library(gridExtra)

source("../ComplexScript/complexes_function.R")

3.2 Load files

We load the different files, coming from proteome and transcriptome.

filename <- "../Data/Dataset/tissue_analysis/Wang_et_al_GENE_FPKM_Tissues.txt"
filename_proteome <- "../Data/Dataset/tissue_analysis/Wang_et_al_PROTEOME_IBAQ_Tissues.txt"

We use our build in function SetExperimentFromFile to load the matrix in a structured way. We extract the matrix and we convert in Log2 the data. For transcriptome, the data are in FPKM. We apply here the same filtering used by the author, setting to NAs, all measurements that were below, 1 FPKM. With ParalogsColor we also chooose a standard palette to use.

#Paralogs Palette
ParalogsColor <- c("#e6d5b8","#0278ae")

#LoadMatrix
em_transcript <- SetExperimentFromFile(filename,exprcol = c(3:31),ids="Gene.name",sep="\t",showFileStructure = F)
em_proteome <- SetExperimentFromFile(filename_proteome,exprcol=c(3:31),ids="Gene.name",showFileStructure = F)

#extract matrix, substitute 0s with NA
em_transcr.exprs <- as.data.frame(assay(em_transcript));em_transcr.exprs[em_transcr.exprs<1] <- NA
em_proteome.exprs <- as.data.frame(assay(em_proteome));em_proteome.exprs[em_proteome.exprs==0] <- NA

#Convert Log2
em_proteome.exprs <- log2(em_proteome.exprs)
em_transcr.exprs <- log2(em_transcr.exprs)

We Load the matrices, substitute 0s with NAs and convert the data in log scale.

3.3 Load Human Protein Complex Data

We load the information coming from Homo sapiens protein complexes definitions. That will be used for future analysis.

Complexes <- ComplexSet(organism = "hsapiens",identifier = "SYMBOL")

3.4 Load Human Paralogs Data

Now we load paralogs information via the paralogstools package. We use CleanParalogs to remove duplicated pairs. The ParalogsSet object contains a dataframe called data that basically contains the paralogs file of reference.

ParalogsFile <- "../Data/Paralogs/hsapiens_SYMBOL_paralogs_v102.txt"
Paralogs <- ParalogsSet(organism = "hsapiens",identifier = "SYMBOL",filename = ParalogsFile)
#Clean paralogs
Paralogs <- CleanParalogs(Paralogs)
#Extract dataframe
Paralogs.df <- Paralogs@data

Alternatively we can also load a workspace with all calculations already done.

load("../Markdown/workspaces/ParalogsAlongTissues_Part01.RData")

We select then paralogs that are protein complex members, and other paralogs by just removing SubunitsParalogs from all the rest of the paralogs, using the Remove function.

#Paralogs of Protein Complex Subunits
SubunitsParalogs <- ParalogsInComplex(Paralogs,Complexes)
OtherParalogs <- Remove(Paralogs,GetIds(SubunitsParalogs))

3.4.1 Add Paralogs annotation

In this chunk, we take the matrix containing sample quantification, and annotate it, changing rownames, and adding another column hasParalogs that indicates if a specific gene has paralog in the genome. In order to do this we use the hasParalogs function from the paralogstools package.

#Add Ids
em_proteome.exprs$ids <- rownames(em_proteome.exprs)
em_transcr.exprs$ids <- rownames(em_transcr.exprs)

#Add Paralogs
em_proteome.exprs$hasParalogs <- hasParalogs(Paralogs,em_proteome.exprs$id)
em_transcr.exprs$hasParalogs <- hasParalogs(Paralogs,em_transcr.exprs$id)

3.4.2 Percentage of Genes that have paralogs

In the end, we can plot the percentage of transcripts that have paralogs in the tissues data. Using again the SimplePiePlot function.

pdf("../out/plots/p39Tiss.pdf",height = 2.10,width = 5)
p39Tiss <- SimplePiePlot(em_transcr.exprs$hasParalogs) +
           theme(title = element_text(size = 7),
                 legend.position = "bottom")
p39Tiss
dev.off()
## png 
##   2
p39Tiss

3.5 Are paralogs driving tissues variability?

We can address the variability accross tissues at the level of transcriptome and proteome, we calculate the standard deviation of each genes, and we divide it by the mean total expression of the genes across tissues. We see that genes that have paralogs in the genome are generally more variable across tissues both at transcriptome and at proteome level.

Then for each genes, we calculate the coefficient of variation defined as 𝞼 / mean expression across tissues, for both transcriptome and proteome. We then run two-sample Wilcoxon test on the two distribution to address differences.

#Coeff.Variation
em_proteome.exprs$coeff.var <- apply(em_proteome.exprs[,1:29],1,function(x)sd(x,na.rm = T)/mean(x,na.rm = T))
em_transcr.exprs$coeff.var <- apply(em_transcr.exprs[,1:29],1,function(x)sd(x,na.rm = T)/mean(x,na.rm = T))

#Run tests
wx.test <- wilcox.test(em_proteome.exprs$coeff.var[em_proteome.exprs$hasParalogs],
                       em_proteome.exprs$coeff.var[!em_proteome.exprs$hasParalogs])

wx.test.2 <- wilcox.test(em_transcr.exprs$coeff.var[em_transcr.exprs$hasParalogs],
                         em_transcr.exprs$coeff.var[!em_transcr.exprs$hasParalogs])

Now we can plot the density of the variability for genes that have paralogs and for genes that do not have paralogs in the genome. We can see that proteins that have paralogs are generally more differentially regulated. We save the plot in a directory as a .pdf.

pdf("../out/plots/p01Tiss.pdf",height = 5,width = 5)
p01Tiss <- ggplot(data=em_proteome.exprs) + geom_density(aes(x=coeff.var,fill=hasParalogs),alpha=0.2) + 
      scale_fill_manual(values = c("#e6d5b8","#0278ae")) + theme_bw() + 
      ggtitle("Tissues - Proteome",subtitle = paste("Wilcox pvalue",format(wx.test$p.value,digits = 3))) + 
      xlab(paste("Coefficient of Variation [\u03C3/mean (Prot. Quantity)]"))
p01Tiss
dev.off()
## png 
##   2
p01Tiss

We plot the same distribution profiles also for transcript, and observe a similar trend.

pdf("../out/plots/p02Tiss.pdf",height = 5,width = 5)
p02Tiss <- ggplot(data=em_transcr.exprs) + geom_density(aes(x=coeff.var,fill=hasParalogs),alpha=0.2) + 
      scale_fill_manual(values = c("#e6d5b8","#0278ae")) + theme_bw() + 
      ggtitle("Tissues - Transcriptome",subtitle = paste("Wilcox pvalue",format(wx.test.2$p.value,digits = 3))) + 
      xlab(paste("Coefficient of Variation [\u03C3/mean (Gene Expression)]"))
p02Tiss
dev.off()
## png 
##   2
p02Tiss

3.6 Correlate Paralog Pairs

Now we correlate the expression of transcript and proteins of paralog pairs along the 29 human tissues, considering only paralogs identified in at least 5 tissues. The function Correlate calculates correlation for paralog pairs, (taken from the Paralogs object) and the nmin parameter, defines the minimum number of pairwise complete observation present in each of the dataset for the correlation to be computed. We do this for both Paralogs, Subunits paralogs, and all other for both transcriptome and proteome.

# TRANSCRIPTOME
# Correlation of Paralogs, paralogs subunits, others, transcriptome
AllParalogsCorrelations.tissue.transcr <- Correlate(Paralogs,em_transcript,nmin=5)
ParalogSubunitsCorrelations.tissue.transcr <- Correlate(SubunitsParalogs,em_transcript,nmin=5)
OtherParalogsCorrelations.tissue.transcr <- Correlate(OtherParalogs,em_transcript,nmin=5)

# PROTEOME
AllParalogsCorrelations.tissue.prot <- Correlate(Paralogs,em_proteome,nmin=5)
ParalogSubunitsCorrelations.tissue.prot <- Correlate(SubunitsParalogs,em_proteome,nmin=5)
OtherParalogsCorrelations.tissue.prot <- Correlate(OtherParalogs,em_proteome,nmin=5)

We add some annotations from the different datasets, in order to be easily mapped in the script.

#Add column for dataset
AllParalogsCorrelations.tissue.transcr$dataset <- "Transcriptome"
AllParalogsCorrelations.tissue.prot$dataset <- "Proteome"

ParalogSubunitsCorrelations.tissue.transcr$dataset <- "Transcriptome"
ParalogSubunitsCorrelations.tissue.prot$dataset <- "Proteome"

OtherParalogsCorrelations.tissue.transcr$dataset <- "Transcriptome"
OtherParalogsCorrelations.tissue.prot$dataset <- "Proteome"

ParalogSubunitsCorrelations.tissue.transcr$type <- "Paralogs in Complex"
ParalogSubunitsCorrelations.tissue.prot$type  <- "Paralogs in Complex"

OtherParalogsCorrelations.tissue.transcr$type <- "Other Paralogs"
OtherParalogsCorrelations.tissue.prot$type <- "Other Paralogs"

And save the workspace in case we want to load it easily.

save.image("../Markdown/workspaces/ParalogsAlongTissues_Part01.RData")

3.7 Distribution of paralog pairs correlations

We bind together the dataset, and we select only pairs that are present both at transcriptome and both at proteome level, in order to have an equal comparison. In this chuck below we bind together the correlations coming from the transcriptome and proteome data.

#Binds the datasets
AllParalogsCorrelations.trans.prot <- rbind(AllParalogsCorrelations.tissue.prot,AllParalogsCorrelations.tissue.transcr)
AllParalogsCorrelations.trans.prot$IDs <- as.character(AllParalogsCorrelations.trans.prot$IDs)

3.7.1 Transcript

We can show the distribution of correlations of paralogs pairs across tissues and proteome. The distributions are more or less centered on 0, with an extensive tail toward positive values. The function ParalogsDensityPlot takes in input the vector containing the correlation values and plot the density distribution with the barplot count of the single paralog pairs stored in the variable pTiss$p.all .

#detach old function, use new one.
rm("ParalogsDensityPlot")
source("../ComplexScript/complexes_function.R")

dens <- density(AllParalogsCorrelations.trans.prot[AllParalogsCorrelations.trans.prot$dataset=="Transcriptome","corr"])
df <- data.frame(x=dens$x, y=dens$y)
probs <- c(-0.7,-0.4, 0,0.4, 0.7)
quantiles <- probs
df$quant <- factor(findInterval(df$x,quantiles))

We again save the output as a pdf file, using the ParalogsDensityPlot function to plot the distribution of correlation between pairs, plus a barplot indicating the specific numbers of paralog pairs.

pdf("../out/plots/p15Tiss.pdf",height = 5,width = 5)
p15Tiss <- ParalogsDensityPlot(AllParalogsCorrelations.tissue.transcr$corr,title = "Correlation of Paralog Pairs - Tissues",x_scale = c(-1,1))
p15Tiss$p.all

dev.off()
## png 
##   2
p15Tiss$p.all

3.7.2 Proteome

We can show the distribution of correlations of paralogs pairs across tissues proteome. The distributions are more or less centered on 0, with an extensive tail toward positive values. The function ParalogsDensityPlot takes in input the vector containing the correlation values and plot the density distribution with the barplot count of the single paralog pairs stored in the variable pTiss$p.all .

pdf("../out/plots/p40Tiss.pdf",height = 5,width = 5)
p40Tiss <- ParalogsDensityPlot(AllParalogsCorrelations.tissue.prot$corr,title = "Correlation of Paralog Pairs across Tissues - Proteome")
p40Tiss$p.all
dev.off()
## png 
##   2

3.8 Correlation and sequence Identity – Transcriptome

Now we run the analysis to understand inf there is a specific relation between correlation between paralog pairs and sequence identity. The same as was done for the Zebrafish RNA-Seq Data.

#Create a unique dataframe
AllCorrelations.tr <- rbind(ParalogSubunitsCorrelations.tissue.transcr,OtherParalogsCorrelations.tissue.transcr)

Here we plot the relationship between mean sequence identity and co-regulation, we use a Generalized Additive Model to plot the relationship between sequence identity and co-regulation. And run a correlation test with cor.test

ComplexColor <- c("#D6604D","#4393C3")

pdf("../out/plots/p04Tiss.pdf",height = 5,width = 5)
p04Tiss <- ggplot(data=AllCorrelations.tr) + 
         geom_smooth(aes(x=mean.identity,y=corr,color=type)) + 
         theme_bw() + scale_color_manual(values = ComplexColor) +
         theme(legend.position = "top") + xlab("mean identity") +
         ylab("correlation") + labs(color="")
#Plot
p04Tiss
dev.off()
## png 
##   2
p04Tiss

#correlations tests
cortest <- cor.test(AllCorrelations.tr[AllCorrelations.tr$type=="Other Paralogs","corr"],AllCorrelations.tr[AllCorrelations.tr$type=="Other Paralogs","mean.identity"])
cortest
## 
##  Pearson's product-moment correlation
## 
## data:  AllCorrelations.tr[AllCorrelations.tr$type == "Other Paralogs", "corr"] and AllCorrelations.tr[AllCorrelations.tr$type == "Other Paralogs", "mean.identity"]
## t = 105, df = 89078, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3260179 0.3377052
## sample estimates:
##       cor 
## 0.3318743
cortest <- cor.test(AllCorrelations.tr[!AllCorrelations.tr$type=="Other Paralogs","corr"],AllCorrelations.tr[!AllCorrelations.tr$type=="Other Paralogs","mean.identity"])
cortest
## 
##  Pearson's product-moment correlation
## 
## data:  AllCorrelations.tr[!AllCorrelations.tr$type == "Other Paralogs", "corr"] and AllCorrelations.tr[!AllCorrelations.tr$type == "Other Paralogs", "mean.identity"]
## t = -1.298, df = 1802, p-value = 0.1945
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07660547  0.01561082
## sample estimates:
##         cor 
## -0.03056236

3.9 Correlation and sequence Identity – Proteome

Same approach is also carried out for the proteome data. First the dataset of paralogs that are not complex members and paralogs that are complex members are joined together, then a GAM is fitted to highlight the relationship between sequence identity and co-regulation for paralogs inside protein complexes and outside protein complexes.

#Create a unique dataframe
AllCorrelations.pr <- rbind(OtherParalogsCorrelations.tissue.prot,ParalogSubunitsCorrelations.tissue.prot)

We highlight the relation between the two variable again using a GAM.

pdf("../out/plots/p10Tiss.pdf",height = 5,width = 5)
p10Tiss <- ggplot(data=AllCorrelations.pr) + 
           geom_smooth(aes(x=mean.identity,y=corr,color=type)) + 
           theme_bw() + 
           theme(legend.position = "top") +
           ggtitle("Proteome",subtitle = "along 29 Human tissues")
p10Tiss
dev.off()
## png 
##   2
p10Tiss

cortest <- cor.test(AllCorrelations.pr[AllCorrelations.pr$type=="Other Paralogs","corr"],AllCorrelations.pr[AllCorrelations.pr$type=="Other Paralogs","mean.identity"])
cortest
## 
##  Pearson's product-moment correlation
## 
## data:  AllCorrelations.pr[AllCorrelations.pr$type == "Other Paralogs", "corr"] and AllCorrelations.pr[AllCorrelations.pr$type == "Other Paralogs", "mean.identity"]
## t = 46.774, df = 42649, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2118491 0.2299039
## sample estimates:
##       cor 
## 0.2208954
cortest <- cor.test(AllCorrelations.pr[!AllCorrelations.pr$type=="Other Paralogs","corr"],AllCorrelations.pr[!AllCorrelations.pr$type=="Other Paralogs","mean.identity"])
cortest
## 
##  Pearson's product-moment correlation
## 
## data:  AllCorrelations.pr[!AllCorrelations.pr$type == "Other Paralogs", "corr"] and AllCorrelations.pr[!AllCorrelations.pr$type == "Other Paralogs", "mean.identity"]
## t = 3.5045, df = 1302, p-value = 0.000473
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04260577 0.15016480
## sample estimates:
##       cor 
## 0.0966675

3.10 GOEnrichment on variable Complex paralog pairs

3.10.0.1 Transcriptome

Now we select a subset of paralogs that is localized specifically inside protein complexes, we store this information in the AllCorr.Compl.tr variable. This specific variable contains the information for the correlation of paralog pairs inside protein complexes in reference to the transcriptome data. In the same block of code we define also stable and variable paralogs, selecting the bottom 25% of the distribution of correlations and labeling them as HiglyVariable . We also take informations from sequence identity and mark paralog pairs as HighlySimilar if their average sequence identity is >= 50%,

AllCorr.Compl.tr <- AllCorrelations.tr[AllCorrelations.tr$type=="Paralogs in Complex",]

#Select the most variable pairs
quant <- quantile(AllCorr.Compl.tr$corr)["25%"]

#Variable pairs
variable.pairs <- unlist(AllCorr.Compl.tr[AllCorr.Compl.tr$corr<=quant,c("p.1","p.2")])
variable.pairs <- unique(variable.pairs)

geneSet <- unlist(AllCorr.Compl.tr[,c("p.1","p.2")])

AllCorr.Compl.tr$regulation <- ifelse(AllCorr.Compl.tr$corr<=quant,"HighlyVar","Other")
AllCorr.Compl.tr$similar <- ifelse(AllCorr.Compl.tr$mean.identity>=50,"HighlySimilar","NotSimilar")

Selecting specifically variable pairs against the rest of the subunits (geneSet) we can run a GOEnrichment to understand functions carried out specifically by variable paralogs.

GOEnrich.var.Pairs.tr <- spatialR::GOEnrichment(geneList = geneSet,
                                                interesting = variable.pairs,
                                                ontology = "BP",species = "Hs")

3.10.0.2 Proteome

We carry on the same passages also for the correlations data coming from the proteome, carrying on the same exact approach. We first bind together the datasets, and store the information of paralog pairs residing in the same complex in the AllCorr.Compl.pr variable.

#Create a unique dataframe
AllCorrelations.pr <- rbind(ParalogSubunitsCorrelations.tissue.prot,OtherParalogsCorrelations.tissue.prot)

Then we select the bottom 25% of the correlation distribution, we select variable pairs unlisting the dataframe, after subsetting it for having AllCorr.Compl.pr$corr<=quant, we also create the complete gene list to compare (geneSet) by unlisting all other paralog pairs in complexes. Finally we add information on the type of paralogs, HighlyVar in the case of having a corr<=quant and inside the similar column we add annotation for paralog pairs, HighlySimilar in case they have a mean reciprocal identity >=50%. (mean.identity>=50)

AllCorr.Compl.pr <- AllCorrelations.pr[AllCorrelations.pr$type=="Paralogs in Complex",]

#Select the most variable pairs
quant <- quantile(AllCorr.Compl.pr$corr)["25%"]

#Variable pairs
variable.pairs <- unlist(AllCorr.Compl.pr[AllCorr.Compl.pr$corr<=quant,c("p.1","p.2")])
variable.pairs <- unique(variable.pairs)

geneSet <- unlist(AllCorr.Compl.pr[,c("p.1","p.2")])

AllCorr.Compl.pr$regulation <- ifelse(AllCorr.Compl.pr$corr<=quant,"HighlyVar","Other")
AllCorr.Compl.pr$similar <- ifelse(AllCorr.Compl.pr$mean.identity>=50,"HighlySimilar","NotSimilar")

Also here as for the transcriptome data, we carry on a GOEnrichment analysis, using as target list the variable.pairs and as a background the geneSet variable. Again we can also Summarize the GOEnrichment analysis using the SummarizeGO function.

GOEnrich.var.Pairs.pr <- spatialR::GOEnrichment(geneSet,variable.pairs,ontology = "BP",species = "Hs")
GOEnrich.var.Pairs.Summ.pr <- spatialR::SummarizeGO(GOEnrich.var.Pairs.pr,plot = "scatter")
## Warning: `show_guide` has been deprecated. Please use `show.legend` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

Finally we can highlight the enrichment coming out from the SummarizedGO analysis.

pdf("../out/plots/p09Tiss.pdf",height = 5,width = 5)
p09Tiss <- GOEnrich.var.Pairs.Summ.pr$plot
p09Tiss
dev.off()
## png 
##   2
p09Tiss

3.11 Protein Complexes Analysis - Proteome

In this part of the script we analyze protein complexes at the level of the proteome across 29human tissues. First we create a random set of protein complexes using as reference the same number of complex and complex size from the original Complexes assembly, and randomly sampling proteins coming from rownames(em_proteome). The function GetExpressions is used to add expression annotation to the Complexes object. Selecting from protein complexes with at least 5 subunits matched in the data. We also calculate the median correlation for each protein complexes with the function cor. The function subunits is used to extract the different subunits present in the Complexes object.

set.seed(123)
# ComplexAnalysis at the proteome level
RandomComplex <- RandomComplexSet(Complexes,rownames(em_proteome))
RandomComplex <- GetExpressions(RandomComplex,em_proteome,min=5)

#Complexes
Complexes <- GetExpressions(Complexes,em_proteome,min=5)

#Calculate Correlations
ComplexStability <- cor(Complexes)
RandComplexStability <- cor(RandomComplex)

#Sorted Complex , sort complex by stability
Complexes <- Complexes[[names(sort(ComplexStability,decreasing = T))]]

#Get Ordered subunits by stability
Complexes.subunits <- subunits(Complexes)

Now we create a dataframe DF that contains the values of median correlation for each protein complexes. Median correlation is used here as a proxy for stoichiometries variability/stability. High correlation indicates complexes that are co-regulated together along tissues, and viceversa. We just test the distribution of correlation of protein complexes with a list of random complexes to see that there is an effective correlation between protein complex members.

#Create ggplotDataframe with Density distributions.
DF <- data.frame(corr=c(ComplexStability,RandComplexStability))
DF$type <- rep(c("Complexes"),nrow(DF))
DF$type[grep("Random",rownames(DF))] <- "Random Complexes"

#wilcox test
wilc.test <- wilcox.test(DF$corr[DF$type=="Complexes"],DF$corr[DF$type=="Random Complexes"])
wilc.test$p.value
## [1] 8.497342e-62

The plot below shows the correlation density distribution of protein complexes and random protein complexes (grey). It is clear that protein complexes shows higher levels of co-expression compared to random ones.

#Plot Distribution of Correlation of complexes
pdf("../out/plots/p25Tiss.pdf",height = 5,width = 5)
p25Tiss <- ggplot(data=DF) + geom_density(aes(x=corr,fill=type),alpha = 0.3,size=0.5) + theme_minimal() +
  scale_fill_manual(values=c("darkgreen","grey")) + scale_color_manual(values=c("darkgreen")) + xlab("Correlations")
p25Tiss
dev.off()
## png 
##   2
p25Tiss
Density distribution of the correlations for protein complexes in the proteome data

Figure 3.1: Density distribution of the correlations for protein complexes in the proteome data

3.11.1 Variable and Stable complexes along tissues - Proteome

In this chunk of code, protein complexes are divided in variable and stable according to their quantile correlation distribution. The upper 25% percentile is considered stable and the lower 25% is considered variable.

## Divide in stable and variable Complexes
VariableComplexes <- ComplexStability[ComplexStability < quantile(ComplexStability)["25%"]]
StableComplexes <- ComplexStability[ComplexStability > quantile(ComplexStability)["75%"]]

Then we calculate the median correlation of each subunits with the rest of its protein complexes, we have a median correlation values that can help us understand how the subunits is coregulated together with the complexes across tissues. This is achieved with the functions SubunitsCoregulation and GetStability that helps annotating the complexes with coexpression information, and retrieving that information respectively.

#ComplexStability BarPlot
ComplexStability <- data.frame(complex=names(ComplexStability),corr=ComplexStability)
ComplexStability$complex <- factor(ComplexStability$complex,levels = ComplexStability[order(ComplexStability$corr,decreasing = T),"complex"])

# Subunits Stability in Protein Complexes 
Complexes <- SubunitsCoregulation(Complexes)
StabilityDF <- GetStability(Complexes)
StabilityDF <- StabilityDF[complete.cases(StabilityDF),]
colnames(StabilityDF)[1] <- "complex"

We then annotate each subunits with the information coming from the paralogs dataset, addressing if a specific subunits has paralogs or not, and we also add information coming from the general size of the complex. Finally we merge this information with the total correlation of protein complexes to plot the stability of protein complexes in relationship with paralogs content. For each subints we also add a type annotation.

#Stability, paralog proportions and size.
StabilityDF$size <- sapply(StabilityDF$complex,function(x)nrow(StabilityDF[StabilityDF$complex==x,]))
StabilityDF$hasParalogs <- hasParalogs(Paralogs,StabilityDF$subunit)

#Quantiles
Quant <- quantile(StabilityDF$cv,na.rm = T,c(0.25,0.75))

#Highly Variable Subunits
HighlyVariableSub <- StabilityDF[StabilityDF$cv<Quant["25%"],]
HighlyVariableSub$type<-rep("HighlyVar",nrow(HighlyVariableSub))

HighlyStableSub <- StabilityDF[StabilityDF$cv>Quant["75%"],]
HighlyStableSub$type<-rep("HighlyStable",nrow(HighlyStableSub))

#Stable and Variable Subunits
StableDF <- rbind(HighlyVariableSub,HighlyStableSub)

Then we calculate for each complex the proportion of paralogs and it is done with this line below. After calculating the proportion of subunits that have paralogs in each complex, we merge the dataset by the complex column, with the ComplexStability dataframe, that contains the co-expression values for protein complexes defined above.

#Proportion of paralogs subunits and correlations
Stability <- StabilityDF %>% group_by(complex) %>% summarize(prop =sum(hasParalogs)/n(),size=n())
Stability <- as.data.frame(Stability)
Stability <- merge(ComplexStability,Stability,by.x="complex",by.y="complex")
colnames(Stability) <- c("complex","correlation","Paralogs.prop","size")

Finally we can plot the general distribution of the variables we calculated. To understand if there are some specific relation between variables. We can see as for the zebrafish development data that there is a relation between paralogs proportions and coregulation,

First we can plot for each subinits the correlation values and indicates for each specific ones, if it has paralogs or not. We can first run a wilcoxon test, and then

wx.test <- wilcox.test(StabilityDF$cv[StabilityDF$hasParalogs],
                       StabilityDF$cv[!StabilityDF$hasParalogs])

Show the specific density profiles in a density plot. Storing the results in a .pdf files.

pdf("../out/plots/p28Tiss.pdf",height = 5,width = 5)
p28Tiss <- ggplot(data=StabilityDF) + geom_density(aes(x=cv,fill=hasParalogs),alpha=0.4) + 
  scale_fill_manual(values = ParalogsColor) + theme_bw() + 
  xlab("correlation within complex") + 
  ggtitle("Within Compl. Correlations - Proteome",subtitle = paste("Wilcox pvalue =",format(wx.test$p.value,digits = 2)))
p28Tiss
dev.off()
## png 
##   2
p28Tiss

This density plot highlight the fact that genes that have paralogs are on average more distant = less correlated to their complex, compared to subunits that do not have paralogs. We can also represent this with a barplot that highlight the variable subunits and the stable one, indicating if those specific subunits have paralogs. In this case we bind the dataframe together and run a fisher test for addressing differences in proportions.

#Proportion
FisherDF <- t(rbind(table(HighlyVariableSub$hasParalogs),table(HighlyStableSub$hasParalogs)))
colnames(FisherDF) <- c("Variable Complexes","Stable Complexes")
rownames(FisherDF) <- c("No Paralogs","Has Paralogs")
Fisher <- fisher.test(FisherDF)

It is now time to plot the results in a barplot, that show exactly what is also recapitulated consistently.

#BarPlot
StableDF$type <- factor(StableDF$type,levels = c("HighlyVar","HighlyStable"))

pdf("../out/plots/p29Tiss.pdf",height = 5,width = 5)
p29Tiss <- ggplot(data=StableDF) + 
           geom_bar(aes(x=type,fill=hasParalogs),position = "fill",color="black",alpha=0.5) + 
           ggtitle("Subunits - Across Tissues") + 
           scale_fill_manual(values = c("#e6d5b8","#0278ae")) + 
           theme_minimal() + xlab("Subunits") + ylab("Proportions")
p29Tiss
dev.off()
## png 
##   2
p29Tiss

Then we finally show more clearly the relationship between paralog proportion and complexes stoichiometric variability. We can clearly see the trend that reflect that the more subunits that have paralogs are present in a specific complex the more this specific complex has variable stoichiometries. By defining stoichiometric variability for protein complexes as (1-R) and the paralog fraction (Paralogs.prop) in each complex The scatterplot highlights the existence of a positive trend between complexes stoichiometric variability and paralog content.

Correlations <- cor(Stability$correlation,Stability$Paralogs.prop)

#Proportions Correlation scatterplot
pdf("../out/plots/p30Tiss.pdf",height = 5,width = 5)
p30Tiss <- ggplot(data=Stability) + 
           geom_point(aes(x=1-correlation,y=Paralogs.prop),stat = "identity") + 
           theme_minimal() +
           geom_smooth(aes(x=1 - correlation,y=Paralogs.prop),
                       method = "lm",se=F,color="orange") + 
           ylab("Proportion of Paralogs") + xlab("complex variability (1-R)")
p30Tiss
dev.off()
## png 
##   2
p30Tiss

3.12 Protein Complexes Analysis - Transcriptome

Here we carry on the same analysis that we did, but at the level of the transcriptome. Loading a new ComplexSet object with the function ComplexSet. Annotating with expression values coming from the em_transcript with the GetExpressions function. We calculate correlation with the cor function and create a dataframe ComplexStability.transcr that contains the co-expression values for protein complexes on the transcriptome data. Before binding the data together we also add a reference column to the DF data frame marking it as coming from the proteome data. Finally we remove random complexes from the combined data frame.

#Complexes
Complexes.transcr <- ComplexSet("hsapiens",identifier = "SYMBOL")
Complexes.transcr <- GetExpressions(Complexes.transcr,em_transcript,min=5)

#Calculate Correlations
ComplexStability.transcr <- cor(Complexes.transcr)

#Create ggplotDataframe with Density distributions.
DF.transcr <- data.frame(corr=c(ComplexStability.transcr))
DF.transcr$type <- rep(c("Complexes"),nrow(DF.transcr))
DF.transcr$data <- "Transcriptome"

#Add reference column on proteome data
DF$data <- "Proteome"

#Bind together Proteome and transcriptome removing random from proteome
DF.all <- rbind(DF[DF$type!="Random Complexes",],DF.transcr)

3.12.1 Variable and Stable complexes along tissues - Transcript

In this chunk of code, protein complexes are divided in variable and stable according to their quantile correlation distribution. The upper 25% percentile is considered stable and the lower 25% is considered variable.

## Divide in stable and variable Complexes
VariableComplexes <- ComplexStability.transcr[ComplexStability.transcr < quantile(ComplexStability.transcr)["25%"]]
StableComplexes <- ComplexStability.transcr[ComplexStability.transcr > quantile(ComplexStability.transcr)["75%"]]

Then we calculate the median correlation of each subunits with the rest of its protein complexes, we have a median correlation values that can help us understand how the subunits is coregulated together with the complexes across tissues.This is achieved with the functions SubunitsCoregulation and GetStability that helps annotating the complexes with coexpression information, and retrieving that information respectively.

#ComplexStability.transcr BarPlot
ComplexStability.transcr <- data.frame(complex=names(ComplexStability.transcr),corr=ComplexStability.transcr)
ComplexStability.transcr$complex <- factor(ComplexStability.transcr$complex,levels = ComplexStability.transcr[order(ComplexStability.transcr$corr,decreasing = T),"complex"])

# Subunits Stability in Protein Complexes 
Complexes.transcr <- SubunitsCoregulation(Complexes.transcr)
StabilityDF.transcr <- GetStability(Complexes.transcr)
StabilityDF.transcr <- StabilityDF.transcr[complete.cases(StabilityDF.transcr),]
colnames(StabilityDF.transcr)[1] <- "complex"

We then annotate each subunits with the information coming from the paralogs dataset, addressing if a specific subunits has paralogs or not, and we also add information coming from the genral size of the complex. Finally we merge this information with the total correlation of protein complexes to plot the stability of protein complexes in relationship with paralogs content.

#Stability, paralog proportions and size.
StabilityDF.transcr$size <- sapply(StabilityDF.transcr$complex,function(x)nrow(StabilityDF.transcr[StabilityDF.transcr$complex==x,]))
StabilityDF.transcr$hasParalogs <- hasParalogs(Paralogs,StabilityDF.transcr$subunit)

#Quantiles
Quant <- quantile(StabilityDF.transcr$cv,na.rm = T,c(0.25,0.75))

#Highly Variable Subunits
HighlyVariableSub <- StabilityDF.transcr[StabilityDF.transcr$cv<Quant["25%"],];HighlyVariableSub$type<-rep("HighlyVar",nrow(HighlyVariableSub))
HighlyStableSub <- StabilityDF.transcr[StabilityDF.transcr$cv>Quant["75%"],];HighlyStableSub$type<-rep("HighlyStable",nrow(HighlyStableSub))

#Stable and Variable Subunits
StableDF.transcr <- rbind(HighlyVariableSub,HighlyStableSub)

We calculate for each complex the proportion of paralogs and it is done with this line below.

#Proportion of paralogs subunits and correlations
Stability.tr <- StabilityDF.transcr %>% group_by(complex) %>% summarize(prop =sum(hasParalogs)/n(),size=n())
Stability.tr <- as.data.frame(Stability.tr)

Stability.tr <- merge(ComplexStability.transcr,Stability.tr,by.x="complex",by.y="complex")
colnames(Stability.tr) <- c("complex","correlation","Paralogs.prop","size")

Finally we can plot the general distribution of the variables we calculated. To understand if there are some specific relation between variables. We can see as for the zebrafish development data that there is a relation between paralogs proportions and coregulation,

pdf("../out/plots/p32Tiss.pdf",height = 5,width = 5)
p32Tiss <- plot(Stability.tr[,2:4],pch=19,cex=0.7)
p32Tiss
## NULL
dev.off()
## png 
##   2
p32Tiss
## NULL
wx.test <- wilcox.test(StabilityDF.transcr$cv[StabilityDF.transcr$hasParalogs],
                       StabilityDF.transcr$cv[!StabilityDF.transcr$hasParalogs])

#Density plot
pdf("../out/plots/p34Tiss.pdf",height = 5,width = 5)
p34Tiss <- ggplot(data=StabilityDF.transcr) + geom_density(aes(x=cv,fill=hasParalogs),alpha=0.4) + 
  scale_fill_manual(values = ParalogsColor) + theme_bw() + 
  xlab("correlation within complex") + 
  ggtitle("Within Compl. Correlations - Transcriptome",subtitle = paste("Wilcox pvalue =",format(wx.test$p.value,digits = 2)))
p34Tiss
dev.off()
## png 
##   2
p34Tiss

This density plot highlight the fact that genes that have paralogs are on average more distant = less correlated to their complex, compared to subunits that do not have paralogs. We can also represent this with a barplot that highlight the variable subunits and the stable one, indicating if those specific subunits have paralogs. In this case we bind the dataframe together and run a fisher test for addressing differences in proportions.

#Proportion
FisherDF <- t(rbind(table(HighlyVariableSub$hasParalogs),table(HighlyStableSub$hasParalogs)))
colnames(FisherDF) <- c("Variable Complexes","Stable Complexes")
rownames(FisherDF) <- c("No Paralogs","Has Paralogs")
Fisher <- fisher.test(FisherDF)

It is now time to plot the results in a barplot, that show exactly what is also recapitulaed consistently.

#BarPlot
StableDF.transcr$type <- factor(StableDF.transcr$type,levels = c("HighlyVar","HighlyStable"))

pdf("../out/plots/p35Tiss.pdf",height = 5,width = 5)
p35Tiss <- ggplot(data=StableDF.transcr) + geom_bar(aes(x=type,fill=hasParalogs),position = "fill") + ggtitle("Subunits - Across Tissues") + theme_minimal()
p35Tiss
dev.off()
## png 
##   2
p35Tiss

Then we finally show more clearly the relationship between paralog proportion and complexes stoichiometric variability. We can clearly see the trend that reflect that the more subunits that have paralogs are present in a specific complex the more this specific complex has variable stoichiometries.

Correlations <- cor(Stability.tr$correlation,Stability.tr$Paralogs.prop)

#ComplexBarPlot
pdf("../out/plots/p36Tiss.pdf",height = 5,width = 5)
p36Tiss <- ggplot(data=Stability.tr) + geom_point(aes(x=1-correlation,y=Paralogs.prop),stat = "identity",size=.9) +
        theme_bw()+ ylim(0,1) + 
        geom_smooth(aes(x=1-correlation,y=Paralogs.prop),method = "lm",color="orange") + ylim(0,1) + 
        ylab("Proportion of Paralogs") + xlab("complex variability (1-R)") 

p36Tiss 

3.13 Assemble Tables

Complexes_Table_Hsap_P <- merge(Stability,StabilityDF,by='complex')
Complexes_Table_Hsap_P <- Complexes_Table_Hsap_P[,-7]
colnames(Complexes_Table_Hsap_P) <- c('complex','complex.coexpression',
                                      'paralogs.fraction','comp.size',
                                      'subunit','subunits.coexpr.with.complex',
                                      'subunit.hasParalogs')

Complexes_Table_Hsap_T <- merge(Stability.tr,StabilityDF.transcr,by='complex')
Complexes_Table_Hsap_T <- Complexes_Table_Hsap_T[,-7]
colnames(Complexes_Table_Hsap_T) <- c('complex','complex.coexpression',
                                      'paralogs.fraction','comp.size',
                                      'subunit','subunits.coexpr.with.complex',
                                      'subunit.hasParalogs')

3.14 Save workspace

We save the complete workspace under the name 02-Paralogs_AlongTissues_CompleteWorkspace.RData this saves all the workspace that is active while writing this markdown. In this ways should be easy to access back information coming from the different tables.

Sys.time()
## [1] "2021-07-16 19:59:40 CEST"
save.image("../Markdown/workspaces/02-Paralogs_AlongTissues_CompleteWorkspace.RData")