Section 3 Paralogs Regulation Along 29 Human Tissue
3.1 Homo sapiens Paralogs
For starting we calculate again some statistics regarding Human paralogs, in particular, how many genes in the Human genome have paralogs ? This is an important question to ask before carrying out the rest of the analysis. We first read the paralogs file and with a similar approach done here, we calculate the proportion of protein coding genes that have paralogs in the human genome.
library(dplyr)
#Load the Zebrafish Paralogs file
<- read.delim("../Data/Paralogs/hsapiens_ENSEMBL_paralogs_v102.txt",sep = "\t",header = T)
Paralogs <- unique(Paralogs[,c(1,2)])
Paralogs
#Only protein conding genes
<- read.delim("../Data/Paralogs/hsapiens_ENSEMBL_type_v102.txt",sep = "\t",header = T)
Protein.coding <- Protein.coding[Protein.coding$Transcript.type=="protein_coding",1]
Protein.coding <- unique(Protein.coding) Protein.coding
And we filter, for genes only present in the Protein.coding
variable
and count the occurrence using the summarize
function.
<- Paralogs %>%
Paralogs filter(Gene.stable.ID %in% Protein.coding) %>%
group_by(Gene.stable.ID) %>%
summarize(N.Paralogs=sum(Human.paralogue.gene.stable.ID!=""))
Now we can simply calculates the proportion of genes that have paralogs in the genome
<- sum(Paralogs$N.Paralogs>0)/nrow(Paralogs)
human.paralogs.percentage human.paralogs.percentage
## [1] 0.7008968
After we can proceed to analyze paralogs behavior and regulation across
29 human tissues from transcriptome and proteome data published in Wang
et al., 2019.
Here we focus in detail about paralogs regulation along tissues, with a
particular focus inside protein complexes. Our built-in package
paralogstools
will help us in simplify some operations with paralogs.
We load some build in packages, like spatialR
, paralogstools
and
proteomicstools
, plus some other ones needed for further analysis. We
also load the script "../ComplexScript/complexes_function.R"
that
contains useful functions to carry on the analysis.
library(proteomicstools)
library(paralogstools)
library(spatialR)
library(reshape2)
library(biomaRt)
library(ggrepel)
library(ggplot2)
library(dplyr)
library(factoextra)
library(data.table)
library(gridExtra)
source("../ComplexScript/complexes_function.R")
3.2 Load files
We load the different files, coming from proteome and transcriptome.
<- "../Data/Dataset/tissue_analysis/Wang_et_al_GENE_FPKM_Tissues.txt"
filename <- "../Data/Dataset/tissue_analysis/Wang_et_al_PROTEOME_IBAQ_Tissues.txt" filename_proteome
We use our build in function SetExperimentFromFile
to load the matrix
in a structured way. We extract the matrix and we convert in Log2 the
data. For transcriptome, the data are in FPKM. We apply here the same
filtering used by the author, setting to NAs, all measurements that were
below, 1 FPKM. With ParalogsColor
we also chooose a standard palette
to use.
#Paralogs Palette
<- c("#e6d5b8","#0278ae")
ParalogsColor
#LoadMatrix
<- SetExperimentFromFile(filename,exprcol = c(3:31),ids="Gene.name",sep="\t",showFileStructure = F)
em_transcript <- SetExperimentFromFile(filename_proteome,exprcol=c(3:31),ids="Gene.name",showFileStructure = F)
em_proteome
#extract matrix, substitute 0s with NA
<- as.data.frame(assay(em_transcript));em_transcr.exprs[em_transcr.exprs<1] <- NA
em_transcr.exprs <- as.data.frame(assay(em_proteome));em_proteome.exprs[em_proteome.exprs==0] <- NA
em_proteome.exprs
#Convert Log2
<- log2(em_proteome.exprs)
em_proteome.exprs <- log2(em_transcr.exprs) em_transcr.exprs
We Load the matrices, substitute 0s with NAs and convert the data in log scale.
3.3 Load Human Protein Complex Data
We load the information coming from Homo sapiens protein complexes definitions. That will be used for future analysis.
<- ComplexSet(organism = "hsapiens",identifier = "SYMBOL") Complexes
3.4 Load Human Paralogs Data
Now we load paralogs information via the paralogstools
package. We use
CleanParalogs
to remove duplicated pairs. The ParalogsSet
object
contains a dataframe called data that basically contains the paralogs
file of reference.
<- "../Data/Paralogs/hsapiens_SYMBOL_paralogs_v102.txt"
ParalogsFile <- ParalogsSet(organism = "hsapiens",identifier = "SYMBOL",filename = ParalogsFile)
Paralogs #Clean paralogs
<- CleanParalogs(Paralogs)
Paralogs #Extract dataframe
<- Paralogs@data Paralogs.df
Alternatively we can also load a workspace with all calculations already done.
load("../Markdown/workspaces/ParalogsAlongTissues_Part01.RData")
We select then paralogs that are protein complex members, and other
paralogs by just removing SubunitsParalogs
from all the rest of the
paralogs, using the Remove
function.
#Paralogs of Protein Complex Subunits
<- ParalogsInComplex(Paralogs,Complexes)
SubunitsParalogs <- Remove(Paralogs,GetIds(SubunitsParalogs)) OtherParalogs
3.4.1 Add Paralogs annotation
In this chunk, we take the matrix containing sample quantification, and
annotate it, changing rownames, and adding another column hasParalogs
that indicates if a specific gene has paralog in the genome. In order to
do this we use the hasParalogs
function from the paralogstools
package.
#Add Ids
$ids <- rownames(em_proteome.exprs)
em_proteome.exprs$ids <- rownames(em_transcr.exprs)
em_transcr.exprs
#Add Paralogs
$hasParalogs <- hasParalogs(Paralogs,em_proteome.exprs$id)
em_proteome.exprs$hasParalogs <- hasParalogs(Paralogs,em_transcr.exprs$id) em_transcr.exprs
3.4.2 Percentage of Genes that have paralogs
In the end, we can plot the percentage of transcripts that have paralogs
in the tissues data. Using again the SimplePiePlot
function.
pdf("../out/plots/p39Tiss.pdf",height = 2.10,width = 5)
<- SimplePiePlot(em_transcr.exprs$hasParalogs) +
p39Tiss theme(title = element_text(size = 7),
legend.position = "bottom")
p39Tissdev.off()
## png
## 2
p39Tiss
3.5 Are paralogs driving tissues variability?
We can address the variability accross tissues at the level of transcriptome and proteome, we calculate the standard deviation of each genes, and we divide it by the mean total expression of the genes across tissues. We see that genes that have paralogs in the genome are generally more variable across tissues both at transcriptome and at proteome level.
Then for each genes, we calculate the coefficient of variation defined as 𝞼 / mean expression across tissues, for both transcriptome and proteome. We then run two-sample Wilcoxon test on the two distribution to address differences.
#Coeff.Variation
$coeff.var <- apply(em_proteome.exprs[,1:29],1,function(x)sd(x,na.rm = T)/mean(x,na.rm = T))
em_proteome.exprs$coeff.var <- apply(em_transcr.exprs[,1:29],1,function(x)sd(x,na.rm = T)/mean(x,na.rm = T))
em_transcr.exprs
#Run tests
<- wilcox.test(em_proteome.exprs$coeff.var[em_proteome.exprs$hasParalogs],
wx.test $coeff.var[!em_proteome.exprs$hasParalogs])
em_proteome.exprs
.2 <- wilcox.test(em_transcr.exprs$coeff.var[em_transcr.exprs$hasParalogs],
wx.test$coeff.var[!em_transcr.exprs$hasParalogs]) em_transcr.exprs
Now we can plot the density of the variability for genes that have paralogs and for genes that do not have paralogs in the genome. We can see that proteins that have paralogs are generally more differentially regulated. We save the plot in a directory as a .pdf.
pdf("../out/plots/p01Tiss.pdf",height = 5,width = 5)
<- ggplot(data=em_proteome.exprs) + geom_density(aes(x=coeff.var,fill=hasParalogs),alpha=0.2) +
p01Tiss scale_fill_manual(values = c("#e6d5b8","#0278ae")) + theme_bw() +
ggtitle("Tissues - Proteome",subtitle = paste("Wilcox pvalue",format(wx.test$p.value,digits = 3))) +
xlab(paste("Coefficient of Variation [\u03C3/mean (Prot. Quantity)]"))
p01Tissdev.off()
## png
## 2
p01Tiss
We plot the same distribution profiles also for transcript, and observe a similar trend.
pdf("../out/plots/p02Tiss.pdf",height = 5,width = 5)
<- ggplot(data=em_transcr.exprs) + geom_density(aes(x=coeff.var,fill=hasParalogs),alpha=0.2) +
p02Tiss scale_fill_manual(values = c("#e6d5b8","#0278ae")) + theme_bw() +
ggtitle("Tissues - Transcriptome",subtitle = paste("Wilcox pvalue",format(wx.test.2$p.value,digits = 3))) +
xlab(paste("Coefficient of Variation [\u03C3/mean (Gene Expression)]"))
p02Tissdev.off()
## png
## 2
p02Tiss
3.6 Correlate Paralog Pairs
Now we correlate the expression of transcript and proteins of paralog
pairs along the 29 human tissues, considering only paralogs identified
in at least 5 tissues. The function Correlate
calculates correlation
for paralog pairs, (taken from the Paralogs
object) and the nmin
parameter, defines the minimum number of pairwise complete observation
present in each of the dataset for the correlation to be computed. We do
this for both Paralogs, Subunits paralogs, and all other for both
transcriptome and proteome.
# TRANSCRIPTOME
# Correlation of Paralogs, paralogs subunits, others, transcriptome
<- Correlate(Paralogs,em_transcript,nmin=5)
AllParalogsCorrelations.tissue.transcr <- Correlate(SubunitsParalogs,em_transcript,nmin=5)
ParalogSubunitsCorrelations.tissue.transcr <- Correlate(OtherParalogs,em_transcript,nmin=5)
OtherParalogsCorrelations.tissue.transcr
# PROTEOME
<- Correlate(Paralogs,em_proteome,nmin=5)
AllParalogsCorrelations.tissue.prot <- Correlate(SubunitsParalogs,em_proteome,nmin=5)
ParalogSubunitsCorrelations.tissue.prot <- Correlate(OtherParalogs,em_proteome,nmin=5) OtherParalogsCorrelations.tissue.prot
We add some annotations from the different datasets, in order to be easily mapped in the script.
#Add column for dataset
$dataset <- "Transcriptome"
AllParalogsCorrelations.tissue.transcr$dataset <- "Proteome"
AllParalogsCorrelations.tissue.prot
$dataset <- "Transcriptome"
ParalogSubunitsCorrelations.tissue.transcr$dataset <- "Proteome"
ParalogSubunitsCorrelations.tissue.prot
$dataset <- "Transcriptome"
OtherParalogsCorrelations.tissue.transcr$dataset <- "Proteome"
OtherParalogsCorrelations.tissue.prot
$type <- "Paralogs in Complex"
ParalogSubunitsCorrelations.tissue.transcr$type <- "Paralogs in Complex"
ParalogSubunitsCorrelations.tissue.prot
$type <- "Other Paralogs"
OtherParalogsCorrelations.tissue.transcr$type <- "Other Paralogs" OtherParalogsCorrelations.tissue.prot
And save the workspace in case we want to load it easily.
save.image("../Markdown/workspaces/ParalogsAlongTissues_Part01.RData")
3.7 Distribution of paralog pairs correlations
We bind together the dataset, and we select only pairs that are present both at transcriptome and both at proteome level, in order to have an equal comparison. In this chuck below we bind together the correlations coming from the transcriptome and proteome data.
#Binds the datasets
<- rbind(AllParalogsCorrelations.tissue.prot,AllParalogsCorrelations.tissue.transcr)
AllParalogsCorrelations.trans.prot $IDs <- as.character(AllParalogsCorrelations.trans.prot$IDs) AllParalogsCorrelations.trans.prot
3.7.1 Transcript
We can show the distribution of correlations of paralogs pairs across
tissues and proteome. The distributions are more or less centered on 0,
with an extensive tail toward positive values. The function
ParalogsDensityPlot
takes in input the vector containing the
correlation values and plot the density distribution with the barplot
count of the single paralog pairs stored in the variable pTiss$p.all
.
#detach old function, use new one.
rm("ParalogsDensityPlot")
source("../ComplexScript/complexes_function.R")
<- density(AllParalogsCorrelations.trans.prot[AllParalogsCorrelations.trans.prot$dataset=="Transcriptome","corr"])
dens <- data.frame(x=dens$x, y=dens$y)
df <- c(-0.7,-0.4, 0,0.4, 0.7)
probs <- probs
quantiles $quant <- factor(findInterval(df$x,quantiles)) df
We again save the output as a pdf file, using the ParalogsDensityPlot
function to plot the distribution of correlation between pairs, plus a
barplot indicating the specific numbers of paralog pairs.
pdf("../out/plots/p15Tiss.pdf",height = 5,width = 5)
<- ParalogsDensityPlot(AllParalogsCorrelations.tissue.transcr$corr,title = "Correlation of Paralog Pairs - Tissues",x_scale = c(-1,1))
p15Tiss $p.all
p15Tiss
dev.off()
## png
## 2
$p.all p15Tiss
3.7.2 Proteome
We can show the distribution of correlations of paralogs pairs across
tissues proteome. The distributions are more or less centered on 0, with
an extensive tail toward positive values. The function
ParalogsDensityPlot
takes in input the vector containing the
correlation values and plot the density distribution with the barplot
count of the single paralog pairs stored in the variable pTiss$p.all
.
pdf("../out/plots/p40Tiss.pdf",height = 5,width = 5)
<- ParalogsDensityPlot(AllParalogsCorrelations.tissue.prot$corr,title = "Correlation of Paralog Pairs across Tissues - Proteome")
p40Tiss $p.all
p40Tissdev.off()
## png
## 2
3.8 Correlation and sequence Identity – Transcriptome
Now we run the analysis to understand inf there is a specific relation between correlation between paralog pairs and sequence identity. The same as was done for the Zebrafish RNA-Seq Data.
#Create a unique dataframe
<- rbind(ParalogSubunitsCorrelations.tissue.transcr,OtherParalogsCorrelations.tissue.transcr) AllCorrelations.tr
Here we plot the relationship between mean sequence identity and
co-regulation, we use a Generalized Additive Model to plot the
relationship between sequence identity and co-regulation. And run a
correlation test with cor.test
<- c("#D6604D","#4393C3")
ComplexColor
pdf("../out/plots/p04Tiss.pdf",height = 5,width = 5)
<- ggplot(data=AllCorrelations.tr) +
p04Tiss geom_smooth(aes(x=mean.identity,y=corr,color=type)) +
theme_bw() + scale_color_manual(values = ComplexColor) +
theme(legend.position = "top") + xlab("mean identity") +
ylab("correlation") + labs(color="")
#Plot
p04Tissdev.off()
## png
## 2
p04Tiss
#correlations tests
<- cor.test(AllCorrelations.tr[AllCorrelations.tr$type=="Other Paralogs","corr"],AllCorrelations.tr[AllCorrelations.tr$type=="Other Paralogs","mean.identity"])
cortest cortest
##
## Pearson's product-moment correlation
##
## data: AllCorrelations.tr[AllCorrelations.tr$type == "Other Paralogs", "corr"] and AllCorrelations.tr[AllCorrelations.tr$type == "Other Paralogs", "mean.identity"]
## t = 105, df = 89078, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3260179 0.3377052
## sample estimates:
## cor
## 0.3318743
<- cor.test(AllCorrelations.tr[!AllCorrelations.tr$type=="Other Paralogs","corr"],AllCorrelations.tr[!AllCorrelations.tr$type=="Other Paralogs","mean.identity"])
cortest cortest
##
## Pearson's product-moment correlation
##
## data: AllCorrelations.tr[!AllCorrelations.tr$type == "Other Paralogs", "corr"] and AllCorrelations.tr[!AllCorrelations.tr$type == "Other Paralogs", "mean.identity"]
## t = -1.298, df = 1802, p-value = 0.1945
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07660547 0.01561082
## sample estimates:
## cor
## -0.03056236
3.9 Correlation and sequence Identity – Proteome
Same approach is also carried out for the proteome data. First the dataset of paralogs that are not complex members and paralogs that are complex members are joined together, then a GAM is fitted to highlight the relationship between sequence identity and co-regulation for paralogs inside protein complexes and outside protein complexes.
#Create a unique dataframe
<- rbind(OtherParalogsCorrelations.tissue.prot,ParalogSubunitsCorrelations.tissue.prot) AllCorrelations.pr
We highlight the relation between the two variable again using a GAM.
pdf("../out/plots/p10Tiss.pdf",height = 5,width = 5)
<- ggplot(data=AllCorrelations.pr) +
p10Tiss geom_smooth(aes(x=mean.identity,y=corr,color=type)) +
theme_bw() +
theme(legend.position = "top") +
ggtitle("Proteome",subtitle = "along 29 Human tissues")
p10Tissdev.off()
## png
## 2
p10Tiss
<- cor.test(AllCorrelations.pr[AllCorrelations.pr$type=="Other Paralogs","corr"],AllCorrelations.pr[AllCorrelations.pr$type=="Other Paralogs","mean.identity"])
cortest cortest
##
## Pearson's product-moment correlation
##
## data: AllCorrelations.pr[AllCorrelations.pr$type == "Other Paralogs", "corr"] and AllCorrelations.pr[AllCorrelations.pr$type == "Other Paralogs", "mean.identity"]
## t = 46.774, df = 42649, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2118491 0.2299039
## sample estimates:
## cor
## 0.2208954
<- cor.test(AllCorrelations.pr[!AllCorrelations.pr$type=="Other Paralogs","corr"],AllCorrelations.pr[!AllCorrelations.pr$type=="Other Paralogs","mean.identity"])
cortest cortest
##
## Pearson's product-moment correlation
##
## data: AllCorrelations.pr[!AllCorrelations.pr$type == "Other Paralogs", "corr"] and AllCorrelations.pr[!AllCorrelations.pr$type == "Other Paralogs", "mean.identity"]
## t = 3.5045, df = 1302, p-value = 0.000473
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04260577 0.15016480
## sample estimates:
## cor
## 0.0966675
3.10 GOEnrichment on variable Complex paralog pairs
3.10.0.1 Transcriptome
Now we select a subset of paralogs that is localized specifically inside
protein complexes, we store this information in the AllCorr.Compl.tr
variable. This specific variable contains the information for the
correlation of paralog pairs inside protein complexes in reference to
the transcriptome data. In the same block of code we define also stable
and variable paralogs, selecting the bottom 25% of the distribution of
correlations and labeling them as HiglyVariable
. We also take
informations from sequence identity and mark paralog pairs as
HighlySimilar
if their average sequence identity is >= 50%,
<- AllCorrelations.tr[AllCorrelations.tr$type=="Paralogs in Complex",]
AllCorr.Compl.tr
#Select the most variable pairs
<- quantile(AllCorr.Compl.tr$corr)["25%"]
quant
#Variable pairs
<- unlist(AllCorr.Compl.tr[AllCorr.Compl.tr$corr<=quant,c("p.1","p.2")])
variable.pairs <- unique(variable.pairs)
variable.pairs
<- unlist(AllCorr.Compl.tr[,c("p.1","p.2")])
geneSet
$regulation <- ifelse(AllCorr.Compl.tr$corr<=quant,"HighlyVar","Other")
AllCorr.Compl.tr$similar <- ifelse(AllCorr.Compl.tr$mean.identity>=50,"HighlySimilar","NotSimilar") AllCorr.Compl.tr
Selecting specifically variable pairs against the rest of the subunits
(geneSet
) we can run a GOEnrichment to understand functions carried
out specifically by variable paralogs.
<- spatialR::GOEnrichment(geneList = geneSet,
GOEnrich.var.Pairs.tr interesting = variable.pairs,
ontology = "BP",species = "Hs")
3.10.0.2 Proteome
We carry on the same passages also for the correlations data coming from
the proteome, carrying on the same exact approach. We first bind
together the datasets, and store the information of paralog pairs
residing in the same complex in the AllCorr.Compl.pr
variable.
#Create a unique dataframe
<- rbind(ParalogSubunitsCorrelations.tissue.prot,OtherParalogsCorrelations.tissue.prot) AllCorrelations.pr
Then we select the bottom 25% of the correlation distribution, we select
variable pairs unlisting the dataframe, after subsetting it for having
AllCorr.Compl.pr$corr<=quant
, we also create the complete gene list to
compare (geneSet
) by unlisting all other paralog pairs in complexes.
Finally we add information on the type of paralogs, HighlyVar
in the
case of having a corr<=quant
and inside the similar
column we add
annotation for paralog pairs, HighlySimilar
in case they have a mean
reciprocal identity >=50%. (mean.identity>=50
)
<- AllCorrelations.pr[AllCorrelations.pr$type=="Paralogs in Complex",]
AllCorr.Compl.pr
#Select the most variable pairs
<- quantile(AllCorr.Compl.pr$corr)["25%"]
quant
#Variable pairs
<- unlist(AllCorr.Compl.pr[AllCorr.Compl.pr$corr<=quant,c("p.1","p.2")])
variable.pairs <- unique(variable.pairs)
variable.pairs
<- unlist(AllCorr.Compl.pr[,c("p.1","p.2")])
geneSet
$regulation <- ifelse(AllCorr.Compl.pr$corr<=quant,"HighlyVar","Other")
AllCorr.Compl.pr$similar <- ifelse(AllCorr.Compl.pr$mean.identity>=50,"HighlySimilar","NotSimilar") AllCorr.Compl.pr
Also here as for the transcriptome data, we carry on a GOEnrichment
analysis, using as target list the variable.pairs
and as a background
the geneSet
variable. Again we can also Summarize the GOEnrichment
analysis using the SummarizeGO
function.
<- spatialR::GOEnrichment(geneSet,variable.pairs,ontology = "BP",species = "Hs")
GOEnrich.var.Pairs.pr <- spatialR::SummarizeGO(GOEnrich.var.Pairs.pr,plot = "scatter") GOEnrich.var.Pairs.Summ.pr
## Warning: `show_guide` has been deprecated. Please use `show.legend` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
Finally we can highlight the enrichment coming out from the
SummarizedGO
analysis.
pdf("../out/plots/p09Tiss.pdf",height = 5,width = 5)
<- GOEnrich.var.Pairs.Summ.pr$plot
p09Tiss
p09Tissdev.off()
## png
## 2
p09Tiss
3.11 Protein Complexes Analysis - Proteome
In this part of the script we analyze protein complexes at the level of
the proteome across 29human tissues. First we create a random set of
protein complexes using as reference the same number of complex and
complex size from the original Complexes
assembly, and randomly
sampling proteins coming from rownames(em_proteome)
. The function
GetExpressions
is used to add expression annotation to the Complexes
object. Selecting from protein complexes with at least 5 subunits
matched in the data. We also calculate the median correlation for each
protein complexes with the function cor
. The function subunits
is
used to extract the different subunits present in the Complexes
object.
set.seed(123)
# ComplexAnalysis at the proteome level
<- RandomComplexSet(Complexes,rownames(em_proteome))
RandomComplex <- GetExpressions(RandomComplex,em_proteome,min=5)
RandomComplex
#Complexes
<- GetExpressions(Complexes,em_proteome,min=5)
Complexes
#Calculate Correlations
<- cor(Complexes)
ComplexStability <- cor(RandomComplex)
RandComplexStability
#Sorted Complex , sort complex by stability
<- Complexes[[names(sort(ComplexStability,decreasing = T))]]
Complexes
#Get Ordered subunits by stability
<- subunits(Complexes) Complexes.subunits
Now we create a dataframe DF
that contains the values of median
correlation for each protein complexes. Median correlation is used here
as a proxy for stoichiometries variability/stability. High correlation
indicates complexes that are co-regulated together along tissues, and
viceversa. We just test the distribution of correlation of protein
complexes with a list of random complexes to see that there is an
effective correlation between protein complex members.
#Create ggplotDataframe with Density distributions.
<- data.frame(corr=c(ComplexStability,RandComplexStability))
DF $type <- rep(c("Complexes"),nrow(DF))
DF$type[grep("Random",rownames(DF))] <- "Random Complexes"
DF
#wilcox test
<- wilcox.test(DF$corr[DF$type=="Complexes"],DF$corr[DF$type=="Random Complexes"])
wilc.test $p.value wilc.test
## [1] 8.497342e-62
The plot below shows the correlation density distribution of protein complexes and random protein complexes (grey). It is clear that protein complexes shows higher levels of co-expression compared to random ones.
#Plot Distribution of Correlation of complexes
pdf("../out/plots/p25Tiss.pdf",height = 5,width = 5)
<- ggplot(data=DF) + geom_density(aes(x=corr,fill=type),alpha = 0.3,size=0.5) + theme_minimal() +
p25Tiss scale_fill_manual(values=c("darkgreen","grey")) + scale_color_manual(values=c("darkgreen")) + xlab("Correlations")
p25Tissdev.off()
## png
## 2
p25Tiss
Figure 3.1: Density distribution of the correlations for protein complexes in the proteome data
3.11.1 Variable and Stable complexes along tissues - Proteome
In this chunk of code, protein complexes are divided in variable and stable according to their quantile correlation distribution. The upper 25% percentile is considered stable and the lower 25% is considered variable.
## Divide in stable and variable Complexes
<- ComplexStability[ComplexStability < quantile(ComplexStability)["25%"]]
VariableComplexes <- ComplexStability[ComplexStability > quantile(ComplexStability)["75%"]] StableComplexes
Then we calculate the median correlation of each subunits with the rest
of its protein complexes, we have a median correlation values that can
help us understand how the subunits is coregulated together with the
complexes across tissues. This is achieved with the functions
SubunitsCoregulation
and GetStability
that helps annotating the
complexes with coexpression information, and retrieving that information
respectively.
#ComplexStability BarPlot
<- data.frame(complex=names(ComplexStability),corr=ComplexStability)
ComplexStability $complex <- factor(ComplexStability$complex,levels = ComplexStability[order(ComplexStability$corr,decreasing = T),"complex"])
ComplexStability
# Subunits Stability in Protein Complexes
<- SubunitsCoregulation(Complexes)
Complexes <- GetStability(Complexes)
StabilityDF <- StabilityDF[complete.cases(StabilityDF),]
StabilityDF colnames(StabilityDF)[1] <- "complex"
We then annotate each subunits with the information coming from the
paralogs dataset, addressing if a specific subunits has paralogs or not,
and we also add information coming from the general size of the complex.
Finally we merge this information with the total correlation of protein
complexes to plot the stability of protein complexes in relationship
with paralogs content. For each subints we also add a type
annotation.
#Stability, paralog proportions and size.
$size <- sapply(StabilityDF$complex,function(x)nrow(StabilityDF[StabilityDF$complex==x,]))
StabilityDF$hasParalogs <- hasParalogs(Paralogs,StabilityDF$subunit)
StabilityDF
#Quantiles
<- quantile(StabilityDF$cv,na.rm = T,c(0.25,0.75))
Quant
#Highly Variable Subunits
<- StabilityDF[StabilityDF$cv<Quant["25%"],]
HighlyVariableSub $type<-rep("HighlyVar",nrow(HighlyVariableSub))
HighlyVariableSub
<- StabilityDF[StabilityDF$cv>Quant["75%"],]
HighlyStableSub $type<-rep("HighlyStable",nrow(HighlyStableSub))
HighlyStableSub
#Stable and Variable Subunits
<- rbind(HighlyVariableSub,HighlyStableSub) StableDF
Then we calculate for each complex the proportion of paralogs and it is
done with this line below. After calculating the proportion of subunits
that have paralogs in each complex, we merge the dataset by the
complex
column, with the ComplexStability
dataframe, that contains
the co-expression values for protein complexes defined
above.
#Proportion of paralogs subunits and correlations
<- StabilityDF %>% group_by(complex) %>% summarize(prop =sum(hasParalogs)/n(),size=n())
Stability <- as.data.frame(Stability)
Stability <- merge(ComplexStability,Stability,by.x="complex",by.y="complex")
Stability colnames(Stability) <- c("complex","correlation","Paralogs.prop","size")
Finally we can plot the general distribution of the variables we calculated. To understand if there are some specific relation between variables. We can see as for the zebrafish development data that there is a relation between paralogs proportions and coregulation,
First we can plot for each subinits the correlation values and indicates for each specific ones, if it has paralogs or not. We can first run a wilcoxon test, and then
<- wilcox.test(StabilityDF$cv[StabilityDF$hasParalogs],
wx.test $cv[!StabilityDF$hasParalogs]) StabilityDF
Show the specific density profiles in a density plot. Storing the results in a .pdf files.
pdf("../out/plots/p28Tiss.pdf",height = 5,width = 5)
<- ggplot(data=StabilityDF) + geom_density(aes(x=cv,fill=hasParalogs),alpha=0.4) +
p28Tiss scale_fill_manual(values = ParalogsColor) + theme_bw() +
xlab("correlation within complex") +
ggtitle("Within Compl. Correlations - Proteome",subtitle = paste("Wilcox pvalue =",format(wx.test$p.value,digits = 2)))
p28Tissdev.off()
## png
## 2
p28Tiss
This density plot highlight the fact that genes that have paralogs are on average more distant = less correlated to their complex, compared to subunits that do not have paralogs. We can also represent this with a barplot that highlight the variable subunits and the stable one, indicating if those specific subunits have paralogs. In this case we bind the dataframe together and run a fisher test for addressing differences in proportions.
#Proportion
<- t(rbind(table(HighlyVariableSub$hasParalogs),table(HighlyStableSub$hasParalogs)))
FisherDF colnames(FisherDF) <- c("Variable Complexes","Stable Complexes")
rownames(FisherDF) <- c("No Paralogs","Has Paralogs")
<- fisher.test(FisherDF) Fisher
It is now time to plot the results in a barplot, that show exactly what is also recapitulated consistently.
#BarPlot
$type <- factor(StableDF$type,levels = c("HighlyVar","HighlyStable"))
StableDF
pdf("../out/plots/p29Tiss.pdf",height = 5,width = 5)
<- ggplot(data=StableDF) +
p29Tiss geom_bar(aes(x=type,fill=hasParalogs),position = "fill",color="black",alpha=0.5) +
ggtitle("Subunits - Across Tissues") +
scale_fill_manual(values = c("#e6d5b8","#0278ae")) +
theme_minimal() + xlab("Subunits") + ylab("Proportions")
p29Tissdev.off()
## png
## 2
p29Tiss
Then we finally show more clearly the relationship between paralog
proportion and complexes stoichiometric variability. We can clearly see
the trend that reflect that the more subunits that have paralogs are
present in a specific complex the more this specific complex has
variable stoichiometries. By defining stoichiometric variability for
protein complexes as (1-R) and the paralog fraction (Paralogs.prop
) in
each complex The scatterplot highlights the existence of a positive
trend between complexes stoichiometric variability and paralog content.
<- cor(Stability$correlation,Stability$Paralogs.prop)
Correlations
#Proportions Correlation scatterplot
pdf("../out/plots/p30Tiss.pdf",height = 5,width = 5)
<- ggplot(data=Stability) +
p30Tiss geom_point(aes(x=1-correlation,y=Paralogs.prop),stat = "identity") +
theme_minimal() +
geom_smooth(aes(x=1 - correlation,y=Paralogs.prop),
method = "lm",se=F,color="orange") +
ylab("Proportion of Paralogs") + xlab("complex variability (1-R)")
p30Tissdev.off()
## png
## 2
p30Tiss
3.12 Protein Complexes Analysis - Transcriptome
Here we carry on the same analysis that we did, but at the level of the
transcriptome. Loading a new ComplexSet object with the function
ComplexSet. Annotating with expression values coming from the
em_transcript
with the GetExpressions
function. We calculate
correlation with the cor
function and create a dataframe
ComplexStability.transcr
that contains the co-expression values for
protein complexes on the transcriptome data. Before binding the data
together we also add a reference column to the DF
data frame marking
it as coming from the proteome data. Finally we remove random complexes
from the combined data frame.
#Complexes
<- ComplexSet("hsapiens",identifier = "SYMBOL")
Complexes.transcr <- GetExpressions(Complexes.transcr,em_transcript,min=5)
Complexes.transcr
#Calculate Correlations
<- cor(Complexes.transcr)
ComplexStability.transcr
#Create ggplotDataframe with Density distributions.
<- data.frame(corr=c(ComplexStability.transcr))
DF.transcr $type <- rep(c("Complexes"),nrow(DF.transcr))
DF.transcr$data <- "Transcriptome"
DF.transcr
#Add reference column on proteome data
$data <- "Proteome"
DF
#Bind together Proteome and transcriptome removing random from proteome
<- rbind(DF[DF$type!="Random Complexes",],DF.transcr) DF.all
3.12.1 Variable and Stable complexes along tissues - Transcript
In this chunk of code, protein complexes are divided in variable and stable according to their quantile correlation distribution. The upper 25% percentile is considered stable and the lower 25% is considered variable.
## Divide in stable and variable Complexes
<- ComplexStability.transcr[ComplexStability.transcr < quantile(ComplexStability.transcr)["25%"]]
VariableComplexes <- ComplexStability.transcr[ComplexStability.transcr > quantile(ComplexStability.transcr)["75%"]] StableComplexes
Then we calculate the median correlation of each subunits with the rest
of its protein complexes, we have a median correlation values that can
help us understand how the subunits is coregulated together with the
complexes across tissues.This is achieved with the functions
SubunitsCoregulation
and GetStability
that helps annotating the
complexes with coexpression information, and retrieving that information
respectively.
#ComplexStability.transcr BarPlot
<- data.frame(complex=names(ComplexStability.transcr),corr=ComplexStability.transcr)
ComplexStability.transcr $complex <- factor(ComplexStability.transcr$complex,levels = ComplexStability.transcr[order(ComplexStability.transcr$corr,decreasing = T),"complex"])
ComplexStability.transcr
# Subunits Stability in Protein Complexes
<- SubunitsCoregulation(Complexes.transcr)
Complexes.transcr <- GetStability(Complexes.transcr)
StabilityDF.transcr <- StabilityDF.transcr[complete.cases(StabilityDF.transcr),]
StabilityDF.transcr colnames(StabilityDF.transcr)[1] <- "complex"
We then annotate each subunits with the information coming from the paralogs dataset, addressing if a specific subunits has paralogs or not, and we also add information coming from the genral size of the complex. Finally we merge this information with the total correlation of protein complexes to plot the stability of protein complexes in relationship with paralogs content.
#Stability, paralog proportions and size.
$size <- sapply(StabilityDF.transcr$complex,function(x)nrow(StabilityDF.transcr[StabilityDF.transcr$complex==x,]))
StabilityDF.transcr$hasParalogs <- hasParalogs(Paralogs,StabilityDF.transcr$subunit)
StabilityDF.transcr
#Quantiles
<- quantile(StabilityDF.transcr$cv,na.rm = T,c(0.25,0.75))
Quant
#Highly Variable Subunits
<- StabilityDF.transcr[StabilityDF.transcr$cv<Quant["25%"],];HighlyVariableSub$type<-rep("HighlyVar",nrow(HighlyVariableSub))
HighlyVariableSub <- StabilityDF.transcr[StabilityDF.transcr$cv>Quant["75%"],];HighlyStableSub$type<-rep("HighlyStable",nrow(HighlyStableSub))
HighlyStableSub
#Stable and Variable Subunits
<- rbind(HighlyVariableSub,HighlyStableSub) StableDF.transcr
We calculate for each complex the proportion of paralogs and it is done with this line below.
#Proportion of paralogs subunits and correlations
<- StabilityDF.transcr %>% group_by(complex) %>% summarize(prop =sum(hasParalogs)/n(),size=n())
Stability.tr <- as.data.frame(Stability.tr)
Stability.tr
<- merge(ComplexStability.transcr,Stability.tr,by.x="complex",by.y="complex")
Stability.tr colnames(Stability.tr) <- c("complex","correlation","Paralogs.prop","size")
Finally we can plot the general distribution of the variables we calculated. To understand if there are some specific relation between variables. We can see as for the zebrafish development data that there is a relation between paralogs proportions and coregulation,
pdf("../out/plots/p32Tiss.pdf",height = 5,width = 5)
<- plot(Stability.tr[,2:4],pch=19,cex=0.7)
p32Tiss p32Tiss
## NULL
dev.off()
## png
## 2
p32Tiss
## NULL
<- wilcox.test(StabilityDF.transcr$cv[StabilityDF.transcr$hasParalogs],
wx.test $cv[!StabilityDF.transcr$hasParalogs])
StabilityDF.transcr
#Density plot
pdf("../out/plots/p34Tiss.pdf",height = 5,width = 5)
<- ggplot(data=StabilityDF.transcr) + geom_density(aes(x=cv,fill=hasParalogs),alpha=0.4) +
p34Tiss scale_fill_manual(values = ParalogsColor) + theme_bw() +
xlab("correlation within complex") +
ggtitle("Within Compl. Correlations - Transcriptome",subtitle = paste("Wilcox pvalue =",format(wx.test$p.value,digits = 2)))
p34Tissdev.off()
## png
## 2
p34Tiss
This density plot highlight the fact that genes that have paralogs are on average more distant = less correlated to their complex, compared to subunits that do not have paralogs. We can also represent this with a barplot that highlight the variable subunits and the stable one, indicating if those specific subunits have paralogs. In this case we bind the dataframe together and run a fisher test for addressing differences in proportions.
#Proportion
<- t(rbind(table(HighlyVariableSub$hasParalogs),table(HighlyStableSub$hasParalogs)))
FisherDF colnames(FisherDF) <- c("Variable Complexes","Stable Complexes")
rownames(FisherDF) <- c("No Paralogs","Has Paralogs")
<- fisher.test(FisherDF) Fisher
It is now time to plot the results in a barplot, that show exactly what is also recapitulaed consistently.
#BarPlot
$type <- factor(StableDF.transcr$type,levels = c("HighlyVar","HighlyStable"))
StableDF.transcr
pdf("../out/plots/p35Tiss.pdf",height = 5,width = 5)
<- ggplot(data=StableDF.transcr) + geom_bar(aes(x=type,fill=hasParalogs),position = "fill") + ggtitle("Subunits - Across Tissues") + theme_minimal()
p35Tiss
p35Tissdev.off()
## png
## 2
p35Tiss
Then we finally show more clearly the relationship between paralog proportion and complexes stoichiometric variability. We can clearly see the trend that reflect that the more subunits that have paralogs are present in a specific complex the more this specific complex has variable stoichiometries.
<- cor(Stability.tr$correlation,Stability.tr$Paralogs.prop)
Correlations
#ComplexBarPlot
pdf("../out/plots/p36Tiss.pdf",height = 5,width = 5)
<- ggplot(data=Stability.tr) + geom_point(aes(x=1-correlation,y=Paralogs.prop),stat = "identity",size=.9) +
p36Tiss theme_bw()+ ylim(0,1) +
geom_smooth(aes(x=1-correlation,y=Paralogs.prop),method = "lm",color="orange") + ylim(0,1) +
ylab("Proportion of Paralogs") + xlab("complex variability (1-R)")
p36Tiss
3.13 Assemble Tables
<- merge(Stability,StabilityDF,by='complex')
Complexes_Table_Hsap_P <- Complexes_Table_Hsap_P[,-7]
Complexes_Table_Hsap_P colnames(Complexes_Table_Hsap_P) <- c('complex','complex.coexpression',
'paralogs.fraction','comp.size',
'subunit','subunits.coexpr.with.complex',
'subunit.hasParalogs')
<- merge(Stability.tr,StabilityDF.transcr,by='complex')
Complexes_Table_Hsap_T <- Complexes_Table_Hsap_T[,-7]
Complexes_Table_Hsap_T colnames(Complexes_Table_Hsap_T) <- c('complex','complex.coexpression',
'paralogs.fraction','comp.size',
'subunit','subunits.coexpr.with.complex',
'subunit.hasParalogs')
3.14 Save workspace
We save the complete workspace under the name
02-Paralogs_AlongTissues_CompleteWorkspace.RData
this saves all the
workspace that is active while writing this markdown. In this ways
should be easy to access back information coming from the different
tables.
Sys.time()
## [1] "2021-07-16 19:59:40 CEST"
save.image("../Markdown/workspaces/02-Paralogs_AlongTissues_CompleteWorkspace.RData")