Section 3 Paralogs Regulation Along 29 Human Tissue
3.1 Homo sapiens Paralogs
For starting we calculate again some statistics regarding Human paralogs, in particular, how many genes in the Human genome have paralogs ? This is an important question to ask before carrying out the rest of the analysis. We first read the paralogs file and with a similar approach done here, we calculate the proportion of protein coding genes that have paralogs in the human genome.
library(dplyr)
#Load the Zebrafish Paralogs file
Paralogs <- read.delim("../Data/Paralogs/hsapiens_ENSEMBL_paralogs_v102.txt",sep = "\t",header = T)
Paralogs <- unique(Paralogs[,c(1,2)])
#Only protein conding genes
Protein.coding <- read.delim("../Data/Paralogs/hsapiens_ENSEMBL_type_v102.txt",sep = "\t",header = T)
Protein.coding <- Protein.coding[Protein.coding$Transcript.type=="protein_coding",1]
Protein.coding <- unique(Protein.coding)And we filter, for genes only present in the Protein.coding variable
and count the occurrence using the summarize function.
Paralogs <- Paralogs %>%
filter(Gene.stable.ID %in% Protein.coding) %>%
group_by(Gene.stable.ID) %>%
summarize(N.Paralogs=sum(Human.paralogue.gene.stable.ID!=""))Now we can simply calculates the proportion of genes that have paralogs in the genome
human.paralogs.percentage <- sum(Paralogs$N.Paralogs>0)/nrow(Paralogs)
human.paralogs.percentage## [1] 0.7008968
After we can proceed to analyze paralogs behavior and regulation across
29 human tissues from transcriptome and proteome data published in Wang
et al., 2019.
Here we focus in detail about paralogs regulation along tissues, with a
particular focus inside protein complexes. Our built-in package
paralogstools will help us in simplify some operations with paralogs.
We load some build in packages, like spatialR , paralogstools and
proteomicstools, plus some other ones needed for further analysis. We
also load the script "../ComplexScript/complexes_function.R" that
contains useful functions to carry on the analysis.
library(proteomicstools)
library(paralogstools)
library(spatialR)
library(reshape2)
library(biomaRt)
library(ggrepel)
library(ggplot2)
library(dplyr)
library(factoextra)
library(data.table)
library(gridExtra)
source("../ComplexScript/complexes_function.R")3.2 Load files
We load the different files, coming from proteome and transcriptome.
filename <- "../Data/Dataset/tissue_analysis/Wang_et_al_GENE_FPKM_Tissues.txt"
filename_proteome <- "../Data/Dataset/tissue_analysis/Wang_et_al_PROTEOME_IBAQ_Tissues.txt"We use our build in function SetExperimentFromFile to load the matrix
in a structured way. We extract the matrix and we convert in Log2 the
data. For transcriptome, the data are in FPKM. We apply here the same
filtering used by the author, setting to NAs, all measurements that were
below, 1 FPKM. With ParalogsColor we also chooose a standard palette
to use.
#Paralogs Palette
ParalogsColor <- c("#e6d5b8","#0278ae")
#LoadMatrix
em_transcript <- SetExperimentFromFile(filename,exprcol = c(3:31),ids="Gene.name",sep="\t",showFileStructure = F)
em_proteome <- SetExperimentFromFile(filename_proteome,exprcol=c(3:31),ids="Gene.name",showFileStructure = F)
#extract matrix, substitute 0s with NA
em_transcr.exprs <- as.data.frame(assay(em_transcript));em_transcr.exprs[em_transcr.exprs<1] <- NA
em_proteome.exprs <- as.data.frame(assay(em_proteome));em_proteome.exprs[em_proteome.exprs==0] <- NA
#Convert Log2
em_proteome.exprs <- log2(em_proteome.exprs)
em_transcr.exprs <- log2(em_transcr.exprs)We Load the matrices, substitute 0s with NAs and convert the data in log scale.
3.3 Load Human Protein Complex Data
We load the information coming from Homo sapiens protein complexes definitions. That will be used for future analysis.
Complexes <- ComplexSet(organism = "hsapiens",identifier = "SYMBOL")3.4 Load Human Paralogs Data
Now we load paralogs information via the paralogstools package. We use
CleanParalogs to remove duplicated pairs. The ParalogsSet object
contains a dataframe called data that basically contains the paralogs
file of reference.
ParalogsFile <- "../Data/Paralogs/hsapiens_SYMBOL_paralogs_v102.txt"
Paralogs <- ParalogsSet(organism = "hsapiens",identifier = "SYMBOL",filename = ParalogsFile)
#Clean paralogs
Paralogs <- CleanParalogs(Paralogs)
#Extract dataframe
Paralogs.df <- Paralogs@dataAlternatively we can also load a workspace with all calculations already done.
load("../Markdown/workspaces/ParalogsAlongTissues_Part01.RData")We select then paralogs that are protein complex members, and other
paralogs by just removing SubunitsParalogs from all the rest of the
paralogs, using the Remove function.
#Paralogs of Protein Complex Subunits
SubunitsParalogs <- ParalogsInComplex(Paralogs,Complexes)
OtherParalogs <- Remove(Paralogs,GetIds(SubunitsParalogs))3.4.1 Add Paralogs annotation
In this chunk, we take the matrix containing sample quantification, and
annotate it, changing rownames, and adding another column hasParalogs
that indicates if a specific gene has paralog in the genome. In order to
do this we use the hasParalogs function from the paralogstools
package.
#Add Ids
em_proteome.exprs$ids <- rownames(em_proteome.exprs)
em_transcr.exprs$ids <- rownames(em_transcr.exprs)
#Add Paralogs
em_proteome.exprs$hasParalogs <- hasParalogs(Paralogs,em_proteome.exprs$id)
em_transcr.exprs$hasParalogs <- hasParalogs(Paralogs,em_transcr.exprs$id)3.4.2 Percentage of Genes that have paralogs
In the end, we can plot the percentage of transcripts that have paralogs
in the tissues data. Using again the SimplePiePlot function.
pdf("../out/plots/p39Tiss.pdf",height = 2.10,width = 5)
p39Tiss <- SimplePiePlot(em_transcr.exprs$hasParalogs) +
theme(title = element_text(size = 7),
legend.position = "bottom")
p39Tiss
dev.off()## png
## 2
p39Tiss3.5 Are paralogs driving tissues variability?
We can address the variability accross tissues at the level of transcriptome and proteome, we calculate the standard deviation of each genes, and we divide it by the mean total expression of the genes across tissues. We see that genes that have paralogs in the genome are generally more variable across tissues both at transcriptome and at proteome level.
Then for each genes, we calculate the coefficient of variation defined as 𝞼 / mean expression across tissues, for both transcriptome and proteome. We then run two-sample Wilcoxon test on the two distribution to address differences.
#Coeff.Variation
em_proteome.exprs$coeff.var <- apply(em_proteome.exprs[,1:29],1,function(x)sd(x,na.rm = T)/mean(x,na.rm = T))
em_transcr.exprs$coeff.var <- apply(em_transcr.exprs[,1:29],1,function(x)sd(x,na.rm = T)/mean(x,na.rm = T))
#Run tests
wx.test <- wilcox.test(em_proteome.exprs$coeff.var[em_proteome.exprs$hasParalogs],
em_proteome.exprs$coeff.var[!em_proteome.exprs$hasParalogs])
wx.test.2 <- wilcox.test(em_transcr.exprs$coeff.var[em_transcr.exprs$hasParalogs],
em_transcr.exprs$coeff.var[!em_transcr.exprs$hasParalogs])Now we can plot the density of the variability for genes that have paralogs and for genes that do not have paralogs in the genome. We can see that proteins that have paralogs are generally more differentially regulated. We save the plot in a directory as a .pdf.
pdf("../out/plots/p01Tiss.pdf",height = 5,width = 5)
p01Tiss <- ggplot(data=em_proteome.exprs) + geom_density(aes(x=coeff.var,fill=hasParalogs),alpha=0.2) +
scale_fill_manual(values = c("#e6d5b8","#0278ae")) + theme_bw() +
ggtitle("Tissues - Proteome",subtitle = paste("Wilcox pvalue",format(wx.test$p.value,digits = 3))) +
xlab(paste("Coefficient of Variation [\u03C3/mean (Prot. Quantity)]"))
p01Tiss
dev.off()## png
## 2
p01TissWe plot the same distribution profiles also for transcript, and observe a similar trend.
pdf("../out/plots/p02Tiss.pdf",height = 5,width = 5)
p02Tiss <- ggplot(data=em_transcr.exprs) + geom_density(aes(x=coeff.var,fill=hasParalogs),alpha=0.2) +
scale_fill_manual(values = c("#e6d5b8","#0278ae")) + theme_bw() +
ggtitle("Tissues - Transcriptome",subtitle = paste("Wilcox pvalue",format(wx.test.2$p.value,digits = 3))) +
xlab(paste("Coefficient of Variation [\u03C3/mean (Gene Expression)]"))
p02Tiss
dev.off()## png
## 2
p02Tiss3.6 Correlate Paralog Pairs
Now we correlate the expression of transcript and proteins of paralog
pairs along the 29 human tissues, considering only paralogs identified
in at least 5 tissues. The function Correlate calculates correlation
for paralog pairs, (taken from the Paralogs object) and the nmin
parameter, defines the minimum number of pairwise complete observation
present in each of the dataset for the correlation to be computed. We do
this for both Paralogs, Subunits paralogs, and all other for both
transcriptome and proteome.
# TRANSCRIPTOME
# Correlation of Paralogs, paralogs subunits, others, transcriptome
AllParalogsCorrelations.tissue.transcr <- Correlate(Paralogs,em_transcript,nmin=5)
ParalogSubunitsCorrelations.tissue.transcr <- Correlate(SubunitsParalogs,em_transcript,nmin=5)
OtherParalogsCorrelations.tissue.transcr <- Correlate(OtherParalogs,em_transcript,nmin=5)
# PROTEOME
AllParalogsCorrelations.tissue.prot <- Correlate(Paralogs,em_proteome,nmin=5)
ParalogSubunitsCorrelations.tissue.prot <- Correlate(SubunitsParalogs,em_proteome,nmin=5)
OtherParalogsCorrelations.tissue.prot <- Correlate(OtherParalogs,em_proteome,nmin=5)We add some annotations from the different datasets, in order to be easily mapped in the script.
#Add column for dataset
AllParalogsCorrelations.tissue.transcr$dataset <- "Transcriptome"
AllParalogsCorrelations.tissue.prot$dataset <- "Proteome"
ParalogSubunitsCorrelations.tissue.transcr$dataset <- "Transcriptome"
ParalogSubunitsCorrelations.tissue.prot$dataset <- "Proteome"
OtherParalogsCorrelations.tissue.transcr$dataset <- "Transcriptome"
OtherParalogsCorrelations.tissue.prot$dataset <- "Proteome"
ParalogSubunitsCorrelations.tissue.transcr$type <- "Paralogs in Complex"
ParalogSubunitsCorrelations.tissue.prot$type <- "Paralogs in Complex"
OtherParalogsCorrelations.tissue.transcr$type <- "Other Paralogs"
OtherParalogsCorrelations.tissue.prot$type <- "Other Paralogs"And save the workspace in case we want to load it easily.
save.image("../Markdown/workspaces/ParalogsAlongTissues_Part01.RData")3.7 Distribution of paralog pairs correlations
We bind together the dataset, and we select only pairs that are present both at transcriptome and both at proteome level, in order to have an equal comparison. In this chuck below we bind together the correlations coming from the transcriptome and proteome data.
#Binds the datasets
AllParalogsCorrelations.trans.prot <- rbind(AllParalogsCorrelations.tissue.prot,AllParalogsCorrelations.tissue.transcr)
AllParalogsCorrelations.trans.prot$IDs <- as.character(AllParalogsCorrelations.trans.prot$IDs)3.7.1 Transcript
We can show the distribution of correlations of paralogs pairs across
tissues and proteome. The distributions are more or less centered on 0,
with an extensive tail toward positive values. The function
ParalogsDensityPlot takes in input the vector containing the
correlation values and plot the density distribution with the barplot
count of the single paralog pairs stored in the variable pTiss$p.all .
#detach old function, use new one.
rm("ParalogsDensityPlot")
source("../ComplexScript/complexes_function.R")
dens <- density(AllParalogsCorrelations.trans.prot[AllParalogsCorrelations.trans.prot$dataset=="Transcriptome","corr"])
df <- data.frame(x=dens$x, y=dens$y)
probs <- c(-0.7,-0.4, 0,0.4, 0.7)
quantiles <- probs
df$quant <- factor(findInterval(df$x,quantiles))We again save the output as a pdf file, using the ParalogsDensityPlot
function to plot the distribution of correlation between pairs, plus a
barplot indicating the specific numbers of paralog pairs.
pdf("../out/plots/p15Tiss.pdf",height = 5,width = 5)
p15Tiss <- ParalogsDensityPlot(AllParalogsCorrelations.tissue.transcr$corr,title = "Correlation of Paralog Pairs - Tissues",x_scale = c(-1,1))
p15Tiss$p.all
dev.off()## png
## 2
p15Tiss$p.all3.7.2 Proteome
We can show the distribution of correlations of paralogs pairs across
tissues proteome. The distributions are more or less centered on 0, with
an extensive tail toward positive values. The function
ParalogsDensityPlot takes in input the vector containing the
correlation values and plot the density distribution with the barplot
count of the single paralog pairs stored in the variable pTiss$p.all .
pdf("../out/plots/p40Tiss.pdf",height = 5,width = 5)
p40Tiss <- ParalogsDensityPlot(AllParalogsCorrelations.tissue.prot$corr,title = "Correlation of Paralog Pairs across Tissues - Proteome")
p40Tiss$p.all
dev.off()## png
## 2
3.8 Correlation and sequence Identity – Transcriptome
Now we run the analysis to understand inf there is a specific relation between correlation between paralog pairs and sequence identity. The same as was done for the Zebrafish RNA-Seq Data.
#Create a unique dataframe
AllCorrelations.tr <- rbind(ParalogSubunitsCorrelations.tissue.transcr,OtherParalogsCorrelations.tissue.transcr)Here we plot the relationship between mean sequence identity and
co-regulation, we use a Generalized Additive Model to plot the
relationship between sequence identity and co-regulation. And run a
correlation test with cor.test
ComplexColor <- c("#D6604D","#4393C3")
pdf("../out/plots/p04Tiss.pdf",height = 5,width = 5)
p04Tiss <- ggplot(data=AllCorrelations.tr) +
geom_smooth(aes(x=mean.identity,y=corr,color=type)) +
theme_bw() + scale_color_manual(values = ComplexColor) +
theme(legend.position = "top") + xlab("mean identity") +
ylab("correlation") + labs(color="")
#Plot
p04Tiss
dev.off()## png
## 2
p04Tiss#correlations tests
cortest <- cor.test(AllCorrelations.tr[AllCorrelations.tr$type=="Other Paralogs","corr"],AllCorrelations.tr[AllCorrelations.tr$type=="Other Paralogs","mean.identity"])
cortest##
## Pearson's product-moment correlation
##
## data: AllCorrelations.tr[AllCorrelations.tr$type == "Other Paralogs", "corr"] and AllCorrelations.tr[AllCorrelations.tr$type == "Other Paralogs", "mean.identity"]
## t = 105, df = 89078, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3260179 0.3377052
## sample estimates:
## cor
## 0.3318743
cortest <- cor.test(AllCorrelations.tr[!AllCorrelations.tr$type=="Other Paralogs","corr"],AllCorrelations.tr[!AllCorrelations.tr$type=="Other Paralogs","mean.identity"])
cortest##
## Pearson's product-moment correlation
##
## data: AllCorrelations.tr[!AllCorrelations.tr$type == "Other Paralogs", "corr"] and AllCorrelations.tr[!AllCorrelations.tr$type == "Other Paralogs", "mean.identity"]
## t = -1.298, df = 1802, p-value = 0.1945
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07660547 0.01561082
## sample estimates:
## cor
## -0.03056236
3.9 Correlation and sequence Identity – Proteome
Same approach is also carried out for the proteome data. First the dataset of paralogs that are not complex members and paralogs that are complex members are joined together, then a GAM is fitted to highlight the relationship between sequence identity and co-regulation for paralogs inside protein complexes and outside protein complexes.
#Create a unique dataframe
AllCorrelations.pr <- rbind(OtherParalogsCorrelations.tissue.prot,ParalogSubunitsCorrelations.tissue.prot)We highlight the relation between the two variable again using a GAM.
pdf("../out/plots/p10Tiss.pdf",height = 5,width = 5)
p10Tiss <- ggplot(data=AllCorrelations.pr) +
geom_smooth(aes(x=mean.identity,y=corr,color=type)) +
theme_bw() +
theme(legend.position = "top") +
ggtitle("Proteome",subtitle = "along 29 Human tissues")
p10Tiss
dev.off()## png
## 2
p10Tisscortest <- cor.test(AllCorrelations.pr[AllCorrelations.pr$type=="Other Paralogs","corr"],AllCorrelations.pr[AllCorrelations.pr$type=="Other Paralogs","mean.identity"])
cortest##
## Pearson's product-moment correlation
##
## data: AllCorrelations.pr[AllCorrelations.pr$type == "Other Paralogs", "corr"] and AllCorrelations.pr[AllCorrelations.pr$type == "Other Paralogs", "mean.identity"]
## t = 46.774, df = 42649, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2118491 0.2299039
## sample estimates:
## cor
## 0.2208954
cortest <- cor.test(AllCorrelations.pr[!AllCorrelations.pr$type=="Other Paralogs","corr"],AllCorrelations.pr[!AllCorrelations.pr$type=="Other Paralogs","mean.identity"])
cortest##
## Pearson's product-moment correlation
##
## data: AllCorrelations.pr[!AllCorrelations.pr$type == "Other Paralogs", "corr"] and AllCorrelations.pr[!AllCorrelations.pr$type == "Other Paralogs", "mean.identity"]
## t = 3.5045, df = 1302, p-value = 0.000473
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04260577 0.15016480
## sample estimates:
## cor
## 0.0966675
3.10 GOEnrichment on variable Complex paralog pairs
3.10.0.1 Transcriptome
Now we select a subset of paralogs that is localized specifically inside
protein complexes, we store this information in the AllCorr.Compl.tr
variable. This specific variable contains the information for the
correlation of paralog pairs inside protein complexes in reference to
the transcriptome data. In the same block of code we define also stable
and variable paralogs, selecting the bottom 25% of the distribution of
correlations and labeling them as HiglyVariable . We also take
informations from sequence identity and mark paralog pairs as
HighlySimilar if their average sequence identity is >= 50%,
AllCorr.Compl.tr <- AllCorrelations.tr[AllCorrelations.tr$type=="Paralogs in Complex",]
#Select the most variable pairs
quant <- quantile(AllCorr.Compl.tr$corr)["25%"]
#Variable pairs
variable.pairs <- unlist(AllCorr.Compl.tr[AllCorr.Compl.tr$corr<=quant,c("p.1","p.2")])
variable.pairs <- unique(variable.pairs)
geneSet <- unlist(AllCorr.Compl.tr[,c("p.1","p.2")])
AllCorr.Compl.tr$regulation <- ifelse(AllCorr.Compl.tr$corr<=quant,"HighlyVar","Other")
AllCorr.Compl.tr$similar <- ifelse(AllCorr.Compl.tr$mean.identity>=50,"HighlySimilar","NotSimilar")Selecting specifically variable pairs against the rest of the subunits
(geneSet) we can run a GOEnrichment to understand functions carried
out specifically by variable paralogs.
GOEnrich.var.Pairs.tr <- spatialR::GOEnrichment(geneList = geneSet,
interesting = variable.pairs,
ontology = "BP",species = "Hs")3.10.0.2 Proteome
We carry on the same passages also for the correlations data coming from
the proteome, carrying on the same exact approach. We first bind
together the datasets, and store the information of paralog pairs
residing in the same complex in the AllCorr.Compl.pr variable.
#Create a unique dataframe
AllCorrelations.pr <- rbind(ParalogSubunitsCorrelations.tissue.prot,OtherParalogsCorrelations.tissue.prot)Then we select the bottom 25% of the correlation distribution, we select
variable pairs unlisting the dataframe, after subsetting it for having
AllCorr.Compl.pr$corr<=quant, we also create the complete gene list to
compare (geneSet) by unlisting all other paralog pairs in complexes.
Finally we add information on the type of paralogs, HighlyVar in the
case of having a corr<=quant and inside the similar column we add
annotation for paralog pairs, HighlySimilar in case they have a mean
reciprocal identity >=50%. (mean.identity>=50)
AllCorr.Compl.pr <- AllCorrelations.pr[AllCorrelations.pr$type=="Paralogs in Complex",]
#Select the most variable pairs
quant <- quantile(AllCorr.Compl.pr$corr)["25%"]
#Variable pairs
variable.pairs <- unlist(AllCorr.Compl.pr[AllCorr.Compl.pr$corr<=quant,c("p.1","p.2")])
variable.pairs <- unique(variable.pairs)
geneSet <- unlist(AllCorr.Compl.pr[,c("p.1","p.2")])
AllCorr.Compl.pr$regulation <- ifelse(AllCorr.Compl.pr$corr<=quant,"HighlyVar","Other")
AllCorr.Compl.pr$similar <- ifelse(AllCorr.Compl.pr$mean.identity>=50,"HighlySimilar","NotSimilar")Also here as for the transcriptome data, we carry on a GOEnrichment
analysis, using as target list the variable.pairs and as a background
the geneSet variable. Again we can also Summarize the GOEnrichment
analysis using the SummarizeGO function.
GOEnrich.var.Pairs.pr <- spatialR::GOEnrichment(geneSet,variable.pairs,ontology = "BP",species = "Hs")
GOEnrich.var.Pairs.Summ.pr <- spatialR::SummarizeGO(GOEnrich.var.Pairs.pr,plot = "scatter")## Warning: `show_guide` has been deprecated. Please use `show.legend` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
Finally we can highlight the enrichment coming out from the
SummarizedGO analysis.
pdf("../out/plots/p09Tiss.pdf",height = 5,width = 5)
p09Tiss <- GOEnrich.var.Pairs.Summ.pr$plot
p09Tiss
dev.off()## png
## 2
p09Tiss3.11 Protein Complexes Analysis - Proteome
In this part of the script we analyze protein complexes at the level of
the proteome across 29human tissues. First we create a random set of
protein complexes using as reference the same number of complex and
complex size from the original Complexes assembly, and randomly
sampling proteins coming from rownames(em_proteome). The function
GetExpressions is used to add expression annotation to the Complexes
object. Selecting from protein complexes with at least 5 subunits
matched in the data. We also calculate the median correlation for each
protein complexes with the function cor. The function subunits is
used to extract the different subunits present in the Complexes
object.
set.seed(123)
# ComplexAnalysis at the proteome level
RandomComplex <- RandomComplexSet(Complexes,rownames(em_proteome))
RandomComplex <- GetExpressions(RandomComplex,em_proteome,min=5)
#Complexes
Complexes <- GetExpressions(Complexes,em_proteome,min=5)
#Calculate Correlations
ComplexStability <- cor(Complexes)
RandComplexStability <- cor(RandomComplex)
#Sorted Complex , sort complex by stability
Complexes <- Complexes[[names(sort(ComplexStability,decreasing = T))]]
#Get Ordered subunits by stability
Complexes.subunits <- subunits(Complexes)Now we create a dataframe DF that contains the values of median
correlation for each protein complexes. Median correlation is used here
as a proxy for stoichiometries variability/stability. High correlation
indicates complexes that are co-regulated together along tissues, and
viceversa. We just test the distribution of correlation of protein
complexes with a list of random complexes to see that there is an
effective correlation between protein complex members.
#Create ggplotDataframe with Density distributions.
DF <- data.frame(corr=c(ComplexStability,RandComplexStability))
DF$type <- rep(c("Complexes"),nrow(DF))
DF$type[grep("Random",rownames(DF))] <- "Random Complexes"
#wilcox test
wilc.test <- wilcox.test(DF$corr[DF$type=="Complexes"],DF$corr[DF$type=="Random Complexes"])
wilc.test$p.value## [1] 8.497342e-62
The plot below shows the correlation density distribution of protein complexes and random protein complexes (grey). It is clear that protein complexes shows higher levels of co-expression compared to random ones.
#Plot Distribution of Correlation of complexes
pdf("../out/plots/p25Tiss.pdf",height = 5,width = 5)
p25Tiss <- ggplot(data=DF) + geom_density(aes(x=corr,fill=type),alpha = 0.3,size=0.5) + theme_minimal() +
scale_fill_manual(values=c("darkgreen","grey")) + scale_color_manual(values=c("darkgreen")) + xlab("Correlations")
p25Tiss
dev.off()## png
## 2
p25TissFigure 3.1: Density distribution of the correlations for protein complexes in the proteome data
3.11.1 Variable and Stable complexes along tissues - Proteome
In this chunk of code, protein complexes are divided in variable and stable according to their quantile correlation distribution. The upper 25% percentile is considered stable and the lower 25% is considered variable.
## Divide in stable and variable Complexes
VariableComplexes <- ComplexStability[ComplexStability < quantile(ComplexStability)["25%"]]
StableComplexes <- ComplexStability[ComplexStability > quantile(ComplexStability)["75%"]]Then we calculate the median correlation of each subunits with the rest
of its protein complexes, we have a median correlation values that can
help us understand how the subunits is coregulated together with the
complexes across tissues. This is achieved with the functions
SubunitsCoregulation and GetStability that helps annotating the
complexes with coexpression information, and retrieving that information
respectively.
#ComplexStability BarPlot
ComplexStability <- data.frame(complex=names(ComplexStability),corr=ComplexStability)
ComplexStability$complex <- factor(ComplexStability$complex,levels = ComplexStability[order(ComplexStability$corr,decreasing = T),"complex"])
# Subunits Stability in Protein Complexes
Complexes <- SubunitsCoregulation(Complexes)
StabilityDF <- GetStability(Complexes)
StabilityDF <- StabilityDF[complete.cases(StabilityDF),]
colnames(StabilityDF)[1] <- "complex"We then annotate each subunits with the information coming from the
paralogs dataset, addressing if a specific subunits has paralogs or not,
and we also add information coming from the general size of the complex.
Finally we merge this information with the total correlation of protein
complexes to plot the stability of protein complexes in relationship
with paralogs content. For each subints we also add a type annotation.
#Stability, paralog proportions and size.
StabilityDF$size <- sapply(StabilityDF$complex,function(x)nrow(StabilityDF[StabilityDF$complex==x,]))
StabilityDF$hasParalogs <- hasParalogs(Paralogs,StabilityDF$subunit)
#Quantiles
Quant <- quantile(StabilityDF$cv,na.rm = T,c(0.25,0.75))
#Highly Variable Subunits
HighlyVariableSub <- StabilityDF[StabilityDF$cv<Quant["25%"],]
HighlyVariableSub$type<-rep("HighlyVar",nrow(HighlyVariableSub))
HighlyStableSub <- StabilityDF[StabilityDF$cv>Quant["75%"],]
HighlyStableSub$type<-rep("HighlyStable",nrow(HighlyStableSub))
#Stable and Variable Subunits
StableDF <- rbind(HighlyVariableSub,HighlyStableSub)Then we calculate for each complex the proportion of paralogs and it is
done with this line below. After calculating the proportion of subunits
that have paralogs in each complex, we merge the dataset by the
complex column, with the ComplexStability dataframe, that contains
the co-expression values for protein complexes defined
above.
#Proportion of paralogs subunits and correlations
Stability <- StabilityDF %>% group_by(complex) %>% summarize(prop =sum(hasParalogs)/n(),size=n())
Stability <- as.data.frame(Stability)
Stability <- merge(ComplexStability,Stability,by.x="complex",by.y="complex")
colnames(Stability) <- c("complex","correlation","Paralogs.prop","size")Finally we can plot the general distribution of the variables we calculated. To understand if there are some specific relation between variables. We can see as for the zebrafish development data that there is a relation between paralogs proportions and coregulation,
First we can plot for each subinits the correlation values and indicates for each specific ones, if it has paralogs or not. We can first run a wilcoxon test, and then
wx.test <- wilcox.test(StabilityDF$cv[StabilityDF$hasParalogs],
StabilityDF$cv[!StabilityDF$hasParalogs])Show the specific density profiles in a density plot. Storing the results in a .pdf files.
pdf("../out/plots/p28Tiss.pdf",height = 5,width = 5)
p28Tiss <- ggplot(data=StabilityDF) + geom_density(aes(x=cv,fill=hasParalogs),alpha=0.4) +
scale_fill_manual(values = ParalogsColor) + theme_bw() +
xlab("correlation within complex") +
ggtitle("Within Compl. Correlations - Proteome",subtitle = paste("Wilcox pvalue =",format(wx.test$p.value,digits = 2)))
p28Tiss
dev.off()## png
## 2
p28TissThis density plot highlight the fact that genes that have paralogs are on average more distant = less correlated to their complex, compared to subunits that do not have paralogs. We can also represent this with a barplot that highlight the variable subunits and the stable one, indicating if those specific subunits have paralogs. In this case we bind the dataframe together and run a fisher test for addressing differences in proportions.
#Proportion
FisherDF <- t(rbind(table(HighlyVariableSub$hasParalogs),table(HighlyStableSub$hasParalogs)))
colnames(FisherDF) <- c("Variable Complexes","Stable Complexes")
rownames(FisherDF) <- c("No Paralogs","Has Paralogs")
Fisher <- fisher.test(FisherDF)It is now time to plot the results in a barplot, that show exactly what is also recapitulated consistently.
#BarPlot
StableDF$type <- factor(StableDF$type,levels = c("HighlyVar","HighlyStable"))
pdf("../out/plots/p29Tiss.pdf",height = 5,width = 5)
p29Tiss <- ggplot(data=StableDF) +
geom_bar(aes(x=type,fill=hasParalogs),position = "fill",color="black",alpha=0.5) +
ggtitle("Subunits - Across Tissues") +
scale_fill_manual(values = c("#e6d5b8","#0278ae")) +
theme_minimal() + xlab("Subunits") + ylab("Proportions")
p29Tiss
dev.off()## png
## 2
p29TissThen we finally show more clearly the relationship between paralog
proportion and complexes stoichiometric variability. We can clearly see
the trend that reflect that the more subunits that have paralogs are
present in a specific complex the more this specific complex has
variable stoichiometries. By defining stoichiometric variability for
protein complexes as (1-R) and the paralog fraction (Paralogs.prop) in
each complex The scatterplot highlights the existence of a positive
trend between complexes stoichiometric variability and paralog content.
Correlations <- cor(Stability$correlation,Stability$Paralogs.prop)
#Proportions Correlation scatterplot
pdf("../out/plots/p30Tiss.pdf",height = 5,width = 5)
p30Tiss <- ggplot(data=Stability) +
geom_point(aes(x=1-correlation,y=Paralogs.prop),stat = "identity") +
theme_minimal() +
geom_smooth(aes(x=1 - correlation,y=Paralogs.prop),
method = "lm",se=F,color="orange") +
ylab("Proportion of Paralogs") + xlab("complex variability (1-R)")
p30Tiss
dev.off()## png
## 2
p30Tiss3.12 Protein Complexes Analysis - Transcriptome
Here we carry on the same analysis that we did, but at the level of the
transcriptome. Loading a new ComplexSet object with the function
ComplexSet. Annotating with expression values coming from the
em_transcript with the GetExpressions function. We calculate
correlation with the cor function and create a dataframe
ComplexStability.transcr that contains the co-expression values for
protein complexes on the transcriptome data. Before binding the data
together we also add a reference column to the DF data frame marking
it as coming from the proteome data. Finally we remove random complexes
from the combined data frame.
#Complexes
Complexes.transcr <- ComplexSet("hsapiens",identifier = "SYMBOL")
Complexes.transcr <- GetExpressions(Complexes.transcr,em_transcript,min=5)
#Calculate Correlations
ComplexStability.transcr <- cor(Complexes.transcr)
#Create ggplotDataframe with Density distributions.
DF.transcr <- data.frame(corr=c(ComplexStability.transcr))
DF.transcr$type <- rep(c("Complexes"),nrow(DF.transcr))
DF.transcr$data <- "Transcriptome"
#Add reference column on proteome data
DF$data <- "Proteome"
#Bind together Proteome and transcriptome removing random from proteome
DF.all <- rbind(DF[DF$type!="Random Complexes",],DF.transcr)3.12.1 Variable and Stable complexes along tissues - Transcript
In this chunk of code, protein complexes are divided in variable and stable according to their quantile correlation distribution. The upper 25% percentile is considered stable and the lower 25% is considered variable.
## Divide in stable and variable Complexes
VariableComplexes <- ComplexStability.transcr[ComplexStability.transcr < quantile(ComplexStability.transcr)["25%"]]
StableComplexes <- ComplexStability.transcr[ComplexStability.transcr > quantile(ComplexStability.transcr)["75%"]]Then we calculate the median correlation of each subunits with the rest
of its protein complexes, we have a median correlation values that can
help us understand how the subunits is coregulated together with the
complexes across tissues.This is achieved with the functions
SubunitsCoregulation and GetStability that helps annotating the
complexes with coexpression information, and retrieving that information
respectively.
#ComplexStability.transcr BarPlot
ComplexStability.transcr <- data.frame(complex=names(ComplexStability.transcr),corr=ComplexStability.transcr)
ComplexStability.transcr$complex <- factor(ComplexStability.transcr$complex,levels = ComplexStability.transcr[order(ComplexStability.transcr$corr,decreasing = T),"complex"])
# Subunits Stability in Protein Complexes
Complexes.transcr <- SubunitsCoregulation(Complexes.transcr)
StabilityDF.transcr <- GetStability(Complexes.transcr)
StabilityDF.transcr <- StabilityDF.transcr[complete.cases(StabilityDF.transcr),]
colnames(StabilityDF.transcr)[1] <- "complex"We then annotate each subunits with the information coming from the paralogs dataset, addressing if a specific subunits has paralogs or not, and we also add information coming from the genral size of the complex. Finally we merge this information with the total correlation of protein complexes to plot the stability of protein complexes in relationship with paralogs content.
#Stability, paralog proportions and size.
StabilityDF.transcr$size <- sapply(StabilityDF.transcr$complex,function(x)nrow(StabilityDF.transcr[StabilityDF.transcr$complex==x,]))
StabilityDF.transcr$hasParalogs <- hasParalogs(Paralogs,StabilityDF.transcr$subunit)
#Quantiles
Quant <- quantile(StabilityDF.transcr$cv,na.rm = T,c(0.25,0.75))
#Highly Variable Subunits
HighlyVariableSub <- StabilityDF.transcr[StabilityDF.transcr$cv<Quant["25%"],];HighlyVariableSub$type<-rep("HighlyVar",nrow(HighlyVariableSub))
HighlyStableSub <- StabilityDF.transcr[StabilityDF.transcr$cv>Quant["75%"],];HighlyStableSub$type<-rep("HighlyStable",nrow(HighlyStableSub))
#Stable and Variable Subunits
StableDF.transcr <- rbind(HighlyVariableSub,HighlyStableSub)We calculate for each complex the proportion of paralogs and it is done with this line below.
#Proportion of paralogs subunits and correlations
Stability.tr <- StabilityDF.transcr %>% group_by(complex) %>% summarize(prop =sum(hasParalogs)/n(),size=n())
Stability.tr <- as.data.frame(Stability.tr)
Stability.tr <- merge(ComplexStability.transcr,Stability.tr,by.x="complex",by.y="complex")
colnames(Stability.tr) <- c("complex","correlation","Paralogs.prop","size")Finally we can plot the general distribution of the variables we calculated. To understand if there are some specific relation between variables. We can see as for the zebrafish development data that there is a relation between paralogs proportions and coregulation,
pdf("../out/plots/p32Tiss.pdf",height = 5,width = 5)
p32Tiss <- plot(Stability.tr[,2:4],pch=19,cex=0.7)
p32Tiss## NULL
dev.off()## png
## 2
p32Tiss## NULL
wx.test <- wilcox.test(StabilityDF.transcr$cv[StabilityDF.transcr$hasParalogs],
StabilityDF.transcr$cv[!StabilityDF.transcr$hasParalogs])
#Density plot
pdf("../out/plots/p34Tiss.pdf",height = 5,width = 5)
p34Tiss <- ggplot(data=StabilityDF.transcr) + geom_density(aes(x=cv,fill=hasParalogs),alpha=0.4) +
scale_fill_manual(values = ParalogsColor) + theme_bw() +
xlab("correlation within complex") +
ggtitle("Within Compl. Correlations - Transcriptome",subtitle = paste("Wilcox pvalue =",format(wx.test$p.value,digits = 2)))
p34Tiss
dev.off()## png
## 2
p34TissThis density plot highlight the fact that genes that have paralogs are on average more distant = less correlated to their complex, compared to subunits that do not have paralogs. We can also represent this with a barplot that highlight the variable subunits and the stable one, indicating if those specific subunits have paralogs. In this case we bind the dataframe together and run a fisher test for addressing differences in proportions.
#Proportion
FisherDF <- t(rbind(table(HighlyVariableSub$hasParalogs),table(HighlyStableSub$hasParalogs)))
colnames(FisherDF) <- c("Variable Complexes","Stable Complexes")
rownames(FisherDF) <- c("No Paralogs","Has Paralogs")
Fisher <- fisher.test(FisherDF)It is now time to plot the results in a barplot, that show exactly what is also recapitulaed consistently.
#BarPlot
StableDF.transcr$type <- factor(StableDF.transcr$type,levels = c("HighlyVar","HighlyStable"))
pdf("../out/plots/p35Tiss.pdf",height = 5,width = 5)
p35Tiss <- ggplot(data=StableDF.transcr) + geom_bar(aes(x=type,fill=hasParalogs),position = "fill") + ggtitle("Subunits - Across Tissues") + theme_minimal()
p35Tiss
dev.off()## png
## 2
p35TissThen we finally show more clearly the relationship between paralog proportion and complexes stoichiometric variability. We can clearly see the trend that reflect that the more subunits that have paralogs are present in a specific complex the more this specific complex has variable stoichiometries.
Correlations <- cor(Stability.tr$correlation,Stability.tr$Paralogs.prop)
#ComplexBarPlot
pdf("../out/plots/p36Tiss.pdf",height = 5,width = 5)
p36Tiss <- ggplot(data=Stability.tr) + geom_point(aes(x=1-correlation,y=Paralogs.prop),stat = "identity",size=.9) +
theme_bw()+ ylim(0,1) +
geom_smooth(aes(x=1-correlation,y=Paralogs.prop),method = "lm",color="orange") + ylim(0,1) +
ylab("Proportion of Paralogs") + xlab("complex variability (1-R)")
p36Tiss 3.13 Assemble Tables
Complexes_Table_Hsap_P <- merge(Stability,StabilityDF,by='complex')
Complexes_Table_Hsap_P <- Complexes_Table_Hsap_P[,-7]
colnames(Complexes_Table_Hsap_P) <- c('complex','complex.coexpression',
'paralogs.fraction','comp.size',
'subunit','subunits.coexpr.with.complex',
'subunit.hasParalogs')
Complexes_Table_Hsap_T <- merge(Stability.tr,StabilityDF.transcr,by='complex')
Complexes_Table_Hsap_T <- Complexes_Table_Hsap_T[,-7]
colnames(Complexes_Table_Hsap_T) <- c('complex','complex.coexpression',
'paralogs.fraction','comp.size',
'subunit','subunits.coexpr.with.complex',
'subunit.hasParalogs')3.14 Save workspace
We save the complete workspace under the name
02-Paralogs_AlongTissues_CompleteWorkspace.RData this saves all the
workspace that is active while writing this markdown. In this ways
should be easy to access back information coming from the different
tables.
Sys.time()## [1] "2021-07-16 19:59:40 CEST"
save.image("../Markdown/workspaces/02-Paralogs_AlongTissues_CompleteWorkspace.RData")