Section 9 Process Zebrafish MS-data

In this part of the script, we analize the MaxQuant output from the Zebrafish Neurogenesis data. Carrying out a false discovery rate detection with the fdrtool package. In this chunck we select for ratios present in at least 2 out of 3 replicates, and we add new column with the Log2 transformed values for the normalized ratios. We also extract the gene name from the FASTA file header.

library(limma)
library(ggplot2)
library(fdrtool)
library(spatialR)
library(proteomicstools)
library(dplyr)
library(paralogstools)
library(RColorBrewer)
source("../ComplexScript/complexes_function.R")

We read the data frame coming from the output of MaxQuant.

#Load Zebrafish MS data
Drerio.MSData <- read.csv("../Data/Dataset/drerio/combined/txt/proteinGroups.txt",sep = "\t",header = T)

Then we subset for only protein present in 2 out of 3 replicates. By selecting rows that satisfy this requirement and store them in the keep variable.

#Subset only for 2 out of 3 Replicates
keep <- apply(Drerio.MSData[,grep("Ratio.H.L.normalized.mix",colnames(Drerio.MSData))],1,
              function(x)sum(is.nan(x))<2)

We subset then the data, and create a new column inside the data frame called Log.Ratio.H.L.normalized that contains the Log2 transformed normalized Ratio H/L values. We then remove empty values.

#Subset only 2 out of 3 valid ratios
Drerio.MSData <- Drerio.MSData[keep,]
#add Log2 ratios
Drerio.MSData$Log.Ratio.H.L.normalized <- log2(Drerio.MSData$Ratio.H.L.normalized)
#Take only valid ratios
Drerio.MSData <- Drerio.MSData[!is.na(Drerio.MSData$Ratio.H.L.normalized),]

Extract genename from the fasta header, and convert to ’’ empty gene name.

#Take geneName from fasta header
genename <- stringr::str_extract(Drerio.MSData$Fasta.headers,"GN=\\w+")
genename[is.na(genename)] <- ""

Drerio.MSData$genename <- gsub("GN=","",genename)
Drerio.MSData$genename <- toupper(as.character(Drerio.MSData$genename))

9.1 fdrtools processing

We apply fdrtool package in order to detect false discovery rate in proteinGroups ratios.

library(fdrtool)
Ratios <- c("Log.Ratio.H.L.normalized")
fdrtool.res <- fdrtool(as.numeric((Drerio.MSData[,Ratios])) , statistic = "normal")
## Step 1... determine cutoff point
## Step 2... estimate parameters of null distribution and eta0
## Step 3... compute p-values and estimate empirical PDF/CDF
## Step 4... compute q-values and local fdr
## Step 5... prepare for plotting

#add information to dataset
Drerio.MSData$fdrtool.pval <- fdrtool.res$pval
Drerio.MSData$fdrtool.qval <- fdrtool.res$qval

9.2 Ratio H/L Total Intensity

To have a general overview we can plot the results from the Log2 Ratio H/L. Marking differentially expressed proteins in the vector signif.

Drerio.MSData$signif <- ifelse(Drerio.MSData$fdrtool.pval<0.05,"Yes","No")
Drerio.MSData$Log.Ratio.H.L.normalized <- as.numeric(Drerio.MSData$Log.Ratio.H.L.normalized)

#Plot
pdf("../out/plots/p01Drerio.pdf",height = 5,width = 5)
p01Drerio <- ggplot(data=Drerio.MSData) + 
            geom_point(aes(x=Log.Ratio.H.L.normalized,y=log10(Intensity),color=signif)) +
            scale_color_manual(values = c("lightgrey","black")) + 
            theme_bw() + 
            ggtitle("Neuron/Stem")
p01Drerio
dev.off()
## png 
##   2
p01Drerio

9.3 Load Paralogs Annotation

We load protein Complexes from the package proteomicstools, and paralogs genes from paralogstools. We can in this case just annotate directly the dataframe with information on the single genes if they have paralog or not.

#Load Paralogs, get the Already Cleaned Paralogs from File with filename=""
Paralogs <- ParalogsSet(organism = "drerio",identifier = "SYMBOL")
Paralogs.df <- Paralogs@data
#Load Protein Complexes
Complexes <- ComplexSet(organism = "drerio",identifier = "SYMBOL")

Here we annotate differentially expressed proteins as if they have paralogs or not. We use again the function hasParalogs function.

Data <- Drerio.MSData
Condition <- "Neuron/Stem"

#Differentially expressed proteins Proportions
DiffProteins <- nrow(Data[Data$signif=="Yes",])
Percentage <- DiffProteins/nrow(Data)

##Get Paralogs ####
has.Paralogs <- hasParalogs(Paralogs,Data[,"genename"])
Data$has.Paralogs <- has.Paralogs

9.4 Add condition column

We then add a condition column to the dataset. We will do this for every dataset in order to be able to easily trace back the different datasets when we will carry out more analysis in the future steps.

Data$condition <- "Neu.Stem"

9.5 Save processed data

We save the processed data with all the informations, in a new file called ZebrafishNeurogProcessed.txt

write.table(Data,"../Data/Dataset/processed/ZebrafishNeurogProcessed.txt",sep = "\t",row.names = F)