Section 9 Process Zebrafish MS-data
In this part of the script, we analize the MaxQuant output from the Zebrafish Neurogenesis data. Carrying out a false discovery rate detection with the fdrtool
package. In this chunck we select for ratios present in at least 2 out of 3 replicates, and we add new column with the Log2 transformed values for the normalized ratios. We also extract the gene name from the FASTA file header.
library(limma)
library(ggplot2)
library(fdrtool)
library(spatialR)
library(proteomicstools)
library(dplyr)
library(paralogstools)
library(RColorBrewer)
source("../ComplexScript/complexes_function.R")
We read the data frame coming from the output of MaxQuant.
#Load Zebrafish MS data
<- read.csv("../Data/Dataset/drerio/combined/txt/proteinGroups.txt",sep = "\t",header = T) Drerio.MSData
Then we subset for only protein present in 2 out of 3 replicates. By selecting rows that satisfy this requirement and store them in the keep
variable.
#Subset only for 2 out of 3 Replicates
<- apply(Drerio.MSData[,grep("Ratio.H.L.normalized.mix",colnames(Drerio.MSData))],1,
keep function(x)sum(is.nan(x))<2)
We subset then the data, and create a new column inside the data frame called Log.Ratio.H.L.normalized
that contains the Log2 transformed normalized Ratio H/L values. We then remove empty values.
#Subset only 2 out of 3 valid ratios
<- Drerio.MSData[keep,]
Drerio.MSData #add Log2 ratios
$Log.Ratio.H.L.normalized <- log2(Drerio.MSData$Ratio.H.L.normalized)
Drerio.MSData#Take only valid ratios
<- Drerio.MSData[!is.na(Drerio.MSData$Ratio.H.L.normalized),] Drerio.MSData
Extract genename from the fasta header, and convert to ’’ empty gene name.
#Take geneName from fasta header
<- stringr::str_extract(Drerio.MSData$Fasta.headers,"GN=\\w+")
genename is.na(genename)] <- ""
genename[
$genename <- gsub("GN=","",genename)
Drerio.MSData$genename <- toupper(as.character(Drerio.MSData$genename)) Drerio.MSData
9.1 fdrtools processing
We apply fdrtool
package in order to detect false discovery rate in proteinGroups ratios.
library(fdrtool)
<- c("Log.Ratio.H.L.normalized")
Ratios <- fdrtool(as.numeric((Drerio.MSData[,Ratios])) , statistic = "normal") fdrtool.res
## Step 1... determine cutoff point
## Step 2... estimate parameters of null distribution and eta0
## Step 3... compute p-values and estimate empirical PDF/CDF
## Step 4... compute q-values and local fdr
## Step 5... prepare for plotting
#add information to dataset
$fdrtool.pval <- fdrtool.res$pval
Drerio.MSData$fdrtool.qval <- fdrtool.res$qval Drerio.MSData
9.2 Ratio H/L Total Intensity
To have a general overview we can plot the results from the Log2 Ratio H/L. Marking differentially expressed proteins in the vector signif
.
$signif <- ifelse(Drerio.MSData$fdrtool.pval<0.05,"Yes","No")
Drerio.MSData$Log.Ratio.H.L.normalized <- as.numeric(Drerio.MSData$Log.Ratio.H.L.normalized)
Drerio.MSData
#Plot
pdf("../out/plots/p01Drerio.pdf",height = 5,width = 5)
<- ggplot(data=Drerio.MSData) +
p01Drerio geom_point(aes(x=Log.Ratio.H.L.normalized,y=log10(Intensity),color=signif)) +
scale_color_manual(values = c("lightgrey","black")) +
theme_bw() +
ggtitle("Neuron/Stem")
p01Dreriodev.off()
## png
## 2
p01Drerio
9.3 Load Paralogs Annotation
We load protein Complexes from the package proteomicstools
, and paralogs genes from paralogstools
. We can in this case just annotate directly the dataframe with information on the single genes if they have paralog or not.
#Load Paralogs, get the Already Cleaned Paralogs from File with filename=""
<- ParalogsSet(organism = "drerio",identifier = "SYMBOL")
Paralogs <- Paralogs@data
Paralogs.df #Load Protein Complexes
<- ComplexSet(organism = "drerio",identifier = "SYMBOL") Complexes
Here we annotate differentially expressed proteins as if they have paralogs or not. We use again the function hasParalogs
function.
<- Drerio.MSData
Data <- "Neuron/Stem"
Condition
#Differentially expressed proteins Proportions
<- nrow(Data[Data$signif=="Yes",])
DiffProteins <- DiffProteins/nrow(Data)
Percentage
##Get Paralogs ####
<- hasParalogs(Paralogs,Data[,"genename"])
has.Paralogs $has.Paralogs <- has.Paralogs Data
9.4 Add condition column
We then add a condition
column to the dataset. We will do this for every dataset in order to be able to easily trace back the different datasets when we will carry out more analysis in the future steps.
$condition <- "Neu.Stem" Data
9.5 Save processed data
We save the processed data with all the informations, in a new file called ZebrafishNeurogProcessed.txt
write.table(Data,"../Data/Dataset/processed/ZebrafishNeurogProcessed.txt",sep = "\t",row.names = F)