Guidance for RNA-seq co-expression estimates: the importance of data normalization, batch effects, and correlation measures

2021 
MotivationGene co-expression analysis is an attractive tool for leveraging enormous amounts of public RNA-seq datasets for the prediction of gene functions and regulatory mechanisms. However, the optimal data processing steps for obtaining high-quality co-expression networks from such large datasets remain unclear. Especially the importance of batch effect correction is understudied. ResultsWe conducted a systematic analysis of 50 different data processing workflows and applied them on RNA-seq data of 68 human and 76 mouse cell types and tissues. We analyzed the resulting 7,200 gene co-expression networks and identified the factors that contribute to their quality focusing on data normalization, batch effect correction and the measure of correlation. We confirmed the key importance of large sample counts for generating high-quality networks. However, choosing a suitable normalization approach and applying batch effect correction can further improve the quality of co-expression networks, equivalent to a >70% and >40% increase in samples count. Finally, Pearson correlation appears more suitable than Spearman correlation, except for smaller datasets. ConclusionA key point in constructing high-quality gene co-expression networks is the collection of many samples. However, paying attention to data normalization, batch effects, and the measure of correlation can significantly improve the quality of co-expression estimates.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    45
    References
    0
    Citations
    NaN
    KQI
    []