Improving Hierarchical Clustering of Genotypic Data via Principal Component Analysis

2013 
Understanding the genetic structure of germplasm collections is a prerequisite for effective and efficient use of crop genetic resources in genebanks. Currently, hierarchical clustering techniques are most popular for describing genetic structure in germplasm collections. Traditionally performed using dissimilarities based on raw genotypic data, recent studies have shown that cluster analysis can be improved by first condensing the genotypic data using principal component analysis (PCA). Although the two-step approach (PCA followed by cluster analysis) is gaining popularity, no systematic study into its benefits over traditional clustering methods has been performed. In particular, the relationship between the number of principal components (PCs) to be retained and the performance of cluster analysis have not been established. It is also not clear whether genetic data should be scaled before performing PCA. Here we present a detailed study comparing cluster analysis using distances based on condensed data using significant PCs and clustering based on the full dataset. We also studied the effect of data scaling on PCA-based clustering. Using simulations, we show that in discretely subdivided populations, maximum clustering performance is attained by using a subset of PCs that relate to differentiation between subpopulations and that scaling of the data is key to achieving improvement in PCA-based clustering. For scaled data, we report consistently higher clustering success for PCA, particularly at lower levels of population differentiation, while gains for unscaled data are minor. This is confirmed by real data, where PCA-based clustering of scaled genotypic data leads to visible improvements in resolving finer patterns of geographic subdivision. Our results show clearly that proper scaling and reduction of genotypic data is key to improving clustering performance
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    11
    Citations
    NaN
    KQI
    []