Classification and regression with random forests as a standard method for presence-only data SDMs: A future conservation example using China tree species

2019 
Abstract The random forests (RF) algorithm is a superb learner and classifier in machine learning applications. This ensemble model is also one of the most popular species distribution model algorithms (SDMs) available to date. RF by default can produce categorical and numerical species distribution maps based on its classification tree (CT) and regression tree (RT) algorithms, respectively. Statistically, CT can also produce numerical predictions (class probability). Many real-world applications (e.g. conservation planning) employ binary presence–absence outputs that use classification thresholds to make these conversions. However, there is little available information regarding the difference in model performance between CT and RT for inference settings. Here, under an ensemble modeling framework, 52 forest tree species with presence-only data for all of China were selected for comparison of the performance of CT and RT algorithms in projecting the distribution and potential range shifts of these species under current and future climates. Five climatic variables were used to develop CT and RT models. Eight threshold-setting approaches were employed to convert numerical predictions into binary predictions. With regard to probabilistic predictions, the relative performance of CT and RT depended on the choice of the evaluation criteria. For both RT and CT, threshold-setting methods significantly altered the determination of thresholds, model performance, and subsequently projections of species range shifts under climate change. The four threshold selection methods (MaxKappa, MaxOA, MaxTSS, and MinROCdist) based on the composite model accuracy measures most often achieved significantly higher model performance than CT default threshold method and other threshold methods. They consistently projected that species' geographical ranges changed in response to climate change with the same direction and magnitude. We argue for choosing RT rather than CT as the SDM if model discrimination capacity (the ability to differentiate between occurrences of presence and absence) is viewed as more important than model reliability (the agreement between predicted relative indexes of occurrence and observed proportions of occurrence), and vice versa. In line with gradient theory, we can recommend the use of numerical predictions for species distribution modeling since they help to convey more information than binary predictions. Binary conversion of model outputs should only be carried out when it is clearly justified by the application's objective. The four aforementioned threshold methods are promising objective methods for binary conversions of continuous predictions when presence-only data are available. This study proposes guidelines on how machine learning can be used for specific applied and theoretical applications in a SDM context.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    69
    References
    16
    Citations
    NaN
    KQI
    []