Big data driven outlier detection for soybean straw near infrared spectroscopy

2017 
Abstract In near infrared spectroscopy (NIRS) analysis, the prediction ability of the model is seriously affected by outliers that may be the result of errors related to the spectral measurements, the chemical analysis, or a combination of both. In this paper, an outlier detection method is described based on the NIRS analysis data of soybean straw. We improved the resampling by half-mean (RHM) method by including a confidence interval (IRHM) and combined the IRHM and Cook’s distance methods (IRHM-COOK) to detect outlier samples in the NIRS data. The confidence interval is an important parameter in the IRHM-COOK method and the optimal confidence intervals for the IRHM and Cook’s distance methods are combined and used as the confidence interval for the IRHM-COOK method. The selection process for the confidence interval is aimed at relative independence between the detection of the spectrum outliers and the chemical outliers. The experimental results show that the IRHM-COOK method is superior to the traditional Mahalanobis distance method, the IRHM method, and the Cook’s distance method using a partial least squares regression (PLS) model. The determination coefficient (R 2 ) of a hemicellulose PLS calibration model increased from 0.4397918 to 0.5333039 and the root mean square error (RMSE) decreased from 0.7926415 to 0.7287254. The PLS models for lignin and cellulose performed better using the IRHM-COOK method than the original model. The results show that the IRHM-COOK method can effectively identify spectrum outliers and chemical outliers for soybean straw biomass. In addition, it is an effective method to handle NIRS analysis data with one type of outlier, which is proven based on an NIRS analysis of starch.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    13
    Citations
    NaN
    KQI
    []