Randomizing outputs to increase variable selection accuracy

2016 
Variable selection plays a key role in explanatory modeling and its aim is to identify the variables that are truly important to the outcome. Recently, ensemble learning techniques have manifested great potential in improving the performance of some traditional methods such as lasso, genetic algorithm, stepwise search. Following the main principle to build a variable selection ensemble, we propose in this paper a novel approach by randomizing outputs (i.e., adding some random noise to the response) to maximize variable selection accuracy. In order to generate multiple but slightly different importance measures for each variable, some Gaussian noise is artificially added to the response. The new training set (i.e, the original design matrix together with the new response vector) is then fed into genetic algorithm to perform variable selection. By repeating this process a number of trials and fusing the results by simple averaging, a more reliable importance measure is obtained for each candidate variable. The variables are then ranked and further determined to be important or not by a thresholding rule. The performance of the proposed method is studied with some simulated and real-world data in the framework of linear and logistic regression models. The results demonstrate that it compares favorably with several other existing methods. HighlightsEnsemble learning methods are used to perform variable ranking and selection.Output smearing technique is employed to build a variable selection ensemble.A parameter is introduced to control the amount of added random noise.The measure averaged across multiple trials is used to rank and select variables.Experiments on simulated and real data sets show the efficiency of new method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    49
    References
    6
    Citations
    NaN
    KQI
    []