Synthetic sampling from small datasets: A modified mega-trend diffusion approach using k-nearest neighbors

2021 
Abstract Data generation techniques are one of the emerging trends in machine learning in the last two decades. Despite huge data availability, small datasets are still an issue to tackle for decision making purposes. Synthetic data generation is a promising alternative for the small dataset problem. In addition, previous methodologies address the data generation for only one of the tasks: supervised or unsupervised. A modified Mega-Trend Diffusion (MTD) approach, k -Nearest Neighbor Mega-Trend Diffusion ( k NNMTD), is proposed in this research to address these challenges. The method identifies the closest subsamples using the k -Nearest Neighbor ( k NN) algorithm and applies MTD to the subsample neighbors to estimate the domain ranges. The proposed methodology provides the functionality to generate data for any data-driven tasks. k NNMTD is compared with baseline MTD, CTGAN, and synthetic minority oversampling technique (SMOTE) for the classification tasks as well as against SMOTE for regression (SmoteR) for regression tasks. The proposed method is validated using some of the benchmark datasets as well as the simulated datasets along with a case study. Pairwise correlation difference (PCD) is used to compare the similarity between real and synthetic datasets. k NNMTD outperforms baseline MTD and CTGAN on all the datasets and shows statistical significance of the proposed methodology. Some of the benchmark datasets show low average PCD values as well as the statistical differences against SMOTE and SmoteR using k NNMTD. In the case study, k NNMTD generate data with the lowest PCD values compared with the other methods for both classification (1.2077) and ordinal regression (1.6017) tasks.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    62
    References
    0
    Citations
    NaN
    KQI
    []