Algorithmic Splitting: A Method for Dataset Preparation

Khalid M. Kahloot,Péter Ekler

Algorithmic Splitting: A Method for Dataset Preparation

2021

Khalid M. Kahloot
Péter Ekler

The datasets that appear in publications are curated and have been split into training, testing and validation sub-datasets by domain experts. Consequently, machine learning models typically perform well on such split-by-hand prepared datasets. Whereas preparing real-world datasets into curated split training, testing and validation sub-dataset requires extensive effort. Usually, repetitive random splits are carried out and trained and evaluated on until reaching out a good score on the evaluation metrics. In this paper, an algorithmic method is proposed for preparing the sub-datasets splits for machine learning models. The objective of the proposed method is to achieve an evenly representative splits out of the dataset with standard and algorithmic way that reduce the perplexity of random splitting.

Keywords:

Cluster analysis
Perplexity
Machine learning
Data modeling
Domain (software engineering)
Computer science
Artificial intelligence

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations