Classifying commit messages: A case study in resampling techniques

2017 
In practice, there are a variety of real-world datasets that have an imbalanced nature where one of two classes dominates the data. These datasets are generally difficult to classify using machine learning algorithms as the skewed nature of the data has a significant impact on the training process. In order to combat this difficulty, many methods of under sampling and over sampling have been proposed in order to generate comparable data sets that are more easily classifiable. This study applies multiple resampling techniques to a set of commit messages that have been extracted from multiple Github and Sourceforge projects in order to answer the question, “Do developers discuss design?” This dataset is highly imbalanced with less than 15% of all commit messages being classified as having to do with design. Results demonstrate that the combined use of resampling as coupled with various classification algorithms yields improvements in classification over the state-of-the-art by more than 10% in terms of accuracy.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    26
    References
    4
    Citations
    NaN
    KQI
    []