EASE 2014 Replication Package

This is the accompanying Web site to the EASE'14 paper: Preliminary Comparison of Techniques for Dealing with Imbalance in Software Defect Prediction.

Imbalanced data is a common problem in data mining when dealing with classification problems, where samples of a class vastly outnumber other classes. In this situation, many data mining algorithms generate poor models as they try to optimize the overall accuracy and perform badly in classes with very few samples. Software Engineering data in general and defect prediction datasets are not an exception and in this paper, we compare different approaches, namely sampling, cost-sensitive, ensemble and hybrid approaches to the problem of defect prediction with different datasets preprocessed differently. We have used the well-known NASA datasets curated by Shepperd et al. There are differences in the results depending on the characteristics of the dataset and the evaluation metrics, especially if duplicates and inconsistencies are removed as a preprocessing step.

Weka output of the 5x5CV (zip file)

Required software

Instructions

You need to load the data provided with the Weka's experimenter.

Where can I get the paper?

The paper has been accepted at the 18th International Conference on Evaluation and Assessment in Software Engineering (EASE'14). You can download a preprint from UAH.

E-mail daniel.rodriguezg@uah.es for any inquiries about this paper.