TY - JOUR
T1 - Over-sampling methods for mixed data in imbalanced problems
AU - Alonso, Hugo
AU - Pinto da Costa, Joaquim Fernando
N1 - Publisher Copyright:
© 2025 Taylor & Francis Group, LLC.
PY - 2025
Y1 - 2025
N2 - In practice, it is common to find imbalanced classification problems, where one or more classes have many fewer examples than the others. There are several ways to deal with imbalance in order to improve the classification results in the less represented class(es) and one of them consists in applying re-sampling methods. Furthermore, it is no less common for data sets in imbalanced classification problems to be a mix of nominal, ordinal, quantitative discrete and continuous data. However, the true nature of the data tends to be ignored, like when ordinal data are treated as nominal. In this paper, we propose several re-sampling methods for mixed data, which take into account the four scales of measurement usually found in real data. They are based on the popular synthetic minority over-sampling technique or SMOTE. We consider different measures of distance adequate for mixed data. We also introduce new ways of creating the synthetic examples, using all of the nearest neighbors. We show through a comparative study that it pays off taking into account the true nature of the data and the new ways of creating synthetic examples.
AB - In practice, it is common to find imbalanced classification problems, where one or more classes have many fewer examples than the others. There are several ways to deal with imbalance in order to improve the classification results in the less represented class(es) and one of them consists in applying re-sampling methods. Furthermore, it is no less common for data sets in imbalanced classification problems to be a mix of nominal, ordinal, quantitative discrete and continuous data. However, the true nature of the data tends to be ignored, like when ordinal data are treated as nominal. In this paper, we propose several re-sampling methods for mixed data, which take into account the four scales of measurement usually found in real data. They are based on the popular synthetic minority over-sampling technique or SMOTE. We consider different measures of distance adequate for mixed data. We also introduce new ways of creating the synthetic examples, using all of the nearest neighbors. We show through a comparative study that it pays off taking into account the true nature of the data and the new ways of creating synthetic examples.
KW - Classification
KW - Data imbalance
KW - Mixed data
KW - Over-sampling
UR - http://www.scopus.com/inward/record.url?scp=85214368246&partnerID=8YFLogxK
U2 - 10.1080/03610918.2024.2447451
DO - 10.1080/03610918.2024.2447451
M3 - Article
AN - SCOPUS:85214368246
SN - 0361-0918
JO - Communications in Statistics Part B: Simulation and Computation
JF - Communications in Statistics Part B: Simulation and Computation
ER -