- Main
The Effect of Sampling Methods on Model Performance for Classification of Imbalanced Datasets
- Weidner, Jeremy
- Advisor(s): Wu, Yingnian
Abstract
This paper applies various statistical techniques with the goal of maximizing model performance for the task ofclassification on a dataset with heavily imbalanced classes. A dataset is created by combining several sources into one comprehensive dataset. Exploratory data analysis will be performed to understand the available factors, their corresponding distributions and relationship to the outcome variable. Then steps will be taken to prepare the data for the task of classification. Next, a collection of different training set sampling strategies will be outlined using methods such as Random Over Sampling, Random Under Sampling and Synthetic Minority Oversampling Technique. Machine learning models such as Random Forest Classifiers will be fitted for each of the sets of parameters and the model fit will be evaluated on the test set in order to provide insight into the differences of various sampling techniques in the imbalanced classification task. Metrics used to evaluate model fit will include traditional statistical measures as well as other strategies that more closely align with the specific business problem.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-