|
a |
|
b/classification/SMOTEBoost/README.txt |
|
|
1 |
****************************************************************************** |
|
|
2 |
Author: Barnan Das |
|
|
3 |
Email: barnandas@wsu.edu |
|
|
4 |
Homepage: www.eecs.wsu.edu/~bdas1 |
|
|
5 |
Last Updated: June 25, 2012 |
|
|
6 |
****************************************************************************** |
|
|
7 |
|
|
|
8 |
Description of Algorithm: |
|
|
9 |
This code implements SMOTEBoost. SMOTEBoost is an algorithm to handle class |
|
|
10 |
imbalance problem in data with discrete class labels. It uses a combination of |
|
|
11 |
SMOTE and the standard boosting procedure AdaBoost to better model the minority |
|
|
12 |
class by providing the learner not only with the minority class examples that |
|
|
13 |
were misclassified in the previous boosting iteration but also with broader |
|
|
14 |
representation of those instances (achieved by SMOTE). Since boosting |
|
|
15 |
algorithms give equal weight to all misclassified examples and sample from a |
|
|
16 |
pool of data that predominantly consists of majority class, subsequent sampling |
|
|
17 |
of the training set is still skewed towards the majority class. Thus, to reduce |
|
|
18 |
the bias inherent in the learning procedure due to class imbalance and to |
|
|
19 |
increase the sampling weights of minority class, SMOTE is introduced at each |
|
|
20 |
round of boosting. Introduction of SMOTE increases the number of minority class |
|
|
21 |
samples for the learner and focus on these cases in the distribution at each |
|
|
22 |
boosting round. In addition to maximizing the margin for the skewed class |
|
|
23 |
dataset, this procedure also increases the diversity among the classifiers in |
|
|
24 |
the ensemble because at each iteration a different set of synthetic samples are |
|
|
25 |
produced. |
|
|
26 |
|
|
|
27 |
For more detail on the theoretical description of the algorithm please refer to |
|
|
28 |
the following paper: |
|
|
29 |
N.V. Chawla, A.Lazarevic, L.O. Hall, K. Bowyer, "SMOTEBoost: Improving |
|
|
30 |
Prediction of Minority Class in Boosting, Journal of Knowledge Discovery |
|
|
31 |
in Databases: PKDD, 2003. |
|
|
32 |
|
|
|
33 |
Description of Implementation: |
|
|
34 |
The current implementation of SMOTEBoost has been independently done by the author |
|
|
35 |
for the purpose of research. In order to enable the users use a lot of different |
|
|
36 |
weak learners for boosting, an interface is created with Weka API. Currently, |
|
|
37 |
four Weka algortihms could be used as weak learner: J48, SMO, IBk, Logistic. |
|
|
38 |
|
|
|
39 |
Files: |
|
|
40 |
weka.jar -> Weka jar file that is called by several Matlab scripts in this |
|
|
41 |
directory. |
|
|
42 |
|
|
|
43 |
train.arff, test.arff, resampled.arff -> ARFF (Weka compatible) files generated |
|
|
44 |
by some of the Matlab scripts. |
|
|
45 |
|
|
|
46 |
ARFFheader.txt -> Defines the ARFF header for the data file "data.csv". Please |
|
|
47 |
refer to the following link to learn more about ARFF format. |
|
|
48 |
http://www.cs.waikato.ac.nz/ml/weka/arff.html |
|
|
49 |
|
|
|
50 |
SMOTEBoost.m -> Matlab script that implements the SMOTEBoost algorithm. Please |
|
|
51 |
"help SMOTEBoost" in Matlab Console to the arguments for this |
|
|
52 |
function. |
|
|
53 |
|
|
|
54 |
Test.m -> Matlab script that shows a sample code to use SMOTEBoost function in |
|
|
55 |
Matlab. |
|
|
56 |
|
|
|
57 |
ClassifierTrain.m, ClassifierPredict.m, CSVtoARFF.m -> Matlab functions used by |
|
|
58 |
SMOTEBoost.m |
|
|
59 |
|
|
|
60 |
|
|
|
61 |
**************************************xxx************************************** |