Title: | Imbalanced Resampling using SMOTE with Boosting (SMOTEWB) |
---|---|
Description: | Provides the SMOTE with Boosting (SMOTEWB) algorithm. See F. Sağlam, M. A. Cengiz (2022) <doi:10.1016/j.eswa.2022.117023>. It is a SMOTE-based resampling technique which creates synthetic data on the links between nearest neighbors. SMOTEWB uses boosting weights to determine where to generate new samples and automatically decides the number of neighbors for each sample. It is robust to noise and outperforms most of the alternatives according to Matthew Correlation Coefficient metric. Alternative resampling methods are also available in the package. |
Authors: | Fatih Saglam [aut, cre] |
Maintainer: | Fatih Saglam <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.2.0 |
Built: | 2024-10-22 04:14:44 UTC |
Source: | https://github.com/fatihsaglam/smotewb |
Generates synthetic data for minority class to balance imbalanced datasets using ADASYN.
ADASYN(x, y, k = 5)
ADASYN(x, y, k = 5)
x |
feature matrix or data.frame. |
y |
a factor class variable with two classes. |
k |
number of neighbors. Default is 5. |
Adaptive Synthetic Sampling (ADASYN) is an extension of the Synthetic Minority Over-sampling Technique (SMOTE) algorithm, which is used to generate synthetic examples for the minority class (He et al., 2008). In contrast to SMOTE, ADASYN adaptively generates synthetic examples by focusing on the minority class examples that are harder to learn, meaning those examples that are closer to the decision boundary.
Note: Much faster than smotefamily::ADAS()
.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
x_syn |
Generated synthetic data. |
C |
Number of synthetic samples for each positive class samples. |
Fatih Saglam, [email protected]
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). IEEE.
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- ADASYN(x = x, y = y, k = 3) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- ADASYN(x = x, y = y, k = 3) plot(m$x_new, col = m$y_new)
BLSMOTE()
applies BLSMOTE (Borderline-SMOTE) which is a
variation of the SMOTE algorithm that generates synthetic samples only in the
vicinity of the borderline instances in imbalanced datasets.
BLSMOTE(x, y, k1 = 5, k2 = 5, type = "type1")
BLSMOTE(x, y, k1 = 5, k2 = 5, type = "type1")
x |
feature matrix or data.frame. |
y |
a factor class variable with two classes. |
k1 |
number of neighbors to link. Default is 5. |
k2 |
number of neighbors to determine safe levels. Default is 5. |
type |
"type1" or "type2". Default is "type1". |
BLSMOTE works by focusing on the instances that are near the decision boundary between the minority and majority classes, known as borderline instances. These instances are more informative and potentially more challenging for classification, and thus generating synthetic samples in their vicinity can be more effective than generating them randomly.
Note: Much faster than smotefamily::BLSMOTE()
.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
x_syn |
Generated synthetic data. |
C |
Number of synthetic samples for each positive class samples. |
Fatih Saglam, [email protected]
Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I 1 (pp. 878-887). Springer Berlin Heidelberg.
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- BLSMOTE(x = x, y = y, k1 = 5, k2 = 5) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- BLSMOTE(x = x, y = y, k1 = 5, k2 = 5) plot(m$x_new, col = m$y_new)
Resampling with GSMOTE.
GSMOTE(x, y, k = 5, alpha_sel = "combined", alpha_trunc = 0.5, alpha_def = 0.5)
GSMOTE(x, y, k = 5, alpha_sel = "combined", alpha_trunc = 0.5, alpha_def = 0.5)
x |
feature matrix. |
y |
a factor class variable with two classes. |
k |
number of neighbors. Default is 5. |
alpha_sel |
selection method. Can be "minority", "majority" or "combined". Default is "combined". |
alpha_trunc |
truncation factor. A numeric value in |
alpha_def |
deformation factor. A numeric value in |
GSMOTE (Douzas & Bacao, 2019) is an oversampling method which creates synthetic samples geometrically around selected minority samples. Details are in the paper (Douzas & Bacao, 2019).
NOTE: Can not work with classes more than 2. Only numerical variables are allowed.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
x_syn |
Generated synthetic feature data. |
y_syn |
Generated synthetic label data. |
Fatih Saglam, [email protected]
Douzas, G., & Bacao, F. (2019). Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Information sciences, 501, 118-135.
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- GSMOTE(x = x, y = y, k = 7) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- GSMOTE(x = x, y = y, k = 7) plot(m$x_new, col = m$y_new)
Resampling with ROS.
ROS(x, y)
ROS(x, y)
x |
feature matrix. |
y |
a factor class variable with two classes. |
Random Oversampling (ROS) is a method of copying and pasting of positive samples until balance is achieved.
Can work with classes more than 2.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
Fatih Saglam, [email protected]
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- ROS(x = x, y = y) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- ROS(x = x, y = y) plot(m$x_new, col = m$y_new)
Generates synthetic data for each class to balance imbalanced datasets using kernel density estimations. Can be used for multiclass datasets.
ROSE(x, y, h = 1)
ROSE(x, y, h = 1)
x |
feature matrix or data.frame. |
y |
a factor class variable. Can be multiclass. |
h |
A numeric vector of length one or number of classes in y. If one is
given, all classes will have same shrink factor. If a value is given for each
classes, it will match respectively to |
Randomly Over Sampling Examples (ROSE) (Menardi and Torelli, 2014) is an oversampling method which uses conditional kernel densities to balance dataset. There is already an R package called 'ROSE' (Lunardon et al., 2014). Difference is that this one is much faster and can be applied for more than two classes.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
Fatih Saglam, [email protected]
Lunardon, N., Menardi, G., and Torelli, N. (2014). ROSE: a Package for Binary Imbalanced Learning. R Jorunal, 6:82–92.
Menardi, G. and Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28:92–122.
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- ROSE(x = x, y = y, h = c(0.12, 1)) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- ROSE(x = x, y = y, h = c(0.12, 1)) plot(m$x_new, col = m$y_new)
The Relocating Safe-Level SMOTE (RSLS) algorithm improves the quality of synthetic samples generated by Safe-Level SMOTE (SLS) by relocating specific synthetic data points that are too close to the majority class distribution towards the original minority class distribution in the feature space.
RSLSMOTE(x, y, k1 = 5, k2 = 5)
RSLSMOTE(x, y, k1 = 5, k2 = 5)
x |
feature matrix or data.frame. |
y |
a factor class variable with two classes. |
k1 |
number of neighbors to link. Default is 5. |
k2 |
number of neighbors to determine safe levels. Default is 5. |
In Safe-level SMOTE (SLS), a safe-level threshold is used to control the number of synthetic samples generated from each minority instance. This threshold is calculated based on the number of minority and majority instances in the local neighborhood of each minority instance. SLS generates synthetic samples that are located closer to the original minority class distribution in the feature space.
In Relocating safe-level SMOTE (RSLS), after generating synthetic samples using the SLS algorithm, the algorithm relocates specific synthetic data points that are deemed to be too close to the majority class distribution in the feature space. The relocation process moves these synthetic data points towards the original minority class distribution in the feature space.
This relocation process is performed by first identifying the synthetic data points that are too close to the majority class distribution. Then, for each identified synthetic data point, the algorithm calculates a relocation vector based on the distance between the synthetic data point and its k nearest minority class instances. This relocation vector is used to move the synthetic data point towards the minority class distribution in the feature space.
Note: Much faster than smotefamily::RSLS()
.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
x_syn |
Generated synthetic data. |
C |
Number of synthetic samples for each positive class samples. |
Fatih Saglam, [email protected]
Siriseriwan, W., & Sinapiromsaran, K. (2016). The effective redistribution for imbalance dataset: Relocating safe-level SMOTE with minority outcast handling. Chiang Mai J. Sci, 43(1), 234-246.
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- RSLSMOTE(x = x, y = y, k1 = 5, k2 = 5) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- RSLSMOTE(x = x, y = y, k1 = 5, k2 = 5) plot(m$x_new, col = m$y_new)
Resampling with RUS.
RUS(x, y)
RUS(x, y)
x |
feature matrix. |
y |
a factor class variable with two classes. |
Random Undersampling (RUS) is a method of removing negative samples until balance is achieved.
Can work with classes more than 2.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
Fatih Saglam, [email protected]
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- RUS(x = x, y = y) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- RUS(x = x, y = y) plot(m$x_new, col = m$y_new)
Resampling with RWO
RWO(x, y)
RWO(x, y)
x |
feature matrix. |
y |
a factor class variable with two classes. |
RWO (Zhang and Li, 2014) is an oversampling method which generates data using variable standard error in a way that it preserves the variances of all variables.
Can work with classes more than 2.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
x_syn |
Generated synthetic feature data. |
y_syn |
Generated synthetic label data. |
Fatih Saglam, [email protected]
Zhang, H., & Li, M. (2014). RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Information Fusion, 20, 99-116.
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- RWO(x = x, y = y) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- RWO(x = x, y = y) plot(m$x_new, col = m$y_new)
SLSMOTE()
generates synthetic samples by considering a
safe level of the nearest minority class examples.
SLSMOTE(x, y, k1 = 5, k2 = 5)
SLSMOTE(x, y, k1 = 5, k2 = 5)
x |
feature matrix or data.frame. |
y |
a factor class variable with two classes. |
k1 |
number of neighbors to link. Default is 5. |
k2 |
number of neighbors to determine safe levels. Default is 5. |
SLSMOTE uses the safe-level distance metric to identify the minority class samples that are safe to oversample. Safe-level distance measures the distance between a minority class sample and its k-nearest minority class neighbors. A sample is considered safe to oversample if its safe-level is greater than a threshold. The safe-level of a sample is the ratio of minority class samples among its k-nearest neighbors.
In SLSMOTE, the oversampling process only applies to the safe minority class samples, which avoids the generation of noisy samples that can lead to overfitting. To generate synthetic samples, SLSMOTE randomly selects a minority class sample and finds its k-nearest minority class neighbors. Then, a random minority class neighbor is selected, and a synthetic sample is generated by adding a random proportion of the difference between the selected sample and its neighbor to the selected sample.
Note: Much faster than smotefamily::SLS()
.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
x_syn |
Generated synthetic data. |
C |
Number of synthetic samples for each positive class samples. |
Fatih Saglam, [email protected]
Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-30, 2009 Proceedings 13 (pp. 475-482). Springer Berlin Heidelberg.
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- SLSMOTE(x = x, y = y, k1 = 5, k2 = 5) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- SLSMOTE(x = x, y = y, k1 = 5, k2 = 5) plot(m$x_new, col = m$y_new)
Resampling with SMOTE.
SMOTE(x, y, k = 5)
SMOTE(x, y, k = 5)
x |
feature matrix. |
y |
a factor class variable with two classes. |
k |
number of neighbors. Default is 5. |
SMOTE (Chawla et al., 2002) is an oversampling method which creates links between positive samples and nearest neighbors and generates synthetic samples along that link.
It is well known that SMOTE is sensitive to noisy data. It may create more noise.
Can work with classes more than 2.
Note: Much faster than smotefamily::SMOTE()
.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
x_syn |
Generated synthetic feature data. |
y_syn |
Generated synthetic label data. |
Fatih Saglam, [email protected]
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- SMOTE(x = x, y = y, k = 7) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- SMOTE(x = x, y = y, k = 7) plot(m$x_new, col = m$y_new)
Resampling with SMOTE with boosting.
SMOTEWB(x, y, n_weak_classifier = 100, class_weights = NULL, k_max = NULL, ...)
SMOTEWB(x, y, n_weak_classifier = 100, class_weights = NULL, k_max = NULL, ...)
x |
feature matrix. |
y |
a factor class variable with two classes. |
n_weak_classifier |
number of weak classifiers for boosting. |
class_weights |
numeric vector of length two. First number is for
positive class, and second is for negative. Higher the relative weight,
lesser noises for that class. By default, |
k_max |
to increase maximum number of neighbors. Default is
|
... |
additional inputs for ada::ada(). |
SMOTEWB (Saglam & Cengiz, 2022) is a SMOTE-based oversampling method which can handle noisy data and adaptively decides the appropriate number of neighbors to link during resampling with SMOTE.
Trained model based on this method gives significantly better Matthew Correlation Coefficient scores compared to others.
a list with resampled dataset.
x_new |
Resampled feature matrix. |
y_new |
Resampled target variable. |
x_syn |
Generated synthetic data. |
w |
Boosting weights for original dataset. |
k |
Number of nearest neighbors for positive class samples. |
C |
Number of synthetic samples for each positive class samples. |
Fatih Saglam, [email protected]
Sağlam, F., & Cengiz, M. A. (2022). A novel SMOTE-based resampling technique trough noise detection and the boosting procedure. Expert Systems with Applications, 200, 117023.
Can work with 2 classes only yet.
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- SMOTEWB(x = x, y = y, n_weak_classifier = 150) plot(m$x_new, col = m$y_new)
set.seed(1) x <- rbind(matrix(rnorm(2000, 3, 1), ncol = 2, nrow = 1000), matrix(rnorm(100, 5, 1), ncol = 2, nrow = 50)) y <- as.factor(c(rep("negative", 1000), rep("positive", 50))) plot(x, col = y) # resampling m <- SMOTEWB(x = x, y = y, n_weak_classifier = 150) plot(m$x_new, col = m$y_new)