Introduction to basemodels

library(basemodels)

Introduction

This is an introduction to the package basemodels. This package provides equivalent functions for the dummy classifier and regressor used in ‘Python’ ‘scikit-learn’ library with some modifications. We aim to help R users easily identify baseline performance for their classification and regression problems. Our baseline models do not use any predictors to make predictions. They are useful in cases of class imbalance, multi-class classification, and when users want to quickly compare their statistical and machine learning models with several baseline models to see how much they have improved.

Examples

We show a few examples here. First, we split the data into training and testing sets.

set.seed(2023)
index <- sample(1:nrow(iris), nrow(iris) * 0.8)
train_data <- iris[index,]
test_data <- iris[-index,]

We can use the dummyClassifier method for the train() function in caret package.

ctrl1 <- caret::trainControl(method = "none")
# Train a dummy classifier with caret
dummy_model <- caret::train(Species ~ ., 
                            data = train_data,
                            method = dummyClassifier,
                            strategy = "stratified",
                               trControl = ctrl1)

# Make predictions using the trained dummy classifier
pred_vec <- predict(dummy_model, test_data)

# Evaluate the performance of the dummy classifier
conf_matrix <- caret::confusionMatrix(pred_vec, test_data$Species)
print(conf_matrix)
#> Confusion Matrix and Statistics
#> 
#>             Reference
#> Prediction   setosa versicolor virginica
#>   setosa          3          3         3
#>   versicolor      4          2         2
#>   virginica       5          4         4
#> 
#> Overall Statistics
#>                                          
#>                Accuracy : 0.3            
#>                  95% CI : (0.1473, 0.494)
#>     No Information Rate : 0.4            
#>     P-Value [Acc > NIR] : 0.9060         
#>                                          
#>                   Kappa : -0.0448        
#>                                          
#>  Mcnemar's Test P-Value : 0.7269         
#> 
#> Statistics by Class:
#> 
#>                      Class: setosa Class: versicolor Class: virginica
#> Sensitivity                 0.2500           0.22222           0.4444
#> Specificity                 0.6667           0.71429           0.5714
#> Pos Pred Value              0.3333           0.25000           0.3077
#> Neg Pred Value              0.5714           0.68182           0.7059
#> Prevalence                  0.4000           0.30000           0.3000
#> Detection Rate              0.1000           0.06667           0.1333
#> Detection Prevalence        0.3000           0.26667           0.4333
#> Balanced Accuracy           0.4583           0.46825           0.5079

For a classification problem, we can use the dummy_classifier() function.

dummy_model <- dummy_classifier(train_data$Species, strategy = "proportional", random_state = 2024)

# Make predictions using the trained dummy classifier
pred_vec <- predict_dummy_classifier(dummy_model, test_data)

# Evaluate the performance of the dummy classifier
conf_matrix <- caret::confusionMatrix(pred_vec, test_data$Species)
print(conf_matrix)
#> Confusion Matrix and Statistics
#> 
#>             Reference
#> Prediction   setosa versicolor virginica
#>   setosa          6          3         3
#>   versicolor      3          4         4
#>   virginica       3          2         2
#> 
#> Overall Statistics
#>                                          
#>                Accuracy : 0.4            
#>                  95% CI : (0.2266, 0.594)
#>     No Information Rate : 0.4            
#>     P-Value [Acc > NIR] : 0.5689         
#>                                          
#>                   Kappa : 0.0909         
#>                                          
#>  Mcnemar's Test P-Value : 0.8810         
#> 
#> Statistics by Class:
#> 
#>                      Class: setosa Class: versicolor Class: virginica
#> Sensitivity                 0.5000            0.4444          0.22222
#> Specificity                 0.6667            0.6667          0.76190
#> Pos Pred Value              0.5000            0.3636          0.28571
#> Neg Pred Value              0.6667            0.7368          0.69565
#> Prevalence                  0.4000            0.3000          0.30000
#> Detection Rate              0.2000            0.1333          0.06667
#> Detection Prevalence        0.4000            0.3667          0.23333
#> Balanced Accuracy           0.5833            0.5556          0.49206

For a regression problem, we can use the dummy_regressor() function.

# Make predictions using the trained dummy regressor
reg_model <- dummy_regressor(train_data$Sepal.Length, strategy = "median")
y_hat <- predict_dummy_regressor(reg_model, test_data)
# Find mean squared error
mean((test_data$Sepal.Length-y_hat)^2)
#> [1] 0.6056667

The dummyRegressor method can be used for the train() function in caret package.

ctrl1 <- caret::trainControl(method = "none")
# Train a dummy regressor with caret
reg_model <- caret::train(Sepal.Length ~ ., data = train_data,
                               method = dummyRegressor,
                               strategy = "median",
                               trControl = ctrl1)
y_hat <- predict(reg_model, test_data)
# Find mean squared error
mean((test_data$Sepal.Length-y_hat)^2)
#> [1] 0.6056667