The goal of this vignette is to explain how to compute reproducible machine learning benchmarks.

Introduction

Reproducibility is the ability to re-compute the exact same results, given the exact same inputs, possibly on a variety of different computers.

In mlr3 benchmarks, there are three components, all of which may present barriers to reproducibility:

When using mlr3resampling::proj_grid(), the default is train_seed=1L, which means R’s random seed will be set before training.

For reproducible train/test splits, we recommend

These steps ensure that the benchmark results are reproducible, given the CSV file with fold column, and the random seed for training.

Example

We begin by defining the Resampling method (default is 3-fold CV).

kfold <- mlr3resampling::ResamplingSameOtherSizesCV$new()

Next we load the spam binary classification task, optionally down-sampling to 200 rows to make the computations quicker.

spam <- mlr3::tsk("spam")
## uncomment next line to speedup rendering:
#spam$filter(as.integer(seq(1, spam$nrow, length.out = 200)))
spam
#>
#>── <TaskClassif> (4601x58): HP Spam Detection ──────────────────────────────────
#>• Target: type
#>• Target classes: spam (positive class, 39%), nonspam (61%)
#>• Properties: twoclass
#>• Features (57):
#>  • dbl (57): address, addresses, all, business, capitalAve, capitalLong,
#>  capitalTotal, charDollar, charExclamation, charHash, charRoundbracket,
#>  charSemicolon, charSquarebracket, conference, credit, cs, data, direct, edu,
#>  email, font, free, george, hp, hpl, internet, lab, labs, mail, make, meeting,
#>  money, num000, num1999, num3d, num415, num650, num85, num857, order,
#>  original, our, over, parts, people, pm, project, re, receive, remove, report,
#>  table, technology, telnet, will, you, your

Next, we create a new CSV file with fold IDs.

library(data.table)
spam_with_fold.csv <- if(interactive())"~/spam_with_fold.csv" else tempfile()
spam_with_fold.dt <- spam$data()[
, Fold := rep(1:3, length.out = .N)
, by = type]
fwrite(spam_with_fold.dt, spam_with_fold.csv)
spam_with_fold.dt[, table(Fold, type)]
#>    type
#>Fold spam nonspam
#>   1  605     930
#>   2  604     929
#>   3  604     929

The outpat above shows that the number of samples in each class is approximately constant across folds. Next, we use these data to define a new task with the fold role.

spam_with_fold <- mlr3::TaskClassif$new(
  "spam_with_fold", spam_with_fold.dt, target="type")
spam_with_fold$col_roles$fold <- "Fold"
spam_with_fold$col_roles$feature <- spam$col_roles$feature

Below we assign the stratum role to both tasks, for proportional down-sampling.

spam_with_fold$col_roles$stratum <- c("type","Fold")
spam$col_roles$stratum <- "type"

Next we define learners and ensure their predict types are real-valued scores (predict_type="prob"), so we can compute AUC.

learner_list <- list(
  mlr3learners::LearnerClassifCVGlmnet$new(),
  mlr3::LearnerClassifRpart$new())
for(learner.i in seq_along(learner_list)){
  learner_list[[learner.i]]$predict_type <- "prob"
}

Next, we create a project grid.

pdir <- if(interactive())"~/pdir" else tempfile()
task_list <- list(spam, spam_with_fold)
unlink(pdir, recursive = TRUE)
measure_list <- mlr3::msrs("classif.auc")
mlr3resampling::proj_grid(
  pdir, task_list, learner_list, kfold, score_args=measure_list)
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file628123a02a60", max_jobs=1)
task.i learner.i resampling.i task_id learner_id resampling_id test.subset train.subsets groups test.fold seed n.train.groups iteration Train_subsets
1 1 1 spam classif.cv_glmnet same_other_sizes_cv full same 3067 1 1 3067 1 same
1 1 1 spam classif.cv_glmnet same_other_sizes_cv full same 3067 2 1 3067 2 same
1 1 1 spam classif.cv_glmnet same_other_sizes_cv full same 3067 3 1 3067 3 same
1 2 1 spam classif.rpart same_other_sizes_cv full same 3067 1 1 3067 1 same
1 2 1 spam classif.rpart same_other_sizes_cv full same 3067 2 1 3067 2 same
2 1 1 spam_with_fold classif.cv_glmnet same_other_sizes_cv full same 3067 2 1 3067 2 same
2 1 1 spam_with_fold classif.cv_glmnet same_other_sizes_cv full same 3067 3 1 3067 3 same
2 2 1 spam_with_fold classif.rpart same_other_sizes_cv full same 3067 1 1 3067 1 same
2 2 1 spam_with_fold classif.rpart same_other_sizes_cv full same 3067 2 1 3067 2 same
2 2 1 spam_with_fold classif.rpart same_other_sizes_cv full same 3067 3 1 3067 3 same

Demonstration with test project

Below we run proj_test() twice.

test_res_list <- list()
for(run.num in 1:2){
  tres <- mlr3resampling::proj_test(pdir, min_samples_per_stratum = 20)
  test_res_list[[run.num]] <- data.table(
    run=paste0("run", run.num), tres$results.csv)
}
test_res <- rbindlist(test_res_list)[
, algorithm := sub("classif.", "", learner_id)]
dcast(
  test_res, task_id + algorithm ~ run, value.var="classif.auc")
task_id algorithm run1 run2
spam cv_glmnet 0.500 0.500
spam rpart 0.557 0.557
spam_with_fold cv_glmnet 0.938 0.938
spam_with_fold rpart 0.815 0.815

The result above shows that the two runs have the same AUC values. Even for spam, which does not have fold column role, the random seed is set for reproducible fold assignments.

Demonstration with full benchmark

Below we create two project grids from the same code.

set.seed(1)
jobs_list <- list()
for(run.num in 1:2){
  pdir <- if(interactive())paste0("~/pdir",run.num) else tempfile()
  unlink(pdir, recursive = TRUE)
  pgrid <- mlr3resampling::proj_grid(
    pdir, task_list, learner_list, kfold, score_args=measure_list)
  jobs_list[[run.num]] <- data.table(
    run=paste0("run", run.num), pdir, job.i=pgrid[
    , which(test.fold==1 & grepl("glmnet", learner_id))])
}
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file62811059e4f5", max_jobs=1)
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file6281138c9384", max_jobs=1)

In the code above, the two grids have the same fold assignments (because random seed is set inside proj_grid). Below, we compute one glmnet job for each run and task.

proj_res <- rbindlist(jobs_list)[
, mlr3resampling::proj_compute(job.i, pdir)
, by=.(run, pdir, job.i)][
, algorithm := sub("classif.", "", learner_id)]
dcast(
  proj_res, task_id + algorithm ~ run, value.var="classif.auc")
task_id algorithm run1 run2
spam cv_glmnet 0.971 0.971
spam_with_fold cv_glmnet 0.968 0.968

The results here using proj_compute() are consistent with the results above using proj_test(): results are reproducible between runs.

Do it yourself

In this section we demonstrate that it is possible to compute the same test AUC values without using the mlr3 framework.

fold1.dt <- data.table(spam_with_fold.dt)[
, set := ifelse(Fold==1, "test", "train")][]
set.dt.list <- split(fold1.dt, fold1.dt$set)
set.xy.list <- list()
for(set in names(set.dt.list)){
  set.dt <- set.dt.list[[set]]
  set.xy.list[[set]] <- list(
    X=as.matrix(set.dt[, spam$col_roles$feature, with=FALSE]),
    y=set.dt$type)
}
library(glmnet)
#>Loading required package: Matrix
#>Loaded glmnet 4.1-10
set.seed(1)
cvg_train_predict <- function(set.xy.list){
  cvfit <- with(set.xy.list$train, cv.glmnet(X, y, family="binomial"))
  with(set.xy.list$test, {
    pred <- predict(cvfit, X)
    roc.df <- WeightedROC::WeightedROC(pred, y)
    WeightedROC::WeightedAUC(roc.df)
  })
}
fold1.test.auc <- cvg_train_predict(set.xy.list)
rbind(
  data.table(packages="glmnet,WeightedROC", run="only", auc=fold1.test.auc),
  proj_res[task_id=="spam_with_fold", .(
    packages="mlr3resampling", run, auc=classif.auc)])
packages run auc
glmnet,WeightedROC only 0.968
mlr3resampling run1 0.968
mlr3resampling run2 0.968

The output above is a table with AUC values that are identical across packages used for computation. These data indicate that the proposed framework enables reproducibility even if mlr3resampling is not used.

If the seed is not obvious from the code, you can read it from the grid RDS file:

grid.rds <- file.path(pdir, "grid.rds")
readRDS(grid.rds)$train_seed
#>[1] 1

Comparison with batchtools

If you want consistent results between runs with batchtools, you need to set the seed argument when making the registry.

(bgrid <- mlr3::benchmark_grid(task_list, learner_list, kfold))
task learner resampling
TaskClassif:spam LearnerClassifCVGlmnet:classif.cv_glmnet
TaskClassif:spam LearnerClassifRpart:classif.rpart
TaskClassif:spam_with_fold LearnerClassifCVGlmnet:classif.cv_glmnet
TaskClassif:spam_with_fold LearnerClassifRpart:classif.rpart
batchtools.seed <- 1
if(requireNamespace("mlr3batchmark")){
  batch_dt_list <- list()
  for(run.num in 1:2){
    reg_dir <- if(interactive())paste0("~/reg",run.num) else tempfile()
    unlink(reg_dir, recursive = TRUE)
    reg <- batchtools::makeExperimentRegistry(reg_dir, seed=batchtools.seed)
    mlr3batchmark::batchmark(bgrid)
    jt <- batchtools::getJobTable()
    jt1 <- jt[repl==1]
    batchtools::submitJobs(jt1)
    batchtools::waitForJobs()
    ignore.learner <- function(L){
      L$learner_state$model <- NULL
      L
    }
    bt_res <- mlr3batchmark::reduceResultsBatchmark(jt1, fun=ignore.learner)
    bt_score <- bt_res$score(measure_list)
    batch_dt_list[[run.num]] <- data.table(
      run=paste0("run", run.num), bt_score
    )[
    , algorithm := sub("classif.", "", learner_id)]
  }
  batch_dt <- rbindlist(batch_dt_list)
  dcast(
    batch_dt, task_id + algorithm ~ run, value.var="classif.auc")
}
#>Loading required namespace: mlr3batchmark
#>No readable configuration file found
#>Created registry in '/tmp/Rtmppy43XW/file62812849799' using cluster functions 'Interactive'
#>Adding algorithm 'run_learner'
#>Adding problem '4d4715e62a2eaf23'
#>Exporting new objects: 'bae9feab4b45c859' ...
#>Exporting new objects: 'afb1fabfdb92224e' ...
#>Exporting new objects: '2099aa995d4e20f7' ...
#>Exporting new objects: 'ecf8ee265ec56766' ...
#>Overwriting previously exported object: 'ecf8ee265ec56766'
#>Adding 6 experiments ('4d4715e62a2eaf23'[1] x 'run_learner'[2] x repls[3]) ...
#>Adding problem 'e827a0f482a53df3'
#>Exporting new objects: '1f7bf6db193ef5ae' ...
#>Adding 6 experiments ('e827a0f482a53df3'[1] x 'run_learner'[2] x repls[3]) ...
#>Submitting 4 jobs in 4 chunks using cluster functions 'Interactive' ...
#>No readable configuration file found
#>Created registry in '/tmp/Rtmppy43XW/file62811069e00c' using cluster functions 'Interactive'
#>Adding algorithm 'run_learner'
#>Adding problem '4d4715e62a2eaf23'
#>Exporting new objects: 'bae9feab4b45c859' ...
#>Exporting new objects: 'afb1fabfdb92224e' ...
#>Exporting new objects: '2099aa995d4e20f7' ...
#>Exporting new objects: 'ecf8ee265ec56766' ...
#>Overwriting previously exported object: 'ecf8ee265ec56766'
#>Adding 6 experiments ('4d4715e62a2eaf23'[1] x 'run_learner'[2] x repls[3]) ...
#>Adding problem 'e827a0f482a53df3'
#>Exporting new objects: '1f7bf6db193ef5ae' ...
#>Adding 6 experiments ('e827a0f482a53df3'[1] x 'run_learner'[2] x repls[3]) ...
#>Submitting 4 jobs in 4 chunks using cluster functions 'Interactive' ...
task_id algorithm run1 run2
spam cv_glmnet 0.974 0.974
spam rpart 0.905 0.905
spam_with_fold cv_glmnet 0.967 0.967
spam_with_fold rpart 0.889 0.889

The table above shows consistent results across runs, which means that reproducibility is possible as long as batchtools is used with the same seed argument.

Do it yourself

It is possible to reproduce these results without batchtools. First, the code below shows that the bgrid object contains an instantiated sampler with fold IDs that are consistent with the values in the Fold column.

bgrid_dt <- as.data.table(bgrid)[, task_id := sapply(task, "[[", "id")][]
resampler_with_fold <- bgrid_dt[task_id=="spam_with_fold"]$resampling[[1]]
identical(resampler_with_fold$instance$fold.dt$Fold, spam_with_fold.dt$Fold)
#>[1] TRUE

To reproduce these values outside of the batchtools framework, we need to set the same random seed that was used by batchtools. This is documented on ?batchtools::makeExperimentRegistry which says

    seed: ['integer(1)']
          Start seed for jobs. Each job uses the ('seed' + 'job.id') as
          seed. Default is a random integer between 1 and 32768.

The code below sets this random seed.

(one_job <- jt1[, let(
  task_id = sapply(prob.pars, "[[", "task_id"),
  learner_id = sapply(algo.pars, "[[", "learner_id")
)][task_id=="spam_with_fold" & learner_id=="classif.cv_glmnet"])
job.id submitted started done error mem.used batch.id log.file job.hash job.name repl time.queued time.running problem prob.pars algorithm algo.pars resources tags task_id learner_id
7 ce96fb1b-8da9-4027-b300-db0fb034113d 1 e827a0f482a53df3 <list[4]> run_learner <list[4]> [NULL] spam_with_fold classif.cv_glmnet
set.seed(one_job$job.id+batchtools.seed)

Next we train cv glmnet again.

batchtools.test.auc <- cvg_train_predict(set.xy.list)
rbind(
  data.table(
    packages="glmnet,WeightedROC", run="only", auc=batchtools.test.auc),
  batch_dt[algorithm=="cv_glmnet" & task_id=="spam_with_fold", .(
    packages="batchtools", run, auc=classif.auc)])
packages run auc
glmnet,WeightedROC only 0.967
batchtools run1 0.967
batchtools run2 0.967

The DIY results above are consistent with the previous two runs, indicating that reproducibility is possible with batchtools as well.

If the seed was not set in the batchtools code, you can read it from the registry RDS file:

registry.rds <- file.path(reg_dir, "registry.rds")
rbind(
  code=batchtools.seed,
  registry=readRDS(registry.rds)$seed)
code 1
registry 1
unlink(reg_dir, recursive = TRUE)
batchtools::makeExperimentRegistry(reg_dir)
#>No readable configuration file found
#>Created registry in '/tmp/Rtmppy43XW/file62811069e00c' using cluster functions 'Interactive'
#>Experiment Registry
#>  Backend   : Interactive
#>  File dir  : /tmp/Rtmppy43XW/file62811069e00c
#>  Work dir  : /tmp/Rtmp6Vlihj/Rbuild61f83059bb63/mlr3resampling/vignettes
#>  Jobs      : 0
#>  Problems  : 0
#>  Algorithms: 0
#>  Seed      : 32101
#>  Writeable : TRUE
readRDS(registry.rds)$seed
#>[1] 32101

Conclusion

We see that reproducibility is possible in mlr3. Using mlr3resampling::proj_grid()

Using mlr3batchmark, user needs to give seed argument to batchtools::makeExperimentRegistry.