The goal of this vignette is to explain how to compute reproducible machine learning benchmarks.

Introduction

Reproducibility is the ability to re-compute the exact same results, given the exact same inputs, possibly on a variety of different computers.

In mlr3 benchmarks, there are three components, all of which may present barriers to reproducibility:

Tasks represent data sets, which may not have pre-defined splits or cross-validation folds, in which case mlr3 will create them randomly (with a loss of reproducibility).
Learners represent algorithms which may be deterministic (reproducible) or random (not reproducible unless random seed is set).
Resampling methods like cross-validation are inherently random.

When using mlr3resampling::proj_grid(), the default is train_seed=1L, which means R’s random seed will be set before training.

For reproducible train/test splits, we recommend

randomly creating a fold column, so that fold values respect strata such as class labels.
save data with the fold column as a new CSV file, which can be used to communicate to others what cross-validation splits you used in your benchmark.
assign the fold column to the fold role in the mlr3 Task.
use mlr3resampling::ResamplingSameOtherSizesCV which uses the column with fold role in cross-validation.

These steps ensure that the benchmark results are reproducible, given the CSV file with fold column, and the random seed for training.

Example

We begin by defining the Resampling method (default is 3-fold CV).

kfold <- mlr3resampling::ResamplingSameOtherSizesCV$new()

Next we load the spam binary classification task, optionally down-sampling to 200 rows to make the computations quicker.

spam <- mlr3::tsk("spam")
## uncomment next line to speedup rendering:
#spam$filter(as.integer(seq(1, spam$nrow, length.out = 200)))
spam
#>
#>── <TaskClassif> (4601x58): HP Spam Detection ──────────────────────────────────
#>• Target: type
#>• Target classes: spam (positive class, 39%), nonspam (61%)
#>• Properties: twoclass
#>• Features (57):
#>  • dbl (57): address, addresses, all, business, capitalAve, capitalLong,
#>  capitalTotal, charDollar, charExclamation, charHash, charRoundbracket,
#>  charSemicolon, charSquarebracket, conference, credit, cs, data, direct, edu,
#>  email, font, free, george, hp, hpl, internet, lab, labs, mail, make, meeting,
#>  money, num000, num1999, num3d, num415, num650, num85, num857, order,
#>  original, our, over, parts, people, pm, project, re, receive, remove, report,
#>  table, technology, telnet, will, you, your

Next, we create a new CSV file with fold IDs.

library(data.table)
spam_with_fold.csv <- if(interactive())"~/spam_with_fold.csv" else tempfile()
spam_with_fold.dt <- spam$data()[
, Fold := rep(1:3, length.out = .N)
, by = type]
fwrite(spam_with_fold.dt, spam_with_fold.csv)
spam_with_fold.dt[, table(Fold, type)]
#>    type
#>Fold spam nonspam
#>   1  605     930
#>   2  604     929
#>   3  604     929

The outpat above shows that the number of samples in each class is approximately constant across folds. Next, we use these data to define a new task with the fold role.

spam_with_fold <- mlr3::TaskClassif$new(
  "spam_with_fold", spam_with_fold.dt, target="type")
spam_with_fold$col_roles$fold <- "Fold"
spam_with_fold$col_roles$feature <- spam$col_roles$feature

Below we assign the stratum role to both tasks, for proportional down-sampling.

spam_with_fold$col_roles$stratum <- c("type","Fold")
spam$col_roles$stratum <- "type"

Next we define learners and ensure their predict types are real-valued scores (predict_type="prob"), so we can compute AUC.

learner_list <- list(
  mlr3learners::LearnerClassifCVGlmnet$new(),
  mlr3::LearnerClassifRpart$new())
for(learner.i in seq_along(learner_list)){
  learner_list[[learner.i]]$predict_type <- "prob"
}

Next, we create a project grid.

pdir <- if(interactive())"~/pdir" else tempfile()
task_list <- list(spam, spam_with_fold)
unlink(pdir, recursive = TRUE)
measure_list <- mlr3::msrs("classif.auc")
mlr3resampling::proj_grid(
  pdir, task_list, learner_list, kfold, score_args=measure_list)
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file628123a02a60", max_jobs=1)

task.i	learner.i	resampling.i	task_id	learner_id	resampling_id	test.subset	train.subsets	groups	test.fold	seed	n.train.groups	iteration	Train_subsets
1	1	1	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	3067	1	1	3067	1	same
1	1	1	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	3067	2	1	3067	2	same
1	1	1	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	3067	3	1	3067	3	same
1	2	1	spam	classif.rpart	same_other_sizes_cv	full	same	3067	1	1	3067	1	same
1	2	1	spam	classif.rpart	same_other_sizes_cv	full	same	3067	2	1	3067	2	same
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
2	1	1	spam_with_fold	classif.cv_glmnet	same_other_sizes_cv	full	same	3067	2	1	3067	2	same
2	1	1	spam_with_fold	classif.cv_glmnet	same_other_sizes_cv	full	same	3067	3	1	3067	3	same
2	2	1	spam_with_fold	classif.rpart	same_other_sizes_cv	full	same	3067	1	1	3067	1	same
2	2	1	spam_with_fold	classif.rpart	same_other_sizes_cv	full	same	3067	2	1	3067	2	same
2	2	1	spam_with_fold	classif.rpart	same_other_sizes_cv	full	same	3067	3	1	3067	3	same

Demonstration with test project

Below we run proj_test() twice.

test_res_list <- list()
for(run.num in 1:2){
  tres <- mlr3resampling::proj_test(pdir, min_samples_per_stratum = 20)
  test_res_list[[run.num]] <- data.table(
    run=paste0("run", run.num), tres$results.csv)
}
test_res <- rbindlist(test_res_list)[
, algorithm := sub("classif.", "", learner_id)]
dcast(
  test_res, task_id + algorithm ~ run, value.var="classif.auc")

task_id	algorithm	run1	run2
spam	cv_glmnet	0.500	0.500
spam	rpart	0.557	0.557
spam_with_fold	cv_glmnet	0.938	0.938
spam_with_fold	rpart	0.815	0.815

The result above shows that the two runs have the same AUC values. Even for spam, which does not have fold column role, the random seed is set for reproducible fold assignments.

Demonstration with full benchmark

Below we create two project grids from the same code.

set.seed(1)
jobs_list <- list()
for(run.num in 1:2){
  pdir <- if(interactive())paste0("~/pdir",run.num) else tempfile()
  unlink(pdir, recursive = TRUE)
  pgrid <- mlr3resampling::proj_grid(
    pdir, task_list, learner_list, kfold, score_args=measure_list)
  jobs_list[[run.num]] <- data.table(
    run=paste0("run", run.num), pdir, job.i=pgrid[
    , which(test.fold==1 & grepl("glmnet", learner_id))])
}
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file62811059e4f5", max_jobs=1)
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file6281138c9384", max_jobs=1)

In the code above, the two grids have the same fold assignments (because random seed is set inside proj_grid). Below, we compute one glmnet job for each run and task.

proj_res <- rbindlist(jobs_list)[
, mlr3resampling::proj_compute(job.i, pdir)
, by=.(run, pdir, job.i)][
, algorithm := sub("classif.", "", learner_id)]
dcast(
  proj_res, task_id + algorithm ~ run, value.var="classif.auc")

task_id	algorithm	run1	run2
spam	cv_glmnet	0.971	0.971
spam_with_fold	cv_glmnet	0.968	0.968

The results here using proj_compute() are consistent with the results above using proj_test(): results are reproducible between runs.

Do it yourself

In this section we demonstrate that it is possible to compute the same test AUC values without using the mlr3 framework.

fold1.dt <- data.table(spam_with_fold.dt)[
, set := ifelse(Fold==1, "test", "train")][]
set.dt.list <- split(fold1.dt, fold1.dt$set)
set.xy.list <- list()
for(set in names(set.dt.list)){
  set.dt <- set.dt.list[[set]]
  set.xy.list[[set]] <- list(
    X=as.matrix(set.dt[, spam$col_roles$feature, with=FALSE]),
    y=set.dt$type)
}
library(glmnet)
#>Loading required package: Matrix
#>Loaded glmnet 4.1-10
set.seed(1)
cvg_train_predict <- function(set.xy.list){
  cvfit <- with(set.xy.list$train, cv.glmnet(X, y, family="binomial"))
  with(set.xy.list$test, {
    pred <- predict(cvfit, X)
    roc.df <- WeightedROC::WeightedROC(pred, y)
    WeightedROC::WeightedAUC(roc.df)
  })
}
fold1.test.auc <- cvg_train_predict(set.xy.list)
rbind(
  data.table(packages="glmnet,WeightedROC", run="only", auc=fold1.test.auc),
  proj_res[task_id=="spam_with_fold", .(
    packages="mlr3resampling", run, auc=classif.auc)])

packages	run	auc
glmnet,WeightedROC	only	0.968
mlr3resampling	run1	0.968
mlr3resampling	run2	0.968

The output above is a table with AUC values that are identical across packages used for computation. These data indicate that the proposed framework enables reproducibility even if mlr3resampling is not used.

If the seed is not obvious from the code, you can read it from the grid RDS file:

grid.rds <- file.path(pdir, "grid.rds")
readRDS(grid.rds)$train_seed
#>[1] 1

Comparison with batchtools

If you want consistent results between runs with batchtools, you need to set the seed argument when making the registry.

(bgrid <- mlr3::benchmark_grid(task_list, learner_list, kfold))

task	learner	resampling
TaskClassif:spam	LearnerClassifCVGlmnet:classif.cv_glmnet
TaskClassif:spam	LearnerClassifRpart:classif.rpart
TaskClassif:spam_with_fold	LearnerClassifCVGlmnet:classif.cv_glmnet
TaskClassif:spam_with_fold	LearnerClassifRpart:classif.rpart

batchtools.seed <- 1
if(requireNamespace("mlr3batchmark")){
  batch_dt_list <- list()
  for(run.num in 1:2){
    reg_dir <- if(interactive())paste0("~/reg",run.num) else tempfile()
    unlink(reg_dir, recursive = TRUE)
    reg <- batchtools::makeExperimentRegistry(reg_dir, seed=batchtools.seed)
    mlr3batchmark::batchmark(bgrid)
    jt <- batchtools::getJobTable()
    jt1 <- jt[repl==1]
    batchtools::submitJobs(jt1)
    batchtools::waitForJobs()
    ignore.learner <- function(L){
      L$learner_state$model <- NULL
      L
    }
    bt_res <- mlr3batchmark::reduceResultsBatchmark(jt1, fun=ignore.learner)
    bt_score <- bt_res$score(measure_list)
    batch_dt_list[[run.num]] <- data.table(
      run=paste0("run", run.num), bt_score
    )[
    , algorithm := sub("classif.", "", learner_id)]
  }
  batch_dt <- rbindlist(batch_dt_list)
  dcast(
    batch_dt, task_id + algorithm ~ run, value.var="classif.auc")
}
#>Loading required namespace: mlr3batchmark
#>No readable configuration file found
#>Created registry in '/tmp/Rtmppy43XW/file62812849799' using cluster functions 'Interactive'
#>Adding algorithm 'run_learner'
#>Adding problem '4d4715e62a2eaf23'
#>Exporting new objects: 'bae9feab4b45c859' ...
#>Exporting new objects: 'afb1fabfdb92224e' ...
#>Exporting new objects: '2099aa995d4e20f7' ...
#>Exporting new objects: 'ecf8ee265ec56766' ...
#>Overwriting previously exported object: 'ecf8ee265ec56766'
#>Adding 6 experiments ('4d4715e62a2eaf23'[1] x 'run_learner'[2] x repls[3]) ...
#>Adding problem 'e827a0f482a53df3'
#>Exporting new objects: '1f7bf6db193ef5ae' ...
#>Adding 6 experiments ('e827a0f482a53df3'[1] x 'run_learner'[2] x repls[3]) ...
#>Submitting 4 jobs in 4 chunks using cluster functions 'Interactive' ...
#>No readable configuration file found
#>Created registry in '/tmp/Rtmppy43XW/file62811069e00c' using cluster functions 'Interactive'
#>Adding algorithm 'run_learner'
#>Adding problem '4d4715e62a2eaf23'
#>Exporting new objects: 'bae9feab4b45c859' ...
#>Exporting new objects: 'afb1fabfdb92224e' ...
#>Exporting new objects: '2099aa995d4e20f7' ...
#>Exporting new objects: 'ecf8ee265ec56766' ...
#>Overwriting previously exported object: 'ecf8ee265ec56766'
#>Adding 6 experiments ('4d4715e62a2eaf23'[1] x 'run_learner'[2] x repls[3]) ...
#>Adding problem 'e827a0f482a53df3'
#>Exporting new objects: '1f7bf6db193ef5ae' ...
#>Adding 6 experiments ('e827a0f482a53df3'[1] x 'run_learner'[2] x repls[3]) ...
#>Submitting 4 jobs in 4 chunks using cluster functions 'Interactive' ...

task_id	algorithm	run1	run2
spam	cv_glmnet	0.974	0.974
spam	rpart	0.905	0.905
spam_with_fold	cv_glmnet	0.967	0.967
spam_with_fold	rpart	0.889	0.889

The table above shows consistent results across runs, which means that reproducibility is possible as long as batchtools is used with the same seed argument.

Do it yourself

It is possible to reproduce these results without batchtools. First, the code below shows that the bgrid object contains an instantiated sampler with fold IDs that are consistent with the values in the Fold column.

bgrid_dt <- as.data.table(bgrid)[, task_id := sapply(task, "[[", "id")][]
resampler_with_fold <- bgrid_dt[task_id=="spam_with_fold"]$resampling[[1]]
identical(resampler_with_fold$instance$fold.dt$Fold, spam_with_fold.dt$Fold)
#>[1] TRUE

To reproduce these values outside of the batchtools framework, we need to set the same random seed that was used by batchtools. This is documented on ?batchtools::makeExperimentRegistry which says

    seed: ['integer(1)']
          Start seed for jobs. Each job uses the ('seed' + 'job.id') as
          seed. Default is a random integer between 1 and 32768.

The code below sets this random seed.

(one_job <- jt1[, let(
  task_id = sapply(prob.pars, "[[", "task_id"),
  learner_id = sapply(algo.pars, "[[", "learner_id")
)][task_id=="spam_with_fold" & learner_id=="classif.cv_glmnet"])

job.id	submitted	started	done	error	mem.used	batch.id	log.file	job.hash	job.name	repl	time.queued	time.running	problem	prob.pars	algorithm	algo.pars	resources	tags	task_id	learner_id
7									ce96fb1b-8da9-4027-b300-db0fb034113d	1			e827a0f482a53df3	<list[4]>	run_learner	<list[4]>	[NULL]		spam_with_fold	classif.cv_glmnet

set.seed(one_job$job.id+batchtools.seed)

Next we train cv glmnet again.

batchtools.test.auc <- cvg_train_predict(set.xy.list)
rbind(
  data.table(
    packages="glmnet,WeightedROC", run="only", auc=batchtools.test.auc),
  batch_dt[algorithm=="cv_glmnet" & task_id=="spam_with_fold", .(
    packages="batchtools", run, auc=classif.auc)])

packages	run	auc
glmnet,WeightedROC	only	0.967
batchtools	run1	0.967
batchtools	run2	0.967

The DIY results above are consistent with the previous two runs, indicating that reproducibility is possible with batchtools as well.

If the seed was not set in the batchtools code, you can read it from the registry RDS file:

registry.rds <- file.path(reg_dir, "registry.rds")
rbind(
  code=batchtools.seed,
  registry=readRDS(registry.rds)$seed)

code	1
registry	1

unlink(reg_dir, recursive = TRUE)
batchtools::makeExperimentRegistry(reg_dir)
#>No readable configuration file found
#>Created registry in '/tmp/Rtmppy43XW/file62811069e00c' using cluster functions 'Interactive'
#>Experiment Registry
#>  Backend   : Interactive
#>  File dir  : /tmp/Rtmppy43XW/file62811069e00c
#>  Work dir  : /tmp/Rtmp6Vlihj/Rbuild61f83059bb63/mlr3resampling/vignettes
#>  Jobs      : 0
#>  Problems  : 0
#>  Algorithms: 0
#>  Seed      : 32101
#>  Writeable : TRUE
readRDS(registry.rds)$seed
#>[1] 32101

Conclusion

We see that reproducibility is possible in mlr3. Using mlr3resampling::proj_grid()

For simple reproducibility even outside R and mlr3, use tasks with fold column role.
Even without fold column role, random seeds are set for reproducible splitting and training.

Using mlr3batchmark, user needs to give seed argument to batchtools::makeExperimentRegistry.