The goal of this vignette is to explain how to compute reproducible machine learning benchmarks.
Reproducibility is the ability to re-compute the exact same results, given the exact same inputs, possibly on a variety of different computers.
In mlr3 benchmarks, there are three components, all of which may present barriers to reproducibility:
When using mlr3resampling::proj_grid(), the default is train_seed=1L, which means R’s random seed will be set before training.
For reproducible train/test splits, we recommend
fold role in the mlr3 Task.mlr3resampling::ResamplingSameOtherSizesCV which uses the column with fold role in cross-validation.These steps ensure that the benchmark results are reproducible, given the CSV file with fold column, and the random seed for training.
We begin by defining the Resampling method (default is 3-fold CV).
kfold <- mlr3resampling::ResamplingSameOtherSizesCV$new()
Next we load the spam binary classification task, optionally down-sampling to 200 rows to make the computations quicker.
spam <- mlr3::tsk("spam")
## uncomment next line to speedup rendering:
#spam$filter(as.integer(seq(1, spam$nrow, length.out = 200)))
spam
#>
#>── <TaskClassif> (4601x58): HP Spam Detection ──────────────────────────────────
#>• Target: type
#>• Target classes: spam (positive class, 39%), nonspam (61%)
#>• Properties: twoclass
#>• Features (57):
#> • dbl (57): address, addresses, all, business, capitalAve, capitalLong,
#> capitalTotal, charDollar, charExclamation, charHash, charRoundbracket,
#> charSemicolon, charSquarebracket, conference, credit, cs, data, direct, edu,
#> email, font, free, george, hp, hpl, internet, lab, labs, mail, make, meeting,
#> money, num000, num1999, num3d, num415, num650, num85, num857, order,
#> original, our, over, parts, people, pm, project, re, receive, remove, report,
#> table, technology, telnet, will, you, your
Next, we create a new CSV file with fold IDs.
library(data.table)
spam_with_fold.csv <- if(interactive())"~/spam_with_fold.csv" else tempfile()
spam_with_fold.dt <- spam$data()[
, Fold := rep(1:3, length.out = .N)
, by = type]
fwrite(spam_with_fold.dt, spam_with_fold.csv)
spam_with_fold.dt[, table(Fold, type)]
#> type
#>Fold spam nonspam
#> 1 605 930
#> 2 604 929
#> 3 604 929
The outpat above shows that the number of samples in each class is approximately constant across folds.
Next, we use these data to define a new task with the fold role.
spam_with_fold <- mlr3::TaskClassif$new(
"spam_with_fold", spam_with_fold.dt, target="type")
spam_with_fold$col_roles$fold <- "Fold"
spam_with_fold$col_roles$feature <- spam$col_roles$feature
Below we assign the stratum role to both tasks, for proportional down-sampling.
spam_with_fold$col_roles$stratum <- c("type","Fold")
spam$col_roles$stratum <- "type"
Next we define learners and ensure their predict types are real-valued scores (predict_type="prob"), so we can compute AUC.
learner_list <- list(
mlr3learners::LearnerClassifCVGlmnet$new(),
mlr3::LearnerClassifRpart$new())
for(learner.i in seq_along(learner_list)){
learner_list[[learner.i]]$predict_type <- "prob"
}
Next, we create a project grid.
pdir <- if(interactive())"~/pdir" else tempfile()
task_list <- list(spam, spam_with_fold)
unlink(pdir, recursive = TRUE)
measure_list <- mlr3::msrs("classif.auc")
mlr3resampling::proj_grid(
pdir, task_list, learner_list, kfold, score_args=measure_list)
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file628123a02a60", max_jobs=1)
| task.i | learner.i | resampling.i | task_id | learner_id | resampling_id | test.subset | train.subsets | groups | test.fold | seed | n.train.groups | iteration | Train_subsets |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 3067 | 1 | 1 | 3067 | 1 | same |
| 1 | 1 | 1 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 3067 | 2 | 1 | 3067 | 2 | same |
| 1 | 1 | 1 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 3067 | 3 | 1 | 3067 | 3 | same |
| 1 | 2 | 1 | spam | classif.rpart | same_other_sizes_cv | full | same | 3067 | 1 | 1 | 3067 | 1 | same |
| 1 | 2 | 1 | spam | classif.rpart | same_other_sizes_cv | full | same | 3067 | 2 | 1 | 3067 | 2 | same |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 2 | 1 | 1 | spam_with_fold | classif.cv_glmnet | same_other_sizes_cv | full | same | 3067 | 2 | 1 | 3067 | 2 | same |
| 2 | 1 | 1 | spam_with_fold | classif.cv_glmnet | same_other_sizes_cv | full | same | 3067 | 3 | 1 | 3067 | 3 | same |
| 2 | 2 | 1 | spam_with_fold | classif.rpart | same_other_sizes_cv | full | same | 3067 | 1 | 1 | 3067 | 1 | same |
| 2 | 2 | 1 | spam_with_fold | classif.rpart | same_other_sizes_cv | full | same | 3067 | 2 | 1 | 3067 | 2 | same |
| 2 | 2 | 1 | spam_with_fold | classif.rpart | same_other_sizes_cv | full | same | 3067 | 3 | 1 | 3067 | 3 | same |
Below we run proj_test() twice.
test_res_list <- list()
for(run.num in 1:2){
tres <- mlr3resampling::proj_test(pdir, min_samples_per_stratum = 20)
test_res_list[[run.num]] <- data.table(
run=paste0("run", run.num), tres$results.csv)
}
test_res <- rbindlist(test_res_list)[
, algorithm := sub("classif.", "", learner_id)]
dcast(
test_res, task_id + algorithm ~ run, value.var="classif.auc")
| task_id | algorithm | run1 | run2 |
|---|---|---|---|
| spam | cv_glmnet | 0.500 | 0.500 |
| spam | rpart | 0.557 | 0.557 |
| spam_with_fold | cv_glmnet | 0.938 | 0.938 |
| spam_with_fold | rpart | 0.815 | 0.815 |
The result above shows that the two runs have the same AUC values.
Even for spam, which does not have fold column role, the random seed is set for reproducible fold assignments.
Below we create two project grids from the same code.
set.seed(1)
jobs_list <- list()
for(run.num in 1:2){
pdir <- if(interactive())paste0("~/pdir",run.num) else tempfile()
unlink(pdir, recursive = TRUE)
pgrid <- mlr3resampling::proj_grid(
pdir, task_list, learner_list, kfold, score_args=measure_list)
jobs_list[[run.num]] <- data.table(
run=paste0("run", run.num), pdir, job.i=pgrid[
, which(test.fold==1 & grepl("glmnet", learner_id))])
}
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file62811059e4f5", max_jobs=1)
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file6281138c9384", max_jobs=1)
In the code above, the two grids have the same fold assignments (because random seed is set inside proj_grid).
Below, we compute one glmnet job for each run and task.
proj_res <- rbindlist(jobs_list)[
, mlr3resampling::proj_compute(job.i, pdir)
, by=.(run, pdir, job.i)][
, algorithm := sub("classif.", "", learner_id)]
dcast(
proj_res, task_id + algorithm ~ run, value.var="classif.auc")
| task_id | algorithm | run1 | run2 |
|---|---|---|---|
| spam | cv_glmnet | 0.971 | 0.971 |
| spam_with_fold | cv_glmnet | 0.968 | 0.968 |
The results here using proj_compute() are consistent with the results above using proj_test(): results are reproducible between runs.
In this section we demonstrate that it is possible to compute the same test AUC values without using the mlr3 framework.
fold1.dt <- data.table(spam_with_fold.dt)[
, set := ifelse(Fold==1, "test", "train")][]
set.dt.list <- split(fold1.dt, fold1.dt$set)
set.xy.list <- list()
for(set in names(set.dt.list)){
set.dt <- set.dt.list[[set]]
set.xy.list[[set]] <- list(
X=as.matrix(set.dt[, spam$col_roles$feature, with=FALSE]),
y=set.dt$type)
}
library(glmnet)
#>Loading required package: Matrix
#>Loaded glmnet 4.1-10
set.seed(1)
cvg_train_predict <- function(set.xy.list){
cvfit <- with(set.xy.list$train, cv.glmnet(X, y, family="binomial"))
with(set.xy.list$test, {
pred <- predict(cvfit, X)
roc.df <- WeightedROC::WeightedROC(pred, y)
WeightedROC::WeightedAUC(roc.df)
})
}
fold1.test.auc <- cvg_train_predict(set.xy.list)
rbind(
data.table(packages="glmnet,WeightedROC", run="only", auc=fold1.test.auc),
proj_res[task_id=="spam_with_fold", .(
packages="mlr3resampling", run, auc=classif.auc)])
| packages | run | auc |
|---|---|---|
| glmnet,WeightedROC | only | 0.968 |
| mlr3resampling | run1 | 0.968 |
| mlr3resampling | run2 | 0.968 |
The output above is a table with AUC values that are identical across packages used for computation.
These data indicate that the proposed framework enables reproducibility even if mlr3resampling is not used.
If the seed is not obvious from the code, you can read it from the grid RDS file:
grid.rds <- file.path(pdir, "grid.rds")
readRDS(grid.rds)$train_seed
#>[1] 1
If you want consistent results between runs with batchtools, you need to set the seed argument when making the registry.
(bgrid <- mlr3::benchmark_grid(task_list, learner_list, kfold))
batchtools.seed <- 1
if(requireNamespace("mlr3batchmark")){
batch_dt_list <- list()
for(run.num in 1:2){
reg_dir <- if(interactive())paste0("~/reg",run.num) else tempfile()
unlink(reg_dir, recursive = TRUE)
reg <- batchtools::makeExperimentRegistry(reg_dir, seed=batchtools.seed)
mlr3batchmark::batchmark(bgrid)
jt <- batchtools::getJobTable()
jt1 <- jt[repl==1]
batchtools::submitJobs(jt1)
batchtools::waitForJobs()
ignore.learner <- function(L){
L$learner_state$model <- NULL
L
}
bt_res <- mlr3batchmark::reduceResultsBatchmark(jt1, fun=ignore.learner)
bt_score <- bt_res$score(measure_list)
batch_dt_list[[run.num]] <- data.table(
run=paste0("run", run.num), bt_score
)[
, algorithm := sub("classif.", "", learner_id)]
}
batch_dt <- rbindlist(batch_dt_list)
dcast(
batch_dt, task_id + algorithm ~ run, value.var="classif.auc")
}
#>Loading required namespace: mlr3batchmark
#>No readable configuration file found
#>Created registry in '/tmp/Rtmppy43XW/file62812849799' using cluster functions 'Interactive'
#>Adding algorithm 'run_learner'
#>Adding problem '4d4715e62a2eaf23'
#>Exporting new objects: 'bae9feab4b45c859' ...
#>Exporting new objects: 'afb1fabfdb92224e' ...
#>Exporting new objects: '2099aa995d4e20f7' ...
#>Exporting new objects: 'ecf8ee265ec56766' ...
#>Overwriting previously exported object: 'ecf8ee265ec56766'
#>Adding 6 experiments ('4d4715e62a2eaf23'[1] x 'run_learner'[2] x repls[3]) ...
#>Adding problem 'e827a0f482a53df3'
#>Exporting new objects: '1f7bf6db193ef5ae' ...
#>Adding 6 experiments ('e827a0f482a53df3'[1] x 'run_learner'[2] x repls[3]) ...
#>Submitting 4 jobs in 4 chunks using cluster functions 'Interactive' ...
#>No readable configuration file found
#>Created registry in '/tmp/Rtmppy43XW/file62811069e00c' using cluster functions 'Interactive'
#>Adding algorithm 'run_learner'
#>Adding problem '4d4715e62a2eaf23'
#>Exporting new objects: 'bae9feab4b45c859' ...
#>Exporting new objects: 'afb1fabfdb92224e' ...
#>Exporting new objects: '2099aa995d4e20f7' ...
#>Exporting new objects: 'ecf8ee265ec56766' ...
#>Overwriting previously exported object: 'ecf8ee265ec56766'
#>Adding 6 experiments ('4d4715e62a2eaf23'[1] x 'run_learner'[2] x repls[3]) ...
#>Adding problem 'e827a0f482a53df3'
#>Exporting new objects: '1f7bf6db193ef5ae' ...
#>Adding 6 experiments ('e827a0f482a53df3'[1] x 'run_learner'[2] x repls[3]) ...
#>Submitting 4 jobs in 4 chunks using cluster functions 'Interactive' ...
| task_id | algorithm | run1 | run2 |
|---|---|---|---|
| spam | cv_glmnet | 0.974 | 0.974 |
| spam | rpart | 0.905 | 0.905 |
| spam_with_fold | cv_glmnet | 0.967 | 0.967 |
| spam_with_fold | rpart | 0.889 | 0.889 |
The table above shows consistent results across runs, which means that reproducibility is possible as long as batchtools is used with the same seed argument.
It is possible to reproduce these results without batchtools.
First, the code below shows that the bgrid object contains an instantiated sampler with fold IDs that are consistent with the values in the Fold column.
bgrid_dt <- as.data.table(bgrid)[, task_id := sapply(task, "[[", "id")][]
resampler_with_fold <- bgrid_dt[task_id=="spam_with_fold"]$resampling[[1]]
identical(resampler_with_fold$instance$fold.dt$Fold, spam_with_fold.dt$Fold)
#>[1] TRUE
To reproduce these values outside of the batchtools framework, we need to set the same random seed that was used by batchtools.
This is documented on ?batchtools::makeExperimentRegistry which says
seed: ['integer(1)']
Start seed for jobs. Each job uses the ('seed' + 'job.id') as
seed. Default is a random integer between 1 and 32768.
The code below sets this random seed.
(one_job <- jt1[, let(
task_id = sapply(prob.pars, "[[", "task_id"),
learner_id = sapply(algo.pars, "[[", "learner_id")
)][task_id=="spam_with_fold" & learner_id=="classif.cv_glmnet"])
| job.id | submitted | started | done | error | mem.used | batch.id | log.file | job.hash | job.name | repl | time.queued | time.running | problem | prob.pars | algorithm | algo.pars | resources | tags | task_id | learner_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | ce96fb1b-8da9-4027-b300-db0fb034113d | 1 | e827a0f482a53df3 | <list[4]> | run_learner | <list[4]> | [NULL] | spam_with_fold | classif.cv_glmnet |
set.seed(one_job$job.id+batchtools.seed)
Next we train cv glmnet again.
batchtools.test.auc <- cvg_train_predict(set.xy.list)
rbind(
data.table(
packages="glmnet,WeightedROC", run="only", auc=batchtools.test.auc),
batch_dt[algorithm=="cv_glmnet" & task_id=="spam_with_fold", .(
packages="batchtools", run, auc=classif.auc)])
| packages | run | auc |
|---|---|---|
| glmnet,WeightedROC | only | 0.967 |
| batchtools | run1 | 0.967 |
| batchtools | run2 | 0.967 |
The DIY results above are consistent with the previous two runs, indicating that reproducibility is possible with batchtools as well.
If the seed was not set in the batchtools code, you can read it from the registry RDS file:
registry.rds <- file.path(reg_dir, "registry.rds")
rbind(
code=batchtools.seed,
registry=readRDS(registry.rds)$seed)
| code | 1 |
| registry | 1 |
unlink(reg_dir, recursive = TRUE)
batchtools::makeExperimentRegistry(reg_dir)
#>No readable configuration file found
#>Created registry in '/tmp/Rtmppy43XW/file62811069e00c' using cluster functions 'Interactive'
#>Experiment Registry
#> Backend : Interactive
#> File dir : /tmp/Rtmppy43XW/file62811069e00c
#> Work dir : /tmp/Rtmp6Vlihj/Rbuild61f83059bb63/mlr3resampling/vignettes
#> Jobs : 0
#> Problems : 0
#> Algorithms: 0
#> Seed : 32101
#> Writeable : TRUE
readRDS(registry.rds)$seed
#>[1] 32101
We see that reproducibility is possible in mlr3.
Using mlr3resampling::proj_grid()
fold column role.fold column role, random seeds are set for reproducible splitting and training.Using mlr3batchmark, user needs to give seed argument to batchtools::makeExperimentRegistry.