The goal of this vignette is to compare and contrast calculation of machine learning benchmarks using
mlr3 and mlr3batchmark packages, which are the built-in/previous/existing methods.mlr3resampling::proj_*() functions, which are the new methods proposed in this package.In mlr3, a benchmark is defined as combinations of tasks, learners, and resampling iterations. The code in this section must be run using both methods (previous and proposed).
First we create an instance of 3-fold CV, which we will use as the train/test splitting method.
(kfoldcv <- mlr3resampling::ResamplingSameOtherSizesCV$new())
#>
#>── <ResamplingSameOtherSizesCV> : Compare Same/Other and Sizes Cross-Validation
#>• Iterations:
#>• Instantiated: FALSE
#>• Parameters: folds=3, seeds=1, ratio=0.5, sizes=-1, ignore_subset=FALSE,
#>subsets=SOA
For reproducibility, we use ResamplingSameOtherSizesCV because it respects the fold column role (and mlr3::ResamplingCV does not).
Note that the resampling should be instantiated before creating a Task, because it loads the mlr3resampling package (needed to avoid errors about unrecognized column roles subset and fold).
First we define a list of two tasks.
task_list <- mlr3::tsks(c("spam", "german_credit"))
tasks_with_fold <- list()
for(task_i in seq_along(task_list)){
task <- task_list[[task_i]]
tcol <- task$col_roles$target
task_dt <- task$data()
task_dt[, Fold := rep(1:3, length.out=.N), by=c(tcol)]
ftask <- mlr3::TaskClassif$new(
task_dt, id=task$id, target=tcol)
ftask$col_roles$feature <- task$col_roles$feature
ftask$col_roles$fold <- "Fold"
ftask$col_roles$stratum <- c("Fold", tcol)
tasks_with_fold[[task$id]] <- ftask
}
tasks_with_fold
#>$spam
#>
#>── <TaskClassif> (4601x58) ─────────────────────────────────────────────────────
#>• Target: type
#>• Target classes: spam (positive class, 39%), nonspam (61%)
#>• Properties: twoclass, strata
#>• Features (57):
#> • dbl (57): address, addresses, all, business, capitalAve, capitalLong,
#> capitalTotal, charDollar, charExclamation, charHash, charRoundbracket,
#> charSemicolon, charSquarebracket, conference, credit, cs, data, direct, edu,
#> email, font, free, george, hp, hpl, internet, lab, labs, mail, make, meeting,
#> money, num000, num1999, num3d, num415, num650, num85, num857, order,
#> original, our, over, parts, people, pm, project, re, receive, remove, report,
#> table, technology, telnet, will, you, your
#>• Strata: Fold and type
#>
#>$german_credit
#>
#>── <TaskClassif> (1000x21) ─────────────────────────────────────────────────────
#>• Target: credit_risk
#>• Target classes: good (positive class, 70%), bad (30%)
#>• Properties: twoclass, strata
#>• Features (20):
#> • fct (14): credit_history, employment_duration, foreign_worker, housing,
#> job, other_debtors, other_installment_plans, people_liable,
#> personal_status_sex, property, purpose, savings, status, telephone
#> • int (3): age, amount, duration
#> • ord (3): installment_rate, number_credits, present_residence
#>• Strata: Fold and credit_risk
#>
In the code above, we set column roles for each task:
fold to Fold which is a column of fold IDs that we create in each data set, so that splits in the code below will be the same between benchmarking frameworks.stratum to Fold and tcol so that down-sampling will maintain the proportions of the values in these columns. This is important to ensure that all train and test sets have a reasonable number of observations of each class. When running mlr3resampling::proj_test(), tasks are down-sampled to get quicker train times, with default 10 samples per stratum.Next, we define a list of two learners.
learner_list <- list(
mlr3::LearnerClassifFeatureless$new())
if(requireNamespace("rpart")){
learner_list$rpart <- mlr3::LearnerClassifRpart$new()
}
for(learner_i in seq_along(learner_list)){
L <- learner_list[[learner_i]]
L$predict_type <- "prob"
}
In the code above, we use the prob predict type, because we want to use ROC-AUC as an evaluation metric, in addition to accuracy:
measure_list <- mlr3::msrs(c("classif.auc","classif.acc"))
In this section, we explain the two alternative methods for defining the grid of combinations.
mlr3::benchmark_grid()In the code below, we must save the result, called bgrid, which exists only in R (not yet saved to the file system).
(bgrid <- mlr3::benchmark_grid(tasks_with_fold, learner_list, kfoldcv))
After that, we save the grid to the file system, by first creating a batchtools registry. Various parallel computation methods are supported. Below we use a batchtools configuration file which will use SLURM, a cluster computing system.
extdata <- system.file(package="mlr3resampling", "extdata")
Sys.setenv(R_BATCHTOOLS_SEARCH_PATH=extdata) #comment to use ~/.batchtools.conf.R instead.
The code above sets an environment variable that tells batchtools where to find the config file (that says to use SLURM when it is available for testing on GitHub Actions, otherwise sequential execution for testing on CRAN).
If you use batchtools on a cluster, you should create a ~/.batchtools.conf.R config file, as explained in my R batchtools on Monsoon tutorial.
Then when we make the registry in the code below, the batchtools config will be saved to disk, along with the benchmark grid.
if(requireNamespace("mlr3batchmark")){
reg_dir <- tempfile()
reg <- batchtools::makeExperimentRegistry(reg_dir)
slurm.available <- reg$cluster.functions$name=="Slurm"
mlr3batchmark::batchmark(bgrid)
}
#>Sourcing configuration file '/tmp/Rtmp6Vlihj/Rinst61f87d9e70d4/mlr3resampling/extdata/batchtools.conf.R' ...
#>Created registry in '/tmp/Rtmppy43XW/file62812d2e9acb' using cluster functions 'Slurm'
#>Adding algorithm 'run_learner'
#>Adding problem '860965eb37a273f7'
#>Exporting new objects: '44c00fd8d0c641cb' ...
#>Exporting new objects: 'c1c047f0c08761bb' ...
#>Exporting new objects: '2099aa995d4e20f7' ...
#>Exporting new objects: 'ecf8ee265ec56766' ...
#>Overwriting previously exported object: 'ecf8ee265ec56766'
#>Adding 6 experiments ('860965eb37a273f7'[1] x 'run_learner'[2] x repls[3]) ...
#>Adding problem 'ec1c23f2718a37f6'
#>Exporting new objects: 'b5dfb9daba57cb9e' ...
#>Adding 6 experiments ('ec1c23f2718a37f6'[1] x 'run_learner'[2] x repls[3]) ...
mlr3resampling::proj_grid()In the code below, the grid of combinations is saved to the proj_dir directory.
proj_dir <- if(interactive())"~/testproj" else tempfile()
unlink(proj_dir, recursive = TRUE)
mlr3resampling::proj_grid(
proj_dir, tasks_with_fold, learner_list, kfoldcv,
score_args = measure_list)
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file628154a6321a", max_jobs=1)
| task.i | learner.i | resampling.i | task_id | learner_id | resampling_id | test.subset | train.subsets | groups | test.fold | seed | n.train.groups | iteration | Train_subsets |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | spam | classif.featureless | same_other_sizes_cv | full | same | 3067 | 1 | 1 | 3067 | 1 | same |
| 1 | 1 | 1 | spam | classif.featureless | same_other_sizes_cv | full | same | 3067 | 2 | 1 | 3067 | 2 | same |
| 1 | 1 | 1 | spam | classif.featureless | same_other_sizes_cv | full | same | 3067 | 3 | 1 | 3067 | 3 | same |
| 1 | 2 | 1 | spam | classif.rpart | same_other_sizes_cv | full | same | 3067 | 1 | 1 | 3067 | 1 | same |
| 1 | 2 | 1 | spam | classif.rpart | same_other_sizes_cv | full | same | 3067 | 2 | 1 | 3067 | 2 | same |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 2 | 1 | 1 | german_credit | classif.featureless | same_other_sizes_cv | full | same | 666 | 2 | 1 | 666 | 2 | same |
| 2 | 1 | 1 | german_credit | classif.featureless | same_other_sizes_cv | full | same | 666 | 3 | 1 | 666 | 3 | same |
| 2 | 2 | 1 | german_credit | classif.rpart | same_other_sizes_cv | full | same | 666 | 1 | 1 | 666 | 1 | same |
| 2 | 2 | 1 | german_credit | classif.rpart | same_other_sizes_cv | full | same | 666 | 2 | 1 | 666 | 2 | same |
| 2 | 2 | 1 | german_credit | classif.rpart | same_other_sizes_cv | full | same | 666 | 3 | 1 | 666 | 3 | same |
At this step in the proposed code above,
proj_dir defines the project directory, in which data and result files are saved. This is the analog of the registry directory, reg_dir, in the previous approach.score_args = measure_list defines the evaluation metrics we will use to measure the accuracy of predictions on the held-out test sets. Each parallel job will compute these evaluation metrics for one train/test split and one Learner.After defining the grid of combinations, we typically want to do a small test on the local system, to make sure there are no errors, before submitting the full computation.
To test one job using the previous approach, we use:
if(requireNamespace("mlr3batchmark")){
batchtools::testJob(1)
}
#>### [bt]: Generating problem instance for problem '860965eb37a273f7' ...
#>### [bt]: Applying algorithm 'run_learner' on problem '860965eb37a273f7' for job 1 (seed = 3880) ...
#>$learner_state
#>$param_vals
#>$param_vals$method
#>[1] "mode"
#>
#>
#>$log
#>Empty data.table (0 rows and 3 cols): stage,class,condition
#>
#>$train_time
#>elapsed
#> 0.001
#>
#>$task_hash
#>[1] "4e0e4517bbfd0a5f"
#>
#>$feature_names
#> [1] "address" "addresses" "all"
#> [4] "business" "capitalAve" "capitalLong"
#> [7] "capitalTotal" "charDollar" "charExclamation"
#>[10] "charHash" "charRoundbracket" "charSemicolon"
#>[13] "charSquarebracket" "conference" "credit"
#>[16] "cs" "data" "direct"
#>[19] "edu" "email" "font"
#>[22] "free" "george" "hp"
#>[25] "hpl" "internet" "lab"
#>[28] "labs" "mail" "make"
#>[31] "meeting" "money" "num000"
#>[34] "num1999" "num3d" "num415"
#>[37] "num650" "num85" "num857"
#>[40] "order" "original" "our"
#>[43] "over" "parts" "people"
#>[46] "pm" "project" "re"
#>[49] "receive" "remove" "report"
#>[52] "table" "technology" "telnet"
#>[55] "will" "you" "your"
#>
#>$validate
#>NULL
#>
#>$mlr3_version
#>[1] ‘1.6.0’
#>
#>$predict_time
#>[1] 0.004
#>
#>attr(,"class")
#>[1] "learner_state" "list"
#>
#>$prediction
#>$prediction$test
#><PredictionDataClassif:1535>
#>
#>
#>$param_values
#>$param_values$method
#>[1] "mode"
#>
#>
#>$learner_hash
#>[1] "c1c047f0c08761bb"
#>
#>$data_extra
#>NULL
#>
To test one job using the proposed approach, we use the code below. Note that the test output below contains meta-data (task, learner, …) as well as prediction accuracy metrics (AUC and accuracy), so is much more useful and interpretable than the test output above.
mlr3resampling::proj_test(proj_dir, max_jobs=1)
#>$grid_jobs.csv
#> task.i learner.i resampling.i task_id learner_id
#> <int> <int> <int> <char> <char>
#>1: 1 1 1 spam classif.featureless
#> resampling_id test.subset train.subsets groups test.fold seed
#> <char> <char> <char> <int> <int> <int>
#>1: same_other_sizes_cv full same 50 1 1
#> n.train.groups iteration Train_subsets
#> <int> <int> <char>
#>1: 50 1 same
#>
#>$results.csv
#> grid_job_i task.i learner.i resampling.i task_id learner_id
#> <int> <int> <int> <int> <char> <char>
#>1: 1 1 1 1 spam classif.featureless
#> resampling_id test.subset train.subsets groups test.fold seed
#> <char> <char> <char> <int> <int> <int>
#>1: same_other_sizes_cv full same 50 1 1
#> n.train.groups iteration Train_subsets start.time
#> <int> <int> <char> <POSc>
#>1: 50 1 same 2026-04-28 14:58:41
#> end.time process classif.auc classif.acc
#> <POSc> <int> <num> <num>
#>1: 2026-04-28 14:58:41 25217 0.5 0.6
#>
We tested a featureless learner, which is not a great test (other learners may fail for a variety of reasons). So typically we run a more extensive test, as in the next section.
A simple way to test each algorithm and data set with batchtools is via the code below, which uses repl==1 to consider only the first cross-validation iteration, run on the local machine:
if(requireNamespace("mlr3batchmark")){
jt <- batchtools::getJobTable()
jt1 <- jt[repl==1]
testJob.repl1 <- sapply(jt1$job.id, batchtools::testJob)
}
#>### [bt]: Generating problem instance for problem '860965eb37a273f7' ...
#>### [bt]: Applying algorithm 'run_learner' on problem '860965eb37a273f7' for job 1 (seed = 3880) ...
#>### [bt]: Generating problem instance for problem '860965eb37a273f7' ...
#>### [bt]: Applying algorithm 'run_learner' on problem '860965eb37a273f7' for job 4 (seed = 3883) ...
#>### [bt]: Generating problem instance for problem 'ec1c23f2718a37f6' ...
#>### [bt]: Applying algorithm 'run_learner' on problem 'ec1c23f2718a37f6' for job 7 (seed = 3886) ...
#>### [bt]: Generating problem instance for problem 'ec1c23f2718a37f6' ...
#>### [bt]: Applying algorithm 'run_learner' on problem 'ec1c23f2718a37f6' for job 10 (seed = 3889) ...
There is no error above, but it is not clear if the result is reasonable. Another way to do it with batchtools is to submit a subset of jobs, as below:
submit_job_array <- function(jobs_dt, minutes=1, gigabytes=1){
jobs_dt$chunk <- 1
batchtools::submitJobs(jobs_dt, resources=list(
walltime = minutes*60,#seconds
memory = gigabytes*1000,#megabytes per cpu
ncpus=1, #>1 for multicore/parallel jobs.
ntasks=1, #>1 for MPI jobs.
chunks.as.arrayjobs=slurm.available))
}
if(requireNamespace("mlr3batchmark")){
submit_job_array(jt1)
}
#>Submitting 4 jobs in 1 chunks using cluster functions 'Slurm' ...
The code above adds a job array, one parallel CPU per repl=1 cross-validation iteration, to the SLURM queue.
In the code below, we wait for the jobs to compute, then gather the results.
if(requireNamespace("mlr3batchmark")){
batchtools::waitForJobs(jt1)
test_res <- mlr3batchmark::reduceResultsBatchmark(jt1)
test_res$score(measure_list)
}
| uhash | nr | task | task_id | learner | learner_id | resampling | resampling_id | iteration | prediction_test | classif.auc | classif.acc |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 8328b07f-178f-4c2c-81d8-a294ec67b2f6 | 1 | TaskClassif:spam | spam | LearnerClassifFeatureless:classif.featureless | classif.featureless | same_other_sizes_cv | 1 | 0.500 | 0.606 | ||
| 95668932-f531-421d-b8d6-a590c19031ad | 2 | TaskClassif:spam | spam | LearnerClassifRpart:classif.rpart | classif.rpart | same_other_sizes_cv | 1 | 0.889 | 0.892 | ||
| f3e139e5-0a2a-47a5-8336-2c2516eed410 | 3 | TaskClassif:german_credit | german_credit | LearnerClassifFeatureless:classif.featureless | classif.featureless | same_other_sizes_cv | 1 | 0.500 | 0.701 | ||
| 3b653707-b2cc-4480-bf6d-e4e5a891034b | 4 | TaskClassif:german_credit | german_credit | LearnerClassifRpart:classif.rpart | classif.rpart | same_other_sizes_cv | 1 | 0.721 | 0.740 |
The result above includes test AUC and accuracy values, which are more convincing test (but this involves actually submitting jobs on the cluster).
Our proposed way to test is via:
mlr3resampling::proj_test(proj_dir)
#>$grid_jobs.csv
#> task.i learner.i resampling.i task_id learner_id
#> <int> <int> <int> <char> <char>
#>1: 1 1 1 spam classif.featureless
#>2: 1 2 1 spam classif.rpart
#>3: 2 1 1 german_credit classif.featureless
#>4: 2 2 1 german_credit classif.rpart
#> resampling_id test.subset train.subsets groups test.fold seed
#> <char> <char> <char> <int> <int> <int>
#>1: same_other_sizes_cv full same 50 1 1
#>2: same_other_sizes_cv full same 50 1 1
#>3: same_other_sizes_cv full same 66 1 1
#>4: same_other_sizes_cv full same 66 1 1
#> n.train.groups iteration Train_subsets
#> <int> <int> <char>
#>1: 50 1 same
#>2: 50 1 same
#>3: 66 1 same
#>4: 66 1 same
#>
#>$results.csv
#> grid_job_i task.i learner.i resampling.i task_id learner_id
#> <int> <int> <int> <int> <char> <char>
#>1: 1 1 1 1 spam classif.featureless
#>2: 2 1 2 1 spam classif.rpart
#>3: 3 2 1 1 german_credit classif.featureless
#>4: 4 2 2 1 german_credit classif.rpart
#> resampling_id test.subset train.subsets groups test.fold seed
#> <char> <char> <char> <int> <int> <int>
#>1: same_other_sizes_cv full same 50 1 1
#>2: same_other_sizes_cv full same 50 1 1
#>3: same_other_sizes_cv full same 66 1 1
#>4: same_other_sizes_cv full same 66 1 1
#> n.train.groups iteration Train_subsets start.time
#> <int> <int> <char> <POSc>
#>1: 50 1 same 2026-04-28 14:58:48
#>2: 50 1 same 2026-04-28 14:58:48
#>3: 66 1 same 2026-04-28 14:58:48
#>4: 66 1 same 2026-04-28 14:58:48
#> end.time process classif.auc classif.acc
#> <POSc> <int> <num> <num>
#>1: 2026-04-28 14:58:48 25217 0.5000000 0.6000000
#>2: 2026-04-28 14:58:48 25217 0.9100000 0.8400000
#>3: 2026-04-28 14:58:48 25217 0.5000000 0.6969697
#>4: 2026-04-28 14:58:48 25217 0.6630435 0.7575758
#>
The results above include test AUC and accuracy, but with different values because the tasks are down-sampled to make training faster.
For some small benchmarks, you may want to compute all results on your local machine (not a cluster). For both the previous and proposed approaches, we use the code below to enable computation in parallel using all the CPUs on the local machine.
if(interactive())future::plan("multisession")
The previous approach uses the code below:
bench_result <- mlr3::benchmark(bgrid)
bench_score <- bench_result$score(measure_list)
bench_score[, .(task_id, learner_id, iteration, classif.auc, classif.acc)]
| task_id | learner_id | iteration | classif.auc | classif.acc |
|---|---|---|---|---|
| spam | classif.featureless | 1 | 0.500 | 0.606 |
| spam | classif.featureless | 2 | 0.500 | 0.606 |
| spam | classif.featureless | 3 | 0.500 | 0.606 |
| spam | classif.rpart | 1 | 0.889 | 0.892 |
| spam | classif.rpart | 2 | 0.909 | 0.888 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| german_credit | classif.featureless | 2 | 0.500 | 0.700 |
| german_credit | classif.featureless | 3 | 0.500 | 0.700 |
| german_credit | classif.rpart | 1 | 0.721 | 0.740 |
| german_credit | classif.rpart | 2 | 0.771 | 0.775 |
| german_credit | classif.rpart | 3 | 0.688 | 0.730 |
The proposed approach uses the code below:
proj_score <- mlr3resampling::proj_compute_all(proj_dir)
proj_score[, .(task_id, learner_id, iteration, classif.auc, classif.acc)]
| task_id | learner_id | iteration | classif.auc | classif.acc |
|---|---|---|---|---|
| spam | classif.featureless | 1 | 0.500 | 0.606 |
| spam | classif.featureless | 2 | 0.500 | 0.606 |
| spam | classif.featureless | 3 | 0.500 | 0.606 |
| spam | classif.rpart | 1 | 0.889 | 0.892 |
| spam | classif.rpart | 2 | 0.909 | 0.888 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| german_credit | classif.featureless | 2 | 0.500 | 0.700 |
| german_credit | classif.featureless | 3 | 0.500 | 0.700 |
| german_credit | classif.rpart | 1 | 0.721 | 0.740 |
| german_credit | classif.rpart | 2 | 0.771 | 0.775 |
| german_credit | classif.rpart | 3 | 0.688 | 0.730 |
The results are identical because
fold column role, so there is no randomness in cross-validation splits.rpart and featureless learners are deterministic (neither train nor predict is random).Now we discuss how to run large benchmarks on a cluster.
batchtoolsIn batchtools we can check the status of jobs in the registry using the code below.
if(requireNamespace("mlr3batchmark")){
batchtools::getStatus()
}
| defined | submitted | started | done | error | queued | running | expired | system |
|---|---|---|---|---|---|---|---|---|
| 12 | 4 | 4 | 4 | 0 | 0 | 0 | 0 | 0 |
The output above shows there are 4 jobs done out of 12 in the registry. To launch the other jobs, we use the code below.
if(requireNamespace("mlr3batchmark")){
not.done <- batchtools::getJobTable()[is.na(done)]
submit_job_array(not.done)
}
#>Submitting 8 jobs in 1 chunks using cluster functions 'Slurm' ...
The code above adds a job array, one parallel CPU for each remaining cross-validation iteration, to the SLURM queue. In the code below, we wait for the jobs to compute, then gather the results.
if(requireNamespace("mlr3batchmark")){
batchtools::waitForJobs()
ignore.learner <- function(L){
L$learner_state$model <- NULL
L
}
bt_res <- mlr3batchmark::reduceResultsBatchmark(jt, fun=ignore.learner)
bt_score <- bt_res$score(measure_list)
}
Note for large benchmarks, you must use fun to avoid loading all models into memory at once.
fun=ignore.learner as above is good if you do not want to do model interpretation.fun which keeps only the parts of the model that are necessary. This can be cumbersome to do here, especially if there are several different learners to interpret (such as torch and glmnet, see examples below).mlr3resampling::proj_submit()The code below does the full computation on SLURM using the proposed method.
if(slurm.available){
slurm_job_id <- mlr3resampling::proj_submit(
proj_dir, tasks=2, hours=1, gigabytes=1)
}
#>Loading required namespace: pbdMPI
The code above submits a SLURM MPI job with two tasks.
After all computations are done, the last worker saves a results.csv file, which can be read back into R using the code below.
(result_file_list <- mlr3resampling::proj_fread(proj_dir))
#>$grid_jobs.csv
#> task.i learner.i resampling.i task_id learner_id
#> <int> <int> <int> <char> <char>
#> 1: 1 1 1 spam classif.featureless
#> 2: 1 1 1 spam classif.featureless
#> 3: 1 1 1 spam classif.featureless
#> 4: 1 2 1 spam classif.rpart
#> 5: 1 2 1 spam classif.rpart
#> 6: 1 2 1 spam classif.rpart
#> 7: 2 1 1 german_credit classif.featureless
#> 8: 2 1 1 german_credit classif.featureless
#> 9: 2 1 1 german_credit classif.featureless
#>10: 2 2 1 german_credit classif.rpart
#>11: 2 2 1 german_credit classif.rpart
#>12: 2 2 1 german_credit classif.rpart
#> resampling_id test.subset train.subsets groups test.fold seed
#> <char> <char> <char> <int> <int> <int>
#> 1: same_other_sizes_cv full same 3067 1 1
#> 2: same_other_sizes_cv full same 3067 2 1
#> 3: same_other_sizes_cv full same 3067 3 1
#> 4: same_other_sizes_cv full same 3067 1 1
#> 5: same_other_sizes_cv full same 3067 2 1
#> 6: same_other_sizes_cv full same 3067 3 1
#> 7: same_other_sizes_cv full same 666 1 1
#> 8: same_other_sizes_cv full same 666 2 1
#> 9: same_other_sizes_cv full same 666 3 1
#>10: same_other_sizes_cv full same 666 1 1
#>11: same_other_sizes_cv full same 666 2 1
#>12: same_other_sizes_cv full same 666 3 1
#> n.train.groups iteration Train_subsets
#> <int> <int> <char>
#> 1: 3067 1 same
#> 2: 3067 2 same
#> 3: 3067 3 same
#> 4: 3067 1 same
#> 5: 3067 2 same
#> 6: 3067 3 same
#> 7: 666 1 same
#> 8: 666 2 same
#> 9: 666 3 same
#>10: 666 1 same
#>11: 666 2 same
#>12: 666 3 same
#>
#>$results.csv
#> grid_job_i task.i learner.i resampling.i task_id learner_id
#> <int> <int> <int> <int> <char> <char>
#> 1: 1 1 1 1 spam classif.featureless
#> 2: 10 2 2 1 german_credit classif.rpart
#> 3: 11 2 2 1 german_credit classif.rpart
#> 4: 12 2 2 1 german_credit classif.rpart
#> 5: 2 1 1 1 spam classif.featureless
#> 6: 3 1 1 1 spam classif.featureless
#> 7: 4 1 2 1 spam classif.rpart
#> 8: 5 1 2 1 spam classif.rpart
#> 9: 6 1 2 1 spam classif.rpart
#>10: 7 2 1 1 german_credit classif.featureless
#>11: 8 2 1 1 german_credit classif.featureless
#>12: 9 2 1 1 german_credit classif.featureless
#> resampling_id test.subset train.subsets groups test.fold seed
#> <char> <char> <char> <int> <int> <int>
#> 1: same_other_sizes_cv full same 3067 1 1
#> 2: same_other_sizes_cv full same 666 1 1
#> 3: same_other_sizes_cv full same 666 2 1
#> 4: same_other_sizes_cv full same 666 3 1
#> 5: same_other_sizes_cv full same 3067 2 1
#> 6: same_other_sizes_cv full same 3067 3 1
#> 7: same_other_sizes_cv full same 3067 1 1
#> 8: same_other_sizes_cv full same 3067 2 1
#> 9: same_other_sizes_cv full same 3067 3 1
#>10: same_other_sizes_cv full same 666 1 1
#>11: same_other_sizes_cv full same 666 2 1
#>12: same_other_sizes_cv full same 666 3 1
#> n.train.groups iteration Train_subsets start.time
#> <int> <int> <char> <POSc>
#> 1: 3067 1 same 2026-04-28 14:58:49
#> 2: 666 1 same 2026-04-28 14:58:49
#> 3: 666 2 same 2026-04-28 14:58:49
#> 4: 666 3 same 2026-04-28 14:58:49
#> 5: 3067 2 same 2026-04-28 14:58:49
#> 6: 3067 3 same 2026-04-28 14:58:49
#> 7: 3067 1 same 2026-04-28 14:58:49
#> 8: 3067 2 same 2026-04-28 14:58:49
#> 9: 3067 3 same 2026-04-28 14:58:49
#>10: 666 1 same 2026-04-28 14:58:49
#>11: 666 2 same 2026-04-28 14:58:49
#>12: 666 3 same 2026-04-28 14:58:49
#> end.time process classif.auc classif.acc
#> <POSc> <int> <num> <num>
#> 1: 2026-04-28 14:58:49 25217 0.5000000 0.6058632
#> 2: 2026-04-28 14:58:49 25217 0.7214530 0.7395210
#> 3: 2026-04-28 14:58:49 25217 0.7712446 0.7747748
#> 4: 2026-04-28 14:58:50 25217 0.6879399 0.7297297
#> 5: 2026-04-28 14:58:49 25217 0.5000000 0.6060013
#> 6: 2026-04-28 14:58:49 25217 0.5000000 0.6060013
#> 7: 2026-04-28 14:58:49 25217 0.8889185 0.8918567
#> 8: 2026-04-28 14:58:49 25217 0.9085884 0.8884540
#> 9: 2026-04-28 14:58:49 25217 0.8906028 0.8956295
#>10: 2026-04-28 14:58:49 25217 0.5000000 0.7005988
#>11: 2026-04-28 14:58:49 25217 0.5000000 0.6996997
#>12: 2026-04-28 14:58:49 25217 0.5000000 0.6996997
#>
Above results show two tables,
grid_jobs.csv has meta-data that the workers use to determine what computation to run.results.csv has columns including computation time, process ID, and test accuracy measures.acc_in_list <- list(
mlr3resampling=result_file_list$results.csv)
if(requireNamespace("mlr3batchmark"))
acc_in_list$mlr3batchmark <- bt_score
acc_out_list <- list()
library(data.table)
for(package in names(acc_in_list)){
acc_in <- melt(
acc_in_list[[package]],
id.vars=c("task_id", "learner_id", "iteration"),
measure.vars=c("classif.auc", "classif.acc"))
acc_out_list[[package]] <- data.table(package, acc_in)
}
acc_out <- rbindlist(acc_out_list)
(acc_compare <- dcast(
acc_out,
variable + task_id + learner_id + iteration ~ package))
| variable | task_id | learner_id | iteration | mlr3batchmark | mlr3resampling |
|---|---|---|---|---|---|
| classif.auc | german_credit | classif.featureless | 1 | 0.500 | 0.500 |
| classif.auc | german_credit | classif.featureless | 2 | 0.500 | 0.500 |
| classif.auc | german_credit | classif.featureless | 3 | 0.500 | 0.500 |
| classif.auc | german_credit | classif.rpart | 1 | 0.721 | 0.721 |
| classif.auc | german_credit | classif.rpart | 2 | 0.771 | 0.771 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| classif.acc | spam | classif.featureless | 2 | 0.606 | 0.606 |
| classif.acc | spam | classif.featureless | 3 | 0.606 | 0.606 |
| classif.acc | spam | classif.rpart | 1 | 0.892 | 0.892 |
| classif.acc | spam | classif.rpart | 2 | 0.888 | 0.888 |
| classif.acc | spam | classif.rpart | 3 | 0.896 | 0.896 |
if(requireNamespace("mlr3batchmark"))
acc_compare[, all.equal(mlr3batchmark, mlr3resampling)]
#>[1] TRUE
We see above that all accuracy metrics are equal, using the two methods. We plot the results below.
if(require(ggplot2)){
ggplot()+
geom_point(aes(
mlr3resampling, learner_id),
data=acc_compare)+
facet_wrap(c("task_id","variable"), labeller=label_both, scales="free", ncol=1)
}
The figure above shows that the decision tree learns something non-trivial in both data sets, and that spam is easier (more class balance, bigger improvement over featureless).
Analysis of computation time is not very interesting for this small data example, but this code could be useful for a larger benchmark.
time_compare <- rbind(
if(requireNamespace("mlr3batchmark"))batchtools::getJobTable()[, .(
package="mlr3batchmark", process=.I, start.time=started, end.time=done)],
result_file_list$results.csv[, .(
package="mlr3resampling", process, start.time, end.time)])
if(require(ggplot2)){
ggplot(time_compare, aes(start.time, process))+
geom_segment(aes(
xend=end.time, yend=process))+
geom_point()+
facet_grid(
package~.,
labeller=label_both,
scales="free")
}
edit_learner() method for quick testingFor even faster test runs, Learners may define edit_learner() method, which edits the learner to make training faster. For example, the default with AutoTuner and TorchLearner is to only use two epochs of training during proj_test(), so you can quickly see if results look reasonable, before running training using the full number of epochs, and the full data set.
proj_new <- if(interactive())"~/proj_new" else tempfile()
unlink(proj_new, recursive = TRUE)
learners_new <- list(
mlr3::LearnerClassifFeatureless$new())
if(requireNamespace("torch") && torch::torch_is_installed()){
gen_linear <- torch::nn_module(
"my_linear",
initialize = function(task) {
self$weight = torch::nn_linear(task$n_features, 1)
},
forward = function(x) {
self$weight(x)
}
)
learners_new$torch <- mlr3resampling::AutoTunerTorch_epochs$new(
"torch_linear",
module_generator=gen_linear,
max_epochs=1000,
batch_size=10,
measure_list=mlr3::msrs("classif.auc")
)
}
#>Loading required namespace: torch
if(requireNamespace("glmnet")){
learners_new$glmnet <- mlr3resampling::LearnerClassifCVGlmnetSave$new()
}
for(learner_i in seq_along(learners_new)){
L <- learners_new[[learner_i]]
L$predict_type <- "prob"
}
mlr3resampling::proj_grid(
proj_new, tasks_with_fold$spam, learners_new, kfoldcv,
score_args = measure_list)
#>grid with 9 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file62814f2cb14e", max_jobs=1)
| task.i | learner.i | resampling.i | task_id | learner_id | resampling_id | test.subset | train.subsets | groups | test.fold | seed | n.train.groups | iteration | Train_subsets |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | spam | classif.featureless | same_other_sizes_cv | full | same | 3067 | 1 | 1 | 3067 | 1 | same |
| 1 | 1 | 1 | spam | classif.featureless | same_other_sizes_cv | full | same | 3067 | 2 | 1 | 3067 | 2 | same |
| 1 | 1 | 1 | spam | classif.featureless | same_other_sizes_cv | full | same | 3067 | 3 | 1 | 3067 | 3 | same |
| 1 | 2 | 1 | spam | torch_linear | same_other_sizes_cv | full | same | 3067 | 1 | 1 | 3067 | 1 | same |
| 1 | 2 | 1 | spam | torch_linear | same_other_sizes_cv | full | same | 3067 | 2 | 1 | 3067 | 2 | same |
| 1 | 2 | 1 | spam | torch_linear | same_other_sizes_cv | full | same | 3067 | 3 | 1 | 3067 | 3 | same |
| 1 | 3 | 1 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 3067 | 1 | 1 | 3067 | 1 | same |
| 1 | 3 | 1 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 3067 | 2 | 1 | 3067 | 2 | same |
| 1 | 3 | 1 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 3067 | 3 | 1 | 3067 | 3 | same |
system.time({
test_result_list <- mlr3resampling::proj_test(proj_new)
})
#> user system elapsed
#> 1.440 0.015 1.455
Notice how fast the above code is.
Even though the torch learner has max 1000 epochs, it is reduced to only 2 epoch when running proj_test().
save_learner() method for model interpretationThe two learners above have save_learner() methods, defined in mlr3resampling/R/Learners.R.
mlr3resampling::proj_grid(), the default for most learners is to not save anything, because we don’t want to risk running out of memory, by saving and loading a bunch of large models, if we are not interested in model interpretation.save_learner() methods which return a named list of data tables, which should contain the important parts of learner$learner_state$model, which we want to use for model interpretation.
save_learner() method, rather than in the cluster results analysis script (fun argument of mlr3batchmark::reduceResultsBatchmark).learners_ and suffix .csv are appended to the name of each data table, to get a file in which to save these data.proj_test() returns all of these data tables in a list:names(test_result_list)
#>[1] "grid_jobs.csv" "learners_history.csv" "learners_weights.csv"
#>[4] "results.csv"
We see that there are two learners tables. First, history comes from torch:
test_result_list$learners_history.csv
| grid_job_i | epoch | train.classif.auc | valid.classif.auc | task_id | learner_id | resampling_id | test.subset | train.subsets | groups | test.fold | seed | n.train.groups | iteration | Train_subsets |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 0.562 | 0.371 | spam | torch_linear | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
| 2 | 2 | 0.575 | 0.379 | spam | torch_linear | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
We see above there are two rows (one for each epoch). In larger runs (more than two epochs), we plot these data to see if the torch model has a good fit (avoids overfitting and underfitting).
Next, the weights table come from glmnet:
test_result_list$learners_weights.csv
| grid_job_i | feature | weight | task_id | learner_id | resampling_id | test.subset | train.subsets | groups | test.fold | seed | n.train.groups | iteration | Train_subsets |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | address | 0.000 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
| 3 | addresses | 0.000 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
| 3 | all | 0.000 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
| 3 | business | 0.000 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
| 3 | capitalAve | 0.000 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 3 | technology | -0.276 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
| 3 | telnet | 0.000 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
| 3 | will | -0.230 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
| 3 | you | 0.000 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
| 3 | your | 0.175 | spam | classif.cv_glmnet | same_other_sizes_cv | full | same | 50 | 1 | 1 | 50 | 1 | same |
The table above contains one row for each input feature. Some are selected by L1 regularization (non-zero weights); the other features are not used for prediction (zero weights).
When we ran proj_test() above, it created a test directory, which is another project directory, with a smaller task.
Each original task in the project is down-sampled proportionally (using strata), so that we can quickly test if training and prediction work on a smaller data set.
The code below reads the down-sampled task:
rds.vec <- Sys.glob(file.path(proj_new,"test","tasks","*rds"))
for(rds.i in seq_along(rds.vec)){
mini_task <- readRDS(rds.vec[[rds.i]])
print(table(mini_task$data()[[1]]))
}
#>
#> spam nonspam
#> 30 45
The output above shows that there are at least 30 data per class in the down-sampled task, because
min_samples_per_stratum),We compared mlr3resampling::proj_*() functions to their analogs in batchtools and mlr3batchmark. We saw that the interface is similar, with some convenience features for quick testing, and for efficient model interpretation.