The goal of this vignette is to compare and contrast calculation of machine learning benchmarks using

mlr3 and mlr3batchmark packages, which are the built-in/previous/existing methods.
mlr3resampling::proj_*() functions, which are the new methods proposed in this package.

Introduction: tasks, learners, resampling

In mlr3, a benchmark is defined as combinations of tasks, learners, and resampling iterations. The code in this section must be run using both methods (previous and proposed).

Resampling

First we create an instance of 3-fold CV, which we will use as the train/test splitting method.

(kfoldcv <- mlr3resampling::ResamplingSameOtherSizesCV$new())
#>
#>── <ResamplingSameOtherSizesCV> : Compare Same/Other and Sizes Cross-Validation 
#>• Iterations:
#>• Instantiated: FALSE
#>• Parameters: folds=3, seeds=1, ratio=0.5, sizes=-1, ignore_subset=FALSE,
#>subsets=SOA

For reproducibility, we use ResamplingSameOtherSizesCV because it respects the fold column role (and mlr3::ResamplingCV does not). Note that the resampling should be instantiated before creating a Task, because it loads the mlr3resampling package (needed to avoid errors about unrecognized column roles subset and fold).

Tasks

First we define a list of two tasks.

task_list <- mlr3::tsks(c("spam", "german_credit"))
tasks_with_fold <- list()
for(task_i in seq_along(task_list)){
  task <- task_list[[task_i]]
  tcol <- task$col_roles$target
  task_dt <- task$data()
  task_dt[, Fold := rep(1:3, length.out=.N), by=c(tcol)]
  ftask <- mlr3::TaskClassif$new(
    task_dt, id=task$id, target=tcol)
  ftask$col_roles$feature <- task$col_roles$feature
  ftask$col_roles$fold <- "Fold"
  ftask$col_roles$stratum <- c("Fold", tcol)
  tasks_with_fold[[task$id]] <- ftask
}
tasks_with_fold
#>$spam
#>
#>── <TaskClassif> (4601x58) ─────────────────────────────────────────────────────
#>• Target: type
#>• Target classes: spam (positive class, 39%), nonspam (61%)
#>• Properties: twoclass, strata
#>• Features (57):
#>  • dbl (57): address, addresses, all, business, capitalAve, capitalLong,
#>  capitalTotal, charDollar, charExclamation, charHash, charRoundbracket,
#>  charSemicolon, charSquarebracket, conference, credit, cs, data, direct, edu,
#>  email, font, free, george, hp, hpl, internet, lab, labs, mail, make, meeting,
#>  money, num000, num1999, num3d, num415, num650, num85, num857, order,
#>  original, our, over, parts, people, pm, project, re, receive, remove, report,
#>  table, technology, telnet, will, you, your
#>• Strata: Fold and type
#>
#>$german_credit
#>
#>── <TaskClassif> (1000x21) ─────────────────────────────────────────────────────
#>• Target: credit_risk
#>• Target classes: good (positive class, 70%), bad (30%)
#>• Properties: twoclass, strata
#>• Features (20):
#>  • fct (14): credit_history, employment_duration, foreign_worker, housing,
#>  job, other_debtors, other_installment_plans, people_liable,
#>  personal_status_sex, property, purpose, savings, status, telephone
#>  • int (3): age, amount, duration
#>  • ord (3): installment_rate, number_credits, present_residence
#>• Strata: Fold and credit_risk
#>

In the code above, we set column roles for each task:

fold to Fold which is a column of fold IDs that we create in each data set, so that splits in the code below will be the same between benchmarking frameworks.
stratum to Fold and tcol so that down-sampling will maintain the proportions of the values in these columns. This is important to ensure that all train and test sets have a reasonable number of observations of each class. When running mlr3resampling::proj_test(), tasks are down-sampled to get quicker train times, with default 10 samples per stratum.

Learners

Next, we define a list of two learners.

learner_list <- list(
  mlr3::LearnerClassifFeatureless$new())
if(requireNamespace("rpart")){
  learner_list$rpart <- mlr3::LearnerClassifRpart$new()
}
for(learner_i in seq_along(learner_list)){
  L <- learner_list[[learner_i]]
  L$predict_type <- "prob"
}

In the code above, we use the prob predict type, because we want to use ROC-AUC as an evaluation metric, in addition to accuracy:

measure_list <- mlr3::msrs(c("classif.auc","classif.acc"))

Define the grid of combinations

In this section, we explain the two alternative methods for defining the grid of combinations.

Previous method, `mlr3::benchmark_grid()`

In the code below, we must save the result, called bgrid, which exists only in R (not yet saved to the file system).

(bgrid <- mlr3::benchmark_grid(tasks_with_fold, learner_list, kfoldcv))

task	learner	resampling
TaskClassif:spam	LearnerClassifFeatureless:classif.featureless
TaskClassif:spam	LearnerClassifRpart:classif.rpart
TaskClassif:german_credit	LearnerClassifFeatureless:classif.featureless
TaskClassif:german_credit	LearnerClassifRpart:classif.rpart

After that, we save the grid to the file system, by first creating a batchtools registry. Various parallel computation methods are supported. Below we use a batchtools configuration file which will use SLURM, a cluster computing system.

extdata <- system.file(package="mlr3resampling", "extdata")
Sys.setenv(R_BATCHTOOLS_SEARCH_PATH=extdata) #comment to use ~/.batchtools.conf.R instead.

The code above sets an environment variable that tells batchtools where to find the config file (that says to use SLURM when it is available for testing on GitHub Actions, otherwise sequential execution for testing on CRAN). If you use batchtools on a cluster, you should create a ~/.batchtools.conf.R config file, as explained in my R batchtools on Monsoon tutorial. Then when we make the registry in the code below, the batchtools config will be saved to disk, along with the benchmark grid.

if(requireNamespace("mlr3batchmark")){
  reg_dir <- tempfile()
  reg <- batchtools::makeExperimentRegistry(reg_dir)
  slurm.available <- reg$cluster.functions$name=="Slurm"
  mlr3batchmark::batchmark(bgrid)
}
#>Sourcing configuration file '/tmp/Rtmp6Vlihj/Rinst61f87d9e70d4/mlr3resampling/extdata/batchtools.conf.R' ...
#>Created registry in '/tmp/Rtmppy43XW/file62812d2e9acb' using cluster functions 'Slurm'
#>Adding algorithm 'run_learner'
#>Adding problem '860965eb37a273f7'
#>Exporting new objects: '44c00fd8d0c641cb' ...
#>Exporting new objects: 'c1c047f0c08761bb' ...
#>Exporting new objects: '2099aa995d4e20f7' ...
#>Exporting new objects: 'ecf8ee265ec56766' ...
#>Overwriting previously exported object: 'ecf8ee265ec56766'
#>Adding 6 experiments ('860965eb37a273f7'[1] x 'run_learner'[2] x repls[3]) ...
#>Adding problem 'ec1c23f2718a37f6'
#>Exporting new objects: 'b5dfb9daba57cb9e' ...
#>Adding 6 experiments ('ec1c23f2718a37f6'[1] x 'run_learner'[2] x repls[3]) ...

Proposed method, `mlr3resampling::proj_grid()`

In the code below, the grid of combinations is saved to the proj_dir directory.

proj_dir <- if(interactive())"~/testproj" else tempfile()
unlink(proj_dir, recursive = TRUE)
mlr3resampling::proj_grid(
  proj_dir, tasks_with_fold, learner_list, kfoldcv,
  score_args = measure_list)
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file628154a6321a", max_jobs=1)

task.i	learner.i	resampling.i	task_id	learner_id	resampling_id	test.subset	train.subsets	groups	test.fold	seed	n.train.groups	iteration	Train_subsets
1	1	1	spam	classif.featureless	same_other_sizes_cv	full	same	3067	1	1	3067	1	same
1	1	1	spam	classif.featureless	same_other_sizes_cv	full	same	3067	2	1	3067	2	same
1	1	1	spam	classif.featureless	same_other_sizes_cv	full	same	3067	3	1	3067	3	same
1	2	1	spam	classif.rpart	same_other_sizes_cv	full	same	3067	1	1	3067	1	same
1	2	1	spam	classif.rpart	same_other_sizes_cv	full	same	3067	2	1	3067	2	same
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
2	1	1	german_credit	classif.featureless	same_other_sizes_cv	full	same	666	2	1	666	2	same
2	1	1	german_credit	classif.featureless	same_other_sizes_cv	full	same	666	3	1	666	3	same
2	2	1	german_credit	classif.rpart	same_other_sizes_cv	full	same	666	1	1	666	1	same
2	2	1	german_credit	classif.rpart	same_other_sizes_cv	full	same	666	2	1	666	2	same
2	2	1	german_credit	classif.rpart	same_other_sizes_cv	full	same	666	3	1	666	3	same

At this step in the proposed code above,

The first argument proj_dir defines the project directory, in which data and result files are saved. This is the analog of the registry directory, reg_dir, in the previous approach.
The named argument score_args = measure_list defines the evaluation metrics we will use to measure the accuracy of predictions on the held-out test sets. Each parallel job will compute these evaluation metrics for one train/test split and one Learner.

Testing

After defining the grid of combinations, we typically want to do a small test on the local system, to make sure there are no errors, before submitting the full computation.

Testing one job

To test one job using the previous approach, we use:

if(requireNamespace("mlr3batchmark")){
  batchtools::testJob(1)
}
#>### [bt]: Generating problem instance for problem '860965eb37a273f7' ...
#>### [bt]: Applying algorithm 'run_learner' on problem '860965eb37a273f7' for job 1 (seed = 3880) ...
#>$learner_state
#>$param_vals
#>$param_vals$method
#>[1] "mode"
#>
#>
#>$log
#>Empty data.table (0 rows and 3 cols): stage,class,condition
#>
#>$train_time
#>elapsed 
#>  0.001 
#>
#>$task_hash
#>[1] "4e0e4517bbfd0a5f"
#>
#>$feature_names
#> [1] "address"           "addresses"         "all"              
#> [4] "business"          "capitalAve"        "capitalLong"      
#> [7] "capitalTotal"      "charDollar"        "charExclamation"  
#>[10] "charHash"          "charRoundbracket"  "charSemicolon"    
#>[13] "charSquarebracket" "conference"        "credit"           
#>[16] "cs"                "data"              "direct"           
#>[19] "edu"               "email"             "font"             
#>[22] "free"              "george"            "hp"               
#>[25] "hpl"               "internet"          "lab"              
#>[28] "labs"              "mail"              "make"             
#>[31] "meeting"           "money"             "num000"           
#>[34] "num1999"           "num3d"             "num415"           
#>[37] "num650"            "num85"             "num857"           
#>[40] "order"             "original"          "our"              
#>[43] "over"              "parts"             "people"           
#>[46] "pm"                "project"           "re"               
#>[49] "receive"           "remove"            "report"           
#>[52] "table"             "technology"        "telnet"           
#>[55] "will"              "you"               "your"             
#>
#>$validate
#>NULL
#>
#>$mlr3_version
#>[1] ‘1.6.0’
#>
#>$predict_time
#>[1] 0.004
#>
#>attr(,"class")
#>[1] "learner_state" "list"         
#>
#>$prediction
#>$prediction$test
#><PredictionDataClassif:1535>
#>
#>
#>$param_values
#>$param_values$method
#>[1] "mode"
#>
#>
#>$learner_hash
#>[1] "c1c047f0c08761bb"
#>
#>$data_extra
#>NULL
#>

To test one job using the proposed approach, we use the code below. Note that the test output below contains meta-data (task, learner, …) as well as prediction accuracy metrics (AUC and accuracy), so is much more useful and interpretable than the test output above.

mlr3resampling::proj_test(proj_dir, max_jobs=1)
#>$grid_jobs.csv
#>   task.i learner.i resampling.i task_id          learner_id
#>    <int>     <int>        <int>  <char>              <char>
#>1:      1         1            1    spam classif.featureless
#>         resampling_id test.subset train.subsets groups test.fold  seed
#>                <char>      <char>        <char>  <int>     <int> <int>
#>1: same_other_sizes_cv        full          same     50         1     1
#>   n.train.groups iteration Train_subsets
#>            <int>     <int>        <char>
#>1:             50         1          same
#>
#>$results.csv
#>   grid_job_i task.i learner.i resampling.i task_id          learner_id
#>        <int>  <int>     <int>        <int>  <char>              <char>
#>1:          1      1         1            1    spam classif.featureless
#>         resampling_id test.subset train.subsets groups test.fold  seed
#>                <char>      <char>        <char>  <int>     <int> <int>
#>1: same_other_sizes_cv        full          same     50         1     1
#>   n.train.groups iteration Train_subsets          start.time
#>            <int>     <int>        <char>              <POSc>
#>1:             50         1          same 2026-04-28 14:58:41
#>              end.time process classif.auc classif.acc
#>                <POSc>   <int>       <num>       <num>
#>1: 2026-04-28 14:58:41   25217         0.5         0.6
#>

We tested a featureless learner, which is not a great test (other learners may fail for a variety of reasons). So typically we run a more extensive test, as in the next section.

Test one job for each algo and data set

A simple way to test each algorithm and data set with batchtools is via the code below, which uses repl==1 to consider only the first cross-validation iteration, run on the local machine:

if(requireNamespace("mlr3batchmark")){
  jt <- batchtools::getJobTable()
  jt1 <- jt[repl==1]
  testJob.repl1 <- sapply(jt1$job.id, batchtools::testJob)
}
#>### [bt]: Generating problem instance for problem '860965eb37a273f7' ...
#>### [bt]: Applying algorithm 'run_learner' on problem '860965eb37a273f7' for job 1 (seed = 3880) ...
#>### [bt]: Generating problem instance for problem '860965eb37a273f7' ...
#>### [bt]: Applying algorithm 'run_learner' on problem '860965eb37a273f7' for job 4 (seed = 3883) ...
#>### [bt]: Generating problem instance for problem 'ec1c23f2718a37f6' ...
#>### [bt]: Applying algorithm 'run_learner' on problem 'ec1c23f2718a37f6' for job 7 (seed = 3886) ...
#>### [bt]: Generating problem instance for problem 'ec1c23f2718a37f6' ...
#>### [bt]: Applying algorithm 'run_learner' on problem 'ec1c23f2718a37f6' for job 10 (seed = 3889) ...

There is no error above, but it is not clear if the result is reasonable. Another way to do it with batchtools is to submit a subset of jobs, as below:

submit_job_array <- function(jobs_dt, minutes=1, gigabytes=1){
  jobs_dt$chunk <- 1
  batchtools::submitJobs(jobs_dt, resources=list(
    walltime = minutes*60,#seconds
    memory = gigabytes*1000,#megabytes per cpu
    ncpus=1,  #>1 for multicore/parallel jobs.
    ntasks=1, #>1 for MPI jobs.
    chunks.as.arrayjobs=slurm.available))
}
if(requireNamespace("mlr3batchmark")){
  submit_job_array(jt1)
}
#>Submitting 4 jobs in 1 chunks using cluster functions 'Slurm' ...

The code above adds a job array, one parallel CPU per repl=1 cross-validation iteration, to the SLURM queue. In the code below, we wait for the jobs to compute, then gather the results.

if(requireNamespace("mlr3batchmark")){
  batchtools::waitForJobs(jt1)
  test_res <- mlr3batchmark::reduceResultsBatchmark(jt1)
  test_res$score(measure_list)
}

uhash	nr	task	task_id	learner	learner_id	resampling_id	iteration	classif.auc	classif.acc
8328b07f-178f-4c2c-81d8-a294ec67b2f6	1	TaskClassif:spam	spam	LearnerClassifFeatureless:classif.featureless	classif.featureless	same_other_sizes_cv	1	0.500	0.606
95668932-f531-421d-b8d6-a590c19031ad	2	TaskClassif:spam	spam	LearnerClassifRpart:classif.rpart	classif.rpart	same_other_sizes_cv	1	0.889	0.892
f3e139e5-0a2a-47a5-8336-2c2516eed410	3	TaskClassif:german_credit	german_credit	LearnerClassifFeatureless:classif.featureless	classif.featureless	same_other_sizes_cv	1	0.500	0.701
3b653707-b2cc-4480-bf6d-e4e5a891034b	4	TaskClassif:german_credit	german_credit	LearnerClassifRpart:classif.rpart	classif.rpart	same_other_sizes_cv	1	0.721	0.740

The result above includes test AUC and accuracy values, which are more convincing test (but this involves actually submitting jobs on the cluster).

Our proposed way to test is via:

mlr3resampling::proj_test(proj_dir)
#>$grid_jobs.csv
#>   task.i learner.i resampling.i       task_id          learner_id
#>    <int>     <int>        <int>        <char>              <char>
#>1:      1         1            1          spam classif.featureless
#>2:      1         2            1          spam       classif.rpart
#>3:      2         1            1 german_credit classif.featureless
#>4:      2         2            1 german_credit       classif.rpart
#>         resampling_id test.subset train.subsets groups test.fold  seed
#>                <char>      <char>        <char>  <int>     <int> <int>
#>1: same_other_sizes_cv        full          same     50         1     1
#>2: same_other_sizes_cv        full          same     50         1     1
#>3: same_other_sizes_cv        full          same     66         1     1
#>4: same_other_sizes_cv        full          same     66         1     1
#>   n.train.groups iteration Train_subsets
#>            <int>     <int>        <char>
#>1:             50         1          same
#>2:             50         1          same
#>3:             66         1          same
#>4:             66         1          same
#>
#>$results.csv
#>   grid_job_i task.i learner.i resampling.i       task_id          learner_id
#>        <int>  <int>     <int>        <int>        <char>              <char>
#>1:          1      1         1            1          spam classif.featureless
#>2:          2      1         2            1          spam       classif.rpart
#>3:          3      2         1            1 german_credit classif.featureless
#>4:          4      2         2            1 german_credit       classif.rpart
#>         resampling_id test.subset train.subsets groups test.fold  seed
#>                <char>      <char>        <char>  <int>     <int> <int>
#>1: same_other_sizes_cv        full          same     50         1     1
#>2: same_other_sizes_cv        full          same     50         1     1
#>3: same_other_sizes_cv        full          same     66         1     1
#>4: same_other_sizes_cv        full          same     66         1     1
#>   n.train.groups iteration Train_subsets          start.time
#>            <int>     <int>        <char>              <POSc>
#>1:             50         1          same 2026-04-28 14:58:48
#>2:             50         1          same 2026-04-28 14:58:48
#>3:             66         1          same 2026-04-28 14:58:48
#>4:             66         1          same 2026-04-28 14:58:48
#>              end.time process classif.auc classif.acc
#>                <POSc>   <int>       <num>       <num>
#>1: 2026-04-28 14:58:48   25217   0.5000000   0.6000000
#>2: 2026-04-28 14:58:48   25217   0.9100000   0.8400000
#>3: 2026-04-28 14:58:48   25217   0.5000000   0.6969697
#>4: 2026-04-28 14:58:48   25217   0.6630435   0.7575758
#>

The results above include test AUC and accuracy, but with different values because the tasks are down-sampled to make training faster.

Running all jobs locally (small benchmarks)

For some small benchmarks, you may want to compute all results on your local machine (not a cluster). For both the previous and proposed approaches, we use the code below to enable computation in parallel using all the CPUs on the local machine.

if(interactive())future::plan("multisession")

The previous approach uses the code below:

bench_result <- mlr3::benchmark(bgrid)
bench_score <- bench_result$score(measure_list)
bench_score[, .(task_id, learner_id, iteration, classif.auc, classif.acc)]

task_id	learner_id	iteration	classif.auc	classif.acc
spam	classif.featureless	1	0.500	0.606
spam	classif.featureless	2	0.500	0.606
spam	classif.featureless	3	0.500	0.606
spam	classif.rpart	1	0.889	0.892
spam	classif.rpart	2	0.909	0.888
⋮	⋮	⋮	⋮	⋮
german_credit	classif.featureless	2	0.500	0.700
german_credit	classif.featureless	3	0.500	0.700
german_credit	classif.rpart	1	0.721	0.740
german_credit	classif.rpart	2	0.771	0.775
german_credit	classif.rpart	3	0.688	0.730

The proposed approach uses the code below:

proj_score <- mlr3resampling::proj_compute_all(proj_dir)
proj_score[, .(task_id, learner_id, iteration, classif.auc, classif.acc)]

task_id	learner_id	iteration	classif.auc	classif.acc
spam	classif.featureless	1	0.500	0.606
spam	classif.featureless	2	0.500	0.606
spam	classif.featureless	3	0.500	0.606
spam	classif.rpart	1	0.889	0.892
spam	classif.rpart	2	0.909	0.888
⋮	⋮	⋮	⋮	⋮
german_credit	classif.featureless	2	0.500	0.700
german_credit	classif.featureless	3	0.500	0.700
german_credit	classif.rpart	1	0.721	0.740
german_credit	classif.rpart	2	0.771	0.775
german_credit	classif.rpart	3	0.688	0.730

The results are identical because

we set fold column role, so there is no randomness in cross-validation splits.
the rpart and featureless learners are deterministic (neither train nor predict is random).

Running all jobs on a cluster (large benchmarks)

Now we discuss how to run large benchmarks on a cluster.

Previous method, `batchtools`

In batchtools we can check the status of jobs in the registry using the code below.

if(requireNamespace("mlr3batchmark")){
  batchtools::getStatus()
}

defined	submitted	started	done	error	queued	running	expired	system
12	4	4	4	0	0	0	0	0

The output above shows there are 4 jobs done out of 12 in the registry. To launch the other jobs, we use the code below.

if(requireNamespace("mlr3batchmark")){
  not.done <- batchtools::getJobTable()[is.na(done)]
  submit_job_array(not.done)
}
#>Submitting 8 jobs in 1 chunks using cluster functions 'Slurm' ...

The code above adds a job array, one parallel CPU for each remaining cross-validation iteration, to the SLURM queue. In the code below, we wait for the jobs to compute, then gather the results.

if(requireNamespace("mlr3batchmark")){
  batchtools::waitForJobs()
  ignore.learner <- function(L){
    L$learner_state$model <- NULL
    L
  }
  bt_res <- mlr3batchmark::reduceResultsBatchmark(jt, fun=ignore.learner)
  bt_score <- bt_res$score(measure_list)
}

Note for large benchmarks, you must use fun to avoid loading all models into memory at once.

fun=ignore.learner as above is good if you do not want to do model interpretation.
If you do want to do model interpretation, then you need to write a fun which keeps only the parts of the model that are necessary. This can be cumbersome to do here, especially if there are several different learners to interpret (such as torch and glmnet, see examples below).

Proposed method, `mlr3resampling::proj_submit()`

The code below does the full computation on SLURM using the proposed method.

if(slurm.available){
  slurm_job_id <- mlr3resampling::proj_submit(
    proj_dir, tasks=2, hours=1, gigabytes=1)
}
#>Loading required namespace: pbdMPI

The code above submits a SLURM MPI job with two tasks.

the first task is a manager,
the other tasks (only one here) are workers, which repeatedly ask the manager for a new train/test split to compute, then save a result file.

After all computations are done, the last worker saves a results.csv file, which can be read back into R using the code below.

(result_file_list <- mlr3resampling::proj_fread(proj_dir))
#>$grid_jobs.csv
#>    task.i learner.i resampling.i       task_id          learner_id
#>     <int>     <int>        <int>        <char>              <char>
#> 1:      1         1            1          spam classif.featureless
#> 2:      1         1            1          spam classif.featureless
#> 3:      1         1            1          spam classif.featureless
#> 4:      1         2            1          spam       classif.rpart
#> 5:      1         2            1          spam       classif.rpart
#> 6:      1         2            1          spam       classif.rpart
#> 7:      2         1            1 german_credit classif.featureless
#> 8:      2         1            1 german_credit classif.featureless
#> 9:      2         1            1 german_credit classif.featureless
#>10:      2         2            1 german_credit       classif.rpart
#>11:      2         2            1 german_credit       classif.rpart
#>12:      2         2            1 german_credit       classif.rpart
#>          resampling_id test.subset train.subsets groups test.fold  seed
#>                 <char>      <char>        <char>  <int>     <int> <int>
#> 1: same_other_sizes_cv        full          same   3067         1     1
#> 2: same_other_sizes_cv        full          same   3067         2     1
#> 3: same_other_sizes_cv        full          same   3067         3     1
#> 4: same_other_sizes_cv        full          same   3067         1     1
#> 5: same_other_sizes_cv        full          same   3067         2     1
#> 6: same_other_sizes_cv        full          same   3067         3     1
#> 7: same_other_sizes_cv        full          same    666         1     1
#> 8: same_other_sizes_cv        full          same    666         2     1
#> 9: same_other_sizes_cv        full          same    666         3     1
#>10: same_other_sizes_cv        full          same    666         1     1
#>11: same_other_sizes_cv        full          same    666         2     1
#>12: same_other_sizes_cv        full          same    666         3     1
#>    n.train.groups iteration Train_subsets
#>             <int>     <int>        <char>
#> 1:           3067         1          same
#> 2:           3067         2          same
#> 3:           3067         3          same
#> 4:           3067         1          same
#> 5:           3067         2          same
#> 6:           3067         3          same
#> 7:            666         1          same
#> 8:            666         2          same
#> 9:            666         3          same
#>10:            666         1          same
#>11:            666         2          same
#>12:            666         3          same
#>
#>$results.csv
#>    grid_job_i task.i learner.i resampling.i       task_id          learner_id
#>         <int>  <int>     <int>        <int>        <char>              <char>
#> 1:          1      1         1            1          spam classif.featureless
#> 2:         10      2         2            1 german_credit       classif.rpart
#> 3:         11      2         2            1 german_credit       classif.rpart
#> 4:         12      2         2            1 german_credit       classif.rpart
#> 5:          2      1         1            1          spam classif.featureless
#> 6:          3      1         1            1          spam classif.featureless
#> 7:          4      1         2            1          spam       classif.rpart
#> 8:          5      1         2            1          spam       classif.rpart
#> 9:          6      1         2            1          spam       classif.rpart
#>10:          7      2         1            1 german_credit classif.featureless
#>11:          8      2         1            1 german_credit classif.featureless
#>12:          9      2         1            1 german_credit classif.featureless
#>          resampling_id test.subset train.subsets groups test.fold  seed
#>                 <char>      <char>        <char>  <int>     <int> <int>
#> 1: same_other_sizes_cv        full          same   3067         1     1
#> 2: same_other_sizes_cv        full          same    666         1     1
#> 3: same_other_sizes_cv        full          same    666         2     1
#> 4: same_other_sizes_cv        full          same    666         3     1
#> 5: same_other_sizes_cv        full          same   3067         2     1
#> 6: same_other_sizes_cv        full          same   3067         3     1
#> 7: same_other_sizes_cv        full          same   3067         1     1
#> 8: same_other_sizes_cv        full          same   3067         2     1
#> 9: same_other_sizes_cv        full          same   3067         3     1
#>10: same_other_sizes_cv        full          same    666         1     1
#>11: same_other_sizes_cv        full          same    666         2     1
#>12: same_other_sizes_cv        full          same    666         3     1
#>    n.train.groups iteration Train_subsets          start.time
#>             <int>     <int>        <char>              <POSc>
#> 1:           3067         1          same 2026-04-28 14:58:49
#> 2:            666         1          same 2026-04-28 14:58:49
#> 3:            666         2          same 2026-04-28 14:58:49
#> 4:            666         3          same 2026-04-28 14:58:49
#> 5:           3067         2          same 2026-04-28 14:58:49
#> 6:           3067         3          same 2026-04-28 14:58:49
#> 7:           3067         1          same 2026-04-28 14:58:49
#> 8:           3067         2          same 2026-04-28 14:58:49
#> 9:           3067         3          same 2026-04-28 14:58:49
#>10:            666         1          same 2026-04-28 14:58:49
#>11:            666         2          same 2026-04-28 14:58:49
#>12:            666         3          same 2026-04-28 14:58:49
#>               end.time process classif.auc classif.acc
#>                 <POSc>   <int>       <num>       <num>
#> 1: 2026-04-28 14:58:49   25217   0.5000000   0.6058632
#> 2: 2026-04-28 14:58:49   25217   0.7214530   0.7395210
#> 3: 2026-04-28 14:58:49   25217   0.7712446   0.7747748
#> 4: 2026-04-28 14:58:50   25217   0.6879399   0.7297297
#> 5: 2026-04-28 14:58:49   25217   0.5000000   0.6060013
#> 6: 2026-04-28 14:58:49   25217   0.5000000   0.6060013
#> 7: 2026-04-28 14:58:49   25217   0.8889185   0.8918567
#> 8: 2026-04-28 14:58:49   25217   0.9085884   0.8884540
#> 9: 2026-04-28 14:58:49   25217   0.8906028   0.8956295
#>10: 2026-04-28 14:58:49   25217   0.5000000   0.7005988
#>11: 2026-04-28 14:58:49   25217   0.5000000   0.6996997
#>12: 2026-04-28 14:58:49   25217   0.5000000   0.6996997
#>

Above results show two tables,

grid_jobs.csv has meta-data that the workers use to determine what computation to run.
results.csv has columns including computation time, process ID, and test accuracy measures.

Results comparison

Accuracy measures

acc_in_list <- list(
  mlr3resampling=result_file_list$results.csv)
if(requireNamespace("mlr3batchmark"))
  acc_in_list$mlr3batchmark <- bt_score
acc_out_list <- list()
library(data.table)
for(package in names(acc_in_list)){
  acc_in <- melt(
    acc_in_list[[package]],
    id.vars=c("task_id", "learner_id", "iteration"),
    measure.vars=c("classif.auc", "classif.acc"))
  acc_out_list[[package]] <- data.table(package, acc_in)
}
acc_out <- rbindlist(acc_out_list)
(acc_compare <- dcast(
  acc_out,
  variable + task_id + learner_id + iteration ~ package))

variable	task_id	learner_id	iteration	mlr3batchmark	mlr3resampling
classif.auc	german_credit	classif.featureless	1	0.500	0.500
classif.auc	german_credit	classif.featureless	2	0.500	0.500
classif.auc	german_credit	classif.featureless	3	0.500	0.500
classif.auc	german_credit	classif.rpart	1	0.721	0.721
classif.auc	german_credit	classif.rpart	2	0.771	0.771
⋮	⋮	⋮	⋮	⋮	⋮
classif.acc	spam	classif.featureless	2	0.606	0.606
classif.acc	spam	classif.featureless	3	0.606	0.606
classif.acc	spam	classif.rpart	1	0.892	0.892
classif.acc	spam	classif.rpart	2	0.888	0.888
classif.acc	spam	classif.rpart	3	0.896	0.896

if(requireNamespace("mlr3batchmark"))
  acc_compare[, all.equal(mlr3batchmark, mlr3resampling)]
#>[1] TRUE

We see above that all accuracy metrics are equal, using the two methods. We plot the results below.

if(require(ggplot2)){
  ggplot()+
    geom_point(aes(
      mlr3resampling, learner_id),
      data=acc_compare)+
    facet_wrap(c("task_id","variable"), labeller=label_both, scales="free", ncol=1)
}

The figure above shows that the decision tree learns something non-trivial in both data sets, and that spam is easier (more class balance, bigger improvement over featureless).

Computation time

Analysis of computation time is not very interesting for this small data example, but this code could be useful for a larger benchmark.

time_compare <- rbind(
  if(requireNamespace("mlr3batchmark"))batchtools::getJobTable()[, .(
    package="mlr3batchmark", process=.I, start.time=started, end.time=done)],
  result_file_list$results.csv[, .(
    package="mlr3resampling", process, start.time, end.time)])
if(require(ggplot2)){
  ggplot(time_compare, aes(start.time, process))+
    geom_segment(aes(
      xend=end.time, yend=process))+
    geom_point()+
    facet_grid(
      package~.,
      labeller=label_both,
      scales="free")
}

New features

New `edit_learner()` method for quick testing

For even faster test runs, Learners may define edit_learner() method, which edits the learner to make training faster. For example, the default with AutoTuner and TorchLearner is to only use two epochs of training during proj_test(), so you can quickly see if results look reasonable, before running training using the full number of epochs, and the full data set.

proj_new <- if(interactive())"~/proj_new" else tempfile()
unlink(proj_new, recursive = TRUE)
learners_new <- list(
  mlr3::LearnerClassifFeatureless$new())
if(requireNamespace("torch") && torch::torch_is_installed()){
  gen_linear <- torch::nn_module(
    "my_linear",
    initialize = function(task) {
      self$weight = torch::nn_linear(task$n_features, 1)
    },
    forward = function(x) {
      self$weight(x)
    }
  )
  learners_new$torch <- mlr3resampling::AutoTunerTorch_epochs$new(
    "torch_linear",
    module_generator=gen_linear,
    max_epochs=1000,
    batch_size=10,
    measure_list=mlr3::msrs("classif.auc")
  )
}
#>Loading required namespace: torch
if(requireNamespace("glmnet")){
  learners_new$glmnet <- mlr3resampling::LearnerClassifCVGlmnetSave$new()
}
for(learner_i in seq_along(learners_new)){
  L <- learners_new[[learner_i]]
  L$predict_type <- "prob"
}
mlr3resampling::proj_grid(
  proj_new, tasks_with_fold$spam, learners_new, kfoldcv,
  score_args = measure_list)
#>grid with 9 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file62814f2cb14e", max_jobs=1)

task.i	learner.i	resampling.i	task_id	learner_id	resampling_id	test.subset	train.subsets	groups	test.fold	seed	n.train.groups	iteration	Train_subsets
1	1	1	spam	classif.featureless	same_other_sizes_cv	full	same	3067	1	1	3067	1	same
1	1	1	spam	classif.featureless	same_other_sizes_cv	full	same	3067	2	1	3067	2	same
1	1	1	spam	classif.featureless	same_other_sizes_cv	full	same	3067	3	1	3067	3	same
1	2	1	spam	torch_linear	same_other_sizes_cv	full	same	3067	1	1	3067	1	same
1	2	1	spam	torch_linear	same_other_sizes_cv	full	same	3067	2	1	3067	2	same
1	2	1	spam	torch_linear	same_other_sizes_cv	full	same	3067	3	1	3067	3	same
1	3	1	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	3067	1	1	3067	1	same
1	3	1	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	3067	2	1	3067	2	same
1	3	1	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	3067	3	1	3067	3	same

system.time({
  test_result_list <- mlr3resampling::proj_test(proj_new)
})
#>   user  system elapsed 
#>  1.440   0.015   1.455

Notice how fast the above code is. Even though the torch learner has max 1000 epochs, it is reduced to only 2 epoch when running proj_test().

New `save_learner()` method for model interpretation

The two learners above have save_learner() methods, defined in mlr3resampling/R/Learners.R.

In mlr3resampling::proj_grid(), the default for most learners is to not save anything, because we don’t want to risk running out of memory, by saving and loading a bunch of large models, if we are not interested in model interpretation.
These two learners have save_learner() methods which return a named list of data tables, which should contain the important parts of learner$learner_state$model, which we want to use for model interpretation.
- in glmnet we want the linear model coefficients.
- in torch we want the training history (subtrain/validation loss at each epoch).
- in general it is preferable (easier to understand, less disk usage, less risk of running out of memory) if this model interpretation code is in the save_learner() method, rather than in the cluster results analysis script (fun argument of mlr3batchmark::reduceResultsBatchmark).
The prefix learners_ and suffix .csv are appended to the name of each data table, to get a file in which to save these data.
proj_test() returns all of these data tables in a list:

names(test_result_list)
#>[1] "grid_jobs.csv"        "learners_history.csv" "learners_weights.csv"
#>[4] "results.csv"

We see that there are two learners tables. First, history comes from torch:

test_result_list$learners_history.csv

grid_job_i	epoch	train.classif.auc	valid.classif.auc	task_id	learner_id	resampling_id	test.subset	train.subsets	groups	test.fold	seed	n.train.groups	iteration	Train_subsets
2	1	0.562	0.371	spam	torch_linear	same_other_sizes_cv	full	same	50	1	1	50	1	same
2	2	0.575	0.379	spam	torch_linear	same_other_sizes_cv	full	same	50	1	1	50	1	same

We see above there are two rows (one for each epoch). In larger runs (more than two epochs), we plot these data to see if the torch model has a good fit (avoids overfitting and underfitting).

Next, the weights table come from glmnet:

test_result_list$learners_weights.csv

grid_job_i	feature	weight	task_id	learner_id	resampling_id	test.subset	train.subsets	groups	test.fold	seed	n.train.groups	iteration	Train_subsets
3	address	0.000	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	50	1	1	50	1	same
3	addresses	0.000	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	50	1	1	50	1	same
3	all	0.000	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	50	1	1	50	1	same
3	business	0.000	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	50	1	1	50	1	same
3	capitalAve	0.000	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	50	1	1	50	1	same
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
3	technology	-0.276	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	50	1	1	50	1	same
3	telnet	0.000	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	50	1	1	50	1	same
3	will	-0.230	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	50	1	1	50	1	same
3	you	0.000	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	50	1	1	50	1	same
3	your	0.175	spam	classif.cv_glmnet	same_other_sizes_cv	full	same	50	1	1	50	1	same

The table above contains one row for each input feature. Some are selected by L1 regularization (non-zero weights); the other features are not used for prediction (zero weights).

New Task down-sampling

When we ran proj_test() above, it created a test directory, which is another project directory, with a smaller task. Each original task in the project is down-sampled proportionally (using strata), so that we can quickly test if training and prediction work on a smaller data set. The code below reads the down-sampled task:

rds.vec <- Sys.glob(file.path(proj_new,"test","tasks","*rds"))
for(rds.i in seq_along(rds.vec)){
  mini_task <- readRDS(rds.vec[[rds.i]])
  print(table(mini_task$data()[[1]]))
}
#>
#>   spam nonspam 
#>     30      45

The output above shows that there are at least 30 data per class in the down-sampled task, because

default is 10 samples in the smallest stratum (this can be changed via min_samples_per_stratum),
we defined stratum as fold + target columns,
there were K=3 cross-validation folds,
so there are at least 30 data per class (10 per fold).

Conclusion

We compared mlr3resampling::proj_*() functions to their analogs in batchtools and mlr3batchmark. We saw that the interface is similar, with some convenience features for quick testing, and for efficient model interpretation.