The goal of this vignette is to compare and contrast calculation of machine learning benchmarks using

Introduction: tasks, learners, resampling

In mlr3, a benchmark is defined as combinations of tasks, learners, and resampling iterations. The code in this section must be run using both methods (previous and proposed).

Resampling

First we create an instance of 3-fold CV, which we will use as the train/test splitting method.

(kfoldcv <- mlr3resampling::ResamplingSameOtherSizesCV$new())
#>
#>── <ResamplingSameOtherSizesCV> : Compare Same/Other and Sizes Cross-Validation 
#>• Iterations:
#>• Instantiated: FALSE
#>• Parameters: folds=3, seeds=1, ratio=0.5, sizes=-1, ignore_subset=FALSE,
#>subsets=SOA

For reproducibility, we use ResamplingSameOtherSizesCV because it respects the fold column role (and mlr3::ResamplingCV does not). Note that the resampling should be instantiated before creating a Task, because it loads the mlr3resampling package (needed to avoid errors about unrecognized column roles subset and fold).

Tasks

First we define a list of two tasks.

task_list <- mlr3::tsks(c("spam", "german_credit"))
tasks_with_fold <- list()
for(task_i in seq_along(task_list)){
  task <- task_list[[task_i]]
  tcol <- task$col_roles$target
  task_dt <- task$data()
  task_dt[, Fold := rep(1:3, length.out=.N), by=c(tcol)]
  ftask <- mlr3::TaskClassif$new(
    task_dt, id=task$id, target=tcol)
  ftask$col_roles$feature <- task$col_roles$feature
  ftask$col_roles$fold <- "Fold"
  ftask$col_roles$stratum <- c("Fold", tcol)
  tasks_with_fold[[task$id]] <- ftask
}
tasks_with_fold
#>$spam
#>
#>── <TaskClassif> (4601x58) ─────────────────────────────────────────────────────
#>• Target: type
#>• Target classes: spam (positive class, 39%), nonspam (61%)
#>• Properties: twoclass, strata
#>• Features (57):
#>  • dbl (57): address, addresses, all, business, capitalAve, capitalLong,
#>  capitalTotal, charDollar, charExclamation, charHash, charRoundbracket,
#>  charSemicolon, charSquarebracket, conference, credit, cs, data, direct, edu,
#>  email, font, free, george, hp, hpl, internet, lab, labs, mail, make, meeting,
#>  money, num000, num1999, num3d, num415, num650, num85, num857, order,
#>  original, our, over, parts, people, pm, project, re, receive, remove, report,
#>  table, technology, telnet, will, you, your
#>• Strata: Fold and type
#>
#>$german_credit
#>
#>── <TaskClassif> (1000x21) ─────────────────────────────────────────────────────
#>• Target: credit_risk
#>• Target classes: good (positive class, 70%), bad (30%)
#>• Properties: twoclass, strata
#>• Features (20):
#>  • fct (14): credit_history, employment_duration, foreign_worker, housing,
#>  job, other_debtors, other_installment_plans, people_liable,
#>  personal_status_sex, property, purpose, savings, status, telephone
#>  • int (3): age, amount, duration
#>  • ord (3): installment_rate, number_credits, present_residence
#>• Strata: Fold and credit_risk
#>

In the code above, we set column roles for each task:

Learners

Next, we define a list of two learners.

learner_list <- list(
  mlr3::LearnerClassifFeatureless$new())
if(requireNamespace("rpart")){
  learner_list$rpart <- mlr3::LearnerClassifRpart$new()
}
for(learner_i in seq_along(learner_list)){
  L <- learner_list[[learner_i]]
  L$predict_type <- "prob"
}

In the code above, we use the prob predict type, because we want to use ROC-AUC as an evaluation metric, in addition to accuracy:

measure_list <- mlr3::msrs(c("classif.auc","classif.acc"))

Define the grid of combinations

In this section, we explain the two alternative methods for defining the grid of combinations.

Previous method, mlr3::benchmark_grid()

In the code below, we must save the result, called bgrid, which exists only in R (not yet saved to the file system).

(bgrid <- mlr3::benchmark_grid(tasks_with_fold, learner_list, kfoldcv))
task learner resampling
TaskClassif:spam LearnerClassifFeatureless:classif.featureless
TaskClassif:spam LearnerClassifRpart:classif.rpart
TaskClassif:german_credit LearnerClassifFeatureless:classif.featureless
TaskClassif:german_credit LearnerClassifRpart:classif.rpart

After that, we save the grid to the file system, by first creating a batchtools registry. Various parallel computation methods are supported. Below we use a batchtools configuration file which will use SLURM, a cluster computing system.

extdata <- system.file(package="mlr3resampling", "extdata")
Sys.setenv(R_BATCHTOOLS_SEARCH_PATH=extdata) #comment to use ~/.batchtools.conf.R instead.

The code above sets an environment variable that tells batchtools where to find the config file (that says to use SLURM when it is available for testing on GitHub Actions, otherwise sequential execution for testing on CRAN). If you use batchtools on a cluster, you should create a ~/.batchtools.conf.R config file, as explained in my R batchtools on Monsoon tutorial. Then when we make the registry in the code below, the batchtools config will be saved to disk, along with the benchmark grid.

if(requireNamespace("mlr3batchmark")){
  reg_dir <- tempfile()
  reg <- batchtools::makeExperimentRegistry(reg_dir)
  slurm.available <- reg$cluster.functions$name=="Slurm"
  mlr3batchmark::batchmark(bgrid)
}
#>Sourcing configuration file '/tmp/Rtmp6Vlihj/Rinst61f87d9e70d4/mlr3resampling/extdata/batchtools.conf.R' ...
#>Created registry in '/tmp/Rtmppy43XW/file62812d2e9acb' using cluster functions 'Slurm'
#>Adding algorithm 'run_learner'
#>Adding problem '860965eb37a273f7'
#>Exporting new objects: '44c00fd8d0c641cb' ...
#>Exporting new objects: 'c1c047f0c08761bb' ...
#>Exporting new objects: '2099aa995d4e20f7' ...
#>Exporting new objects: 'ecf8ee265ec56766' ...
#>Overwriting previously exported object: 'ecf8ee265ec56766'
#>Adding 6 experiments ('860965eb37a273f7'[1] x 'run_learner'[2] x repls[3]) ...
#>Adding problem 'ec1c23f2718a37f6'
#>Exporting new objects: 'b5dfb9daba57cb9e' ...
#>Adding 6 experiments ('ec1c23f2718a37f6'[1] x 'run_learner'[2] x repls[3]) ...

Proposed method, mlr3resampling::proj_grid()

In the code below, the grid of combinations is saved to the proj_dir directory.

proj_dir <- if(interactive())"~/testproj" else tempfile()
unlink(proj_dir, recursive = TRUE)
mlr3resampling::proj_grid(
  proj_dir, tasks_with_fold, learner_list, kfoldcv,
  score_args = measure_list)
#>grid with 12 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file628154a6321a", max_jobs=1)
task.i learner.i resampling.i task_id learner_id resampling_id test.subset train.subsets groups test.fold seed n.train.groups iteration Train_subsets
1 1 1 spam classif.featureless same_other_sizes_cv full same 3067 1 1 3067 1 same
1 1 1 spam classif.featureless same_other_sizes_cv full same 3067 2 1 3067 2 same
1 1 1 spam classif.featureless same_other_sizes_cv full same 3067 3 1 3067 3 same
1 2 1 spam classif.rpart same_other_sizes_cv full same 3067 1 1 3067 1 same
1 2 1 spam classif.rpart same_other_sizes_cv full same 3067 2 1 3067 2 same
2 1 1 german_credit classif.featureless same_other_sizes_cv full same 666 2 1 666 2 same
2 1 1 german_credit classif.featureless same_other_sizes_cv full same 666 3 1 666 3 same
2 2 1 german_credit classif.rpart same_other_sizes_cv full same 666 1 1 666 1 same
2 2 1 german_credit classif.rpart same_other_sizes_cv full same 666 2 1 666 2 same
2 2 1 german_credit classif.rpart same_other_sizes_cv full same 666 3 1 666 3 same

At this step in the proposed code above,

Testing

After defining the grid of combinations, we typically want to do a small test on the local system, to make sure there are no errors, before submitting the full computation.

Testing one job

To test one job using the previous approach, we use:

if(requireNamespace("mlr3batchmark")){
  batchtools::testJob(1)
}
#>### [bt]: Generating problem instance for problem '860965eb37a273f7' ...
#>### [bt]: Applying algorithm 'run_learner' on problem '860965eb37a273f7' for job 1 (seed = 3880) ...
#>$learner_state
#>$param_vals
#>$param_vals$method
#>[1] "mode"
#>
#>
#>$log
#>Empty data.table (0 rows and 3 cols): stage,class,condition
#>
#>$train_time
#>elapsed 
#>  0.001 
#>
#>$task_hash
#>[1] "4e0e4517bbfd0a5f"
#>
#>$feature_names
#> [1] "address"           "addresses"         "all"              
#> [4] "business"          "capitalAve"        "capitalLong"      
#> [7] "capitalTotal"      "charDollar"        "charExclamation"  
#>[10] "charHash"          "charRoundbracket"  "charSemicolon"    
#>[13] "charSquarebracket" "conference"        "credit"           
#>[16] "cs"                "data"              "direct"           
#>[19] "edu"               "email"             "font"             
#>[22] "free"              "george"            "hp"               
#>[25] "hpl"               "internet"          "lab"              
#>[28] "labs"              "mail"              "make"             
#>[31] "meeting"           "money"             "num000"           
#>[34] "num1999"           "num3d"             "num415"           
#>[37] "num650"            "num85"             "num857"           
#>[40] "order"             "original"          "our"              
#>[43] "over"              "parts"             "people"           
#>[46] "pm"                "project"           "re"               
#>[49] "receive"           "remove"            "report"           
#>[52] "table"             "technology"        "telnet"           
#>[55] "will"              "you"               "your"             
#>
#>$validate
#>NULL
#>
#>$mlr3_version
#>[1] ‘1.6.0’
#>
#>$predict_time
#>[1] 0.004
#>
#>attr(,"class")
#>[1] "learner_state" "list"         
#>
#>$prediction
#>$prediction$test
#><PredictionDataClassif:1535>
#>
#>
#>$param_values
#>$param_values$method
#>[1] "mode"
#>
#>
#>$learner_hash
#>[1] "c1c047f0c08761bb"
#>
#>$data_extra
#>NULL
#>

To test one job using the proposed approach, we use the code below. Note that the test output below contains meta-data (task, learner, …) as well as prediction accuracy metrics (AUC and accuracy), so is much more useful and interpretable than the test output above.

mlr3resampling::proj_test(proj_dir, max_jobs=1)
#>$grid_jobs.csv
#>   task.i learner.i resampling.i task_id          learner_id
#>    <int>     <int>        <int>  <char>              <char>
#>1:      1         1            1    spam classif.featureless
#>         resampling_id test.subset train.subsets groups test.fold  seed
#>                <char>      <char>        <char>  <int>     <int> <int>
#>1: same_other_sizes_cv        full          same     50         1     1
#>   n.train.groups iteration Train_subsets
#>            <int>     <int>        <char>
#>1:             50         1          same
#>
#>$results.csv
#>   grid_job_i task.i learner.i resampling.i task_id          learner_id
#>        <int>  <int>     <int>        <int>  <char>              <char>
#>1:          1      1         1            1    spam classif.featureless
#>         resampling_id test.subset train.subsets groups test.fold  seed
#>                <char>      <char>        <char>  <int>     <int> <int>
#>1: same_other_sizes_cv        full          same     50         1     1
#>   n.train.groups iteration Train_subsets          start.time
#>            <int>     <int>        <char>              <POSc>
#>1:             50         1          same 2026-04-28 14:58:41
#>              end.time process classif.auc classif.acc
#>                <POSc>   <int>       <num>       <num>
#>1: 2026-04-28 14:58:41   25217         0.5         0.6
#>

We tested a featureless learner, which is not a great test (other learners may fail for a variety of reasons). So typically we run a more extensive test, as in the next section.

Test one job for each algo and data set

A simple way to test each algorithm and data set with batchtools is via the code below, which uses repl==1 to consider only the first cross-validation iteration, run on the local machine:

if(requireNamespace("mlr3batchmark")){
  jt <- batchtools::getJobTable()
  jt1 <- jt[repl==1]
  testJob.repl1 <- sapply(jt1$job.id, batchtools::testJob)
}
#>### [bt]: Generating problem instance for problem '860965eb37a273f7' ...
#>### [bt]: Applying algorithm 'run_learner' on problem '860965eb37a273f7' for job 1 (seed = 3880) ...
#>### [bt]: Generating problem instance for problem '860965eb37a273f7' ...
#>### [bt]: Applying algorithm 'run_learner' on problem '860965eb37a273f7' for job 4 (seed = 3883) ...
#>### [bt]: Generating problem instance for problem 'ec1c23f2718a37f6' ...
#>### [bt]: Applying algorithm 'run_learner' on problem 'ec1c23f2718a37f6' for job 7 (seed = 3886) ...
#>### [bt]: Generating problem instance for problem 'ec1c23f2718a37f6' ...
#>### [bt]: Applying algorithm 'run_learner' on problem 'ec1c23f2718a37f6' for job 10 (seed = 3889) ...

There is no error above, but it is not clear if the result is reasonable. Another way to do it with batchtools is to submit a subset of jobs, as below:

submit_job_array <- function(jobs_dt, minutes=1, gigabytes=1){
  jobs_dt$chunk <- 1
  batchtools::submitJobs(jobs_dt, resources=list(
    walltime = minutes*60,#seconds
    memory = gigabytes*1000,#megabytes per cpu
    ncpus=1,  #>1 for multicore/parallel jobs.
    ntasks=1, #>1 for MPI jobs.
    chunks.as.arrayjobs=slurm.available))
}
if(requireNamespace("mlr3batchmark")){
  submit_job_array(jt1)
}
#>Submitting 4 jobs in 1 chunks using cluster functions 'Slurm' ...

The code above adds a job array, one parallel CPU per repl=1 cross-validation iteration, to the SLURM queue. In the code below, we wait for the jobs to compute, then gather the results.

if(requireNamespace("mlr3batchmark")){
  batchtools::waitForJobs(jt1)
  test_res <- mlr3batchmark::reduceResultsBatchmark(jt1)
  test_res$score(measure_list)
}
uhash nr task task_id learner learner_id resampling resampling_id iteration prediction_test classif.auc classif.acc
8328b07f-178f-4c2c-81d8-a294ec67b2f6 1 TaskClassif:spam spam LearnerClassifFeatureless:classif.featureless classif.featureless same_other_sizes_cv 1 0.500 0.606
95668932-f531-421d-b8d6-a590c19031ad 2 TaskClassif:spam spam LearnerClassifRpart:classif.rpart classif.rpart same_other_sizes_cv 1 0.889 0.892
f3e139e5-0a2a-47a5-8336-2c2516eed410 3 TaskClassif:german_credit german_credit LearnerClassifFeatureless:classif.featureless classif.featureless same_other_sizes_cv 1 0.500 0.701
3b653707-b2cc-4480-bf6d-e4e5a891034b 4 TaskClassif:german_credit german_credit LearnerClassifRpart:classif.rpart classif.rpart same_other_sizes_cv 1 0.721 0.740

The result above includes test AUC and accuracy values, which are more convincing test (but this involves actually submitting jobs on the cluster).

Our proposed way to test is via:

mlr3resampling::proj_test(proj_dir)
#>$grid_jobs.csv
#>   task.i learner.i resampling.i       task_id          learner_id
#>    <int>     <int>        <int>        <char>              <char>
#>1:      1         1            1          spam classif.featureless
#>2:      1         2            1          spam       classif.rpart
#>3:      2         1            1 german_credit classif.featureless
#>4:      2         2            1 german_credit       classif.rpart
#>         resampling_id test.subset train.subsets groups test.fold  seed
#>                <char>      <char>        <char>  <int>     <int> <int>
#>1: same_other_sizes_cv        full          same     50         1     1
#>2: same_other_sizes_cv        full          same     50         1     1
#>3: same_other_sizes_cv        full          same     66         1     1
#>4: same_other_sizes_cv        full          same     66         1     1
#>   n.train.groups iteration Train_subsets
#>            <int>     <int>        <char>
#>1:             50         1          same
#>2:             50         1          same
#>3:             66         1          same
#>4:             66         1          same
#>
#>$results.csv
#>   grid_job_i task.i learner.i resampling.i       task_id          learner_id
#>        <int>  <int>     <int>        <int>        <char>              <char>
#>1:          1      1         1            1          spam classif.featureless
#>2:          2      1         2            1          spam       classif.rpart
#>3:          3      2         1            1 german_credit classif.featureless
#>4:          4      2         2            1 german_credit       classif.rpart
#>         resampling_id test.subset train.subsets groups test.fold  seed
#>                <char>      <char>        <char>  <int>     <int> <int>
#>1: same_other_sizes_cv        full          same     50         1     1
#>2: same_other_sizes_cv        full          same     50         1     1
#>3: same_other_sizes_cv        full          same     66         1     1
#>4: same_other_sizes_cv        full          same     66         1     1
#>   n.train.groups iteration Train_subsets          start.time
#>            <int>     <int>        <char>              <POSc>
#>1:             50         1          same 2026-04-28 14:58:48
#>2:             50         1          same 2026-04-28 14:58:48
#>3:             66         1          same 2026-04-28 14:58:48
#>4:             66         1          same 2026-04-28 14:58:48
#>              end.time process classif.auc classif.acc
#>                <POSc>   <int>       <num>       <num>
#>1: 2026-04-28 14:58:48   25217   0.5000000   0.6000000
#>2: 2026-04-28 14:58:48   25217   0.9100000   0.8400000
#>3: 2026-04-28 14:58:48   25217   0.5000000   0.6969697
#>4: 2026-04-28 14:58:48   25217   0.6630435   0.7575758
#>

The results above include test AUC and accuracy, but with different values because the tasks are down-sampled to make training faster.

Running all jobs locally (small benchmarks)

For some small benchmarks, you may want to compute all results on your local machine (not a cluster). For both the previous and proposed approaches, we use the code below to enable computation in parallel using all the CPUs on the local machine.

if(interactive())future::plan("multisession")

The previous approach uses the code below:

bench_result <- mlr3::benchmark(bgrid)
bench_score <- bench_result$score(measure_list)
bench_score[, .(task_id, learner_id, iteration, classif.auc, classif.acc)]
task_id learner_id iteration classif.auc classif.acc
spam classif.featureless 1 0.500 0.606
spam classif.featureless 2 0.500 0.606
spam classif.featureless 3 0.500 0.606
spam classif.rpart 1 0.889 0.892
spam classif.rpart 2 0.909 0.888
german_credit classif.featureless 2 0.500 0.700
german_credit classif.featureless 3 0.500 0.700
german_credit classif.rpart 1 0.721 0.740
german_credit classif.rpart 2 0.771 0.775
german_credit classif.rpart 3 0.688 0.730

The proposed approach uses the code below:

proj_score <- mlr3resampling::proj_compute_all(proj_dir)
proj_score[, .(task_id, learner_id, iteration, classif.auc, classif.acc)]
task_id learner_id iteration classif.auc classif.acc
spam classif.featureless 1 0.500 0.606
spam classif.featureless 2 0.500 0.606
spam classif.featureless 3 0.500 0.606
spam classif.rpart 1 0.889 0.892
spam classif.rpart 2 0.909 0.888
german_credit classif.featureless 2 0.500 0.700
german_credit classif.featureless 3 0.500 0.700
german_credit classif.rpart 1 0.721 0.740
german_credit classif.rpart 2 0.771 0.775
german_credit classif.rpart 3 0.688 0.730

The results are identical because

Running all jobs on a cluster (large benchmarks)

Now we discuss how to run large benchmarks on a cluster.

Previous method, batchtools

In batchtools we can check the status of jobs in the registry using the code below.

if(requireNamespace("mlr3batchmark")){
  batchtools::getStatus()
}
defined submitted started done error queued running expired system
12 4 4 4 0 0 0 0 0

The output above shows there are 4 jobs done out of 12 in the registry. To launch the other jobs, we use the code below.

if(requireNamespace("mlr3batchmark")){
  not.done <- batchtools::getJobTable()[is.na(done)]
  submit_job_array(not.done)
}
#>Submitting 8 jobs in 1 chunks using cluster functions 'Slurm' ...

The code above adds a job array, one parallel CPU for each remaining cross-validation iteration, to the SLURM queue. In the code below, we wait for the jobs to compute, then gather the results.

if(requireNamespace("mlr3batchmark")){
  batchtools::waitForJobs()
  ignore.learner <- function(L){
    L$learner_state$model <- NULL
    L
  }
  bt_res <- mlr3batchmark::reduceResultsBatchmark(jt, fun=ignore.learner)
  bt_score <- bt_res$score(measure_list)
}

Note for large benchmarks, you must use fun to avoid loading all models into memory at once.

Proposed method, mlr3resampling::proj_submit()

The code below does the full computation on SLURM using the proposed method.

if(slurm.available){
  slurm_job_id <- mlr3resampling::proj_submit(
    proj_dir, tasks=2, hours=1, gigabytes=1)
}
#>Loading required namespace: pbdMPI

The code above submits a SLURM MPI job with two tasks.

After all computations are done, the last worker saves a results.csv file, which can be read back into R using the code below.

(result_file_list <- mlr3resampling::proj_fread(proj_dir))
#>$grid_jobs.csv
#>    task.i learner.i resampling.i       task_id          learner_id
#>     <int>     <int>        <int>        <char>              <char>
#> 1:      1         1            1          spam classif.featureless
#> 2:      1         1            1          spam classif.featureless
#> 3:      1         1            1          spam classif.featureless
#> 4:      1         2            1          spam       classif.rpart
#> 5:      1         2            1          spam       classif.rpart
#> 6:      1         2            1          spam       classif.rpart
#> 7:      2         1            1 german_credit classif.featureless
#> 8:      2         1            1 german_credit classif.featureless
#> 9:      2         1            1 german_credit classif.featureless
#>10:      2         2            1 german_credit       classif.rpart
#>11:      2         2            1 german_credit       classif.rpart
#>12:      2         2            1 german_credit       classif.rpart
#>          resampling_id test.subset train.subsets groups test.fold  seed
#>                 <char>      <char>        <char>  <int>     <int> <int>
#> 1: same_other_sizes_cv        full          same   3067         1     1
#> 2: same_other_sizes_cv        full          same   3067         2     1
#> 3: same_other_sizes_cv        full          same   3067         3     1
#> 4: same_other_sizes_cv        full          same   3067         1     1
#> 5: same_other_sizes_cv        full          same   3067         2     1
#> 6: same_other_sizes_cv        full          same   3067         3     1
#> 7: same_other_sizes_cv        full          same    666         1     1
#> 8: same_other_sizes_cv        full          same    666         2     1
#> 9: same_other_sizes_cv        full          same    666         3     1
#>10: same_other_sizes_cv        full          same    666         1     1
#>11: same_other_sizes_cv        full          same    666         2     1
#>12: same_other_sizes_cv        full          same    666         3     1
#>    n.train.groups iteration Train_subsets
#>             <int>     <int>        <char>
#> 1:           3067         1          same
#> 2:           3067         2          same
#> 3:           3067         3          same
#> 4:           3067         1          same
#> 5:           3067         2          same
#> 6:           3067         3          same
#> 7:            666         1          same
#> 8:            666         2          same
#> 9:            666         3          same
#>10:            666         1          same
#>11:            666         2          same
#>12:            666         3          same
#>
#>$results.csv
#>    grid_job_i task.i learner.i resampling.i       task_id          learner_id
#>         <int>  <int>     <int>        <int>        <char>              <char>
#> 1:          1      1         1            1          spam classif.featureless
#> 2:         10      2         2            1 german_credit       classif.rpart
#> 3:         11      2         2            1 german_credit       classif.rpart
#> 4:         12      2         2            1 german_credit       classif.rpart
#> 5:          2      1         1            1          spam classif.featureless
#> 6:          3      1         1            1          spam classif.featureless
#> 7:          4      1         2            1          spam       classif.rpart
#> 8:          5      1         2            1          spam       classif.rpart
#> 9:          6      1         2            1          spam       classif.rpart
#>10:          7      2         1            1 german_credit classif.featureless
#>11:          8      2         1            1 german_credit classif.featureless
#>12:          9      2         1            1 german_credit classif.featureless
#>          resampling_id test.subset train.subsets groups test.fold  seed
#>                 <char>      <char>        <char>  <int>     <int> <int>
#> 1: same_other_sizes_cv        full          same   3067         1     1
#> 2: same_other_sizes_cv        full          same    666         1     1
#> 3: same_other_sizes_cv        full          same    666         2     1
#> 4: same_other_sizes_cv        full          same    666         3     1
#> 5: same_other_sizes_cv        full          same   3067         2     1
#> 6: same_other_sizes_cv        full          same   3067         3     1
#> 7: same_other_sizes_cv        full          same   3067         1     1
#> 8: same_other_sizes_cv        full          same   3067         2     1
#> 9: same_other_sizes_cv        full          same   3067         3     1
#>10: same_other_sizes_cv        full          same    666         1     1
#>11: same_other_sizes_cv        full          same    666         2     1
#>12: same_other_sizes_cv        full          same    666         3     1
#>    n.train.groups iteration Train_subsets          start.time
#>             <int>     <int>        <char>              <POSc>
#> 1:           3067         1          same 2026-04-28 14:58:49
#> 2:            666         1          same 2026-04-28 14:58:49
#> 3:            666         2          same 2026-04-28 14:58:49
#> 4:            666         3          same 2026-04-28 14:58:49
#> 5:           3067         2          same 2026-04-28 14:58:49
#> 6:           3067         3          same 2026-04-28 14:58:49
#> 7:           3067         1          same 2026-04-28 14:58:49
#> 8:           3067         2          same 2026-04-28 14:58:49
#> 9:           3067         3          same 2026-04-28 14:58:49
#>10:            666         1          same 2026-04-28 14:58:49
#>11:            666         2          same 2026-04-28 14:58:49
#>12:            666         3          same 2026-04-28 14:58:49
#>               end.time process classif.auc classif.acc
#>                 <POSc>   <int>       <num>       <num>
#> 1: 2026-04-28 14:58:49   25217   0.5000000   0.6058632
#> 2: 2026-04-28 14:58:49   25217   0.7214530   0.7395210
#> 3: 2026-04-28 14:58:49   25217   0.7712446   0.7747748
#> 4: 2026-04-28 14:58:50   25217   0.6879399   0.7297297
#> 5: 2026-04-28 14:58:49   25217   0.5000000   0.6060013
#> 6: 2026-04-28 14:58:49   25217   0.5000000   0.6060013
#> 7: 2026-04-28 14:58:49   25217   0.8889185   0.8918567
#> 8: 2026-04-28 14:58:49   25217   0.9085884   0.8884540
#> 9: 2026-04-28 14:58:49   25217   0.8906028   0.8956295
#>10: 2026-04-28 14:58:49   25217   0.5000000   0.7005988
#>11: 2026-04-28 14:58:49   25217   0.5000000   0.6996997
#>12: 2026-04-28 14:58:49   25217   0.5000000   0.6996997
#>

Above results show two tables,

Results comparison

Accuracy measures

acc_in_list <- list(
  mlr3resampling=result_file_list$results.csv)
if(requireNamespace("mlr3batchmark"))
  acc_in_list$mlr3batchmark <- bt_score
acc_out_list <- list()
library(data.table)
for(package in names(acc_in_list)){
  acc_in <- melt(
    acc_in_list[[package]],
    id.vars=c("task_id", "learner_id", "iteration"),
    measure.vars=c("classif.auc", "classif.acc"))
  acc_out_list[[package]] <- data.table(package, acc_in)
}
acc_out <- rbindlist(acc_out_list)
(acc_compare <- dcast(
  acc_out,
  variable + task_id + learner_id + iteration ~ package))
variable task_id learner_id iteration mlr3batchmark mlr3resampling
classif.auc german_credit classif.featureless 1 0.500 0.500
classif.auc german_credit classif.featureless 2 0.500 0.500
classif.auc german_credit classif.featureless 3 0.500 0.500
classif.auc german_credit classif.rpart 1 0.721 0.721
classif.auc german_credit classif.rpart 2 0.771 0.771
classif.acc spam classif.featureless 2 0.606 0.606
classif.acc spam classif.featureless 3 0.606 0.606
classif.acc spam classif.rpart 1 0.892 0.892
classif.acc spam classif.rpart 2 0.888 0.888
classif.acc spam classif.rpart 3 0.896 0.896
if(requireNamespace("mlr3batchmark"))
  acc_compare[, all.equal(mlr3batchmark, mlr3resampling)]
#>[1] TRUE

We see above that all accuracy metrics are equal, using the two methods. We plot the results below.

if(require(ggplot2)){
  ggplot()+
    geom_point(aes(
      mlr3resampling, learner_id),
      data=acc_compare)+
    facet_wrap(c("task_id","variable"), labeller=label_both, scales="free", ncol=1)
}

The figure above shows that the decision tree learns something non-trivial in both data sets, and that spam is easier (more class balance, bigger improvement over featureless).

Computation time

Analysis of computation time is not very interesting for this small data example, but this code could be useful for a larger benchmark.

time_compare <- rbind(
  if(requireNamespace("mlr3batchmark"))batchtools::getJobTable()[, .(
    package="mlr3batchmark", process=.I, start.time=started, end.time=done)],
  result_file_list$results.csv[, .(
    package="mlr3resampling", process, start.time, end.time)])
if(require(ggplot2)){
  ggplot(time_compare, aes(start.time, process))+
    geom_segment(aes(
      xend=end.time, yend=process))+
    geom_point()+
    facet_grid(
      package~.,
      labeller=label_both,
      scales="free")
}

New features

New edit_learner() method for quick testing

For even faster test runs, Learners may define edit_learner() method, which edits the learner to make training faster. For example, the default with AutoTuner and TorchLearner is to only use two epochs of training during proj_test(), so you can quickly see if results look reasonable, before running training using the full number of epochs, and the full data set.

proj_new <- if(interactive())"~/proj_new" else tempfile()
unlink(proj_new, recursive = TRUE)
learners_new <- list(
  mlr3::LearnerClassifFeatureless$new())
if(requireNamespace("torch") && torch::torch_is_installed()){
  gen_linear <- torch::nn_module(
    "my_linear",
    initialize = function(task) {
      self$weight = torch::nn_linear(task$n_features, 1)
    },
    forward = function(x) {
      self$weight(x)
    }
  )
  learners_new$torch <- mlr3resampling::AutoTunerTorch_epochs$new(
    "torch_linear",
    module_generator=gen_linear,
    max_epochs=1000,
    batch_size=10,
    measure_list=mlr3::msrs("classif.auc")
  )
}
#>Loading required namespace: torch
if(requireNamespace("glmnet")){
  learners_new$glmnet <- mlr3resampling::LearnerClassifCVGlmnetSave$new()
}
for(learner_i in seq_along(learners_new)){
  L <- learners_new[[learner_i]]
  L$predict_type <- "prob"
}
mlr3resampling::proj_grid(
  proj_new, tasks_with_fold$spam, learners_new, kfoldcv,
  score_args = measure_list)
#>grid with 9 jobs created! Test one job with the following code in a new R session:
#>mlr3resampling::proj_test("/tmp/Rtmppy43XW/file62814f2cb14e", max_jobs=1)
task.i learner.i resampling.i task_id learner_id resampling_id test.subset train.subsets groups test.fold seed n.train.groups iteration Train_subsets
1 1 1 spam classif.featureless same_other_sizes_cv full same 3067 1 1 3067 1 same
1 1 1 spam classif.featureless same_other_sizes_cv full same 3067 2 1 3067 2 same
1 1 1 spam classif.featureless same_other_sizes_cv full same 3067 3 1 3067 3 same
1 2 1 spam torch_linear same_other_sizes_cv full same 3067 1 1 3067 1 same
1 2 1 spam torch_linear same_other_sizes_cv full same 3067 2 1 3067 2 same
1 2 1 spam torch_linear same_other_sizes_cv full same 3067 3 1 3067 3 same
1 3 1 spam classif.cv_glmnet same_other_sizes_cv full same 3067 1 1 3067 1 same
1 3 1 spam classif.cv_glmnet same_other_sizes_cv full same 3067 2 1 3067 2 same
1 3 1 spam classif.cv_glmnet same_other_sizes_cv full same 3067 3 1 3067 3 same
system.time({
  test_result_list <- mlr3resampling::proj_test(proj_new)
})
#>   user  system elapsed 
#>  1.440   0.015   1.455 

Notice how fast the above code is. Even though the torch learner has max 1000 epochs, it is reduced to only 2 epoch when running proj_test().

New save_learner() method for model interpretation

The two learners above have save_learner() methods, defined in mlr3resampling/R/Learners.R.

names(test_result_list)
#>[1] "grid_jobs.csv"        "learners_history.csv" "learners_weights.csv"
#>[4] "results.csv"         

We see that there are two learners tables. First, history comes from torch:

test_result_list$learners_history.csv
grid_job_i epoch train.classif.auc valid.classif.auc task_id learner_id resampling_id test.subset train.subsets groups test.fold seed n.train.groups iteration Train_subsets
2 1 0.562 0.371 spam torch_linear same_other_sizes_cv full same 50 1 1 50 1 same
2 2 0.575 0.379 spam torch_linear same_other_sizes_cv full same 50 1 1 50 1 same

We see above there are two rows (one for each epoch). In larger runs (more than two epochs), we plot these data to see if the torch model has a good fit (avoids overfitting and underfitting).

Next, the weights table come from glmnet:

test_result_list$learners_weights.csv
grid_job_i feature weight task_id learner_id resampling_id test.subset train.subsets groups test.fold seed n.train.groups iteration Train_subsets
3 address 0.000 spam classif.cv_glmnet same_other_sizes_cv full same 50 1 1 50 1 same
3 addresses 0.000 spam classif.cv_glmnet same_other_sizes_cv full same 50 1 1 50 1 same
3 all 0.000 spam classif.cv_glmnet same_other_sizes_cv full same 50 1 1 50 1 same
3 business 0.000 spam classif.cv_glmnet same_other_sizes_cv full same 50 1 1 50 1 same
3 capitalAve 0.000 spam classif.cv_glmnet same_other_sizes_cv full same 50 1 1 50 1 same
3 technology -0.276 spam classif.cv_glmnet same_other_sizes_cv full same 50 1 1 50 1 same
3 telnet 0.000 spam classif.cv_glmnet same_other_sizes_cv full same 50 1 1 50 1 same
3 will -0.230 spam classif.cv_glmnet same_other_sizes_cv full same 50 1 1 50 1 same
3 you 0.000 spam classif.cv_glmnet same_other_sizes_cv full same 50 1 1 50 1 same
3 your 0.175 spam classif.cv_glmnet same_other_sizes_cv full same 50 1 1 50 1 same

The table above contains one row for each input feature. Some are selected by L1 regularization (non-zero weights); the other features are not used for prediction (zero weights).

New Task down-sampling

When we ran proj_test() above, it created a test directory, which is another project directory, with a smaller task. Each original task in the project is down-sampled proportionally (using strata), so that we can quickly test if training and prediction work on a smaller data set. The code below reads the down-sampled task:

rds.vec <- Sys.glob(file.path(proj_new,"test","tasks","*rds"))
for(rds.i in seq_along(rds.vec)){
  mini_task <- readRDS(rds.vec[[rds.i]])
  print(table(mini_task$data()[[1]]))
}
#>
#>   spam nonspam 
#>     30      45 

The output above shows that there are at least 30 data per class in the down-sampled task, because

Conclusion

We compared mlr3resampling::proj_*() functions to their analogs in batchtools and mlr3batchmark. We saw that the interface is similar, with some convenience features for quick testing, and for efficient model interpretation.