The goal of this vignette is to explain the older resamplers: ResamplingVariableSizeTrainCV and ResamplingSameOtherCV, which output some data which are useful for visualizing the train/test splits. If you do not want to visualize the train/test splits, then it is recommended to instead use the newer resampler, ResamplingSameOtherSizesCV (see other vignette).

Same/Other/All resampler

The goal of this section is to explain how to quantify the extent to which it is possible to train on one data subset, and predict on another data subset. This kind of problem occurs frequently in many different problem domains:

geography: can we train on one region (say Europe) and accurately predict on another? (North America)
time series: can we train on one time period (2000) and accurately predict on another? (2001)
personalization: can we train on one person (Alice) and accurately predict on another? (Bob)

The ideas are similar to my previous blog posts about how to do this in python and R. Below we explain how to use mlr3resampling for this purpose, in simulated regression and classification problems. To use this method in real data, the important sections to read below are named “Benchmark: computing test error,” which show how to create these cross-validation experiments using mlr3 code.

Before creating any mlr3 tasks, we need to load the K-fold cross-validation resampler, as below.

(reg_same_other <- mlr3resampling::ResamplingSameOtherCV$new())
#>
#>── <ResamplingSameOtherCV> : Same versus Other Cross-Validation ────────────────
#>• Iterations:
#>• Instantiated: FALSE
#>• Parameters: folds=3

This is required to do first, because it loads mlr3resampling, which has the effect of adding subset to the set of valid mlr3 column roles (otherwise mlr3 will give an error, subset is not a recognized role).

Simulated regression problems

We begin by generating some data which can be used with regression algorithms. Assume there is a data set with some rows from one person, some rows from another,

N <- 300
library(data.table)
set.seed(1)
abs.x <- 2
reg.dt <- data.table(
  x=runif(N, -abs.x, abs.x),
  person=rep(1:2, each=0.5*N))
reg.pattern.list <- list(
  easy=function(x, person)x^2,
  impossible=function(x, person)(x^2+person*3)*(-1)^person)
reg.task.list <- list()
for(task_id in names(reg.pattern.list)){
  f <- reg.pattern.list[[task_id]]
  yname <- paste0("y_",task_id)
  reg.dt[, (yname) := f(x,person)+rnorm(N)][]
  task.dt <- reg.dt[, c("x","person",yname), with=FALSE]
  reg.task <- mlr3::TaskRegr$new(
    task_id, task.dt, target=yname)
  if(requireNamespace("mlr3resampling"))reg.task$col_roles$subset <- "person"
  reg.task$col_roles$stratum <- "person"
  reg.task$col_roles$feature <- "x"
  reg.task.list[[task_id]] <- reg.task
}
reg.dt

x	person	y_easy	y_impossible
-0.938	1	1.330	-2.918
-0.512	1	0.243	-3.866
0.291	1	-0.233	-3.838
1.633	1	1.737	-7.222
-1.193	1	-0.064	-5.878
⋮	⋮	⋮	⋮
0.726	2	-2.481	5.181
-1.603	2	1.205	9.604
-1.524	2	1.900	7.512
-1.798	2	3.470	11.035
1.717	2	0.605	10.720

The table above shows some simulated data for two regression problems:

easy problem has the same pattern for each person, so it is possible/easy to train on one person, and accurately predict on another.
impossible problem has a different pattern for each person, so it is impossible to train on one person, and accurately predict on another.
when adapting the code above to real data, the important part is the mlr3::TaskRegr line which tells mlr3 what data set to use, what is the target column, and what is the subset/stratum column.

Static visualization of simulated data

First we reshape the data using the code below,

(reg.tall <- nc::capture_melt_single(
  reg.dt,
  task_id="easy|impossible",
  value.name="y"))

x	person	task_id	y
-0.938	1	easy	1.330
-0.512	1	easy	0.243
0.291	1	easy	-0.233
1.633	1	easy	1.737
-1.193	1	easy	-0.064
⋮	⋮	⋮	⋮
0.726	2	impossible	5.181
-1.603	2	impossible	9.604
-1.524	2	impossible	7.512
-1.798	2	impossible	11.035
1.717	2	impossible	10.720

The table above is a more convenient form for the visualization which we create using the code below,

if(require(animint2)){
  print_theme <- theme_bw(20)
  ggplot()+
    print_theme+
    geom_point(aes(
      x, y),
      data=reg.tall)+
    facet_grid(
      task_id ~ person,
      labeller=label_both,
      space="free",
      scales="free")+
    scale_y_continuous(
      breaks=seq(-100, 100, by=2))
}
#>Loading required package: animint2
#>Registered S3 methods overwritten by 'animint2':
#>  method                   from   
#>  drawDetails.zeroGrob     ggplot2
#>  grobHeight.absoluteGrob  ggplot2
#>  grobHeight.zeroGrob      ggplot2
#>  grobWidth.absoluteGrob   ggplot2
#>  grobWidth.zeroGrob       ggplot2
#>  grobX.absoluteGrob       ggplot2
#>  grobY.absoluteGrob       ggplot2
#>  heightDetails.titleGrob  ggplot2
#>  heightDetails.zeroGrob   ggplot2
#>  makeContext.dotstackGrob ggplot2
#>  print.ggplot2_bins       ggplot2
#>  print.rel                ggplot2
#>  widthDetails.titleGrob   ggplot2
#>  widthDetails.zeroGrob    ggplot2
#>
#>Attaching package: ‘animint2’
#>
#>The following objects are masked from ‘package:ggplot2’:
#>
#>    %+%, %+replace%, Coord, CoordCartesian, CoordFixed, CoordFlip,
#>    CoordMap, CoordPolar, CoordQuickmap, CoordTrans, Geom, GeomAbline,
#>    GeomAnnotationMap, GeomArea, GeomBar, GeomBin2d, GeomBlank,
#>    GeomContour, GeomCrossbar, GeomCurve, GeomCustomAnn, GeomDensity,
#>    GeomDensity2d, GeomDotplot, GeomErrorbar, GeomErrorbarh, GeomHex,
#>    GeomHline, GeomLabel, GeomLine, GeomLinerange, GeomLogticks,
#>    GeomMap, GeomPath, GeomPoint, GeomPointrange, GeomPolygon,
#>    GeomRect, GeomRibbon, GeomRug, GeomSegment, GeomSmooth, GeomSpoke,
#>    GeomStep, GeomText, GeomTile, GeomViolin, GeomVline, Position,
#>    PositionDodge, PositionFill, PositionIdentity, PositionJitter,
#>    PositionJitterdodge, PositionNudge, PositionStack, Scale,
#>    ScaleContinuous, ScaleContinuousDate, ScaleContinuousDatetime,
#>    ScaleContinuousIdentity, ScaleContinuousPosition, ScaleDiscrete,
#>    ScaleDiscreteIdentity, ScaleDiscretePosition, Stat, StatBin,
#>    StatBin2d, StatBindot, StatBinhex, StatContour, StatCount,
#>    StatDensity, StatDensity2d, StatEcdf, StatEllipse, StatFunction,
#>    StatIdentity, StatQq, StatSmooth, StatSum, StatSummary,
#>    StatSummary2d, StatSummaryBin, StatSummaryHex, StatUnique,
#>    StatYdensity, aes, aes_, aes_all, aes_auto, aes_q, aes_string,
#>    annotate, annotation_custom, annotation_logticks, annotation_map,
#>    as_labeller, autoplot, benchplot, borders, calc_element,
#>    continuous_scale, coord_cartesian, coord_equal, coord_fixed,
#>    coord_flip, coord_map, coord_munch, coord_polar, coord_quickmap,
#>    coord_trans, cut_interval, cut_number, cut_width, discrete_scale,
#>    draw_key_abline, draw_key_blank, draw_key_crossbar,
#>    draw_key_dotplot, draw_key_label, draw_key_path, draw_key_point,
#>    draw_key_pointrange, draw_key_polygon, draw_key_rect,
#>    draw_key_smooth, draw_key_text, draw_key_vline, draw_key_vpath,
#>    economics, economics_long, element_blank, element_grob,
#>    element_line, element_rect, element_text, expand_limits,
#>    facet_grid, facet_null, facet_wrap, fortify, geom_abline,
#>    geom_area, geom_bar, geom_bin2d, geom_blank, geom_contour,
#>    geom_count, geom_crossbar, geom_curve, geom_density,
#>    geom_density2d, geom_density_2d, geom_dotplot, geom_errorbar,
#>    geom_errorbarh, geom_freqpoly, geom_hex, geom_histogram,
#>    geom_hline, geom_jitter, geom_label, geom_line, geom_linerange,
#>    geom_map, geom_path, geom_point, geom_pointrange, geom_polygon,
#>    geom_qq, geom_rect, geom_ribbon, geom_rug, geom_segment,
#>    geom_smooth, geom_spoke, geom_step, geom_text, geom_tile,
#>    geom_violin, geom_vline, gg_dep, ggplot, ggplotGrob, ggplot_build,
#>    ggplot_gtable, ggsave, ggtitle, guide_colorbar, guide_colourbar,
#>    guide_legend, guides, is.Coord, is.facet, is.ggplot, is.theme,
#>    label_both, label_bquote, label_context, label_parsed, label_value,
#>    label_wrap_gen, labeller, labs, last_plot, layer, layer_data,
#>    layer_grob, layer_scales, lims, map_data, margin, mean_cl_boot,
#>    mean_cl_normal, mean_sdl, mean_se, median_hilow, position_dodge,
#>    position_fill, position_identity, position_jitter,
#>    position_jitterdodge, position_nudge, position_stack, presidential,
#>    qplot, quickplot, rel, remove_missing, resolution, scale_alpha,
#>    scale_alpha_continuous, scale_alpha_discrete, scale_alpha_identity,
#>    scale_alpha_manual, scale_color_brewer, scale_color_continuous,
#>    scale_color_discrete, scale_color_distiller, scale_color_gradient,
#>    scale_color_gradient2, scale_color_gradientn, scale_color_grey,
#>    scale_color_hue, scale_color_identity, scale_color_manual,
#>    scale_colour_brewer, scale_colour_continuous, scale_colour_date,
#>    scale_colour_datetime, scale_colour_discrete,
#>    scale_colour_distiller, scale_colour_gradient,
#>    scale_colour_gradient2, scale_colour_gradientn, scale_colour_grey,
#>    scale_colour_hue, scale_colour_identity, scale_colour_manual,
#>    scale_fill_brewer, scale_fill_continuous, scale_fill_date,
#>    scale_fill_datetime, scale_fill_discrete, scale_fill_distiller,
#>    scale_fill_gradient, scale_fill_gradient2, scale_fill_gradientn,
#>    scale_fill_grey, scale_fill_hue, scale_fill_identity,
#>    scale_fill_manual, scale_linetype, scale_linetype_continuous,
#>    scale_linetype_discrete, scale_linetype_identity,
#>    scale_linetype_manual, scale_radius, scale_shape,
#>    scale_shape_continuous, scale_shape_discrete, scale_shape_identity,
#>    scale_shape_manual, scale_size, scale_size_area,
#>    scale_size_continuous, scale_size_date, scale_size_datetime,
#>    scale_size_discrete, scale_size_identity, scale_size_manual,
#>    scale_x_continuous, scale_x_date, scale_x_datetime,
#>    scale_x_discrete, scale_x_log10, scale_x_reverse, scale_x_sqrt,
#>    scale_y_continuous, scale_y_date, scale_y_datetime,
#>    scale_y_discrete, scale_y_log10, scale_y_reverse, scale_y_sqrt,
#>    should_stop, stat_bin, stat_bin2d, stat_bin_2d, stat_bin_hex,
#>    stat_binhex, stat_contour, stat_count, stat_density,
#>    stat_density2d, stat_density_2d, stat_ecdf, stat_ellipse,
#>    stat_function, stat_identity, stat_qq, stat_smooth, stat_spoke,
#>    stat_sum, stat_summary, stat_summary2d, stat_summary_2d,
#>    stat_summary_bin, stat_summary_hex, stat_unique, stat_ydensity,
#>    theme, theme_bw, theme_classic, theme_dark, theme_get, theme_gray,
#>    theme_grey, theme_light, theme_linedraw, theme_minimal,
#>    theme_replace, theme_set, theme_update, theme_void,
#>    transform_position, update_geom_defaults, update_labels,
#>    update_stat_defaults, waiver, xlab, xlim, ylab, ylim, zeroGrob
#>

In the simulated data above, we can see that

for the easy pattern, it is the same for both people, so it should be possible/easy to train on one person, and accurately predict on another.
for the impossible pattern, it is different for each person, so it should not be possible to train on one person, and accurately predict on another.

Benchmark: computing test error

In the code below, we define two learners to compare,

(reg.learner.list <- list(
  if(requireNamespace("rpart"))mlr3::LearnerRegrRpart$new(),
  mlr3::LearnerRegrFeatureless$new()))
#>[[1]]
#>
#>── <LearnerRegrRpart> (regr.rpart): Regression Tree ────────────────────────────
#>• Model: -
#>• Parameters: xval=0
#>• Packages: mlr3 and rpart
#>• Predict Types: [response]
#>• Feature Types: logical, integer, numeric, factor, and ordered
#>• Encapsulation: none (fallback: -)
#>• Properties: importance, missings, selected_features, and weights
#>• Other settings: use_weights = 'use', predict_raw = 'FALSE'
#>
#>[[2]]
#>
#>── <LearnerRegrFeatureless> (regr.featureless): Featureless Regression Learner ─
#>• Model: -
#>• Parameters: robust=FALSE
#>• Packages: mlr3 and stats
#>• Predict Types: [response], se, and quantiles
#>• Feature Types: logical, integer, numeric, character, factor, ordered,
#>POSIXct, and Date
#>• Encapsulation: none (fallback: -)
#>• Properties: featureless, importance, missings, selected_features, and weights
#>• Other settings: use_weights = 'use', predict_raw = 'FALSE'
#>

In the code below, we define the benchmark grid, which is all combinations of tasks (easy and impossible), learners (rpart and featureless), and the one resampling method.

(reg.bench.grid <- mlr3::benchmark_grid(
  reg.task.list,
  reg.learner.list,
  reg_same_other))

task	learner	resampling
TaskRegr:easy	LearnerRegrRpart:regr.rpart
TaskRegr:easy	LearnerRegrFeatureless:regr.featureless
TaskRegr:impossible	LearnerRegrRpart:regr.rpart
TaskRegr:impossible	LearnerRegrFeatureless:regr.featureless

In the code below, we execute the benchmark experiment (in parallel using the multisession future plan).

if(FALSE){#for CRAN.
  if(require(future))plan("multisession")
}
if(require(lgr))get_logger("mlr3")$set_threshold("warn")
#>Loading required package: lgr
#>
#>Attaching package: ‘lgr’
#>
#>The following object is masked from ‘package:ggplot2’:
#>
#>    Layout
#>
(reg.bench.result <- mlr3::benchmark(
  reg.bench.grid, store_models = TRUE))
#>
#>── <BenchmarkResult> of 72 rows with 4 resampling run ──────────────────────────
#> nr    task_id       learner_id resampling_id iters warnings errors
#>  1       easy       regr.rpart same_other_cv    18        0      0
#>  2       easy regr.featureless same_other_cv    18        0      0
#>  3 impossible       regr.rpart same_other_cv    18        0      0
#>  4 impossible regr.featureless same_other_cv    18        0      0

The code below computes the test error for each split,

reg.bench.score <- mlr3resampling::score(reg.bench.result)
reg.bench.score[1]

train.subsets	test.fold	test.subset	person	iteration	test	train	uhash	nr	task	task_id	learner	learner_id	resampling	resampling_id	prediction_test	regr.mse	algorithm
all	1	1	1	1	1, 3, 5, 6,12,13,…[50]	4, 7, 9,10,18,20,…[200]	27eed994-6b6f-4601-9be2-f258b4ab0342	1	TaskRegr:easy	easy	LearnerRegrRpart:regr.rpart	regr.rpart		same_other_cv		1.638	rpart

The code below visualizes the resulting test accuracy numbers.

if(require(animint2)){
  ggplot()+
    print_theme+
    scale_x_log10()+
    geom_point(aes(
      regr.mse, train.subsets, color=algorithm),
      shape=1,
      data=reg.bench.score)+
    facet_grid(
      task_id ~ person,
      labeller=label_both,
      scales="free")
}

It is clear from the plot above that

for the easy task, training on same is just as good as all or other subsets. rpart has much lower test error than featureless, in all three train subsets.
for the impossible task, the least test error is using rpart with same train subsets; featureless with same train subsets is next best; training on all is substantially worse (for both featureless and rpart); training on other is even worse (patterns in the two people are completely different).
in a real data task, training on other will most likely not be quite as bad as in the impossible task above, but also not as good as in the easy task.

Interactive visualization of data, test error, and splits

The code below can be used to create an interactive data visualization which allows exploring how different functions are learned during different splits.

inst <- reg.bench.score$resampling[[1]]$instance
rect.expand <- 0.3
grid.dt <- data.table(x=seq(-abs.x, abs.x, l=101), y=0)
grid.task <- mlr3::TaskRegr$new("grid", grid.dt, target="y")
pred.dt.list <- list()
point.dt.list <- list()
for(score.i in 1:nrow(reg.bench.score)){
  reg.bench.row <- reg.bench.score[score.i]
  task.dt <- data.table(
    reg.bench.row$task[[1]]$data(),
    reg.bench.row$resampling[[1]]$instance$id.dt)
  names(task.dt)[1] <- "y"
  set.ids <- data.table(
    set.name=c("test","train")
  )[
  , data.table(row_id=reg.bench.row[[set.name]][[1]])
  , by=set.name]
  i.points <- set.ids[
    task.dt, on="row_id"
  ][
    is.na(set.name), set.name := "unused"
  ]
  point.dt.list[[score.i]] <- data.table(
    reg.bench.row[, .(task_id, iteration)],
    i.points)
  i.learner <- reg.bench.row$learner[[1]]
  pred.dt.list[[score.i]] <- data.table(
    reg.bench.row[, .(
      task_id, iteration, algorithm
    )],
    as.data.table(
      i.learner$predict(grid.task)
    )[, .(x=grid.dt$x, y=response)]
  )
}
(pred.dt <- rbindlist(pred.dt.list))

task_id	iteration	algorithm	x	y
easy	1	rpart	-2.00	3.558
easy	1	rpart	-1.96	3.558
easy	1	rpart	-1.92	3.558
easy	1	rpart	-1.88	3.558
easy	1	rpart	-1.84	3.558
⋮	⋮	⋮	⋮	⋮
impossible	18	featureless	1.84	7.204
impossible	18	featureless	1.88	7.204
impossible	18	featureless	1.92	7.204
impossible	18	featureless	1.96	7.204
impossible	18	featureless	2.00	7.204

(point.dt <- rbindlist(point.dt.list))

task_id	iteration	set.name	row_id	y	x	fold	person	subset	display_row
easy	1	test	1	1.330	-0.938	1	1	1	1
easy	1	train	2	0.243	-0.512	3	1	1	101
easy	1	test	3	-0.233	0.291	1	1	1	2
easy	1	train	4	1.737	1.633	2	1	1	51
easy	1	test	5	-0.064	-1.193	1	1	1	3
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
impossible	18	train	296	5.181	0.726	1	2	2	198
impossible	18	train	297	9.604	-1.603	1	2	2	199
impossible	18	test	298	7.512	-1.524	3	2	2	299
impossible	18	train	299	11.035	-1.798	1	2	2	200
impossible	18	test	300	10.720	1.717	3	2	2	300

set.colors <- c(
  train="#1B9E77",
  test="#D95F02",
  unused="white")
algo.colors <- c(
  featureless="blue",
  rpart="red")
make_person_subset <- function(DT){
  DT[, "person/subset" := person]
}
make_person_subset(point.dt)
make_person_subset(reg.bench.score)

if(require(animint2)){
  viz <- animint(
    title="SOAK algorithm: train/predict on subsets, regression",
    video="https://vimeo.com/1053413000",
    pred=ggplot()+
      ggtitle("Predictions for selected train/test split")+
      theme_animint(
        rowspan=2,
        width=600,
        height=700)+
      scale_fill_manual(values=set.colors)+
      geom_point(aes(
        x, y, fill=set.name),
        showSelected="iteration",
        size=3,
        help="One dot for each train/test/unused data point.",
        shape=21,
        data=point.dt)+
      scale_color_manual(values=algo.colors)+
      geom_line(aes(
        x, y, color=algorithm,
        group=paste(algorithm, iteration)),
        help="One line for each learned prediction function.",
        showSelected="iteration",
        data=pred.dt)+
      facet_grid(
        task_id ~ `person/subset`,
        labeller=label_both,
        space="free",
        scales="free")+
      scale_x_continuous(
        "x = input/feature in regression")+
      scale_y_continuous(
        "y = output to predict in regression",
        breaks=seq(-100, 100, by=2)),
    err=ggplot()+
      ggtitle("Test error for each split")+
      theme_animint(
        height=400,
        width=400,
        last_in_row=TRUE)+
      guides(fill="none")+
      scale_y_log10(
        "Mean squared error on test set")+
      scale_fill_manual(values=algo.colors)+
      scale_x_discrete(
        "People/subsets in train set")+
      geom_point(aes(
        train.subsets, regr.mse, fill=algorithm),
        help="One dot per test set and learning algorithm.",
        shape=1,
        size=5,
        stroke=2,
        color="black",
        color_off=NA,
        showSelected="algorithm",
        clickSelects="iteration",
        data=reg.bench.score)+
      facet_grid(
        task_id ~ `person/subset`,
        labeller=label_both,
        scales="free"),
    diagram=ggplot()+
      ggtitle("Select train/test split")+
      theme_animint(height=300, width=350)+
      facet_grid(
        . ~ train.subsets,
        scales="free",
        space="free")+
      scale_size_manual(values=c(subset=3, fold=1))+
      scale_color_manual(values=c(subset="orange", fold="grey50"))+
      geom_rect(aes(
        xmin=-Inf, xmax=Inf,
        color=rows,
        size=rows,
        ymin=display_row, ymax=display_end),
        help="One rect per chunk of data with common fold (grey) and subset (gold).",
        fill=NA,
        data=inst$viz.rect.dt)+
      scale_fill_manual(values=set.colors)+
      geom_label_aligned(aes(
        x=ifelse(rows=="subset", Inf, -Inf),
        y=(display_row+display_end)/2,
        color=rows,
        hjust=ifelse(rows=="subset", 1, 0),
        label=paste0(rows, "=", ifelse(rows=="subset", subset, fold))),
        help="Text labels indicate chunks of data with common fold (grey) and subset (gold).",
        showSelected="rows",
        data=data.table(train.name="same", inst$viz.rect.dt))+
      geom_rect(aes(
        xmin=iteration-rect.expand, ymin=display_row,
        xmax=iteration+rect.expand, ymax=display_end,
        fill=set.name),
        help="One rect per chunk of data assigned to train/test set in cross-validation.",
        alpha=0.5,
        alpha_off=0.5,
        color="black",
        color_off=NA,
        clickSelects="iteration",
        data=inst$viz.set.dt)+
      scale_x_continuous(
        "Split number",
        breaks=c(1,6, 7,12, 13,18))+
      scale_y_continuous(
        "Row number"),
    first=list(iteration=10),
    source="https://github.com/tdhock/mlr3resampling/blob/main/vignettes/Older_resamplers.Rmd")
}
if(FALSE){
  animint2pages(viz, "2023-12-13-train-predict-subsets-regression", chromote_sleep_seconds = 5)
}

If you are viewing this in an installed package or on CRAN, then there will be no data viz on this page, but you can view it on: https://tdhock.github.io/2023-12-13-train-predict-subsets-regression/

Simulated classification problems

The previous section investigated a simulated regression problem, whereas in this section we simulate a binary classification problem. Assume there is a data set with some rows from one person, some rows from another,

N <- 200
library(data.table)
(full.dt <- data.table(
  label=factor(rep(c("spam","not spam"), l=N)),
  person=rep(1:2, each=0.5*N)
)[, signal := ifelse(label=="not spam", 0, 3)][])

label	person	signal
spam	1	3
not spam	1	0
spam	1	3
not spam	1	0
spam	1	3
⋮	⋮	⋮
not spam	2	0
spam	2	3
not spam	2	0
spam	2	3
not spam	2	0

Above each row has an person ID between 1 and 2. We can imagine a spam filtering system, that has training data for multiple people (here just two). Each row in the table above represents a message which has been labeled as spam or not, by one of the two people. Can we train on one person, and accurately predict on the other person? To do that we will need some features, which we generate/simulate below:

set.seed(1)
n.people <- length(unique(full.dt$person))
for(person.i in 1:n.people){
  use.signal.vec <- list(
    easy=rep(if(person.i==1)TRUE else FALSE, N),
    impossible=full.dt$person==person.i)
  for(task_id in names(use.signal.vec)){
    use.signal <- use.signal.vec[[task_id]]
    full.dt[
    , paste0("x",person.i,"_",task_id) := ifelse(
      use.signal, signal, 0
    )+rnorm(N)][]
  }
}
full.dt

label	person	signal	x1_easy	x1_impossible	x2_easy	x2_impossible
spam	1	3	2.374	3.409	1.074	-0.341
not spam	1	0	0.184	1.689	1.896	1.502
spam	1	3	2.164	4.587	-0.603	0.528
not spam	1	0	1.595	-0.331	-0.391	0.542
spam	1	3	3.330	0.715	-0.416	-0.137
⋮	⋮	⋮	⋮	⋮	⋮	⋮
not spam	2	0	-1.048	-0.924	0.768	-1.029
spam	2	3	4.441	1.593	-0.816	2.989
not spam	2	0	-1.016	0.045	-0.436	-1.225
spam	2	3	3.412	-0.715	0.905	0.404
not spam	2	0	-0.381	0.865	-0.763	1.169

In the table above, there are two sets of two features:

For easy features, one is correlated with the label (x1_easy), and one is random noise (x2_easy), so the algorithm just needs to learn to ignore the noise feature, and concentrate on the signal feature. That should be possible given data from either person (same signal in each person).
Each impossible feature is correlated with the label (when feature number same as person number), or is just noise (when person number different from feature number). So if the algorithm has access to the correct person (same as test, say person 2), then it needs to learn to use the corresponding feature x2_impossible. But if the algorithm does not have access to that person, then the best it can do is same as featureless (predict most frequent class label in train data).

Static visualization of simulated data

Below we reshape the data to a table which is more suitable for visualization:

(scatter.dt <- nc::capture_melt_multiple(
  full.dt,
  column="x[12]",
  "_",
  task_id="easy|impossible"))

label	person	signal	task_id	x1	x2
spam	1	3	easy	2.374	1.074
not spam	1	0	easy	0.184	1.896
spam	1	3	easy	2.164	-0.603
not spam	1	0	easy	1.595	-0.391
spam	1	3	easy	3.330	-0.416
⋮	⋮	⋮	⋮	⋮	⋮
not spam	2	0	impossible	-0.924	-1.029
spam	2	3	impossible	1.593	2.989
not spam	2	0	impossible	0.045	-1.225
spam	2	3	impossible	-0.715	0.404
not spam	2	0	impossible	0.865	1.169

Below we visualize the pattern for each person and feature type:

if(require(animint2)){
  ggplot()+
    print_theme+
    geom_point(aes(
      x1, x2, color=label),
      shape=1,
      data=scatter.dt)+
    facet_grid(
      task_id ~ person,
      labeller=label_both)
}

In the plot above, it is apparent that

for easy features (left), the two label classes differ in x1 values for both people. So it should be possible/easy to train on person 1, and predict accurately on person 2.
for impossible features (right), the two people have different label patterns. For person 1, the two label classes differ in x1 values, whereas for person 2, the two label classes differ in x2 values. So it should be impossible to train on person 1, and predict accurately on person 2.

Benchmark: computing test error

We use the code below to create a list of classification tasks, for use in the mlr3 framework.

class.task.list <- list()
for(task_id in c("easy","impossible")){
  feature.names <- grep(task_id, names(full.dt), value=TRUE)
  task.col.names <- c(feature.names, "label", "person")
  task.dt <- full.dt[, task.col.names, with=FALSE]
  this.task <- mlr3::TaskClassif$new(
    task_id, task.dt, target="label")
  this.task$col_roles$subset <- "person"
  this.task$col_roles$stratum <- c("person","label")
  this.task$col_roles$feature <- setdiff(names(task.dt), this.task$col_roles$stratum)
  class.task.list[[task_id]] <- this.task
}
class.task.list
#>$easy
#>
#>── <TaskClassif> (200x3) ───────────────────────────────────────────────────────
#>• Target: label
#>• Target classes: not spam (positive class, 50%), spam (50%)
#>• Properties: twoclass, strata
#>• Features (2):
#>  • dbl (2): x1_easy, x2_easy
#>• Strata: person and label
#>
#>$impossible
#>
#>── <TaskClassif> (200x3) ───────────────────────────────────────────────────────
#>• Target: label
#>• Target classes: not spam (positive class, 50%), spam (50%)
#>• Properties: twoclass, strata
#>• Features (2):
#>  • dbl (2): x1_impossible, x2_impossible
#>• Strata: person and label
#>

Note in the code above that person is assigned roles subset and stratum, whereas label is assigned roles target and stratum. When adapting the code above to real data, the important part is the mlr3::TaskClassif line which tells mlr3 what data set to use, and what columns should be used for target/subset/stratum.

The code below is used to define a K-fold cross-validation experiment,

(class_same_other <- mlr3resampling::ResamplingSameOtherCV$new())
#>
#>── <ResamplingSameOtherCV> : Same versus Other Cross-Validation ────────────────
#>• Iterations:
#>• Instantiated: FALSE
#>• Parameters: folds=3

The code below is used to define the learning algorithms to test,

(class.learner.list <- list(
  if(requireNamespace("rpart"))mlr3::LearnerClassifRpart$new(),
  mlr3::LearnerClassifFeatureless$new()))
#>[[1]]
#>
#>── <LearnerClassifRpart> (classif.rpart): Classification Tree ──────────────────
#>• Model: -
#>• Parameters: xval=0
#>• Packages: mlr3 and rpart
#>• Predict Types: [response] and prob
#>• Feature Types: logical, integer, numeric, factor, and ordered
#>• Encapsulation: none (fallback: -)
#>• Properties: importance, missings, multiclass, selected_features, twoclass,
#>and weights
#>• Other settings: use_weights = 'use', predict_raw = 'FALSE'
#>
#>[[2]]
#>
#>── <LearnerClassifFeatureless> (classif.featureless): Featureless Classification
#>• Model: -
#>• Parameters: method=mode
#>• Packages: mlr3
#>• Predict Types: [response] and prob
#>• Feature Types: logical, integer, numeric, character, factor, ordered,
#>POSIXct, and Date
#>• Encapsulation: none (fallback: -)
#>• Properties: featureless, importance, missings, multiclass, selected_features,
#>twoclass, and weights
#>• Other settings: use_weights = 'use', predict_raw = 'FALSE'
#>

The code below defines the grid of tasks, learners, and resamplings.

(class.bench.grid <- mlr3::benchmark_grid(
  class.task.list,
  class.learner.list,
  class_same_other))

task	learner	resampling
TaskClassif:easy	LearnerClassifRpart:classif.rpart
TaskClassif:easy	LearnerClassifFeatureless:classif.featureless
TaskClassif:impossible	LearnerClassifRpart:classif.rpart
TaskClassif:impossible	LearnerClassifFeatureless:classif.featureless

The code below runs the benchmark experiment grid. Note that each iteration can be parallelized by declaring a future plan.

if(FALSE){
  if(require(future))plan("multisession")
}
if(require(lgr))get_logger("mlr3")$set_threshold("warn")
(class.bench.result <- mlr3::benchmark(
  class.bench.grid, store_models = TRUE))
#>
#>── <BenchmarkResult> of 72 rows with 4 resampling run ──────────────────────────
#> nr    task_id          learner_id resampling_id iters warnings errors
#>  1       easy       classif.rpart same_other_cv    18        0      0
#>  2       easy classif.featureless same_other_cv    18        0      0
#>  3 impossible       classif.rpart same_other_cv    18        0      0
#>  4 impossible classif.featureless same_other_cv    18        0      0

Below we compute scores (test error) for each resampling iteration, and show the first row of the result.

class.bench.score <- mlr3resampling::score(class.bench.result)
class.bench.score[1]

train.subsets	test.fold	test.subset	person	iteration	test	train	uhash	nr	task	task_id	learner	learner_id	resampling	resampling_id	prediction_test	classif.ce	algorithm
all	1	1	1	1	1, 2, 8,11,12,18,…[34]	3, 4, 5, 6, 9,10,…[132]	ce0cbe2e-0541-47fb-a53a-6318bf9ad968	1	TaskClassif:easy	easy	LearnerClassifRpart:classif.rpart	classif.rpart		same_other_cv		0.088	rpart

Finally we plot the test error values below.

if(require(animint2)){
  ggplot()+
    print_theme+
    geom_point(aes(
      classif.ce, train.subsets, color=algorithm),
      shape=1,
      data=class.bench.score)+
    facet_grid(
      person ~ task_id,
      labeller=label_both,
      scales="free")
}

It is clear from the plot above that

for the easy task, training on same is just as good as all or other subsets.
for the impossible task, we must train on same subset for minimal test error; training on all is almost as good, because the pattern in person 1 is orthogonal to person 2; training on other is just as bad as featureless, because patterns are different.
in a real data task, training on other will most likely not be quite as bad as in the impossible task above, but also not as good as in the easy task.

Interactive visualization of data, test error, and splits

The code below can be used to create an interactive data visualization which allows exploring how different functions are learned during different splits.

inst <- class.bench.score$resampling[[1]]$instance
rect.expand <- 0.3
grid.value.dt <- scatter.dt[
, lapply(.SD, function(x)do.call(seq, c(as.list(range(x)), l=21)))
, .SDcols=c("x1","x2")]
grid.class.dt <- data.table(
  label=full.dt$label[1],
  do.call(
    CJ, grid.value.dt
  )
)
class.pred.dt.list <- list()
class.point.dt.list <- list()
for(score.i in 1:nrow(class.bench.score)){
  class.bench.row <- class.bench.score[score.i]
  task.dt <- data.table(
    class.bench.row$task[[1]]$data(),
    class.bench.row$resampling[[1]]$instance$id.dt)
  names(task.dt)[2:3] <- c("x1","x2")
  set.ids <- data.table(
    set.name=c("test","train")
  )[
  , data.table(row_id=class.bench.row[[set.name]][[1]])
  , by=set.name]
  i.points <- set.ids[
    task.dt, on="row_id"
  ][
    is.na(set.name), set.name := "unused"
  ][]
  class.point.dt.list[[score.i]] <- data.table(
    class.bench.row[, .(task_id, iteration)],
    i.points)
  if(class.bench.row$algorithm!="featureless"){
    i.learner <- class.bench.row$learner[[1]]
    i.learner$predict_type <- "prob"
    i.task <- class.bench.row$task[[1]]
    setnames(grid.class.dt, names(i.task$data()))
    grid.class.task <- mlr3::TaskClassif$new(
      "grid", grid.class.dt, target="label")
    pred.grid <- as.data.table(
      i.learner$predict(grid.class.task)
    )[, data.table(grid.class.dt, prob.spam)]
    names(pred.grid)[2:3] <- c("x1","x2")
    pred.wide <- dcast(pred.grid, x1 ~ x2, value.var="prob.spam")
    prob.mat <- as.matrix(pred.wide[,-1])
    contour.list <- contourLines(
      grid.value.dt$x1, grid.value.dt$x2, prob.mat, levels=0.5)
    class.pred.dt.list[[score.i]] <- data.table(
      class.bench.row[, .(
        task_id, iteration, algorithm
      )],
      data.table(contour.i=seq_along(contour.list))[, {
        do.call(data.table, contour.list[[contour.i]])[, .(level, x1=x, x2=y)]
      }, by=contour.i]
    )
  }
}
(class.pred.dt <- rbindlist(class.pred.dt.list))

task_id	iteration	algorithm	contour.i	level	x1	x2
easy	1	rpart	1	0.5	1.856	-3.008
easy	1	rpart	1	0.5	1.856	-2.607
easy	1	rpart	1	0.5	1.856	-2.205
easy	1	rpart	1	0.5	1.856	-1.804
easy	1	rpart	1	0.5	1.856	-1.402
⋮	⋮	⋮	⋮	⋮	⋮	⋮
impossible	18	rpart	1	0.5	3.744	1.225
impossible	18	rpart	1	0.5	4.158	1.225
impossible	18	rpart	1	0.5	4.573	1.225
impossible	18	rpart	1	0.5	4.987	1.225
impossible	18	rpart	1	0.5	5.402	1.225

(class.point.dt <- rbindlist(class.point.dt.list))

task_id	iteration	set.name	row_id	label	x1	x2	fold	person	subset	display_row
easy	1	test	1	spam	2.374	1.074	1	1	1	1
easy	1	test	2	not spam	0.184	1.896	1	1	1	2
easy	1	train	3	spam	2.164	-0.603	2	1	1	35
easy	1	train	4	not spam	1.595	-0.391	2	1	1	36
easy	1	train	5	spam	3.330	-0.416	2	1	1	37
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
impossible	18	train	196	not spam	-0.924	-1.029	2	2	2	166
impossible	18	train	197	spam	1.593	2.989	2	2	2	167
impossible	18	train	198	not spam	0.045	-1.225	1	2	2	133
impossible	18	train	199	spam	-0.715	0.404	1	2	2	134
impossible	18	train	200	not spam	0.865	1.169	2	2	2	168

set.colors <- c(
  train="#1B9E77",
  test="#D95F02",
  unused="white")
algo.colors <- c(
  featureless="blue",
  rpart="red")
make_person_subset <- function(DT){
  DT[, "person/subset" := person]
}
make_person_subset(class.point.dt)
make_person_subset(class.bench.score)
if(require(animint2)){
  viz <- animint(
    title="SOAK algorithm: train/predict on subsets, classification",
    video="https://vimeo.com/1053464329",
    pred=ggplot()+
      ggtitle("Predictions for selected train/test split")+
      theme_animint(height=700, width=700, rowspan=2)+
      scale_fill_manual(values=set.colors)+
      scale_color_manual(values=c(spam="black","not spam"="white"))+
      geom_point(aes(
        x1, x2, color=label, fill=set.name),
        showSelected="iteration",
        size=3,
        help="One dot for each train/test/unused data point.",
        stroke=2,
        shape=21,
        data=class.point.dt)+
      geom_path(aes(
        x1, x2, 
        group=paste(algorithm, iteration, contour.i)),
        showSelected=c("iteration","algorithm"),
        help="Red path represents decision boundary of rpart decision tree learning algorithm.",
        color=algo.colors[["rpart"]],
        data=class.pred.dt)+
      facet_grid(
        task_id ~ `person/subset`,
        labeller=label_both,
        space="free",
        scales="free")+
      scale_y_continuous(
        breaks=seq(-100, 100, by=2)),
    err=ggplot()+
      ggtitle("Test error for each split")+
      theme_animint(height=400, width=400, last_in_row=TRUE)+
      theme(panel.margin=grid::unit(1, "lines"))+
      scale_y_continuous(
        "Classification error on test set",
        breaks=seq(0, 1, by=0.25))+
      scale_fill_manual(values=algo.colors)+
      scale_x_discrete(
        "People/subsets in train set")+
      geom_hline(aes(
        yintercept=yint),
        help="Horizontal lines highlight baseline error rate of 50%.",
        data=data.table(yint=0.5),
        color="grey50")+
      geom_point(aes(
        train.subsets, classif.ce, fill=algorithm),
        help="One dot per test set and learning algorithm.",
        shape=1,
        size=5,
        stroke=2,
        color="black",
        color_off=NA,
        clickSelects="iteration",
        data=class.bench.score)+
      facet_grid(
        task_id ~ `person/subset`,
        labeller=label_both),
    diagram=ggplot()+
      ggtitle("Select train/test split")+
      theme_animint(height=300, width=400)+
      facet_grid(
        . ~ train.subsets,
        scales="free",
        space="free")+
      scale_size_manual(values=c(subset=3, fold=1))+
      scale_color_manual(values=c(subset="orange", fold="grey50"))+
      geom_rect(aes(
        xmin=-Inf, xmax=Inf,
        color=rows,
        size=rows,
        ymin=display_row, ymax=display_end),
        help="One rect per chunk of data with common fold (grey) and subset (gold).",
        fill=NA,
        data=inst$viz.rect.dt)+
      scale_fill_manual(values=set.colors)+
      geom_label_aligned(aes(
        x=ifelse(rows=="subset", Inf, -Inf),
        y=(display_row+display_end)/2,
        color=rows,
        hjust=ifelse(rows=="subset", 1, 0),
        label=paste0(rows, "=", ifelse(rows=="subset", subset, fold))),
        help="Text labels indicate chunks of data with common fold (grey) and subset (gold).",
        showSelected="rows",
        data=data.table(train.name="same", inst$viz.rect.dt))+
      geom_rect(aes(
        xmin=iteration-rect.expand, ymin=display_row,
        xmax=iteration+rect.expand, ymax=display_end,
        fill=set.name),
        help="One rect per chunk of data assigned to train/test set in cross-validation.",
        alpha=0.5,
        alpha_off=0.5,
        color="black",
        color_off=NA,
        clickSelects="iteration",
        data=inst$viz.set.dt)+
      scale_x_continuous(
        "Split number / cross-validation iteration",
        breaks=c(1,6, 7,12, 13,18))+
      scale_y_continuous(
        "Row number"),
    first=list(iteration=10),
    source="https://github.com/tdhock/mlr3resampling/blob/main/vignettes/Older_resamplers.Rmd")
}
if(FALSE){
  animint2pages(viz, "2023-12-13-train-predict-subsets-classification", chromote_sleep_seconds=5)
}

If you are viewing this in an installed package or on CRAN, then there will be no data viz on this page, but you can view it on: https://tdhock.github.io/2023-12-13-train-predict-subsets-classification/

Conclusion

In this section we have shown how to use mlr3resampling for comparing test error of models trained on same/all/other subsets.

Variable size train resampler

The goal of this section is to explain how to ResamplingVariableSizeTrainCV, which can be used to determine how many train data are necessary to provide accurate predictions on a given test set.

Simulated regression problems

The code below creates data for simulated regression problems. First we define a vector of input values,

N <- 300
abs.x <- 10
set.seed(1)
x.vec <- runif(N, -abs.x, abs.x)
str(x.vec)
#> num [1:300] -4.69 -2.56 1.46 8.16 -5.97 ...

Below we define a list of two true regression functions (tasks in mlr3 terminology) for our simulated data,

reg.pattern.list <- list(
  sin=sin,
  constant=function(x)0)

The constant function represents a regression problem which can be solved by always predicting the mean value of outputs (featureless is the best possible learning algorithm). The sin function will be used to generate data with a non-linear pattern that will need to be learned. Below we use a for loop over these two functions/tasks, to simulate the data which will be used as input to the learning algorithms:

library(data.table)
reg.task.list <- list()
reg.data.list <- list()
for(task_id in names(reg.pattern.list)){
  f <- reg.pattern.list[[task_id]]
  task.dt <- data.table(
    x=x.vec,
    y = f(x.vec)+rnorm(N,sd=0.5))
  reg.data.list[[task_id]] <- data.table(task_id, task.dt)
  reg.task.list[[task_id]] <- mlr3::TaskRegr$new(
    task_id, task.dt, target="y"
  )
}
(reg.data <- rbindlist(reg.data.list))

task_id	x	y
sin	-4.690	1.225
sin	-2.558	-0.561
sin	1.457	0.835
sin	8.164	0.488
sin	-5.966	-0.432
⋮	⋮	⋮
constant	3.629	-0.673
constant	-8.017	0.517
constant	-7.622	-0.406
constant	-8.991	0.901
constant	8.585	0.886

In the table above, the input is x, and the output is y. Below we visualize these data, with one task in each facet/panel:

if(require(animint2)){
  ggplot()+
    print_theme+
    geom_point(aes(
      x, y),
      data=reg.data)+
    facet_grid(task_id ~ ., labeller=label_both)
}

In the plot above we can see two different simulated data sets (constant and sin). Note that the code above used the animint2 package, which provides interactive extensions to the static graphics of the ggplot2 package (see below section Interactive data viz).

Visualizing instance table

In the code below, we define a K-fold cross-validation experiment, with K=3 folds.

reg_size_cv <- mlr3resampling::ResamplingVariableSizeTrainCV$new()
reg_size_cv$param_set$values$train_sizes <- 6
reg_size_cv
#>
#>── <ResamplingVariableSizeTrainCV> : Cross-Validation with variable size train s
#>• Iterations:
#>• Instantiated: FALSE
#>• Parameters: folds=3, min_train_data=10, random_seeds=3, train_sizes=6

In the output above we can see the parameters of the resampling object, all of which should be integer scalars:

folds is the number of cross-validation folds.
min_train_data is the minimum number of train data to consider.
random_seeds is the number of random seeds, each of which determines a different random ordering of the train data. The random ordering determines which data are included in small train set sizes.
train_sizes is the number of train set sizes, evenly spaced on a log scale, from min_train_data to the max number of train data (determined by folds).

Below we instantiate the resampling on one of the tasks:

reg_size_cv$instantiate(reg.task.list[["sin"]])
reg_size_cv$instance
#>$iteration.dt
#>    test.fold  seed small_stratum_size train_size_i train_size
#>        <int> <int>              <int>        <int>      <int>
#> 1:         1     1                 10            1         10
#> 2:         1     1                 18            2         18
#> 3:         1     1                 33            3         33
#> 4:         1     1                 60            4         60
#> 5:         1     1                110            5        110
#> 6:         1     1                200            6        200
#> 7:         1     2                 10            1         10
#> 8:         1     2                 18            2         18
#> 9:         1     2                 33            3         33
#>10:         1     2                 60            4         60
#>11:         1     2                110            5        110
#>12:         1     2                200            6        200
#>13:         1     3                 10            1         10
#>14:         1     3                 18            2         18
#>15:         1     3                 33            3         33
#>16:         1     3                 60            4         60
#>17:         1     3                110            5        110
#>18:         1     3                200            6        200
#>19:         2     1                 10            1         10
#>20:         2     1                 18            2         18
#>21:         2     1                 33            3         33
#>22:         2     1                 60            4         60
#>23:         2     1                110            5        110
#>24:         2     1                200            6        200
#>25:         2     2                 10            1         10
#>26:         2     2                 18            2         18
#>27:         2     2                 33            3         33
#>28:         2     2                 60            4         60
#>29:         2     2                110            5        110
#>30:         2     2                200            6        200
#>31:         2     3                 10            1         10
#>32:         2     3                 18            2         18
#>33:         2     3                 33            3         33
#>34:         2     3                 60            4         60
#>35:         2     3                110            5        110
#>36:         2     3                200            6        200
#>37:         3     1                 10            1         10
#>38:         3     1                 18            2         18
#>39:         3     1                 33            3         33
#>40:         3     1                 60            4         60
#>41:         3     1                110            5        110
#>42:         3     1                200            6        200
#>43:         3     2                 10            1         10
#>44:         3     2                 18            2         18
#>45:         3     2                 33            3         33
#>46:         3     2                 60            4         60
#>47:         3     2                110            5        110
#>48:         3     2                200            6        200
#>49:         3     3                 10            1         10
#>50:         3     3                 18            2         18
#>51:         3     3                 33            3         33
#>52:         3     3                 60            4         60
#>53:         3     3                110            5        110
#>54:         3     3                200            6        200
#>    test.fold  seed small_stratum_size train_size_i train_size
#>        <int> <int>              <int>        <int>      <int>
#>                               train                       test iteration
#>                              <list>                     <list>     <int>
#> 1:  216,197, 81,171,143, 36,...[10]  1, 7,11,13,15,19,...[100]         1
#> 2:  216,197, 81,171,143, 36,...[18]  1, 7,11,13,15,19,...[100]         2
#> 3:  216,197, 81,171,143, 36,...[33]  1, 7,11,13,15,19,...[100]         3
#> 4:  216,197, 81,171,143, 36,...[60]  1, 7,11,13,15,19,...[100]         4
#> 5: 216,197, 81,171,143, 36,...[110]  1, 7,11,13,15,19,...[100]         5
#> 6: 216,197, 81,171,143, 36,...[200]  1, 7,11,13,15,19,...[100]         6
#> 7:  260,291, 16,164,109, 45,...[10]  1, 7,11,13,15,19,...[100]         7
#> 8:  260,291, 16,164,109, 45,...[18]  1, 7,11,13,15,19,...[100]         8
#> 9:  260,291, 16,164,109, 45,...[33]  1, 7,11,13,15,19,...[100]         9
#>10:  260,291, 16,164,109, 45,...[60]  1, 7,11,13,15,19,...[100]        10
#>11: 260,291, 16,164,109, 45,...[110]  1, 7,11,13,15,19,...[100]        11
#>12: 260,291, 16,164,109, 45,...[200]  1, 7,11,13,15,19,...[100]        12
#>13:   14,253,115,102,293, 18,...[10]  1, 7,11,13,15,19,...[100]        13
#>14:   14,253,115,102,293, 18,...[18]  1, 7,11,13,15,19,...[100]        14
#>15:   14,253,115,102,293, 18,...[33]  1, 7,11,13,15,19,...[100]        15
#>16:   14,253,115,102,293, 18,...[60]  1, 7,11,13,15,19,...[100]        16
#>17:  14,253,115,102,293, 18,...[110]  1, 7,11,13,15,19,...[100]        17
#>18:  14,253,115,102,293, 18,...[200]  1, 7,11,13,15,19,...[100]        18
#>19:  203,197, 81,171,130, 43,...[10]  4, 6, 9,12,14,16,...[100]        19
#>20:  203,197, 81,171,130, 43,...[18]  4, 6, 9,12,14,16,...[100]        20
#>21:  203,197, 81,171,130, 43,...[33]  4, 6, 9,12,14,16,...[100]        21
#>22:  203,197, 81,171,130, 43,...[60]  4, 6, 9,12,14,16,...[100]        22
#>23: 203,197, 81,171,130, 43,...[110]  4, 6, 9,12,14,16,...[100]        23
#>24: 203,197, 81,171,130, 43,...[200]  4, 6, 9,12,14,16,...[100]        24
#>25:  251,291, 19,164,109, 55,...[10]  4, 6, 9,12,14,16,...[100]        25
#>26:  251,291, 19,164,109, 55,...[18]  4, 6, 9,12,14,16,...[100]        26
#>27:  251,291, 19,164,109, 55,...[33]  4, 6, 9,12,14,16,...[100]        27
#>28:  251,291, 19,164,109, 55,...[60]  4, 6, 9,12,14,16,...[100]        28
#>29: 251,291, 19,164,109, 55,...[110]  4, 6, 9,12,14,16,...[100]        29
#>30: 251,291, 19,164,109, 55,...[200]  4, 6, 9,12,14,16,...[100]        30
#>31:   15,253,115,110,293, 18,...[10]  4, 6, 9,12,14,16,...[100]        31
#>32:   15,253,115,110,293, 18,...[18]  4, 6, 9,12,14,16,...[100]        32
#>33:   15,253,115,110,293, 18,...[33]  4, 6, 9,12,14,16,...[100]        33
#>34:   15,253,115,110,293, 18,...[60]  4, 6, 9,12,14,16,...[100]        34
#>35:  15,253,115,110,293, 18,...[110]  4, 6, 9,12,14,16,...[100]        35
#>36:  15,253,115,110,293, 18,...[200]  4, 6, 9,12,14,16,...[100]        36
#>37:  203,211, 82,194,130, 43,...[10]  2, 3, 5, 8,10,17,...[100]        37
#>38:  203,211, 82,194,130, 43,...[18]  2, 3, 5, 8,10,17,...[100]        38
#>39:  203,211, 82,194,130, 43,...[33]  2, 3, 5, 8,10,17,...[100]        39
#>40:  203,211, 82,194,130, 43,...[60]  2, 3, 5, 8,10,17,...[100]        40
#>41: 203,211, 82,194,130, 43,...[110]  2, 3, 5, 8,10,17,...[100]        41
#>42: 203,211, 82,194,130, 43,...[200]  2, 3, 5, 8,10,17,...[100]        42
#>43:  251,295, 19,189,102, 55,...[10]  2, 3, 5, 8,10,17,...[100]        43
#>44:  251,295, 19,189,102, 55,...[18]  2, 3, 5, 8,10,17,...[100]        44
#>45:  251,295, 19,189,102, 55,...[33]  2, 3, 5, 8,10,17,...[100]        45
#>46:  251,295, 19,189,102, 55,...[60]  2, 3, 5, 8,10,17,...[100]        46
#>47: 251,295, 19,189,102, 55,...[110]  2, 3, 5, 8,10,17,...[100]        47
#>48: 251,295, 19,189,102, 55,...[200]  2, 3, 5, 8,10,17,...[100]        48
#>49:   15,263,135,110,296, 25,...[10]  2, 3, 5, 8,10,17,...[100]        49
#>50:   15,263,135,110,296, 25,...[18]  2, 3, 5, 8,10,17,...[100]        50
#>51:   15,263,135,110,296, 25,...[33]  2, 3, 5, 8,10,17,...[100]        51
#>52:   15,263,135,110,296, 25,...[60]  2, 3, 5, 8,10,17,...[100]        52
#>53:  15,263,135,110,296, 25,...[110]  2, 3, 5, 8,10,17,...[100]        53
#>54:  15,263,135,110,296, 25,...[200]  2, 3, 5, 8,10,17,...[100]        54
#>                               train                       test iteration
#>                              <list>                     <list>     <int>
#>    train_min_size
#>             <int>
#> 1:             10
#> 2:             18
#> 3:             33
#> 4:             60
#> 5:            110
#> 6:            200
#> 7:             10
#> 8:             18
#> 9:             33
#>10:             60
#>11:            110
#>12:            200
#>13:             10
#>14:             18
#>15:             33
#>16:             60
#>17:            110
#>18:            200
#>19:             10
#>20:             18
#>21:             33
#>22:             60
#>23:            110
#>24:            200
#>25:             10
#>26:             18
#>27:             33
#>28:             60
#>29:            110
#>30:            200
#>31:             10
#>32:             18
#>33:             33
#>34:             60
#>35:            110
#>36:            200
#>37:             10
#>38:             18
#>39:             33
#>40:             60
#>41:            110
#>42:            200
#>43:             10
#>44:             18
#>45:             33
#>46:             60
#>47:            110
#>48:            200
#>49:             10
#>50:             18
#>51:             33
#>52:             60
#>53:            110
#>54:            200
#>    train_min_size
#>             <int>
#>
#>$id.dt
#>     row_id  fold
#>      <int> <int>
#>  1:      1     1
#>  2:      2     3
#>  3:      3     3
#>  4:      4     2
#>  5:      5     3
#> ---             
#>296:    296     2
#>297:    297     1
#>298:    298     1
#>299:    299     3
#>300:    300     2
#>

Above we see the instance, which need not be examined by the user, but for informational purposes, it contains the following data:

iteration.dt has one row for each train/test split,
id.dt has one row for each data point.

Benchmark: computing test error

In the code below, we define two learners to compare,

(reg.learner.list <- list(
  if(requireNamespace("rpart"))mlr3::LearnerRegrRpart$new(),
  mlr3::LearnerRegrFeatureless$new()))
#>[[1]]
#>
#>── <LearnerRegrRpart> (regr.rpart): Regression Tree ────────────────────────────
#>• Model: -
#>• Parameters: xval=0
#>• Packages: mlr3 and rpart
#>• Predict Types: [response]
#>• Feature Types: logical, integer, numeric, factor, and ordered
#>• Encapsulation: none (fallback: -)
#>• Properties: importance, missings, selected_features, and weights
#>• Other settings: use_weights = 'use', predict_raw = 'FALSE'
#>
#>[[2]]
#>
#>── <LearnerRegrFeatureless> (regr.featureless): Featureless Regression Learner ─
#>• Model: -
#>• Parameters: robust=FALSE
#>• Packages: mlr3 and stats
#>• Predict Types: [response], se, and quantiles
#>• Feature Types: logical, integer, numeric, character, factor, ordered,
#>POSIXct, and Date
#>• Encapsulation: none (fallback: -)
#>• Properties: featureless, importance, missings, selected_features, and weights
#>• Other settings: use_weights = 'use', predict_raw = 'FALSE'
#>

The code above defines

regr.rpart: Regression Tree learning algorithm, which should be able to learn the non-linear pattern in the sin data (if there are enough data in the train set).
regr.featureless: Featureless Regression learning algorithm, which should be optimal for the constant data, and can be used as a baseline in the sin data. When the rpart learner gets smaller prediction error rates than featureless, then we know that it has learned some non-trivial relationship between inputs and outputs.

In the code below, we define the benchmark grid, which is all combinations of tasks (constant and sin), learners (rpart and featureless), and the one resampling method.

(reg.bench.grid <- mlr3::benchmark_grid(
  reg.task.list,
  reg.learner.list,
  reg_size_cv))

task	learner	resampling
TaskRegr:sin	LearnerRegrRpart:regr.rpart
TaskRegr:sin	LearnerRegrFeatureless:regr.featureless
TaskRegr:constant	LearnerRegrRpart:regr.rpart
TaskRegr:constant	LearnerRegrFeatureless:regr.featureless

In the code below, we execute the benchmark experiment (optionally in parallel using the multisession future plan).

if(FALSE){
  if(require(future))plan("multisession")
}
if(require(lgr))get_logger("mlr3")$set_threshold("warn")
(reg.bench.result <- mlr3::benchmark(
  reg.bench.grid, store_models = TRUE))
#>
#>── <BenchmarkResult> of 216 rows with 4 resampling run ─────────────────────────
#> nr  task_id       learner_id          resampling_id iters warnings errors
#>  1      sin       regr.rpart variable_size_train_cv    54        0      0
#>  2      sin regr.featureless variable_size_train_cv    54        0      0
#>  3 constant       regr.rpart variable_size_train_cv    54        0      0
#>  4 constant regr.featureless variable_size_train_cv    54        0      0

The code below computes the test error for each split, and visualizes the information stored in the first row of the result:

reg.bench.score <- mlr3resampling::score(reg.bench.result)
reg.bench.score[1]

test.fold	seed	small_stratum_size	train_size_i	train_size	train	test	iteration	train_min_size	uhash	nr	task	task_id	learner	learner_id	resampling	resampling_id	prediction_test	regr.mse	algorithm
1	1	10	1	10	216,197, 81,171,143, 36,…[10]	1, 7,11,13,15,19,…[100]	1	10	1de7da88-134e-44c1-abdc-a8d89cc1cd68	1	TaskRegr:sin	sin	LearnerRegrRpart:regr.rpart	regr.rpart		variable_size_train_cv		0.801	rpart

The output above contains all of the results related to a particular train/test split. In particular for our purposes, the interesting columns are:

test.fold is the cross-validation fold ID.
seed is the random seed used to determine the train set order.
train_size is the number of data in the train set.
train and test are vectors of row numbers assigned to each set.
iteration is an ID for the train/test split, for a particular learning algorithm and task. It is the row number of iteration.dt (see instance above), which has one row for each unique combination of test.fold, seed, and train_size.
learner is the mlr3 learner object, which can be used to compute predictions on new data (including a grid of inputs, to show predictions in the visualization below).
regr.mse is the mean squared error on the test set.
algorithm is the name of the learning algorithm (same as learner_id but without regr. prefix).

The code below visualizes the resulting test accuracy numbers.

train_size_vec <- unique(reg.bench.score$train_size)
if(require(animint2)){
  ggplot()+
    print_theme+
    scale_x_log10(
      breaks=train_size_vec)+
    scale_y_log10()+
    geom_line(aes(
      train_size, regr.mse,
      group=paste(algorithm, seed),
      color=algorithm),
      shape=1,
      data=reg.bench.score)+
    geom_point(aes(
      train_size, regr.mse, color=algorithm),
      shape=1,
      data=reg.bench.score)+
    facet_grid(
      test.fold~task_id,
      labeller=label_both,
      scales="free")
}

Above we plot the test error for each fold and train set size. There is a different panel for each task and test fold. Each line represents a random seed (ordering of data in train set), and each dot represents a specific train set size. So the plot above shows that some variation in test error, for a given test fold, is due to the random ordering of the train data.

Below we summarize each train set size, by taking the mean and standard deviation over each random seed.

reg.mean.dt <- dcast(
  reg.bench.score,
  task_id + train_size + test.fold + algorithm ~ .,
  list(mean, sd),
  value.var="regr.mse")
if(require(animint2)){
  ggplot()+
    print_theme+
    scale_x_log10(
      breaks=train_size_vec)+
    scale_y_log10()+
    geom_ribbon(aes(
      train_size,
      ymin=regr.mse_mean-regr.mse_sd,
      ymax=regr.mse_mean+regr.mse_sd,
      fill=algorithm),
      alpha=0.5,
      data=reg.mean.dt)+
    geom_line(aes(
      train_size, regr.mse_mean, color=algorithm),
      shape=1,
      data=reg.mean.dt)+
    facet_grid(
      test.fold~task_id,
      labeller=label_both,
      scales="free")
}

The plot above shows a line for the mean, and a ribbon for the standard deviation, over the three random seeds. It is clear from the plot above that

in constant task, the featureless always has smaller or equal prediction error rates than rpart, which indicates that rpart sometimes overfits for large sample sizes.
in sin task, more than 30 samples are required for rpart to be more accurate than featureless, which indicates it has learned a non-trivial relationship between input and output.

Interactive data viz

The code below can be used to create an interactive data visualization which allows exploring how different functions are learned during different splits.

grid.dt <- data.table(x=seq(-abs.x, abs.x, l=101), y=0)
grid.task <- mlr3::TaskRegr$new("grid", grid.dt, target="y")
pred.dt.list <- list()
point.dt.list <- list()
for(score.i in 1:nrow(reg.bench.score)){
  reg.bench.row <- reg.bench.score[score.i]
  task.dt <- data.table(
    reg.bench.row$task[[1]]$data(),
    reg.bench.row$resampling[[1]]$instance$id.dt)
  set.ids <- data.table(
    set.name=c("test","train")
  )[
  , data.table(row_id=reg.bench.row[[set.name]][[1]])
  , by=set.name]
  i.points <- set.ids[
    task.dt, on="row_id"
  ][
    is.na(set.name), set.name := "unused"
  ]
  point.dt.list[[score.i]] <- data.table(
    reg.bench.row[, .(task_id, iteration)],
    i.points)
  i.learner <- reg.bench.row$learner[[1]]
  pred.dt.list[[score.i]] <- data.table(
    reg.bench.row[, .(
      task_id, iteration, algorithm
    )],
    as.data.table(
      i.learner$predict(grid.task)
    )[, .(x=grid.dt$x, y=response)]
  )
}
(pred.dt <- rbindlist(pred.dt.list))

task_id	iteration	algorithm	x	y
sin	1	rpart	-10.0	0.250
sin	1	rpart	-9.8	0.250
sin	1	rpart	-9.6	0.250
sin	1	rpart	-9.4	0.250
sin	1	rpart	-9.2	0.250
⋮	⋮	⋮	⋮	⋮
constant	54	featureless	9.2	-0.034
constant	54	featureless	9.4	-0.034
constant	54	featureless	9.6	-0.034
constant	54	featureless	9.8	-0.034
constant	54	featureless	10.0	-0.034

(point.dt <- rbindlist(point.dt.list))

task_id	iteration	set.name	row_id	y	x	fold
sin	1	test	1	1.225	-4.690	1
sin	1	unused	2	-0.561	-2.558	3
sin	1	unused	3	0.835	1.457	3
sin	1	unused	4	0.488	8.164	2
sin	1	unused	5	-0.432	-5.966	3
⋮	⋮	⋮	⋮	⋮	⋮	⋮
constant	54	train	296	-0.673	3.629	2
constant	54	train	297	0.517	-8.017	1
constant	54	train	298	-0.406	-7.622	1
constant	54	test	299	0.901	-8.991	3
constant	54	train	300	0.886	8.585	2

set.colors <- c(
  train="#1B9E77",
  test="#D95F02",
  unused="white")
algo.colors <- c(
  featureless="blue",
  rpart="red")
if(require(animint2)){
  viz <- animint(
    title="Variable size train set, regression",
    pred=ggplot()+
      ggtitle("Predictions for selected train/test split")+
      theme_animint(height=400)+
      scale_fill_manual(values=set.colors)+
      geom_point(aes(
        x, y, fill=set.name),
        help="One dot per sample in train/test/unused set.",
        showSelected="iteration",
        size=3,
        shape=21,
        data=point.dt)+
      scale_size_manual(values=c(
        featureless=3,
        rpart=2))+
      scale_color_manual(values=algo.colors)+
      geom_line(aes(
        x, y,
        color=algorithm,
        size=algorithm,
        group=paste(algorithm, iteration)),
        help="One line per learned prediction function.",
        showSelected="iteration",
        data=pred.dt)+
      facet_grid(
        task_id ~ .,
        labeller=label_both),
    err=ggplot()+
      ggtitle("Test error for each split")+
      theme_animint(width=500)+
      theme(
        panel.margin=grid::unit(1, "lines"),
        legend.position="none")+
      scale_y_log10(
        "Mean squared error on test set")+
      scale_color_manual(values=algo.colors)+
      scale_x_log10(
        "Train set size",
        breaks=train_size_vec)+
      geom_line(aes(
        train_size, regr.mse,
        group=paste(algorithm, seed),
        color=algorithm),
        help="One line per algorithm and random seed used to order train set.",
        clickSelects="seed",
        alpha_off=0.2,
        showSelected="algorithm",
        size=4,
        data=reg.bench.score)+
      facet_grid(
        test.fold~task_id,
        labeller=label_both,
        scales="free")+
      geom_point(aes(
        train_size, regr.mse,
        color=algorithm),
        help="One point per algorithm and train set size, for the selected random ordering.",
        size=5,
        stroke=3,
        fill="black",
        fill_off=NA,
        showSelected=c("algorithm","seed"),
        clickSelects="iteration",
        data=reg.bench.score),
    video="https://vimeo.com/1053467310",
    source="https://github.com/tdhock/mlr3resampling/blob/main/vignettes/Older_resamplers.Rmd")
}
if(FALSE){
  animint2pages(viz, "2023-12-26-train-sizes-regression", chromote_sleep_seconds = 5)
}

If you are viewing this in an installed package or on CRAN, then there will be no data viz on this page, but you can view it on: https://tdhock.github.io/2023-12-26-train-sizes-regression/

The interactive data viz consists of two plots:

The first plot shows the data, with each point colored according to the set it was assigned, in the currently selected split/iteration. The red/blue lines additionally show the learned prediction functions for the currently selected split/iteration.
The second plot shows the test error rates, as a function of train set size. Clicking a line selects the corresponding random seed, which makes the corresponding points on that line appear. Clicking a point selects the corresponding iteration (seed, test fold, and train set size).

Simulated classification problems

Whereas in the section above, we focused on regression (output is a real number), in this section we simulate a binary classification problem (output if a factor with two levels).

class.N <- 900
class.abs.x <- 1
rclass <- function(){
  runif(class.N, -class.abs.x, class.abs.x)
}
library(data.table)
set.seed(1)
class.x.dt <- data.table(x1=rclass(), x2=rclass())
class.fun.list <- list(
  constant=function(...)0.5,
  xor=function(x1, x2)xor(x1>0, x2>0))
class.data.list <- list()
class.task.list <- list()
for(task_id in names(class.fun.list)){
  class.fun <- class.fun.list[[task_id]]
  y <- factor(ifelse(
    class.x.dt[, class.fun(x1, x2)+rnorm(class.N, sd=0.5)]>0.5,
    "spam", "not"))
  task.dt <- data.table(class.x.dt, y)
  this.task <- mlr3::TaskClassif$new(
    task_id, task.dt, target="y")
  this.task$col_roles$stratum <- "y"
  class.task.list[[task_id]] <- this.task
  class.data.list[[task_id]] <- data.table(task_id, task.dt)
}
(class.data <- rbindlist(class.data.list))

task_id	x1	x2	y
constant	-0.469	0.664	not
constant	-0.256	0.534	spam
constant	0.146	-0.454	spam
constant	0.816	-0.624	not
constant	-0.597	-0.548	spam
⋮	⋮	⋮	⋮
xor	-0.761	-0.020	not
xor	0.187	-0.963	not
xor	-0.925	-0.641	not
xor	-0.981	-0.401	spam
xor	-0.677	-0.446	not

The simulated data table above consists of two input features (x1 and x2) along with an output/label to predict (y). Below we count the number of times each label appears in each task:

class.data[, .(count=.N), by=.(task_id, y)]

task_id	y	count
constant	not	462
constant	spam	438
xor	spam	462
xor	not	438

The table above shows that the spam label is the minority class (not is majority, so that will be the prediction of the featureless baseline). Below we visualize the data in the feature space:

if(require(animint2)){
  ggplot()+
    print_theme+
    geom_point(aes(
      x1, x2, color=y),
      shape=1,
      data=class.data)+
    facet_grid(. ~ task_id, labeller=label_both)+
    coord_equal()
}

The plot above shows how the output y is related to the two inputs x1 and x2, for the two tasks.

For the constant task, the two inputs are not related to the output.
For the xor task, the spam label is associated with either x1 or x2 being negative (but not both).

In the mlr3 code below, we define a list of learners, our resampling method, and a benchmark grid:

class.learner.list <- list(
  if(requireNamespace("rpart"))mlr3::LearnerClassifRpart$new(),
  mlr3::LearnerClassifFeatureless$new())
size_cv <- mlr3resampling::ResamplingVariableSizeTrainCV$new()
(class.bench.grid <- mlr3::benchmark_grid(
  class.task.list,
  class.learner.list,
  size_cv))

task	learner	resampling
TaskClassif:constant	LearnerClassifRpart:classif.rpart
TaskClassif:constant	LearnerClassifFeatureless:classif.featureless
TaskClassif:xor	LearnerClassifRpart:classif.rpart
TaskClassif:xor	LearnerClassifFeatureless:classif.featureless

Below we run the learning algorithm for each of the train/test splits defined by our benchmark grid:

if(FALSE){
  if(require(future))plan("multisession")
}
if(require(lgr))get_logger("mlr3")$set_threshold("warn")
(class.bench.result <- mlr3::benchmark(
  class.bench.grid, store_models = TRUE))
#>
#>── <BenchmarkResult> of 180 rows with 4 resampling run ─────────────────────────
#> nr  task_id          learner_id          resampling_id iters warnings errors
#>  1 constant       classif.rpart variable_size_train_cv    45        0      0
#>  2 constant classif.featureless variable_size_train_cv    45        0      0
#>  3      xor       classif.rpart variable_size_train_cv    45        0      0
#>  4      xor classif.featureless variable_size_train_cv    45        0      0

Below we compute scores (test error) for each resampling iteration, and show the first row of the result.

class.bench.score <- mlr3resampling::score(class.bench.result)
class.bench.score[1]

test.fold	seed	small_stratum_size	train_size_i	train_size	train	test	iteration	train_min_size	uhash	nr	task	task_id	learner	learner_id	resampling	resampling_id	prediction_test	classif.ce	algorithm
1	1	10	1	21	91,746,863,730,208,508,…[21]	4,10,12,33,40,49,…[300]	1	21	2b1a012a-35ed-4acc-b736-75db61fab97a	1	TaskClassif:constant	constant	LearnerClassifRpart:classif.rpart	classif.rpart		variable_size_train_cv		0.527	rpart

The output above has columns which are very similar to the regression example in the previous section. The main difference is the classif.ce column, which is the classification error on the test set.

Finally we plot the test error values below.

if(require(animint2)){
  ggplot()+
    print_theme+
    geom_line(aes(
      train_size, classif.ce,
      group=paste(algorithm, seed),
      color=algorithm),
      shape=1,
      data=class.bench.score)+
    geom_point(aes(
      train_size, classif.ce, color=algorithm),
      shape=1,
      data=class.bench.score)+
    facet_grid(
      task_id ~ test.fold,
      labeller=label_both)+
    scale_x_log10(
      breaks=unique(class.bench.score$train_size))+
    scale_y_continuous(
      "Test error rate",
      limits=c(0.1,0.6),
      breaks=seq(0.1,0.6,by=0.1))
}

It is clear from the plot above that

in constant task, rpart does not have significantly lower error rates than featureless, which is expected, because the best prediction function is constant (predict the most frequent class, no relationship between inputs and output).
in xor task, more than 30 samples are required for rpart to be more accurate than featureless, which indicates it has learned a non-trivial relationship between inputs and output.

Exercise for the reader: compute and plot mean and SD for these classification tasks, similar to the plot for the regression tasks in the previous section.

Interactive visualization of data, test error, and splits

The code below can be used to create an interactive data visualization which allows exploring how different functions are learned during different splits.

class.grid.vec <- seq(-class.abs.x, class.abs.x, l=21)
class.grid.dt <- CJ(x1=class.grid.vec, x2=class.grid.vec)
class.pred.dt.list <- list()
class.point.dt.list <- list()
for(score.i in 1:nrow(class.bench.score)){
  class.bench.row <- class.bench.score[score.i]
  task.dt <- data.table(
    class.bench.row$task[[1]]$data(),
    class.bench.row$resampling[[1]]$instance$id.dt)
  set.ids <- data.table(
    set.name=c("test","train")
  )[
  , data.table(row_id=class.bench.row[[set.name]][[1]])
  , by=set.name]
  i.points <- set.ids[
    task.dt, on="row_id"
  ][
    is.na(set.name), set.name := "unused"
  ][]
  class.point.dt.list[[score.i]] <- data.table(
    class.bench.row[, .(task_id, iteration)],
    i.points)
  if(class.bench.row$algorithm!="featureless"){
    i.learner <- class.bench.row$learner[[1]]
    i.learner$predict_type <- "prob"
    i.task <- class.bench.row$task[[1]]
    grid.class.task <- mlr3::TaskClassif$new(
      "grid", class.grid.dt[, label:=factor(NA,levels(task.dt$y))], target="label")
    pred.grid <- as.data.table(
      i.learner$predict(grid.class.task)
    )[, data.table(class.grid.dt, prob.spam)]
    pred.wide <- dcast(pred.grid, x1 ~ x2, value.var="prob.spam")
    prob.mat <- as.matrix(pred.wide[,-1])
    if(length(table(prob.mat))>1){
      contour.list <- contourLines(
        class.grid.vec, class.grid.vec, prob.mat, levels=0.5)
      class.pred.dt.list[[score.i]] <- data.table(
        class.bench.row[, .(
          task_id, iteration, algorithm
        )],
        data.table(contour.i=seq_along(contour.list))[, {
          do.call(data.table, contour.list[[contour.i]])[, .(level, x1=x, x2=y)]
        }, by=contour.i]
      )
    }
  }
}
(class.pred.dt <- rbindlist(class.pred.dt.list))

task_id	iteration	algorithm	contour.i	level	x1	x2
constant	1	rpart	1	0.5	-1.000	-0.353
constant	1	rpart	1	0.5	-0.900	-0.353
constant	1	rpart	1	0.5	-0.800	-0.353
constant	1	rpart	1	0.5	-0.700	-0.353
constant	1	rpart	1	0.5	-0.600	-0.353
⋮	⋮	⋮	⋮	⋮	⋮	⋮
xor	45	rpart	2	0.5	0.700	0.050
xor	45	rpart	2	0.5	0.800	0.050
xor	45	rpart	2	0.5	0.847	0.000
xor	45	rpart	2	0.5	0.900	-0.046
xor	45	rpart	2	0.5	1.000	-0.046

(class.point.dt <- rbindlist(class.point.dt.list))

task_id	iteration	set.name	row_id	y	x1	x2	fold
constant	1	unused	1	not	-0.469	0.664	3
constant	1	unused	2	spam	-0.256	0.534	2
constant	1	unused	3	spam	0.146	-0.454	2
constant	1	test	4	not	0.816	-0.624	1
constant	1	test	5	spam	-0.597	-0.548	1
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
xor	45	test	896	not	-0.761	-0.020	3
xor	45	test	897	not	0.187	-0.963	3
xor	45	train	898	not	-0.925	-0.641	2
xor	45	train	899	spam	-0.981	-0.401	1
xor	45	train	900	not	-0.677	-0.446	1

set.colors <- c(
  train="#1B9E77",
  test="#D95F02",
  unused="white")
algo.colors <- c(
  featureless="blue",
  rpart="red")
if(require(animint2)){
  viz <- animint(
    title="Variable size train sets, classification",
    pred=ggplot()+
      ggtitle("Predictions for selected train/test split")+
      theme(panel.margin=grid::unit(1, "lines"))+
      theme_animint(width=600)+
      coord_equal()+
      scale_fill_manual(values=set.colors)+
      scale_color_manual(values=c(spam="black","not spam"="white"))+
      geom_point(aes(
        x1, x2, color=y, fill=set.name),
        showSelected="iteration",
        help="One dot per data sample in the train/test/unused set.",
        size=3,
        stroke=2,
        shape=21,
        data=class.point.dt)+
      geom_path(aes(
        x1, x2, 
        group=paste(algorithm, iteration, contour.i)),
        showSelected=c("iteration","algorithm"),
        help="Red path represents decision boundary of rpart decision tree learning algorithm.",
        color=algo.colors[["rpart"]],
        data=class.pred.dt)+
      facet_grid(
        . ~ task_id,
        labeller=label_both,
        space="free",
        scales="free"),
    err=ggplot()+
      ggtitle("Test error for each split")+
      theme_animint(height=400)+
      theme(panel.margin=grid::unit(1, "lines"))+
      scale_y_continuous(
        "Classification error on test set",
        limits=c(0.1,0.6),
        breaks=seq(0.1,0.6,by=0.1))+
      scale_color_manual(values=algo.colors)+
      scale_x_log10(
        "Train set size",
        breaks=unique(class.bench.score$train_size))+
      geom_line(aes(
        train_size, classif.ce,
        group=paste(algorithm, seed),
        color=algorithm),
        help="One line per algorithm and random seed used to order train set.",
        clickSelects="seed",
        alpha_off=0.2,
        showSelected="algorithm",
        size=4,
        data=class.bench.score)+
      facet_grid(
        test.fold~task_id,
        labeller=label_both,
        scales="free")+
      geom_point(aes(
        train_size, classif.ce,
        color=algorithm),
        size=5,
        stroke=3,
        fill="black",
        fill_off=NA,
        help="One point per algorithm and train set size, for the selected random ordering.",
        showSelected=c("algorithm","seed"),
        clickSelects="iteration",
        data=class.bench.score),
    video="https://vimeo.com/1053477025",
    source="https://github.com/tdhock/mlr3resampling/blob/main/vignettes/Older_resamplers.Rmd")
}
if(FALSE){
  animint2pages(viz, "2023-12-27-train-sizes-classification", chromote_sleep_seconds = 5)
}

If you are viewing this in an installed package or on CRAN, then there will be no data viz on this page, but you can view it on: https://tdhock.github.io/2023-12-27-train-sizes-classification/

The interactive data viz consists of two plots

The first plot shows the data, with each point colored according to its label/y value (black outline for spam, white outline for not), and the set it was assigned (fill color) in the currently selected split/iteration. The red lines additionally show the learned decision boundary for rpart, given the currently selected split/iteration. For constant, the ideal decision boundary is none (always predict the most frequent class), and for xor, the ideal decision boundary looks like a plus sign.
The second plot shows the test error rates, as a function of train set size. Clicking a line selects the corresponding random seed, which makes the corresponding points on that line appear. Clicking a point selects the corresponding iteration (seed, test fold, and train set size).

Conclusion

In this section we have shown how to use mlr3resampling for comparing test error of models trained on different sized train sets.

Session info

sessionInfo()
#>R version 4.6.0 (2026-04-24)
#>Platform: x86_64-pc-linux-gnu
#>Running under: Ubuntu 24.04.4 LTS
#>
#>Matrix products: default
#>BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#>LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#>
#>locale:
#> [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#> [4] LC_COLLATE=C           LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#> [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#>[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#>
#>time zone: UTC
#>tzcode source: system (glibc)
#>
#>attached base packages:
#>[1] stats     graphics  grDevices utils     datasets  methods   base     
#>
#>other attached packages:
#>[1] lgr_0.5.2                animint2_2025.10.17      directlabels_2026.4.23  
#>[4] mlr3resampling_2026.4.26 mlr3_1.6.0               future_1.70.0           
#>[7] ggplot2_4.0.3            data.table_1.18.2.1     
#>
#>loaded via a namespace (and not attached):
#> [1] gtable_0.3.6         future.apply_1.20.2  compiler_4.6.0      
#> [4] crayon_1.5.3         rpart_4.1.27         Rcpp_1.1.1-1.1      
#> [7] stringr_1.6.0        parallel_4.6.0       globals_0.19.1      
#>[10] scales_1.4.0         uuid_1.2-2           mime_0.13           
#>[13] plyr_1.8.9           R6_2.6.1             commonmark_2.0.0    
#>[16] mlr3tuning_1.6.0     labeling_0.4.3       knitr_1.51          
#>[19] palmerpenguins_0.1.1 backports_1.5.1      checkmate_2.3.4     
#>[22] paradox_1.0.1        RColorBrewer_1.1-3   mlr3measures_1.3.0  
#>[25] rlang_1.2.0          stringi_1.8.7        litedown_0.9        
#>[28] xfun_0.57            RJSONIO_2.0.0        quadprog_1.5-8      
#>[31] mlr3misc_0.21.0      S7_0.2.2             otel_0.2.0          
#>[34] cli_3.6.6            magrittr_2.0.5       withr_3.0.2         
#>[37] digest_0.6.39        grid_4.6.0           nc_2026.4.20        
#>[40] bbotk_1.10.0         lifecycle_1.0.5      vctrs_0.7.3         
#>[43] evaluate_1.0.5       glue_1.8.1           farver_2.1.2        
#>[46] listenv_0.10.1       codetools_0.2-20     parallelly_1.47.0   
#>[49] reshape2_1.4.5       tools_4.6.0