TabularStudy

Code for Tabular Learning Revisited: An Empirical Study of Tabular Classification.

The paper is currently submitted to Transactions on Machine Learning Research (TMLR). The OpenReview page is available at: https://openreview.net/forum?id=I8BIGp4XOb

Overview

This repository contains the code used to run the experiments in our paper.

Note: This repository is not intended to be a general‑purpose library.
It is a collection of scripts that reproduce the experimental pipeline.

Experiments are conducted on the OpenMLCC18 benchmark suite.
Four datasets were excluded due to persistent memory issues.
The final list of dataset IDs is provided in openmlcc18_tasks.txt.

Each method resides in its own folder that contains everything required to run it on the OpenMLCC18 benchmark.

This repository includes code adapted from third-party projects. Their original licenses are retained in the corresponding subdirectories.

Citation

If you use this code, please cite:

@article{zabergja2026tabular,
  title={Tabular Learning Revisited: An Empirical Study of Tabular Classification},
  author={Zab{\"e}rgja, Guri and Kadra, Arlind and Frey, Christian M. M. and Grabocka, Josif},
  journal={Transactions on Machine Learning Research},
  year={2026},
  url={https://openreview.net/forum?id=I8BIGp4XOb}
}

Setup

Create the Conda environment defined in environment.yml:

conda env create -f environment.yml
conda activate tabularstudy

Most experiment scripts log to Weights & Biases. To run them without a W&B account or network logging, set offline mode before launching a script:

export WANDB_MODE=offline

This environment was used for the following methods:

CatBoost
XGBoost
LightGBM
FT-Transformer
MLP-PLR
ResNet
SAINT
TabNet
XTab

The preprocessing and training pipeline are largely based on the original FT-Transformer codebase.

Other methods (e.g. AutoML libraries or foundation models) have their own dependencies and preprocessing.
Please follow the official instructions of each method to create the correct environment.

ModernNCA uses the external TALENT package and is not vendored in this repository. Install TALENT separately before running ModernNCA/run_modernnca.py.

pip install TALENT

Quick-start (single fold, single dataset)

To sanity-check your installation, you can run ResNet on the Balance-Scale dataset
(OpenML ID = 11) for outer fold 0 with 100 Optuna trials:

cd ResNet
python run_resnet.py \
    --seed 0 \
    --outer_fold 0 \
    --dataset 11 \
    --n_trials 100 \
    --experiment_name "ResNet_HPO" \
    --tune

Omit --tune to run with default hyper-parameters.

Entry-Point Scripts

Method	Script Path
AutoGluon	`autogluon/run_autogluon.py`
CARTE	`carte/examples/carte_single_tables.py`
CatBoost	`Catboost/run_catboost.py`
FT-Transformer	`FT_Transformer/bin/run_ft.py`
MLP-PLR	`MLP/run_mlp.py`
LightGBM	`LightGBM/run_lightgbm.py`
ModernNCA	`ModernNCA/run_modernnca.py`
RealMLP	`RealMLP/run_realmlp.py`
TabM	`RealMLP/run_tabm.py`
ResNet	`ResNet/run_resnet.py`
SAINT	`saint/run_saint.py`
TabNet	`TabNet/run_tabnet.py`
TabICL	`TabICL/run_tabicl.py`
TabPFN	`TabPFN/run_tabpfn.py`
TabPFNv2	`TabPFNv2/run_tabpfnv2.py`
TPBERTa	`TPBerta/scripts/finetune/default/run_default_config_tpberta.py`
XGBoost	`XGBoost/run_xgboost.py`
XTab	`XTab/run_xtab_finetune.py`

Parallelisation & Nested Cross-Validation

Experiments use nested cross-validation.
The outer loop is parallelised via Slurm job arrays.

Example Slurm script for TabM

#!/bin/bash
#SBATCH -t 1-00:00              # Runtime (D-HH:MM)
#SBATCH --gres=gpu:1            # Number of GPUs
#SBATCH -o log/%x.%N.%j.out     # STDOUT
#SBATCH -e log/%x.%N.%j.err     # STDERR
#SBATCH -J JOB_NAME             # Job name
#SBATCH -a 0-679                # 68 datasets × 10 outer folds

# Load environment
source ~/.bashrc
conda activate ENVIRONMENT_NAME

# (Optional) Proxy for outbound traffic
# export https_proxy=http://proxy:port

echo "Working dir  : $PWD"
echo "Started at   : $(date)"
echo "Job          : $SLURM_JOB_NAME"
echo "CPUs/node    : $SLURM_JOB_CPUS_PER_NODE"
echo "Job ID       : $SLURM_JOB_ID"
echo "Partition    : $SLURM_JOB_PARTITION"

# Read dataset list
read -a DATASETS < openmlcc18_tasks.txt

# Calculate dataset index & fold
DATASET_INDEX=$((SLURM_ARRAY_TASK_ID / 10))
OUTER_FOLD=$((SLURM_ARRAY_TASK_ID % 10))
DATASET=${DATASETS[$DATASET_INDEX]}

# Run TabM
python run_tabm.py \
    --seed 0 \
    --outer_fold $OUTER_FOLD \
    --dataset $DATASET \
    --n_trials 100 \
    --experiment_name "TabM_HPO" \
    --tune

echo "DONE"
echo "Finished at  : $(date)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TabularStudy

Overview

Citation

Setup

Quick-start (single fold, single dataset)

Entry-Point Scripts

Parallelisation & Nested Cross-Validation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Catboost		Catboost
FT_Transformer		FT_Transformer
LightGBM		LightGBM
MLP		MLP
Mitra		Mitra
ModernNCA		ModernNCA
RealMLP		RealMLP
ResNet		ResNet
TPBerta		TPBerta
TabICL		TabICL
TabNet		TabNet
TabPFN		TabPFN
TabPFNv2		TabPFNv2
XGBoost		XGBoost
XTab		XTab
autogluon		autogluon
carte		carte
saint		saint
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
openmlcc18_tasks.txt		openmlcc18_tasks.txt

Folders and files

Latest commit

History

Repository files navigation

TabularStudy

Overview

Citation

Setup

Quick-start (single fold, single dataset)

Entry-Point Scripts

Parallelisation & Nested Cross-Validation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages