Skip to content

machinelearningnuremberg/Tabular-Study

Repository files navigation

TabularStudy

Code for Tabular Learning Revisited: An Empirical Study of Tabular Classification.

The paper is currently submitted to Transactions on Machine Learning Research (TMLR). The OpenReview page is available at: https://openreview.net/forum?id=I8BIGp4XOb

Overview

This repository contains the code used to run the experiments in our paper.

Note: This repository is not intended to be a general‑purpose library.
It is a collection of scripts that reproduce the experimental pipeline.

Experiments are conducted on the OpenMLCC18 benchmark suite.
Four datasets were excluded due to persistent memory issues.
The final list of dataset IDs is provided in openmlcc18_tasks.txt.

Each method resides in its own folder that contains everything required to run it on the OpenMLCC18 benchmark.

This repository includes code adapted from third-party projects. Their original licenses are retained in the corresponding subdirectories.


Citation

If you use this code, please cite:

@article{zabergja2026tabular,
  title={Tabular Learning Revisited: An Empirical Study of Tabular Classification},
  author={Zab{\"e}rgja, Guri and Kadra, Arlind and Frey, Christian M. M. and Grabocka, Josif},
  journal={Transactions on Machine Learning Research},
  year={2026},
  url={https://openreview.net/forum?id=I8BIGp4XOb}
}

Setup

Create the Conda environment defined in environment.yml:

conda env create -f environment.yml
conda activate tabularstudy

Most experiment scripts log to Weights & Biases. To run them without a W&B account or network logging, set offline mode before launching a script:

export WANDB_MODE=offline

This environment was used for the following methods:

  • CatBoost
  • XGBoost
  • LightGBM
  • FT-Transformer
  • MLP-PLR
  • ResNet
  • SAINT
  • TabNet
  • XTab

The preprocessing and training pipeline are largely based on the original FT-Transformer codebase.

Other methods (e.g. AutoML libraries or foundation models) have their own dependencies and preprocessing.
Please follow the official instructions of each method to create the correct environment.

ModernNCA uses the external TALENT package and is not vendored in this repository. Install TALENT separately before running ModernNCA/run_modernnca.py.

pip install TALENT

Quick-start (single fold, single dataset)

To sanity-check your installation, you can run ResNet on the Balance-Scale dataset
(OpenML ID = 11) for outer fold 0 with 100 Optuna trials:

cd ResNet
python run_resnet.py \
    --seed 0 \
    --outer_fold 0 \
    --dataset 11 \
    --n_trials 100 \
    --experiment_name "ResNet_HPO" \
    --tune

Omit --tune to run with default hyper-parameters.


Entry-Point Scripts

Method Script Path
AutoGluon autogluon/run_autogluon.py
CARTE carte/examples/carte_single_tables.py
CatBoost Catboost/run_catboost.py
FT-Transformer FT_Transformer/bin/run_ft.py
MLP-PLR MLP/run_mlp.py
LightGBM LightGBM/run_lightgbm.py
ModernNCA ModernNCA/run_modernnca.py
RealMLP RealMLP/run_realmlp.py
TabM RealMLP/run_tabm.py
ResNet ResNet/run_resnet.py
SAINT saint/run_saint.py
TabNet TabNet/run_tabnet.py
TabICL TabICL/run_tabicl.py
TabPFN TabPFN/run_tabpfn.py
TabPFNv2 TabPFNv2/run_tabpfnv2.py
TPBERTa TPBerta/scripts/finetune/default/run_default_config_tpberta.py
XGBoost XGBoost/run_xgboost.py
XTab XTab/run_xtab_finetune.py

Parallelisation & Nested Cross-Validation

Experiments use nested cross-validation.
The outer loop is parallelised via Slurm job arrays.

Example Slurm script for TabM
#!/bin/bash
#SBATCH -t 1-00:00              # Runtime (D-HH:MM)
#SBATCH --gres=gpu:1            # Number of GPUs
#SBATCH -o log/%x.%N.%j.out     # STDOUT
#SBATCH -e log/%x.%N.%j.err     # STDERR
#SBATCH -J JOB_NAME             # Job name
#SBATCH -a 0-679                # 68 datasets × 10 outer folds

# Load environment
source ~/.bashrc
conda activate ENVIRONMENT_NAME

# (Optional) Proxy for outbound traffic
# export https_proxy=http://proxy:port

echo "Working dir  : $PWD"
echo "Started at   : $(date)"
echo "Job          : $SLURM_JOB_NAME"
echo "CPUs/node    : $SLURM_JOB_CPUS_PER_NODE"
echo "Job ID       : $SLURM_JOB_ID"
echo "Partition    : $SLURM_JOB_PARTITION"

# Read dataset list
read -a DATASETS < openmlcc18_tasks.txt

# Calculate dataset index & fold
DATASET_INDEX=$((SLURM_ARRAY_TASK_ID / 10))
OUTER_FOLD=$((SLURM_ARRAY_TASK_ID % 10))
DATASET=${DATASETS[$DATASET_INDEX]}

# Run TabM
python run_tabm.py \
    --seed 0 \
    --outer_fold $OUTER_FOLD \
    --dataset $DATASET \
    --n_trials 100 \
    --experiment_name "TabM_HPO" \
    --tune

echo "DONE"
echo "Finished at  : $(date)"

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors