Reproducing Leaderboards
You can use the HELM package to rerun evaluation runs and recreate a specific public leaderboard.
The general procedure is to first find the appropriate run_entries_*.conf and schema_*.yaml files from the HELM GitHub repository for the leaderboard version and then place them in your current working directory. The locations of these files are as follows:
run_entries_*.conf: thesrc/helm/benchmark/presentation/directory hereschema_*.conf: thesrc/helm/benchmark/static/directory here
Then run the following shell script:
# Pick any suite name of your choice
export SUITE_NAME=my_suite
# Replace this with your model or models
export MODELS_TO_RUN=openai/gpt-3.5-turbo-0613
# Get these from the list below
export RUN_ENTRIES_CONF_PATH=run_entries_repro.conf
export SCHEMA_PATH=schema_repro.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=1000
export PRIORITY=2
helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN
helm-summarize --schema $SCHEMA_PATH --suite $SUITE_NAME
helm-server --suite $SUITE_NAME
Leaderboard versions
The following specifies the appropriate parameters and configuration files for a leaderboard, given its project and version number.
Lite
export RUN_ENTRIES_CONF_PATH=run_entries_lite_20240424.conf
export SCHEMA_PATH=schema_lite.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=1000
export PRIORITY=2
Classic before v0.2.4
export RUN_ENTRIES_CONF_PATH=run_entries.conf
export SCHEMA_PATH=schema_classic.yaml
export NUM_TRAIN_TRIALS=3
export MAX_EVAL_INSTANCES=1000
export PRIORITY=2
Classic v0.2.4 and after
export RUN_ENTRIES_CONF_PATH=run_entries_lite.conf
export SCHEMA_PATH=schema_classic.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=1000
export PRIORITY=2
HEIM
export RUN_ENTRIES_CONF_PATH=run_entries_heim.conf
export SCHEMA_PATH=schema_heim.yaml
export NUM_TRAIN_TRIALS=1
export NUM_EVAL_INSTANCES=1000
export PRIORITY=2
MMLU
export RUN_ENTRIES_CONF_PATH=run_entries_mmlu.conf
export SCHEMA_PATH=schema_mmlu.yaml
export NUM_TRAIN_TRIALS=1
export NUM_EVAL_INSTANCES=10000
export PRIORITY=4
VHELM v1.0.0 (VLM evaluation)
export RUN_ENTRIES_CONF_PATH=run_entries_vhelm_lite.conf
export SCHEMA_PATH=schema_vhelm_lite.yaml
export NUM_TRAIN_TRIALS=1
export NUM_EVAL_INSTANCES=1000
export PRIORITY=2
VHELM >=v2.0.0
export RUN_ENTRIES_CONF_PATH=run_entries_vhelm.conf
export SCHEMA_PATH=schema_vhelm.yaml
export NUM_TRAIN_TRIALS=1
export NUM_EVAL_INSTANCES=1000
export PRIORITY=2
AIR-Bench
export RUN_ENTRIES_CONF_PATH=run_entries_air_bench.conf
export SCHEMA_PATH=schema_air_bench.yaml
export NUM_TRAIN_TRIALS=1
export NUM_EVAL_INSTANCES=10000
export PRIORITY=2