Code Structure
Here's a birds-eye view of how the benchmarking process interacts with the main
classes (see benchmark):
-
A
Scenario(given by aScenarioSpec) specifies a task and a data distribution. It specifies a set ofInstances, where eachInstancehas an input (e.g., question) and a set ofReferenceoutputs (e.g., multiple choice answers). -
A
DataPreprocessortakes in aScenarioand produces a list ofInstances EachInstanceis given a unique ID. The set ofInstances is augmented according toDataAugmenterSpec. -
An
Adapter(given by anAdaptationSpec) takes a list ofInstances and adapts it to a set ofRequests to the API (e.g., the model, temperature, number of in-context training examples). Formally, the output is aScenarioStatecontaining a set ofRequestStates, where eachRequestStateconsists of aRequestand any metadata used to track the role of thisRequest(e.g., the relevantInstanceandReference). -
An
Executor(given by anExecutionSpec) executes eachRequestin theRequestStateto produce aRequestResultfor each one; everything is encapsulated in aScenarioState. -
A
Metric(given by aMetricSpec) takes aScenarioStatecontainingRequestResultss and produces a set ofStats (e.g., accuracy, accuracy@5, toxicity, bias, etc.). -
A
Runneris the top-level controller that runs the above steps and is driven by a set ofRunSpecs.
There are three types of classes:
- Specifications (e.g.,
AdapterSpec,ExecutionSpec,RunSpec): specified manually by the user. Note thatScenarioandMetricare subclassed, so they are constructed byObjectSpec, which specifies the subclass name and a free-form dictionary of arguments. - States (e.g.,
Instance,ScenarioState,Request,RequestResult): these are automatically generated and can be serialized. - Controllers (e.g.,
Scenario,Adapter,Executor,Metric,Runner): these have the bulk of the code and should not be serialized.
Adding new scenarios
In order to implement new scenarios:
- Create a new file as a new Python scenario file in the
scenariosfolder. - Within the scenario file, create a
Scenarioclass, e.g.YourScenario. YourScenarioshould implementget_instances, a method that downloads the dataset files if they don't already exist and returns a list ofInstances. EachInstancemust have a list of (potentially one)Referenceanswers: a correct answer may be indicated with aCORRECT_TAGin aReferenceinstance'stagsargument. In addition, you must specify thesplitof theInstanceas one ofTRAIN_SPLIT,VALID_SPLIT, orTEST_SPLITconstants as inscenario.py.- For
Scenarios with datasets that cannot be publicly shared, place a copy of the dataset at pathrestricted/<Name of the Scenario>and read from that path. SeeNewsQAScenarioandICEScenariofor some examples. - Note that you need not enumerate every possible correct answer (nor must there even necessarily be a correct answer).
- Make sure to document your scenario well with a clear docstring.
- In addition, specify its
name,description, andtags. - Define a function
get_specname_specinrun_specs.pyto retrieve aScenarioSpecfor your scenario using a class name corresponding to the Python path of the class (e.g.helm.benchmark.scenarios.your_scenario.YourScenario) and any arguments which must be passed as a dictionary ofargs. - Have the
get_specname_specfunction retrieve anAdapterSpecfor your scenario specifying the type of language model generation which must be performed for the task. - Identify the appropriate metric for your task in one of the
*_metrics.pyfiles. If the metric you'd like to use does not exist, follow the directions in Adding new metrics. Many will be inbasic_metrics.py. - Have a
get_metric_specfunction retrieve one or moreMetricSpecobjects for your task, specifying the classname with the Python path of the object, with the same arguments as theScenarioSpecconstructor. - Have the
get_specname_specfunction return aRunSpecobject, with anamecorresponding to the scenario name and any patterns to match in curly braces, ascenario_spec, anadapter_spec,metric_specs, andgroups. - Attempt to run your task with
venv/bin/helm-run -r yourscenarioname:arg=valuewhereyourscenarionamematches thenamespecified in YourScenario - Add the spec to dictionary
CANONICAL_RUN_SPEC_FUNCSinsrc/helm/benchmark/run_specs.py. - Update
src/helm/proxy/static/contamination.yamlwith models that we trained on your scenario (i.e. contaminated). - Add a schema to
src/helm/benchmark/static/schema.yamland add the scenario tosubgroupsas needed.
Adding new metrics
To add a new metric:
- If the metric is task-specific, create a new
yourtask_metrics.pyfile. Otherwise, if the metric is generic and likely to be widely used, add it tobasic_metrics.py. - If you are creating a task-specific metric, create a
YourTaskMetricwhich inherits fromMetricinmetric.py. - Define methods
__init__andevaluate_generationreturning a list ofStatobjects. - Each
Statshould correspond to a distinct aggregate measurement over the generated examples. Some may have one metric (e.g. accuracy), while others may quantify multiple aspects (e.g. multiple distance metrics). - For each
valuegenerated for aStat, add it toyourstatusingyourstat.add(value). Usually, there will only be one value for eachStat, but multiple can be used, e.g. to show variance.
Data augmentations
To apply data augmentation, create a DataAugmenterSpec with a list of
PerturbationSpecs and pass it into RunSpec. The following is an
example:
data_augmenter_spec = DataAugmenterSpec(
perturbation_specs=[
PerturbationSpec(
class_name="helm.benchmark.augmentations.perturbation.ExtraSpacePerturbation",
args={"num_spaces": 5},
)
],
should_perturb_references=False,
should_augment_train_instances=False,
should_include_original_train=False,
should_augment_eval_instances=True,
should_include_original_eval=True,
)
run_spec = RunSpec(
...
data_augmenter_spec=data_augmenter_spec
)
In the example above, the DataPreprocessor will augment the set of evaluation instances by perturbing
the original set of instances with the ExtraSpacePerturbation, where spaces in the text are
replaced with num_spaces number of spaces.
We currently only support applying a single perturbation to an instance instead of chaining multiple perturbations and applying it onto a single instance.
Adding a new perturbation
- To add a new perturbation to the framework, create a new file at
src/helm/benchmark/augmentationswith the name<Name of perturbation>_perturbation.pye.g.,typo_perturbation.py. Inside the file, create a new class (name it<Name of the perturbation>Perturbatione.g.,TypoPerturbation) that extends the abstract classPerturbationand implement theperturbmethod which takes in text and outputs the perturbed text. - Add a test for the new perturbation in
test_perturbation.py.
Supporting new Hugging Face tokenizers
- Give the tokenizer a name. Use the same name that's used in Hugging Face (e.g., "EleutherAI/gpt-j-6B").
- In
HuggingFaceTokenizers, we load and cache tokenizers in memory. Add logic to handle the tokenizer in theload_tokenizermethod. - Add a test in
test_huggingface_tokenizer.pyto make sure we can load the tokenizer from Hugging Face. - Add a new class
<Name of tokenizer>WindowServicein file<Name of tokenizer>_window_service.py. Follow what we did forGPTJWindowService. - Import the new
WindowServiceand map the model(s) to it inWindowServiceFactory.
HEIM (text-to-image evaluation)
The overall code structure is the same as HELM's.
When adding new scenarios and metrics for image generation, place the Python files under the image_generation package
(e.g., src/helm/benchmark/scenarios/image_generation).