Holistic Evaluation of Language Models
Welcome! The crfm-helm
Python package contains code used in the Holistic Evaluation of Language Models project (paper, website) by Stanford CRFM. This package includes the following features:
- Collection of datasets in a standard format (e.g., NaturalQuestions)
- Collection of models accessible via a unified API (e.g., GPT-3, MT-NLG, OPT, BLOOM)
- Collection of metrics beyond accuracy (efficiency, bias, toxicity, etc.)
- Collection of perturbations for evaluating robustness and fairness (e.g., typos, dialect)
- Modular framework for constructing prompts from datasets
- Proxy server for managing accounts and providing unified interface to access models
The code is hosted on GitHub here.
To run the code, refer to the User Guide's chapters:
To add new models and scenarios, refer to the Developer Guide's chapters:
We also support evaluating text-to-image models as introduced in Holistic Evaluation of Text-to-Image Models (HEIM) (paper, website).