Model request API

HELM represents model calls with the shared Request and RequestResult dataclasses in helm.common.request. These classes are the common boundary between scenarios, clients, local execution, and cached raw results.

Use this API when you need to make a model request from Python code or inspect the exact request and response fields used by HELM runs.

Request and response formats

Request dataclass

A Request specifies how to query a language model (given a prompt, complete it). It is the unified representation for communicating with various APIs (e.g., GPT-3, Jurassic).

model_deployment: str = '' class-attribute instance-attribute

Which model deployment to query -> Determines the Client. Refers to a deployment in the model deployment registry.

model: str = '' class-attribute instance-attribute

Which model to use -> Determines the Engine. Refers to a model metadata in the model registry.

embedding: bool = False class-attribute instance-attribute

Whether to query embedding instead of text response

prompt: str = '' class-attribute instance-attribute

What prompt do condition the language model on

temperature: float = 1.0 class-attribute instance-attribute

Temperature parameter that governs diversity

num_completions: int = 1 class-attribute instance-attribute

Generate this many completions (by sampling from the model)

top_k_per_token: int = 1 class-attribute instance-attribute

Take this many highest probability candidates per token in the completion

max_tokens: int = 100 class-attribute instance-attribute

Maximum number of tokens to generate (per completion)

stop_sequences: List[str] = field(default_factory=list) class-attribute instance-attribute

Stop generating once we hit one of these strings.

echo_prompt: bool = False class-attribute instance-attribute

Should prompt be included as a prefix of each completion? (e.g., for evaluating perplexity of the prompt)

top_p: float = 1 class-attribute instance-attribute

Same from tokens that occupy this probability mass (nucleus sampling)

presence_penalty: float = 0 class-attribute instance-attribute

Penalize repetition (OpenAI & Writer only)

frequency_penalty: float = 0 class-attribute instance-attribute

Penalize repetition (OpenAI & Writer only)

random: Optional[str] = None class-attribute instance-attribute

Used to control randomness. Expect different responses for the same request but with different values for random.

messages: Optional[List[Dict[str, str]]] = None class-attribute instance-attribute

Used for chat models. (OpenAI only for now). if messages is specified for a chat model, the prompt is ignored. Otherwise, the client should convert the prompt into a message.

multimodal_prompt: Optional[MultimediaObject] = None class-attribute instance-attribute

Multimodal prompt with media objects interleaved (e.g., text, video, image, text, ...)

image_generation_parameters: Optional[ImageGenerationParameters] = None class-attribute instance-attribute

Parameters for image generation.

response_format: Optional[ResponseFormat] = None class-attribute instance-attribute

EXPERIMENTAL: Response format. Currently only supported by OpenAI and Together.

model_host: str property

Returns the model host (referring to the deployment). Not to be confused with the model creator organization (referring to the model).

'openai/davinci' => 'openai'

'together/bloom' => 'together'

model_engine: str property

Returns the model engine (referring to the model). This is often the same as self.model_deploymentl.split("/")[1], but not always. For example, one model could be served on several servers (each with a different model_deployment) In that case we would have for example: 'aws/bloom-1', 'aws/bloom-2', 'aws/bloom-3' => 'bloom' This is why we need to keep track of the model engine with the model metadata. Example: 'openai/davinci' => 'davinci'

validate()

RequestResult dataclass

What comes back due to a Request.

success: bool instance-attribute

Whether the request was successful

embedding: List[float] instance-attribute

Fixed dimensional embedding corresponding to the entire prompt

completions: List[GeneratedOutput] instance-attribute

List of completion

cached: bool instance-attribute

Whether the request was actually cached

request_time: Optional[float] = None class-attribute instance-attribute

How long the request took in seconds

request_datetime: Optional[int] = None class-attribute instance-attribute

When was the request sent? We keep track of when the request was made because the underlying model or inference procedure backing the API might change over time. The integer represents the current time in seconds since the Epoch (January 1, 1970).

error: Optional[str] = None class-attribute instance-attribute

If success is false, what was the error?

error_flags: Optional[ErrorFlags] = None class-attribute instance-attribute

Describes how to treat errors in the request.

batch_size: Optional[int] = None class-attribute instance-attribute

Batch size (TogetherClient only)

batch_request_time: Optional[float] = None class-attribute instance-attribute

How long it took to process the batch? (TogetherClient only)

render_lines() -> List[str]

GeneratedOutput dataclass

A GeneratedOutput is a single generated output that may contain text or multimodal content.

text: str instance-attribute

logprob: float instance-attribute

tokens: List[Token] instance-attribute

finish_reason: Optional[Dict[str, Any]] = None class-attribute instance-attribute

multimodal_content: Optional[MultimediaObject] = None class-attribute instance-attribute

thinking: Optional[Thinking] = None class-attribute instance-attribute

__add__(other: GeneratedOutput) -> GeneratedOutput

render_lines() -> List[str]

Token dataclass

A Token represents one token position in a Sequence, which has the chosen text as well as the top probabilities under the model.

text: str instance-attribute

logprob: float instance-attribute

render_lines() -> List[str]

Making a local request through AutoClient

Use AutoClient when you want HELM to select the concrete client from the model_deployment field and use local credentials directly. This is the recommended path for making model requests from Python code. AutoClient requires a credentials mapping, a file storage path, and a cache backend configuration. The example below uses BlackHoleCacheBackendConfig, which does not persist cache entries.

from helm.clients.auto_client import AutoClient
from helm.common.cache_backend_config import BlackHoleCacheBackendConfig
from helm.common.request import Request

client = AutoClient(
    credentials={"openaiApiKey": "YOUR_OPENAI_API_KEY"},
    file_storage_path="prod_env/cache",
    cache_backend_config=BlackHoleCacheBackendConfig(),
)

request = Request(
    model_deployment="openai/gpt-4o-mini",
    model="openai/gpt-4o-mini",
    prompt="Explain HELM in one sentence.",
    max_tokens=64,
    temperature=0.0,
)

result = client.make_request(request)

if result.success:
    print(result.completions[0].text)
else:
    print(result.error)

See helm.clients.auto_client.AutoClient for the complete local-client interface.

Using a persistent cache

Use SqliteCacheBackendConfig when you want HELM to persist request results locally:

from helm.common.cache_backend_config import SqliteCacheBackendConfig

cache_backend_config = SqliteCacheBackendConfig(path="prod_env/cache")

Pass this value as cache_backend_config when constructing AutoClient.

Call request.validate() before dispatch if you construct requests dynamically and want to fail early on incompatible prompt fields.