Model request API
HELM represents model calls with the shared Request and RequestResult
dataclasses in helm.common.request. These classes are the common boundary
between scenarios, clients, local execution, and cached raw results.
Use this API when you need to make a model request from Python code or inspect the exact request and response fields used by HELM runs.
Request and response formats
Request
dataclass
A Request specifies how to query a language model (given a prompt,
complete it). It is the unified representation for communicating with
various APIs (e.g., GPT-3, Jurassic).
model_deployment: str = ''
class-attribute
instance-attribute
Which model deployment to query -> Determines the Client. Refers to a deployment in the model deployment registry.
model: str = ''
class-attribute
instance-attribute
Which model to use -> Determines the Engine. Refers to a model metadata in the model registry.
embedding: bool = False
class-attribute
instance-attribute
Whether to query embedding instead of text response
prompt: str = ''
class-attribute
instance-attribute
What prompt do condition the language model on
temperature: float = 1.0
class-attribute
instance-attribute
Temperature parameter that governs diversity
num_completions: int = 1
class-attribute
instance-attribute
Generate this many completions (by sampling from the model)
top_k_per_token: int = 1
class-attribute
instance-attribute
Take this many highest probability candidates per token in the completion
max_tokens: int = 100
class-attribute
instance-attribute
Maximum number of tokens to generate (per completion)
stop_sequences: List[str] = field(default_factory=list)
class-attribute
instance-attribute
Stop generating once we hit one of these strings.
echo_prompt: bool = False
class-attribute
instance-attribute
Should prompt be included as a prefix of each completion? (e.g., for
evaluating perplexity of the prompt)
top_p: float = 1
class-attribute
instance-attribute
Same from tokens that occupy this probability mass (nucleus sampling)
presence_penalty: float = 0
class-attribute
instance-attribute
Penalize repetition (OpenAI & Writer only)
frequency_penalty: float = 0
class-attribute
instance-attribute
Penalize repetition (OpenAI & Writer only)
random: Optional[str] = None
class-attribute
instance-attribute
Used to control randomness. Expect different responses for the same
request but with different values for random.
messages: Optional[List[Dict[str, str]]] = None
class-attribute
instance-attribute
Used for chat models. (OpenAI only for now). if messages is specified for a chat model, the prompt is ignored. Otherwise, the client should convert the prompt into a message.
multimodal_prompt: Optional[MultimediaObject] = None
class-attribute
instance-attribute
Multimodal prompt with media objects interleaved (e.g., text, video, image, text, ...)
image_generation_parameters: Optional[ImageGenerationParameters] = None
class-attribute
instance-attribute
Parameters for image generation.
response_format: Optional[ResponseFormat] = None
class-attribute
instance-attribute
EXPERIMENTAL: Response format. Currently only supported by OpenAI and Together.
model_host: str
property
Returns the model host (referring to the deployment). Not to be confused with the model creator organization (referring to the model).
'openai/davinci' => 'openai'
'together/bloom' => 'together'
model_engine: str
property
Returns the model engine (referring to the model). This is often the same as self.model_deploymentl.split("/")[1], but not always. For example, one model could be served on several servers (each with a different model_deployment) In that case we would have for example: 'aws/bloom-1', 'aws/bloom-2', 'aws/bloom-3' => 'bloom' This is why we need to keep track of the model engine with the model metadata. Example: 'openai/davinci' => 'davinci'
validate()
RequestResult
dataclass
What comes back due to a Request.
success: bool
instance-attribute
Whether the request was successful
embedding: List[float]
instance-attribute
Fixed dimensional embedding corresponding to the entire prompt
completions: List[GeneratedOutput]
instance-attribute
List of completion
cached: bool
instance-attribute
Whether the request was actually cached
request_time: Optional[float] = None
class-attribute
instance-attribute
How long the request took in seconds
request_datetime: Optional[int] = None
class-attribute
instance-attribute
When was the request sent? We keep track of when the request was made because the underlying model or inference procedure backing the API might change over time. The integer represents the current time in seconds since the Epoch (January 1, 1970).
error: Optional[str] = None
class-attribute
instance-attribute
If success is false, what was the error?
error_flags: Optional[ErrorFlags] = None
class-attribute
instance-attribute
Describes how to treat errors in the request.
batch_size: Optional[int] = None
class-attribute
instance-attribute
Batch size (TogetherClient only)
batch_request_time: Optional[float] = None
class-attribute
instance-attribute
How long it took to process the batch? (TogetherClient only)
render_lines() -> List[str]
GeneratedOutput
dataclass
A GeneratedOutput is a single generated output that may contain text or multimodal content.
text: str
instance-attribute
logprob: float
instance-attribute
tokens: List[Token]
instance-attribute
finish_reason: Optional[Dict[str, Any]] = None
class-attribute
instance-attribute
multimodal_content: Optional[MultimediaObject] = None
class-attribute
instance-attribute
thinking: Optional[Thinking] = None
class-attribute
instance-attribute
__add__(other: GeneratedOutput) -> GeneratedOutput
render_lines() -> List[str]
Token
dataclass
A Token represents one token position in a Sequence, which has the
chosen text as well as the top probabilities under the model.
text: str
instance-attribute
logprob: float
instance-attribute
render_lines() -> List[str]
Making a local request through AutoClient
Use AutoClient when you want HELM to select the concrete client from the
model_deployment field and use local credentials directly. This is the
recommended path for making model requests from Python code. AutoClient
requires a credentials mapping, a file storage path, and a cache backend
configuration. The example below uses BlackHoleCacheBackendConfig, which
does not persist cache entries.
from helm.clients.auto_client import AutoClient
from helm.common.cache_backend_config import BlackHoleCacheBackendConfig
from helm.common.request import Request
client = AutoClient(
credentials={"openaiApiKey": "YOUR_OPENAI_API_KEY"},
file_storage_path="prod_env/cache",
cache_backend_config=BlackHoleCacheBackendConfig(),
)
request = Request(
model_deployment="openai/gpt-4o-mini",
model="openai/gpt-4o-mini",
prompt="Explain HELM in one sentence.",
max_tokens=64,
temperature=0.0,
)
result = client.make_request(request)
if result.success:
print(result.completions[0].text)
else:
print(result.error)
See helm.clients.auto_client.AutoClient for the complete local-client
interface.
Using a persistent cache
Use SqliteCacheBackendConfig when you want HELM to persist request results
locally:
from helm.common.cache_backend_config import SqliteCacheBackendConfig
cache_backend_config = SqliteCacheBackendConfig(path="prod_env/cache")
Pass this value as cache_backend_config when constructing AutoClient.
Call request.validate() before dispatch if you construct requests
dynamically and want to fail early on incompatible prompt fields.