Schemas

Scenario dataclass

A scenario represents a (task, data distribution). It is usually based on some raw dataset and is converted into a list of Instances. Override this class.

Note: the constructor should be lightweight, get_instances should do all the heavy lifting.

name: str = field(init=False) class-attribute instance-attribute

Short unique identifier of the scenario

description: str = field(init=False) class-attribute instance-attribute

Description of the scenario (task, data)

tags: List[str] = field(init=False) class-attribute instance-attribute

Extra metadata (e.g., whether this is a question answering or commonsense task)

definition_path: str = field(init=False) class-attribute instance-attribute

Where the scenario subclass for self is defined.

__post_init__() -> None

get_instances(output_path: str) -> List[Instance] abstractmethod

Does the main work in the Scenario (e.g., download datasets, convert it into a list of instances).

render_lines(instances: List[Instance]) -> List[str]

ScenarioState dataclass

A ScenarioState represents the output of adaptation. Contains a set of RequestState that were created and executed (a ScenarioState could be pre-execution or post-execution).

adapter_spec: AdapterSpec instance-attribute

request_states: List[RequestState] instance-attribute

annotator_specs: Optional[List[AnnotatorSpec]] = None class-attribute instance-attribute

__post_init__()

get_request_states(train_trial_index: int, instance: Instance, reference_index: Optional[int]) -> List[RequestState]

RequestState dataclass

A RequestState represents a single Request made on behalf of an Instance. It should have all the information that's needed later for a Metric to be able to understand the Request and its RequestResult.

instance: Instance instance-attribute

Which instance we're evaluating

reference_index: Optional[int] instance-attribute

Which reference of the instance we're evaluating (if any)

request_mode: Optional[str] instance-attribute

Which request mode ("original" or "calibration") of the instance we're evaluating (if any) (for ADAPT_MULTIPLE_CHOICE_SEPARATE_CALIBRATED)

train_trial_index: int instance-attribute

Which training set this request is for

output_mapping: Optional[Dict[str, str]] instance-attribute

How to map the completion text back to a real output (e.g., for multiple choice, "B" => "the second choice")

request: Request instance-attribute

The request that is actually made

result: Optional[RequestResult] instance-attribute

The result of the request (filled in when the request is executed)

num_train_instances: int instance-attribute

Number of training instances (i.e., in-context examples)

prompt_truncated: bool instance-attribute

Whether the prompt (instructions + test input) is truncated to fit the model's context window.

num_conditioning_tokens: int = 0 class-attribute instance-attribute

The number of initial tokens that will be ignored when computing language modeling metrics

annotations: Optional[Dict[str, Any]] = None class-attribute instance-attribute

Output of some post-processing step that is needed for the metric to understand the request Should match the annotator's name to an Annotation (usually a list of dictionaries for each completion) Example: parsing, rendering an image based on the text completion, etc.

__post_init__()

render_lines() -> List[str]

Instance dataclass

An Instance represents one data point that we're evaluating on (e.g., one question in a QA task). Note: eq=False means that we hash by the identity.

input: Input instance-attribute

The input

references: List[Reference] instance-attribute

References that helps us evaluate

split: Optional[str] = None class-attribute instance-attribute

Split (e.g., train, valid, test)

sub_split: Optional[str] = None class-attribute instance-attribute

Sub split (e.g. toxic, non-toxic)

id: Optional[str] = None class-attribute instance-attribute

Used to group Instances that were created from a particular Instance through data augmentation

perturbation: Optional[PerturbationDescription] = None class-attribute instance-attribute

Description of the Perturbation that was applied when creating this Instance

contrast_inputs: Optional[List[Input]] = None class-attribute instance-attribute

Perturbed input as defined by contrast sets (if available)

contrast_references: Optional[List[List[Reference]]] = None class-attribute instance-attribute

References for the perturbed input above (if available)

first_correct_reference: Optional[Reference] property

Return the first correct reference.

all_correct_references: List[Reference] property

Return all correct references.

render_lines() -> List[str]

Reference dataclass

A Reference specifies a possible output and how good/bad it is. This could be used to represent multiple reference outputs which are all acceptable (e.g., in machine translation) or alternatives (e.g., in a multiple-choice exam).

output: Output instance-attribute

The output

tags: List[str] instance-attribute

Extra metadata (e.g., whether it's correct/factual/toxic)

is_correct: bool property

render_lines() -> List[str]

PerturbationDescription dataclass

DataClass used to describe a Perturbation

name: str instance-attribute

Name of the Perturbation

robustness: bool = False class-attribute instance-attribute

Whether a perturbation is relevant to robustness. Will be used to aggregate perturbations metrics

fairness: bool = False class-attribute instance-attribute

Whether a perturbation is relevant to fairness. Will be used to aggregate perturbations metrics

computed_on: str = PERTURBATION_PERTURBED class-attribute instance-attribute

Which types of Instances we are evaluating, to be populated during metric evaluation. PERTURBATION_PERTURBED (default) means we are evaluating on perturbed instances, PERTURBATION_ORIGINAL means we are evaluating the unperturbed version of instances where this perturbation applies, and, PERTURBATION_WORST means the the minimum metric between the two.

seed: Optional[int] = None class-attribute instance-attribute

Seed added to instance_id when generating perturbation

Request dataclass

A Request specifies how to query a language model (given a prompt, complete it). It is the unified representation for communicating with various APIs (e.g., GPT-3, Jurassic).

model_deployment: str = '' class-attribute instance-attribute

Which model deployment to query -> Determines the Client. Refers to a deployment in the model deployment registry.

model: str = '' class-attribute instance-attribute

Which model to use -> Determines the Engine. Refers to a model metadata in the model registry.

embedding: bool = False class-attribute instance-attribute

Whether to query embedding instead of text response

prompt: str = '' class-attribute instance-attribute

What prompt do condition the language model on

temperature: float = 1.0 class-attribute instance-attribute

Temperature parameter that governs diversity

num_completions: int = 1 class-attribute instance-attribute

Generate this many completions (by sampling from the model)

top_k_per_token: int = 1 class-attribute instance-attribute

Take this many highest probability candidates per token in the completion

max_tokens: int = 100 class-attribute instance-attribute

Maximum number of tokens to generate (per completion)

stop_sequences: List[str] = field(default_factory=list) class-attribute instance-attribute

Stop generating once we hit one of these strings.

echo_prompt: bool = False class-attribute instance-attribute

Should prompt be included as a prefix of each completion? (e.g., for evaluating perplexity of the prompt)

top_p: float = 1 class-attribute instance-attribute

Same from tokens that occupy this probability mass (nucleus sampling)

presence_penalty: float = 0 class-attribute instance-attribute

Penalize repetition (OpenAI & Writer only)

frequency_penalty: float = 0 class-attribute instance-attribute

Penalize repetition (OpenAI & Writer only)

random: Optional[str] = None class-attribute instance-attribute

Used to control randomness. Expect different responses for the same request but with different values for random.

messages: Optional[List[Dict[str, str]]] = None class-attribute instance-attribute

Used for chat models. (OpenAI only for now). if messages is specified for a chat model, the prompt is ignored. Otherwise, the client should convert the prompt into a message.

multimodal_prompt: Optional[MultimediaObject] = None class-attribute instance-attribute

Multimodal prompt with media objects interleaved (e.g., text, video, image, text, ...)

image_generation_parameters: Optional[ImageGenerationParameters] = None class-attribute instance-attribute

Parameters for image generation.

model_host: str property

Returns the model host (referring to the deployment). Not to be confused with the model creator organization (referring to the model).

'openai/davinci' => 'openai'

'together/bloom' => 'together'

model_engine: str property

Returns the model engine (referring to the model). This is often the same as self.model_deploymentl.split("/")[1], but not always. For example, one model could be served on several servers (each with a different model_deployment) In that case we would have for example: 'aws/bloom-1', 'aws/bloom-2', 'aws/bloom-3' => 'bloom' This is why we need to keep track of the model engine with the model metadata. Example: 'openai/davinci' => 'davinci'

validate()

RequestResult dataclass

What comes back due to a Request.

success: bool instance-attribute

Whether the request was successful

embedding: List[float] instance-attribute

Fixed dimensional embedding corresponding to the entire prompt

completions: List[GeneratedOutput] instance-attribute

List of completion

cached: bool instance-attribute

Whether the request was actually cached

request_time: Optional[float] = None class-attribute instance-attribute

How long did the request take?

request_datetime: Optional[int] = None class-attribute instance-attribute

When was the request sent? We keep track of when the request was made because the underlying model or inference procedure backing the API might change over time. The integer represents the current time in seconds since the Epoch (January 1, 1970).

error: Optional[str] = None class-attribute instance-attribute

If success is false, what was the error?

error_flags: Optional[ErrorFlags] = None class-attribute instance-attribute

Describes how to treat errors in the request.

batch_size: Optional[int] = None class-attribute instance-attribute

Batch size (TogetherClient only)

batch_request_time: Optional[float] = None class-attribute instance-attribute

How long it took to process the batch? (TogetherClient only)

render_lines() -> List[str]

PerInstanceStats dataclass

Captures a unit of evaluation.

instance_id: str instance-attribute

perturbation: Optional[PerturbationDescription] instance-attribute

train_trial_index: int instance-attribute

Which replication

stats: List[Stat] instance-attribute

Statistics computed from the predicted output

Stat dataclass

A mutable class that allows us to aggregate values and report mean/stddev.

name: MetricName instance-attribute

count: int = 0 class-attribute instance-attribute

sum: float = 0 class-attribute instance-attribute

sum_squared: float = 0 class-attribute instance-attribute

min: Optional[float] = None class-attribute instance-attribute

max: Optional[float] = None class-attribute instance-attribute

mean: Optional[float] = None class-attribute instance-attribute

variance: Optional[float] = None class-attribute instance-attribute

This is the population variance, not the sample variance.

See https://towardsdatascience.com/variance-sample-vs-population-3ddbd29e498a for details.

stddev: Optional[float] = None class-attribute instance-attribute

This is the population standard deviation, not the sample standard deviation.

See https://towardsdatascience.com/variance-sample-vs-population-3ddbd29e498a for details.

add(x) -> Stat

merge(other: Stat) -> Stat

__repr__()

bare_str() -> str

take_mean()

Return a version of the stat that only has the mean.