braintrust
A Python library for interacting with Braintrust. This library contains functionality for running evaluations, logging completions, loading and invoking functions, and more.
braintrust
is distributed as a library on PyPI. It is open source and
available on GitHub.
Quickstart
Install the library with pip.
Then, create a file like eval_hello.py
with the following content:
Finally, run the script with braintrust eval eval_hello.py
.
API Reference
braintrust.logger
Exportable Objects
export
Return a serialized representation of the object that can be used to start subspans in other places. See Span.start_span
for more details.
Span Objects
A Span encapsulates logged data and metrics for a unit of work. This interface is shared by all span implementations.
We suggest using one of the various start_span
methods, instead of creating Spans directly. See Span.start_span
for full details.
id
Row ID of the span.
log
Incrementally update the current span with new data. The event will be batched and uploaded behind the scenes.
Arguments:
**event
: Data to be logged. SeeExperiment.log
for full details.
log_feedback
Add feedback to the current span. Unlike Experiment.log_feedback
and Logger.log_feedback
, this method does not accept an id parameter, because it logs feedback to the current span.
Arguments:
**event
: Data to be logged. SeeExperiment.log_feedback
for full details.
start_span
Create a new span. This is useful if you want to log more detailed trace information beyond the scope of a single log event. Data logged over several calls to Span.log
will be merged into one logical row.
We recommend running spans within context managers (with start_span(...) as span
) to automatically mark them as current and ensure they are ended. Only spans run within a context manager will be marked current, so they can be accessed using braintrust.current_span()
. If you wish to start a span outside a context manager, be sure to end it with span.end()
.
Arguments:
name
: Optional name of the span. If not provided, a name will be inferred from the call stack.type
: Optional type of the span. Use theSpanTypeAttribute
enum or just provide a string directly. If not provided, the type will be unset.span_attributes
: Optional additional attributes to attach to the span, such as a type name.start_time
: Optional start time of the span, as a timestamp in seconds.set_current
: If true (the default), the span will be marked as the currently-active span for the duration of the context manager.parent
: Optional parent info string for the span. The string can be generated from[Span,Experiment,Logger].export
. If not provided, the current span will be used (depending on context). This is useful for adding spans to an existing trace.**event
: Data to be logged. SeeExperiment.log
for full details.
Returns:
The newly-created Span
export
Serialize the identifiers of this span. The return value can be used to identify this span when starting a subspan elsewhere, such as another process or service, without needing to access this Span
object. See the parameters of Span.start_span
for usage details.
Callers should treat the return value as opaque. The serialization format may change from time to time. If parsing is needed, use SpanComponentsV3.from_str
.
Returns:
Serialized representation of this span's identifiers.
permalink
Format a permalink to the Braintrust application for viewing this span.
Links can be generated at any time, but they will only become viewable after the span and its root have been flushed to the server and ingested.
Returns:
A permalink to the span.
end
Log an end time to the span (defaults to the current time). Returns the logged time.
Will be invoked automatically if the span is bound to a context manager.
Arguments:
end_time
: Optional end time of the span, as a timestamp in seconds.
Returns:
The end time logged to the span metrics.
flush
Flush any pending rows to the server.
close
Alias for end
.
set_attributes
Set the span's name, type, or other attributes. These attributes will be attached to all log events within the span.
The attributes are equivalent to the arguments to start_span.
Arguments:
name
: Optional name of the span. If not provided, a name will be inferred from the call stack.type
: Optional type of the span. Use theSpanTypeAttribute
enum or just provide a string directly. If not provided, the type will be unset.span_attributes
: Optional additional attributes to attach to the span, such as a type name.
set_http_adapter
Specify a custom HTTP adapter to use for all network requests. This is useful for setting custom retry policies, timeouts, etc.
Braintrust uses the requests
library, so the adapter should be an instance of requests.adapters.HTTPAdapter
.
Arguments:
adapter
: The adapter to use.
init
Log in, and then initialize a new experiment in a specified project. If the project does not exist, it will be created.
Arguments:
project
: The name of the project to create the experiment in. Must specify at least one ofproject
orproject_id
.experiment
: The name of the experiment to create. If not specified, a name will be generated automatically.description
: (Optional) An optional description of the experiment.dataset
: (Optional) A dataset to associate with the experiment. The dataset must be initialized withbraintrust.init_dataset
before passing it into the experiment.update
: If the experiment already exists, continue logging to it. If it does not exist, creates the experiment with the specified arguments.base_experiment
: An optional experiment name to use as a base. If specified, the new experiment will be summarized and compared to this experiment. Otherwise, it will pick an experiment by finding the closest ancestor on the default (e.g. main) branch.is_public
: An optional parameter to control whether the experiment is publicly visible to anybody with the link or privately visible to only members of the organization. Defaults to private.app_url
: The URL of the Braintrust App. Defaults to https://www.braintrust.dev.api_key
: The API key to use. If the parameter is not specified, will try to use theBRAINTRUST_API_KEY
environment variable. If no API key is specified, will prompt the user to login.org_name
: (Optional) The name of a specific organization to connect to. This is useful if you belong to multiple.metadata
: (Optional) a dictionary with additional data about the test example, model outputs, or just about anything else that's relevant, that you can use to help find and analyze examples later. For example, you could log theprompt
, example'sid
, or anything else that would be useful to slice/dice later. The values inmetadata
can be any JSON-serializable type, but its keys must be strings.git_metadata_settings
: (Optional) Settings for collecting git metadata. By default, will collect all git metadata fields allowed in org-level settings.set_current
: If true (the default), set the global current-experiment to the newly-created one.open
: If the experiment already exists, open it in read-only mode. Throws an error if the experiment does not already exist.project_id
: The id of the project to create the experiment in. This takes precedence overproject
if specified.base_experiment_id
: An optional experiment id to use as a base. If specified, the new experiment will be summarized and compared to this. This takes precedence overbase_experiment
if specified.repo_info
: (Optional) Explicitly specify the git metadata for this experiment. This takes precedence overgit_metadata_settings
if specified.
Returns:
The experiment object.
init_experiment
Alias for init
init_dataset
Create a new dataset in a specified project. If the project does not exist, it will be created.
Arguments:
project_name
: The name of the project to create the dataset in. Must specify at least one ofproject_name
orproject_id
.name
: The name of the dataset to create. If not specified, a name will be generated automatically.description
: An optional description of the dataset.version
: An optional version of the dataset (to read). If not specified, the latest version will be used.app_url
: The URL of the Braintrust App. Defaults to https://www.braintrust.dev.api_key
: The API key to use. If the parameter is not specified, will try to use theBRAINTRUST_API_KEY
environment variable. If no API key is specified, will prompt the user to login.org_name
: (Optional) The name of a specific organization to connect to. This is useful if you belong to multiple.project_id
: The id of the project to create the dataset in. This takes precedence overproject
if specified.metadata
: (Optional) a dictionary with additional data about the dataset. The values inmetadata
can be any JSON-serializable type, but its keys must be strings.use_output
: (Deprecated) If True, records will be fetched from this dataset in the legacy format, with the "expected" field renamed to "output". This option will be removed in a future version of Braintrust.
Returns:
The dataset object.
init_logger
Create a new logger in a specified project. If the project does not exist, it will be created.
Arguments:
project
: The name of the project to log into. If unspecified, will default to the Global project.project_id
: The id of the project to log into. This takes precedence over project if specified.async_flush
: If true (the default), log events will be batched and sent asynchronously in a background thread. If false, log events will be sent synchronously. Set to false in serverless environments.app_url
: The URL of the Braintrust API. Defaults to https://www.braintrust.dev.api_key
: The API key to use. If the parameter is not specified, will try to use theBRAINTRUST_API_KEY
environment variable. If no API key is specified, will prompt the user to login.org_name
: (Optional) The name of a specific organization to connect to. This is useful if you belong to multiple.force_login
: Login again, even if you have already logged in (by default, the logger will not login if you are already logged in)set_current
: If true (the default), set the global current-experiment to the newly-created one.
Returns:
The newly created Logger.
load_prompt
Loads a prompt from the specified project.
Arguments:
project
: The name of the project to load the prompt from. Must specify at least one ofproject
orproject_id
.slug
: The slug of the prompt to load.version
: An optional version of the prompt (to read). If not specified, the latest version will be used.project_id
: The id of the project to load the prompt from. This takes precedence overproject
if specified.defaults
: (Optional) A dictionary of default values to use when rendering the prompt. Prompt values will override these defaults.no_trace
: If true, do not include logging metadata for this prompt when build() is called.app_url
: The URL of the Braintrust App. Defaults to https://www.braintrust.dev.api_key
: The API key to use. If the parameter is not specified, will try to use theBRAINTRUST_API_KEY
environment variable. If no API key is specified, will prompt the user to login.org_name
: (Optional) The name of a specific organization to connect to. This is useful if you belong to multiple.project_id
: The id of the project to load the prompt from. This takes precedence overproject
if specified.
Returns:
The prompt object.
login
Log into Braintrust. This will prompt you for your API token, which you can find at
https://www.braintrust.dev/app/token. This method is called automatically by init()
.
Arguments:
app_url
: The URL of the Braintrust App. Defaults to https://www.braintrust.dev.api_key
: The API key to use. If the parameter is not specified, will try to use theBRAINTRUST_API_KEY
environment variable. If no API key is specified, will prompt the user to login.org_name
: (Optional) The name of a specific organization to connect to. This is useful if you belong to multiple.force_login
: Login again, even if you have already logged in (by default, this function will exit quickly if you have already logged in)
log
Log a single event to the current experiment. The event will be batched and uploaded behind the scenes.
Arguments:
**event
: Data to be logged. SeeExperiment.log
for full details.
Returns:
The id
of the logged event.
summarize
Summarize the current experiment, including the scores (compared to the closest reference experiment) and metadata.
Arguments:
summarize_scores
: Whether to summarize the scores. If False, only the metadata will be returned.comparison_experiment_id
: The experiment to compare against. If None, the most recent experiment on the comparison_commit will be used.
Returns:
ExperimentSummary
current_experiment
Returns the currently-active experiment (set by braintrust.init(...)
). Returns None if no current experiment has been set.
current_logger
Returns the currently-active logger (set by braintrust.init_logger(...)
). Returns None if no current logger has been set.
current_span
Return the currently-active span for logging (set by running a span under a context manager). If there is no active span, returns a no-op span object, which supports the same interface as spans but does no logging.
See Span
for full details.
get_span_parent_object
Mainly for internal use. Return the parent object for starting a span in a global context.
traced
Decorator to trace the wrapped function. Can either be applied bare (@traced
) or by providing arguments (@traced(*span_args, **span_kwargs)
), which will be forwarded to the created span. See Span.start_span
for full details on the span arguments.
It checks the following (in precedence order): _ Currently-active span _ Currently-active experiment * Currently-active logger
and creates a span in the first one that is active. If none of these are active, it returns a no-op span object.
The decorator will automatically log the input and output of the wrapped function to the corresponding fields of the created span. Pass the kwarg notrace_io=True
to the decorator to prevent this.
Unless a name is explicitly provided in span_args
or span_kwargs
, the name of the span will be the name of the decorated function.
start_span
Lower-level alternative to @traced
for starting a span at the toplevel. It creates a span under the first active object (using the same precedence order as @traced
), or if parent
is specified, under the specified parent row, or returns a no-op span object.
We recommend running spans bound to a context manager (with start_span
) to automatically mark them as current and ensure they are terminated. If you wish to start a span outside a context manager, be sure to terminate it with span.end()
.
See Span.start_span
for full details.
flush
Flush any pending rows to the server.
ObjectFetcher Objects
fetch
Fetch all records.
Update a span using the output of span.export()
. It is important that you only resume updating
to a span once the original span has been fully written and flushed, since otherwise updates to the span may conflict with the original span.
Arguments:
exported
: The output ofspan.export()
.**event
: Data to update. SeeExperiment.log
for a full list of valid fields.
span_components_to_object_id
Utility function to resolve the object ID of a SpanComponentsV3 object. This function may trigger a login to braintrust if the object ID is encoded lazily.
permalink
Format a permalink to the Braintrust application for viewing the span represented by the provided slug
.
Links can be generated at any time, but they will only become viewable after the span and its root have been flushed to the server and ingested.
If you have a Span
object, use Span.permalink
instead.
Arguments:
slug
: The identifier generated fromSpan.export
.org_name
: The org name to use. If not provided, the org name will be inferred from the global login state.app_url
: The app URL to use. If not provided, the app URL will be inferred from the global login state.
Returns:
A permalink to the exported span.
Experiment Objects
An experiment is a collection of logged events, such as model inputs and outputs, which represent a snapshot of your application at a particular point in time. An experiment is meant to capture more than just the model you use, and includes the data you use to test, pre- and post- processing code, comparison metrics (scores), and any other metadata you want to include.
Experiments are associated with a project, and two experiments are meant to be easily comparable via
their input
. You can change the attributes of the experiments in a project (e.g. scoring functions)
over time, simply by changing what you log.
You should not create Experiment
objects directly. Instead, use the braintrust.init()
method.
log
Log a single event to the experiment. The event will be batched and uploaded behind the scenes.
Arguments:
input
: The arguments that uniquely define a test case (an arbitrary, JSON serializable object). Later on, Braintrust will use theinput
to know whether two test cases are the same between experiments, so they should not contain experiment-specific state. A simple rule of thumb is that if you run the same experiment twice, theinput
should be identical.output
: The output of your application, including post-processing (an arbitrary, JSON serializable object), that allows you to determine whether the result is correct or not. For example, in an app that generates SQL queries, theoutput
should be the result of the SQL query generated by the model, not the query itself, because there may be multiple valid queries that answer a single question.expected
: (Optional) the ground truth value (an arbitrary, JSON serializable object) that you'd compare tooutput
to determine if youroutput
value is correct or not. Braintrust currently does not compareoutput
toexpected
for you, since there are so many different ways to do that correctly. Instead, these values are just used to help you navigate your experiments while digging into analyses. However, we may later use these values to re-score outputs or fine-tune your models.error
: (Optional) The error that occurred, if any. If you use tracing to run an experiment, errors are automatically logged when your code throws an exception.scores
: A dictionary of numeric values (between 0 and 1) to log. The scores should give you a variety of signals that help you determine how accurate the outputs are compared to what you expect and diagnose failures. For example, a summarization app might have one score that tells you how accurate the summary is, and another that measures the word similarity between the generated and grouth truth summary. The word similarity score could help you determine whether the summarization was covering similar concepts or not. You can use these scores to help you sort, filter, and compare experiments.metadata
: (Optional) a dictionary with additional data about the test example, model outputs, or just about anything else that's relevant, that you can use to help find and analyze examples later. For example, you could log theprompt
, example'sid
, or anything else that would be useful to slice/dice later. The values inmetadata
can be any JSON-serializable type, but its keys must be strings.tags
: (Optional) a list of strings that you can use to filter and group records later.metrics
: (Optional) a dictionary of metrics to log. The following keys are populated automatically: "start", "end".id
: (Optional) a unique identifier for the event. If you don't provide one, BrainTrust will generate one for you.dataset_record_id
: (Optional) the id of the dataset record that this event is associated with. This field is required if and only if the experiment is associated with a dataset.allow_concurrent_with_spans
: (Optional) in rare cases where you need to log at the top level separately from using spans on the experiment elsewhere, set this to True.
Returns:
The id
of the logged event.
log_feedback
Log feedback to an event in the experiment. Feedback is used to save feedback scores, set an expected value, or add a comment.
Arguments:
id
: The id of the event to log feedback for. This is theid
returned bylog
or accessible as theid
field of a span.scores
: (Optional) a dictionary of numeric values (between 0 and 1) to log. These scores will be merged into the existing scores for the event.expected
: (Optional) the ground truth value (an arbitrary, JSON serializable object) that you'd compare tooutput
to determine if youroutput
value is correct or not.tags
: (Optional) a list of strings that you can use to filter and group records later.comment
: (Optional) an optional comment string to log about the event.metadata
: (Optional) a dictionary with additional data about the feedback. If you have auser_id
, you can log it here and access it in the Braintrust UI.source
: (Optional) the source of the feedback. Must be one of "external" (default), "app", or "api".
start_span
Create a new toplevel span underneath the experiment. The name defaults to "root" and the span type to "eval".
See Span.start_span
for full details
update_span
Update a span in the experiment using its id. It is important that you only update a span once the original span has been fully written and flushed,
since otherwise updates to the span may conflict with the original span.
Arguments:
id
: The id of the span to update.**event
: Data to update. SeeExperiment.log
for a full list of valid fields.
summarize
Summarize the experiment, including the scores (compared to the closest reference experiment) and metadata.
Arguments:
summarize_scores
: Whether to summarize the scores. If False, only the metadata will be returned.comparison_experiment_id
: The experiment to compare against. If None, the most recent experiment on the origin's main branch will be used.
Returns:
ExperimentSummary
close
This function is deprecated. You can simply remove it from your code.
flush
Flush any pending rows to the server.
ReadonlyExperiment Objects
A read-only view of an experiment, initialized by passing open=True
to init()
.
SpanImpl Objects
Primary implementation of the Span
interface. See the Span
interface for full details on each method.
We suggest using one of the various start_span
methods, instead of creating Spans directly. See Span.start_span
for full details.
flush
Flush any pending rows to the server.
Dataset Objects
A dataset is a collection of records, such as model inputs and outputs, which represent data you can use to evaluate and fine-tune models. You can log production data to datasets, curate them with interesting examples, edit/delete records, and run evaluations against them.
You should not create Dataset
objects directly. Instead, use the braintrust.init_dataset()
method.
insert
Insert a single record to the dataset. The record will be batched and uploaded behind the scenes. If you pass in an id
,
and a record with that id
already exists, it will be overwritten (upsert).
Arguments:
input
: The argument that uniquely define an input case (an arbitrary, JSON serializable object).expected
: The output of your application, including post-processing (an arbitrary, JSON serializable object).tags
: (Optional) a list of strings that you can use to filter and group records later.metadata
: (Optional) a dictionary with additional data about the test example, model outputs, or just about anything else that's relevant, that you can use to help find and analyze examples later. For example, you could log theprompt
, example'sid
, or anything else that would be useful to slice/dice later. The values inmetadata
can be any JSON-serializable type, but its keys must be strings.id
: (Optional) a unique identifier for the event. If you don't provide one, Braintrust will generate one for you.output
: (Deprecated) The output of your application. Useexpected
instead.
Returns:
The id
of the logged record.
update
Update fields of a single record in the dataset. The updated fields will be batched and uploaded behind the scenes.
You must pass in an id
of the record to update. Only the fields provided will be updated; other fields will remain unchanged.
Arguments:
id
: The unique identifier of the record to update.input
: (Optional) The new input value for the record (an arbitrary, JSON serializable object).expected
: (Optional) The new expected output value for the record (an arbitrary, JSON serializable object).tags
: (Optional) A list of strings to update the tags of the record.metadata
: (Optional) A dictionary to update the metadata of the record. The values inmetadata
can be any JSON-serializable type, but its keys must be strings.
Returns:
The id
of the updated record.
delete
Delete a record from the dataset.
Arguments:
id
: Theid
of the record to delete.
summarize
Summarize the dataset, including high level metrics about its size and other metadata.
Arguments:
summarize_data
: Whether to summarize the data. If False, only the metadata will be returned.
Returns:
DatasetSummary
close
This function is deprecated. You can simply remove it from your code.
flush
Flush any pending rows to the server.
Prompt Objects
A prompt object consists of prompt text, a model, and model parameters (such as temperature), which
can be used to generate completions or chat messages. The prompt object supports calling .build()
which uses mustache templating to build the prompt with the given formatting options and returns a
plain dictionary that includes the built prompt and arguments. The dictionary can be passed as
kwargs to the OpenAI client or modified as you see fit.
You should not create Prompt
objects directly. Instead, use the braintrust.load_prompt()
method.
build
Build the prompt with the given formatting options. The args you pass in will
be forwarded to the mustache template that defines the prompt and rendered with
the chevron
library.
Returns:
A dictionary that includes the rendered prompt and arguments, that can be passed as kwargs to the OpenAI client.
Logger Objects
log
Log a single event. The event will be batched and uploaded behind the scenes.
Arguments:
input
: (Optional) the arguments that uniquely define a user input (an arbitrary, JSON serializable object).output
: (Optional) the output of your application, including post-processing (an arbitrary, JSON serializable object), that allows you to determine whether the result is correct or not. For example, in an app that generates SQL queries, theoutput
should be the result of the SQL query generated by the model, not the query itself, because there may be multiple valid queries that answer a single question.expected
: (Optional) the ground truth value (an arbitrary, JSON serializable object) that you'd compare tooutput
to determine if youroutput
value is correct or not. Braintrust currently does not compareoutput
toexpected
for you, since there are so many different ways to do that correctly. Instead, these values are just used to help you navigate while digging into analyses. However, we may later use these values to re-score outputs or fine-tune your models.error
: (Optional) The error that occurred, if any. If you use tracing to run an experiment, errors are automatically logged when your code throws an exception.tags
: (Optional) a list of strings that you can use to filter and group records later.scores
: (Optional) a dictionary of numeric values (between 0 and 1) to log. The scores should give you a variety of signals that help you determine how accurate the outputs are compared to what you expect and diagnose failures. For example, a summarization app might have one score that tells you how accurate the summary is, and another that measures the word similarity between the generated and grouth truth summary. The word similarity score could help you determine whether the summarization was covering similar concepts or not. You can use these scores to help you sort, filter, and compare logs.metadata
: (Optional) a dictionary with additional data about the test example, model outputs, or just about anything else that's relevant, that you can use to help find and analyze examples later. For example, you could log theprompt
, example'sid
, or anything else that would be useful to slice/dice later. The values inmetadata
can be any JSON-serializable type, but its keys must be strings.metrics
: (Optional) a dictionary of metrics to log. The following keys are populated automatically: "start", "end".id
: (Optional) a unique identifier for the event. If you don't provide one, BrainTrust will generate one for you.allow_concurrent_with_spans
: (Optional) in rare cases where you need to log at the top level separately from using spans on the logger elsewhere, set this to True.
log_feedback
Log feedback to an event. Feedback is used to save feedback scores, set an expected value, or add a comment.
Arguments:
id
: The id of the event to log feedback for. This is theid
returned bylog
or accessible as theid
field of a span.scores
: (Optional) a dictionary of numeric values (between 0 and 1) to log. These scores will be merged into the existing scores for the event.expected
: (Optional) the ground truth value (an arbitrary, JSON serializable object) that you'd compare tooutput
to determine if youroutput
value is correct or not.tags
: (Optional) a list of strings that you can use to filter and group records later.comment
: (Optional) an optional comment string to log about the event.metadata
: (Optional) a dictionary with additional data about the feedback. If you have auser_id
, you can log it here and access it in the Braintrust UI.source
: (Optional) the source of the feedback. Must be one of "external" (default), "app", or "api".
start_span
Create a new toplevel span underneath the logger. The name defaults to "root" and the span type to "task".
See Span.start_span
for full details
update_span
Update a span in the experiment using its id. It is important that you only update a span once the original span
has been fully written and flushed, since otherwise updates to the span may conflict with the original span.
Arguments:
id
: The id of the span to update.**event
: Data to update. SeeExperiment.log
for a full list of valid fields.
export
Return a serialized representation of the logger that can be used to start subspans in other places. See Span.start_span
for more details.
flush
Flush any pending logs to the server.
ScoreSummary Objects
Summary of a score's performance.
name
Name of the score.
score
Average score across all examples.
improvements
Number of improvements in the score.
regressions
Number of regressions in the score.
diff
Difference in score between the current and reference experiment.
MetricSummary Objects
Summary of a metric's performance.
name
Name of the metric.
metric
Average metric across all examples.
unit
Unit label for the metric.
improvements
Number of improvements in the metric.
regressions
Number of regressions in the metric.
diff
Difference in metric between the current and reference experiment.
ExperimentSummary Objects
Summary of an experiment's scores and metadata.
project_name
Name of the project that the experiment belongs to.
project_id
ID of the project. May be None
if the eval was run locally.
experiment_id
ID of the experiment. May be None
if the eval was run locally.
experiment_name
Name of the experiment.
project_url
URL to the project's page in the Braintrust app.
experiment_url
URL to the experiment's page in the Braintrust app.
comparison_experiment_name
The experiment scores are baselined against.
scores
Summary of the experiment's scores.
metrics
Summary of the experiment's metrics.
DataSummary Objects
Summary of a dataset's data.
new_records
New or updated records added in this session.
total_records
Total records in the dataset.
DatasetSummary Objects
Summary of a dataset's scores and metadata.
project_name
Name of the project that the dataset belongs to.
dataset_name
Name of the dataset.
project_url
URL to the project's page in the Braintrust app.
dataset_url
URL to the experiment's page in the Braintrust app.
data_summary
Summary of the dataset's data.
braintrust.framework
EvalCase Objects
An evaluation case. This is a single input to the evaluation task, along with an optional expected output, metadata, and tags.
EvalResult Objects
The result of an evaluation. This includes the input, expected output, actual output, and metadata.
EvalHooks Objects
An object that can be used to add metadata to an evaluation. This is passed to the task
function.
span
Access the span under which the task is run. Also accessible via braintrust.current_span()
meta
Adds metadata to the evaluation. This metadata will be logged to the Braintrust. You can pass in metadaa
as keyword arguments, e.g. hooks.meta(foo="bar")
.
EvalScorerArgs Objects
Arguments passed to an evaluator scorer. This includes the input, expected output, actual output, and metadata.
BaseExperiment Objects
Use this to specify that the dataset should actually be the data from a previous (base) experiment. If you do not specify a name, Braintrust will automatically figure out the best base experiment to use based on your git history (or fall back to timestamps).
name
The name of the base experiment to use. If unspecified, Braintrust will automatically figure out the best base using your git history (or fall back to timestamps).
Evaluator Objects
An evaluator is an abstraction that defines an evaluation dataset, a task to run on the dataset, and a set of scorers to evaluate the results of the task. Each method attribute can be synchronous or asynchronous (for optimal performance, it is recommended to provide asynchronous implementations).
You should not create Evaluators directly if you plan to use the Braintrust eval framework. Instead, you should
create them using the Eval()
method, which will register them so that braintrust eval ...
can find them.
project_name
The name of the project the eval falls under.
eval_name
A name that describes the experiment. You do not need to change it each time the experiment runs.
data
Returns an iterator over the evaluation dataset. Each element of the iterator should be an EvalCase
or a dict
with the same fields as an EvalCase
(input
, expected
, metadata
).
task
Runs the evaluation task on a single input. The hooks
object can be used to add metadata to the evaluation.
scores
A list of scorers to evaluate the results of the task. Each scorer can be a Scorer object or a function
that takes input
, output
, and expected
arguments and returns a Score
object. The function can be async.
experiment_name
Optional experiment name. If not specified, a name will be generated automatically.
metadata
A dictionary with additional data about the test example, model outputs, or just about anything else that's
relevant, that you can use to help find and analyze examples later. For example, you could log the prompt
,
example's id
, or anything else that would be useful to slice/dice later. The values in metadata
can be any
JSON-serializable type, but its keys must be strings.
trial_count
The number of times to run the evaluator per input. This is useful for evaluating applications that have non-deterministic behavior and gives you both a stronger aggregate measure and a sense of the variance in the results.
is_public
Whether the experiment should be public. Defaults to false.
update
Whether to update an existing experiment with experiment_name
if one exists. Defaults to false.
timeout
The duration, in seconds, after which to time out the evaluation. Defaults to None, in which case there is no timeout.
max_concurrency
The maximum number of tasks/scorers that will be run concurrently. Defaults to None, in which case there is no max concurrency.
project_id
If specified, uses the given project ID instead of the evaluator's name to identify the project.
base_experiment_name
An optional experiment name to use as a base. If specified, the new experiment will be summarized and compared to this experiment.
base_experiment_id
An optional experiment id to use as a base. If specified, the new experiment will be summarized and
compared to this experiment. This takes precedence over base_experiment_name
if specified.
git_metadata_settings
Optional settings for collecting git metadata. By default, will collect all git metadata fields allowed in org-level settings.
repo_info
Optionally explicitly specify the git metadata for this experiment. This
takes precedence over git_metadata_settings
if specified.
ReporterDef Objects
A reporter takes an evaluator and its result and returns a report.
name
The name of the reporter.
report_eval
A function that takes an evaluator and its result and returns a report.
report_run
A function that takes all evaluator results and returns a boolean indicating whether the run was successful.
If you return false, the braintrust eval
command will exit with a non-zero status code.
EvalAsync
A function you can use to define an evaluator. This is a convenience wrapper around the Evaluator
class.
Use this function over Eval()
when you are running in an async context, including in a Jupyter notebook.
Example:
Arguments:
name
: The name of the evaluator. This corresponds to a project name in Braintrust.data
: Returns an iterator over the evaluation dataset. Each element of the iterator should be aEvalCase
.task
: Runs the evaluation task on a single input. Thehooks
object can be used to add metadata to the evaluation.scores
: A list of scorers to evaluate the results of the task. Each scorer can be a Scorer object or a function that takes anEvalScorerArgs
object and returns aScore
object.experiment_name
: (Optional) Experiment name. If not specified, a name will be generated automatically.trial_count
: The number of times to run the evaluator per input. This is useful for evaluating applications that have non-deterministic behavior and gives you both a stronger aggregate measure and a sense of the variance in the results.metadata
: (Optional) A dictionary with additional data about the test example, model outputs, or just about anything else that's relevant, that you can use to help find and analyze examples later. For example, you could log theprompt
, example'sid
, or anything else that would be useful to slice/dice later. The values inmetadata
can be any JSON-serializable type, but its keys must be strings.is_public
: (Optional) Whether the experiment should be public. Defaults to false.reporter
: (Optional) A reporter that takes an evaluator and its result and returns a report.timeout
: (Optional) The duration, in seconds, after which to time out the evaluation. Defaults to None, in which case there is no timeout.project_id
: (Optional) If specified, uses the given project ID instead of the evaluator's name to identify the project.base_experiment_name
: An optional experiment name to use as a base. If specified, the new experiment will be summarized and compared to this experiment.base_experiment_id
: An optional experiment id to use as a base. If specified, the new experiment will be summarized and compared to this experiment. This takes precedence overbase_experiment_name
if specified.git_metadata_settings
: Optional settings for collecting git metadata. By default, will collect all git metadata fields allowed in org-level settings.repo_info
: Optionally explicitly specify the git metadata for this experiment. This takes precedence overgit_metadata_settings
if specified.
Returns:
An EvalResultWithSummary
object, which contains all results and a summary.
Eval
A function you can use to define an evaluator. This is a convenience wrapper around the Evaluator
class.
For callers running in an async context, use EvalAsync()
instead.
Example:
Arguments:
name
: The name of the evaluator. This corresponds to a project name in Braintrust.data
: Returns an iterator over the evaluation dataset. Each element of the iterator should be aEvalCase
.task
: Runs the evaluation task on a single input. Thehooks
object can be used to add metadata to the evaluation.scores
: A list of scorers to evaluate the results of the task. Each scorer can be a Scorer object or a function that takes anEvalScorerArgs
object and returns aScore
object.experiment_name
: (Optional) Experiment name. If not specified, a name will be generated automatically.trial_count
: The number of times to run the evaluator per input. This is useful for evaluating applications that have non-deterministic behavior and gives you both a stronger aggregate measure and a sense of the variance in the results.metadata
: (Optional) A dictionary with additional data about the test example, model outputs, or just about anything else that's relevant, that you can use to help find and analyze examples later. For example, you could log theprompt
, example'sid
, or anything else that would be useful to slice/dice later. The values inmetadata
can be any JSON-serializable type, but its keys must be strings.is_public
: (Optional) Whether the experiment should be public. Defaults to false.reporter
: (Optional) A reporter that takes an evaluator and its result and returns a report.timeout
: (Optional) The duration, in seconds, after which to time out the evaluation. Defaults to None, in which case there is no timeout.project_id
: (Optional) If specified, uses the given project ID instead of the evaluator's name to identify the project.base_experiment_name
: An optional experiment name to use as a base. If specified, the new experiment will be summarized and compared to this experiment.base_experiment_id
: An optional experiment id to use as a base. If specified, the new experiment will be summarized and compared to this experiment. This takes precedence overbase_experiment_name
if specified.git_metadata_settings
: Optional settings for collecting git metadata. By default, will collect all git metadata fields allowed in org-level settings.repo_info
: Optionally explicitly specify the git metadata for this experiment. This takes precedence overgit_metadata_settings
if specified.
Returns:
An EvalResultWithSummary
object, which contains all results and a summary.
Reporter
A function you can use to define a reporter. This is a convenience wrapper around the ReporterDef
class.
Example:
Arguments:
name
: The name of the reporter.report_eval
: A function that takes an evaluator and its result and returns a report.report_run
: A function that takes all evaluator results and returns a boolean indicating whether the run was successful.
set_thread_pool_max_workers
Set the maximum number of threads to use for running evaluators. By default, this is the number of CPUs on the machine.
run_evaluator
Wrapper on _run_evaluator_internal that times out execution after evaluator.timeout.
braintrust.functions.stream
This module provides classes and functions for handling Braintrust streams.
A Braintrust stream is a wrapper around a generator of BraintrustStreamChunk
,
with utility methods to make them easy to log and convert into various formats.
BraintrustTextChunk Objects
A chunk of text data from a Braintrust stream.
BraintrustJsonChunk Objects
A chunk of JSON data from a Braintrust stream.
BraintrustErrorChunk Objects
An error chunk from a Braintrust stream.
BraintrustConsoleChunk Objects
A console chunk from a Braintrust stream.
BraintrustProgressChunk Objects
A progress chunk from a Braintrust stream.
BraintrustInvokeError Objects
An error that occurs during a Braintrust stream.
BraintrustStream Objects
A Braintrust stream. This is a wrapper around a generator of BraintrustStreamChunk
,
with utility methods to make them easy to log and convert into various formats.
__init__
Initialize a BraintrustStream.
Arguments:
base_stream
- Either an SSEClient or a list of BraintrustStreamChunks.
copy
Copy the stream. This returns a new stream that shares the same underlying
generator (via tee
). Since generators are consumed in Python, use copy()
if you
need to use the stream multiple times.
Returns:
BraintrustStream
- A new stream that you can independently consume.
final_value
Get the final value of the stream. The final value is the concatenation of all the chunks in the stream, deserialized into a string or object, depending on the value's type.
This function consumes the stream, so if you need to use the stream multiple
times, you should call copy()
first.
Returns:
The final value of the stream.
__iter__
Iterate over the stream chunks.
Yields:
BraintrustStreamChunk
- The next chunk in the stream.
parse_stream
Parse a BraintrustStream into its final value.
Arguments:
stream
- The BraintrustStream to parse.
Returns:
The final value of the stream.
braintrust.functions.invoke
invoke
Invoke a Braintrust function, returning a BraintrustStream
or the value as a plain
Python object.
Arguments:
input
- The input to the function. This will be logged as theinput
field in the span.messages
- Additional OpenAI-style messages to add to the prompt (only works for llm functions).parent
- The parent of the function. This can be an existing span, logger, or experiment, or the output of.export()
if you are distributed tracing. If unspecified, will use the same semantics astraced()
to determine the parent and no-op if not in a tracing context.stream
- Whether to stream the function's output. If True, the function will return aBraintrustStream
, otherwise it will return the output of the function as a JSON object.mode
- The response shape of the function if returning tool calls. If "auto", will return a string if the function returns a string, and a JSON object otherwise. If "parallel", will return an array of JSON objects with one object per tool call.org_name
- The name of the Braintrust organization to use.api_key
- The API key to use for authentication.app_url
- The URL of the Braintrust application.force_login
- Whether to force a new login even if already logged in.function_id
- The ID of the function to invoke.version
- The version of the function to invoke.prompt_session_id
- The ID of the prompt session to invoke the function from.prompt_session_function_id
- The ID of the function in the prompt session to invoke.project_name
- The name of the project containing the function to invoke.slug
- The slug of the function to invoke.global_function
- The name of the global function to invoke.
Returns:
The output of the function. If stream
is True, returns a BraintrustStream
,
otherwise returns the output as a Python object.