Overview

The ModelBench Workbench is a powerful tool for creating robust, testable prompts and running comprehensive benchmarks across multiple models. It allows you to take your experiments from the Playground and turn them into structured tests that can be run repeatedly and at scale.

Key Features

Prompt Versioning

  • Create and save multiple versions of your prompts.
  • Easily iterate and improve your prompts over time.

Dynamic Inputs

  • Convert parts of your prompt into variable inputs.
  • Test your prompt across different scenarios by varying these inputs.

Test Creation

  • Define multiple test cases for your prompt.
  • Set up desired outcomes for each test case.
  • Create tests for different scenarios (e.g., refusing insecure links, handling valid links).

Benchmarking

  • Run your tests across multiple models.
  • Set the number of rounds to ensure a robust sample size.
  • Compare performance across different models and prompt versions.

Result Analysis

  • View detailed results of your benchmarks.
  • Analyze success rates for each model and test case.
  • Drill down into individual test results to understand failures.

How to Use

  1. Start by creating a new prompt or importing one from the Playground.
  2. Identify parts of your prompt that you want to vary and convert them to inputs.
  3. Create test cases by defining different input values and desired outcomes.
  4. Set up your benchmark by selecting models and the number of rounds.
  5. Run the benchmark and analyze the results.
  6. Refine your prompt based on the results and create new versions for further testing.

The Workbench allows you to move beyond simple experimentation and into rigorous, data-driven prompt engineering. Use it to ensure your prompts are robust and perform consistently across different scenarios and models.