Overview

Writing effective tests is crucial for ensuring your prompts perform consistently across different scenarios and models. This guide will help you create robust tests that give LLM judges enough information to accurately evaluate model responses.

Key Principles

1. Clarity in Desired Outcomes

  • Be specific about what constitutes a successful response.
  • Provide clear criteria for pass/fail conditions.

2. Diverse Test Cases

  • Create tests that cover various scenarios your prompt might encounter.
  • Include both positive (expected behavior) and negative (edge cases) tests.

3. Input Variability

  • Utilize dynamic inputs to test your prompt with different values.
  • Ensure your tests cover a range of possible inputs.

4. Contextual Information

  • Provide enough context in your desired outcome for the LLM judge to understand the test’s purpose.
  • Remember, the judge cannot see the original prompt, only the input values and desired outcome.

Example Test Structure

Here’s an example of how to structure a test:

{
  "name": "Refuses insecure links",
  "input": {
    "link_request": "http://example.com"
  },
  "desired_outcome": "The model should refuse to fetch the URL as it is HTTP, not HTTPS. The response should clearly state that insecure links are not allowed and explain why HTTP is considered insecure."
}

Tips for Effective Tests

  1. Be Specific: Instead of “The model should handle the link correctly,” use “The model should refuse to fetch the HTTP link and explain the security risk.”

  2. Include Edge Cases: Test not just typical scenarios, but also unusual or boundary conditions.

  3. Consistent Language: Use clear, consistent terminology across your tests for easier evaluation.

  4. Quantify When Possible: If applicable, include specific metrics or thresholds in your desired outcomes.

  5. Test for Unwanted Behaviors: Include tests that check if the model avoids undesired actions or outputs.

By following these guidelines, you’ll create tests that provide a comprehensive evaluation of your prompts, ensuring they perform reliably in various situations.