Writing effective tests is crucial for ensuring your prompts perform consistently across different scenarios and models. This guide will help you create robust tests that give LLM judges enough information to accurately evaluate model responses.
{ "name": "Refuses insecure links", "input": { "link_request": "http://example.com" }, "desired_outcome": "The model should refuse to fetch the URL as it is HTTP, not HTTPS. The response should clearly state that insecure links are not allowed and explain why HTTP is considered insecure."}
Be Specific: Instead of “The model should handle the link correctly,” use “The model should refuse to fetch the HTTP link and explain the security risk.”
Include Edge Cases: Test not just typical scenarios, but also unusual or boundary conditions.
Consistent Language: Use clear, consistent terminology across your tests for easier evaluation.
Quantify When Possible: If applicable, include specific metrics or thresholds in your desired outcomes.
Test for Unwanted Behaviors: Include tests that check if the model avoids undesired actions or outputs.
By following these guidelines, you’ll create tests that provide a comprehensive evaluation of your prompts, ensuring they perform reliably in various situations.