What’s an eval?
An eval is a test framework that evaluates the quality of LLM output. This guide focuses on evaluating a specific type of output: LLM-generated code.Scenario: Assess LLM-assisted data transformation
In our data transformation guide, you learn how to use LLM-generated code to transform a CSV into any desired JSON format. The transformation worked on one example, but we want to be more certain that our specific LLM model and prompts are effective on a variety of inputs. We also want to be able to easily test the impact of changing our prompts, or using new LLM models.Solution: Custom code generation eval
We’ll create a simple test harness that prompts an LLM to write code, executes that code safely using Riza, and analyzes the output. As input, we’ll write test cases that represent variations we might see in our CSV data, and in the desired output JSON schemas.Why use Riza?
In general, LLMs are good at writing code, but they can’t execute the code they write.A common use case for Riza is to safely execute code written by LLMs.For example, you can ask an LLM to write code to analyze specific data, to generate graphs, or to extract data from a website or document. The code written by the LLM is “untrusted” and might contain harmful side-effects. You can protect your systems by executing that code on Riza instead of in your production environment.Example code
Get the full code for this example in our GitHub.Step 1: Create a code generation function we want to test
In this example, we want to test the CSV-to-JSON code generation setup we created in our data transformation guide. In that guide, we crafted a prompt and ran it against a specific Claude model. Let’s package that logic into a simple class:data_transform_codegen.py
generate_code()
function here accepts a desired output JSON schema, and sample CSV data. These are inputs we’ll provide in our test cases.
For simplicity here, we’ve combined the prompt with the model. In a real-world situation, you may want to create a way to easily swap in different models and prompts.
Step 2: Create test cases
Next, let’s write our evals, which is a set of test cases. We’ve created a few evals to demonstrate some variations to test for. Below, we’ll explain the general format of each test and write just one test. See GitHub for the full test suite. First, we define a test class:test_cases.py
- a desired output JSON schema
- an example CSV snippet that is used in our LLM prompt
- a full CSV snippet that is used as input to the LLM-generated code, and
- the expected JSON output
test_cases.py
Step 3: Run the evaluation
Now, let’s run the evals. Thismain()
function shows the outline of the test logic:
eval_codegen.py
evaluate_llm_code()
function. First, we define the format of the results we want:
# Do something
—the logic we apply to each test case. Let’s build out this logic.
Step 3a: Generate the code
First, we want to invoke the LLM-code generation. This is simple: we call ourcode_generator()
function with inputs from the test case.
Step 3b: Execute the code on Riza
Next, let’s run the LLM-generated code against the full CSV data in our test case. To run the LLM-generated code safely, we’ll run it on Riza. First, install and initialize the Riza API client library:_run_code()
, that calls the Riza Execute Function API:
Step 3c: Score the output
The Riza Execute Function API tells you whether the code ran successfully, and if so, the result of the code execution. Let’s use both types of information to evaluate the output. Let’s write a helper function,_validate_result()
that scores the output. Note that this is just an example of the types of checks you can do. In a real-world example, you’ll likely customize this logic:
evaluate_llm_code()
function. We:
- Run
_validate_result()
and add the output to our overall results. - Compute an overall score after we run all the test cases.
Next steps
Extend this example
Get the full code for this example in our GitHub. For further exploration, you can extend this example. Here are a few ideas:- Expand the test cases: Add more varied and edge-case scenarios.
- Compare different models: Modify the code generator class to test different LLM providers and models against the same test suite.
-
Modify the prompt: Experiment with different prompts by modifying the
PROMPT
template in the code. See how changes to the wording affect performance. -
Expand the scoring: Add more ways to check the output. You could also make
_validate_result()
output a numeric score that reflects a more nuanced result than “pass” versus “fail”.
Explore Riza
- Try out the API.
- Learn how to use the Riza API with tool use APIs from OpenAI, Anthropic and Google.
- Check out the roadmap to see what we’re working on next.