Code Generation Evals
Programmatically test the quality of LLM-generated code
If you run LLM-generated code in production, you’ll want to know that the code you’re generating works on a variety of inputs. You’ll also want to ensure your system continues to produce working code whenever you upgrade your LLM model or edit a prompt.
This guide shows how Riza can help you create a test framework that gives you this assurance.
What’s an eval?
An eval is a test framework that evaluates the quality of LLM output. This guide focuses on evaluating a specific type of output: LLM-generated code.
Scenario: Assess LLM-assisted data transformation
In our data transformation guide, you learn how to use LLM-generated code to transform a CSV into any desired JSON format.
The transformation worked on one example, but we want to be more certain that our specific LLM model and prompts are effective on a variety of inputs. We also want to be able to easily test the impact of changing our prompts, or using new LLM models.
Solution: Custom code generation eval
We’ll create a simple test harness that prompts an LLM to write code, executes that code safely using Riza, and analyzes the output.
As input, we’ll write test cases that represent variations we might see in our CSV data, and in the desired output JSON schemas.
Why use Riza?
In general, LLMs are good at writing code, but they can’t execute the code they write.
A common use case for Riza is to safely execute code written by LLMs.
For example, you can ask an LLM to write code to analyze specific data, to generate graphs, or to extract data from a website or document. The code written by the LLM is “untrusted” and might contain harmful side-effects. You can protect your systems by executing that code on Riza instead of in your production environment.
Example code
Get the full code for this example in our GitHub.
Step 1: Create a code generation function we want to test
In this example, we want to test the CSV-to-JSON code generation setup we created in our data transformation guide. In that guide, we crafted a prompt and ran it against a specific Claude model.
Let’s package that logic into a simple class:
The generate_code()
function here accepts a desired output JSON schema, and sample CSV data. These are inputs we’ll provide in our test cases.
For simplicity here, we’ve combined the prompt with the model. In a real-world situation, you may want to create a way to easily swap in different models and prompts.
Step 2: Create test cases
Next, let’s write our evals, which is a set of test cases.
We’ve created a few evals to demonstrate some variations to test for. Below, we’ll explain the general format of each test and write just one test. See GitHub for the full test suite.
First, we define a test class:
Then, we’ll use the class to create a test case. Each test case includes:
- a desired output JSON schema
- an example CSV snippet that is used in our LLM prompt
- a full CSV snippet that is used as input to the LLM-generated code, and
- the expected JSON output
For example, we can write an eval that tests whether the LLM-generated code correctly handles missing fields in the CSV input:
Step 3: Run the evaluation
Now, let’s run the evals. This main()
function shows the outline of the test logic:
Let’s look at the evaluate_llm_code()
function. First, we define the format of the results we want:
The meat of this function is in the # Do something
—the logic we apply to each test case. Let’s build out this logic.
Step 3a: Generate the code
First, we want to invoke the LLM-code generation. This is simple: we call our code_generator()
function with inputs from the test case.
Step 3b: Execute the code on Riza
Next, let’s run the LLM-generated code against the full CSV data in our test case. To run the LLM-generated code safely, we’ll run it on Riza.
First, install and initialize the Riza API client library:
Import and initialize the Riza client. Note that there are multiple ways to set your API key:
Let’s add a helper function, _run_code()
, that calls the Riza Execute Function API:
Finally, we’ll use this helper function to execute the generated code. Note how we pass in the full CSV data from our test case as input:
Step 3c: Score the output
The Riza Execute Function API tells you whether the code ran successfully, and if so, the result of the code execution. Let’s use both types of information to evaluate the output.
Let’s write a helper function, _validate_result()
that scores the output. Note that this is just an example of the types of checks you can do. In a real-world example, you’ll likely customize this logic:
Now, let’s finish our evaluate_llm_code()
function. We:
- Run
_validate_result()
and add the output to our overall results. - Compute an overall score after we run all the test cases.
Next steps
Extend this example
Get the full code for this example in our GitHub.
For further exploration, you can extend this example. Here are a few ideas:
-
Expand the test cases: Add more varied and edge-case scenarios.
-
Compare different models: Modify the code generator class to test different LLM providers and models against the same test suite.
-
Modify the prompt: Experiment with different prompts by modifying the
PROMPT
template in the code. See how changes to the wording affect performance. -
Expand the scoring: Add more ways to check the output. You could also make
_validate_result()
output a numeric score that reflects a more nuanced result than “pass” versus “fail”.
Explore Riza
- Try out the API.
- Learn how to use the Riza API with tool use APIs from OpenAI, Anthropic and Google.
- Check out the roadmap to see what we’re working on next.