If you run LLM-generated code in production, you’ll want to know that the code you’re generating works on a variety of inputs. You’ll also want to ensure your system continues to produce working code whenever you upgrade your LLM model or edit a prompt.

This guide shows how Riza can help you create a test framework that gives you this assurance.

What’s an eval?

An eval is a test framework that evaluates the quality of LLM output. This guide focuses on evaluating a specific type of output: LLM-generated code.

Scenario: Assess LLM-assisted data transformation

In our data transformation guide, you learn how to use LLM-generated code to transform a CSV into any desired JSON format.

The transformation worked on one example, but we want to be more certain that our specific LLM model and prompts are effective on a variety of inputs. We also want to be able to easily test the impact of changing our prompts, or using new LLM models.

Solution: Custom code generation eval

We’ll create a simple test harness that prompts an LLM to write code, executes that code safely using Riza, and analyzes the output.

As input, we’ll write test cases that represent variations we might see in our CSV data, and in the desired output JSON schemas.

Why use Riza?

In general, LLMs are good at writing code, but they can’t execute the code they write.

A common use case for Riza is to safely execute code written by LLMs.

For example, you can ask an LLM to write code to analyze specific data, to generate graphs, or to extract data from a website or document. The code written by the LLM is “untrusted” and might contain harmful side-effects. You can protect your systems by executing that code on Riza instead of in your production environment.

Example code

Get the full code for this example in our GitHub.

Before you begin, sign up for Riza and Anthropic API access. You can adapt this guide to use any other LLM. There is no special reason we chose Anthropic for this use case, and it’s straightforward to adjust the implementation to use another LLM provider.

Step 1: Create a code generation function we want to test

In this example, we want to test the CSV-to-JSON code generation setup we created in our data transformation guide. In that guide, we crafted a prompt and ran it against a specific Claude model.

Let’s package that logic into a simple class:

data_transform_codegen.py
import json
import anthropic

# Get an API key from Anthropic and set it as the value of
# an environment variable named ANTHROPIC_API_KEY
anthropic_client = anthropic.Anthropic()

PROMPT = """
You are given raw CSV data of a list of people.

Write a Python function that transforms the raw text into a JSON object with the following fields:
{}

The function signature of the Python function must be:
def execute(input):

`input` is a Python object. The full data is available as text at `input["data"]`. The data is text.

Here are the rules for writing code:
- The function should return an object that has 1 field: "result". The "result" data should a stringified JSON object.
- Use only the Python standard library and built-in modules.

Finally, here are a few lines of the raw text of the CSV:

{}
"""

class ClaudeCsvToJsonCodeGenerator:
  generator_id = "csv2json-claude-3-7-sonnet-latest"

  @staticmethod
  def generate_code(desired_schema_obj, sample_data):
      message = anthropic_client.messages.create(
          model="claude-3-7-sonnet-latest",
          max_tokens=2048,
          system="You are an expert programmer. When given a programming task, " +
            "you will only output the final code, without any explanation. " +
            "Do NOT put the code in a codeblock.",
          messages=[
              {
                  "role": "user",
                  "content": PROMPT.format(json.dumps(desired_schema_obj), sample_data),
              }
          ]
      )
      code = message.content[0].text
      return code

The generate_code() function here accepts a desired output JSON schema, and sample CSV data. These are inputs we’ll provide in our test cases.

For simplicity here, we’ve combined the prompt with the model. In a real-world situation, you may want to create a way to easily swap in different models and prompts.

Step 2: Create test cases

Next, let’s write our evals, which is a set of test cases.

We’ve created a few evals to demonstrate some variations to test for. Below, we’ll explain the general format of each test and write just one test. See GitHub for the full test suite.

First, we define a test class:

test_cases.py
from dataclasses import dataclass
from typing import Dict, Any

@dataclass
class DataTransformationTestCase:
    id: str
    name: str
    description: str
    csv_sample: str
    csv_full: str
    desired_schema: Dict[str, Any]
    expected_json_out: Dict[str, Any]

Then, we’ll use the class to create a test case. Each test case includes:

  • a desired output JSON schema
  • an example CSV snippet that is used in our LLM prompt
  • a full CSV snippet that is used as input to the LLM-generated code, and
  • the expected JSON output

For example, we can write an eval that tests whether the LLM-generated code correctly handles missing fields in the CSV input:

test_cases.py
STANDARD_SCHEMA = {
    "type": "object",
    "properties": {
        "appraisers": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "phone": {"type": "string"},
                    "license": {"type": "string"}
                }
            }
        }
    }
}

MISSING_FIELDS_CSV_SUBSET = """License,Name,Company,Address,City,State,Zip,County,Phone
001507,"Smith, Amy",Amy Smith Inc.,1830 Castillo St,Santa Barbara,CA,93101,Santa Barbara,(123) 456-7890
001508,"Smith, Bob",Bob Smith Inc.,1458 Sutter St,San Francisco,CA,94109,San Francisco,
,"Johnson, Carl",,123 Main St,Los Angeles,CA,90001,Los Angeles,(555) 123-4567"""

MISSING_FIELDS_CSV_FULL = "\n".join(
    [MISSING_FIELDS_CSV_SUBSET,
    """001509,"Williams, Sarah",,500 Elm St,San Diego,CA,92101,San Diego,(619) 555-7890
001510,"Brown, David",Brown Consulting,,Sacramento,CA,95814,,(916) 555-2345
,"Taylor, Jessica",Jessica Taylor LLC,742 Oak Ave,San Jose,CA,,Santa Clara,(408) 555-6789"""])

TEST_CASE_MISSING_FIELDS = DataTransformationTestCase(
    id="missing-data-001",
    name="Missing data in CSV",
    description="Test handling missing data for some fields",
    csv_sample=MISSING_FIELDS_CSV_SUBSET,
    csv_full=MISSING_FIELDS_CSV_FULL,
    desired_schema=STANDARD_SCHEMA,
    expected_json_out={
        "appraisers": [
            {"name": "Smith, Amy", "phone": "(123) 456-7890", "license": "001507"},
            {"name": "Smith, Bob", "phone": "", "license": "001508"},
            {"name": "Johnson, Carl", "phone": "(555) 123-4567", "license": ""},
            {"name": "Williams, Sarah", "phone": "(619) 555-7890", "license": "001509"},
            {"name": "Brown, David", "phone": "(916) 555-2345", "license": "001510"},
            {"name": "Taylor, Jessica", "phone": "(408) 555-6789", "license": ""},
        ]
    }
)

Step 3: Run the evaluation

Now, let’s run the evals. This main() function shows the outline of the test logic:

eval_codegen.py
def main():
    code_generator = ClaudeCsvToJsonCodeGenerator()

    results = evaluate_llm_code(
        generator_id=code_generator.generator_id,
        code_generator=code_generator.generate_code,
        test_cases=ALL_TEST_CASES
    )

    print(json.dumps(results, indent=2))

Let’s look at the evaluate_llm_code() function. First, we define the format of the results we want:

def evaluate_llm_code(generator_id, code_generator, test_cases):
    results = {
        "generator_id": generator_id,
        "summary": {
            "total": len(test_cases),
            "passed": 0,
            "failed": 0,
        },
        "test_case_details": [],
        "overall_score": None,
    }

    if len(test_cases) == 0:
        return

    for t in test_cases:
        # Do something ...

The meat of this function is in the # Do something—the logic we apply to each test case. Let’s build out this logic.

Step 3a: Generate the code

First, we want to invoke the LLM-code generation. This is simple: we call our code_generator() function with inputs from the test case.

    for t in test_cases:
        generated_code = code_generator(t.desired_schema, t.csv_sample)

Step 3b: Execute the code on Riza

Next, let’s run the LLM-generated code against the full CSV data in our test case. To run the LLM-generated code safely, we’ll run it on Riza.

First, install and initialize the Riza API client library:

pip install rizaio

Import and initialize the Riza client. Note that there are multiple ways to set your API key:

from rizaio import Riza

# Option 1: Pass your API key directly:
riza_client = Riza(api_key="your Riza API key")

# Option 2: Set the `RIZA_API_KEY` environment variable
riza_client = Riza() # Will use `RIZA_API_KEY`

Let’s add a helper function, _run_code(), that calls the Riza Execute Function API:

def _run_code(code, input_data):
    print("Running code on Riza...")
    return riza_client.command.exec_func(
        language="python",
        input=input_data,
        code=code,
    )

Finally, we’ll use this helper function to execute the generated code. Note how we pass in the full CSV data from our test case as input:

    for t in test_cases:
        generated_code = code_generator(t.desired_schema, t.csv_sample)

        input_data = {"data": t.csv_full}
        execution_result = _run_code(generated_code, input_data)

Step 3c: Score the output

The Riza Execute Function API tells you whether the code ran successfully, and if so, the result of the code execution. Let’s use both types of information to evaluate the output.

Let’s write a helper function, _validate_result() that scores the output. Note that this is just an example of the types of checks you can do. In a real-world example, you’ll likely customize this logic:

def _validate_result(test_case, code_execution):
    # 1. Check code executed
    if code_execution.execution.exit_code != 0:
        return {
            "passed": False,
            "details": {"error": f"Code failed to execute: {code_execution.execution.stderr}"}
        }
    elif code_execution.output_status != "valid":
        return {
            "passed": False,
            "details": {"error": f"Unsuccessful output status: {code_execution.output_status}, stderr: {code_execution.execution.stderr}"}
        }

    execution_result =  code_execution.output

    # 2. Check result exists
    if "result" not in execution_result:
        return {
            "passed": False,
            "details": {"error": "Missing 'result' in output"}
        }

    # 3. Check result is valid stringified JSON
    actual_output = execution_result["result"]
    if isinstance(actual_output, str):
        try:
          json_output = json.loads(actual_output)
        except Exception as e:
          return {
              "passed": False,
              "details": {"error": f"Failed to return a valid JSON string. Got: {actual_output}"}
          }
    else:
        return {
            "passed": False,
            "details": {"error": f"Did not return a string. Got: {str(actual_output)}"}
        }

    # 4. Check data accuracy
    if json_output != test_case.expected_json_out:
        return {
            "passed": False,
            "details": {"error": f"Actual output did not match expected output. \nExpected: {json.dumps(test_case.expected_json_out)} \nGot: {actual_output}"}
        }

    return {
        "passed": True,
        "details": {},
    }

Now, let’s finish our evaluate_llm_code() function. We:

  • Run _validate_result() and add the output to our overall results.
  • Compute an overall score after we run all the test cases.
def evaluate_llm_code(generator_id, code_generator, test_cases):
    # ...
    for t in test_cases:
        generated_code = code_generator(t.desired_schema, t.csv_sample)
        input_data = {"data": t.csv_full}
        execution_result = _run_code(generated_code, input_data)

        validation_result = _validate_result(t, execution_result)

        case_result = {
            "id": t.id,
            "name": t.name,
            "passed": validation_result["passed"],
            "details": validation_result["details"],
            "generated_code": generated_code
        }
        results["test_case_details"].append(case_result)

        if validation_result["passed"]:
            results["summary"]["passed"] += 1
        else:
            results["summary"]["failed"] += 1


    # Overall score
    passed_count = results["summary"]["passed"]
    total_count = results["summary"]["total"]
    results["overall_score"] = round(1.0 * passed_count / total_count, 2)

    return results

Next steps

Extend this example

Get the full code for this example in our GitHub.

For further exploration, you can extend this example. Here are a few ideas:

  1. Expand the test cases: Add more varied and edge-case scenarios.

  2. Compare different models: Modify the code generator class to test different LLM providers and models against the same test suite.

  3. Modify the prompt: Experiment with different prompts by modifying the PROMPT template in the code. See how changes to the wording affect performance.

  4. Expand the scoring: Add more ways to check the output. You could also make _validate_result() output a numeric score that reflects a more nuanced result than “pass” versus “fail”.

Explore Riza