Data Transformation

In this guide, we’ll show you how to use Riza to transform a given dataset into another format. We’ll prompt an LLM to write the code to transform the data, and execute that code using Riza.

Why use Riza?

In general, LLMs are good at writing code, but they can’t execute the code they write.A common use case for Riza is to safely execute code written by LLMs.For example, you can ask an LLM to write code to analyze specific data, to generate graphs, or to extract data from a website or document. The code written by the LLM is “untrusted” and might contain harmful side-effects. You can protect your systems by executing that code on Riza instead of in your production environment.

Scenario: Serve a dataset in custom formats

Many government websites provide datasets with commerically-useful information. For example, the California Bureau of Real Estate Appraisers provides a list of all current and recently-licensed appraisers via their site. In the Data Extraction guide, we showed how to automatically extract the appraisers from this site. We got the data out as a CSV with nine columns. Now, what if we want to let people get this data in whatever format they want? For example, one person might want just three of the nine fields, in a specific JSON schema.

Solution: Generate & run custom transformation code

We’ll build a script that automatically transforms our dataset to a desired final JSON format. In this script, we’ll prompt an LLM to write code that does the transformation, and we’ll safely execute that code using Riza.

Example input

License,Name,Company,Address,City,State,Zip,County,Phone
001507,"Smith, Amy",Amy Smith Inc.,1830 Castillo St,Santa Barbara,CA,93101,Santa Barbara,(123) 456-7890
001508,"Smith, Bob",Bob Smith Inc.,1458 Sutter St,San Francisco,CA,94109,San Francisco,(987) 654-3210

Example output

{"appraisers": [
    {"name": "Smith, Amy", "phone": "(123) 456-7890", "license": "001507"},
    {"name": "Smith, Bob", "phone": "(987) 654-3210", "license": "001508"},
    ...
]}

Benefits of code generation

Compared to asking an LLM directly to transform your data, using code generation as shown in this example can be more reliable, faster, and more cost-effective, especially for larger datasets. You only have to give the LLM a small part of your data and ask it to write code once, as opposed to calling the LLM with your entire dataset.

Example code

Get the full code and data for this example in our GitHub. The data we’ve prepared is a subset of the full California Bureau of Real Estate Appraisers dataset. We created this dataset by running the code in our Data Extraction guide.

Before you begin, sign up for Riza and Anthropic API access. You can adapt this guide to use any other LLM. There is no special reason we chose Anthropic for this use case, and it’s straightforward to adjust the implementation to use another LLM provider.

Step 1: Read in data from CSV

First, we’ll read in the data from our CSV.

INPUT_CSV_FILEPATH = "/path/to/appraisers.csv"

def read_file(filepath):
    with open(filepath, "r", encoding="utf-8") as file:
        content = file.read()
    return content

def main():
    full_rows = read_file(INPUT_CSV_FILEPATH)

Step 2: Generate data transformation code with LLM

In this step, we’ll pass a few lines of the CSV we just read to Anthropic, and ask it to generate custom code to transform our CSV data into a specified JSON format. First, install and initialize the Anthropic SDK:

pip install anthropic

Import and initialize the Anthropic client.

import anthropic

# Option 1: Pass your API key directly:
anthropic_client = anthropic.Anthropic(api_key="YOUR_API_KEY")

# Option 2: Set the `ANTHROPIC_API_KEY` environment variable
anthropic_client = anthropic.Anthropic() # Will use `ANTHROPIC_API_KEY`

We’ll now add a generate_code() function, along with a prompt for the LLM:

PROMPT = """
You are given raw CSV data of a list of people.

Write a Python function that transforms the raw text into a JSON object with the following fields:
{{
  "type": "object",
  "properties": {{
    "appraisers": {{
      "type": "array",
      "items": {{
          "type": "object",
          "properties": {{
            "name": {{
              "type: "string"
            }},
            "phone": {{
              "type": "string"
            }},
            "license": {{
              "type": "string"
            }}
          }}
      }}
    }}
  }}
}}

The function signature of the Python function must be:

def execute(input):

`input` is a Python object. The full data is available as text at `input["data"]`. The data is text.

Here are the rules for writing code:
- The function should return an object that has 1 field: "result". The "result" data should a stringified JSON object.
- Use only the Python standard library and built-in modules.

Finally, here are a few lines of the raw text of the CSV:

{}
"""

def generate_code(sample_data):
    message = anthropic_client.messages.create(
        model="claude-3-7-sonnet-latest",
        max_tokens=2048,
        system="You are an expert programmer. When given a programming task, " +
           "you will only output the final code, without any explanation. " +
           "Do NOT put the code in a codeblock.",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(sample_data),
            }
        ]
    )
    code = message.content[0].text
    return code

Finally, we’ll call generate_code(sample_data) in main(). We’ll only send a few rows of our CSV data to the LLM, because that’s all it needs to understand the shape of the data:

def first_n_lines(text, n):
    return "\n".join(itertools.islice(text.splitlines(), n))

def main():
    full_rows = read_file(INPUT_CSV_FILEPATH)

    first_rows = first_n_lines(full_rows, 10)
    python_code = generate_code(first_rows)
    # Optional: print the generated code
    # print(python_code)

Key components of the prompt

Note that in our prompt above, we explicitly ask the LLM to do a few things:

Write Python code. We plan to execute this code in a Python runtime on Riza.
Write code to transform data to a specific JSON format. In this example, we’ve provided a formal JSON schema, but models may be able to understand less formal definitions too.
Write a function that reads data from an object and returns an object. We plan to use Riza’s Execute Function API to run this code. The Execute Function API lets us pass in an input object and receive an output object.
Use the Python standard library. By default, Riza provides access to standard libraries. If you want to execute code with additional libraries, you can create a custom runtime. You can see an example of using custom runtimes in our Data Analysis Guide.

Step 3: Execute the code on Riza

Now that we have LLM-generated code, we’re ready to run it on Riza and finish our script. First, install and initialize the Riza API client library:

pip install rizaio

Import and initialize the Riza client. Note that there are multiple ways to set your API key:

from rizaio import Riza

# Option 1: Pass your API key directly:
riza_client = Riza(api_key="your Riza API key")

# Option 2: Set the `RIZA_API_KEY` environment variable
riza_client = Riza() # Will use `RIZA_API_KEY`

Let’s add a function, run_code(), that calls the Riza Execute Function API:

def run_code(code, input_data):
    print("Running code on Riza...")
    result = riza_client.command.exec_func(
        language="python",
        input=input_data,
        code=code,
    )
    if result.execution.exit_code != 0:
        print("Code did not execute successfully. Error:")
        print(result.execution.stderr)
    elif result.output_status != "valid":
        print("Unsuccessful output status:")
        print(result.output_status)
    return result.output

Finally, we’ll update our main() function to run the generated code, and print the resulting JSON:

import itertools
from rizaio import Riza
import anthropic

# ... other functions ...

def main():
    full_rows = read_file(INPUT_CSV_FILEPATH)

    first_rows = first_n_lines(full_rows, 10)
    python_code = generate_code(first_rows)

    input_data = {
        "data": full_rows,
    }
    output = run_code(python_code, input_data)
    print(output)


if __name__ == "__main__":
    main()

This script is now complete. You can now run it to produce the desired JSON-formatted data.

Next steps

Get the full code for this example in our GitHub.
Try out the API.
Learn how to use the Riza API with tool use APIs from OpenAI, Anthropic and Google.
Check out the roadmap to see what we’re working on next.

Getting Started

Use Case Guides

Tool-use Guides

Framework Guides

Interpreter Environment

Code Execution

Data Transformation

Why use Riza?

Scenario: Serve a dataset in custom formats

Solution: Generate & run custom transformation code

Example input

Example output

Benefits of code generation

Example code

Step 1: Read in data from CSV

Step 2: Generate data transformation code with LLM

Key components of the prompt

Step 3: Execute the code on Riza

Next steps

Getting Started

Use Case Guides

Tool-use Guides

Framework Guides

Interpreter Environment

Code Execution

​Why use Riza?

​Scenario: Serve a dataset in custom formats

​Solution: Generate & run custom transformation code

​Example input

​Example output

​Benefits of code generation

​Example code

​Step 1: Read in data from CSV

​Step 2: Generate data transformation code with LLM

​Key components of the prompt

​Step 3: Execute the code on Riza

​Next steps

Why use Riza?

Scenario: Serve a dataset in custom formats

Solution: Generate & run custom transformation code

Example input

Example output

Benefits of code generation

Example code

Step 1: Read in data from CSV

Step 2: Generate data transformation code with LLM

Key components of the prompt

Step 3: Execute the code on Riza

Next steps