Data Extraction

In this guide, we’ll show you a simple, real-world use case for Riza: automatically extracting data from a website. We’ll prompt an LLM to write the scraping code, and execute that code using Riza.

Why use Riza?

In general, LLMs are good at writing code, but they can’t execute the code they write.A common use case for Riza is to safely execute code written by LLMs.For example, you can ask an LLM to write code to analyze specific data, to generate graphs, or to extract data from a website or document. The code written by the LLM is “untrusted” and might contain harmful side-effects. You can protect your systems by executing that code on Riza instead of in your production environment.

To see data extraction integrated into an AI agent, see our guide on building a data analyst agent with LangGraph, Browserbase, and Riza.

Scenario: Download a large dataset

Many government websites provide datasets with commerically-useful information. For example, the California Bureau of Real Estate Appraisers provides a list of all current and recently-licensed appraisers via their site. However, it’s hard to get the data out. There are over 13,000 appraisers, presented in pages of 300 at a time, with no bulk download option. If you want to download all the data, you’ll want to automate it.

Solution: What we’ll build

We’ll write a script that automatically extracts each appraiser from the Real Estate Appraisers site, and prints out the results as a CSV. To keep this guide simple, we’ll hand-write the code to download the HTML, and only ask the LLM to write the code to extract data from the HTML.

Example code

Get the full code for this example in our GitHub.

Before you begin, sign up for Riza and OpenAI API access. You can adapt this guide to use any other LLM. There is no special reason we chose OpenAI for this use case, and it’s straightforward to adjust the implementation to use another LLM provider.

Step 1: Download one page of HTML

First, we’ll fetch one page of HTML from the Real Estate Appraisers site. We’ll use the httpx library to make the web request, and the beautifulsoup4 library to further process the HTML. Let’s install them:

pip install httpx beautifulsoup4

Next, we’ll write a function, download_html_body(), to download a page of results:

from bs4 import BeautifulSoup
import httpx

def extract_body_html(full_html):
    """Returns just the <body> of an HTML page, without any <scripts>"""
    soup = BeautifulSoup(full_html, "html.parser")
    body = soup.find("body")
    if body:
        for script in body.find_all("script"):
            script.decompose()
        return str(body)
    else:
        print("No <body> tag found in the HTML.")
        return None

def download_html_body(website_url):
    response = httpx.get(website_url)
    if response.status_code == 200:
        return extract_body_html(response.text)
    return None

def main():
    URL = 'https://www2.brea.ca.gov/breasearch/faces/party/search.xhtml'
    html = download_html_body(URL)
    if html is None:
        print('Could not download HTML')
        return None
    # print(html) # optional: Print out the HTML you've extracted

At this point, you can uncomment the print statement above and run the script to see the extracted HTML. In our code above, we include an optimization to reduce the overall size of the HTML. After we download the HTML, we extract the <body> and remove all <script> tags using extract_body_html(). This reduces the size of the content we pass to the LLM in the next step, and thus helps us stay under LLM token limits.

Step 2: Generate parsing code with OpenAI

In this step, we’ll pass the HTML we just downloaded to OpenAI, and ask it to generate custom code to extract our desired data from the HTML. First, install and initialize the OpenAI SDK:

pip install openai

Import and initialize the OpenAI client.

from openai import OpenAI

openai_client = OpenAI(api_key="your OpenAI key")

We’ll now add a generate_code() function, along with a prompt for the LLM:

PROMPT = """
You are given the exact HTML of a website that contains a table of results.
Write Python code to extract the data from the table. Your code should print
out the extracted data in CSV format. Include the headings of the table.

Here are the rules for writing code:
- Use print() to write the output of your code to stdout.

- Use only the Python standard library and built-in modules. For example,
do not use `pandas`, but you can use `csv`. The one exception to this rule
is that you should use `beautifulsoup4` to parse HTML.

- In order to access the raw HTML, you must use a function called `get_html()`.
Include these exact lines in your code:

def get_html():
    stdin = sys.stdin.read()
    return stdin

Finally, here is the HTML string of the website. Remember, the goal is to extract
the rows from the table of results:

{}
"""

def generate_code(site_html):
    completion = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
              "role": "system",
              "content": "You are an expert programmer. When given a programming task, " +
                  "you will only output the final code, without any explanation. " +
                  "Do NOT put quotes around the code."
            },
            {
                "role": "user",
                "content": PROMPT.format(site_html),
            }
        ]
    )
    code = completion.choices[0].message.content
    return code

Finally, we’ll call this function in main():

def main():
    WEBSITE = 'https://www2.brea.ca.gov/breasearch/faces/party/search.xhtml'
    html = download_html_body(WEBSITE)
    if html is None:
        print('Could not download HTML')
        return None

    code = generate_code(html)
    # print(code) # Optional: Print out the generated code

At this point, you can uncomment the print statement above and run the script to see the LLM-generated code.

Key components of the prompt

Note that in our prompt above, we explicitly ask the LLM to do a few things:

Write Python code. We plan to execute this code in a Python runtime on Riza.
Use print() to write the output of its code. We make this ask because of how Riza works: Riza’s code execution API will return anything the script writes to stdout. Learn more here.
Use the Python standard library, plus beautifulsoup4. In this guide, we’re asking the LLM to write code to parse HTML, so we want it to be able to use the beautifulsoup4 HTML parsing library. By default, Riza provides access to standard libraries. To use additional libraries, you can create a custom runtime. In the next step, we’ll create a custom runtime that includes beautifulsoup4.
Include a specific function in its code to access the HTML. There are several ways to pass input to your Riza script. In this guide, we plan to pass the HTML of the website into stdin, and we provide the LLM the exact code needed to read the input from stdin.

Step 3: Execute the code with Riza

Now that we have LLM-generated code, we’re ready to run it on Riza and finish our script.

Step 3a. Create custom runtime

As we mentioned above, we allowed the LLM to use beautifulsoup4 in its parsing code. To make that library available on Riza, we’ll create a custom runtime. Follow these steps:

In the Riza dashboard, select Custom Runtimes.
Click Create runtime.
In the runtime creation form, provide the following values:
Field Value
Language Python
requirements.txt beautifulsoup4
Click Create runtime.
Wait for the Status of your runtime revision to become “Succeeded”.
Copy the ID of your runtime revision (not the runtime) to use in the next step.

Field	Value
Language	Python
requirements.txt	`beautifulsoup4`

Step 3b. Call the Riza API

Now, let’s add the final pieces of code to finish our script. First, install and initialize the Riza API client library:

pip install rizaio

Import and initialize the Riza client. Note that there are multiple ways to set your API key:

from rizaio import Riza

# Option 1: Pass your API key directly:
riza_client = Riza(api_key="your Riza API key")

# Option 2: Set the `RIZA_API_KEY` environment variable
riza_client = Riza() # Will use `RIZA_API_KEY`

Let’s add a function, run_code(), that calls the Riza Code Interpreter API and uses our custom runtime. Make sure to fill in your own runtime ID:

def run_code(code, input_data):
    result = riza_client.command.exec(
        language="python",
        runtime_revision_id="<the ID of your runtime revision>",
        stdin=input_data,
        code=code,
    )
    if result.exit_code != 0:
        print("Code did not execute successfully. Error:")
        print(result.stderr)
    elif result.stdout == "":
        print("Code executed successfully but produced no output. "
            "Ensure your code includes print statements to get output.")
    return result.stdout

Finally, we’ll update our main() function to run the generated code:

from bs4 import BeautifulSoup
import httpx
from openai import OpenAI
from rizaio import Riza

# ... helper functions ...

def main():
    WEBSITE = 'https://www2.brea.ca.gov/breasearch/faces/party/search.xhtml'
    html = download_html_body(WEBSITE)
    if html is None:
        print('Could not download HTML')
        return None

    code = generate_code(html)
    result = run_code(code, html)

    print('Extracted data:\n\n{}'.format(result))

if __name__ == "__main__":
    main()

This script is now complete. You can now run it to extract the real estate appraisers from the main page:

Output

License,Name,Company,Address,City,State,Zip,County,Phone
001507,BARNWELL, BRIAN,Brian B. Barnwell,1830 Castillo St,Santa Barbara,CA,93101,Santa Barbara,(805) 708-4690
001509,BOEHM, MICHAEL,Senior Living Valuation Services Inc,1458 Sutter St,San Francisco,CA,94109,San Francisco,(415) 385-2832
001511,KETCHAM, DANIEL,Daniel R. Ketcham & Associates,11693 Brunswick Pines Rd,Grass Valley,CA,95945,Nevada,(530) 477-8056
...

Benefits of code generation

You might wonder why you should use the code generation approach we present in this guide. There is one obvious alternative: instead of asking an LLM to write code to extract data, why not ask it to extract the data directly? In our experience, there are two compelling reasons.

Reliability

LLMs don’t always follow instructions. When we passed the HTML of this website directly to OpenAI and asked it to extract all the rows, it gave us some of the rows. Using the method in this guide, we can consistently extract all rows.

Improve speed and cost on large datasets

This code generation approach shines on large datasets. If you want to extract data from many pages that are all in the same format, using code generation will speed you up and reduce LLM costs. You only have to ask the LLM to write code once, compared to calling the LLM on each individual page.

Next Steps

Get the full code for this example in our GitHub.
Try out the API.
Learn how to use the Code Interpreter API with tool use APIs from OpenAI, Anthropic and Google.
Check out the roadmap to see what we’re working on next.

Getting Started

Use Case Guides

Tool-use Guides

Framework Guides

Interpreter Environment

Code Execution

Data Extraction

Why use Riza?

Scenario: Download a large dataset

Solution: What we’ll build

Example code

Step 1: Download one page of HTML

Step 2: Generate parsing code with OpenAI

Key components of the prompt

Step 3: Execute the code with Riza

Step 3a. Create custom runtime

Step 3b. Call the Riza API

Benefits of code generation

Reliability

Improve speed and cost on large datasets

Next Steps

Getting Started

Use Case Guides

Tool-use Guides

Framework Guides

Interpreter Environment

Code Execution

​Why use Riza?

​Scenario: Download a large dataset

​Solution: What we’ll build

​Example code

​Step 1: Download one page of HTML

​Step 2: Generate parsing code with OpenAI

​Key components of the prompt

​Step 3: Execute the code with Riza

​Step 3a. Create custom runtime

​Step 3b. Call the Riza API

​Benefits of code generation

​Reliability

​Improve speed and cost on large datasets

​Next Steps

Why use Riza?

Scenario: Download a large dataset

Solution: What we’ll build

Example code

Step 1: Download one page of HTML

Step 2: Generate parsing code with OpenAI

Key components of the prompt

Step 3: Execute the code with Riza

Step 3a. Create custom runtime

Step 3b. Call the Riza API

Benefits of code generation

Reliability

Improve speed and cost on large datasets

Next Steps