> ## Documentation Index
> Fetch the complete documentation index at: https://docs.riza.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Extraction

> Run LLM-generated code to extract data from a website

In this guide, we'll show you a simple, real-world use case for Riza:
automatically extracting data from a website. We'll prompt an LLM to
write the scraping code, and execute that code using Riza.

<Tip>
  ### Why use Riza?

  In general, LLMs are good at writing code, but they can't
  execute the code they write.

  A common use case for Riza is to safely execute code written by LLMs.

  For example, you can ask an LLM to write code to analyze
  specific data, to generate graphs, or to extract data from a website or document.
  The code written by the LLM is "untrusted" and might contain harmful side-effects.
  You can protect your systems by executing that code on Riza instead of in your
  production environment.
</Tip>

<Note>
  To see data extraction integrated into an AI agent, see our guide on building [a data analyst agent with LangGraph, Browserbase, and Riza](/guides/frameworks/langgraph-gas-price-agent).
</Note>

## Scenario: Download a large dataset

Many government websites provide datasets with commerically-useful information. For
example, the California Bureau of Real Estate Appraisers provides a list of all
current and recently-licensed appraisers
<a href="https://www2.brea.ca.gov/breasearch/faces/party/search.xhtml" rel="noreferrer">via their site</a>.
However, it's hard to get the data out. There are over 13,000 appraisers,
presented in pages of 300 at a time, with no bulk download option.

If you want to download all the data, you'll want to automate it.

## Solution: What we'll build

We'll write a script that automatically extracts each appraiser from the Real Estate
Appraisers site, and prints out the results as a CSV.

To keep this guide simple, we'll hand-write the code to download the HTML, and only ask the
LLM to write the code to extract data from the HTML.

## Example code

Get the full code for this example [in our GitHub](https://github.com/riza-io/examples/blob/main/use-cases/data_extraction_intro.py).

<Note>
  Before you begin, sign up for [Riza](https://dashboard.riza.io) and [OpenAI](https://openai.com/) API access.
  You can adapt this guide to use any other LLM. There is no special reason we chose OpenAI for this use case, and it's straightforward to adjust the implementation to use another LLM provider.
</Note>

## Step 1: Download one page of HTML

First, we'll fetch one page of HTML from the Real Estate Appraisers site.

We'll use the `httpx` library to make the web request, and the `beautifulsoup4` library to further process
the HTML. Let's install them:

```sh theme={null}
pip install httpx beautifulsoup4
```

Next, we'll write a function, `download_html_body()`, to download a page of results:

```py theme={null}
from bs4 import BeautifulSoup
import httpx

def extract_body_html(full_html):
    """Returns just the <body> of an HTML page, without any <scripts>"""
    soup = BeautifulSoup(full_html, "html.parser")
    body = soup.find("body")
    if body:
        for script in body.find_all("script"):
            script.decompose()
        return str(body)
    else:
        print("No <body> tag found in the HTML.")
        return None

def download_html_body(website_url):
    response = httpx.get(website_url)
    if response.status_code == 200:
        return extract_body_html(response.text)
    return None

def main():
    URL = 'https://www2.brea.ca.gov/breasearch/faces/party/search.xhtml'
    html = download_html_body(URL)
    if html is None:
        print('Could not download HTML')
        return None
    # print(html) # optional: Print out the HTML you've extracted
```

At this point, you can uncomment the print statement above and run the script to see the extracted HTML.

In our code above, we include an optimization to reduce the overall size of the HTML.
After we download the HTML, we extract the `<body>` and remove all `<script>` tags using `extract_body_html()`.
This reduces the size of the content we pass to the LLM in the next step, and thus
helps us stay under LLM token limits.

## Step 2: Generate parsing code with OpenAI

In this step, we'll pass the HTML we just downloaded to OpenAI, and ask it to generate custom
code to extract our desired data from the HTML.

First, install and initialize the OpenAI SDK:

```sh theme={null}
pip install openai
```

Import and initialize the OpenAI client.

```py theme={null}
from openai import OpenAI

openai_client = OpenAI(api_key="your OpenAI key")
```

We'll now add a `generate_code()` function, along with a prompt for the LLM:

```py theme={null}
PROMPT = """
You are given the exact HTML of a website that contains a table of results.
Write Python code to extract the data from the table. Your code should print
out the extracted data in CSV format. Include the headings of the table.

Here are the rules for writing code:
- Use print() to write the output of your code to stdout.

- Use only the Python standard library and built-in modules. For example,
do not use `pandas`, but you can use `csv`. The one exception to this rule
is that you should use `beautifulsoup4` to parse HTML.

- In order to access the raw HTML, you must use a function called `get_html()`.
Include these exact lines in your code:

def get_html():
    stdin = sys.stdin.read()
    return stdin

Finally, here is the HTML string of the website. Remember, the goal is to extract
the rows from the table of results:

{}
"""

def generate_code(site_html):
    completion = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
              "role": "system",
              "content": "You are an expert programmer. When given a programming task, " +
                  "you will only output the final code, without any explanation. " +
                  "Do NOT put quotes around the code."
            },
            {
                "role": "user",
                "content": PROMPT.format(site_html),
            }
        ]
    )
    code = completion.choices[0].message.content
    return code
```

Finally, we'll call this function in `main()`:

```py theme={null}
def main():
    WEBSITE = 'https://www2.brea.ca.gov/breasearch/faces/party/search.xhtml'
    html = download_html_body(WEBSITE)
    if html is None:
        print('Could not download HTML')
        return None

    code = generate_code(html)
    # print(code) # Optional: Print out the generated code
```

At this point, you can uncomment the print statement above and run the script to see the LLM-generated code.

### Key components of the prompt

Note that in our prompt above, we explicitly ask the LLM to do a few things:

1. **Write Python code.** We plan to execute this code in a Python runtime on Riza.
2. **Use `print()` to write the output of its code.** We make this ask because of how Riza works: Riza's
   code execution API will return anything the script writes to `stdout`. Learn more [here](/reference/output).
3. **Use the Python standard library, plus `beautifulsoup4`.** In this guide, we're asking the LLM to write
   code to parse HTML, so we want it to be able to use the `beautifulsoup4` HTML parsing library. By default,
   Riza provides access to standard libraries. To use additional libraries, you can create a
   [custom runtime](/guides/custom-runtimes). In the next step, we'll create a custom runtime that includes
   `beautifulsoup4`.
4. **Include a specific function in its code to access the HTML.** There are [several ways](/reference/input)
   to pass input to your Riza script. In this guide, we plan to pass the HTML of the website into `stdin`, and
   we provide the LLM the exact code needed to read the input from `stdin`.

## Step 3: Execute the code with Riza

Now that we have LLM-generated code, we're ready to run it on Riza and finish our script.

### Step 3a. Create custom runtime

As we mentioned above, we allowed the LLM to use `beautifulsoup4` in its parsing code. To
make that library available on Riza, we'll create a [custom runtime](/guides/custom-runtimes).

Follow these steps:

1. In the Riza dashboard, select **Custom Runtimes**.
2. Click **Create runtime**.
3. In the runtime creation form, provide the following values:
   | Field            | Value            |
   | ---------------- | ---------------- |
   | Language         | Python           |
   | requirements.txt | `beautifulsoup4` |
4. Click **Create runtime**.
5. Wait for the **Status** of your runtime revision to become "Succeeded".
6. Copy the **ID** of your runtime **revision** (not the runtime) to use in the next step.

### Step 3b. Call the Riza API

Now, let's add the final pieces of code to finish our script.

First, install and initialize the Riza API client library:

```sh theme={null}
pip install rizaio
```

Import and initialize the Riza client. Note that there are multiple ways to set your API key:

```py theme={null}
from rizaio import Riza

# Option 1: Pass your API key directly:
riza_client = Riza(api_key="your Riza API key")

# Option 2: Set the `RIZA_API_KEY` environment variable
riza_client = Riza() # Will use `RIZA_API_KEY`
```

Let's add a function, `run_code()`, that calls the Riza Code Interpreter API and uses our custom runtime.
Make sure to fill in your own runtime ID:

```py theme={null}
def run_code(code, input_data):
    result = riza_client.command.exec(
        language="python",
        runtime_revision_id="<the ID of your runtime revision>",
        stdin=input_data,
        code=code,
    )
    if result.exit_code != 0:
        print("Code did not execute successfully. Error:")
        print(result.stderr)
    elif result.stdout == "":
        print("Code executed successfully but produced no output. "
            "Ensure your code includes print statements to get output.")
    return result.stdout
```

Finally, we'll update our `main()` function to run the generated code:

```py theme={null}
from bs4 import BeautifulSoup
import httpx
from openai import OpenAI
from rizaio import Riza

# ... helper functions ...

def main():
    WEBSITE = 'https://www2.brea.ca.gov/breasearch/faces/party/search.xhtml'
    html = download_html_body(WEBSITE)
    if html is None:
        print('Could not download HTML')
        return None

    code = generate_code(html)
    result = run_code(code, html)

    print('Extracted data:\n\n{}'.format(result))

if __name__ == "__main__":
    main()
```

This script is now complete. You can now run it to extract the real estate appraisers from the main page:

```text Output theme={null}
License,Name,Company,Address,City,State,Zip,County,Phone
001507,BARNWELL, BRIAN,Brian B. Barnwell,1830 Castillo St,Santa Barbara,CA,93101,Santa Barbara,(805) 708-4690
001509,BOEHM, MICHAEL,Senior Living Valuation Services Inc,1458 Sutter St,San Francisco,CA,94109,San Francisco,(415) 385-2832
001511,KETCHAM, DANIEL,Daniel R. Ketcham & Associates,11693 Brunswick Pines Rd,Grass Valley,CA,95945,Nevada,(530) 477-8056
...
```

## Benefits of code generation

You might wonder why you should use the code generation approach we present in this guide.

There is one obvious alternative: instead of asking an LLM to write code to extract data, why not ask it to extract the data
directly?

In our experience, there are two compelling reasons.

### Reliability

LLMs don't always follow instructions. When we passed the HTML of this website directly to OpenAI and asked it
to extract *all* the rows, it gave us *some* of the rows. Using the method in this guide, we can consistently extract all rows.

### Improve speed and cost on large datasets

This code generation approach shines on large datasets. If you want to extract data from many pages that are all in the
same format, using code generation will speed you up and reduce LLM costs. You only have to ask the LLM to write
code once, compared to calling the LLM on each individual page.

## Next Steps

* Get the full code for this example [in our GitHub](https://github.com/riza-io/examples/blob/main/use-cases/data_extraction_intro.py).
* [Try out the API](https://riza.io/playground).
* Learn how to use the Code Interpreter API with tool use APIs from [OpenAI](/tool-use/openai), [Anthropic](/tool-use/anthropic) and [Google](/tool-use/gemini).
* Check out the [roadmap](/reference/roadmap) to see what we're working on next.
