Data Extraction
Run LLM-generated code to extract data from a website
In this guide, we’ll show you a simple, real-world use case for Riza: automatically extracting data from a website. We’ll prompt an LLM to write the scraping code, and execute that code using Riza.
Why Riza?
In general, LLMs are good at writing code, but they can’t execute the code they write.
A common use case for Riza is to safely execute code written by LLMs.
For example, you can ask an LLM to write code to analyze specific data, to generate graphs, or to extract data from a website or document. The code written by the LLM is “untrusted” and might contain harmful side-effects. You can protect your systems by executing that code on Riza instead of in your production environment.
Scenario: Download a large dataset
Many government websites provide datasets with commerically-useful information. For example, the California Bureau of Real Estate Appraisers provides a list of all current and recently-licensed appraisers via their site. However, it’s hard to get the data out. There are over 13,000 appraisers, presented in pages of 300 at a time, with no bulk download option.
If you want to download all the data, you’ll want to automate it.
Solution: What we’ll build
We’ll write a script that automatically extracts each appraiser from the Real Estate Appraisers site, and prints out the results as a CSV.
To keep this guide simple, we’ll hand-write the code to download the HTML, and only ask the LLM to write the code to extract data from the HTML.
Step 1: Download one page of HTML
First, we’ll fetch one page of HTML from the Real Estate Appraisers site.
We’ll use the httpx
library to make the web request, and the beautifulsoup4
library to further process
the HTML. Let’s install them:
Next, we’ll write a function, download_html_body()
, to download a page of results:
At this point, you can uncomment the print statement above and run the script to see the extracted HTML.
In our code above, we include an optimization to reduce the overall size of the HTML.
After we download the HTML, we extract the <body>
and remove all <script>
tags using extract_body_html()
.
This reduces the size of the content we pass to the LLM in the next step, and thus
helps us stay under LLM token limits.
Step 2: Generate parsing code with OpenAI
In this step, we’ll pass the HTML we just downloaded to OpenAI, and ask it to generate custom code to extract our desired data from the HTML.
First, install and initialize the OpenAI SDK:
Import and initialize the OpenAI client.
We’ll now add a generate_code()
function, along with a prompt for the LLM:
Finally, we’ll call this function in main()
:
At this point, you can uncomment the print statement above and run the script to see the LLM-generated code.
Key components of the prompt
Note that in our prompt above, we explicitly ask the LLM to do a few things:
- Write Python code. We plan to execute this code in a Python runtime on Riza.
- Use
print()
to write the output of its code. We make this ask because of how Riza works: Riza’s code execution API will return anything the script writes tostdout
. Learn more here. - Use the Python standard library, plus
beautifulsoup4
. In this guide, we’re asking the LLM to write code to parse HTML, so we want it to be able to use thebeautifulsoup4
HTML parsing library. By default, Riza provides access to standard libraries. To use additional libraries, you can create a custom runtime. In the next step, we’ll create a custom runtime that includesbeautifulsoup4
. - Include a specific function in its code to access the HTML. There are several ways
to pass input to your Riza script. In this guide, we plan to pass the HTML of the website into
stdin
, and we provide the LLM the exact code needed to read the input fromstdin
.
Step 3: Execute the code with Riza
Now that we have LLM-generated code, we’re ready to run it on Riza and finish our script.
Step 3a. Create custom runtime
As we mentioned above, we allowed the LLM to use beautifulsoup4
in its parsing code. To
make that library available on Riza, we’ll create a custom runtime.
Follow these steps:
- In the Riza dashboard, select Custom Runtimes.
- Click Create runtime.
- In the runtime creation form, provide the following values:
Field Value Language Python requirements.txt beautifulsoup4
- Click Create runtime.
- Wait for the Status of your runtime revision to become “Succeeded”.
- Copy the ID of your runtime revision (not the runtime) to use in the next step.
Step 3b. Call the Riza API
Now, let’s add the final pieces of code to finish our script.
First, install and initialize the Riza SDK:
Import and initialize the Riza client. Note that the SDK offers multiple ways to set your API key:
Let’s add a function, run_code()
, that calls the Riza Code Interpreter API and uses our custom runtime.
Make sure to fill in your own runtime ID:
Finally, we’ll update our main()
function to run the generated code:
This script is now complete. You can now run it to extract the real estate appraisers from the main page:
Benefits of code generation
You might wonder why you should use the code generation approach we present in this guide.
There is one obvious alternative: instead of asking an LLM to write code to extract data, why not ask it to extract the data directly?
In our experience, there are two compelling reasons.
Reliability
LLMs don’t always follow instructions. When we passed the HTML of this website directly to OpenAI and asked it to extract all the rows, it gave us some of the rows. Using the method in this guide, we can consistently extract all rows.
Improve speed and cost on large datasets
This code generation approach shines on large datasets. If you want to extract data from many pages that are all in the same format, using code generation will speed you up and reduce LLM costs. You only have to ask the LLM to write code once, compared to calling the LLM on each individual page.
Next Steps
- Try out the API.
- Learn how to use the Code Interpreter API with tool use APIs from OpenAI, Anthropic and Google.
- Check out the roadmap to see what we’re working on next.