Run LLM-generated code to extract data from a website
httpx
library to make the web request, and the beautifulsoup4
library to further process
the HTML. Let’s install them:
download_html_body()
, to download a page of results:
<body>
and remove all <script>
tags using extract_body_html()
.
This reduces the size of the content we pass to the LLM in the next step, and thus
helps us stay under LLM token limits.
generate_code()
function, along with a prompt for the LLM:
main()
:
print()
to write the output of its code. We make this ask because of how Riza works: Riza’s
code execution API will return anything the script writes to stdout
. Learn more here.beautifulsoup4
. In this guide, we’re asking the LLM to write
code to parse HTML, so we want it to be able to use the beautifulsoup4
HTML parsing library. By default,
Riza provides access to standard libraries. To use additional libraries, you can create a
custom runtime. In the next step, we’ll create a custom runtime that includes
beautifulsoup4
.stdin
, and
we provide the LLM the exact code needed to read the input from stdin
.beautifulsoup4
in its parsing code. To
make that library available on Riza, we’ll create a custom runtime.
Follow these steps:
Field | Value |
---|---|
Language | Python |
requirements.txt | beautifulsoup4 |
run_code()
, that calls the Riza Code Interpreter API and uses our custom runtime.
Make sure to fill in your own runtime ID:
main()
function to run the generated code: