planning_agent
to create and refine high-level plans, and a web_agent
to observe the screen and execute low-level actions.web_agent
can analyze “Set-of-Marks” (SoM) screenshots, which are visual representations of the page with interactive elements highlighted, enabling it to perform complex visual reasoning.VideoAnalysisToolkit
.BrowserToolkit
. You can configure the underlying models for the planning and web agents.
browse_url
browse_url
function. It takes a high-level task and a starting URL, and then autonomously navigates the web to complete the task.
browse_url
function orchestrates a loop between the planning_agent
and the web_agent
.
Planning
planning_agent
creates a high-level plan to accomplish the task.Observation
web_agent
observes the current page by taking a “Set-of-Marks” (SoM) screenshot.Action
web_agent
decides on the next action to take (e.g., click, type, scroll).Execution
Replanning
web_agent
gets stuck, the planning_agent
can re-evaluate the situation and create a new plan.cookies.json
file or a user_data_dir
.