Technology

min read

How to build a web-native AI Agent with Webfuse and Copilot (Part 2/2)

Sasha Ogurţov

February 3, 2025

Learn how Webfuse preserves a seamless, native browsing experience while maintaining data privacy and improving overall efficiency, offering a powerful alternative to fully remote browser tools.

Quick recap: What we’ve built so far

In the first part of this blog series, we explored how we built a web-native browser AI Agent using Webfuse and Copilot during a Microsoft Hackathon at their Redmond HQ. We’ve made significant progress — handling user prompts, capturing browser screenshots, integrating with OpenAI’s API, and executing commands on websites via injected content scripts. At the heart of this integration is a Webfuse extension.

The Webfuse extension is browser-agnostic, working seamlessly across platforms, including mobile. It follows the familiar structure of traditional browser extensions, making it easy to adopt if you’ve worked with Chrome extensions before. Key components include:

Content Scripts — Inject logic into web pages for DOM manipulation.
Service Workers — Handle background tasks and communication.
Custom Start Pages — Provide an intuitive interface for user input and workflows.
Action Popups — Facilitate user interactions and debugging.

Quick overview of how we solved the initial problem

Missed Part 1 or want a deeper dive? Read the full blog post here.

Defining the limitations

While our implementation is functional, it comes with a few challenges:

Persistent Pop-up — A recurring pop-up appears at the start of each session, requiring an extra click. Worse, it misleadingly states that the user is sharing their tab with Webfuse, which isn’t entirely true.

Disruptive Hints — Hints are always visible, sometimes covering parts of the page and altering the website experience. This can confuse users, making them unsure whether they’re interacting with the original site.

API Key Exposure — Since the extension runs in the user’s browser, the OpenAI API key is accessible via developer tools. This is a major security risk, limiting usage to development environments only.

Cumbersome Interface — We currently rely on a pop-up and a custom start page for user input. However, Webfuse offers powerful APIs — what if we could integrate its chat API for a more seamless, built-in experience?
Performance Overhead — Running everything in the browser adds extra load. Could we offload some processing elsewhere for better efficiency?

Webfuse Agent (aka Virtual Participant) to the rescue!

Wait a second, who is this Virtual Agent-Participant?The Webfuse Agent introduces the concept of a “hidden participant” — a feature originally designed for recording sessions. However, we took this idea further. The agent is essentially a standard session participant that joins invisibly, loaded with all the necessary extensions. It can “see” and interact with the same session, just like the user does.Can we solve all these limitations with a hidden remote participant seamlessly joining the session? Let’s find out.Here’s what we came up with:

No More Pop-ups — Instead of capturing screenshots from the user’s browser, we take them from the hidden participant’s side. This eliminates the intrusive pop-up entirely.
Unobstructed User Experience — Since the agent now handles screenshots, we can also keep hints hidden from the user. The AI still gets the necessary context, while the user enjoys an unaltered, native browsing experience.
Secured Credentials & Improved Performance — Webfuse provides a special virtualParticipantOnly flag in manifest.json. Enabling this ensures the extension runs only on the remote participant’s side, keeping API keys hidden from the user. This will also reduce the load on the user’s browser. Here is how our updated manifest.json looks like now:

{
    "manifest_version": 3,
    "name": "Next Remote Agent (Virtual Participant's side)",
    "version": "1.0",

    "action": {
        "default_popup": "popup.html"
    },
    "background": {
        "service_worker": "background.js"
    },
    "content_scripts": [
        {
            "js": ["hints.js", "content.js"]
        }
    ],
    "virtualParticipantOnly": true
}

Seamless Chat Control — Instead of relying on pop-ups and custom start pages, we can now leverage Webfuse’s native chat API to manage interactions effortlessly:

Using the built-in chat to control the flow

Here is a simple code snippet tha demonstrates how to use Webfuse JS API:

function agentMessageHandler(message, sender) {
        console.log('[popup.html] agentMessageHandler: message received: ', message);
        if (message.event_type === 'chat_message') {
            if (message.message[0] === '/') {
                const command = message.message.split(' ')[0].slice(1);
                const params = message.message.split(' ').slice(1);

                switch (command) {
                    case 'start':
                    case 'prompt':
                    case 's':
                    case 'p':
                        prompt = params.join(' ');
                        next();
                        break;
                    case 'next':
                    case 'n':
                        next();
                        break;
                    case 'auto':
                    case 'a':
                        autoNext = params[0] === 'on';
                        break;
                    case 'screenshots':
                    case 'ss':
                        sendScreenshots = params[0] === 'on';
                        break;
                    case 'answer':
                        nextAction(params.join(' '), lastImageData);
                        break;
                }
            }
        }
    }

Final architecture with Virtual Participant

Let's put it all together and see how it works end to end:

So why bot just use a remote browser?

Remote browser tools like Browse.ai, Browserbase.com or even the new kid on the block Operator from OpenAIoffer powerful capabilities by running entire browser sessions on remote servers. However, for many use cases, this approach introduces significant drawbacks compared to Webfuse’s hybrid model with its remote agent.

1. Native website experience

With remote browsers, users interact with the website through a streamed session, which often leads to:

Quality Loss: The website is rendered on the remote server, and the user sees a compressed video stream of it. This can degrade the visual fidelity and responsiveness of the website.
Increased Latency: Every interaction requires a round trip to the remote server, leading to noticeable delays in browsing or performing tasks.

In contrast, Webfuse loads the original website natively in your browser, ensuring a pixel-perfect experience with zero latency. The website behaves exactly as it would in a regular browsing session.

2. Full control over your data

Remote browser tools process all website interactions on their servers, meaning:

Your browsing data, session cookies, and interaction logs may be stored or processed on remote servers, potentially compromising your privacy.
You lose control over the data, even if the tool promises not to save it.

With Webfuse, the website and its state remain in your local browser. The remote agent simply joins as a hidden participant to assist with specific operations, such as offloading screenshots or communicating with APIs. Your data never leaves your local environment unless explicitly intended.

3. Flexible and efficient workflow

By maintaining a local-first approach and offloading only heavy operations to the remote agent, Webfuse offers the best of both worlds:

The performance of a local browser session.
The additional power of remote resources without compromising control, speed, or privacy.

Summary

Unlike fully remote browser solutions, Webfuse and its remote agent preserve the core browsing experience while enhancing it with remote capabilities. You remain in control of your data, enjoy a seamless browsing experience without latency, and can trust that sensitive operations stay secure within your environment.

Highlights and surprises

Solving CAPTCHAs: One of the most impressive moments was watching the agent successfully solve a CAPTCHA — something we hadn’t anticipated.
Effortless Adaptation of Keyjump extension: Porting a browser extension to Webfuse without modifying its code highlighted the flexibility and power of Webfuse.
Open AI’s Assistant API with GPT 4o can’t read text on images. It took us a few precious hackaton hours to realize this. Still not sure why though!

What’s next?

We’re already brainstorming ways to take this further. Some ideas include:

Exposing APIs for more granular actions like clicking, focusing, and typing.
Potentially use coordinates-based approach, similar to Anthrophic Claude’s Computer Use.
Making extensions cross-compatible with browser extensions
Handling more edge cases (of which there are plenty!).

Source code

Here is the full source code.