Technology
10
min read

How to build a web-native AI Agent with Webfuse and Copilot (Part 1/2)

sasha orgurtov profile picture
Sasha Ogurţov
January 21, 2025

We built a tool that lets you automate tasks on the web on top of Webfuse and Microsoft Copilot. It can handle internet-based tasks directly in your browser, with no installation required. In this article, we'll cover our approach, challenges, highlights, and surprises along the way.

how to build a web-native ai agent with webfuse and copilot

TLDR; We built a tool that lets you automate tasks on the web on top of Webfuse and Microsoft Copilot. It can handle internet-based tasks (eg “Book me a flight ticket to Amsterdam next Tuesday”, “Buy a gift for my brother-in-law on Amazon”, “Order me a pizza”, etc) directly in your browser, with no installation required. In this article, I’ll cover our approach, challenges, highlights, and surprises along the way.

Intro

A few weeks ago, my colleagues and I attended a hackathon at Microsoft HQ for their clients and partners. We didn’t have much information about the event’s conditions or rules, but we were excited to showcase the innovative work we’re doing at Surfly.

It turned out that we built a tool aligned with emerging trends in autonomous agents. This timing coincided with Microsoft releasing a set of autonomous agents, Google introducing Gemini 2 for the agentic era, OpenAI unveiling Swarm, and Claude taking agentic AI to the next level with its computer use skills — from moving a mouse cursor around the screen to clicking buttons and typing text using a virtual keyboard.

Typically, AI agents follow a three-step process:

  1. Determine the goal through a user-specified prompt.
  2. Figure out how to approach that objective by breaking it down into smaller, simpler subtasks and collecting the needed data.
  3. Execute tasks making use of any functions they can call or tools they have at their disposal.

To put it simple, we wanted to make our solution understand and act. This is exactly where the power of Webfuse sessions lies. Since we have access to the source code of any web application, we can inject any custom code that can perform both actions.

How it works end-to-end

Before diving into the implementation details, there are two core concepts you need to grasp:

  1. Webfuse Extensions
  2. Webfuse Agent

We will focus on Webfuse extensions in the first part of this article and explore Webfuse Agent in the second part.

If you’re unfamiliar with Webfuse extensions or just curious to learn more, check out our documentation. In short, a Webfuse extension is browser-agnostic and works seamlessly across platforms, including mobile devices. It mirrors the familiar structure of traditional browser extensions, which makes it easy to pick up if you’ve worked with Chrome extensions before. It includes:

  • Content scripts for injecting logic into web pages.
  • Action popups for user interactions.
  • Service workers to handle background tasks.
  • Custom start pages to initialize workflows.
  • Communication APIs to enable messaging between components.

The game-changer here is that with a Webfuse session, users are no longer required to install an extension themselves, the full functionality can be packaged under a new Webfuse link which will take care of loading the extension code on top of a predefined target application.

Let’s look at the overview of how Webuse extensions can solve the problem:

Webfuse Extension Components

Now let’s dive into the extension and explore the code. Just like any browser extension, ours starts with a manifest.json file:

{
    "manifest_version": 3,
    "name": "Next Remote Agent",
    "version": "1.0",
    "action": {
        "default_popup": "popup.html"
    },
    "background": {
        "service_worker": "background.js"
    },
    "content_scripts": [
        {
            "js": ["hints.js", "content.js"]
        }
    ],
    "chrome_url_overrides": {
      "newtab": "newtab.html"
    }
}

As you can see, this looks just like any other browser extension’s manifest file. It includes all the components you’d expect — background scripts, content scripts, a popup, and a custom new tab page.

Custom newtab.html: Handling the Initial Prompt

The newtab.html is loaded on the user’s side as the start page. It acts as the user-facing entry point where the process begins. This component is optional, you could also forego this step and start the execution directly on a predefined target website.

Custom startpage

What it does:

  • Displays a simple HTML form.
  • Captures user input (e.g., “Book me a flight”).
  • Sends this input to the agent’s API via the Webfuse Message API to initialize the process.

Here’s the code for the start page:

<div class="container">
    <div class="cards">
        <div class="card">✈️ Book flight tickets to Amsterdam next Tuesday</div>
        <div class="card">🎅 Find a Christmas present for my brother-in-law</div>
        <!-- Other cards -->
    </div>
    <form id="prompt-form">
        <input type="text" class="prompt-input" placeholder="Enter your prompt here...">
    </form>
</div>

<script>
    document.addEventListener('DOMContentLoaded', () => {
        if (!isAgent) {
            // Use Webfuse Agent API to invite agent
        }
    });    
    document.querySelectorAll('.card').forEach(card => {
        card.addEventListener('click', () => {
            document.querySelector('.prompt-input').value = card.textContent;
            document.querySelector('.prompt-input').focus();
        });
    });
    
    document.getElementById('prompt-form').addEventListener('submit', function(e) {
        e.preventDefault();
        const prompt = this.querySelector('.prompt-input').value;
        startWithPrompt(prompt);
        this.querySelector('.prompt-input').value = '';
        window.location.href = "https://www.google.com";
    });
    function startWithPrompt(prompt) {
        // Use Webfuse Message API to send the prompt
        // https://surfly.com/api/schema/swagger-ui/#/Space%20Session/spaces_sessions_message_create
    }
</script>

Service worker (background.js): Handle local commands

Extension service workers are an extension’s central event handler. Service worker is executed in a ServiceWorkerGlobalScope; this is basically a special kind of worker context, running off the main script execution thread, with no DOM access.

We will use service worker to initialize communication and propagate commands received from the agent to the tabs and handle them in the content scripts.

let prompt = '';
let autoNext = false;
let sendScreenshots = false;

const openAIToken = "<your Open AI token>"
const systemPrompt = `<see repository for system prompt>`;
class OpenAIChatAssistant {
    constructor() {
        this.openAIToken = openAIToken;
        this.baseUrl = 'https://api.openai.com/v1/chat/completions';
        this.chatHistory = [{ role: 'system', content: systemPrompt }];
    }
    
    // See source code for full implementation
}
const openAIChatAssistant = new OpenAIChatAssistant();

surflyExtension.surflySession.onMessage.addListener(message => {
    console.log('[background.js] message received: ', message);
    if (message.event_type === 'chat_message') {
        if (message.message[0] === '/') {
            const command = message.message.split(' ')[0].slice(1);
            const params = message.message.split(' ').slice(1);

            switch (command) {
                case 'start':
                case 'prompt':
                case 's':
                case 'p':
                    prompt = params.join(' ');
                    next();
                    break;
                case 'next':
                case 'n':
                    next();
                    break;
                case 'hints':
                case 'h':
                    surflyExtension.tabs.sendMessage(null, {event_type: 'command', command: 'hints'});
                    break;
                case 'auto':
                case 'a':
                    autoNext = params[0] === 'on';
                    break;
                case 'screenshots':
                case 'ss':
                    sendScreenshots = params[0] === 'on';
                    break;
                case 'answer':
                    nextAction(params.join(' '));
                    break;
                case 'help':
                case 'h':
                    surflyExtension.surflySession.apiRequest({cmd: 'send_chat_message', message: `✨ Agent: Available commands: /start <prompt>, /next, /hints, /auto <on|off>, /screenshots <on|off>, /help`});
                    break;
            }

            console.log('[background.js] command: ', command, 'params: ', params);
        }
    }
});

function next() {
    surflyExtension.surflySession.apiRequest({cmd: 'send_chat_message', message: `✨ Agent: Looking at the webpage..`});
    surflyExtension.runtime.sendMessage({event_type: 'capture_screen', sendScreenshots});
}

function nextAction(prompt, imageData) {
    surflyExtension.surflySession.apiRequest({cmd: 'send_chat_message', message: `✨ Agent: Deciding what to do next..`});
    openAIChatAssistant.sendMessageAndGetInstruction(prompt, imageData).then(response => {
        surflyExtension.surflySession.apiRequest({cmd: 'send_chat_message', message: `✨ Agent: ${response.explanation}`});
        surflyExtension.tabs.sendMessage(null, {event_type: 'command', ...response});

        if (sendScreenshots) {
            surflyExtension.surflySession.apiRequest({cmd: 'send_chat_message', message: `✨ Agent: Adding screenshot to the popup..`});    
            surflyExtension.runtime.sendMessage({event_type: 'add_screenshot', image_data: imageData, response});
        }

        if (autoNext) {
            setTimeout(() => {
                next();
            }, 3000);
        }
    });
}

surflyExtension.runtime.onMessage.addListener((message, sender) => {
    console.log('[background.js] message received: ', message);
    if (message.event_type === 'screenshot') {
        const imageData = message.image_data;
        nextAction(prompt, imageData);
    }
});

popup.html: Taking screenshots and displaying them for debugging

The popup is the only visible element of the extension. It also has access to the DOM, which is useful, since we need to take screenshots of the current tab.

Here's the popup code:

<div id="screenshots"></div>

<script>
    let mediaStream = null;

    async function initializeMediaStream() {
        try {
            const displayMediaOptions = {
                preferCurrentTab: true,
            };
            mediaStream = await navigator.mediaDevices.getDisplayMedia(displayMediaOptions);
        } catch (err) {
            console.error('Failed to initialize media stream:', err);
        }
    }

    surflyExtension.runtime.onMessage.addListener((message, sender) => {
        if (message.event_type === 'capture_screen') {
            captureScreen().then(imageData => {
                surflyExtension.runtime.sendMessage({event_type: 'screenshot', image_data: imageData});
            });
        } else if (message.event_type === 'add_screenshot') {
            const screenshotElement = document.getElementById('screenshots');
            const screenshot = document.createElement('div');
            screenshot.innerHTML = `<img src="${message.image_data}" alt="Screenshot" width="640"><p>${JSON.stringify(message.response)}</p>`;
            screenshotElement.appendChild(screenshot);
        }
    });

    async function captureScreen() {
        if (!mediaStream) {
            // If stream was lost/stopped, try to reinitialize
            await initializeMediaStream();
            if (!mediaStream) {
                throw new Error('Failed to initialize media stream');
            }
        }

        const track = mediaStream.getVideoTracks()[0];
        const imageCapture = new ImageCapture(track);
        
        try {
            const bitmap = await imageCapture.grabFrame();
            const canvas = document.createElement('canvas');
            canvas.width = bitmap.width;
            canvas.height = bitmap.height;
            const context = canvas.getContext('2d');
            context.drawImage(bitmap, 0, 0, canvas.width, canvas.height);
            return canvas.toDataURL('image/png');
        } catch (err) {
            mediaStream = null;
            throw err;
        }
    }

content.js & hints.js: The Content Scripts

The content script runs on the user’s browser and has two primary responsibilities:

1. Displaying Actionable Hints

We ported KeyJump (a simpler Vimium clone) into Webfuse to display small labels (e.g., a, b, c) over actionable elements like buttons, links, and input fields.

Keyjump activated hints

2. Executing Commands

Once the model provides an instruction, a content script injected into the webpage interprets and executes the action. The content script uses JavaScript APIs to perform tasks such as:

  • Simulating a keypress for the click action (e.g., pressing "b").
  • Focusing on an input element and simulating typing for the input action.
  • Triggering scrolling behavior for the scroll action.
  • Waiting if a page hasn’t fully loaded.

Handling Additional Information:

When additional input is needed (e.g., selecting a payment method), the agent displays a prompt in the popup extension. The user provides the input, which is passed back to the model for further instructions.

addEventListener("DOMContentLoaded", (event) => {
    document.body.focus();
    setup();
    activateHintMode();

    const handlers = {
        click: (message) => {
            const char = message.tag.toLowerCase();
            const event = new KeyboardEvent("keydown", {
                key: char,
                shiftKey: message.shiftKey,
                ctrlKey: message.ctrlKey,
                altKey: message.altKey,
                metaKey: message.metaKey,
            });
            handleKeydown(event);
        },

        type: (message) => {
            handlers.click({
                tag: message.tag.toLowerCase(),
                shiftKey: message.shiftKey,
                ctrlKey: message.ctrlKey,
                altKey: message.altKey,
                metaKey: message.metaKey,
            });

            window.document.activeElement.value = message.string;
            window.document.activeElement.dispatchEvent(
                new InputEvent("input", { bubbles: true })
            );
            document.body.focus();
        },
        navigate: (message) => {
            window.location.href = message.url;
        },
        scroll: (message) => {
            window.scrollBy(0, message.y);
        },
        wait: (message) => {
            setTimeout(() => {
                activateHintMode();
            }, message.time);
        },
        finish: (message) => {
            surflyExtension.surflySession.apiRequest({cmd: 'send_chat_message', message: `✨ Agent: Finished`});
        },
        ask: (message) => {
            surflyExtension.surflySession.apiRequest({cmd: 'send_chat_message', message: `✨ Agent: Question: ${message.question}`});
        },
        hints: (message) => {
            activateHintMode();
        },
    };

    surflyExtension.runtime.onMessage.addListener((message, sender) => {
        if (message?.command in handlers) {
            try {
                handlers[message.command](message);
            } catch (error) {
                console.error('Error executing command:', error);
            }
        } else {
            console.log("content script: message not found", message);
        }
    });

    surflyExtension.surflySession.apiRequest({cmd: 'send_chat_message', message: `✨ Agent: Hello! Type /help to see available commands.`});
});

What’s exciting is that this isn’t a one-trick pony — Webfuse extensions can be used to build countless other use cases. Whether it’s automating workflows, improving accessibility, or building intelligent assistants, the possibilities are endless.

What’s Next?

In the second part we will explore how we can offload heavy operations, such as taking screenshots or communicating with OpenAI APIs to the remote agent’s browser. This way, we will ensure that the user’s experience is seamless.

Source Code

In this version we run all the code in the user’s browser. The end user will have to give permission to use a media input which produces a MediaStream with tracks containing the requested types of media.

Here is the full source code.

Extend the web instantly

Join the waiting list and be the first to know once the beta version is public