Skip to content

Conversation

@filip-michalsky
Copy link
Collaborator

@filip-michalsky filip-michalsky commented Sep 13, 2025

why

Our existing screenshot service is a dummy time-based triggered service. It also does not trigger based on any actions of the agent.

what changed

Added img hash diff algo (quick check with MSE, verify with SSIM algo) to see if there was an actual UI change and only store ss in the buffer if that is so.

Added ss interceptor which copies each screenshot the agent is taking to a buffer (if different enough from the previous ss) to be later used for evals.

  • There's also a small refactor of the agent initialization config to enable the screenshot collector service to be attached

test plan

Tests pass locally

@changeset-bot
Copy link

changeset-bot bot commented Sep 13, 2025

🦋 Changeset detected

Latest commit: ac4cb00

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@filip-michalsky filip-michalsky marked this pull request as ready for review September 14, 2025 17:29
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR replaces the existing time-based screenshot service with an intelligent image-diff based approach that uses MSE (Mean Squared Error) and SSIM (Structural Similarity Index) algorithms to determine if UI changes occurred. The new system intercepts agent screenshots and only stores them if they differ significantly from previous captures, reducing storage overhead while maintaining evaluation quality.

Key changes:
• Added Sharp dependency for image processing and comparison algorithms
• Implemented screenshot interception that overrides the page.screenshot method
• Added MSE/SSIM-based image comparison with configurable thresholds
• Introduced optimized screenshot capture with viewport resizing for performance
• Changed default trial count from 3 to 1 for faster evaluation cycles

Confidence score: 3/5

  • This PR has moderate risk due to potential race conditions from viewport manipulation
  • Score reflects solid implementation of image comparison algorithms but critical issue with concurrent viewport modification that could cause unpredictable behavior in multi-threaded scenarios
  • Pay close attention to evals/utils/imageResize.ts - the viewport manipulation approach needs revision

6 files reviewed, 1 comment

Edit Code Review Bot Settings | Greptile

@miguelg719 miguelg719 force-pushed the fm/str-798-improve-screenshots-in-evaluator branch from 691334d to 9c5f69a Compare September 22, 2025 21:59
import { LLMParsedResponse } from "@/lib/inference";
import { LLMResponse } from "@/lib/llm/LLMClient";
import { LogLine } from "@/types/log";
import { z } from "zod";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zod3

@miguelg719 miguelg719 added agent Agentic evals and removed agent Agentic evals labels Sep 22, 2025
@miguelg719 miguelg719 force-pushed the fm/str-798-improve-screenshots-in-evaluator branch from 15f737a to 4d9ecf2 Compare September 23, 2025 00:05
@miguelg719 miguelg719 merged commit dc2d420 into main Sep 23, 2025
15 checks passed
miguelg719 pushed a commit that referenced this pull request Oct 7, 2025
This PR was opened by the [Changesets
release](https://github.com/changesets/action) GitHub action. When
you're ready to do a release, you can merge this and the packages will
be published to npm automatically. If you're not ready to do a release
yet, that's fine, whenever you add more changesets to main, this PR will
be updated.


# Releases
## @browserbasehq/[email protected]

### Patch Changes

- [#1082](#1082)
[`8c0fd01`](8c0fd01)
Thanks [@tkattkat](https://github.com/tkattkat)! - Pass stagehand object
to agent instead of stagehand page

- [#1104](#1104)
[`a1ad06c`](a1ad06c)
Thanks [@miguelg719](https://github.com/miguelg719)! - Fix logging for
stagehand agent

- [#1066](#1066)
[`9daa584`](9daa584)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add playwright
arguments to agent execute response

- [#1077](#1077)
[`7f38b3a`](7f38b3a)
Thanks [@tkattkat](https://github.com/tkattkat)! - adds support for
stagehand agent in the api

- [#1032](#1032)
[`bf2d0e7`](bf2d0e7)
Thanks [@miguelg719](https://github.com/miguelg719)! - Fix for zod peer
dependency support

- [#1014](#1014)
[`6966201`](6966201)
Thanks [@tkattkat](https://github.com/tkattkat)! - Replace operator
handler with base of new agent

- [#1089](#1089)
[`536f366`](536f366)
Thanks [@miguelg719](https://github.com/miguelg719)! - Fixed info logs
on api session create

- [#1103](#1103)
[`889cb6c`](889cb6c)
Thanks [@tkattkat](https://github.com/tkattkat)! - patch custom tool
support in anthropic cua client

- [#1056](#1056)
[`6a002b2`](6a002b2)
Thanks [@chrisreadsf](https://github.com/chrisreadsf)! - remove need for
duplicate project id if already passed to Stagehand

- [#1090](#1090)
[`8ff5c5a`](8ff5c5a)
Thanks [@miguelg719](https://github.com/miguelg719)! - Improve failed
act error logs

- [#1014](#1014)
[`6966201`](6966201)
Thanks [@tkattkat](https://github.com/tkattkat)! - replace operator
agent with scaffold for new stagehand agent

- [#1107](#1107)
[`3ccf335`](3ccf335)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: url
extraction not working inside an array

- [#1102](#1102)
[`a99aa48`](a99aa48)
Thanks [@miguelg719](https://github.com/miguelg719)! - Add current page
and date context to agent

- [#1110](#1110)
[`dda52f1`](dda52f1)
Thanks [@miguelg719](https://github.com/miguelg719)! - Add support for
new Gemini Computer Use models

## @browserbasehq/[email protected]

### Minor Changes

- [#1057](#1057)
[`b7be89e`](b7be89e)
Thanks [@filip-michalsky](https://github.com/filip-michalsky)! - added
web voyager ground truth (optional), added web bench, and subset of
OSWorld evals which run on a browser

### Patch Changes

- [#1072](#1072)
[`dc2d420`](dc2d420)
Thanks [@filip-michalsky](https://github.com/filip-michalsky)! - improve
evals screenshot service - add img hashing diff to add screenshots and
change to screenshot intercepts from the agent

- Updated dependencies
\[[`8c0fd01`](8c0fd01),
[`a1ad06c`](a1ad06c),
[`9daa584`](9daa584),
[`7f38b3a`](7f38b3a),
[`bf2d0e7`](bf2d0e7),
[`6966201`](6966201),
[`536f366`](536f366),
[`889cb6c`](889cb6c),
[`6a002b2`](6a002b2),
[`8ff5c5a`](8ff5c5a),
[`6966201`](6966201),
[`3ccf335`](3ccf335),
[`a99aa48`](a99aa48),
[`dda52f1`](dda52f1)]:
    -   @browserbasehq/[email protected]

## @browserbasehq/[email protected]

### Patch Changes

- Updated dependencies
\[[`8c0fd01`](8c0fd01),
[`a1ad06c`](a1ad06c),
[`9daa584`](9daa584),
[`7f38b3a`](7f38b3a),
[`bf2d0e7`](bf2d0e7),
[`6966201`](6966201),
[`536f366`](536f366),
[`889cb6c`](889cb6c),
[`6a002b2`](6a002b2),
[`8ff5c5a`](8ff5c5a),
[`6966201`](6966201),
[`3ccf335`](3ccf335),
[`a99aa48`](a99aa48),
[`dda52f1`](dda52f1)]:
    -   @browserbasehq/[email protected]

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants