img diff algo for screenshots #1072

filip-michalsky · 2025-09-13T19:07:23Z

why

Our existing screenshot service is a dummy time-based triggered service. It also does not trigger based on any actions of the agent.

what changed

Added img hash diff algo (quick check with MSE, verify with SSIM algo) to see if there was an actual UI change and only store ss in the buffer if that is so.

Added ss interceptor which copies each screenshot the agent is taking to a buffer (if different enough from the previous ss) to be later used for evals.

There's also a small refactor of the agent initialization config to enable the screenshot collector service to be attached

test plan

Tests pass locally

changeset-bot · 2025-09-13T19:07:27Z

🦋 Changeset detected

Latest commit: ac4cb00

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

greptile-apps

Greptile Summary

This PR replaces the existing time-based screenshot service with an intelligent image-diff based approach that uses MSE (Mean Squared Error) and SSIM (Structural Similarity Index) algorithms to determine if UI changes occurred. The new system intercepts agent screenshots and only stores them if they differ significantly from previous captures, reducing storage overhead while maintaining evaluation quality.

Key changes:
• Added Sharp dependency for image processing and comparison algorithms
• Implemented screenshot interception that overrides the page.screenshot method
• Added MSE/SSIM-based image comparison with configurable thresholds
• Introduced optimized screenshot capture with viewport resizing for performance
• Changed default trial count from 3 to 1 for faster evaluation cycles

Confidence score: 3/5

This PR has moderate risk due to potential race conditions from viewport manipulation
Score reflects solid implementation of image comparison algorithms but critical issue with concurrent viewport modification that could cause unpredictable behavior in multi-threaded scenarios
Pay close attention to evals/utils/imageResize.ts - the viewport manipulation approach needs revision

_{6 files reviewed, 1 comment}

_{Edit Code Review Bot Settings | Greptile}

evals/utils/imageResize.ts

evals/utils/ScreenshotCollector.ts

evals/index.eval.ts

evals/utils/ScreenshotCollector.ts

evals/utils/imageResize.ts

Co-authored-by: Miguel <[email protected]>

evals/evaluator.ts

evals/tasks/agent/webvoyager.ts

evals/tasks/agent/webbench.ts

evals/tasks/agent/osworld.ts

evals/tasks/agent/gaia.ts

miguelg719 · 2025-09-22T22:42:12Z

evals/evaluator.ts

 import { LLMParsedResponse } from "@/lib/inference";
 import { LLMResponse } from "@/lib/llm/LLMClient";
 import { LogLine } from "@/types/log";
 import { z } from "zod";


@tkattkat

This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## @browserbasehq/[email protected] ### Patch Changes - [#1082](#1082) [`8c0fd01`](8c0fd01) Thanks [@tkattkat](https://github.com/tkattkat)! - Pass stagehand object to agent instead of stagehand page - [#1104](#1104) [`a1ad06c`](a1ad06c) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix logging for stagehand agent - [#1066](#1066) [`9daa584`](9daa584) Thanks [@tkattkat](https://github.com/tkattkat)! - Add playwright arguments to agent execute response - [#1077](#1077) [`7f38b3a`](7f38b3a) Thanks [@tkattkat](https://github.com/tkattkat)! - adds support for stagehand agent in the api - [#1032](#1032) [`bf2d0e7`](bf2d0e7) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix for zod peer dependency support - [#1014](#1014) [`6966201`](6966201) Thanks [@tkattkat](https://github.com/tkattkat)! - Replace operator handler with base of new agent - [#1089](#1089) [`536f366`](536f366) Thanks [@miguelg719](https://github.com/miguelg719)! - Fixed info logs on api session create - [#1103](#1103) [`889cb6c`](889cb6c) Thanks [@tkattkat](https://github.com/tkattkat)! - patch custom tool support in anthropic cua client - [#1056](#1056) [`6a002b2`](6a002b2) Thanks [@chrisreadsf](https://github.com/chrisreadsf)! - remove need for duplicate project id if already passed to Stagehand - [#1090](#1090) [`8ff5c5a`](8ff5c5a) Thanks [@miguelg719](https://github.com/miguelg719)! - Improve failed act error logs - [#1014](#1014) [`6966201`](6966201) Thanks [@tkattkat](https://github.com/tkattkat)! - replace operator agent with scaffold for new stagehand agent - [#1107](#1107) [`3ccf335`](3ccf335) Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: url extraction not working inside an array - [#1102](#1102) [`a99aa48`](a99aa48) Thanks [@miguelg719](https://github.com/miguelg719)! - Add current page and date context to agent - [#1110](#1110) [`dda52f1`](dda52f1) Thanks [@miguelg719](https://github.com/miguelg719)! - Add support for new Gemini Computer Use models ## @browserbasehq/[email protected] ### Minor Changes - [#1057](#1057) [`b7be89e`](b7be89e) Thanks [@filip-michalsky](https://github.com/filip-michalsky)! - added web voyager ground truth (optional), added web bench, and subset of OSWorld evals which run on a browser ### Patch Changes - [#1072](#1072) [`dc2d420`](dc2d420) Thanks [@filip-michalsky](https://github.com/filip-michalsky)! - improve evals screenshot service - add img hashing diff to add screenshots and change to screenshot intercepts from the agent - Updated dependencies \[[`8c0fd01`](8c0fd01), [`a1ad06c`](a1ad06c), [`9daa584`](9daa584), [`7f38b3a`](7f38b3a), [`bf2d0e7`](bf2d0e7), [`6966201`](6966201), [`536f366`](536f366), [`889cb6c`](889cb6c), [`6a002b2`](6a002b2), [`8ff5c5a`](8ff5c5a), [`6966201`](6966201), [`3ccf335`](3ccf335), [`a99aa48`](a99aa48), [`dda52f1`](dda52f1)]: - @browserbasehq/[email protected] ## @browserbasehq/[email protected] ### Patch Changes - Updated dependencies \[[`8c0fd01`](8c0fd01), [`a1ad06c`](a1ad06c), [`9daa584`](9daa584), [`7f38b3a`](7f38b3a), [`bf2d0e7`](bf2d0e7), [`6966201`](6966201), [`536f366`](536f366), [`889cb6c`](889cb6c), [`6a002b2`](6a002b2), [`8ff5c5a`](8ff5c5a), [`6966201`](6966201), [`3ccf335`](3ccf335), [`a99aa48`](a99aa48), [`dda52f1`](dda52f1)]: - @browserbasehq/[email protected] Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

img diff algo for screenshots

b8acbcf

filip-michalsky added 5 commits September 13, 2025 15:20

intercept agent ss instead of time based trigger

28d9f65

lint

fd0985e

default 1 trial

ef3947e

image resize

4ce7df8

add changeset

001e8d2

filip-michalsky marked this pull request as ready for review September 14, 2025 17:29

greptile-apps bot reviewed Sep 14, 2025

View reviewed changes

evals/utils/imageResize.ts Outdated Show resolved Hide resolved

filip-michalsky commented Sep 14, 2025

View reviewed changes

evals/utils/ScreenshotCollector.ts Outdated Show resolved Hide resolved

miguelg719 reviewed Sep 15, 2025

View reviewed changes

evals/index.eval.ts Outdated Show resolved Hide resolved

miguelg719 reviewed Sep 15, 2025

View reviewed changes

evals/utils/ScreenshotCollector.ts Outdated Show resolved Hide resolved

miguelg719 reviewed Sep 15, 2025

View reviewed changes

evals/utils/ScreenshotCollector.ts Outdated Show resolved Hide resolved

miguelg719 reviewed Sep 15, 2025

View reviewed changes

evals/utils/ScreenshotCollector.ts Outdated Show resolved Hide resolved

miguelg719 reviewed Sep 15, 2025

View reviewed changes

evals/utils/imageResize.ts Outdated Show resolved Hide resolved

filip-michalsky and others added 6 commits September 19, 2025 12:08

merge main tip

c89875e

Update evals/index.eval.ts

e76719f

Co-authored-by: Miguel <[email protected]>

do NOT resize viewport, intercept ss by default

14a185e

combine time based screenshot service with an interceptor

9702884

add sharp as dep for evals

c255af5

adjust threshold to be less sensitive for screenshot UI changes

cd2f4a9

filip-michalsky requested a review from miguelg719 September 20, 2025 10:31

filip-michalsky and others added 2 commits September 20, 2025 12:33

add screenshot collector to all external benchmarks

2c40691

updates and refactor

814fccb

miguelg719 reviewed Sep 22, 2025

View reviewed changes

evals/evaluator.ts Outdated Show resolved Hide resolved

miguelg719 and others added 2 commits September 22, 2025 13:50

Update evals/evaluator.ts

783cf33

enable dom agent

9c5f69a

miguelg719 force-pushed the fm/str-798-improve-screenshots-in-evaluator branch from 691334d to 9c5f69a Compare September 22, 2025 21:59

update benchmark runners

be168ef

miguelg719 reviewed Sep 22, 2025

View reviewed changes

evals/tasks/agent/webvoyager.ts Outdated Show resolved Hide resolved

miguelg719 reviewed Sep 22, 2025

View reviewed changes

evals/tasks/agent/webbench.ts Outdated Show resolved Hide resolved

miguelg719 reviewed Sep 22, 2025

View reviewed changes

evals/tasks/agent/osworld.ts Outdated Show resolved Hide resolved

miguelg719 reviewed Sep 22, 2025

View reviewed changes

evals/tasks/agent/gaia.ts Outdated Show resolved Hide resolved

miguelg719 and others added 2 commits September 22, 2025 15:02

Apply suggestions from code review

e1929bc

cleanup

6aa1229

miguelg719 reviewed Sep 22, 2025

View reviewed changes

miguelg719 added 3 commits September 22, 2025 15:55

Merge branch 'main' into fm/str-798-improve-screenshots-in-evaluator

59e2c80

Merge branch 'main' into fm/str-798-improve-screenshots-in-evaluator

d029164

zod v3

4d9ecf2

miguelg719 approved these changes Sep 22, 2025

View reviewed changes

miguelg719 added agent Agentic evals and removed agent Agentic evals labels Sep 22, 2025

miguelg719 force-pushed the fm/str-798-improve-screenshots-in-evaluator branch from 15f737a to 4d9ecf2 Compare September 23, 2025 00:05

patch osworld

ac4cb00

miguelg719 merged commit dc2d420 into main Sep 23, 2025
15 checks passed

This was referenced Sep 22, 2025

Version Packages #1062

Merged

Version Packages pchaganti/gx-stage-hand#1

Open

Version Packages Malumbo21/stagehand#114

Open

Version Packages CloudEngineHub/stagehand#1

Open

Version Packages erickirt/stagehand#72

Open

github-actions bot mentioned this pull request Aug 13, 2025

Version Packages aaag1980/stagehand#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

img diff algo for screenshots #1072

img diff algo for screenshots #1072

Uh oh!

filip-michalsky commented Sep 13, 2025 •

edited by miguelg719

Loading

Uh oh!

changeset-bot bot commented Sep 13, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

miguelg719 Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

img diff algo for screenshots #1072

img diff algo for screenshots #1072

Uh oh!

Conversation

filip-michalsky commented Sep 13, 2025 • edited by miguelg719 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

test plan

Uh oh!

changeset-bot bot commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 3/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

miguelg719 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

filip-michalsky commented Sep 13, 2025 •

edited by miguelg719

Loading

changeset-bot bot commented Sep 13, 2025 •

edited

Loading