Config Parameters

Parameters

parameter	type	default	description
projects	string[]	[]	If empty, all projects will pass. Value is like "@web-bench/calculator".
agentMode	"local" \| "http"	"local"
agentEndPoint	string	""	When agentMode is set to "http", set http API for network requests.
models	string[]	[]	'models' field in `apps/eval/src/model.json`
maxdop	number	30	max degree of parallelism
logLevel	"info" \| "warn" \| "debug" \| "error"	"info"
httpLimit	number	10	When agentMode is set to "http", maximum concurrent requests
fileDiffLog	boolean	false	Whether to log the diff of files generated by llm. Only enable in 'debug' log level. Note: This affects performance, don't enable it during all-project evaluation.
screenshotLog	boolean	false	Whether to log the screenshot. Only enable in 'debug' log level. Note: This affects performance, don't enable it during all-project evaluation.
startTask	string	the first task of `tasks.jsonl`	Task executed starts from, including `startTask`.
endTask	string	last task of `tasks.jsonl`	Task executed ends to, including `endTask`.

Q & A

Difference between AgentMode "local" and "http"

'local': This mode has the basic capability to interact with LLM. It can specify the corresponding model in the apps/eval/src/model.json.
'http': Through this mode, it calls the configured agentEndPoint to send a request to the custom Agent. You can read Agent Server for more details.

Add new model for evaluation

For models deployed on OpenRouter, use the native OpenRouter provider with the following configuration:

{
  "title": "anthropic/claude-3-opus",
  "provider": "openrouter",
  "model": "anthropic/claude-3-opus",
  "apiBase": "https://openrouter.ai/api/v1",
  "apiKey": "{{OPENROUTER_API_KEY}}"
}

If existing providers do not meet your requirements, you can evaluate specific models by creating a new Provider. This is achieved by extending the BaseLLM:

export abstract class BaseLLM {
  abstract provider: string
  abstract option: LLMOption
  info: Model
  abstract chat(
    compiledMessages: ChatMessage[],
    originOptions: CompletionOptions
  ): Promise<{
    request: string
    error?: string
    response: string
  }>
}

option – define parameters for LLM requests:

export interface LLMOption {
  contextLength: number
  maxTokens: number
  temperature?: number
  apiBase: string
}

info – model metadata in apps/eval/src/model.json.
chat – custom request method that returns the generated text from the LLM.

Evaluation | arXiv Paper | Leaderboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Config Parameters

Parameters

Q & A

Difference between AgentMode "local" and "http"

Add new model for evaluation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally