Skip to content

Add support for a tool call quarantine and guardrail API #27

@bismuthsalamander

Description

@bismuthsalamander

We want to support guardrail use cases like DLP, where a tool call is blocked based on its parameter content, not the content of the response. For that, we will likely need:

  • A quarantine not just for tool responses, but for tool calls
  • Support in the GuardrailProvider API for tool quarantining
  • A CLI interface to review tool calls
  • A wrapper-generated retrieve_quarantined_tool_response tool that will allow the model to "release" a tool call (that is, pass it through to the server) and receive the response.

The response from the wrapper should be generic. Explaining to the model why the response was blocked increases the risks of (1) getting a no-approval/YOLO mode model to attempt to "jailbreak" itself by massaging the tool call until it bypasses the guardrail, or (2) accidentally prompt injecting the model with the error message.

I'm also unsure about usability. The last time I tested the notification/tools/updated notification, some clients (including Claude Desktop) seemingly ignored the notification and never updated their list of tools. This flow would definitely be smoothest if we could make the quarantine release tool appear and disappear automatically depending on whether there are tool calls ready for quarantine release. Another alternative is to have the user manually instruct the model to repeat the tool call after they've manually approved it through the CLI. In that setup, the quarantine would act more like a temporary allowlist. My concern there is that the model (capricious as they tend to be) might not necessarily repeat the tool call identically byte for byte.

The first step will be to experiment enough with the clients to see how well the tool update notification is supported.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions