Skip to content

Conversation

miuosz
Copy link

@miuosz miuosz commented Oct 10, 2025

Description

  1. (vertex) Add cache count (cachedContentTokenCount) to google models responses — supported by gemini models
  2. (vertex) Add cache usage to Anthropic model responses

Motivation

Without proper usage details, it's impossible to accurately count tokens or estimate the cost.

1. claude-opus-4

before

"usage": {
    "prompt_tokens": 4,
    "completion_tokens": 317,
    "total_tokens": 321
}

after

"usage": {
    "prompt_tokens": 4,
    "completion_tokens": 310,
    "total_tokens": 2078,
    "cache_read_input_tokens": 0,
    "cache_creation_input_tokens": 1764
}

2. gemini-2.5-flash

before

"usage": {
    "prompt_tokens": 2924,
    "completion_tokens": 4,
    "total_tokens": 3036,
    "completion_tokens_details": {
        "reasoning_tokens": 108
    }
}

after

"usage": {
    "prompt_tokens": 2924,
    "completion_tokens": 3,
    "total_tokens": 3078,
    "prompt_tokens_details": {
        "cached_tokens": 2532
    },
    "completion_tokens_details": {
        "reasoning_tokens": 151
    }
}

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)

How Has This Been Tested?

  • Unit Tests
  • Integration Tests
  • Manual Testing

Screenshots (if applicable)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Related Issues

Copy link
Contributor

matter-code-review bot commented Oct 10, 2025

Code Quality new feature maintainability

Description

Summary By MatterAI MatterAI logo

🔄 What Changed

This Pull Request introduces a new prompt_tokens_details object with a cached_tokens field into the usage reporting for Anthropic and Google Vertex AI (Gemini) chat completion responses. This applies to both direct and streaming API calls. The AnthropicUsage type definition has been updated to reflect this new structure. Additionally, the condition for shouldSendCacheUsage in Google Vertex AI has been refined to ensure it's not sent under strict OpenAI compliance and only when cache tokens are present.

🔍 Impact of the Change

This enhancement provides more granular visibility into token consumption by explicitly reporting cached token usage. This improves cost analysis, observability, and understanding of how caching mechanisms affect token usage for these AI models.

📁 Total Files Changed

  • src/providers/anthropic/chatComplete.ts: Added prompt_tokens_details to usage object in direct and streaming responses.
  • src/providers/anthropic/types.ts: Added prompt_tokens_details to AnthropicUsage interface.
  • src/providers/google-vertex-ai/chatComplete.ts: Added prompt_tokens_details to usage object in direct and streaming responses, and refined shouldSendCacheUsage condition.

🧪 Test Added

N/A

🔒Security Vulnerabilities

N/A

Motivation

To provide comprehensive and granular token usage information, including cached tokens, for Vertex AI Gemini and Anthropic models, thereby improving cost analysis, observability, and understanding of model interactions.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)

How Has This Been Tested?

  • Unit Tests
  • Integration Tests
  • Manual Testing

Screenshots (if applicable)

N/A

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Related Issues

N/A

Tip

Quality Recommendations

  1. Implement unit and/or integration tests to verify the accurate calculation and reporting of cached_tokens in the prompt_tokens_details for both Anthropic and Google Vertex AI models, covering both direct and streaming response scenarios.

  2. Update the public API documentation for Anthropic and Google Vertex AI chat completion responses to clearly describe the new prompt_tokens_details field and its cached_tokens sub-field, including its purpose and expected values.

Tanka Poem ♫

Cache tokens now seen,
Usage clear, a new insight,
Models learn and grow.
Data flows, a scientist's joy,
Efficiency's sweet reward. 📊✨

Sequence Diagram

sequenceDiagram
    participant Client
    participant AnthropicChatComplete as Anthropic Chat Complete
    participant GoogleVertexAIChatComplete as Google Vertex AI Chat Complete
    participant AnthropicAPI as Anthropic API
    participant GoogleVertexAIApi as Google Vertex AI API

    Client->>AnthropicChatComplete: chatComplete(request)
    AnthropicChatComplete->>AnthropicAPI: POST /messages (payload)
    AnthropicAPI-->>AnthropicChatComplete: API Response (usage, output_tokens, cache_read_input_tokens)
    Note over AnthropicChatComplete: Calculate total_tokens, prompt_tokens, completion_tokens, and populate usage.prompt_tokens_details.cached_tokens
    AnthropicChatComplete-->>Client: Chat Completion Response (with usage.prompt_tokens_details.cached_tokens)

    Client->>GoogleVertexAIChatComplete: chatComplete(request)
    GoogleVertexAIChatComplete->>GoogleVertexAIApi: POST /generateContent (payload)
    GoogleVertexAIApi-->>GoogleVertexAIChatComplete: API Response (usage, output_tokens, cache_read_input_tokens)
    Note over GoogleVertexAIChatComplete: Calculate total_tokens, prompt_tokens, completion_tokens, and populate usage.prompt_tokens_details.cached_tokens
    GoogleVertexAIChatComplete-->>Client: Chat Completion Response (with usage.prompt_tokens_details.cached_tokens)

    Note over AnthropicChatComplete,GoogleVertexAIChatComplete: Also applies to streaming responses, where usage is updated per chunk.
Loading

Copy link
Contributor

@matter-code-review matter-code-review bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added cache token details for Vertex AI and Anthropic models to improve usage tracking.

Copy link
Contributor

@matter-code-review matter-code-review bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added cache and audio token details to Vertex AI and Google models usage metadata

Comment on lines +701 to +709
audio_tokens:
parsedChunk.usageMetadata?.candidatesTokensDetails?.reduce(
(acc, curr) => {
if (curr.modality === VERTEX_MODALITY.AUDIO)
return acc + curr.tokenCount;
return acc;
},
0
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Security Issue

Issue: Potential null dereference when accessing parsedChunk.usageMetadata?.candidatesTokensDetails without validating its existence.
Fix: Add explicit null check before destructuring.
Impact: Prevents runtime errors when usage data is missing.

Suggested change
audio_tokens:
parsedChunk.usageMetadata?.candidatesTokensDetails?.reduce(
(acc, curr) => {
if (curr.modality === VERTEX_MODALITY.AUDIO)
return acc + curr.tokenCount;
return acc;
},
0
),
audio_tokens:
(parsedChunk.usageMetadata?.candidatesTokensDetails ?? []).reduce(
(acc, curr) => {
if (curr?.modality === VERTEX_MODALITY.AUDIO)
return acc + (curr?.tokenCount ?? 0);
return acc;
},
0
),

@narengogi
Copy link
Collaborator

@miuosz I've replicated your changes in the google provider (vertex-ai is the hosted gcp service)
additionally I've added changes to support audio_tokens in request and response tokens

references:

Copy link
Contributor

@matter-code-review matter-code-review bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix null dereference in audio token calculation and improve type safety

Comment on lines +659 to 669
audio_tokens: parsedChunk.usageMetadata?.promptTokensDetails?.reduce(
(acc, curr) => {
if (curr.modality === VERTEX_MODALITY.AUDIO)
return acc + curr.tokenCount;
return acc;
},
0
),
},
};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Security Issue

Issue: Potential null dereference when accessing parsedChunk.usageMetadata?.promptTokensDetails without validating its existence.
Fix: Add explicit null check before destructuring.
Impact: Prevents runtime errors when usage data is missing.

Suggested change
audio_tokens: parsedChunk.usageMetadata?.promptTokensDetails?.reduce(
(acc, curr) => {
if (curr.modality === VERTEX_MODALITY.AUDIO)
return acc + curr.tokenCount;
return acc;
},
0
),
},
};
}
audio_tokens: (parsedChunk.usageMetadata?.promptTokensDetails ?? []).reduce(
(
acc: number,
curr: { modality: VERTEX_MODALITY; tokenCount: number }
) => {
if (curr?.modality === VERTEX_MODALITY.AUDIO)
return acc + (curr?.tokenCount ?? 0);
return acc;
},
0
),

Copy link
Contributor

@matter-code-review matter-code-review bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add cache details to usage for Anthropic and Vertex AI models

Comment on lines 720 to 726
...streamState.usage,
completion_tokens: parsedChunk.usage?.output_tokens,
total_tokens: totalTokens,
prompt_tokens_details: {
cached_tokens: streamState.usage?.cache_read_input_tokens ?? 0,
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Code Quality Issue

Issue: Inconsistent handling of cache_read_input_tokens in usage object. In one place it's included in the spread operator, in another it's explicitly set.
Fix: Standardize the usage object construction to avoid redundancy.
Impact: Improves code maintainability and reduces potential confusion.

Suggested change
...streamState.usage,
completion_tokens: parsedChunk.usage?.output_tokens,
total_tokens: totalTokens,
prompt_tokens_details: {
cached_tokens: streamState.usage?.cache_read_input_tokens ?? 0,
},
},
usage: {
...streamState.usage,
completion_tokens: parsedChunk.usage?.output_tokens,
total_tokens: totalTokens,
prompt_tokens_details: {
cached_tokens: streamState.usage?.cache_read_input_tokens ?? 0,
},
...(shouldSendCacheUsage && {
cache_read_input_tokens: cache_read_input_tokens,
cache_creation_input_tokens: cache_creation_input_tokens,
}),
},

Comment on lines +1013 to 1019
...streamState.usage,
completion_tokens: parsedChunk.usage?.output_tokens,
prompt_tokens: streamState.usage?.prompt_tokens,
total_tokens:
(streamState.usage?.prompt_tokens || 0) +
(parsedChunk.usage?.output_tokens || 0),
total_tokens: totalTokens,
prompt_tokens_details: {
cached_tokens: streamState.usage?.cache_read_input_tokens ?? 0,
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Code Quality Issue

Issue: Inconsistent handling of cache_read_input_tokens in usage object. In one place it's included in the spread operator, in another it's explicitly set.
Fix: Standardize the usage object construction to avoid redundancy.
Impact: Improves code maintainability and reduces potential confusion.

Suggested change
...streamState.usage,
completion_tokens: parsedChunk.usage?.output_tokens,
prompt_tokens: streamState.usage?.prompt_tokens,
total_tokens:
(streamState.usage?.prompt_tokens || 0) +
(parsedChunk.usage?.output_tokens || 0),
total_tokens: totalTokens,
prompt_tokens_details: {
cached_tokens: streamState.usage?.cache_read_input_tokens ?? 0,
},
},
usage: {
...streamState.usage,
completion_tokens: parsedChunk.usage?.output_tokens,
total_tokens: totalTokens,
prompt_tokens_details: {
cached_tokens: streamState.usage?.cache_read_input_tokens ?? 0,
},
...(shouldSendCacheUsage && {
cache_read_input_tokens: cache_read_input_tokens,
cache_creation_input_tokens: cache_creation_input_tokens,
}),
},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants