-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Checklist
- I agree to the terms within the OpenFGA Code of Conduct.
Describe the problem you'd like to have solved
As a consumer of the SDK, I would like to hook it to my dashboards to get data on several metrics, as well being able to configure proper logging and tracing
Describe the ideal solution
For each SDK, users should be able to set up and connect to their infra
- Phase 1: Metrics
- Phase 2: Logging
- Phase 3: Tracing
Metrics:
-
The latency of the request, split into
- time it took since dev called the method until they get a response
- time it took since the sdk issued a request to the API until it got a response
- time reported by the server (query_duration_ms header)
-
Response codes
-
Error codes and method
-
Method calls
-
Python: Export metrics python-sdk#93
We're thinking of adding "fine-grained" config for the attributes/tags.
Something along the lines of:
var configuration = new ClientConfiguration() {
ApiUrl = "http://localhost:8080",
StoreId = "...",
Credentials = new Credentials() { ... },
Telemetry = new OpenFgaTelemetryConfig {
Metrics: {
[TelemetryHistograms.RequestDuration] = {
Attributes: [Attributes.AttributeRequestMethod, Attributes.AttributeRequestStoreId]
},
[TelemetryCounters.TokenExchangeCountKey] = {
Attributes: [Attributes.AttributeRequestModelId, Attributes.AttributeRequestClientId]
},
}
}
};
var fgaClient = new OpenFgaClient(configuration);If not set, we would enable a base set of metrics with minimal attributes, if configured, we follow whatever is configured. We will couple that with warnings in the OTEL config documentation around which attributes could be cost-prohibitive.
Metrics needed
| Metric Name | Type | Enabled by Default | Description |
|---|---|---|---|
fga-client.request.duration |
Histogram | Yes | The total request time for FGA requests |
fga-client.query.duration |
Histogram | Yes | The amount of time the FGA server took to internally process nd evaluate the request |
fga-client.credentials.request |
Counter | Yes | The total number of times a new token was requested when using ClientCredentials |
fga-client.request.count |
Counter | No | The total number of requests made to the FGA server |
Supported attributes
| Attribute Name | Type | Enabled by Default | Description |
|---|---|---|---|
fga-client.response.model_id |
string |
Yes | The authorization model ID that the FGA server used |
fga-client.request.method |
string |
Yes | The FGA method/action that was performed (e.g. Check, ListObjects, ...) in TitleCase |
fga-client.request.store_id |
string |
Yes | The store ID that was sent as part of the request |
fga-client.request.model_id |
string |
Yes | The authorization model ID that was sent as part of the request, if any |
fga-client.request.client_id |
string |
Yes | The client ID associated with the request, if any |
fga-client.user |
string |
No | The user that is associated with the action of the request for check and list objects |
http.request.resend_count |
int |
Yes | The number of retries attempted (Only sent if the request was retried. Count of 1 means the request was retried once in addition to the original request) |
http.response.status_code |
int |
Yes | The status code of the response |
http.request.method |
string |
No | The HTTP method for the request |
http.host |
string |
Yes | Host identifier of the origin the request was sent to |
url.scheme |
string |
No | HTTP Scheme of the request (http/https) |
url.full |
string |
No | Full URL of the request |
user_agent.original |
string |
Yes | User Agent used in the query |
This allows folks to not enable this by accident (they'd have to manually opt-in), while giving them the ability to be able to have visibility on things like:
- Whether a client id is sending a disproportionate amount of calls or request tokens (could be an indication that it was misconfigured - eg.g they are initializing the SDK multiple times causing a credential request per call)
- Whether their new model is causing significantly more latency than the old one
- Whether slow requests are due to retries (tracing helps here, but usually traces are sampled and people might miss this)
- The ratio of success vs. bad requests vs rate limits by model id, store id or client id so folks can understand whether a particular client is being called incorrectly or a particular model is problematic
- Understanding whether they have still old clients running that they need to upgrade and how that goes with the errors they are getting (through the user agent)
Documentation
- Documentation in each SDK on how to configure logging, metrics and tracing
- Documentation in our docs on setting up tracing and connecting it to Prometheus and Grafana
Configuration
For each, SDK we need to allow the configuration of tracing, metrics and logging
For example, in the JS SDK, we may add: (note - config structure may change), based on the server config
Implementation
We will be using OpenTelemetry, e.g. open-telemetry/opentelemetry-js (for JS) or the appropriate SDK for each language: Language APIs & SDKs
Alternatives and current workarounds
No response
References
No response
Additional context
Roadmap Item: openfga/roadmap#41
Metadata
Metadata
Assignees
Labels
Type
Projects
Status