Skip to content

feat: Config logging via helm #6312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs-gb/installation/helm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ This section details the key Helm configuration parameters for Envoy, Autoscalin
* **Envoy**: Manage pre-stop behaviors and configure access logging to track request-level interactions.
* **Autoscaling** (Experimental): Fine-tune dynamic scaling policies for efficient resource allocation based on real-time inference workloads.
* **Servers**: Define grace periods for controlled shutdowns and optimize model control plane parameters for efficient model loading, unloading, and error handling.
* **Logging**: Define log levels for the different components of the system.


## Envoy
Expand Down Expand Up @@ -60,3 +61,23 @@ This section details the key Helm configuration parameters for Envoy, Autoscalin
| `agent.maxUnloadElapsedTimeMinutes` | components | Max time allowed for one model unload command for a model on a particular server replica to take. Lower values allow errors to be exposed faster. | 15 |
| `agent.maxUnloadRetryCount` | components | Max number of retries for unsuccessful unload command for a model on a particular server replica. Lower values allow control plane commands to fail faster. | 5 |
| `agent.unloadGracePeriodSeconds` | components | A period guarding against race conditions between Envoy actually applying the cluster change to remove a route and before proceeding with the model replica unloading command. | 2 |


## Logging

### Component Log Level

| Key | Chart | Description | Default
| --- | --- | --- | --- |
| `logging.logLevel` | components | Components wide settings for logging level, if individual component levels are not set. Options are: debug, info, error. | info |
| `controller.logLevel` | components | check zap log level [here](https://pkg.go.dev/go.uber.org/zap#pkg-constants) | |
| `dataflow.logLevel` | components | | check klogging level [here](https://dokka.klogging.io/-klogging/io.klogging/-level/index.html) |
| `scheduler.logLevel` | components | check logrus log level [here](https://pkg.go.dev/github.com/sirupsen/logrus#Level) | |
| `modelgateway.logLevel` | components | check logrus log level [here](https://pkg.go.dev/github.com/sirupsen/logrus#Level) | |
| `pipelinegateway.logLevel` | components | check logrus log level [here](https://pkg.go.dev/github.com/sirupsen/logrus#Level) | |
| `hodometer.logLevel` | components | check logrus log level [here](https://pkg.go.dev/github.com/sirupsen/logrus#Level) | |
| `serverConfig.rclone.logLevel` | components | check rclone `log-level` [here](https://rclone.org/docs/) | |
| `serverConfig.agent.logLevel` | components | check logrus log level [here](https://pkg.go.dev/github.com/sirupsen/logrus#Level) | |

**Notes**:
- We set kafka client library log level from the log level that is passed to the component, which could be different to the level expected by `librdkafka` (syslog level). In this case we attempt to map the log level value to the best match.
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,7 @@ spec:
- --leader-elect
- --namespace=$(POD_NAMESPACE)
- --clusterwide=$(CLUSTERWIDE)
- --log-level=$(LOG_LEVEL)
command:
- /manager
env:
Expand All @@ -430,6 +431,9 @@ spec:
value: '{{ .Values.security.controlplane.ssl.client.caPath }}'
- name: CONTROL_PLANE_SERVER_TLS_CA_LOCATION
value: '{{ .Values.security.controlplane.ssl.client.serverCaPath }}'
- name: LOG_LEVEL
value: '{{ hasKey .Values.controller "logLevel" | ternary .Values.controller.logLevel
.Values.logging.logLevel }}'
- name: POD_NAMESPACE
valueFrom:
fieldRef:
Expand Down Expand Up @@ -558,7 +562,8 @@ spec:
- name: ENABLE_SERVER_AUTOSCALING
value: '{{ .Values.autoscaling.autoscalingServerEnabled }}'
- name: LOG_LEVEL
value: '{{ .Values.scheduler.logLevel }}'
value: '{{ hasKey .Values.scheduler "logLevel" | ternary .Values.scheduler.logLevel
.Values.logging.logLevel }}'
- name: ALLOW_PLAINTXT
value: "true"
- name: POD_NAMESPACE
Expand Down Expand Up @@ -711,7 +716,8 @@ spec:
- name: CONTROL_PLANE_SERVER_TLS_CA_LOCATION
value: '{{ .Values.security.controlplane.ssl.client.serverCaPath }}'
- name: LOG_LEVEL
value: '{{ .Values.pipelinegateway.logLevel }}'
value: '{{ hasKey .Values.pipelinegateway "logLevel" | ternary .Values.pipelinegateway.logLevel
.Values.logging.logLevel }}'
- name: SELDON_SCHEDULER_PLAINTXT_PORT
value: "9004"
- name: SELDON_SCHEDULER_TLS_PORT
Expand Down Expand Up @@ -843,7 +849,8 @@ spec:
- name: ENVOY_DOWNSTREAM_SERVER_TLS_CA_LOCATION
value: '{{ .Values.security.envoy.ssl.downstream.client.serverCaPath }}'
- name: LOG_LEVEL
value: '{{ .Values.modelgateway.logLevel }}'
value: '{{ hasKey .Values.modelgateway "logLevel" | ternary .Values.modelgateway.logLevel
.Values.logging.logLevel }}'
- name: SELDON_SCHEDULER_PLAINTXT_PORT
value: "9004"
- name: SELDON_SCHEDULER_TLS_PORT
Expand Down Expand Up @@ -895,7 +902,8 @@ spec:
- name: METRICS_LEVEL
value: '{{ .Values.hodometer.metricsLevel }}'
- name: LOG_LEVEL
value: '{{ .Values.hodometer.logLevel }}'
value: '{{ hasKey .Values.hodometer "logLevel" | ternary .Values.hodometer.logLevel
.Values.logging.logLevel }}'
- name: EXTRA_PUBLISH_URLS
value: '{{ .Values.hodometer.extraPublishUrls }}'
- name: CONTROL_PLANE_SECURITY_PROTOCOL
Expand Down Expand Up @@ -1058,6 +1066,12 @@ spec:
}}'
- name: SELDON_CORES_COUNT
value: '{{ .Values.dataflow.cores }}'
- name: SELDON_LOG_LEVEL_APP
value: '{{ hasKey .Values.dataflow "logLevel" | ternary .Values.dataflow.logLevel
.Values.logging.logLevel | upper }}'
- name: SELDON_LOG_LEVEL_KAFKA
value: '{{ hasKey .Values.dataflow "logLevel" | ternary .Values.dataflow.logLevel
.Values.logging.logLevel | upper }}'
- name: SELDON_UPSTREAM_HOST
value: seldon-scheduler
- name: SELDON_UPSTREAM_PORT
Expand Down Expand Up @@ -1136,7 +1150,11 @@ metadata:
spec:
podSpec:
containers:
- image: '{{ .Values.serverConfig.rclone.image.registry }}/{{ .Values.serverConfig.rclone.image.repository
- env:
- name: RCLONE_LOG_LEVEL
value: '{{ hasKey .Values.serverConfig.rclone "logLevel" | ternary .Values.serverConfig.rclone.logLevel
.Values.logging.logLevel | upper }}'
image: '{{ .Values.serverConfig.rclone.image.registry }}/{{ .Values.serverConfig.rclone.image.repository
}}:{{ .Values.serverConfig.rclone.image.tag }}'
imagePullPolicy: '{{ .Values.serverConfig.rclone.image.pullPolicy }}'
lifecycle:
Expand Down Expand Up @@ -1232,7 +1250,8 @@ spec:
- name: MLSERVER_TRACING_SERVER
value: '{{ .Values.opentelemetry.endpoint }}'
- name: SELDON_LOG_LEVEL
value: '{{ .Values.serverConfig.agent.logLevel }}'
value: '{{ hasKey .Values.serverConfig.agent "logLevel" | ternary .Values.serverConfig.agent.logLevel
.Values.logging.logLevel }}'
- name: SELDON_SERVER_HTTP_PORT
value: "9000"
- name: SELDON_SERVER_GRPC_PORT
Expand Down Expand Up @@ -1404,7 +1423,11 @@ metadata:
spec:
podSpec:
containers:
- image: '{{ .Values.serverConfig.rclone.image.registry }}/{{ .Values.serverConfig.rclone.image.repository
- env:
- name: RCLONE_LOG_LEVEL
value: '{{ hasKey .Values.serverConfig.rclone "logLevel" | ternary .Values.serverConfig.rclone.logLevel
.Values.logging.logLevel | upper }}'
image: '{{ .Values.serverConfig.rclone.image.registry }}/{{ .Values.serverConfig.rclone.image.repository
}}:{{ .Values.serverConfig.rclone.image.tag }}'
imagePullPolicy: '{{ .Values.serverConfig.rclone.image.pullPolicy }}'
lifecycle:
Expand Down Expand Up @@ -1498,7 +1521,8 @@ spec:
- name: ENVOY_UPSTREAM_CLIENT_TLS_CA_LOCATION
value: '{{ .Values.security.envoy.ssl.upstream.server.clientCaPath }}'
- name: SELDON_LOG_LEVEL
value: '{{ .Values.serverConfig.agent.logLevel }}'
value: '{{ hasKey .Values.serverConfig.agent "logLevel" | ternary .Values.serverConfig.agent.logLevel
.Values.logging.logLevel }}'
- name: SELDON_SERVER_HTTP_PORT
value: "9000"
- name: SELDON_SERVER_GRPC_PORT
Expand Down
15 changes: 11 additions & 4 deletions k8s/helm-charts/seldon-core-v2-setup/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,15 @@ opentelemetry:
disable: false
ratio: 1

# logging
# this is a global setting, in the case individual components logLevel is not set
# Users should set a value from:
# fatal, error, warn, info, debug, trace
# if used also for .rclone.logLevel, the allowed set reduces to:
# debug, info, error
logging:
logLevel: info

hodometer:
image:
pullPolicy: IfNotPresent
Expand Down Expand Up @@ -131,7 +140,6 @@ modelgateway:
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
logLevel: warn

pipelinegateway:
image:
Expand All @@ -147,7 +155,6 @@ pipelinegateway:
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
logLevel: warn

dataflow:
image:
Expand Down Expand Up @@ -226,7 +233,6 @@ scheduler:
runAsGroup: 1000
runAsNonRoot: true
schedulerReadyTimeoutSeconds: 600
logLevel: warn

autoscaling:
autoscalingModelEnabled: false
Expand All @@ -252,6 +258,8 @@ serverConfig:
resources:
cpu: 50m
memory: 128Mi
# should follow `log-level` from https://rclone.org/docs/
logLevel: info

agent:
image:
Expand All @@ -274,7 +282,6 @@ serverConfig:
resources:
cpu: 200m
memory: 1Gi
logLevel: warn

mlserver:
image:
Expand Down
15 changes: 11 additions & 4 deletions k8s/helm-charts/seldon-core-v2-setup/values.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,15 @@ opentelemetry:
disable: false
ratio: 1

# logging
# this is a global setting, in the case individual components logLevel is not set
# Users should set a value from:
# fatal, error, warn, info, debug, trace
# if used also for .rclone.logLevel, the allowed set reduces to:
# debug, info, error
logging:
logLevel: info

hodometer:
image:
pullPolicy: IfNotPresent
Expand Down Expand Up @@ -131,7 +140,6 @@ modelgateway:
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
logLevel: warn

pipelinegateway:
image:
Expand All @@ -147,7 +155,6 @@ pipelinegateway:
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
logLevel: warn

dataflow:
image:
Expand Down Expand Up @@ -226,7 +233,6 @@ scheduler:
runAsGroup: 1000
runAsNonRoot: true
schedulerReadyTimeoutSeconds: 600
logLevel: warn

autoscaling:
autoscalingModelEnabled: false
Expand All @@ -252,6 +258,8 @@ serverConfig:
resources:
cpu: 50m
memory: 128Mi
# should follow `log-level` from https://rclone.org/docs/
logLevel: info

agent:
image:
Expand All @@ -274,7 +282,6 @@ serverConfig:
resources:
cpu: 200m
memory: 1Gi
logLevel: warn

mlserver:
image:
Expand Down
2 changes: 2 additions & 0 deletions k8s/kustomize/helm-components-sc/patch_controller.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,5 @@ spec:
value: '{{ .Values.security.controlplane.ssl.client.caPath }}'
- name: CONTROL_PLANE_SERVER_TLS_CA_LOCATION
value: '{{ .Values.security.controlplane.ssl.client.serverCaPath }}'
- name: LOG_LEVEL
value: '{{ hasKey .Values.controller "logLevel" | ternary .Values.controller.logLevel .Values.logging.logLevel }}'
4 changes: 4 additions & 0 deletions k8s/kustomize/helm-components-sc/patch_dataflow.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@ spec:
value: '{{ .Values.security.kafka.ssl.client.endpointIdentificationAlgorithm }}'
- name: SELDON_CORES_COUNT
value: '{{ .Values.dataflow.cores }}'
- name: SELDON_LOG_LEVEL_APP
value: '{{ hasKey .Values.dataflow "logLevel" | ternary .Values.dataflow.logLevel .Values.logging.logLevel | upper }}'
- name: SELDON_LOG_LEVEL_KAFKA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was initially wondering whether we won't need a similar setup to the go one on the dataflow-engine side, but it appears that it knows how to take the same levels as syslog.

value: '{{ hasKey .Values.dataflow "logLevel" | ternary .Values.dataflow.logLevel .Values.logging.logLevel | upper }}'
resources:
requests:
cpu: '{{ .Values.dataflow.resources.cpu }}'
Expand Down
2 changes: 1 addition & 1 deletion k8s/kustomize/helm-components-sc/patch_hodometer.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ spec:
- name: METRICS_LEVEL
value: '{{ .Values.hodometer.metricsLevel }}'
- name: LOG_LEVEL
value: '{{ .Values.hodometer.logLevel }}'
value: '{{ hasKey .Values.hodometer "logLevel" | ternary .Values.hodometer.logLevel .Values.logging.logLevel }}'
- name: EXTRA_PUBLISH_URLS
value: '{{ .Values.hodometer.extraPublishUrls }}'
- name: CONTROL_PLANE_SECURITY_PROTOCOL
Expand Down
7 changes: 5 additions & 2 deletions k8s/kustomize/helm-components-sc/patch_mlserver.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@ spec:
podSpec:
imagePullSecrets: []
containers:
- image: '{{ .Values.serverConfig.rclone.image.registry }}/{{ .Values.serverConfig.rclone.image.repository }}:{{ .Values.serverConfig.rclone.image.tag }}'
- env:
- name: RCLONE_LOG_LEVEL
value: '{{ hasKey .Values.serverConfig.rclone "logLevel" | ternary .Values.serverConfig.rclone.logLevel .Values.logging.logLevel | upper }}'
image: '{{ .Values.serverConfig.rclone.image.registry }}/{{ .Values.serverConfig.rclone.image.repository }}:{{ .Values.serverConfig.rclone.image.tag }}'
imagePullPolicy: '{{ .Values.serverConfig.rclone.image.pullPolicy }}'
name: rclone
resources:
Expand Down Expand Up @@ -73,7 +76,7 @@ spec:
- name: MLSERVER_TRACING_SERVER
value: '{{ .Values.opentelemetry.endpoint }}'
- name: SELDON_LOG_LEVEL
value: '{{ .Values.serverConfig.agent.logLevel }}'
value: '{{ hasKey .Values.serverConfig.agent "logLevel" | ternary .Values.serverConfig.agent.logLevel .Values.logging.logLevel }}'
image: '{{ .Values.serverConfig.agent.image.registry }}/{{ .Values.serverConfig.agent.image.repository }}:{{ .Values.serverConfig.agent.image.tag }}'
imagePullPolicy: '{{ .Values.serverConfig.agent.image.pullPolicy }}'
name: agent
Expand Down
2 changes: 1 addition & 1 deletion k8s/kustomize/helm-components-sc/patch_modelgateway.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ spec:
- name: ENVOY_DOWNSTREAM_SERVER_TLS_CA_LOCATION
value: '{{ .Values.security.envoy.ssl.downstream.client.serverCaPath }}'
- name: LOG_LEVEL
value: '{{ .Values.modelgateway.logLevel }}'
value: '{{ hasKey .Values.modelgateway "logLevel" | ternary .Values.modelgateway.logLevel .Values.logging.logLevel }}'
resources:
requests:
cpu: '{{ .Values.modelgateway.resources.cpu }}'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,4 +86,4 @@ spec:
- name: CONTROL_PLANE_SERVER_TLS_CA_LOCATION
value: '{{ .Values.security.controlplane.ssl.client.serverCaPath }}'
- name: LOG_LEVEL
value: '{{ .Values.pipelinegateway.logLevel }}'
value: '{{ hasKey .Values.pipelinegateway "logLevel" | ternary .Values.pipelinegateway.logLevel .Values.logging.logLevel }}'
2 changes: 1 addition & 1 deletion k8s/kustomize/helm-components-sc/patch_scheduler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ spec:
- name: ENABLE_SERVER_AUTOSCALING
value: '{{ .Values.autoscaling.autoscalingServerEnabled }}'
- name: LOG_LEVEL
value: '{{ .Values.scheduler.logLevel }}'
value: '{{ hasKey .Values.scheduler "logLevel" | ternary .Values.scheduler.logLevel .Values.logging.logLevel }}'
volumeClaimTemplates:
- name: scheduler-state
spec:
Expand Down
7 changes: 5 additions & 2 deletions k8s/kustomize/helm-components-sc/patch_triton.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@ spec:
podSpec:
imagePullSecrets: []
containers:
- image: '{{ .Values.serverConfig.rclone.image.registry }}/{{ .Values.serverConfig.rclone.image.repository }}:{{ .Values.serverConfig.rclone.image.tag }}'
- env:
- name: RCLONE_LOG_LEVEL
value: '{{ hasKey .Values.serverConfig.rclone "logLevel" | ternary .Values.serverConfig.rclone.logLevel .Values.logging.logLevel | upper }}'
image: '{{ .Values.serverConfig.rclone.image.registry }}/{{ .Values.serverConfig.rclone.image.repository }}:{{ .Values.serverConfig.rclone.image.tag }}'
imagePullPolicy: '{{ .Values.serverConfig.rclone.image.pullPolicy }}'
name: rclone
resources:
Expand Down Expand Up @@ -71,7 +74,7 @@ spec:
- name: ENVOY_UPSTREAM_CLIENT_TLS_CA_LOCATION
value: '{{ .Values.security.envoy.ssl.upstream.server.clientCaPath }}'
- name: SELDON_LOG_LEVEL
value: '{{ .Values.serverConfig.agent.logLevel }}'
value: '{{ hasKey .Values.serverConfig.agent "logLevel" | ternary .Values.serverConfig.agent.logLevel .Values.logging.logLevel }}'
image: '{{ .Values.serverConfig.agent.image.registry }}/{{ .Values.serverConfig.agent.image.repository }}:{{ .Values.serverConfig.agent.image.tag }}'
imagePullPolicy: '{{ .Values.serverConfig.agent.image.pullPolicy }}'
name: agent
Expand Down
Loading
Loading