|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Kubernetes 中的 PSI 指标进入 Beta 阶段" |
| 4 | +date: 2025-XX-XX |
| 5 | +draft: true |
| 6 | +slug: introducing-psi-metrics-beta |
| 7 | +author: "Haowei Cai (Google)" |
| 8 | +translator: > |
| 9 | + [Wenjun Lou](https://github.com/Eason1118) |
| 10 | +--- |
| 11 | +<!-- |
| 12 | +layout: blog |
| 13 | +title: "PSI Metrics for Kubernetes Graduates to Beta" |
| 14 | +date: 2025-XX-XX |
| 15 | +draft: true |
| 16 | +slug: introducing-psi-metrics-beta |
| 17 | +author: "Haowei Cai (Google)" |
| 18 | +--> |
| 19 | + |
| 20 | +<!-- |
| 21 | +As Kubernetes clusters grow in size and complexity, understanding the health and performance of individual nodes becomes increasingly critical. We are excited to announce that as of Kubernetes v1.34, **Pressure Stall Information (PSI) Metrics** has graduated to Beta. |
| 22 | +--> |
| 23 | +随着 Kubernetes 集群规模和复杂性的增长,了解各个节点的健康状况和性能变得越来越关键。 |
| 24 | +我们很高兴地宣布,从 Kubernetes v1.34 开始,**压力停滞信息 (PSI) 指标**已升级到 Beta 版本。 |
| 25 | + |
| 26 | +<!-- |
| 27 | +## What is Pressure Stall Information (PSI)? |
| 28 | +--> |
| 29 | +## 什么是压力停滞信息 (PSI)? {#what-is-pressure-stall-information-psi} |
| 30 | + |
| 31 | +<!-- |
| 32 | +[Pressure Stall Information (PSI)](https://docs.kernel.org/accounting/psi.html) is a feature of the Linux kernel (version 4.20 and later) |
| 33 | +that provides a canonical way to quantify pressure on infrastructure resources, |
| 34 | +in terms of whether demand for a resource exceeds current supply. |
| 35 | +It moves beyond simple resource utilization metrics and instead |
| 36 | +measures the amount of time that tasks are stalled due to resource contention. |
| 37 | +This is a powerful way to identify and diagnose resource bottlenecks that can impact application performance. |
| 38 | +--> |
| 39 | +[压力停滞信息 (PSI)](https://docs.kernel.org/accounting/psi.html) 是 Linux 内核(4.20 及更高版本)的一项功能, |
| 40 | +它提供了一种规范化的方式来量化基础设施资源的压力, |
| 41 | +即资源需求是否超过当前供应。 |
| 42 | +它超越了简单的资源利用率指标,而是测量任务因资源竞争而停滞的时间。 |
| 43 | +这是识别和诊断可能影响应用程序性能的资源瓶颈的强大方法。 |
| 44 | + |
| 45 | +<!-- |
| 46 | +PSI exposes metrics for CPU, memory, and I/O, categorized as either `some` or `full` pressure: |
| 47 | +--> |
| 48 | +PSI 暴露了 CPU、内存和 I/O 的指标,分为 `some` 或 `full` 压力: |
| 49 | + |
| 50 | +<!-- |
| 51 | +`some` |
| 52 | +: The percentage of time that **at least one** task is stalled on a resource. This indicates some level of resource contention. |
| 53 | +--> |
| 54 | +`some` |
| 55 | +: **至少一个**任务在资源上停滞的时间百分比。这表明存在某种程度的资源竞争。 |
| 56 | + |
| 57 | +<!-- |
| 58 | +`full` |
| 59 | +: The percentage of time that **all** non-idle tasks are stalled on a resource simultaneously. This indicates a more severe resource bottleneck. |
| 60 | +{{< figure src="/images/psi-metrics-some-vs-full.svg" alt="Diagram illustrating the difference between 'some' and 'full' PSI pressure." title="PSI: 'Some' vs. 'Full' Pressure" >}} |
| 61 | +--> |
| 62 | +`full` |
| 63 | +: **所有**非空闲任务同时在资源上停滞的时间百分比。这表明存在更严重的资源瓶颈。 |
| 64 | +{{< figure src="/images/psi-metrics-some-vs-full.svg" alt="展示 'some' 与 'full' PSI 压力差异的示意图。" title="PSI:'Some' 与 'Full' 压力对比" >}} |
| 65 | + |
| 66 | +<!-- |
| 67 | +These metrics are aggregated over 10-second, 1-minute, and 5-minute rolling windows, providing a comprehensive view of resource pressure over time. |
| 68 | +--> |
| 69 | +这些指标在 10 秒、1 分钟和 5 分钟的滚动窗口上进行聚合,提供了随时间变化的资源压力的全面视图。 |
| 70 | + |
| 71 | +<!-- |
| 72 | +## PSI metrics in Kubernetes |
| 73 | +--> |
| 74 | +## Kubernetes 中的 PSI 指标 {#psi-metrics-in-kubernetes} |
| 75 | + |
| 76 | +<!-- |
| 77 | +With the `KubeletPSI` feature gate enabled, the kubelet can now collect PSI metrics from the Linux kernel and expose them through two channels: the [Summary API](/docs/reference/instrumentation/node-metrics#summary-api-source) and the `/metrics/cadvisor` Prometheus endpoint. This allows you to monitor and alert on resource pressure at the node, pod, and container level. |
| 78 | +--> |
| 79 | +启用 `KubeletPSI` 特性门控后,kubelet 现在可以从 Linux 内核收集 PSI 指标, |
| 80 | +并通过两个渠道暴露它们:[Summary API](/docs/reference/instrumentation/node-metrics#summary-api-source) |
| 81 | +和 `/metrics/cadvisor` Prometheus 端点。这允许你在节点、Pod 和容器级别监控和告警资源压力。 |
| 82 | + |
| 83 | +<!-- |
| 84 | +The following new metrics are available in Prometheus exposition format via `/metrics/cadvisor`: |
| 85 | +--> |
| 86 | +以下新指标可通过 `/metrics/cadvisor` 以 Prometheus 暴露格式获得: |
| 87 | +* `container_pressure_cpu_stalled_seconds_total` |
| 88 | +* `container_pressure_cpu_waiting_seconds_total` |
| 89 | +* `container_pressure_memory_stalled_seconds_total` |
| 90 | +* `container_pressure_memory_waiting_seconds_total` |
| 91 | +* `container_pressure_io_stalled_seconds_total` |
| 92 | +* `container_pressure_io_waiting_seconds_total` |
| 93 | + |
| 94 | +<!-- |
| 95 | +These metrics, along with the data from the Summary API, provide a granular view of resource pressure, enabling you to pinpoint the source of performance issues and take corrective action. For example, you can use these metrics to: |
| 96 | +--> |
| 97 | +这些指标与 Summary API 的数据一起,提供了资源压力的细粒度视图, |
| 98 | +使你能够精确定位性能问题的根源并采取纠正措施。 |
| 99 | +例如,你可以使用这些指标来: |
| 100 | + |
| 101 | +<!-- |
| 102 | +* **Identify memory leaks:** A steadily increasing `some` pressure for memory can indicate a memory leak in an application. |
| 103 | +--> |
| 104 | +* **识别内存泄漏:** 内存的 `some` 压力持续增加可能表明应用程序中存在内存泄漏。 |
| 105 | + |
| 106 | +<!-- |
| 107 | +* **Optimize resource requests and limits:** By understanding the resource pressure of your workloads, you can more accurately tune their resource requests and limits. |
| 108 | +--> |
| 109 | +* **优化资源请求和限制:** 通过了解你的工作负载的资源压力,你可以更准确地调整其资源请求和限制。 |
| 110 | + |
| 111 | +<!-- |
| 112 | +* **Autoscale workloads:** You can use PSI metrics to trigger autoscaling events, ensuring that your workloads have the resources they need to perform optimally. |
| 113 | +--> |
| 114 | +* **自动扩缩容工作负载:** 你可以使用 PSI 指标触发自动扩缩容事件,确保你的工作负载拥有最佳性能所需的资源。 |
| 115 | + |
| 116 | +<!-- |
| 117 | +## How to enable PSI metrics |
| 118 | +--> |
| 119 | +## 如何启用 PSI 指标 {#how-to-enable-psi-metrics} |
| 120 | + |
| 121 | +<!-- |
| 122 | +To enable PSI metrics in your Kubernetes cluster, you need to: |
| 123 | +--> |
| 124 | +要在你的 Kubernetes 集群中启用 PSI 指标,你需要: |
| 125 | + |
| 126 | +<!-- |
| 127 | +1. **Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.** |
| 128 | +--> |
| 129 | +1. **确保你的节点运行 Linux 内核版本 4.20 或更高版本,并使用 cgroup v2。** |
| 130 | + |
| 131 | +<!-- |
| 132 | +2. **Enable the `KubeletPSI` feature gate on the kubelet.** |
| 133 | +--> |
| 134 | +2. **在 kubelet 上启用 `KubeletPSI` 特性门控。** |
| 135 | + |
| 136 | +<!-- |
| 137 | +Once enabled, you can start scraping the `/metrics/cadvisor` endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes the kubelet does not expose PSI metrics. |
| 138 | +--> |
| 139 | +启用后,你可以开始使用 Prometheus 兼容的监控解决方案抓取 `/metrics/cadvisor` 端点, |
| 140 | +或查询 Summary API 来收集和可视化新的 PSI 指标。 |
| 141 | +请注意,PSI 是 Linux 内核功能,因此这些指标在 Windows 节点上不可用。 |
| 142 | +你的集群可以包含 Linux 和 Windows 节点的混合,在 Windows 节点上,kubelet 不会暴露 PSI 指标。 |
| 143 | + |
| 144 | +<!-- |
| 145 | +## What's next? |
| 146 | +--> |
| 147 | +## 接下来是什么? {#whats-next} |
| 148 | + |
| 149 | +<!-- |
| 150 | +We are excited to bring PSI metrics to the Kubernetes community and look forward to your feedback. As a beta feature, we are actively working on improving and extending this functionality towards a stable GA release. We encourage you to try it out and share your experiences with us. |
| 151 | +--> |
| 152 | +我们很高兴为 Kubernetes 社区带来 PSI 指标,并期待你的反馈。 |
| 153 | +作为 Beta 功能,我们正在积极改进和扩展此功能,以实现稳定的 GA 发布。 |
| 154 | +我们鼓励你试用并与我们分享你的经验。 |
| 155 | + |
| 156 | +<!-- |
| 157 | +To learn more about PSI metrics, check out the official [Kubernetes documentation](/docs/reference/instrumentation/understand-psi-metrics/). You can also get involved in the conversation on the [#sig-node](https://kubernetes.slack.com/messages/sig-node) Slack channel. |
| 158 | +--> |
| 159 | +要了解有关 PSI 指标的更多信息,请查看官方 [Kubernetes 文档](/docs/reference/instrumentation/understand-psi-metrics/)。 |
| 160 | +你还可以参与 [#sig-node](https://kubernetes.slack.com/messages/sig-node) Slack 频道的对话。 |
0 commit comments