Skip to content

Commit 675531a

Browse files
authored
Merge pull request #51956 from Eason1118/blog-2
[zh-cn] Add blog 2025-08-08-introducing-psi-metrics-beta
2 parents 38fe295 + 337dad4 commit 675531a

File tree

1 file changed

+160
-0
lines changed
  • content/zh-cn/blog/_posts/2025-08-08-introducing-psi-metrics-beta

1 file changed

+160
-0
lines changed
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes 中的 PSI 指标进入 Beta 阶段"
4+
date: 2025-XX-XX
5+
draft: true
6+
slug: introducing-psi-metrics-beta
7+
author: "Haowei Cai (Google)"
8+
translator: >
9+
[Wenjun Lou](https://github.com/Eason1118)
10+
---
11+
<!--
12+
layout: blog
13+
title: "PSI Metrics for Kubernetes Graduates to Beta"
14+
date: 2025-XX-XX
15+
draft: true
16+
slug: introducing-psi-metrics-beta
17+
author: "Haowei Cai (Google)"
18+
-->
19+
20+
<!--
21+
As Kubernetes clusters grow in size and complexity, understanding the health and performance of individual nodes becomes increasingly critical. We are excited to announce that as of Kubernetes v1.34, **Pressure Stall Information (PSI) Metrics** has graduated to Beta.
22+
-->
23+
随着 Kubernetes 集群规模和复杂性的增长,了解各个节点的健康状况和性能变得越来越关键。
24+
我们很高兴地宣布,从 Kubernetes v1.34 开始,**压力停滞信息 (PSI) 指标**已升级到 Beta 版本。
25+
26+
<!--
27+
## What is Pressure Stall Information (PSI)?
28+
-->
29+
## 什么是压力停滞信息 (PSI)? {#what-is-pressure-stall-information-psi}
30+
31+
<!--
32+
[Pressure Stall Information (PSI)](https://docs.kernel.org/accounting/psi.html) is a feature of the Linux kernel (version 4.20 and later)
33+
that provides a canonical way to quantify pressure on infrastructure resources,
34+
in terms of whether demand for a resource exceeds current supply.
35+
It moves beyond simple resource utilization metrics and instead
36+
measures the amount of time that tasks are stalled due to resource contention.
37+
This is a powerful way to identify and diagnose resource bottlenecks that can impact application performance.
38+
-->
39+
[压力停滞信息 (PSI)](https://docs.kernel.org/accounting/psi.html) 是 Linux 内核(4.20 及更高版本)的一项功能,
40+
它提供了一种规范化的方式来量化基础设施资源的压力,
41+
即资源需求是否超过当前供应。
42+
它超越了简单的资源利用率指标,而是测量任务因资源竞争而停滞的时间。
43+
这是识别和诊断可能影响应用程序性能的资源瓶颈的强大方法。
44+
45+
<!--
46+
PSI exposes metrics for CPU, memory, and I/O, categorized as either `some` or `full` pressure:
47+
-->
48+
PSI 暴露了 CPU、内存和 I/O 的指标,分为 `some``full` 压力:
49+
50+
<!--
51+
`some`
52+
: The percentage of time that **at least one** task is stalled on a resource. This indicates some level of resource contention.
53+
-->
54+
`some`
55+
: **至少一个**任务在资源上停滞的时间百分比。这表明存在某种程度的资源竞争。
56+
57+
<!--
58+
`full`
59+
: The percentage of time that **all** non-idle tasks are stalled on a resource simultaneously. This indicates a more severe resource bottleneck.
60+
{{< figure src="/images/psi-metrics-some-vs-full.svg" alt="Diagram illustrating the difference between 'some' and 'full' PSI pressure." title="PSI: 'Some' vs. 'Full' Pressure" >}}
61+
-->
62+
`full`
63+
: **所有**非空闲任务同时在资源上停滞的时间百分比。这表明存在更严重的资源瓶颈。
64+
{{< figure src="/images/psi-metrics-some-vs-full.svg" alt="展示 'some' 与 'full' PSI 压力差异的示意图。" title="PSI:'Some' 与 'Full' 压力对比" >}}
65+
66+
<!--
67+
These metrics are aggregated over 10-second, 1-minute, and 5-minute rolling windows, providing a comprehensive view of resource pressure over time.
68+
-->
69+
这些指标在 10 秒、1 分钟和 5 分钟的滚动窗口上进行聚合,提供了随时间变化的资源压力的全面视图。
70+
71+
<!--
72+
## PSI metrics in Kubernetes
73+
-->
74+
## Kubernetes 中的 PSI 指标 {#psi-metrics-in-kubernetes}
75+
76+
<!--
77+
With the `KubeletPSI` feature gate enabled, the kubelet can now collect PSI metrics from the Linux kernel and expose them through two channels: the [Summary API](/docs/reference/instrumentation/node-metrics#summary-api-source) and the `/metrics/cadvisor` Prometheus endpoint. This allows you to monitor and alert on resource pressure at the node, pod, and container level.
78+
-->
79+
启用 `KubeletPSI` 特性门控后,kubelet 现在可以从 Linux 内核收集 PSI 指标,
80+
并通过两个渠道暴露它们:[Summary API](/docs/reference/instrumentation/node-metrics#summary-api-source)
81+
`/metrics/cadvisor` Prometheus 端点。这允许你在节点、Pod 和容器级别监控和告警资源压力。
82+
83+
<!--
84+
The following new metrics are available in Prometheus exposition format via `/metrics/cadvisor`:
85+
-->
86+
以下新指标可通过 `/metrics/cadvisor` 以 Prometheus 暴露格式获得:
87+
* `container_pressure_cpu_stalled_seconds_total`
88+
* `container_pressure_cpu_waiting_seconds_total`
89+
* `container_pressure_memory_stalled_seconds_total`
90+
* `container_pressure_memory_waiting_seconds_total`
91+
* `container_pressure_io_stalled_seconds_total`
92+
* `container_pressure_io_waiting_seconds_total`
93+
94+
<!--
95+
These metrics, along with the data from the Summary API, provide a granular view of resource pressure, enabling you to pinpoint the source of performance issues and take corrective action. For example, you can use these metrics to:
96+
-->
97+
这些指标与 Summary API 的数据一起,提供了资源压力的细粒度视图,
98+
使你能够精确定位性能问题的根源并采取纠正措施。
99+
例如,你可以使用这些指标来:
100+
101+
<!--
102+
* **Identify memory leaks:** A steadily increasing `some` pressure for memory can indicate a memory leak in an application.
103+
-->
104+
* **识别内存泄漏:** 内存的 `some` 压力持续增加可能表明应用程序中存在内存泄漏。
105+
106+
<!--
107+
* **Optimize resource requests and limits:** By understanding the resource pressure of your workloads, you can more accurately tune their resource requests and limits.
108+
-->
109+
* **优化资源请求和限制:** 通过了解你的工作负载的资源压力,你可以更准确地调整其资源请求和限制。
110+
111+
<!--
112+
* **Autoscale workloads:** You can use PSI metrics to trigger autoscaling events, ensuring that your workloads have the resources they need to perform optimally.
113+
-->
114+
* **自动扩缩容工作负载:** 你可以使用 PSI 指标触发自动扩缩容事件,确保你的工作负载拥有最佳性能所需的资源。
115+
116+
<!--
117+
## How to enable PSI metrics
118+
-->
119+
## 如何启用 PSI 指标 {#how-to-enable-psi-metrics}
120+
121+
<!--
122+
To enable PSI metrics in your Kubernetes cluster, you need to:
123+
-->
124+
要在你的 Kubernetes 集群中启用 PSI 指标,你需要:
125+
126+
<!--
127+
1. **Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.**
128+
-->
129+
1. **确保你的节点运行 Linux 内核版本 4.20 或更高版本,并使用 cgroup v2。**
130+
131+
<!--
132+
2. **Enable the `KubeletPSI` feature gate on the kubelet.**
133+
-->
134+
2. **在 kubelet 上启用 `KubeletPSI` 特性门控。**
135+
136+
<!--
137+
Once enabled, you can start scraping the `/metrics/cadvisor` endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes the kubelet does not expose PSI metrics.
138+
-->
139+
启用后,你可以开始使用 Prometheus 兼容的监控解决方案抓取 `/metrics/cadvisor` 端点,
140+
或查询 Summary API 来收集和可视化新的 PSI 指标。
141+
请注意,PSI 是 Linux 内核功能,因此这些指标在 Windows 节点上不可用。
142+
你的集群可以包含 Linux 和 Windows 节点的混合,在 Windows 节点上,kubelet 不会暴露 PSI 指标。
143+
144+
<!--
145+
## What's next?
146+
-->
147+
## 接下来是什么? {#whats-next}
148+
149+
<!--
150+
We are excited to bring PSI metrics to the Kubernetes community and look forward to your feedback. As a beta feature, we are actively working on improving and extending this functionality towards a stable GA release. We encourage you to try it out and share your experiences with us.
151+
-->
152+
我们很高兴为 Kubernetes 社区带来 PSI 指标,并期待你的反馈。
153+
作为 Beta 功能,我们正在积极改进和扩展此功能,以实现稳定的 GA 发布。
154+
我们鼓励你试用并与我们分享你的经验。
155+
156+
<!--
157+
To learn more about PSI metrics, check out the official [Kubernetes documentation](/docs/reference/instrumentation/understand-psi-metrics/). You can also get involved in the conversation on the [#sig-node](https://kubernetes.slack.com/messages/sig-node) Slack channel.
158+
-->
159+
要了解有关 PSI 指标的更多信息,请查看官方 [Kubernetes 文档](/docs/reference/instrumentation/understand-psi-metrics/)
160+
你还可以参与 [#sig-node](https://kubernetes.slack.com/messages/sig-node) Slack 频道的对话。

0 commit comments

Comments
 (0)