Skip to content

Conversation

XinShuYang
Copy link
Contributor

@XinShuYang XinShuYang commented Jun 27, 2025

Based on performance testing on a 4C8G Windows node running 100 pods, the antrea-agent container showed an average CPU usage of 30m and memory usage of 50MB, while the OVS container consumed 17m CPU and 23MB memory.
To account for potential burst scenarios and ensure runtime stability, both containers have been configured with resource requests of 100m CPU and 100MB memory.

@XinShuYang
Copy link
Contributor Author

/test-windows-all

@XinShuYang XinShuYang requested a review from wenyingd July 3, 2025 08:12
@XinShuYang XinShuYang marked this pull request as ready for review July 3, 2025 08:12
@XinShuYang XinShuYang force-pushed the win-resource-request branch from efd4ac5 to a051cbb Compare July 10, 2025 15:08
@@ -62,6 +62,10 @@ spec:
- disable
{{- end}}
name: antrea-agent
resources:
requests:
cpu: 100m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have 200m for the Linux container
any reason why we are using a lower value for the Windows one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to update the PR description. Based on our tests, I've never observed such high average CPU usage when workload pods are running stably (the cpu cost is indeed higher when workload pods are starting). May I know the Linux test scenario? Is the Linux resource request based on average or burst usage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have had the CPU request for Linux for a long time. I don't remember how it was measured.
However, unless you had a testbed with some ongoing activity, I think it would be better to be conservative and match the Linux value. FQDN policies, network policy audit logging, ... are things that can consume quite a bit of CPU in a steady way, and you probably didn't test with these feature actively used. On Linux we also have the FlowExporter which could be quite CPU intensive.
Of course, another approach is to say that this request value should really be a baseline, and that users should increase it based on their use case. That being said, we haven't had any complaints about setting it to 200m for Linux, AFAIK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have had the CPU request for Linux for a long time. I don't remember how it was measured. However, unless you had a testbed with some ongoing activity, I think it would be better to be conservative and match the Linux value. FQDN policies, network policy audit logging, ... are things that can consume quite a bit of CPU in a steady way, and you probably didn't test with these feature actively used. On Linux we also have the FlowExporter which could be quite CPU intensive. Of course, another approach is to say that this request value should really be a baseline, and that users should increase it based on their use case. That being said, we haven't had any complaints about setting it to 200m for Linux, AFAIK.

Thanks for the explanation. I am ok to set the CPU request value the same as the Linux pod.

resources:
requests:
cpu: 100m
memory: 100Mi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the memory requests based on some observations for a Windows Node running Antrea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my test, I deployed 100 pods on a 4c8g Windows node. During the test, the average agent memory cost was less than 50MB when the workload pods were running stably. At first I used the burst memory cost as the resource request. However, after @wenyingd 's review, I agree that such a large value might prevent users from scheduling necessary workloads onto the node. Therefore, it's acceptable for the burst cost to be larger than the resource request. Unlike the resource limit setting, it won't prevent the container from acquiring more resources once it's scheduled and running on the node. Could you share more insights?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok.
Usually, memory should not really vary that much. Once memory is given to a process, it's not always returned to the OS right away, even if the process no longer needs it. So I am a bit surprised that you see big bursts in memory usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the highest burst memory cost was only captured once by the test script. I've never observed such a cost during long-term pod operation, so I'll keep the current memory request value.

@XinShuYang XinShuYang force-pushed the win-resource-request branch 2 times, most recently from 9e07d65 to eb9baac Compare July 14, 2025 08:17
@XinShuYang XinShuYang requested a review from antoninbas July 14, 2025 08:19
@XinShuYang XinShuYang force-pushed the win-resource-request branch from eb9baac to 35e4119 Compare July 14, 2025 08:31
wenyingd
wenyingd previously approved these changes Jul 14, 2025
Copy link
Contributor

@wenyingd wenyingd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

antoninbas
antoninbas previously approved these changes Jul 14, 2025
@antoninbas antoninbas added the action/release-note Indicates a PR that should be included in release notes. label Jul 14, 2025
@antoninbas
Copy link
Contributor

@XinShuYang looks like there is still an issue with manifest generation

Based on performance testing on a 4C8G Windows node running 100 pods,
the antrea-agent container showed an average CPU usage of 30m and memory
usage of 50MB, while the OVS container consumed 17m CPU and 23MB memory.

To account for potential burst scenarios and ensure runtime stability, both
containers have been configured with resource requests of 100m CPU and 100MB memory.

Signed-off-by: Shuyang Xin <[email protected]>
@XinShuYang XinShuYang dismissed stale reviews from antoninbas and wenyingd via 6f165cb July 15, 2025 02:18
@XinShuYang XinShuYang force-pushed the win-resource-request branch from 35e4119 to 6f165cb Compare July 15, 2025 02:18
@XinShuYang
Copy link
Contributor Author

/test-windows-all

@XinShuYang XinShuYang requested a review from antoninbas July 15, 2025 08:48
@antoninbas antoninbas merged commit 14876e2 into antrea-io:main Jul 15, 2025
60 of 64 checks passed
luolanzone pushed a commit that referenced this pull request Jul 16, 2025
…) (#7313)

Based on performance testing on a 4C8G Windows node running 100 pods,
the antrea-agent container showed an average CPU usage of 30m and memory
usage of 50MB, while the OVS container consumed 17m CPU and 23MB memory.

To account for potential burst scenarios and ensure runtime stability,
memory requests are set to 100MB and CPU requests are set to 200m (except
for the install-cni initContainer). The CPU requests match the ones for the
Agent on Linux.

Signed-off-by: Shuyang Xin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants