feat(BA-2606): Add subagent layer to agents #6268

hhoikoo · 2025-10-16T06:46:06Z

This change introduces a new abstraction layer named 'subagents', which is used to emulate multiple agent backend instances within the same agent deployment.

Users are now able to specify sub_agents in the unified configuration, which enables the user to list the configurations of subagent instances intended to be spawned. The agent server automatically handles creating the agent instances according to the configuration and routing RPC calls appropriately.

This current change does not yet fully handle routing RPC calls to the correct subagent. If kernel ID, image name etc. are included as part of the arguments of the RPC call, the requests will be directed to the correct subagent. Otherwise it currently routes the requests to the default agent, which in this change is defined to be the first subagent defined in the configuration.

Including subagent ID in the RPC request is future work.

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Installer updates including:
- Fixtures for db schema changes
- New mandatory config options
Update of end-to-end CLI integration tests in ai.backend.test
API server-client counterparts (e.g., manager API -> client SDK)
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

src/ai/backend/agent/config/unified.py

src/ai/backend/agent/server.py

src/ai/backend/agent/config/unified.py

src/ai/backend/agent/server.py

hhoikoo · 2025-10-17T02:09:50Z

src/ai/backend/common/configs/sample_generator.py

Please note that this code was Claude Code-assisted, so there may be some imperfections that I've missed. Please let me know if you find any issues

src/ai/backend/agent/server.py

tests/agent/test_config_server.py

tests/agent/test_config_validation.py

src/ai/backend/agent/agent.py

src/ai/backend/agent/config/unified.py

This change introduces a new abstraction layer named 'subagents', which is used to emulate multiple agent backend instances within the same agent deployment. Users are now able to specify sub_agents in the unified configuration, which enables the user to list the configurations of subagent instances intended to be spawned. The agent server automatically handles creating the agent instances according to the configuration and routing RPC calls appropriately. This current change does not yet fully handle routing RPC calls to the correct subagent. If kernel ID, image name etc. are included as part of the arguments of the RPC call, the requests will be directed to the correct subagent. Otherwise it currently routes the requests to the default agent, which in this change is defined to be the first subagent defined in the configuration. Including subagent ID in the RPC request is future work.

This change adds support for generating TOML array of tables in the sample config generator, which is required for generating subagents configuration. This change also fixes some subtle bugs with not including the descriptions of certain optional fields. Note that the sample_generator.py file has been vibe coded with Claude Code, so maybe there are some subtle bugs with the generation. Care has been taken to ensure that the code does generate correct config file.

This change introduces OverridableContainerConfig class to represent container configs that should be overridable by subagents.

This change fixes an uncaught error with agent creation, where the type of the config passed into the constructor was separated AgentGlobalConfig and AgentSpecificConfig, rather than AgentUnifiedConfig. This was a remnant of an older version of this change, where the constructor type of the AbstractAgent was modified which broke some implicit contracts of subclasses and how they used config objects (especially with dumping the config in pickle).

This change introduces a change where all DockerAgents now share a single global instance of MetadataServer, as Docker agent instances should not create their own MetadataServer, as it will lead to unintended resource contentions if multiple DockerAgent instances are created.

src/ai/backend/agent/server.py

hhoikoo · 2025-10-20T06:05:20Z

src/ai/backend/agent/server.py

                    "registry": {
                        str(kern_id): _ensure_serializable(kern.__getstate__())
-                        for kern_id, kern in self.agent.kernel_registry.items()
+                        for agent in self.agents.values()


Here (and below on line 518), when creating the snapshot object, I flatten out the kernel registry and allocs stored across all the subagents. Is this an acceptable thing to do?
@HyeockJinKim @achimnol

hhoikoo · 2025-10-20T06:06:59Z

src/ai/backend/agent/server.py

+                for agent_config in agent_configs
+            ]
+        agents = [task.result() for task in tasks]
+        self._default_agent_id = agents[0].id


I choose the default agent to be the first agent defined in the subagents config. Is this acceptable?

src/ai/backend/agent/agent.py

This change fixes a bug where the RPC calls for destroying kernels and purging containers did not return proper responses, leading incorrect behaviors like infinite retry cycles of kernel destruction or unexpected exceptions while purging non-existent containers.

hhoikoo · 2025-10-20T11:31:33Z

I tested locally by following the steps below. The steps are deliberately significantly more detailed than necessary, which is done mainly for posterity.

Prepare a configuration with multiple subagents. Put it at the root of the repo e.g. ./agent-subagent-demo.toml

[etcd]
namespace = "local"
addr = { host = "127.0.0.1", port = 8121 }
user = ""
password = ""

[service-discovery]
type = "etcd"

# Global agent configuration - serves as defaults for all subagents
[agent]
backend = "docker"
rpc-listen-addr = { host = "127.0.0.1", port = 6011 }
service-addr = { host = "0.0.0.0", port = 6003 }
announce-addr = { host = "127.0.0.1", port = 6003 }
ssl-enabled = false
agent-sock-port = 6007
scaling-group = "default"
scaling-group-type = "compute"
pid-file = "./agent-subagent-demo.pid"
event-loop = "uvloop"
ipc-base-path = "/tmp/backend.ai/ipc"
var-base-path = "./var/lib/backend.ai"
image-commit-path = "./tmp/backend.ai/commit/"
mount-path = "/Users/hhkoo/Developer/backend.ai/vfroot/local"
cohabiting-storage-proxy = true
skip-manager-detection = false
kernel-creation-concurrency = 4

[agent.sync-container-lifecycles]
enabled = true
interval = 10.0

# Global container configuration - serves as defaults for all subagents
[container]
kernel-uid = -1
kernel-gid = -1
bind-host = "127.0.0.1"
# Note: Port ranges must NOT overlap between subagents!
port-range = [30000, 31000]
sandbox-type = "docker"
scratch-type = "hostdir"
scratch-root = "./scratches"
scratch-size = "1G"
jail-args = ["--mount", "/tmp"]
swarm-enabled = false

# Global resource configuration - serves as defaults for all subagents
[resource]
reserved-cpu = 1
reserved-mem = "1G"
reserved-disk = "8G"
memory-align-size = "16M"
allocation-order = ["cuda", "cpu", "mem"]
affinity-policy = "INTERLEAVED"

# Global pyroscope configuration
[pyroscope]
enabled = false

# Global logging configuration
[logging]
level = "INFO"
drivers = ["console"]

[logging.pkg-ns]
"" = "WARNING"
"aiodocker" = "INFO"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"

[logging.console]
colored = true
format = "verbose"

# Global debug configuration
[debug]
enabled = true
skip-container-deletion = false
log-heartbeats = false
heartbeat-interval = 20.0

[debug.coredump]
enabled = false
path = "./coredumps"
backup-count = 10
size-limit = "64M"

# Global OTEL configuration
[otel]
enabled = true
log-level = "INFO"
endpoint = "http://127.0.0.1:4317"

# Subagent 1: CPU-focused agent
[[sub-agents]]
[sub-agents.agent]
id = "subagent-cpu-001"
agent-sock-port = 6107
mount-path = "/Users/hhkoo/Developer/backend.ai/vfroot/local/subagent-cpu"

[sub-agents.container]
port-range = [31000, 32000]
bind-host = "127.0.0.1"
scratch-root = "./scratches/subagent-cpu"
scratch-size = "2G"

[sub-agents.resource]
reserved-cpu = 2
reserved-mem = "2G"
reserved-disk = "10G"
allocation-order = ["cuda", "cpu", "mem"]

# Subagent 2: High-memory agent (for memory-intensive workloads)
# Note: On macOS ARM64, CUDA/ROCm are not available
# This subagent focuses on memory-intensive tasks instead
[[sub-agents]]
[sub-agents.agent]
id = "subagent-highmem-001"
agent-sock-port = 6207
mount-path = "/Users/hhkoo/Developer/backend.ai/vfroot/local/subagent-highmem"

[sub-agents.container]
port-range = [32000, 33000]
bind-host = "127.0.0.1"
scratch-root = "./scratches/subagent-highmem"
scratch-size = "3G"

[sub-agents.resource]
reserved-cpu = 1
reserved-mem = "1G"
reserved-disk = "15G"
# Prioritize memory allocation for this subagent
allocation-order = ["cuda", "mem", "cpu"]

# Subagent 3: General-purpose agent with balanced resources
[[sub-agents]]
[sub-agents.agent]
id = "subagent-general-001"
agent-sock-port = 6307
mount-path = "/Users/hhkoo/Developer/backend.ai/vfroot/local/subagent-general"

[sub-agents.container]
port-range = [33000, 34000]
bind-host = "127.0.0.1"
scratch-root = "./scratches/subagent-general"
scratch-size = "1.5G"

[sub-agents.resource]
reserved-cpu = 1
reserved-mem = "1.5G"
reserved-disk = "10G"
# Balanced allocation order suitable for ARM64 macOS
allocation-order = ["cuda", "cpu", "mem"]

Start halfstack

docker compose -f docker-compose.halfstack.current.yml up -d

Start storage proxy, manager, and agent on separate terminal windows

./py -m ai.backend.storage.server
./backend.ai mgr start-server --debug
./backend.ai ag start-server --debug --config agent-subagent-demo.toml

Run command to see the list of subagents and sessions

# Run to see the list of agents defined
source env-local-admin-api.sh
./backend.ai admin agent list
# > Should see 3 agents listed

# Run to see sessions
source env-local-user-api.sh
./backend.ai ps
# > Should be empty

Run basic hello world to make sure it works for a single agent

./backend.ai run cr.backend.ai/multiarch/python:3.9-ubuntu20.04 -c "print('Hello World')"

The following is created a session in interactive mode. Test to see if removing the session works

./backend.ai ps
# > Get the name of the session

./backend.ai rm <session-name>

Create and destroy multiple sessions at the same time

./backend.ai run cr.backend.ai/multiarch/python:3.9-ubuntu20.04 -c "print('Test 1')"
./backend.ai run cr.backend.ai/multiarch/python:3.9-ubuntu20.04 -c "print('Test 2')"
./backend.ai run cr.backend.ai/multiarch/python:3.9-ubuntu20.04 -c "print('Test 3')"
./backend.ai run cr.backend.ai/multiarch/python:3.9-ubuntu20.04 -c "print('Test 4')"

./backend.ai rm pysdk-xxxxx1 pysdk-xxxxx2 pysdk-xxxxx3 pysdk-xxxxx4

Test creating and destroying interactive sessions work

./backend.ai create cr.backend.ai/multiarch/python:3.9-ubuntu20.04
./backend.ai destroy pysdk-xxxxx
# Verify destroy successful by checking ps
./backend.ai ps

Copilot

Pull Request Overview

This PR introduces a new subagent abstraction layer to enable multiple agent backend instances within the same agent deployment. Users can now specify sub-agent configurations in the unified configuration, allowing the agent server to automatically create agent instances according to the configuration and route RPC calls appropriately.

Adds new configuration schema for defining multiple subagents with inheritance from global defaults
Implements agent selection and routing logic in the RPC server to find appropriate agents by kernel ID, image name, etc.
Refactors agent initialization to support multiple agent instances with shared metadata server

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/agent/test_config_validation.py	Comprehensive tests for subagent configuration validation and inheritance patterns
tests/agent/test_config_server.py	Tests for agent RPC server multi-agent mode functionality and agent selection logic
src/ai/backend/common/configs/sample_generator.py	Enhanced TOML config generator to support array of tables syntax and runtime field handling
src/ai/backend/agent/server.py	Major refactor to support multiple agents with routing logic and shared metadata server
src/ai/backend/agent/docker/agent.py	Updates to support shared metadata server across multiple agent instances
src/ai/backend/agent/config/unified.py	New configuration schema with subagent support and global/specific config separation
configs/agent/sample.toml	Updated sample configuration file with new subagent structure
changes/6268.feature.md	Feature changelog entry

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/ai/backend/common/configs/sample_generator.py

src/ai/backend/agent/server.py

src/ai/backend/agent/config/unified.py

This change removes the non-determinism in generating sample configs by always sorting list-like attribute values to remove the non-determinism when sample values are provided as a set, which does not provide a consistent ordering of values.

hhoikoo · 2025-10-21T12:21:24Z

This PR got quite bloated, and follow-up PR for updating RPC functions will actually undo quite a lot of the change made in this PR. I'll create new tickets/issues that better match the new agent runtime <-> agent design, and create new PRs appropriately that separate out changes better.

hhoikoo requested review from HyeockJinKim and achimnol October 16, 2025 06:46

hhoikoo self-assigned this Oct 16, 2025

github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Oct 16, 2025

github-advanced-security bot found potential problems Oct 16, 2025

View reviewed changes

github-actions bot added the comp:common Related to Common component label Oct 16, 2025

github-advanced-security bot found potential problems Oct 17, 2025

View reviewed changes

hhoikoo commented Oct 17, 2025

View reviewed changes

github-advanced-security bot found potential problems Oct 17, 2025

View reviewed changes

HyeockJinKim reviewed Oct 17, 2025

View reviewed changes

src/ai/backend/agent/agent.py Outdated Show resolved Hide resolved

src/ai/backend/agent/config/unified.py Outdated Show resolved Hide resolved

hhoikoo added 12 commits October 20, 2025 14:53

docs(BA-2606): Add changelog

d791204

docs(BA-2606): Make field descriptions fit column width

e2a36d7

fix(BA-2606): Mark all subagent.agent fields except ID as optional

32f87e8

feat(BA-2606): Separate out overridable container configs

ca6ab95

This change introduces OverridableContainerConfig class to represent container configs that should be overridable by subagents.

test(BA-2606): Add unit tests for subagent config validation

cbbbb4d

fix(BA-2606): Regenerate sample config file with new changes

2eb6e3c

fix(BA-2606): Properly mark non-overridable agent config fields

75ac110

doc(BA-2606): Remove stale todo comments

1ed7447

hhoikoo force-pushed the feat/BA-2606/subagent branch from 6955400 to 1ed7447 Compare October 20, 2025 05:53

github-advanced-security bot found potential problems Oct 20, 2025

View reviewed changes

refactor(BA-2606): Remove unnecessary parameter in AbstractAgent

f5ed041

hhoikoo commented Oct 20, 2025

View reviewed changes

hhoikoo added 2 commits October 20, 2025 15:45

fix(BA-2606): Fix broken tests

00c4478

hhoikoo marked this pull request as ready for review October 20, 2025 11:31

hhoikoo requested review from HyeockJinKim and Copilot October 20, 2025 11:31

Copilot AI reviewed Oct 20, 2025

View reviewed changes

hhoikoo added 4 commits October 21, 2025 10:30

Merge branch 'main' into feat/BA-2606/subagent

33c7d0a

refactor(BA-2606): Modify ambiguous or repeated names

480273d

feat(BA-2606): Add validator to ensure subagent id uniqueness

b7c2da9

hhoikoo marked this pull request as draft October 21, 2025 12:15

hhoikoo closed this Oct 22, 2025

feat(BA-2606): Add subagent layer to agents #6268

feat(BA-2606): Add subagent layer to agents #6268

Uh oh!

Conversation

hhoikoo commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hhoikoo Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hhoikoo Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

hhoikoo Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hhoikoo commented Oct 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hhoikoo commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hhoikoo commented Oct 16, 2025 •

edited

Loading