Skip to content

Conversation

hhoikoo
Copy link
Member

@hhoikoo hhoikoo commented Oct 16, 2025

resolves #6144 (BA-2606)

This change introduces a new abstraction layer named 'subagents', which is used to emulate multiple agent backend instances within the same agent deployment.

Users are now able to specify sub_agents in the unified configuration, which enables the user to list the configurations of subagent instances intended to be spawned. The agent server automatically handles creating the agent instances according to the configuration and routing RPC calls appropriately.

This current change does not yet fully handle routing RPC calls to the correct subagent. If kernel ID, image name etc. are included as part of the arguments of the RPC call, the requests will be directed to the correct subagent. Otherwise it currently routes the requests to the default agent, which in this change is defined to be the first subagent defined in the configuration.

Including subagent ID in the RPC request is future work.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@hhoikoo hhoikoo self-assigned this Oct 16, 2025
@github-actions github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Oct 16, 2025
@github-actions github-actions bot added the comp:common Related to Common component label Oct 16, 2025
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this code was Claude Code-assisted, so there may be some imperfections that I've missed. Please let me know if you find any issues

This change introduces a new abstraction layer named 'subagents', which
is used to emulate multiple agent backend instances within the same
agent deployment.

Users are now able to specify sub_agents in the unified configuration,
which enables the user to list the configurations of subagent instances
intended to be spawned. The agent server automatically handles creating
the agent instances according to the configuration and routing RPC calls
appropriately.

This current change does not yet fully handle routing RPC calls to the
correct subagent. If kernel ID, image name etc. are included as part of
the arguments of the RPC call, the requests will be directed to the
correct subagent. Otherwise it currently routes the requests to the
default agent, which in this change is defined to be the first subagent
defined in the configuration.

Including subagent ID in the RPC request is future work.
This change adds support for generating TOML array of tables in the
sample config generator, which is required for generating subagents
configuration.

This change also fixes some subtle bugs with not including the
descriptions of certain optional fields.

Note that the sample_generator.py file has been vibe coded with Claude
Code, so maybe there are some subtle bugs with the generation. Care has
been taken to ensure that the code does generate correct config file.
This change introduces OverridableContainerConfig class to represent
container configs that should be overridable by subagents.
This change fixes an uncaught error with agent creation, where the type
of the config passed into the constructor was separated
AgentGlobalConfig and AgentSpecificConfig, rather than
AgentUnifiedConfig. This was a remnant of an older version of this
change, where the constructor type of the AbstractAgent was modified
which broke some implicit contracts of subclasses and how they used
config objects (especially with dumping the config in pickle).
This change introduces a change where all DockerAgents now share a
single global instance of MetadataServer, as Docker agent instances
should not create their own MetadataServer, as it will lead to
unintended resource contentions if multiple DockerAgent instances are
created.
@hhoikoo hhoikoo force-pushed the feat/BA-2606/subagent branch from 6955400 to 1ed7447 Compare October 20, 2025 05:53
"registry": {
str(kern_id): _ensure_serializable(kern.__getstate__())
for kern_id, kern in self.agent.kernel_registry.items()
for agent in self.agents.values()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here (and below on line 518), when creating the snapshot object, I flatten out the kernel registry and allocs stored across all the subagents. Is this an acceptable thing to do?
@HyeockJinKim @achimnol

for agent_config in agent_configs
]
agents = [task.result() for task in tasks]
self._default_agent_id = agents[0].id
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I choose the default agent to be the first agent defined in the subagents config. Is this acceptable?

This change fixes a bug where the RPC calls for destroying kernels and
purging containers did not return proper responses, leading incorrect
behaviors like infinite retry cycles of kernel destruction or unexpected
exceptions while purging non-existent containers.
@hhoikoo
Copy link
Member Author

hhoikoo commented Oct 20, 2025

I tested locally by following the steps below. The steps are deliberately significantly more detailed than necessary, which is done mainly for posterity.

  1. Prepare a configuration with multiple subagents. Put it at the root of the repo e.g. ./agent-subagent-demo.toml
[etcd]
namespace = "local"
addr = { host = "127.0.0.1", port = 8121 }
user = ""
password = ""

[service-discovery]
type = "etcd"

# Global agent configuration - serves as defaults for all subagents
[agent]
backend = "docker"
rpc-listen-addr = { host = "127.0.0.1", port = 6011 }
service-addr = { host = "0.0.0.0", port = 6003 }
announce-addr = { host = "127.0.0.1", port = 6003 }
ssl-enabled = false
agent-sock-port = 6007
scaling-group = "default"
scaling-group-type = "compute"
pid-file = "./agent-subagent-demo.pid"
event-loop = "uvloop"
ipc-base-path = "/tmp/backend.ai/ipc"
var-base-path = "./var/lib/backend.ai"
image-commit-path = "./tmp/backend.ai/commit/"
mount-path = "/Users/hhkoo/Developer/backend.ai/vfroot/local"
cohabiting-storage-proxy = true
skip-manager-detection = false
kernel-creation-concurrency = 4

[agent.sync-container-lifecycles]
enabled = true
interval = 10.0

# Global container configuration - serves as defaults for all subagents
[container]
kernel-uid = -1
kernel-gid = -1
bind-host = "127.0.0.1"
# Note: Port ranges must NOT overlap between subagents!
port-range = [30000, 31000]
sandbox-type = "docker"
scratch-type = "hostdir"
scratch-root = "./scratches"
scratch-size = "1G"
jail-args = ["--mount", "/tmp"]
swarm-enabled = false

# Global resource configuration - serves as defaults for all subagents
[resource]
reserved-cpu = 1
reserved-mem = "1G"
reserved-disk = "8G"
memory-align-size = "16M"
allocation-order = ["cuda", "cpu", "mem"]
affinity-policy = "INTERLEAVED"

# Global pyroscope configuration
[pyroscope]
enabled = false

# Global logging configuration
[logging]
level = "INFO"
drivers = ["console"]

[logging.pkg-ns]
"" = "WARNING"
"aiodocker" = "INFO"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"

[logging.console]
colored = true
format = "verbose"

# Global debug configuration
[debug]
enabled = true
skip-container-deletion = false
log-heartbeats = false
heartbeat-interval = 20.0

[debug.coredump]
enabled = false
path = "./coredumps"
backup-count = 10
size-limit = "64M"

# Global OTEL configuration
[otel]
enabled = true
log-level = "INFO"
endpoint = "http://127.0.0.1:4317"

# Subagent 1: CPU-focused agent
[[sub-agents]]
[sub-agents.agent]
id = "subagent-cpu-001"
agent-sock-port = 6107
mount-path = "/Users/hhkoo/Developer/backend.ai/vfroot/local/subagent-cpu"

[sub-agents.container]
port-range = [31000, 32000]
bind-host = "127.0.0.1"
scratch-root = "./scratches/subagent-cpu"
scratch-size = "2G"

[sub-agents.resource]
reserved-cpu = 2
reserved-mem = "2G"
reserved-disk = "10G"
allocation-order = ["cuda", "cpu", "mem"]

# Subagent 2: High-memory agent (for memory-intensive workloads)
# Note: On macOS ARM64, CUDA/ROCm are not available
# This subagent focuses on memory-intensive tasks instead
[[sub-agents]]
[sub-agents.agent]
id = "subagent-highmem-001"
agent-sock-port = 6207
mount-path = "/Users/hhkoo/Developer/backend.ai/vfroot/local/subagent-highmem"

[sub-agents.container]
port-range = [32000, 33000]
bind-host = "127.0.0.1"
scratch-root = "./scratches/subagent-highmem"
scratch-size = "3G"

[sub-agents.resource]
reserved-cpu = 1
reserved-mem = "1G"
reserved-disk = "15G"
# Prioritize memory allocation for this subagent
allocation-order = ["cuda", "mem", "cpu"]

# Subagent 3: General-purpose agent with balanced resources
[[sub-agents]]
[sub-agents.agent]
id = "subagent-general-001"
agent-sock-port = 6307
mount-path = "/Users/hhkoo/Developer/backend.ai/vfroot/local/subagent-general"

[sub-agents.container]
port-range = [33000, 34000]
bind-host = "127.0.0.1"
scratch-root = "./scratches/subagent-general"
scratch-size = "1.5G"

[sub-agents.resource]
reserved-cpu = 1
reserved-mem = "1.5G"
reserved-disk = "10G"
# Balanced allocation order suitable for ARM64 macOS
allocation-order = ["cuda", "cpu", "mem"]
  1. Start halfstack
docker compose -f docker-compose.halfstack.current.yml up -d
  1. Start storage proxy, manager, and agent on separate terminal windows
./py -m ai.backend.storage.server
./backend.ai mgr start-server --debug
./backend.ai ag start-server --debug --config agent-subagent-demo.toml
  1. Run command to see the list of subagents and sessions
# Run to see the list of agents defined
source env-local-admin-api.sh
./backend.ai admin agent list
# > Should see 3 agents listed

# Run to see sessions
source env-local-user-api.sh
./backend.ai ps
# > Should be empty
  1. Run basic hello world to make sure it works for a single agent
./backend.ai run cr.backend.ai/multiarch/python:3.9-ubuntu20.04 -c "print('Hello World')"
  1. The following is created a session in interactive mode. Test to see if removing the session works
./backend.ai ps
# > Get the name of the session

./backend.ai rm <session-name>
  1. Create and destroy multiple sessions at the same time
./backend.ai run cr.backend.ai/multiarch/python:3.9-ubuntu20.04 -c "print('Test 1')"
./backend.ai run cr.backend.ai/multiarch/python:3.9-ubuntu20.04 -c "print('Test 2')"
./backend.ai run cr.backend.ai/multiarch/python:3.9-ubuntu20.04 -c "print('Test 3')"
./backend.ai run cr.backend.ai/multiarch/python:3.9-ubuntu20.04 -c "print('Test 4')"

./backend.ai rm pysdk-xxxxx1 pysdk-xxxxx2 pysdk-xxxxx3 pysdk-xxxxx4
  1. Test creating and destroying interactive sessions work
./backend.ai create cr.backend.ai/multiarch/python:3.9-ubuntu20.04
./backend.ai destroy pysdk-xxxxx
# Verify destroy successful by checking ps
./backend.ai ps

@hhoikoo hhoikoo marked this pull request as ready for review October 20, 2025 11:31
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new subagent abstraction layer to enable multiple agent backend instances within the same agent deployment. Users can now specify sub-agent configurations in the unified configuration, allowing the agent server to automatically create agent instances according to the configuration and route RPC calls appropriately.

  • Adds new configuration schema for defining multiple subagents with inheritance from global defaults
  • Implements agent selection and routing logic in the RPC server to find appropriate agents by kernel ID, image name, etc.
  • Refactors agent initialization to support multiple agent instances with shared metadata server

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/agent/test_config_validation.py Comprehensive tests for subagent configuration validation and inheritance patterns
tests/agent/test_config_server.py Tests for agent RPC server multi-agent mode functionality and agent selection logic
src/ai/backend/common/configs/sample_generator.py Enhanced TOML config generator to support array of tables syntax and runtime field handling
src/ai/backend/agent/server.py Major refactor to support multiple agents with routing logic and shared metadata server
src/ai/backend/agent/docker/agent.py Updates to support shared metadata server across multiple agent instances
src/ai/backend/agent/config/unified.py New configuration schema with subagent support and global/specific config separation
configs/agent/sample.toml Updated sample configuration file with new subagent structure
changes/6268.feature.md Feature changelog entry

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

This change removes the non-determinism in generating sample configs by
always sorting list-like attribute values to remove the non-determinism
when sample values are provided as a set, which does not provide a
consistent ordering of values.
@hhoikoo hhoikoo marked this pull request as draft October 21, 2025 12:15
@hhoikoo
Copy link
Member Author

hhoikoo commented Oct 21, 2025

This PR got quite bloated, and follow-up PR for updating RPC functions will actually undo quite a lot of the change made in this PR. I'll create new tickets/issues that better match the new agent runtime <-> agent design, and create new PRs appropriately that separate out changes better.

@hhoikoo hhoikoo closed this Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component comp:common Related to Common component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent: Implement sub-agent abstraction layer within agent server

2 participants