Skip to content

Commit b3accb8

Browse files
drernieclaude
andauthored
feat: add health check endpoint for container orchestration (#198)
## Summary Implements a basic health check endpoint (`/health`) for the MCP server to support container orchestration systems like Kubernetes and Docker Compose. ## Changes - ✅ Add `/health` endpoint using FastMCP's custom_route decorator - ✅ Health check handler returns server status, timestamp, and version info - ✅ Transport-aware registration (only enabled for HTTP/SSE transports) - ✅ Comprehensive unit and integration tests - ✅ Updated Docker container test to verify health endpoint ## Implementation Details ### Health Check Response Format ```json { "status": "ok", "timestamp": "2024-01-15T10:30:00Z", "server": { "name": "quilt-mcp-server", "version": "1.0.0", "transport": "http" } } ``` ### Transport Compatibility - **HTTP/SSE/Streamable-HTTP**: Health endpoint registered at `/health` - **stdio**: No HTTP endpoints (maintains compatibility) ## Testing - [x] Unit tests for health check handler - [x] Integration tests for FastMCP registration - [x] Docker container test validates health endpoint - [x] All existing tests pass ## Documentation Complete specification documents created in `spec/197-add-health-checks/`: - `01-requirements.md` - User stories and acceptance criteria - `02-analysis.md` - Current state analysis - `03-specifications.md` - Desired end state - `04-phases.md` - Implementation roadmap ## Future Phases This PR implements Phase 1 of the health check feature. Future enhancements: - Phase 2: Component health integration (AWS, Athena, packages) - Phase 3: Advanced endpoints (/readiness, /liveness) - Phase 4: Docker HEALTHCHECK and Kubernetes probe support Fixes #197 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1 parent a658d32 commit b3accb8

File tree

13 files changed

+1192
-47
lines changed

13 files changed

+1192
-47
lines changed

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,22 @@ All notable changes to this project will be documented in this file.
66
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
77
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
88

9+
## [0.6.14] - 2025-09-24
10+
11+
### Added
12+
13+
- **Health Check Endpoint**: Basic health monitoring for container orchestration (#197)
14+
- New `/health` endpoint returning server status, timestamp, and version info
15+
- Transport-aware registration (only enabled for HTTP/SSE/streamable-http transports)
16+
- Comprehensive test coverage for health check functionality
17+
- Foundation for future enhancements (component health, readiness/liveness probes)
18+
19+
### Changed
20+
21+
- **Docker Integration Tests**: Enhanced to verify health check endpoint availability
22+
- Tests now validate both `/mcp` and `/health` endpoints
23+
- Ensures health check responses include proper server metadata
24+
925
## [0.6.13] - 2025-09-22
1026

1127
### Added

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "quilt-mcp"
3-
version = "0.6.13"
3+
version = "0.6.14"
44
description = "Secure MCP server for accessing Quilt data with JWT authentication"
55
readme = "README.md"
66
requires-python = ">=3.11"
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# Health Check Endpoints Requirements
2+
3+
**Issue Reference**: GitHub Issue #197 - Add Health Checks to MCP Server
4+
**Branch**: `197-add-health-checks`
5+
**Status**: Requirements Analysis
6+
7+
## Problem Statement
8+
9+
The Quilt MCP server currently lacks standardized health check endpoints that are essential for container orchestration, monitoring systems, and production deployments. While internal health check functionality exists in the codebase (via `error_recovery.health_check_with_recovery()`), there is no exposed HTTP endpoint that external systems can use to monitor server health status.
10+
11+
This creates challenges for:
12+
1. Container orchestration platforms (Kubernetes, Docker Compose) that need health probes
13+
2. Load balancers and reverse proxies requiring health check endpoints
14+
3. Monitoring and alerting systems that track service availability
15+
4. DevOps workflows that need to verify deployment success
16+
17+
## User Stories
18+
19+
### US1: Container Orchestration Health Probes
20+
21+
**As a** DevOps engineer deploying the MCP server in Kubernetes or Docker Compose
22+
**I want** standardized health check endpoints (`/health`, `/readiness`, `/liveness`)
23+
**So that** the orchestration system can automatically restart unhealthy containers and route traffic only to ready instances
24+
25+
**Acceptance Criteria:**
26+
1. `/health` endpoint returns HTTP 200 with JSON status when server is healthy
27+
2. `/health` endpoint returns HTTP 503 with error details when server is unhealthy
28+
3. `/readiness` endpoint indicates when server is ready to accept MCP requests
29+
4. `/liveness` endpoint indicates when server process is running and responsive
30+
5. All endpoints work across HTTP, SSE, and stdio transport modes
31+
6. Health check endpoints do not require MCP authentication
32+
7. Response format follows standard health check conventions
33+
34+
### US2: Monitoring and Alerting Integration
35+
36+
**As a** Site Reliability Engineer monitoring production MCP deployments
37+
**I want** detailed health status information with component-level diagnostics
38+
**So that** I can set up accurate alerts and quickly identify the root cause of issues
39+
40+
**Acceptance Criteria:**
41+
1. Health endpoints return structured JSON with overall status and component details
42+
2. Component-level status includes: authentication, AWS connectivity, Athena, package operations
43+
3. Response includes timing information for performance monitoring
44+
4. Degraded states are clearly distinguished from healthy and unhealthy states
45+
5. Health responses include actionable recovery recommendations when issues are detected
46+
6. Response format is consistent and machine-parseable
47+
48+
### US3: Load Balancer Health Checks
49+
50+
**As a** Infrastructure engineer configuring load balancers for MCP server deployment
51+
**I want** lightweight health check endpoints with configurable response formats
52+
**So that** the load balancer can efficiently route traffic to healthy instances without impacting server performance
53+
54+
**Acceptance Criteria:**
55+
1. `/health/simple` endpoint returns minimal response (HTTP status only) for efficiency
56+
2. Health checks complete within 1-2 seconds under normal conditions
57+
3. Health check endpoints have minimal resource overhead
58+
4. Support for custom health check timeouts via query parameters
59+
5. Health checks work reliably during high server load
60+
6. No side effects on MCP tool operations from health check requests
61+
62+
### US4: Development and Debugging Support
63+
64+
**As a** developer working on the MCP server or troubleshooting client issues
65+
**I want** detailed diagnostic information accessible via health endpoints
66+
**So that** I can quickly identify configuration problems and verify system components
67+
68+
**Acceptance Criteria:**
69+
1. `/health/detailed` endpoint provides comprehensive diagnostic information
70+
2. Diagnostic data includes: transport mode, tool registration status, AWS configuration
71+
3. Error details include specific failure messages and suggested fixes
72+
4. Health endpoints work in all transport modes (stdio, http, sse)
73+
5. Diagnostic information helps differentiate client vs server issues
74+
6. Response format is human-readable for manual debugging
75+
76+
### US5: CI/CD Pipeline Integration
77+
78+
**As a** DevOps engineer managing automated deployments
79+
**I want** reliable health check endpoints for deployment validation
80+
**So that** CI/CD pipelines can automatically verify successful deployments before promoting to production
81+
82+
**Acceptance Criteria:**
83+
1. Health endpoints are available immediately after server startup
84+
2. Health status accurately reflects server readiness for MCP requests
85+
3. Health checks integrate with existing Docker container health probe configurations
86+
4. Consistent behavior across different deployment environments
87+
5. Health check failures provide actionable information for automated remediation
88+
6. Support for health check retries with configurable intervals
89+
90+
## Numbered Acceptance Criteria
91+
92+
1. **HTTP Transport Compatibility**: Health check endpoints MUST work when server is running in HTTP transport mode (`FASTMCP_TRANSPORT=http`)
93+
94+
2. **Cross-Transport Support**: Health check functionality MUST be accessible across all transport modes (stdio, HTTP, SSE, streamable-http)
95+
96+
3. **Standard Endpoint Structure**:
97+
- `/health` - General health status (default)
98+
- `/health/simple` - Minimal response for load balancers
99+
- `/health/detailed` - Comprehensive diagnostics for debugging
100+
- `/readiness` - Kubernetes-style readiness probe
101+
- `/liveness` - Kubernetes-style liveness probe
102+
103+
4. **Response Format Standardization**: All health endpoints MUST return JSON with consistent schema including:
104+
- `status`: "healthy" | "degraded" | "unhealthy"
105+
- `timestamp`: ISO 8601 timestamp
106+
- `components`: Object with component-level status
107+
- `message`: Human-readable status description
108+
109+
5. **Performance Requirements**:
110+
- Health checks MUST complete within 2 seconds under normal load
111+
- Health checks MUST NOT impact MCP tool operation performance
112+
- Memory overhead for health checks MUST be minimal (<10MB)
113+
114+
6. **Error Handling**: Health endpoints MUST return appropriate HTTP status codes:
115+
- 200: Healthy/Ready
116+
- 503: Service Unavailable (unhealthy/not ready)
117+
- 500: Internal server error during health check
118+
119+
7. **Component Health Monitoring**: Health checks MUST validate core components:
120+
- Authentication status (Quilt login state)
121+
- AWS connectivity (STS, S3, Athena)
122+
- MCP tool registration and availability
123+
- Transport layer functionality
124+
125+
8. **Container Integration**: Health endpoints MUST work with Docker health check directives and container orchestration health probes
126+
127+
9. **No Authentication Required**: Health check endpoints MUST NOT require MCP protocol authentication or Quilt login
128+
129+
10. **Graceful Degradation**: Health endpoints MUST function even when some server components are failing
130+
131+
## Success Metrics
132+
133+
1. **Availability**: Health check endpoints respond successfully >99.9% of the time
134+
2. **Response Time**: 95th percentile response time <1 second for simple health checks
135+
3. **Accuracy**: Health status correctly reflects actual server state >99% of the time
136+
4. **Integration Success**: Successfully integrates with at least 2 container orchestration platforms
137+
5. **Zero False Negatives**: Healthy endpoints never report unhealthy status when server is actually functional
138+
139+
## Open Questions
140+
141+
1. **Endpoint Naming Convention**: Should we follow Kubernetes convention (`/healthz`, `/readyz`) or HTTP standard (`/health`, `/status`)?
142+
143+
2. **Authentication Requirement**: Should detailed health endpoints require any form of authentication to prevent information disclosure?
144+
145+
3. **Caching Strategy**: Should health check results be cached to reduce overhead, and if so, for how long?
146+
147+
4. **Custom Health Checks**: Should there be a way for users to register custom health checks for their specific deployment requirements?
148+
149+
5. **Metrics Integration**: Should health endpoints also expose Prometheus-style metrics, or keep that separate?
150+
151+
6. **Transport-Specific Behavior**: Should health check behavior differ between stdio and HTTP transports, or remain identical?
152+
153+
7. **Failure Thresholds**: What should be the criteria for marking individual components or overall system as unhealthy vs degraded?
154+
155+
8. **Recovery Actions**: Should health endpoints support triggering automatic recovery actions (like cache clearing or reconnection attempts)?
156+
157+
## Dependencies and Constraints
158+
159+
- **Existing Code**: Must integrate with existing `error_recovery.health_check_with_recovery()` functionality
160+
- **FastMCP Framework**: Must work within FastMCP server architecture and routing
161+
- **Transport Compatibility**: Must function across all supported transport modes
162+
- **Docker Integration**: Must be compatible with existing Docker container configuration
163+
- **Performance**: Must not significantly impact existing MCP tool performance
164+
- **Backward Compatibility**: Must not break existing server functionality or client connections

0 commit comments

Comments
 (0)