Skip to content

feat: Add last chance check before orphan termination #4595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Jul 21, 2025
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
3fa09d2
feat(runners): add runnerId to RunnerList and implement untag functio…
stuartp44 May 21, 2025
14df9fe
fix(tests): improve clarity of orphaned runner untagging test descrip…
stuartp44 May 22, 2025
9d7c89a
fmt
stuartp44 May 22, 2025
e09337f
fix(scale-down): remove unnecessary logging of runner variable in ter…
stuartp44 May 22, 2025
9064512
fix(scale-up): remove unused import of run function
stuartp44 May 22, 2025
716e079
fix(scale-down): remove unused import of metricGitHubAppRateLimit
stuartp44 May 23, 2025
e826fbe
Merge branch 'main' into stu/add_tag_plus_check
npalm May 23, 2025
a5fcc88
Update lambdas/functions/control-plane/src/scale-runners/scale-down.ts
stuartp44 May 26, 2025
bb8ba2b
Update lambdas/functions/control-plane/src/scale-runners/scale-down.ts
stuartp44 May 26, 2025
97de234
Update lambdas/functions/control-plane/src/scale-runners/scale-up.ts
stuartp44 May 26, 2025
8b61d39
Update lambdas/functions/control-plane/src/scale-runners/scale-up.ts
stuartp44 May 26, 2025
9883b0b
Remove warning log for orphan runners without runnerId in scale-down …
stuartp44 May 29, 2025
43468a7
Remove logging of runner ID marking in addGhRunnerIdToEC2InstanceTag …
stuartp44 May 29, 2025
bc995ef
readded metricGitHubAppRateLimit
stuartp44 Jun 3, 2025
1182c8b
Merge branch 'main' into stu/add_tag_plus_check
stuartp44 Jun 3, 2025
6e1c72c
Refactor runner interfaces: remove RunnerState interface and update i…
stuartp44 Jun 3, 2025
097c14d
Add headers to runner state return and update logging for busy state
stuartp44 Jun 4, 2025
971ec2d
Remove redundant comment describing RunnerState type
stuartp44 Jun 4, 2025
5a275e5
Implement last chance check for orphan runners and refactor terminati…
stuartp44 Jun 4, 2025
9f59abe
Format return object in getGitHubSelfHostedRunnerState for improved r…
stuartp44 Jun 4, 2025
65c0b0e
Refactor runner state types to use Endpoints for improved type safety…
stuartp44 Jun 4, 2025
8ead598
Fix formatting of type definitions and adjust indentation in getGitHu…
stuartp44 Jun 4, 2025
ab5b6b0
Update lambdas/functions/control-plane/src/aws/runners.ts
stuartp44 Jun 4, 2025
0b462bb
Update lambdas/functions/control-plane/src/scale-runners/scale-down.t…
stuartp44 Jun 4, 2025
102edf0
Fix typo in key for GitHub runner ID in mock running instances
stuartp44 Jun 4, 2025
e6d2d88
Merge branch 'main' into stu/add_tag_plus_check
npalm Jun 10, 2025
83610eb
Merge branch 'main' into stu/add_tag_plus_check
npalm Jun 11, 2025
898226d
fix: add missing ec2:RemoveTags action in lambda-scale-down policy
stuartp44 Jul 7, 2025
15af241
Merge branch 'main' into stu/add_tag_plus_check
stuartp44 Jul 7, 2025
8ff9901
fix: streamline orphan runner handling and improve logging
stuartp44 Jul 8, 2025
e4a4bfa
fix(lambda): replace ec2:RemoveTags with ec2:DeleteTags and tag:Untag…
stuartp44 Jul 9, 2025
f5a0388
fix: update logging messages for orphan runner tagging and state checks
stuartp44 Jul 9, 2025
a245b2b
fix: improve logging message for orphan runner judgment
stuartp44 Jul 9, 2025
6dca64f
fix: change var to let for isOrphan in lastChanceCheckOrphanRunner fu…
stuartp44 Jul 9, 2025
c335fc7
fix: refactor orphan termination logic for clarity and consistency
stuartp44 Jul 9, 2025
7b711c0
Merge branch 'main' into stu/add_tag_plus_check
stuartp44 Jul 9, 2025
1baf4af
fix: rename addGhRunnerIdToEC2InstanceTag to tagRunnerId for clarity
stuartp44 Jul 9, 2025
c2b1033
fix: improve logging message format for orphan runner judgment
stuartp44 Jul 9, 2025
3617608
fix: enhance orphan runner judgment logic for clarity
stuartp44 Jul 9, 2025
b8af3f0
fix: simplify orphan runner judgment logic by removing checks that in…
stuartp44 Jul 9, 2025
1ddc98e
Merge branch 'main' into stu/add_tag_plus_check
stuartp44 Jul 18, 2025
835ca85
fix(scale-down): rename unmarkOrphan function to unMarkOrphan for con…
stuartp44 Jul 21, 2025
3551c77
fix(lambda): remove unused tag:UntagResources action from policy
stuartp44 Jul 21, 2025
441c4f4
fix(runner): update orphan tag value to 'true' in untag runner test
stuartp44 Jul 21, 2025
1a59d77
Merge branch 'main' into stu/add_tag_plus_check
npalm Jul 21, 2025
9aa3203
fix: Tag non jit instances with runner id and add docs (#4667)
npalm Jul 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lambdas/functions/control-plane/src/aws/runners.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ export interface RunnerList {
repo?: string;
org?: string;
orphan?: boolean;
runnerId?: string;
}

export interface RunnerInfo {
Expand Down
67 changes: 63 additions & 4 deletions lambdas/functions/control-plane/src/aws/runners.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import {
CreateFleetInstance,
CreateFleetResult,
CreateTagsCommand,
DeleteTagsCommand,
DefaultTargetCapacityType,
DescribeInstancesCommand,
DescribeInstancesResult,
Expand All @@ -17,7 +18,7 @@ import { mockClient } from 'aws-sdk-client-mock';
import 'aws-sdk-client-mock-jest/vitest';

import ScaleError from './../scale-runners/ScaleError';
import { createRunner, listEC2Runners, tag, terminateRunner } from './runners';
import { createRunner, listEC2Runners, tag, untag, terminateRunner } from './runners';
import { RunnerInfo, RunnerInputParameters, RunnerType } from './runners.d';
import { describe, it, expect, beforeEach, vi } from 'vitest';

Expand Down Expand Up @@ -53,14 +54,34 @@ const mockRunningInstances: DescribeInstancesResult = {
},
],
};
const mockRunningInstancesJit: DescribeInstancesResult = {
Reservations: [
{
Instances: [
{
LaunchTime: new Date('2020-10-10T14:48:00.000+09:00'),
InstanceId: 'i-1234',
Tags: [
{ Key: 'ghr:Application', Value: 'github-action-runner' },
{ Key: 'ghr:runner_name_prefix', Value: RUNNER_NAME_PREFIX },
{ Key: 'ghr:created_by', Value: 'scale-up-lambda' },
{ Key: 'ghr:Type', Value: 'Org' },
{ Key: 'ghr:Owner', Value: 'CoderToCat' },
{ Key: 'ghr:github_runner_id', Value: '9876543210' },
],
},
],
},
],
};

describe('list instances', () => {
beforeEach(() => {
vi.resetModules();
vi.clearAllMocks();
});

it('returns a list of instances', async () => {
it('returns a list of instances (Non JIT)', async () => {
mockEC2Client.on(DescribeInstancesCommand).resolves(mockRunningInstances);
const resp = await listEC2Runners();
expect(resp.length).toBe(1);
Expand All @@ -73,6 +94,20 @@ describe('list instances', () => {
});
});

it('returns a list of instances (JIT)', async () => {
mockEC2Client.on(DescribeInstancesCommand).resolves(mockRunningInstancesJit);
const resp = await listEC2Runners();
expect(resp.length).toBe(1);
expect(resp).toContainEqual({
instanceId: 'i-1234',
launchTime: new Date('2020-10-10T14:48:00.000+09:00'),
type: 'Org',
owner: 'CoderToCat',
orphan: false,
runnerId: '9876543210',
});
});

it('check orphan tag.', async () => {
const instances: DescribeInstancesResult = mockRunningInstances;
instances.Reservations![0].Instances![0].Tags!.push({ Key: 'ghr:orphan', Value: 'true' });
Expand Down Expand Up @@ -229,11 +264,35 @@ describe('tag runner', () => {
owner: 'owner-2',
type: 'Repo',
};
await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'truer' }]);
await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);

expect(mockEC2Client).toHaveReceivedCommandWith(CreateTagsCommand, {
Resources: [runner.instanceId],
Tags: [{ Key: 'ghr:orphan', Value: 'true' }],
});
});
});

describe('untag runner', () => {
beforeEach(() => {
vi.clearAllMocks();
});
it('removing extra tag', async () => {
mockEC2Client.on(DeleteTagsCommand).resolves({});
const runner: RunnerInfo = {
instanceId: 'instance-2',
owner: 'owner-2',
type: 'Repo',
};
//await tag(runner.instanceId, [{ Key: 'ghr:orphan', Value: '' }]);
expect(mockEC2Client).toHaveReceivedCommandWith(CreateTagsCommand, {
Resources: [runner.instanceId],
Tags: [{ Key: 'ghr:orphan', Value: 'truer' }],
Tags: [{ Key: 'ghr:orphan', Value: 'true' }],
});
await untag(runner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
expect(mockEC2Client).toHaveReceivedCommandWith(DeleteTagsCommand, {
Resources: [runner.instanceId],
Tags: [{ Key: 'ghr:orphan', Value: 'true' }],
});
});
});
Expand Down
8 changes: 8 additions & 0 deletions lambdas/functions/control-plane/src/aws/runners.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ import {
CreateFleetCommand,
CreateFleetResult,
CreateTagsCommand,
DeleteTagsCommand,
DescribeInstancesCommand,
DescribeInstancesResult,
EC2Client,
Expand Down Expand Up @@ -91,6 +92,7 @@ function getRunnerInfo(runningInstances: DescribeInstancesResult) {
repo: i.Tags?.find((e) => e.Key === 'ghr:Repo')?.Value as string,
org: i.Tags?.find((e) => e.Key === 'ghr:Org')?.Value as string,
orphan: i.Tags?.find((e) => e.Key === 'ghr:orphan')?.Value === 'true',
runnerId: i.Tags?.find((e) => e.Key === 'ghr:github_runner_id')?.Value as string,
});
}
}
Expand All @@ -112,6 +114,12 @@ export async function tag(instanceId: string, tags: Tag[]): Promise<void> {
await ec2.send(new CreateTagsCommand({ Resources: [instanceId], Tags: tags }));
}

export async function untag(instanceId: string, tags: Tag[]): Promise<void> {
logger.debug(`Untagging '${instanceId}'`, { tags });
const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
await ec2.send(new DeleteTagsCommand({ Resources: [instanceId], Tags: tags }));
}

function generateFleetOverrides(
subnetIds: string[],
instancesTypes: string[],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import nock from 'nock';

import { RunnerInfo, RunnerList } from '../aws/runners.d';
import * as ghAuth from '../github/auth';
import { listEC2Runners, terminateRunner, tag } from './../aws/runners';
import { listEC2Runners, terminateRunner, tag, untag } from './../aws/runners';
import { githubCache } from './cache';
import { newestFirstStrategy, oldestFirstStrategy, scaleDown } from './scale-down';
import { describe, it, expect, beforeEach, vi } from 'vitest';
Expand Down Expand Up @@ -33,6 +33,7 @@ vi.mock('./../aws/runners', async (importOriginal) => {
return {
...actual,
tag: vi.fn(),
untag: vi.fn(),
terminateRunner: vi.fn(),
listEC2Runners: vi.fn(),
};
Expand Down Expand Up @@ -62,6 +63,7 @@ const mockedInstallationAuth = vi.mocked(ghAuth.createGithubInstallationAuth);
const mockCreateClient = vi.mocked(ghAuth.createOctokitClient);
const mockListRunners = vi.mocked(listEC2Runners);
const mockTagRunners = vi.mocked(tag);
const mockUntagRunners = vi.mocked(untag);
const mockTerminateRunners = vi.mocked(terminateRunner);

export interface TestData {
Expand Down Expand Up @@ -312,7 +314,7 @@ describe('Scale down runners', () => {
checkNonTerminated(runners);
});

it(`Should terminate orphan.`, async () => {
it(`Should terminate orphan (Non JIT)`, async () => {
// setup
const orphanRunner = createRunnerTestData('orphan-1', type, MINIMUM_BOOT_TIME + 1, false, false, false);
const idleRunner = createRunnerTestData('idle-1', type, MINIMUM_BOOT_TIME + 1, true, false, false);
Expand All @@ -334,6 +336,7 @@ describe('Scale down runners', () => {
Value: 'true',
},
]);

expect(mockTagRunners).not.toHaveBeenCalledWith(idleRunner.instanceId, expect.anything());

// next cycle, update test data set orphan to true and terminate should be true
Expand All @@ -348,6 +351,58 @@ describe('Scale down runners', () => {
checkNonTerminated(runners);
});

it('Should test if orphaned runner, untag if online and busy, else terminate (JIT)', async () => {
// arrange
const orphanRunner = createRunnerTestData(
'orphan-jit',
type,
MINIMUM_BOOT_TIME + 1,
false,
true,
false,
undefined,
1234567890,
);
const runners = [orphanRunner];

mockGitHubRunners([]);
mockAwsRunners(runners);

if (type === 'Repo') {
mockOctokit.actions.getSelfHostedRunnerForRepo.mockResolvedValueOnce({
data: { id: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'online' },
});
} else {
mockOctokit.actions.getSelfHostedRunnerForOrg.mockResolvedValueOnce({
data: { id: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'online' },
});
}

// act
await scaleDown();

// assert
expect(mockUntagRunners).toHaveBeenCalledWith(orphanRunner.instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
expect(mockTerminateRunners).not.toHaveBeenCalledWith(orphanRunner.instanceId);

// arrange
if (type === 'Repo') {
mockOctokit.actions.getSelfHostedRunnerForRepo.mockResolvedValueOnce({
data: { runnerId: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'offline' },
});
} else {
mockOctokit.actions.getSelfHostedRunnerForOrg.mockResolvedValueOnce({
data: { runnerId: 1234567890, name: orphanRunner.instanceId, busy: true, status: 'offline' },
});
}

// act
await scaleDown();

// assert
expect(mockTerminateRunners).toHaveBeenCalledWith(orphanRunner.instanceId);
});

it(`Should ignore errors when termination orphan fails.`, async () => {
// setup
const orphanRunner = createRunnerTestData('orphan-1', type, MINIMUM_BOOT_TIME + 1, false, true, true);
Expand Down Expand Up @@ -625,6 +680,7 @@ function createRunnerTestData(
orphan: boolean,
shouldBeTerminated: boolean,
owner?: string,
runnerId?: number,
): RunnerTestItem {
return {
instanceId: `i-${name}-${type.toLowerCase()}`,
Expand All @@ -638,5 +694,6 @@ function createRunnerTestData(
registered,
orphan,
shouldBeTerminated,
runnerId: runnerId !== undefined ? String(runnerId) : undefined,
};
}
73 changes: 60 additions & 13 deletions lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
import { Octokit } from '@octokit/rest';
import { Endpoints } from '@octokit/types';
import { createChildLogger } from '@aws-github-runner/aws-powertools-util';
import moment from 'moment';

import { createGithubAppAuth, createGithubInstallationAuth, createOctokitClient } from '../github/auth';
import { bootTimeExceeded, listEC2Runners, tag, terminateRunner } from './../aws/runners';
import { bootTimeExceeded, listEC2Runners, tag, untag, terminateRunner } from './../aws/runners';
import { RunnerInfo, RunnerList } from './../aws/runners.d';
import { GhRunners, githubCache } from './cache';
import { ScalingDownConfig, getEvictionStrategy, getIdleRunnerCount } from './scale-down-config';
Expand All @@ -12,6 +13,10 @@ import { getGitHubEnterpriseApiUrl } from './scale-up';

const logger = createChildLogger('scale-down');

type OrgRunnerList = Endpoints['GET /orgs/{org}/actions/runners']['response']['data']['runners'];
type RepoRunnerList = Endpoints['GET /repos/{owner}/{repo}/actions/runners']['response']['data']['runners'];
type RunnerState = OrgRunnerList[number] | RepoRunnerList[number];

async function getOrCreateOctokit(runner: RunnerInfo): Promise<Octokit> {
const key = runner.owner;
const cachedOctokit = githubCache.clients.get(key);
Expand Down Expand Up @@ -46,7 +51,11 @@ async function getOrCreateOctokit(runner: RunnerInfo): Promise<Octokit> {
return octokit;
}

async function getGitHubRunnerBusyState(client: Octokit, ec2runner: RunnerInfo, runnerId: number): Promise<boolean> {
async function getGitHubSelfHostedRunnerState(
client: Octokit,
ec2runner: RunnerInfo,
runnerId: number,
): Promise<RunnerState> {
const state =
ec2runner.type === 'Org'
? await client.actions.getSelfHostedRunnerForOrg({
Expand All @@ -58,12 +67,15 @@ async function getGitHubRunnerBusyState(client: Octokit, ec2runner: RunnerInfo,
owner: ec2runner.owner.split('/')[0],
repo: ec2runner.owner.split('/')[1],
});

logger.info(`Runner '${ec2runner.instanceId}' - GitHub Runner ID '${runnerId}' - Busy: ${state.data.busy}`);

metricGitHubAppRateLimit(state.headers);

return state.data.busy;
return state.data;
}

async function getGitHubRunnerBusyState(client: Octokit, ec2runner: RunnerInfo, runnerId: number): Promise<boolean> {
const state = await getGitHubSelfHostedRunnerState(client, ec2runner, runnerId);
logger.info(`Runner '${ec2runner.instanceId}' - GitHub Runner ID '${runnerId}' - Busy: ${state.busy}`);
return state.busy;
}

async function listGitHubRunners(runner: RunnerInfo): Promise<GhRunners> {
Expand Down Expand Up @@ -194,24 +206,59 @@ async function evaluateAndRemoveRunners(
async function markOrphan(instanceId: string): Promise<void> {
try {
await tag(instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
logger.info(`Runner '${instanceId}' marked as orphan.`);
logger.info(`Runner '${instanceId}' tagged as orphan.`);
} catch (e) {
logger.error(`Failed to mark runner '${instanceId}' as orphan.`, { error: e });
logger.error(`Failed to tag runner '${instanceId}' as orphan.`, { error: e });
}
}

async function unmarkOrphan(instanceId: string): Promise<void> {
try {
await untag(instanceId, [{ Key: 'ghr:orphan', Value: 'true' }]);
logger.info(`Runner '${instanceId}' untagged as orphan.`);
} catch (e) {
logger.error(`Failed to un-tag runner '${instanceId}' as orphan.`, { error: e });
}
}

async function lastChanceCheckOrphanRunner(runner: RunnerList): Promise<boolean> {
const client = await getOrCreateOctokit(runner as RunnerInfo);
const runnerId = parseInt(runner.runnerId || '0');
const ec2Instance = runner as RunnerInfo;
const state = await getGitHubSelfHostedRunnerState(client, ec2Instance, runnerId);
let isOrphan = false;
logger.debug(
`Runner '${runner.instanceId}' is '${state.status}' and is currently '${state.busy ? 'busy' : 'idle'}'.`,
);
const isOfflineAndBusy = state.status === 'offline' && state.busy;
if (isOfflineAndBusy) {
isOrphan = true;
}
logger.info(`Runner '${runner.instanceId}' is judged to ${isOrphan ? 'be' : 'not be'} orphaned.`);
return isOrphan;
}

async function terminateOrphan(environment: string): Promise<void> {
try {
const orphanRunners = await listEC2Runners({ environment, orphan: true });

for (const runner of orphanRunners) {
logger.info(`Terminating orphan runner '${runner.instanceId}'`);
await terminateRunner(runner.instanceId).catch((e) => {
logger.error(`Failed to terminate orphan runner '${runner.instanceId}'`, { error: e });
});
if (runner.runnerId) {
const isOrphan = await lastChanceCheckOrphanRunner(runner);
if (isOrphan) {
await terminateRunner(runner.instanceId);
} else {
await unmarkOrphan(runner.instanceId);
}
} else {
logger.info(`Terminating orphan runner '${runner.instanceId}'`);
await terminateRunner(runner.instanceId).catch((e) => {
logger.error(`Failed to terminate orphan runner '${runner.instanceId}'`, { error: e });
});
}
}
} catch (e) {
logger.warn(`Failure during orphan runner termination.`, { error: e });
logger.warn(`Failure during orphan termination processing.`, { error: e });
}
}

Expand Down
Loading