Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
194 changes: 121 additions & 73 deletions .claude/commands/preprocess-logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,23 @@ quantities of logs. This framework is needed in order to carefully manage AI con
extract useful information without having to load the entire log contents into context. All output files will be
saved to an <analysis_directory>, which should be named "analysis" and placed inside the original log directory.

## Usage Check

First verify that a log directory path was provided. If no path was provided in the command arguments, respond with:

> Error: No log directory path provided.
>
> Usage Example: /preprocess-logs logs_41509396734
>
> This command analyzes log files in the specified directory by:
> 1. Splitting large files into manageable shards
> 2. Searching for various error patterns
> 3. Generating a human-readable report of failures
>
> The analysis and preprocessing artifacts will be saved inside the log directory.

Only proceed with the analysis if a valid directory path was provided.

## Phase 0: Check for Pre-existing Analysis

Before beginning the log preprocessing procedure, check if a previous analysis has already been completed for the target log files.
Expand All @@ -14,18 +31,17 @@ Before beginning the log preprocessing procedure, check if a previous analysis h

2. **Verify analysis completeness**: If the analysis directory exists, check for the presence of key analysis artifacts:
- `shards/` directory containing shard files
- `search_results/` directory containing `primary_search.jsonl` or `fallback_search.jsonl` (depending on which search was used)
- `search_results/` directory containing search result files
- `<original_log_directory_name>_preprocessing_report.md`

3. **User confirmation for re-analysis**: If a complete analysis is found, ask the user for confirmation before proceeding:
```
Found existing analysis for <original_log_directory>. The analysis includes:
- X shard files
- Search results (primary/fallback)
- Preprocessing report

Do you want to re-analyze these logs and overwrite the existing analysis?
```

> Found existing analysis for <original_log_directory>. The analysis includes:
> - X shard files
> - Search results
> - Preprocessing report
>
> Do you want to re-analyze these logs and overwrite the existing analysis?

## Phase 1: Split Large Logs

Expand All @@ -35,7 +51,7 @@ intended analysis tool.

1. Store all shard files in a directory called "shards", inside the <analysis_directory>. The analysis directory
should be named `analysis` and placed inside the original log directory. Each shard should be named
`<original_log_name>_shard_<shard_decimal_index>.<original_log_extension>`.
`<original_log_name>_shard_<shard_decimal_index>`.

2. **Split Command:** Use the following command to split log files into shards with decimal numbering:

Expand All @@ -47,7 +63,6 @@ intended analysis tool.
- `-l 1800`: Split every 1800 lines
- `-d`: Use numeric suffixes instead of alphabetic
- `-a 3`: Use 3-digit suffixes for better readability and sorting
- After splitting, add the original file extension to each shard file

Example shard files:
```
Expand All @@ -65,44 +80,71 @@ at this point: we are simply generating an index of lines that might potentially
mkdir -p "<original_log_directory>/analysis/search_results"
```

Use a **two-stage search strategy** to balance precision and recall:
### Search Profiles

Use targeted search profiles based on the type of failures you're looking for. If the user didn't specify what you are
searching for, you should iteratively search using each profile.

#### Profile 1: Test Failures
For standard test output failures:
```bash
rg --line-number --ignore-case --json -C 5 -- "^[-]{3} FAIL:|\\s+FAIL\$|\\s+FAIL\\t|\\[FAILED\\]|panic: test timed out" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/test_failures_search.jsonl"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question I have is how do we make these files more readable for humans. I would want to request comments that explain what this is doing so that a human can improve on it in the future, but I'm assuming if we put comments here they will be ingested by the LLM, and potentially take some context needlessly, or worse confuse it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I intentionally didn't leave comments on what these commands are specifically doing. It ends up looking like a regurgitation of the rg docs, which generally isn't very helpful to claude, and it introduces content duplication.

Is it just these explicit rg commands that you find unreadable, or do you think the document in general suffers from poor readability? IMO the document itself ought to be digestible (and if it isn't, then this needs more work), but it's not the end of the world if a specific command like this is tailored for an agent. While not as easily accessible as a comment, it's also not too hard to just ask claude to summarize what such a command is doing, if an engineer wants to understand specifics.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point actually. Makes it even more explicit now that comments should almost exclusively be targetted for expressing the author's intentions. "explanation" docs can be generated automatically by llms at read time.

```

#### Profile 2: Connection/Network Errors
For network-related issues:
```bash
rg --line-number --ignore-case --json -C 5 "ECONNREFUSED|connection refused|dial.*failed|cannot connect|connection reset" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/connection_errors_search.jsonl"
```

### Stage 1: Primary Search (Restrictive)
First, search using a restrictive pattern designed to capture actual test failures while minimizing false
positives:
#### Profile 3: Startup/Initialization Errors
For service startup problems:
```bash
rg --line-number --ignore-case --json -C 5 "error starting|failed to start|initialization failed|startup failed|cannot initialize" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/startup_errors_search.jsonl"
```

```
rg --line-number --ignore-case --json -C 5 "\-\-\- FAIL:|\s+FAIL$|\s+FAIL\t|\[FAILED\]|panic: test timed out" <shard_path> > <original_log_directory>/analysis/search_results/primary_search.jsonl
```
#### Profile 4: Docker/Container Issues
For container-related problems:
```bash
rg --line-number --ignore-case --json -C 5 "container.*failed|docker.*error|OCI runtime|container.*exit.*[1-9]" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/container_errors_search.jsonl"
```

### Stage 2: Fallback Search (Expansive)
**Only if the primary search yields no results**, fall back to a more expansive pattern:
#### Profile 5: Resource/Timeout Issues
For resource constraints and timeouts:
```bash
rg --line-number --ignore-case --json -C 5 "out of memory|OOM|deadline exceeded|context canceled|timeout waiting" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/resource_errors_search.jsonl"
```

```
rg --line-number --ignore-case --json -C 5 "FAIL|ERROR|TIMEOUT|panic" <shard_path> > <original_log_directory>/analysis/search_results/fallback_search.jsonl
```
#### Profile 6: Panic/Crash Detection
For application crashes:
```bash
rg --line-number --ignore-case --json -C 5 "panic:|fatal error:|segmentation fault|SIGSEGV|goroutine.*panic" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/panic_errors_search.jsonl"
```

**Primary Pattern Components:**
- `--- FAIL:` - Standard Go test failures (e.g., `--- FAIL: TestName (duration)`)
- `\s+FAIL$` - Go test summary lines ending with FAIL
- `\s+FAIL\t` - Go package failure lines (e.g., `FAIL<tab>package.name`)
- `[FAILED]` - Jest/Ginkgo style test failures
- `panic: test timed out` - Test timeout failures
#### Fallback: General Errors
Only use if specific searches yield no results:
```bash
rg --line-number --ignore-case --json -C 5 "ERROR|FAIL|CRITICAL" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/general_errors_search.jsonl"
```

**Fallback Pattern:**
The expansive pattern provides comprehensive coverage but may include false positives like operational ERROR
logs, dependency warnings, and debug messages.
### Search Result Management

After running each search profile, split the results into manageable shards:

```bash
# Split search results into 1800-line shards
split -l 1800 -d -a 3 "<original_log_directory>/analysis/search_results/test_failures_search.jsonl" \
"<original_log_directory>/analysis/search_results/test_failures_shard_"
```

Repeat this for each search profile that generates results.

**Ripgrep JSON Output Structure:**
The ripgrep command outputs JSON lines where each entry has a `type` field:
- `"type":"match"` - Contains the actual match with file path, line number, and matched text
- `"type":"context"` - Contains surrounding context lines with their line numbers
- `"type":"begin"` and `"type":"end"` - File boundaries and summary statistics

All match and context information is preserved in the JSON output with precise line numbers and file paths.
Use the appropriate search result file (`<original_log_directory>/analysis/search_results/primary_search.jsonl` or
`<original_log_directory>/analysis/search_results/fallback_search.jsonl`) directly for analysis in Phase 3.

## Phase 3: Generate Human Readable Log Preprocessing Report

This phase produces a structured summary for human consumption. Store the report as a **Markdown file** at
Expand All @@ -113,9 +155,6 @@ This phase produces a structured summary for human consumption. Store the report
- Lines that would suffer from being split (e.g., URLs, code snippets, file paths) may exceed this limit
- Apply best-effort line wrapping for readability while preserving technical accuracy

There are two separate report formats, depending on what the input logs actually represent. Explore the input
logs to determine what type of logs you are dealing with. Record the determined type as part of the report.

### Report Type: Test Output

If the logs represent output from one or more tests, then the report will focus on describing tests that included failures.
Expand All @@ -130,17 +169,22 @@ point* for finding failed tests.

The basic format of the `Preprocessing Report` for logs representing tests is as follows:

```
# Test Output Preprocessing Report

## Test Failures

<list of failed tests> // see below for details of how test failures should be structured

## Failure Clusters

<list of classes of failures> // see below for details of how failure classes should be structured
```
> # Test Output Preprocessing Report
>
> ## Search Results Summary
> - Log Type Detected: <test_output|container_logs|system_logs>
> - Total Matches Found:
> - Test Failures: X matches
> - Connection Errors: Y matches
> - [other profiles...]
>
> ## Test Failures
>
> <list of failed tests> // see below for details of how test failures should be structured
>
> ## Failure Clusters
>
> <list of classes of failures> // see below for details of how failure classes should be structured

For each match entry (`"type":"match"`) in the ripgrep JSON output, perform the following steps:

Expand All @@ -161,18 +205,16 @@ For each match entry (`"type":"match"`) in the ripgrep JSON output, perform the
e.g. "Root component invalid array access", or "runtime type panic in ServerProcess"
3. Record the test failure in the report:

```
### CI Action: Unit Tests <-- this is the group the test belongs to.
<-- if the test group has already been added to the report, add the test failure entry under the existing heading

1. `TestParallelProcessing` <-- this is the name of the test
- failure location: `unit_tests_shard_003.txt` line 62 <-- record where the error can be found in the shard files
- failure class: `consistency assertion failed in MainLoop` <-- determined failure class
- relevant log lines: <-- try to show a brief selection of log lines that make it easy to understand what happened
```
...
```
```
> ### CI Action: Unit Tests <-- this is the group the test belongs to.
> <-- if the test group has already been added to the report, add the test failure entry under the existing heading
>
> 1. `TestParallelProcessing` <-- this is the name of the test
> - failure location: `unit_tests_shard_003` line 62 <-- record where the error can be found in the shard files
> - failure class: `consistency assertion failed in MainLoop` <-- determined failure class
> - relevant log lines: <-- try to show a brief selection of log lines that make it easy to understand what happened
> ```
> ...
> ```

Note that a given test should not have multiple entries. If multiple match entries in the ripgrep JSON output correspond
to a single test, try to determine what the "actual" cause of the failure was. If unsure, include all potentially
Expand All @@ -192,15 +234,13 @@ Only record the specific test failure (`TestSpecificFunction`), not the suite su

Example failure clusters:

```
## Failure Clusters

1. Nullptr Access
a. `CI Action: Unit Tests::TestNewImpl`
2. Invalid Configuration
a. `CI Action: Unit Tests::TestProcessing`
b. `CI Action: E2E Tests::TestEndToEndInMemory`
```
> ## Failure Clusters
>
> 1. Nullptr Access
> a. `CI Action: Unit Tests::TestNewImpl`
> 2. Invalid Configuration
> a. `CI Action: Unit Tests::TestProcessing`
> b. `CI Action: E2E Tests::TestEndToEndInMemory`

### Report Type: Arbitrary Log Output

Expand All @@ -211,5 +251,13 @@ of failure clusters. To do this, follow the same procedure defined above.
## Context Compaction

Since you will be dealing with large quantities of data, it is likely that you will need to compact context despite
best efforts to limit what's being loaded. Discard context related to literal log contents first: retain in context
information related to what specific tests have failed, and what classes of failure are being observed.
best efforts to limit what's being loaded.

### Strategies for managing large result sets:

1. **Process shards sequentially**: Load and analyze one shard at a time, maintaining running totals/summaries
2. **Prioritize unique failures**: Focus on distinct error patterns rather than repetitive instances
3. **Discard processed content**: After extracting relevant information from a shard, clear it from context

Discard context related to literal log contents first: retain in context information related to what specific
tests have failed, and what classes of failure are being observed.
Loading