Layr-Labs · litt3 · Jul 28, 2025 · Jul 14, 2025 · Jul 21, 2025 · Jul 21, 2025
diff --git a/.claude/commands/preprocess-logs.md b/.claude/commands/preprocess-logs.md
@@ -5,6 +5,23 @@ quantities of logs. This framework is needed in order to carefully manage AI con
 extract useful information without having to load the entire log contents into context. All output files will be
 saved to an <analysis_directory>, which should be named "analysis" and placed inside the original log directory.
 
+## Usage Check
+
+First verify that a log directory path was provided. If no path was provided in the command arguments, respond with:
+
+> Error: No log directory path provided.
+>
+> Usage Example: /preprocess-logs logs_41509396734
+>
+> This command analyzes log files in the specified directory by:
+> 1. Splitting large files into manageable shards
+> 2. Searching for various error patterns
+> 3. Generating a human-readable report of failures
+>
+> The analysis and preprocessing artifacts will be saved inside the log directory.
+
+Only proceed with the analysis if a valid directory path was provided.
+
 ## Phase 0: Check for Pre-existing Analysis
 
 Before beginning the log preprocessing procedure, check if a previous analysis has already been completed for the target log files.
@@ -14,18 +31,17 @@ Before beginning the log preprocessing procedure, check if a previous analysis h
 
 2. **Verify analysis completeness**: If the analysis directory exists, check for the presence of key analysis artifacts:
    - `shards/` directory containing shard files
-   - `search_results/` directory containing `primary_search.jsonl` or `fallback_search.jsonl` (depending on which search was used)
+   - `search_results/` directory containing search result files
    - `<original_log_directory_name>_preprocessing_report.md`
 
 3. **User confirmation for re-analysis**: If a complete analysis is found, ask the user for confirmation before proceeding:
-   ```
-   Found existing analysis for <original_log_directory>. The analysis includes:
-   - X shard files
-   - Search results (primary/fallback)
-   - Preprocessing report
-
-   Do you want to re-analyze these logs and overwrite the existing analysis?
-   ```
+
+   > Found existing analysis for <original_log_directory>. The analysis includes:
+   > - X shard files
+   > - Search results
+   > - Preprocessing report
+   > 
+   > Do you want to re-analyze these logs and overwrite the existing analysis?
 
 ## Phase 1: Split Large Logs
 
@@ -35,7 +51,7 @@ intended analysis tool.
 
 1. Store all shard files in a directory called "shards", inside the <analysis_directory>. The analysis directory
    should be named `analysis` and placed inside the original log directory. Each shard should be named
-   `<original_log_name>_shard_<shard_decimal_index>.<original_log_extension>`.
+   `<original_log_name>_shard_<shard_decimal_index>`.
 
 2. **Split Command:** Use the following command to split log files into shards with decimal numbering:
 
@@ -47,7 +63,6 @@ intended analysis tool.
   - `-l 1800`: Split every 1800 lines
   - `-d`: Use numeric suffixes instead of alphabetic
   - `-a 3`: Use 3-digit suffixes for better readability and sorting
-  - After splitting, add the original file extension to each shard file
 
   Example shard files:
   ```
@@ -65,44 +80,71 @@ at this point: we are simply generating an index of lines that might potentially
 mkdir -p "<original_log_directory>/analysis/search_results"
 ```
 
-Use a **two-stage search strategy** to balance precision and recall:
+### Search Profiles
+
+Use targeted search profiles based on the type of failures you're looking for. If the user didn't specify what you are
+searching for, you should iteratively search using each profile.
+
+#### Profile 1: Test Failures
+For standard test output failures:
+```bash
+rg --line-number --ignore-case --json -C 5 -- "^[-]{3} FAIL:|\\s+FAIL\$|\\s+FAIL\\t|\\[FAILED\\]|panic: test timed out" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/test_failures_search.jsonl"
+```
+
+#### Profile 2: Connection/Network Errors
+For network-related issues:
+```bash
+rg --line-number --ignore-case --json -C 5 "ECONNREFUSED|connection refused|dial.*failed|cannot connect|connection reset" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/connection_errors_search.jsonl"
+```
 
-### Stage 1: Primary Search (Restrictive)
-First, search using a restrictive pattern designed to capture actual test failures while minimizing false
-positives:
+#### Profile 3: Startup/Initialization Errors
+For service startup problems:
+```bash
+rg --line-number --ignore-case --json -C 5 "error starting|failed to start|initialization failed|startup failed|cannot initialize" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/startup_errors_search.jsonl"
+```
 
- ```
- rg --line-number --ignore-case --json -C 5 "\-\-\- FAIL:|\s+FAIL$|\s+FAIL\t|\[FAILED\]|panic: test timed out" <shard_path> > <original_log_directory>/analysis/search_results/primary_search.jsonl
- ```
+#### Profile 4: Docker/Container Issues
+For container-related problems:
+```bash
+rg --line-number --ignore-case --json -C 5 "container.*failed|docker.*error|OCI runtime|container.*exit.*[1-9]" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/container_errors_search.jsonl"
+```
 
-### Stage 2: Fallback Search (Expansive)
-**Only if the primary search yields no results**, fall back to a more expansive pattern:
+#### Profile 5: Resource/Timeout Issues
+For resource constraints and timeouts:
+```bash
+rg --line-number --ignore-case --json -C 5 "out of memory|OOM|deadline exceeded|context canceled|timeout waiting" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/resource_errors_search.jsonl"
+```
 
- ```
- rg --line-number --ignore-case --json -C 5 "FAIL|ERROR|TIMEOUT|panic" <shard_path> > <original_log_directory>/analysis/search_results/fallback_search.jsonl
- ```
+#### Profile 6: Panic/Crash Detection
+For application crashes:
+```bash
+rg --line-number --ignore-case --json -C 5 "panic:|fatal error:|segmentation fault|SIGSEGV|goroutine.*panic" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/panic_errors_search.jsonl"
+```
 
-**Primary Pattern Components:**
-- `--- FAIL:` - Standard Go test failures (e.g., `--- FAIL: TestName (duration)`)
-- `\s+FAIL$` - Go test summary lines ending with FAIL
-- `\s+FAIL\t` - Go package failure lines (e.g., `FAIL<tab>package.name`)
-- `[FAILED]` - Jest/Ginkgo style test failures
-- `panic: test timed out` - Test timeout failures
+#### Fallback: General Errors
+Only use if specific searches yield no results:
+```bash
+rg --line-number --ignore-case --json -C 5 "ERROR|FAIL|CRITICAL" "<original_log_directory>/analysis/shards/" > "<original_log_directory>/analysis/search_results/general_errors_search.jsonl"
+```
 
-**Fallback Pattern:**
-The expansive pattern provides comprehensive coverage but may include false positives like operational ERROR
-logs, dependency warnings, and debug messages.
+### Search Result Management
+
+After running each search profile, split the results into manageable shards:
+
+```bash
+# Split search results into 1800-line shards
+split -l 1800 -d -a 3 "<original_log_directory>/analysis/search_results/test_failures_search.jsonl" \
+  "<original_log_directory>/analysis/search_results/test_failures_shard_"
+```
+
+Repeat this for each search profile that generates results.
 
 **Ripgrep JSON Output Structure:**
 The ripgrep command outputs JSON lines where each entry has a `type` field:
 - `"type":"match"` - Contains the actual match with file path, line number, and matched text
 - `"type":"context"` - Contains surrounding context lines with their line numbers
 - `"type":"begin"` and `"type":"end"` - File boundaries and summary statistics
 
-All match and context information is preserved in the JSON output with precise line numbers and file paths.
-Use the appropriate search result file (`<original_log_directory>/analysis/search_results/primary_search.jsonl` or 
-`<original_log_directory>/analysis/search_results/fallback_search.jsonl`) directly for analysis in Phase 3.
-
 ## Phase 3: Generate Human Readable Log Preprocessing Report
 
 This phase produces a structured summary for human consumption. Store the report as a **Markdown file** at
@@ -113,9 +155,6 @@ This phase produces a structured summary for human consumption. Store the report
 - Lines that would suffer from being split (e.g., URLs, code snippets, file paths) may exceed this limit
 - Apply best-effort line wrapping for readability while preserving technical accuracy
 
-There are two separate report formats, depending on what the input logs actually represent. Explore the input
-logs to determine what type of logs you are dealing with. Record the determined type as part of the report.
-
 ### Report Type: Test Output
 
 If the logs represent output from one or more tests, then the report will focus on describing tests that included failures.
@@ -130,17 +169,22 @@ point* for finding failed tests.
 
 The basic format of the `Preprocessing Report` for logs representing tests is as follows:
 
-```
-# Test Output Preprocessing Report
-
-## Test Failures
-
-<list of failed tests> // see below for details of how test failures should be structured
-
-## Failure Clusters
-
-<list of classes of failures> // see below for details of how failure classes should be structured
-```
+> # Test Output Preprocessing Report
+> 
+> ## Search Results Summary
+> - Log Type Detected: <test_output|container_logs|system_logs>
+> - Total Matches Found:
+>   - Test Failures: X matches
+>   - Connection Errors: Y matches
+>   - [other profiles...]
+> 
+> ## Test Failures
+> 
+> <list of failed tests> // see below for details of how test failures should be structured
+> 
+> ## Failure Clusters
+> 
+> <list of classes of failures> // see below for details of how failure classes should be structured
 
 For each match entry (`"type":"match"`) in the ripgrep JSON output, perform the following steps:
 
@@ -161,18 +205,16 @@ For each match entry (`"type":"match"`) in the ripgrep JSON output, perform the
      e.g. "Root component invalid array access", or "runtime type panic in ServerProcess"
 3. Record the test failure in the report:
 
-```
-### CI Action: Unit Tests                                     <-- this is the group the test belongs to.
-                                                              <-- if the test group has already been added to the report, add the test failure entry under the existing heading
-
-1. `TestParallelProcessing`                                   <-- this is the name of the test
-  - failure location: `unit_tests_shard_003.txt` line 62      <-- record where the error can be found in the shard files
-  - failure class: `consistency assertion failed in MainLoop` <-- determined failure class
-  - relevant log lines:                                       <-- try to show a brief selection of log lines that make it easy to understand what happened
-    ```
-    ...
-    ```
-```
+> ### CI Action: Unit Tests                                     <-- this is the group the test belongs to.
+>                                                               <-- if the test group has already been added to the report, add the test failure entry under the existing heading
+>
+> 1. `TestParallelProcessing`                                   <-- this is the name of the test
+>   - failure location: `unit_tests_shard_003` line 62          <-- record where the error can be found in the shard files
+>   - failure class: `consistency assertion failed in MainLoop` <-- determined failure class
+>   - relevant log lines:                                       <-- try to show a brief selection of log lines that make it easy to understand what happened
+>     ```
+>     ...
+>     ```
 
 Note that a given test should not have multiple entries. If multiple match entries in the ripgrep JSON output correspond
 to a single test, try to determine what the "actual" cause of the failure was. If unsure, include all potentially
@@ -192,15 +234,13 @@ Only record the specific test failure (`TestSpecificFunction`), not the suite su
 
 Example failure clusters:
 
-```
-## Failure Clusters
-
-1. Nullptr Access
-  a. `CI Action: Unit Tests::TestNewImpl`
-2. Invalid Configuration
-  a. `CI Action: Unit Tests::TestProcessing`
-  b. `CI Action: E2E Tests::TestEndToEndInMemory`
-```
+> ## Failure Clusters
+>
+> 1. Nullptr Access
+>   a. `CI Action: Unit Tests::TestNewImpl`
+> 2. Invalid Configuration
+>   a. `CI Action: Unit Tests::TestProcessing`
+>   b. `CI Action: E2E Tests::TestEndToEndInMemory`
 
 ### Report Type: Arbitrary Log Output
 
@@ -211,5 +251,13 @@ of failure clusters. To do this, follow the same procedure defined above.
 ## Context Compaction
 
 Since you will be dealing with large quantities of data, it is likely that you will need to compact context despite
-best efforts to limit what's being loaded. Discard context related to literal log contents first: retain in context
-information related to what specific tests have failed, and what classes of failure are being observed. 
+best efforts to limit what's being loaded. 
+
+### Strategies for managing large result sets:
+
+1. **Process shards sequentially**: Load and analyze one shard at a time, maintaining running totals/summaries
+2. **Prioritize unique failures**: Focus on distinct error patterns rather than repetitive instances
+3. **Discard processed content**: After extracting relevant information from a shard, clear it from context
+
+Discard context related to literal log contents first: retain in context information related to what specific 
+tests have failed, and what classes of failure are being observed.