MercuryTechnologies
diff --git a/‎CLAUDE.md‎
Lines changed: 7 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 3 additions & 7 deletions b/‎README.md‎
Lines changed: 3 additions & 7 deletions
diff --git a/‎benchmark-result/report-2025-05-28-23-30-12/benchmark_comparison.png‎
606 KB b/‎benchmark-result/report-2025-05-28-23-30-12/benchmark_comparison.png‎
606 KB
diff --git a/‎benchmark-result/report-2025-05-28-23-30-12/summary_table.csv‎
Lines changed: 14 additions & 0 deletions b/‎benchmark-result/report-2025-05-28-23-30-12/summary_table.csv‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎benchmark-result/report-2025-05-28-23-30-12/summary_table.md‎
Lines changed: 15 additions & 0 deletions b/‎benchmark-result/report-2025-05-28-23-30-12/summary_table.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/acronym/src/Acronym.hs‎
Lines changed: 0 additions & 24 deletions b/‎tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/acronym/src/Acronym.hs‎
Lines changed: 0 additions & 24 deletions
diff --git a/‎tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/acronym/stack.yaml.lock‎
Lines changed: 0 additions & 12 deletions b/‎tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/acronym/stack.yaml.lock‎
Lines changed: 0 additions & 12 deletions
diff --git a/‎tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/affine-cipher/src/Affine.hs‎
Lines changed: 0 additions & 65 deletions b/‎tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/affine-cipher/src/Affine.hs‎
Lines changed: 0 additions & 65 deletions
diff --git a/‎tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/affine-cipher/stack.yaml.lock‎
Lines changed: 0 additions & 12 deletions b/‎tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/affine-cipher/stack.yaml.lock‎
Lines changed: 0 additions & 12 deletions
diff --git a/‎tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/all-your-base/src/Base.hs‎
Lines changed: 0 additions & 28 deletions b/‎tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/all-your-base/src/Base.hs‎
Lines changed: 0 additions & 28 deletions
@@ -37,4 +37,11 @@ The summarize_benchmark.py script:
 - Creates visualizations comparing model performance with cost information
 - Saves reports in timestamped directories under benchmark-result/
 
+## README Updates
+When new benchmark results are available:
+- Update README.md with latest benchmark results from the most recent report
+- Change the "Last updated" date to match the report timestamp
+- Update the image path to point to the latest benchmark_comparison.png
+- Replace the benchmark table with results from the latest summary_table.md
+
 Always run any new or modified Python scripts with required packages installed before committing changes. For benchmark scripts, always use `nix develop` to ensure a consistent environment.
@@ -6,29 +6,25 @@ It is a modified version of the [Aider benchmark harness](https://github.com/Aid
 
 The benchmark is based on [Exercism's Haskell exercises](https://exercism.org/tracks/haskell) ([Github](https://github.com/exercism/haskell)). This benchmark evaluates how effectively a coding assistant and LLMs can translate a natural language coding request into executable code saved into files that pass unit tests. It provides an end-to-end evaluation of not just the LLM's coding ability, but also its capacity to edit existing code and format those code edits so that aider can save the edits to the local source files.
 
-_Last updated: 2025-05-22_
+_Last updated: 2025-05-28_
 
-![Haskell LLM Benchmark](/benchmark-result/report-2025-05-22-13-09-10/benchmark_comparison.png)
+![Haskell LLM Benchmark](/benchmark-result/report-2025-05-28-23-30-12/benchmark_comparison.png)
 
 | Model | Tests | Pass % | Pass 1st Try % | Tests Passed | Passes 1st Try | Well Formed % | Errors | Sec/Test | Total Cost ($) | Cost/Test ($) |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | o3-high | 112 | 88.4 | 73.2 | 99 | 82 | 100.0 | 0 | 51.7 | 19.05 | 0.1701 |
 | o3 | 112 | 84.8 | 73.2 | 95 | 82 | 100.0 | 0 | 27.2 | 11.81 | 0.1055 |
 | o1-pro | 112 | 82.1 | 72.3 | 92 | 81 | 99.1 | 1 | 301.6 | 275.04 | 2.4558 |
 | claude-opus-4-20250514 | 112 | 81.2 | 65.2 | 91 | 73 | 100.0 | 0 | 22.5 | 0.00 | 0.0000 |
+| deepseek-r1-0528 | 112 | 81.2 | 63.4 | 91 | 71 | 99.1 | 3 | 242.8 | 0.00 | 0.0000 |
 | gemini-2.5-pro-preview | 112 | 80.4 | 73.2 | 90 | 82 | 99.1 | 2 | 109.4 | 0.00 | 0.0000 |
 | o1 | 112 | 79.5 | 67.9 | 89 | 76 | 99.1 | 1 | 49.3 | 29.22 | 0.2609 |
 | claude-sonnet-4-20250514 | 112 | 77.7 | 61.6 | 87 | 69 | 99.1 | 4 | 14.8 | 0.00 | 0.0000 |
-| claude-3-7-sonnet-20250219 (thinking) | 112 | 77.7 | 67.9 | 87 | 76 | 99.1 | 2 | 79.5 | 12.55 | 0.1120 |
 | gemini-2.5-flash-preview-05-20:thinking | 112 | 75.9 | 58.0 | 85 | 65 | 98.2 | 3 | 29.5 | 0.00 | 0.0000 |
-| gemini-2.5-pro-preview-03-25 | 112 | 75.9 | 68.8 | 85 | 77 | 96.4 | 6 | 44.2 | 0.00 | 0.0000 |
 | o3-mini | 112 | 75.0 | 63.4 | 84 | 71 | 100.0 | 0 | 37.5 | 2.13 | 0.0190 |
 | o4-mini | 112 | 74.1 | 67.9 | 83 | 76 | 99.1 | 1 | 29.4 | 1.81 | 0.0162 |
-| claude-3-7-sonnet-20250219 | 112 | 66.1 | 55.4 | 74 | 62 | 99.1 | 1 | 15.9 | 3.80 | 0.0340 |
 | gpt-4.1-2025-04-14 | 112 | 65.2 | 57.1 | 73 | 64 | 100.0 | 0 | 7.6 | 1.14 | 0.0102 |
 | gpt-4.1-mini-2025-04-14 | 112 | 63.4 | 51.8 | 71 | 58 | 100.0 | 0 | 5.3 | 0.24 | 0.0021 |
-| deepseek-chat-v3-0324 | 112 | 56.2 | 42.0 | 63 | 47 | 100.0 | 0 | 59.2 | 0.41 | 0.0037 |
-| llama-4-maverick | 112 | 46.4 | 32.1 | 52 | 36 | 91.1 | 10 | 16.0 | 0.00 | 0.0000 |
 
 
 
 
@@ -0,0 +1,14 @@
+dirname,dir_path,completed_tests,total_tests,model,edit_format,commit_hash,pass_rate_1,pass_rate_2,percent_cases_well_formed,error_outputs,num_malformed_responses,num_with_malformed_responses,seconds_per_case,total_cost,cost_per_case,passes_total,passes_1st_try,model_mode,model_display
+2025-04-16-22-57-05--o3-high-full-run-final,tmp.benchmarks/2025-04-16-22-57-05--o3-high-full-run-final,112,112,o3-high,whole,4331db2,73.2,88.4,100.0,0,0,0,51.7,19.05,0.1701,99,82,,o3-high
+2025-05-06-22-57-05--gemini-2.5-pro-may-6-full-run-final,tmp.benchmarks/2025-05-06-22-57-05--gemini-2.5-pro-may-6-full-run-final,112,113,gemini-2.5-pro-preview,whole,fc5baaa-dirty,73.2,80.4,99.1,2,1,1,109.4,0.0,0.0,90,82,,gemini-2.5-pro-preview
+2025-05-21-16-53-30--flash-2.5-thinking-full-run-final,tmp.benchmarks/2025-05-21-16-53-30--flash-2.5-thinking-full-run-final,112,112,gemini-2.5-flash-preview-05-20:thinking,whole,3cef607-dirty,58.0,75.9,98.2,3,3,2,29.5,0.0,0.0,85,65,,gemini-2.5-flash-preview-05-20:thinking
+2025-04-16-22-43-43--o4-mini-full-run-final,tmp.benchmarks/2025-04-16-22-43-43--o4-mini-full-run-final,112,112,o4-mini,whole,4331db2,67.9,74.1,99.1,1,1,1,29.4,1.81,0.0162,83,76,,o4-mini
+2025-05-22-13-01-26--claude-sonnet-4-full-run-final,tmp.benchmarks/2025-05-22-13-01-26--claude-sonnet-4-full-run-final,112,112,claude-sonnet-4-20250514,whole,e59733b,61.6,77.7,99.1,4,4,1,14.8,0.0,0.0,87,69,regular,claude-sonnet-4-20250514
+2025-05-22-12-52-18--claude-opus-4-full-run-final,tmp.benchmarks/2025-05-22-12-52-18--claude-opus-4-full-run-final,112,112,claude-opus-4-20250514,whole,e59733b,65.2,81.2,100.0,0,0,0,22.5,0.0,0.0,91,73,regular,claude-opus-4-20250514
+2025-04-14-15-42-48--gpt-4.1-full-run-final,tmp.benchmarks/2025-04-14-15-42-48--gpt-4.1-full-run-final,112,112,gpt-4.1-2025-04-14,whole,8e5f06e,57.1,65.2,100.0,0,0,0,7.6,1.14,0.0102,73,64,,gpt-4.1-2025-04-14
+2025-04-14-15-46-37--gpt-4.1-mini-full-run-final,tmp.benchmarks/2025-04-14-15-46-37--gpt-4.1-mini-full-run-final,112,112,gpt-4.1-mini-2025-04-14,whole,8e5f06e,51.8,63.4,100.0,0,0,0,5.3,0.24,0.0021,71,58,,gpt-4.1-mini-2025-04-14
+2025-04-02-22-59-29--o1-full-run-final,tmp.benchmarks/2025-04-02-22-59-29--o1-full-run-final,112,113,o1,whole,f9b60d8-dirty,67.9,79.5,99.1,1,1,1,49.3,29.22,0.2609,89,76,,o1
+2025-04-03-13-21-35--o1-pro-full-run-final,tmp.benchmarks/2025-04-03-13-21-35--o1-pro-full-run-final,112,113,o1-pro,whole,f9b60d8-dirty,72.3,82.1,99.1,1,1,1,301.6,275.04,2.4558,92,81,,o1-pro
+2025-04-02-23-07-00--o3-mini-full-run-final,tmp.benchmarks/2025-04-02-23-07-00--o3-mini-full-run-final,112,112,o3-mini,whole,f9b60d8-dirty,63.4,75.0,100.0,0,0,0,37.5,2.13,0.019,84,71,,o3-mini
+2025-04-16-22-30-22--o3-full-run-final,tmp.benchmarks/2025-04-16-22-30-22--o3-full-run-final,112,112,o3,whole,4331db2,73.2,84.8,100.0,0,0,0,27.2,11.81,0.1055,95,82,,o3
+2025-05-28-22-55-14--deepseek-r1-0528-full-run-final,tmp.benchmarks/2025-05-28-22-55-14--deepseek-r1-0528-full-run-final,112,112,deepseek-r1-0528,whole,345c179,63.4,81.2,99.1,3,1,1,242.8,0.0,0.0,91,71,,deepseek-r1-0528
@@ -0,0 +1,15 @@
+| Model | Tests | Pass % | Pass 1st Try % | Tests Passed | Passes 1st Try | Well Formed % | Errors | Sec/Test | Total Cost ($) | Cost/Test ($) |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| o3-high | 112 | 88.4 | 73.2 | 99 | 82 | 100.0 | 0 | 51.7 | 19.05 | 0.1701 |
+| o3 | 112 | 84.8 | 73.2 | 95 | 82 | 100.0 | 0 | 27.2 | 11.81 | 0.1055 |
+| o1-pro | 112 | 82.1 | 72.3 | 92 | 81 | 99.1 | 1 | 301.6 | 275.04 | 2.4558 |
+| claude-opus-4-20250514 | 112 | 81.2 | 65.2 | 91 | 73 | 100.0 | 0 | 22.5 | 0.00 | 0.0000 |
+| deepseek-r1-0528 | 112 | 81.2 | 63.4 | 91 | 71 | 99.1 | 3 | 242.8 | 0.00 | 0.0000 |
+| gemini-2.5-pro-preview | 112 | 80.4 | 73.2 | 90 | 82 | 99.1 | 2 | 109.4 | 0.00 | 0.0000 |
+| o1 | 112 | 79.5 | 67.9 | 89 | 76 | 99.1 | 1 | 49.3 | 29.22 | 0.2609 |
+| claude-sonnet-4-20250514 | 112 | 77.7 | 61.6 | 87 | 69 | 99.1 | 4 | 14.8 | 0.00 | 0.0000 |
+| gemini-2.5-flash-preview-05-20:thinking | 112 | 75.9 | 58.0 | 85 | 65 | 98.2 | 3 | 29.5 | 0.00 | 0.0000 |
+| o3-mini | 112 | 75.0 | 63.4 | 84 | 71 | 100.0 | 0 | 37.5 | 2.13 | 0.0190 |
+| o4-mini | 112 | 74.1 | 67.9 | 83 | 76 | 99.1 | 1 | 29.4 | 1.81 | 0.0162 |
+| gpt-4.1-2025-04-14 | 112 | 65.2 | 57.1 | 73 | 64 | 100.0 | 0 | 7.6 | 1.14 | 0.0102 |
+| gpt-4.1-mini-2025-04-14 | 112 | 63.4 | 51.8 | 71 | 58 | 100.0 | 0 | 5.3 | 0.24 | 0.0021 |