Skip to content

Commit e5af88f

Browse files
committed
add deepseek r1
1 parent 345c179 commit e5af88f

File tree

7,159 files changed

+5277
-187988
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

7,159 files changed

+5277
-187988
lines changed

CLAUDE.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,4 +37,11 @@ The summarize_benchmark.py script:
3737
- Creates visualizations comparing model performance with cost information
3838
- Saves reports in timestamped directories under benchmark-result/
3939

40+
## README Updates
41+
When new benchmark results are available:
42+
- Update README.md with latest benchmark results from the most recent report
43+
- Change the "Last updated" date to match the report timestamp
44+
- Update the image path to point to the latest benchmark_comparison.png
45+
- Replace the benchmark table with results from the latest summary_table.md
46+
4047
Always run any new or modified Python scripts with required packages installed before committing changes. For benchmark scripts, always use `nix develop` to ensure a consistent environment.

README.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,29 +6,25 @@ It is a modified version of the [Aider benchmark harness](https://github.com/Aid
66

77
The benchmark is based on [Exercism's Haskell exercises](https://exercism.org/tracks/haskell) ([Github](https://github.com/exercism/haskell)). This benchmark evaluates how effectively a coding assistant and LLMs can translate a natural language coding request into executable code saved into files that pass unit tests. It provides an end-to-end evaluation of not just the LLM's coding ability, but also its capacity to edit existing code and format those code edits so that aider can save the edits to the local source files.
88

9-
_Last updated: 2025-05-22_
9+
_Last updated: 2025-05-28_
1010

11-
![Haskell LLM Benchmark](/benchmark-result/report-2025-05-22-13-09-10/benchmark_comparison.png)
11+
![Haskell LLM Benchmark](/benchmark-result/report-2025-05-28-23-30-12/benchmark_comparison.png)
1212

1313
| Model | Tests | Pass % | Pass 1st Try % | Tests Passed | Passes 1st Try | Well Formed % | Errors | Sec/Test | Total Cost ($) | Cost/Test ($) |
1414
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
1515
| o3-high | 112 | 88.4 | 73.2 | 99 | 82 | 100.0 | 0 | 51.7 | 19.05 | 0.1701 |
1616
| o3 | 112 | 84.8 | 73.2 | 95 | 82 | 100.0 | 0 | 27.2 | 11.81 | 0.1055 |
1717
| o1-pro | 112 | 82.1 | 72.3 | 92 | 81 | 99.1 | 1 | 301.6 | 275.04 | 2.4558 |
1818
| claude-opus-4-20250514 | 112 | 81.2 | 65.2 | 91 | 73 | 100.0 | 0 | 22.5 | 0.00 | 0.0000 |
19+
| deepseek-r1-0528 | 112 | 81.2 | 63.4 | 91 | 71 | 99.1 | 3 | 242.8 | 0.00 | 0.0000 |
1920
| gemini-2.5-pro-preview | 112 | 80.4 | 73.2 | 90 | 82 | 99.1 | 2 | 109.4 | 0.00 | 0.0000 |
2021
| o1 | 112 | 79.5 | 67.9 | 89 | 76 | 99.1 | 1 | 49.3 | 29.22 | 0.2609 |
2122
| claude-sonnet-4-20250514 | 112 | 77.7 | 61.6 | 87 | 69 | 99.1 | 4 | 14.8 | 0.00 | 0.0000 |
22-
| claude-3-7-sonnet-20250219 (thinking) | 112 | 77.7 | 67.9 | 87 | 76 | 99.1 | 2 | 79.5 | 12.55 | 0.1120 |
2323
| gemini-2.5-flash-preview-05-20:thinking | 112 | 75.9 | 58.0 | 85 | 65 | 98.2 | 3 | 29.5 | 0.00 | 0.0000 |
24-
| gemini-2.5-pro-preview-03-25 | 112 | 75.9 | 68.8 | 85 | 77 | 96.4 | 6 | 44.2 | 0.00 | 0.0000 |
2524
| o3-mini | 112 | 75.0 | 63.4 | 84 | 71 | 100.0 | 0 | 37.5 | 2.13 | 0.0190 |
2625
| o4-mini | 112 | 74.1 | 67.9 | 83 | 76 | 99.1 | 1 | 29.4 | 1.81 | 0.0162 |
27-
| claude-3-7-sonnet-20250219 | 112 | 66.1 | 55.4 | 74 | 62 | 99.1 | 1 | 15.9 | 3.80 | 0.0340 |
2826
| gpt-4.1-2025-04-14 | 112 | 65.2 | 57.1 | 73 | 64 | 100.0 | 0 | 7.6 | 1.14 | 0.0102 |
2927
| gpt-4.1-mini-2025-04-14 | 112 | 63.4 | 51.8 | 71 | 58 | 100.0 | 0 | 5.3 | 0.24 | 0.0021 |
30-
| deepseek-chat-v3-0324 | 112 | 56.2 | 42.0 | 63 | 47 | 100.0 | 0 | 59.2 | 0.41 | 0.0037 |
31-
| llama-4-maverick | 112 | 46.4 | 32.1 | 52 | 36 | 91.1 | 10 | 16.0 | 0.00 | 0.0000 |
3228

3329

3430

606 KB
Loading
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
dirname,dir_path,completed_tests,total_tests,model,edit_format,commit_hash,pass_rate_1,pass_rate_2,percent_cases_well_formed,error_outputs,num_malformed_responses,num_with_malformed_responses,seconds_per_case,total_cost,cost_per_case,passes_total,passes_1st_try,model_mode,model_display
2+
2025-04-16-22-57-05--o3-high-full-run-final,tmp.benchmarks/2025-04-16-22-57-05--o3-high-full-run-final,112,112,o3-high,whole,4331db2,73.2,88.4,100.0,0,0,0,51.7,19.05,0.1701,99,82,,o3-high
3+
2025-05-06-22-57-05--gemini-2.5-pro-may-6-full-run-final,tmp.benchmarks/2025-05-06-22-57-05--gemini-2.5-pro-may-6-full-run-final,112,113,gemini-2.5-pro-preview,whole,fc5baaa-dirty,73.2,80.4,99.1,2,1,1,109.4,0.0,0.0,90,82,,gemini-2.5-pro-preview
4+
2025-05-21-16-53-30--flash-2.5-thinking-full-run-final,tmp.benchmarks/2025-05-21-16-53-30--flash-2.5-thinking-full-run-final,112,112,gemini-2.5-flash-preview-05-20:thinking,whole,3cef607-dirty,58.0,75.9,98.2,3,3,2,29.5,0.0,0.0,85,65,,gemini-2.5-flash-preview-05-20:thinking
5+
2025-04-16-22-43-43--o4-mini-full-run-final,tmp.benchmarks/2025-04-16-22-43-43--o4-mini-full-run-final,112,112,o4-mini,whole,4331db2,67.9,74.1,99.1,1,1,1,29.4,1.81,0.0162,83,76,,o4-mini
6+
2025-05-22-13-01-26--claude-sonnet-4-full-run-final,tmp.benchmarks/2025-05-22-13-01-26--claude-sonnet-4-full-run-final,112,112,claude-sonnet-4-20250514,whole,e59733b,61.6,77.7,99.1,4,4,1,14.8,0.0,0.0,87,69,regular,claude-sonnet-4-20250514
7+
2025-05-22-12-52-18--claude-opus-4-full-run-final,tmp.benchmarks/2025-05-22-12-52-18--claude-opus-4-full-run-final,112,112,claude-opus-4-20250514,whole,e59733b,65.2,81.2,100.0,0,0,0,22.5,0.0,0.0,91,73,regular,claude-opus-4-20250514
8+
2025-04-14-15-42-48--gpt-4.1-full-run-final,tmp.benchmarks/2025-04-14-15-42-48--gpt-4.1-full-run-final,112,112,gpt-4.1-2025-04-14,whole,8e5f06e,57.1,65.2,100.0,0,0,0,7.6,1.14,0.0102,73,64,,gpt-4.1-2025-04-14
9+
2025-04-14-15-46-37--gpt-4.1-mini-full-run-final,tmp.benchmarks/2025-04-14-15-46-37--gpt-4.1-mini-full-run-final,112,112,gpt-4.1-mini-2025-04-14,whole,8e5f06e,51.8,63.4,100.0,0,0,0,5.3,0.24,0.0021,71,58,,gpt-4.1-mini-2025-04-14
10+
2025-04-02-22-59-29--o1-full-run-final,tmp.benchmarks/2025-04-02-22-59-29--o1-full-run-final,112,113,o1,whole,f9b60d8-dirty,67.9,79.5,99.1,1,1,1,49.3,29.22,0.2609,89,76,,o1
11+
2025-04-03-13-21-35--o1-pro-full-run-final,tmp.benchmarks/2025-04-03-13-21-35--o1-pro-full-run-final,112,113,o1-pro,whole,f9b60d8-dirty,72.3,82.1,99.1,1,1,1,301.6,275.04,2.4558,92,81,,o1-pro
12+
2025-04-02-23-07-00--o3-mini-full-run-final,tmp.benchmarks/2025-04-02-23-07-00--o3-mini-full-run-final,112,112,o3-mini,whole,f9b60d8-dirty,63.4,75.0,100.0,0,0,0,37.5,2.13,0.019,84,71,,o3-mini
13+
2025-04-16-22-30-22--o3-full-run-final,tmp.benchmarks/2025-04-16-22-30-22--o3-full-run-final,112,112,o3,whole,4331db2,73.2,84.8,100.0,0,0,0,27.2,11.81,0.1055,95,82,,o3
14+
2025-05-28-22-55-14--deepseek-r1-0528-full-run-final,tmp.benchmarks/2025-05-28-22-55-14--deepseek-r1-0528-full-run-final,112,112,deepseek-r1-0528,whole,345c179,63.4,81.2,99.1,3,1,1,242.8,0.0,0.0,91,71,,deepseek-r1-0528
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
| Model | Tests | Pass % | Pass 1st Try % | Tests Passed | Passes 1st Try | Well Formed % | Errors | Sec/Test | Total Cost ($) | Cost/Test ($) |
2+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
3+
| o3-high | 112 | 88.4 | 73.2 | 99 | 82 | 100.0 | 0 | 51.7 | 19.05 | 0.1701 |
4+
| o3 | 112 | 84.8 | 73.2 | 95 | 82 | 100.0 | 0 | 27.2 | 11.81 | 0.1055 |
5+
| o1-pro | 112 | 82.1 | 72.3 | 92 | 81 | 99.1 | 1 | 301.6 | 275.04 | 2.4558 |
6+
| claude-opus-4-20250514 | 112 | 81.2 | 65.2 | 91 | 73 | 100.0 | 0 | 22.5 | 0.00 | 0.0000 |
7+
| deepseek-r1-0528 | 112 | 81.2 | 63.4 | 91 | 71 | 99.1 | 3 | 242.8 | 0.00 | 0.0000 |
8+
| gemini-2.5-pro-preview | 112 | 80.4 | 73.2 | 90 | 82 | 99.1 | 2 | 109.4 | 0.00 | 0.0000 |
9+
| o1 | 112 | 79.5 | 67.9 | 89 | 76 | 99.1 | 1 | 49.3 | 29.22 | 0.2609 |
10+
| claude-sonnet-4-20250514 | 112 | 77.7 | 61.6 | 87 | 69 | 99.1 | 4 | 14.8 | 0.00 | 0.0000 |
11+
| gemini-2.5-flash-preview-05-20:thinking | 112 | 75.9 | 58.0 | 85 | 65 | 98.2 | 3 | 29.5 | 0.00 | 0.0000 |
12+
| o3-mini | 112 | 75.0 | 63.4 | 84 | 71 | 100.0 | 0 | 37.5 | 2.13 | 0.0190 |
13+
| o4-mini | 112 | 74.1 | 67.9 | 83 | 76 | 99.1 | 1 | 29.4 | 1.81 | 0.0162 |
14+
| gpt-4.1-2025-04-14 | 112 | 65.2 | 57.1 | 73 | 64 | 100.0 | 0 | 7.6 | 1.14 | 0.0102 |
15+
| gpt-4.1-mini-2025-04-14 | 112 | 63.4 | 51.8 | 71 | 58 | 100.0 | 0 | 5.3 | 0.24 | 0.0021 |

tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/acronym/src/Acronym.hs

Lines changed: 0 additions & 24 deletions
This file was deleted.

tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/acronym/stack.yaml.lock

Lines changed: 0 additions & 12 deletions
This file was deleted.

tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/affine-cipher/src/Affine.hs

Lines changed: 0 additions & 65 deletions
This file was deleted.

tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/affine-cipher/stack.yaml.lock

Lines changed: 0 additions & 12 deletions
This file was deleted.

tmp.benchmarks/2025-04-02-23-21-03--claude-3-7-full-run-final/haskell/exercises/practice/all-your-base/src/Base.hs

Lines changed: 0 additions & 28 deletions
This file was deleted.

0 commit comments

Comments
 (0)