Skip to content

Commit b96454b

Browse files
committed
Merge branch 'main' into jakep/new_trainer
2 parents 58e4fad + 633b03d commit b96454b

21 files changed

+518
-623
lines changed

.github/workflows/main.yml

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -271,25 +271,26 @@ jobs:
271271
outputs: type=registry
272272
no-cache: true
273273

274-
- name: Setup Beaker CLI
275-
uses: allenai/setup-beaker@v2
276-
with:
277-
token: ${{ secrets.BEAKER_TOKEN }}
278-
version: latest
279-
280-
- name: Push to Beaker
281-
env:
282-
BEAKER_TOKEN: ${{ secrets.BEAKER_TOKEN }}
283-
run: |
284-
# Get the version without 'v' prefix
285-
VERSION=${GITHUB_REF#refs/tags/v}
286-
287-
# Push the Docker image to Beaker
288-
beaker image create \
289-
--name "olmocr-inference-$VERSION" \
290-
--workspace ai2/olmocr \
291-
"docker://${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:$VERSION"
292-
274+
# jakep: push to beaker can't work because of limitted disk space on these runners
275+
# jakep: (you can try by setting load: true above, but you'll need a larger runner)
276+
# - name: Setup Beaker CLI
277+
# uses: allenai/setup-beaker@v2
278+
# with:
279+
# token: ${{ secrets.BEAKER_TOKEN }}
280+
# version: latest
281+
# - name: Debug Docker images
282+
# run: docker images
283+
284+
# - name: Push to Beaker
285+
# env:
286+
# BEAKER_TOKEN: ${{ secrets.BEAKER_TOKEN }}
287+
# run: |
288+
# VERSION=${{ steps.meta.outputs.version }}
289+
# beaker image create \
290+
# --name "olmocr-inference-$VERSION" \
291+
# --workspace ai2/olmocr \
292+
# alleninstituteforai/olmocr:$VERSION
293+
293294
- name: Clean up after build
294295
if: always()
295296
run: |

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ olmOCR-bench/*
2121
table_data*/
2222
/synth*/
2323
dolma_samples/*
24+
old_train/
2425
/*.html
2526
scoreelo.csv
2627
debug.log

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## Unreleased
99

10+
## [v0.1.76](https://github.com/allenai/olmocr/releases/tag/v0.1.76) - 2025-06-23
11+
12+
## [v0.1.75](https://github.com/allenai/olmocr/releases/tag/v0.1.75) - 2025-06-17
13+
14+
## [v0.1.74](https://github.com/allenai/olmocr/releases/tag/v0.1.74) - 2025-06-17
15+
16+
## [v0.1.73](https://github.com/allenai/olmocr/releases/tag/v0.1.73) - 2025-06-17
17+
18+
## [v0.1.72](https://github.com/allenai/olmocr/releases/tag/v0.1.72) - 2025-06-17
19+
1020
## [v0.1.71](https://github.com/allenai/olmocr/releases/tag/v0.1.71) - 2025-05-30
1121

1222
## [v0.1.70](https://github.com/allenai/olmocr/releases/tag/v0.1.70) - 2025-05-23

Dockerfile

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -47,19 +47,19 @@ RUN apt-get update -y && apt-get install -y --no-install-recommends \
4747
unzip
4848

4949
ENV PYTHONUNBUFFERED=1
50-
WORKDIR /root
51-
COPY pyproject.toml pyproject.toml
52-
COPY olmocr/version.py olmocr/version.py
50+
51+
# keep the build context clean
52+
WORKDIR /build
53+
COPY . /build
54+
5355

5456
# Needed to resolve setuptools dependencies
5557
ENV UV_INDEX_STRATEGY="unsafe-best-match"
56-
RUN uv pip install --system --no-cache -e ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128
58+
RUN uv pip install --system --no-cache ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128
5759
RUN uv pip install --system https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl
5860
RUN uv pip install --system --no-cache ".[bench]"
61+
5962
RUN playwright install-deps
6063
RUN playwright install chromium
6164

62-
COPY olmocr olmocr
63-
COPY scripts scripts
64-
6565
RUN python3 -m olmocr.pipeline --help

README.md

Lines changed: 35 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ Features:
3535
- (Based on a 7B parameter VLM, so it requires a GPU)
3636

3737
### News
38+
- June 17, 2025 - v0.1.75 - Switch from sglang to vllm based inference pipeline, updated docker image to CUDA 12.8.
3839
- May 23, 2025 - v0.1.70 - Official docker support and images are now available! [See Docker usage](#using-docker)
3940
- May 19, 2025 - v0.1.68 - [olmOCR-Bench](https://github.com/allenai/olmocr/tree/main/olmocr/bench) launch, scoring 77.4. Launch includes 2 point performance boost in olmOCR pipeline due to bug fixes with prompts.
4041
- Mar 17, 2025 - v0.1.60 - Performance improvements due to better temperature selection in sampling.
@@ -49,29 +50,29 @@ We also ship a comprehensive benchmark suite covering over 7,000 test cases acro
4950
<thead>
5051
<tr>
5152
<th align="left"><strong>Model</strong></th>
52-
<th align="center">AR</th>
53-
<th align="center">OSM</th>
54-
<th align="center">TA</th>
55-
<th align="center">OS</th>
56-
<th align="center">HF</th>
57-
<th align="center">MC</th>
58-
<th align="center">LTT</th>
53+
<th align="center">ArXiv</th>
54+
<th align="center">Old Scans Math</th>
55+
<th align="center">Tables</th>
56+
<th align="center">Old Scans</th>
57+
<th align="center">Headers and Footers</th>
58+
<th align="center">Multi column</th>
59+
<th align="center">Long tiny text</th>
5960
<th align="center">Base</th>
60-
<th align="center">Overall Score</th>
61+
<th align="center">Overall</th>
6162
</tr>
6263
</thead>
6364
<tbody>
6465
<tr>
65-
<td align="left">Marker v1.6.2</td>
66-
<td align="center">24.3</td>
67-
<td align="center">22.1</td>
68-
<td align="center">69.8</td>
69-
<td align="center">24.3</td>
70-
<td align="center">87.1</td>
71-
<td align="center">71.0</td>
72-
<td align="center">76.9</td>
73-
<td align="center"><strong>99.5</strong></td>
74-
<td align="center">59.4 ± 1.1</td>
66+
<td align="left">Marker v1.7.5 (base)</td>
67+
<td align="center">76.0</td>
68+
<td align="center">57.9</td>
69+
<td align="center">57.6</td>
70+
<td align="center">27.8</td>
71+
<td align="center">84.9</td>
72+
<td align="center">72.9</td>
73+
<td align="center">84.6</td>
74+
<td align="center">99.1</td>
75+
<td align="center">70.1 ± 1.1</td>
7576
</tr>
7677
<tr>
7778
<td align="left">MinerU v1.3.10</td>
@@ -94,24 +95,25 @@ We also ship a comprehensive benchmark suite covering over 7,000 test cases acro
9495
<td align="center">93.6</td>
9596
<td align="center">71.3</td>
9697
<td align="center">77.1</td>
97-
<td align="center">99.4</td>
98+
<td align="center"><strong>99.4</strong></td>
9899
<td align="center">72.0 ± 1.1</td>
99100
</tr>
100101
<tr>
101-
<td align="left">olmOCR v0.1.68 (pipeline.py)</td>
102-
<td align="center">75.6</td>
103-
<td align="center">75.1</td>
104-
<td align="center">70.2</td>
105-
<td align="center"><strong>44.5</strong></td>
106-
<td align="center">93.4</td>
107-
<td align="center"><strong>79.4</strong></td>
108-
<td align="center">81.7</td>
109-
<td align="center">99.0</td>
110-
<td align="center"><strong>77.4 ± 1.0</strong></td>
102+
<td align="left">olmOCR v0.1.75 (Anchored)</td>
103+
<td align="center">74.9</td>
104+
<td align="center">71.2</td>
105+
<td align="center">71.0</td>
106+
<td align="center">42.2</td>
107+
<td align="center">94.5</td>
108+
<td align="center"><strong>78.3</strong></td>
109+
<td align="center">73.3</td>
110+
<td align="center">98.3</td>
111+
<td align="center"><strong>75.5 ± 1.0</strong></td>
111112
</tr>
112113
</tbody>
113114
</table>
114115

116+
115117
### Installation
116118

117119
Requirements:
@@ -136,7 +138,10 @@ conda activate olmocr
136138
pip install olmocr[bench]
137139

138140
# For actually converting the files with your own GPU
139-
pip install olmocr[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
141+
pip install olmocr.[gpu] --extra-index-url https://download.pytorch.org/whl/cu128
142+
143+
# Recommended: Install flash infer for faster inference on GPU
144+
pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl
140145
```
141146

142147
### Local Usage Example

olmocr/bench/README.md

Lines changed: 52 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@ olmOCR-bench operates on single page PDFs directly. We make this choice because
1414
We have run the benchmark against some contemporary OCR pipelines, but it is really easy
1515
to run it against your own OCR tools. Your tool just needs to support Markdown or plain text output.
1616

17+
<div align="center">
18+
<img src="https://github.com/allenai/olmocr/blob/main/scripts/pareto/ocr_pareto.png?raw=true" width=800/>
19+
</div>
1720

1821
## Results
1922

@@ -37,7 +40,7 @@ to run it against your own OCR tools. Your tool just needs to support Markdown o
3740
<td align="left">GOT OCR</td>
3841
<td align="center">52.7</td>
3942
<td align="center">52.0</td>
40-
<td align="center">0.2</td>
43+
<td align="center">0.20</td>
4144
<td align="center">22.1</td>
4245
<td align="center">93.6</td>
4346
<td align="center">42.0</td>
@@ -46,16 +49,16 @@ to run it against your own OCR tools. Your tool just needs to support Markdown o
4649
<td align="center">48.3 ± 1.1</td>
4750
</tr>
4851
<tr>
49-
<td align="left">Marker v1.6.2</td>
50-
<td align="center">24.3</td>
51-
<td align="center">22.1</td>
52-
<td align="center">69.8</td>
53-
<td align="center">24.3</td>
54-
<td align="center">87.1</td>
55-
<td align="center">71.0</td>
56-
<td align="center">76.9</td>
57-
<td align="center"><strong>99.5</strong></td>
58-
<td align="center">59.4 ± 1.1</td>
52+
<td align="left">Marker v1.7.5 (base, force_ocr)</td>
53+
<td align="center">76.0</td>
54+
<td align="center">57.9</td>
55+
<td align="center">57.6</td>
56+
<td align="center">27.8</td>
57+
<td align="center">84.9</td>
58+
<td align="center">72.9</td>
59+
<td align="center">84.6</td>
60+
<td align="center">99.1</td>
61+
<td align="center">70.1 ± 1.1</td>
5962
</tr>
6063
<tr>
6164
<td align="left">MinerU v1.3.10</td>
@@ -78,9 +81,21 @@ to run it against your own OCR tools. Your tool just needs to support Markdown o
7881
<td align="center">93.6</td>
7982
<td align="center">71.3</td>
8083
<td align="center">77.1</td>
81-
<td align="center">99.4</td>
84+
<td align="center"><strong>99.4</strong></td>
8285
<td align="center">72.0 ± 1.1</td>
8386
</tr>
87+
<tr>
88+
<td align="left">Nanonets OCR</td>
89+
<td align="center">67.0</td>
90+
<td align="center">68.6</td>
91+
<td align="center"><strong>77.7</strong></td>
92+
<td align="center">39.5</td>
93+
<td align="center">40.7</td>
94+
<td align="center">69.9</td>
95+
<td align="center">53.4</td>
96+
<td align="center">99.3</td>
97+
<td align="center">64.5 ± 1.1</td>
98+
</tr>
8499
<tr>
85100
<td align="left">GPT-4o (No Anchor)</td>
86101
<td align="center">51.5</td>
@@ -154,33 +169,39 @@ to run it against your own OCR tools. Your tool just needs to support Markdown o
154169
<td align="center">65.5 ± 1.2</td>
155170
</tr>
156171
<tr>
157-
<td align="left">olmOCR v0.1.68 (No Anchor)</td>
158-
<td align="center">72.1</td>
159-
<td align="center">74.7</td>
172+
<td align="left">olmOCR v0.1.75 (No Anchor)</td>
160173
<td align="center">71.5</td>
161-
<td align="center">43.7</td>
162-
<td align="center">91.6</td>
163-
<td align="center">78.5</td>
164-
<td align="center">80.5</td>
165-
<td align="center">98.1</td>
166-
<td align="center">76.3 ± 1.1</td>
174+
<td align="center">71.4</td>
175+
<td align="center">71.4</td>
176+
<td align="center"><strong>42.8</strong></td>
177+
<td align="center">94.1</td>
178+
<td align="center">77.7</td>
179+
<td align="center">71.0</td>
180+
<td align="center">97.8</td>
181+
<td align="center">74.7 ± 1.1</td>
167182
</tr>
168183
<tr>
169-
<td align="left">olmOCR v0.1.68 (Anchored)</td>
170-
<td align="center">75.6</td>
171-
<td align="center">75.1</td>
172-
<td align="center">70.2</td>
173-
<td align="center"><strong>44.5</strong></td>
174-
<td align="center">93.4</td>
175-
<td align="center"><strong>79.4</strong></td>
176-
<td align="center">81.7</td>
177-
<td align="center">99.0</td>
178-
<td align="center"><strong>77.4 ± 1.0</strong></td>
184+
<td align="left">olmOCR v0.1.75 (Anchored)</td>
185+
<td align="center">74.9</td>
186+
<td align="center">71.2</td>
187+
<td align="center">71.0</td>
188+
<td align="center">42.2</td>
189+
<td align="center">94.5</td>
190+
<td align="center"><strong>78.3</strong></td>
191+
<td align="center">73.3</td>
192+
<td align="center">98.3</td>
193+
<td align="center"><strong>75.5 ± 1.0</strong></td>
179194
</tr>
180195
</tbody>
181196
</table>
182197

183198

199+
<sup><sub>There was a small drop in scores from olmOCR v0.1.68 (77.4), which is due to two factors. One, is that we have adjusted our benchmark code to not include
200+
any "fallback" mechanism when measuring benchmark scores (though it still exists when you run olmocr.pipeline). Second, there is a small drop in scores as we have updated
201+
from sglang 0.4.2 to vllm 0.9.1. In net, we think the upgrade to vllm is the right choice, given that sglang 0.4.6 had even lower scores by one point, and vllm comes with a
202+
small performance boost, and great support for quantization.
203+
</sub></sup>
204+
184205
## Sourcing Documents and Tests
185206

186207
We define 7 distinct document types that we found olmOCR (or its earlier iterations) often struggled to process and defined custom acquisition strategies for each (described below). We removed documents that both contained PII and were not meant for public dissemination. We also decontaminate against documents that appear in olmOCR-Mix via URL level deduplication. To scale creation of test cases over these documents, we combined manual design and review with prompting GPT-4o.
@@ -288,6 +309,3 @@ We have an internal data annotation tool that can be used to review the question
288309
```bash
289310
python -m olmocr.bench.review_app --port 5000 --debug ./olmOCR-bench/bench_data/multi_column.jsonl --force
290311
```
291-
292-
293-

olmocr/bench/convert.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,6 +223,7 @@ async def process_with_semaphore(task):
223223
available_methods = {
224224
"olmocr_pipeline": ("olmocr.bench.runners.run_olmocr_pipeline", "run_olmocr_pipeline"),
225225
"gotocr": ("olmocr.bench.runners.run_gotocr", "run_gotocr"),
226+
"nanonetsocr": ("olmocr.bench.runners.run_nanonetsocr", "run_nanonetsocr"),
226227
"marker": ("olmocr.bench.runners.run_marker", "run_marker"),
227228
"mineru": ("olmocr.bench.runners.run_mineru", "run_mineru"),
228229
"chatgpt": ("olmocr.bench.runners.run_chatgpt", "run_chatgpt"),

olmocr/bench/runners/run_marker.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import os
22
import tempfile
33

4+
from marker.config.parser import ConfigParser
45
from marker.converters.pdf import PdfConverter
56
from marker.models import create_model_dict
67
from marker.output import text_from_rendered
@@ -15,10 +16,22 @@ def run_marker(pdf_path: str, page_num: int = 1) -> str:
1516
if _marker_converter is None:
1617
# Create a configuration dictionary with the necessary settings
1718
config = {
18-
"texify_inline_spans": True, # This enables conversion of inline math to LaTeX
19+
"force_ocr": True, # This enables conversion of inline math to LaTeX
20+
"use_llm": False, # We would prefer to run just plain marker for reporting bench results, not hybrid mode
21+
"disable_tqdm": True, # Disable tqdm for cleaner output
22+
"recognition_batch_size": 256,
23+
"layout_batch_size": 48,
24+
"detection_batch_size": 48,
25+
"equation_batch_size": 64,
26+
"table_rec_batch_size": 48,
27+
"ocr_error_batch_size": 64,
1928
}
29+
config_parser = ConfigParser(config)
2030

21-
_marker_converter = PdfConverter(artifact_dict=create_model_dict(), config=config)
31+
_marker_converter = PdfConverter(
32+
artifact_dict=create_model_dict(),
33+
config=config_parser.generate_config_dict(),
34+
)
2235

2336
# Extract the specific page from the PDF
2437
pdf_to_process = pdf_path

0 commit comments

Comments
 (0)