Skip to content

Commit 91956cb

Browse files
authored
docs: vlm and picture description options (#149)
Signed-off-by: Michele Dolfi <[email protected]>
1 parent 4c9571a commit 91956cb

File tree

3 files changed

+78
-2
lines changed

3 files changed

+78
-2
lines changed

docling_serve/datamodel/convert.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -310,7 +310,7 @@ class ConvertDocumentsOptions(BaseModel):
310310
bool,
311311
Field(
312312
description=(
313-
"If enabled, perform formula OCR, return Latex code. "
313+
"If enabled, perform formula OCR, return LaTeX code. "
314314
"Boolean. Optional, defaults to false."
315315
),
316316
examples=[False],

docs/configuration.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,11 @@ THe following table describes the options to configure the Docling Serve app.
3838
| `--artifacts-path` | `DOCLING_SERVE_ARTIFACTS_PATH` | unset | If set to a valid directory, the model weights will be loaded from this path |
3939
| | `DOCLING_SERVE_STATIC_PATH` | unset | If set to a valid directory, the static assets for the docs and ui will be loaded from this path |
4040
| `--enable-ui` | `DOCLING_SERVE_ENABLE_UI` | `false` | Enable the demonstrator UI. |
41+
| | `DOCLING_SERVE_ENABLE_REMOTE_SERVICES` | `false` | Allow pipeline components making remote connections. For example, this is needed when using a vision-language model via APIs. |
42+
| | `DOCLING_SERVE_ALLOW_EXTERNAL_PLUGINS` | `false` | Allow the selection of third-party plugins. |
43+
| | `DOCLING_SERVE_MAX_DOCUMENT_TIMEOUT` | `604800` (7 days) | The maximum time for processing a document. |
44+
| | `DOCLING_SERVE_MAX_NUM_PAGES` | | The maximum number of pages for a document to be processed. |
45+
| | `DOCLING_SERVE_MAX_FILE_SIZE` | | The maximum file size for a document to be processed. |
4146
| | `DOCLING_SERVE_OPTIONS_CACHE_SIZE` | `2` | How many DocumentConveter objects (including their loaded models) to keep in the cache. |
4247
| | `DOCLING_SERVE_CORS_ORIGINS` | `["*"]` | A list of origins that should be permitted to make cross-origin requests. |
4348
| | `DOCLING_SERVE_CORS_METHODS` | `["*"]` | A list of HTTP methods that should be allowed for cross-origin requests. |

docs/usage.md

Lines changed: 72 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ On top of the source of file (see below), both endpoints support the same parame
88

99
- `from_format` (List[str]): Input format(s) to convert from. Allowed values: `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md`. Defaults to all formats.
1010
- `to_formats` (List[str]): Output format(s) to convert to. Allowed values: `md`, `json`, `html`, `text`, `doctags`. Defaults to `md`.
11+
- `pipeline` (str). The choice of which pipeline to use. Allowed values are `standard` and `vlm`. Defaults to `standard`.
1112
- `do_ocr` (bool): If enabled, the bitmap content will be processed using OCR. Defaults to `True`.
1213
- `image_export_mode`: Image export mode for the document (only in case of JSON, Markdown or HTML). Allowed values: embedded, placeholder, referenced. Optional, defaults to `embedded`.
1314
- `force_ocr` (bool): If enabled, replace any existing text with OCR-generated text over the full content. Defaults to `False`.
@@ -18,7 +19,13 @@ On top of the source of file (see below), both endpoints support the same parame
1819
- `abort_on_error` (bool): If enabled, abort on error. Defaults to false.
1920
- `return_as_file` (boo): If enabled, return the output as a file. Defaults to false.
2021
- `do_table_structure` (bool): If enabled, the table structure will be extracted. Defaults to true.
21-
- `include_images` (bool): If enabled, images will be extracted from the document. Defaults to true.
22+
- `do_code_enrichment` (bool): If enabled, perform OCR code enrichment. Defaults to false.
23+
- `do_formula_enrichment` (bool): If enabled, perform formula OCR, return LaTeX code. Defaults to false.
24+
- `do_picture_classification` (bool): If enabled, classify pictures in documents. Defaults to false.
25+
- `do_picture_description` (bool): If enabled, describe pictures in documents. Defaults to false.
26+
- `picture_description_local` (dict): Options for running a local vision-language model in the picture description. The parameters refer to a model hosted on Hugging Face. This parameter is mutually exclusive with picture_description_api.
27+
- `picture_description_api` (dict): API details for using a vision-language model in the picture description. This parameter is mutually exclusive with picture_description_local.
28+
- `include_images` (bool): If enabled, images will be extracted from the document. Defaults to false.
2229
- `images_scale` (float): Scale factor for images. Defaults to 2.0.
2330

2431
## Convert endpoints
@@ -244,6 +251,70 @@ data = response.json()
244251

245252
</details>
246253

254+
### Picture description options
255+
256+
When the picture description enrichment is activated, users may specify which model and which execution mode to use for this task. There are two choices for the execution mode: _local_ will run the vision-language model directly, _api_ will invoke an external API endpoint.
257+
258+
The local option is specified with:
259+
260+
```jsonc
261+
{
262+
"picture_description_local": {
263+
"repo_id": "", // Repository id from the Hugging Face Hub.
264+
"generation_config": {"max_new_tokens": 200, "do_sample": false}, // HF generation config.
265+
"prompt": "Describe this image in a few sentences. ", // Prompt used when calling the vision-language model.
266+
}
267+
}
268+
```
269+
270+
The possible values for `generation_config` are documented in the [Hugging Face text generation docs](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig).
271+
272+
The api option is specified with:
273+
274+
```jsonc
275+
{
276+
"picture_description_api": {
277+
"url": "", // Endpoint which accepts openai-api compatible requests.
278+
"headers": {}, // Headers used for calling the API endpoint. For example, it could include authentication headers.
279+
"params": {}, // Model parameters.
280+
"timeout": 20, // Timeout for the API request.
281+
"prompt": "Describe this image in a few sentences. ", // Prompt used when calling the vision-language model.
282+
}
283+
}
284+
```
285+
286+
Example URLs are:
287+
288+
- `http://localhost:8000/v1/chat/completions` for the local vllm api, with example `params`:
289+
- the `HuggingFaceTB/SmolVLM-256M-Instruct` model
290+
291+
```json
292+
{
293+
"model": "HuggingFaceTB/SmolVLM-256M-Instruct",
294+
"max_completion_tokens": 200,
295+
}
296+
```
297+
298+
- the `ibm-granite/granite-vision-3.2-2b` model
299+
300+
```json
301+
{
302+
"model": "ibm-granite/granite-vision-3.2-2b",
303+
"max_completion_tokens": 200,
304+
}
305+
```
306+
307+
- `http://localhost:11434/v1/chat/completions` for the local ollama api, with example `params`:
308+
- the `granite3.2-vision:2b` model
309+
310+
```json
311+
{
312+
"model": "granite3.2-vision:2b"
313+
}
314+
```
315+
316+
Note that when using `picture_description_api`, the server must be launched with `DOCLING_SERVE_ENABLE_REMOTE_SERVICES=true`.
317+
247318
## Response format
248319

249320
The response can be a JSON Document or a File.

0 commit comments

Comments
 (0)