Skip to content

Commit c56fb5d

Browse files
authored
Update docs for extract (#852)
* Update docs for extract * add more details on async
1 parent b407a5e commit c56fb5d

File tree

1 file changed

+93
-21
lines changed

1 file changed

+93
-21
lines changed

extract.md

Lines changed: 93 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,29 @@
22

33
LlamaExtract provides a simple API for extracting structured data from unstructured documents like PDFs, text files and images.
44

5+
## Table of Contents
6+
7+
- [Quick Start](#quick-start)
8+
- [Supported File Types](#supported-file-types)
9+
- [Different Input Types](#different-input-types)
10+
- [Async Extraction](#async-extraction)
11+
- [Core Concepts](#core-concepts)
12+
- [Defining Schemas](#defining-schemas)
13+
- [Using Pydantic (Recommended)](#using-pydantic-recommended)
14+
- [Using JSON Schema](#using-json-schema)
15+
- [Important restrictions on JSON/Pydantic Schema](#important-restrictions-on-jsonpydantic-schema)
16+
- [Extraction Configuration](#extraction-configuration)
17+
- [Configuration Options](#configuration-options)
18+
- [Extraction Agents (Advanced)](#extraction-agents-advanced)
19+
- [Creating Agents](#creating-agents)
20+
- [Agent Batch Processing](#agent-batch-processing)
21+
- [Updating Agent Schemas](#updating-agent-schemas)
22+
- [Managing Agents](#managing-agents)
23+
- [When to Use Agents vs Direct Extraction](#when-to-use-agents-vs-direct-extraction)
24+
- [Installation](#installation)
25+
- [Tips & Best Practices](#tips--best-practices)
26+
- [Additional Resources](#additional-resources)
27+
528
## Quick Start
629

730
The simplest way to get started is to use the stateless API with the extraction configuration and the file/text to extract from:
@@ -12,7 +35,7 @@ from llama_cloud import ExtractConfig, ExtractMode
1235
from pydantic import BaseModel, Field
1336

1437
# Initialize client
15-
extractor = LlamaExtract()
38+
extractor = LlamaExtract(api_key="YOUR_API_KEY")
1639

1740

1841
# Define schema using Pydantic
@@ -64,7 +87,9 @@ result = extractor.extract(Resume, config, SourceText(text_content=text))
6487

6588
### Async Extraction
6689

67-
For better performance with multiple files or when integrating with async applications:
90+
For better performance with multiple files or when integrating with async applications.
91+
Here `queue_extraction` will enqueue the extraction jobs and exit. Alternatively, you
92+
can use `aextract` to poll for the job and return the extraction results.
6893

6994
```python
7095
import asyncio
@@ -80,10 +105,18 @@ async def extract_resumes():
80105
Resume, config, ["resume1.pdf", "resume2.pdf"]
81106
)
82107
print(f"Queued {len(jobs)} extraction jobs")
108+
return jobs
83109

84110

85111
# Run async function
86-
asyncio.run(extract_resumes())
112+
jobs = asyncio.run(extract_resumes())
113+
# Check job status
114+
for job in jobs:
115+
status = agent.get_extraction_job(job.id).status
116+
print(f"Job {job.id}: {status}")
117+
118+
# Get results when complete
119+
results = [agent.get_extraction_run_for_job(job.id) for job in jobs]
87120
```
88121

89122
## Core Concepts
@@ -159,24 +192,6 @@ config = ExtractConfig(extraction_mode=ExtractMode.FAST)
159192
result = extractor.extract(schema, config, "resume.pdf")
160193
```
161194

162-
## Extraction Configuration
163-
164-
Configure how extraction is performed using `ExtractConfig`:
165-
166-
```python
167-
from llama_cloud import ExtractConfig, ExtractMode
168-
169-
# Fast extraction (lower accuracy, faster processing)
170-
fast_config = ExtractConfig(extraction_mode=ExtractMode.FAST)
171-
172-
# Balanced extraction (good balance of speed and accuracy)
173-
balanced_config = ExtractConfig(extraction_mode=ExtractMode.BALANCED)
174-
175-
# Use different configs for different needs
176-
result = extractor.extract(schema, fast_config, "simple_document.pdf")
177-
result = extractor.extract(schema, balanced_config, "complex_document.pdf")
178-
```
179-
180195
### Important restrictions on JSON/Pydantic Schema
181196

182197
_LlamaExtract only supports a subset of the JSON Schema specification._ While limited, it should
@@ -194,6 +209,62 @@ be sufficient for a wide variety of use-cases.
194209
your extraction workflow to fit within these constraints, e.g. by extracting subset of fields
195210
and later merging them together.
196211

212+
## Extraction Configuration
213+
214+
Configure how extraction is performed using `ExtractConfig`. The schema is the most important part, but several configuration options can significantly impact the extraction process.
215+
216+
```python
217+
from llama_cloud import ExtractConfig, ExtractMode, ChunkMode, ExtractTarget
218+
219+
# Basic configuration
220+
config = ExtractConfig(
221+
extraction_mode=ExtractMode.BALANCED, # FAST, BALANCED, MULTIMODAL, PREMIUM
222+
extraction_target=ExtractTarget.PER_DOC, # PER_DOC, PER_PAGE
223+
system_prompt="Focus on the most recent data",
224+
page_range="1-5,10-15", # Extract from specific pages
225+
)
226+
227+
# Advanced configuration
228+
advanced_config = ExtractConfig(
229+
extraction_mode=ExtractMode.MULTIMODAL,
230+
chunk_mode=ChunkMode.PAGE, # PAGE, SECTION
231+
high_resolution_mode=True, # Better OCR accuracy
232+
invalidate_cache=False, # Bypass cached results
233+
cite_sources=True, # Enable source citations
234+
use_reasoning=True, # Enable reasoning (not in FAST mode)
235+
confidence_scores=True, # MULTIMODAL/PREMIUM only
236+
)
237+
```
238+
239+
### Key Configuration Options
240+
241+
**Extraction Mode**: Controls processing quality and speed
242+
243+
- `FAST`: Fastest processing, suitable for simple documents with no OCR
244+
- `BALANCED`: Good speed/accuracy tradeoff for text-rich documents
245+
- `MULTIMODAL`: For visually rich documents with text, tables, and images (recommended)
246+
- `PREMIUM`: Highest accuracy with OCR, complex table/header detection
247+
248+
**Extraction Target**: Defines extraction scope
249+
250+
- `PER_DOC`: Apply schema to entire document (default)
251+
- `PER_PAGE`: Apply schema to each page, returns array of results
252+
253+
**Advanced Options**:
254+
255+
- `system_prompt`: Additional system-level instructions
256+
- `page_range`: Specific pages to extract (e.g., "1,3,5-7,9")
257+
- `chunk_mode`: Document splitting strategy (`PAGE` or `SECTION`)
258+
- `high_resolution_mode`: Better OCR for small text (slower processing)
259+
260+
**Extensions** (return additional metadata):
261+
262+
- `cite_sources`: Source tracing for extracted fields
263+
- `use_reasoning`: Explanations for extraction decisions
264+
- `confidence_scores`: Quantitative confidence measures (MULTIMODAL/PREMIUM only)
265+
266+
For complete configuration options, advanced settings, and detailed examples, see the [LlamaExtract Configuration Documentation](https://docs.cloud.llamaindex.ai/llamaextract/features/options).
267+
197268
## Extraction Agents (Advanced)
198269

199270
For reusable extraction workflows, you can create extraction agents that encapsulate both schema and configuration:
@@ -326,6 +397,7 @@ Another option (orthogonal to the above) is to break the document into smaller s
326397

327398
## Additional Resources
328399

400+
- [Extract Documentation](https://docs.cloud.llamaindex.ai/llamaextract/getting_started) - Details on Extract features, API and examples.
329401
- [Example Notebook](docs/examples-py/extract/resume_screening.ipynb) - Detailed walkthrough of resume parsing
330402
- [Example Application with TypeScript](./examples-ts/extract/) - End-to-end examples using LlamaExtract TypeScript client.
331403
- [Discord Community](https://discord.com/invite/eN6D2HQ4aX) - Get help and share feedback

0 commit comments

Comments
 (0)