Skip to content

Commit ace7418

Browse files
feat(confluence): make SVG processing optional to fix pycairo installation issues
This change addresses installation failures on Debian/Ubuntu systems where svglib 1.6.0 introduced breaking changes that require pycairo compilation, which fails without gcc and cairo-dev system libraries. Changes: - Move svglib dependency to optional extras: pip install 'llama-index-readers-confluence[svg]' - Add graceful degradation in process_svg() when dependencies unavailable - Add FileType.SVG enum for custom parser support - Add comprehensive migration guide with 4 different approaches - Add unit tests for optional dependency behavior - Add working examples for all SVG processing options - Update README and CHANGELOG Breaking Change: SVG processing now requires explicit installation with [svg] extra. Users who need SVG support should install with: pip install 'llama-index-readers-confluence[svg]' Backward Compatibility: Maintained through graceful degradation - SVG attachments are skipped with informative warnings when dependencies are not installed. Fixes installation issues on systems without C compilers. Tested: 3 tests passed, 1 skipped (expected when svglib not installed)
1 parent 44b3f8c commit ace7418

File tree

9 files changed

+649
-13
lines changed

9 files changed

+649
-13
lines changed

llama-index-integrations/readers/llama-index-readers-confluence/CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,18 @@
11
# CHANGELOG
22

3+
## [Unreleased]
4+
5+
### Changed
6+
7+
- **BREAKING**: Made SVG processing optional to avoid installation issues with pycairo dependency
8+
- SVG support (`svglib`) moved to optional dependencies. Install with `pip install llama-index-readers-confluence[svg]`
9+
- SVG attachments will be skipped with a warning if optional dependencies are not installed
10+
- Pinned svglib to <1.6.0 to avoid breaking changes in newer versions
11+
12+
### Fixed
13+
14+
- Fixed installation failures on Debian/Ubuntu systems due to pycairo compilation issues
15+
316
## [0.1.8] - 2024-08-20
417

518
- Added observability events for ConfluenceReader
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# Migration Guide: SVG Support Changes
2+
3+
## Overview
4+
5+
Starting from version 0.4.5, SVG processing support has been moved to an optional dependency to address installation issues on systems where the `pycairo` package cannot be compiled (particularly Debian/Ubuntu systems without C compilers or Cairo development libraries).
6+
7+
## What Changed?
8+
9+
### Before (versions < 0.4.5)
10+
11+
- `svglib` was a required dependency
12+
- All users had to install `pycairo` even if they didn't need SVG support
13+
- Installation could fail on systems without proper build tools
14+
15+
### After (versions >= 0.4.5)
16+
17+
- `svglib` is now an optional dependency
18+
- SVG processing is skipped by default with a warning if optional dependencies are not installed
19+
- Base installation works on all systems without requiring C compilers
20+
- SVG version pinned to `<1.6.0` to avoid breaking changes
21+
22+
## Migration Paths
23+
24+
### Option 1: Continue Using Built-in SVG Support (Recommended if SVG is needed)
25+
26+
If you need SVG processing and can install the required system dependencies:
27+
28+
```bash
29+
# Uninstall current version
30+
pip uninstall llama-index-readers-confluence
31+
32+
# Install with SVG support
33+
pip install 'llama-index-readers-confluence[svg]'
34+
```
35+
36+
**System Requirements for SVG Support:**
37+
38+
- On Debian/Ubuntu: `sudo apt-get install gcc python3-dev libcairo2-dev`
39+
- On macOS: `brew install cairo`
40+
- On Windows: Install Visual C++ Build Tools
41+
42+
### Option 2: Skip SVG Processing (Recommended for Docker/CI environments)
43+
44+
If you don't need SVG processing or want to avoid installation issues:
45+
46+
```bash
47+
# Install without SVG support (default)
48+
pip install llama-index-readers-confluence
49+
```
50+
51+
SVG attachments will be skipped with a warning in the logs. All other functionality remains unchanged.
52+
53+
### Option 3: Use Custom SVG Parser
54+
55+
If you need SVG processing but cannot install pycairo, use a custom parser:
56+
57+
```python
58+
from llama_index.readers.confluence import ConfluenceReader
59+
from llama_index.readers.confluence.event import FileType
60+
61+
62+
# Simple text extraction from SVG (no OCR)
63+
class SimpleSVGParser(BaseReader):
64+
def load_data(self, file_path, **kwargs):
65+
import xml.etree.ElementTree as ET
66+
67+
with open(file_path, "r") as f:
68+
root = ET.fromstring(f.read())
69+
70+
# Extract text elements from SVG
71+
texts = [elem.text for elem in root.findall(".//text") if elem.text]
72+
extracted_text = " ".join(texts) or "[SVG Image]"
73+
74+
return [
75+
Document(text=extracted_text, metadata={"file_path": file_path})
76+
]
77+
78+
79+
reader = ConfluenceReader(
80+
base_url="https://yoursite.atlassian.com/wiki",
81+
api_token="your_token",
82+
custom_parsers={FileType.SVG: SimpleSVGParser()},
83+
)
84+
```
85+
86+
See `examples/svg_parsing_examples.py` for more custom parser examples.
87+
88+
### Option 4: Filter Out SVG Attachments
89+
90+
If you want to explicitly skip SVG files without warnings:
91+
92+
```python
93+
def attachment_filter(
94+
media_type: str, file_size: int, title: str
95+
) -> tuple[bool, str]:
96+
if media_type == "image/svg+xml":
97+
return False, "SVG processing disabled"
98+
return True, ""
99+
100+
101+
reader = ConfluenceReader(
102+
base_url="https://yoursite.atlassian.com/wiki",
103+
api_token="your_token",
104+
process_attachment_callback=attachment_filter,
105+
)
106+
```
107+
108+
## Docker/Container Deployments
109+
110+
### Before (versions < 0.4.5)
111+
112+
```dockerfile
113+
FROM python:3.11-slim
114+
115+
# Required system dependencies for pycairo
116+
RUN apt-get update && apt-get install -y \
117+
gcc \
118+
python3-dev \
119+
libcairo2-dev \
120+
&& rm -rf /var/lib/apt/lists/*
121+
122+
RUN pip install llama-index-readers-confluence
123+
```
124+
125+
### After (versions >= 0.4.5) - Without SVG Support
126+
127+
```dockerfile
128+
FROM python:3.11-slim
129+
130+
# No system dependencies needed!
131+
RUN pip install llama-index-readers-confluence
132+
```
133+
134+
### After (versions >= 0.4.5) - With SVG Support
135+
136+
```dockerfile
137+
FROM python:3.11-slim
138+
139+
# Only if you need SVG support
140+
RUN apt-get update && apt-get install -y \
141+
gcc \
142+
python3-dev \
143+
libcairo2-dev \
144+
&& rm -rf /var/lib/apt/lists/*
145+
146+
RUN pip install 'llama-index-readers-confluence[svg]'
147+
```
148+
149+
## FAQ
150+
151+
### Q: Will my existing code break?
152+
153+
**A:** No, your existing code will continue to work. If you were using SVG processing and don't install the `[svg]` extra, SVG attachments will simply be skipped with a warning instead of failing.
154+
155+
### Q: How do I know if SVG dependencies are installed?
156+
157+
**A:** Check the logs. If you see warnings like "SVG processing skipped: Optional dependencies not installed", then SVG dependencies are not available.
158+
159+
### Q: Can I use a different OCR engine for SVG?
160+
161+
**A:** Yes! Use the custom parser approach (Option 3) and implement your own SVG-to-text conversion logic. You could use libraries like `cairosvg`, `pdf2image`, or pure XML parsing depending on your needs.
162+
163+
### Q: Why was this change made?
164+
165+
**A:** The `pycairo` dependency (required by `svglib`) requires C compilation and system libraries (Cairo). This caused installation failures in:
166+
167+
- Docker containers based on slim images
168+
- CI/CD pipelines without build tools
169+
- Systems managed by users without admin rights
170+
- Environments where SVG support isn't needed
171+
172+
Making it optional allows the package to work everywhere while still supporting SVG for users who need it.
173+
174+
### Q: What if I encounter other issues?
175+
176+
**A:** Please file an issue on GitHub with:
177+
178+
1. Your Python version
179+
2. Your operating system
180+
3. Whether you installed with `[svg]` extra
181+
4. The full error message
182+
5. Output of `pip list` showing installed packages
183+
184+
## Testing Your Migration
185+
186+
After migrating, test your setup:
187+
188+
```python
189+
from llama_index.readers.confluence import ConfluenceReader
190+
import logging
191+
192+
# Enable logging to see SVG warnings
193+
logging.basicConfig(level=logging.INFO)
194+
195+
reader = ConfluenceReader(
196+
base_url="https://yoursite.atlassian.com/wiki",
197+
api_token="your_token",
198+
)
199+
200+
# Try loading data
201+
documents = reader.load_data(space_key="MYSPACE", include_attachments=True)
202+
203+
# Check logs for any SVG-related warnings
204+
print(f"Loaded {len(documents)} documents")
205+
```
206+
207+
If you see "SVG processing skipped" warnings but didn't expect them, you may need to install the `[svg]` extra.

llama-index-integrations/readers/llama-index-readers-confluence/README.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,23 @@ include attachments, this is set to `False` by default, if set to `True` all att
5151
ConfluenceReader will extract the text from the attachments and add it to the Document object.
5252
Currently supported attachment types are: PDF, PNG, JPEG/JPG, SVG, Word and Excel.
5353

54+
### Optional Dependencies
55+
56+
**SVG Support**: SVG processing requires additional dependencies that can cause installation issues on some systems.
57+
To enable SVG attachment processing, install with the `svg` extra:
58+
59+
```bash
60+
pip install llama-index-readers-confluence[svg]
61+
```
62+
63+
If SVG dependencies are not installed, SVG attachments will be skipped with a warning in the logs, but all other
64+
functionality will work normally. This allows the package to be installed on systems where the SVG dependencies
65+
(svglib and its transitive dependency pycairo) cannot be built.
66+
67+
**Migration Note for Existing Users**: If you were previously using SVG processing and want to continue doing so,
68+
you need to install the svg extra as shown above. Alternatively, you can provide a custom SVG parser using the
69+
`custom_parsers` parameter (see Advanced Configuration section and `examples/svg_parsing_examples.py` for details).
70+
5471
## Advanced Configuration
5572

5673
The ConfluenceReader supports several advanced configuration options for customizing the reading behavior:
@@ -98,7 +115,8 @@ confluence_parsers = {
98115
# ConfluenceFileType.CSV: CSVParser(),
99116
# ConfluenceFileType.SPREADSHEET: ExcelParser(),
100117
# ConfluenceFileType.MARKDOWN: MarkdownParser(),
101-
# ConfluenceFileType.TEXT: TextParser()
118+
# ConfluenceFileType.TEXT: TextParser(),
119+
# ConfluenceFileType.SVG: CustomSVGParser(), # Custom SVG parser to avoid pycairo issues
102120
}
103121

104122
reader = ConfluenceReader(
@@ -108,6 +126,10 @@ reader = ConfluenceReader(
108126
)
109127
```
110128

129+
For SVG parsing examples including alternatives to the built-in parser, see `examples/svg_parsing_examples.py`.
130+
131+
````
132+
111133
**Processing Callbacks**:
112134
113135
- `process_attachment_callback`: A callback function to control which attachments should be processed. The function receives the media type and file size as parameters and should return a tuple of `(should_process: bool, reason: str)`.
@@ -425,3 +447,4 @@ print(f"Processing completed. Total documents: {len(documents)}")
425447
```
426448
427449
This loader is designed to be used as a way to load data into [LlamaIndex](https://github.com/run-llama/llama_index/).
450+
````

0 commit comments

Comments
 (0)