Skip to content

Commit 0fc8318

Browse files
committed
Update docs. Add NER extraction for common text anchors
1 parent 42d1750 commit 0fc8318

31 files changed

+3946
-279
lines changed

.bandit

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,7 @@ exclude_dirs:
77
- ".git"
88

99
# Allow hardcoded SECRET_KEY in development settings
10+
# Allow HuggingFace downloads without revision pinning for development
1011
skips:
1112
- "B105"
13+
- "B615"

.env.example

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,9 @@ DJANGO_SUPERUSER_EMAIL=admin@localhost
2323

2424
# Media Storage
2525
MEDIA_ROOT=/app/media
26+
27+
# Ollama Configuration
28+
OLLAMA_HOST=the-area.local
29+
OLLAMA_PORT=11434
30+
OLLAMA_LLM_MODEL=aya:35b-23
31+
OLLAMA_EMBEDDING_MODEL=zylonai/multilingual-e5-large:latest

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,8 @@ Thumbs.db
138138
media/
139139
staticfiles/
140140
static_root/
141+
models/
142+
training_*/
141143

142144
# Docker
143145
.dockerignore

Dockerfile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,12 @@ RUN apt-get update && apt-get install -y \
1111
tesseract-ocr-nld \
1212
libtesseract-dev \
1313
poppler-utils \
14+
libgl1-mesa-glx \
15+
libglib2.0-0 \
16+
libsm6 \
17+
libxext6 \
18+
libxrender-dev \
19+
libgomp1 \
1420
&& rm -rf /var/lib/apt/lists/*
1521

1622
# Copy requirements and install Python dependencies

Makefile

Lines changed: 64 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -144,17 +144,77 @@ test-tasks: ensure-containers ## Run only task tests in Docker
144144

145145

146146
##@ Docker Commands
147+
# Resolve mDNS hostname for Ollama host
148+
resolve-ollama-host:
149+
@echo "Resolving Ollama host..."
150+
@if [ -f .env ]; then \
151+
OLLAMA_HOST=$$(grep "^OLLAMA_HOST=" .env 2>/dev/null | cut -d'=' -f2 | tr -d ' "'\'''); \
152+
if [ -n "$$OLLAMA_HOST" ] && echo "$$OLLAMA_HOST" | grep -q "\.local$$"; then \
153+
echo "Found mDNS hostname in .env: $$OLLAMA_HOST"; \
154+
if command -v avahi-resolve >/dev/null 2>&1; then \
155+
echo "Attempting to resolve $$OLLAMA_HOST using mDNS..."; \
156+
RESOLVED_IP=$$(avahi-resolve -4 -n "$$OLLAMA_HOST" 2>/dev/null | awk '{print $$2}' | head -1); \
157+
if [ -n "$$RESOLVED_IP" ] && [ "$$RESOLVED_IP" != "$$OLLAMA_HOST" ]; then \
158+
echo "✅ Resolved $$OLLAMA_HOST to $$RESOLVED_IP"; \
159+
echo "OLLAMA_HOST=$$RESOLVED_IP" > .env.ollama; \
160+
OLLAMA_PORT=$$(grep "^OLLAMA_PORT=" .env 2>/dev/null | cut -d'=' -f2 | tr -d ' "'\'''); \
161+
OLLAMA_EMBEDDING_MODEL=$$(grep "^OLLAMA_EMBEDDING_MODEL=" .env 2>/dev/null | cut -d'=' -f2 | tr -d ' "'\'''); \
162+
OLLAMA_LLM_MODEL=$$(grep "^OLLAMA_LLM_MODEL=" .env 2>/dev/null | cut -d'=' -f2 | tr -d ' "'\'''); \
163+
[ -n "$$OLLAMA_PORT" ] && echo "OLLAMA_PORT=$$OLLAMA_PORT" >> .env.ollama; \
164+
[ -n "$$OLLAMA_EMBEDDING_MODEL" ] && echo "OLLAMA_EMBEDDING_MODEL=$$OLLAMA_EMBEDDING_MODEL" >> .env.ollama; \
165+
[ -n "$$OLLAMA_LLM_MODEL" ] && echo "OLLAMA_LLM_MODEL=$$OLLAMA_LLM_MODEL" >> .env.ollama; \
166+
else \
167+
echo "⚠️ Could not resolve $$OLLAMA_HOST, using original configuration"; \
168+
rm -f .env.ollama; \
169+
fi; \
170+
else \
171+
echo "⚠️ avahi-resolve not available, using original configuration"; \
172+
rm -f .env.ollama; \
173+
fi; \
174+
else \
175+
echo "OLLAMA_HOST is not an mDNS hostname (.local), no resolution needed"; \
176+
rm -f .env.ollama; \
177+
fi; \
178+
else \
179+
echo "No .env file found, skipping mDNS resolution"; \
180+
rm -f .env.ollama; \
181+
fi
182+
147183
build: ## Build Docker containers
148184
@echo "$(YELLOW)🐳 Building Docker containers...$(NC)"
149185
docker compose build
150186

151-
up: ## Start all Docker services
187+
up: resolve-ollama-host ## Start all Docker services
152188
@echo "$(YELLOW)🚀 Starting Docker services...$(NC)"
153-
docker compose up -d
189+
@ENV_FILES=""; \
190+
if [ -f .env ]; then \
191+
ENV_FILES="--env-file .env"; \
192+
fi; \
193+
if [ -f .env.ollama ]; then \
194+
echo "Using dynamically resolved Ollama configuration"; \
195+
ENV_FILES="$$ENV_FILES --env-file .env.ollama"; \
196+
fi; \
197+
if [ -n "$$ENV_FILES" ]; then \
198+
docker compose $$ENV_FILES up -d; \
199+
else \
200+
docker compose up -d; \
201+
fi
154202

155-
up-build: ## Build and start all Docker services
203+
up-build: resolve-ollama-host ## Build and start all Docker services
156204
@echo "$(YELLOW)🚀 Building and starting Docker services...$(NC)"
157-
docker compose up --build -d
205+
@ENV_FILES=""; \
206+
if [ -f .env ]; then \
207+
ENV_FILES="--env-file .env"; \
208+
fi; \
209+
if [ -f .env.ollama ]; then \
210+
echo "Using dynamically resolved Ollama configuration"; \
211+
ENV_FILES="$$ENV_FILES --env-file .env.ollama"; \
212+
fi; \
213+
if [ -n "$$ENV_FILES" ]; then \
214+
docker compose $$ENV_FILES up --build -d; \
215+
else \
216+
docker compose up --build -d; \
217+
fi
158218

159219
down: ## Stop all Docker services
160220
@echo "$(YELLOW)⏹️ Stopping Docker services...$(NC)"

README.md

Lines changed: 46 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -31,11 +31,30 @@ The demo processes a couple sample pages from a book about my family and extract
3131

3232
## Current Status
3333

34-
OCR processing pipeline implemented: multi-format documents (PDF, JPG, PNG, TIFF), multi-language OCR (English/Dutch), batch upload, background processing with Celery.
35-
36-
Uses Django admin interface to prototype and test business logic before building custom UI. This approach allows rapid iteration on data models and processing workflows.
37-
38-
**Next**: AI-powered extraction to structured genealogy data.
34+
### OCR Processing Pipeline - Testing & Refinement
35+
- Multi-format document processing (PDF, JPG, PNG, TIFF) with Tesseract
36+
- Multi-language support (English/Dutch) for genealogical texts
37+
- Advanced rotation detection using computer vision techniques (Hough line detection, projection profiles)
38+
- Two-stage rotation correction: major angles (0°/90°/180°/270°) + fine-angle adjustments (±10°)
39+
- Batch upload functionality and background processing with Celery/Redis
40+
41+
### AI-Powered Entity Extraction - Implemented
42+
- **Neural Network NER (Named Entity Recognition)**: Custom BERT-based model fine-tuned for genealogical entities
43+
- **Performance**: 96.84% F1 score (harmonic mean of precision and recall) across PERSON_NAME, DATE, PLACE, GENEALOGY_ID, FAMILY_GROUP entities
44+
- **Dual Extraction Pipeline**: Hybrid approach combining traditional regex patterns with neural network predictions
45+
- **Training Data Curation**: Django admin interface for manual refinement of genealogical anchor extractions
46+
47+
### Text Processing & Data Standardization
48+
- **Generation-Aware Chunking**: Intelligent segmentation preserving genealogical document structure
49+
- **Date Standardization**: Multi-format Dutch/English date parsing ("15 maart 1654" → "1654-03-15")
50+
- **Genealogical ID Correction**: Systematic fixes for OCR errors in Roman numerals (IL→II, XIL→XII)
51+
- **Family Context Tracking**: Infers individual IDs from family group headers ("a. John" → "X.9.a")
52+
53+
### Development Approach
54+
Uses Django admin interface to prototype and test business logic before building custom UI. This approach enables rapid iteration on data models and processing workflows while maintaining data quality through manual review capabilities.
55+
56+
**Current Focus**: Optimizing OCR quality and refining neural network training data
57+
**Next Phase**: LLM integration for natural language queries and relationship inference
3958

4059
## Sample Data
4160

@@ -67,15 +86,35 @@ python manage.py runserver
6786

6887
## Usage
6988

70-
Upload documents via Django admin → automatic OCR processing → review extracted text and confidence scores.
89+
**Document Processing Workflow:**
90+
1. Upload documents via Django admin interface
91+
2. Automatic OCR processing with rotation detection and correction
92+
3. Intelligent text chunking with genealogical structure preservation
93+
4. Dual entity extraction (regex + neural network NER)
94+
5. Review extracted text, confidence scores, and genealogical anchors
95+
6. Manual curation of training data for neural network refinement
96+
97+
**Current Capabilities:**
98+
- Multi-page document OCR with confidence scoring
99+
- Genealogical entity recognition and extraction
100+
- Date standardization and genealogical ID correction
101+
- Visual comparison of extraction methods (regex vs. neural network)
102+
- Manual anchor curation for gold standard training data
71103

72104
## Development
73105

74106
**Quality checks:** `make quality-gate` (linting, formatting, type checking, security, tests)
75107

76108
**Tests:** `make test`
77109

78-
**Architecture:** Django + PostgreSQL + Celery + Redis + Tesseract OCR
110+
**Architecture:** Django + PostgreSQL + Celery + Redis + Tesseract OCR + OpenCV + PyTorch
111+
112+
**Key Technologies:**
113+
- **Computer Vision**: OpenCV for advanced rotation detection and correction
114+
- **Machine Learning**: PyTorch + Transformers (BERT) for genealogical Named Entity Recognition
115+
- **Data Storage**: PostgreSQL with custom ArrayField handling for genealogical anchors
116+
- **Background Processing**: Celery with Redis for scalable document processing
117+
- **OCR**: Tesseract with multi-language support and confidence scoring
79118

80119
Run `make help` to see all available development commands.
81120

0 commit comments

Comments
 (0)