A comprehensive system for extracting election data from 287 Form 20 PDFs across 36 districts in Maharashtra's VIDHANSABHA_2024 elections.
This system provides:
- Automated PDF classification into 3 tiers based on complexity
- Tiered extraction with appropriate methods for each PDF type
- Progress tracking with checkpoint and resume capabilities
- Quality control with validation and scoring
- Manual verification interface for corrections
- Comprehensive reporting at every stage
# Initialize tracking files
python scripts/progress_manager.py --init
# Check initial status
python scripts/progress_manager.py --status
python scripts/pdf_classifier.py --classify
# Start fresh extraction
python scripts/main_extractor.py --start
# Resume from last checkpoint
python scripts/main_extractor.py --resume
# Start from specific PDF
python scripts/main_extractor.py --from-pdf AC_216
# Check current status
python scripts/progress_manager.py --status
# Detailed district-wise breakdown
python scripts/progress_manager.py --status --detailed
# Interactive verification mode
python scripts/manual_verifier.py --interactive
# Check specific PDF
python scripts/manual_verifier.py --check AC_216
# Verify record count
python scripts/manual_verifier.py --verify-count AC_216:307
# Review all flagged PDFs
python scripts/manual_verifier.py --review-flagged
# Approve high-quality extractions in batch
python scripts/manual_verifier.py --approve-batch --min-confidence 0.95
form20/
├── README.md # This file
├── IMPLEMENTATION_PLAN.md # Detailed implementation plan
├── EXTRACTION_RULES.md # Extraction rules and standards
├── report.md # PDF parseability analysis
├── required_fields # List of fields to extract
│
├── scripts/ # Core extraction scripts
│ ├── main_extractor.py # Main orchestrator
│ ├── progress_manager.py # Progress tracking
│ ├── manual_verifier.py # Manual verification interface
│ └── [extractors...] # Tier-specific extractors
│
├── config/ # Configuration files
│ ├── extraction_config.json # Extraction settings
│ └── quality_thresholds.json # Validation thresholds
│
├── tracking/ # Progress and logs
│ ├── extraction_progress.json # Main progress file
│ ├── quality_metrics.json # Quality metrics
│ ├── error_log.json # Error tracking
│ └── manual_corrections.json # Manual corrections log
│
├── output/ # Extraction results
│ ├── extracted_data/ # Per-PDF extracted data
│ ├── validation_reports/ # Validation reports
│ └── consolidated_output.csv # Final consolidated data
│
└── VIDHANSABHA_2024/ # Source PDFs (287 files)
└── [36 district folders]
Tier 1 - Standard English Format (70% of PDFs)
- Direct text extraction
- High success rate (95%+)
- Fast processing
Tier 2 - Local Language Format (17% of PDFs)
- Unicode/Devanagari support
- Transliteration capabilities
- Medium complexity
Tier 3 - Scanned/Rotated Format (13% of PDFs)
- OCR with preprocessing
- Image rotation correction
- Highest complexity
- Automatic checkpoints every 10 PDFs
- Resume capability from any point
- Real-time dashboard showing progress
- Backup system for recovery
- Validation rules for data consistency
- Quality scoring (0-1 scale)
- Automatic flagging for review
- Mathematical verification of vote totals
- Interactive interface for corrections
- Side-by-side PDF viewing capability
- Batch approval for high-confidence results
- Correction logging for audit trail
# Generate quality report
python scripts/validator.py --quality-report
# Check specific PDF quality
python scripts/manual_verifier.py --check AC_216
# Reset failed PDF to retry
python scripts/progress_manager.py --reset AC_216
# Mark PDF as complete manually
python scripts/progress_manager.py --mark-complete AC_216:307:0.95
# Create named checkpoint
python scripts/progress_manager.py --checkpoint "after_district_pune"
# Automatic checkpoints created every 10 PDFs
# Export final consolidated CSV
python scripts/main_extractor.py --export-final
# Generate comprehensive report
python scripts/manual_verifier.py --report
During extraction, the system displays:
====================================
Form 20 Extraction Progress
====================================
Total PDFs: 287
Processed: 145 (50.5%)
Pending: 132
Failed: 8
Manual Review: 2
Current: AC_146 (Nagpur)
Tier: 1 (Standard)
Records Extracted: 298
Quality Score: 0.96
Recent Completions:
✓ AC_145: 312 records (Q: 0.98)
✓ AC_144: 289 records (Q: 0.94)
Estimated Time Remaining: 2h 15m
====================================
The system validates:
- Record count within expected range
- Vote totals mathematical consistency
- Required fields presence
- Data types correctness
- Duplicate detection in polling stations
- Retry failed PDFs up to 3 times
- Fallback to lower tier extraction
- Checkpoint restoration on crash
# Emergency stop
python scripts/main_extractor.py --emergency-stop
# Skip problematic PDF
python scripts/main_extractor.py --skip AC_123
# Force reprocess
python scripts/main_extractor.py --reprocess AC_123
Target Performance:
- Processing Rate: 15-20 PDFs/hour
- Success Rate: 95%+
- Data Accuracy: 98%+
- Field Completeness: 90%+
- Manual Intervention: <10%
1. Memory Issues
# Reduce parallel processing
# Edit config/extraction_config.json
# Set "parallel_processing": {"tier_1": 2}
2. OCR Failures
# Check Tesseract installation
tesseract --version
# Install language packs
sudo apt-get install tesseract-ocr-mar # Marathi
3. Progress File Corruption
# Restore from backup
cp backups/progress_checkpoint_latest.json tracking/extraction_progress.json
Edit config/extraction_config.json
:
{
"max_retries": 3,
"timeout_seconds": 300,
"batch_size": 10,
"quality_threshold": 0.85,
"parallel_processing": {
"tier_1": 4,
"tier_2": 2,
"tier_3": 1
}
}
- Python 3.8+
- 8GB RAM minimum (16GB recommended)
- 10GB free disk space
- PDF processing libraries (see requirements.txt)
- IMPLEMENTATION_PLAN.md - Detailed implementation phases
- EXTRACTION_RULES.md - Complete extraction ruleset
- report.md - Initial PDF analysis findings
- Always run progress_manager.py --init before first extraction
- Monitor error_log.json for systematic issues
- Create checkpoints before major operations
- Use manual verification for critical data
- Keep backups of tracking files
For issues or questions, refer to the documentation files or check the error logs in the tracking/
directory.
System designed for incremental, quality-controlled extraction of Maharashtra election Form 20 data with comprehensive progress tracking and manual verification capabilities.