Form 20 PDF Data Extraction System

A comprehensive system for extracting election data from 287 Form 20 PDFs across 36 districts in Maharashtra's VIDHANSABHA_2024 elections.

System Overview

This system provides:

Automated PDF classification into 3 tiers based on complexity
Tiered extraction with appropriate methods for each PDF type
Progress tracking with checkpoint and resume capabilities
Quality control with validation and scoring
Manual verification interface for corrections
Comprehensive reporting at every stage

Quick Start

1. Initialize the System

# Initialize tracking files
python scripts/progress_manager.py --init

# Check initial status
python scripts/progress_manager.py --status

2. Classify PDFs (Optional - auto-done during extraction)

python scripts/pdf_classifier.py --classify

3. Start Extraction

# Start fresh extraction
python scripts/main_extractor.py --start

# Resume from last checkpoint
python scripts/main_extractor.py --resume

# Start from specific PDF
python scripts/main_extractor.py --from-pdf AC_216

4. Monitor Progress

# Check current status
python scripts/progress_manager.py --status

# Detailed district-wise breakdown
python scripts/progress_manager.py --status --detailed

5. Manual Verification

# Interactive verification mode
python scripts/manual_verifier.py --interactive

# Check specific PDF
python scripts/manual_verifier.py --check AC_216

# Verify record count
python scripts/manual_verifier.py --verify-count AC_216:307

# Review all flagged PDFs
python scripts/manual_verifier.py --review-flagged

# Approve high-quality extractions in batch
python scripts/manual_verifier.py --approve-batch --min-confidence 0.95

Directory Structure

form20/
├── README.md                    # This file
├── IMPLEMENTATION_PLAN.md       # Detailed implementation plan
├── EXTRACTION_RULES.md          # Extraction rules and standards
├── report.md                    # PDF parseability analysis
├── required_fields              # List of fields to extract
│
├── scripts/                     # Core extraction scripts
│   ├── main_extractor.py       # Main orchestrator
│   ├── progress_manager.py     # Progress tracking
│   ├── manual_verifier.py      # Manual verification interface
│   └── [extractors...]         # Tier-specific extractors
│
├── config/                      # Configuration files
│   ├── extraction_config.json  # Extraction settings
│   └── quality_thresholds.json # Validation thresholds
│
├── tracking/                    # Progress and logs
│   ├── extraction_progress.json # Main progress file
│   ├── quality_metrics.json    # Quality metrics
│   ├── error_log.json          # Error tracking
│   └── manual_corrections.json # Manual corrections log
│
├── output/                      # Extraction results
│   ├── extracted_data/         # Per-PDF extracted data
│   ├── validation_reports/     # Validation reports
│   └── consolidated_output.csv # Final consolidated data
│
└── VIDHANSABHA_2024/           # Source PDFs (287 files)
    └── [36 district folders]

Key Features

1. Tiered Extraction System

Tier 1 - Standard English Format (70% of PDFs)

Direct text extraction
High success rate (95%+)
Fast processing

Tier 2 - Local Language Format (17% of PDFs)

Unicode/Devanagari support
Transliteration capabilities
Medium complexity

Tier 3 - Scanned/Rotated Format (13% of PDFs)

OCR with preprocessing
Image rotation correction
Highest complexity

2. Progress Tracking

Automatic checkpoints every 10 PDFs
Resume capability from any point
Real-time dashboard showing progress
Backup system for recovery

3. Quality Control

Validation rules for data consistency
Quality scoring (0-1 scale)
Automatic flagging for review
Mathematical verification of vote totals

4. Manual Verification

Interactive interface for corrections
Side-by-side PDF viewing capability
Batch approval for high-confidence results
Correction logging for audit trail

Common Operations

Check Extraction Quality

# Generate quality report
python scripts/validator.py --quality-report

# Check specific PDF quality
python scripts/manual_verifier.py --check AC_216

Handle Failures

# Reset failed PDF to retry
python scripts/progress_manager.py --reset AC_216

# Mark PDF as complete manually
python scripts/progress_manager.py --mark-complete AC_216:307:0.95

Create Checkpoints

# Create named checkpoint
python scripts/progress_manager.py --checkpoint "after_district_pune"

# Automatic checkpoints created every 10 PDFs

Export Results

# Export final consolidated CSV
python scripts/main_extractor.py --export-final

# Generate comprehensive report
python scripts/manual_verifier.py --report

Monitoring Dashboard

During extraction, the system displays:

====================================
Form 20 Extraction Progress
====================================
Total PDFs: 287
Processed: 145 (50.5%)
Pending: 132
Failed: 8
Manual Review: 2

Current: AC_146 (Nagpur)
Tier: 1 (Standard)
Records Extracted: 298
Quality Score: 0.96

Recent Completions:
✓ AC_145: 312 records (Q: 0.98)
✓ AC_144: 289 records (Q: 0.94)

Estimated Time Remaining: 2h 15m
====================================

Validation Rules

The system validates:

Record count within expected range
Vote totals mathematical consistency
Required fields presence
Data types correctness
Duplicate detection in polling stations

Error Recovery

Automatic Recovery

Retry failed PDFs up to 3 times
Fallback to lower tier extraction
Checkpoint restoration on crash

Manual Recovery

# Emergency stop
python scripts/main_extractor.py --emergency-stop

# Skip problematic PDF
python scripts/main_extractor.py --skip AC_123

# Force reprocess
python scripts/main_extractor.py --reprocess AC_123

Performance Metrics

Target Performance:

Processing Rate: 15-20 PDFs/hour
Success Rate: 95%+
Data Accuracy: 98%+
Field Completeness: 90%+
Manual Intervention: <10%

Troubleshooting

Common Issues

1. Memory Issues

# Reduce parallel processing
# Edit config/extraction_config.json
# Set "parallel_processing": {"tier_1": 2}

2. OCR Failures

# Check Tesseract installation
tesseract --version

# Install language packs
sudo apt-get install tesseract-ocr-mar  # Marathi

3. Progress File Corruption

# Restore from backup
cp backups/progress_checkpoint_latest.json tracking/extraction_progress.json

Advanced Configuration

Edit config/extraction_config.json:

{
  "max_retries": 3,
  "timeout_seconds": 300,
  "batch_size": 10,
  "quality_threshold": 0.85,
  "parallel_processing": {
    "tier_1": 4,
    "tier_2": 2,
    "tier_3": 1
  }
}

System Requirements

Python 3.8+
8GB RAM minimum (16GB recommended)
10GB free disk space
PDF processing libraries (see requirements.txt)

Support Files

IMPLEMENTATION_PLAN.md - Detailed implementation phases
EXTRACTION_RULES.md - Complete extraction ruleset
report.md - Initial PDF analysis findings

Notes

Always run progress_manager.py --init before first extraction
Monitor error_log.json for systematic issues
Create checkpoints before major operations
Use manual verification for critical data
Keep backups of tracking files

Contact

For issues or questions, refer to the documentation files or check the error logs in the tracking/ directory.

System designed for incremental, quality-controlled extraction of Maharashtra election Form 20 data with comprehensive progress tracking and manual verification capabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.claude-flow/metrics		.claude-flow/metrics
.hive-mind		.hive-mind
AC		AC
LS		LS
__pycache__		__pycache__
completed_1		completed_1
csvAll		csvAll
csvData.off		csvData.off
csvData		csvData
csvOct10		csvOct10
csvOct6		csvOct6
csvOct9		csvOct9
csvOct9_duplicates		csvOct9_duplicates
csvOct9_removed		csvOct9_removed
failed_pdf_logs		failed_pdf_logs
failed_reprocess_logs		failed_reprocess_logs
logs		logs
parsedData.assembly.2019		parsedData.assembly.2019
parsedData.assembly.2024		parsedData.assembly.2024
parsedData.loksabha.2024		parsedData.loksabha.2024
parsedData		parsedData
scripts		scripts
venv		venv
.env.example		.env.example
.gitignore		.gitignore
2025-09-24-command-messageinit-is-analyzing-your-codebase.txt		2025-09-24-command-messageinit-is-analyzing-your-codebase.txt
2025-10-03-read-2025-09-24-command-messageinit-is-analyzing-y.txt		2025-10-03-read-2025-09-24-command-messageinit-is-analyzing-y.txt
BATCH_PROCESSING_README.md		BATCH_PROCESSING_README.md
CLASSIFICATION_REPORT.md		CLASSIFICATION_REPORT.md
CLAUDE.md		CLAUDE.md
CSV_CONVERSION_SUCCESS.md		CSV_CONVERSION_SUCCESS.md
DEPLOYMENT_SUMMARY.txt		DEPLOYMENT_SUMMARY.txt
DISTRICTWISE_AC_LIST.pdf		DISTRICTWISE_AC_LIST.pdf
DISTRICTWISE_AC_LIST_OCR.pdf		DISTRICTWISE_AC_LIST_OCR.pdf
Districtwise_Parliamentary_Constituencies_list.pdf		Districtwise_Parliamentary_Constituencies_list.pdf
EXTRACTION_RULES.md		EXTRACTION_RULES.md
End-Poll-VTR_AC-Wise_Male_Female_Vidhansabha_2024.pdf		End-Poll-VTR_AC-Wise_Male_Female_Vidhansabha_2024.pdf
FINAL_LOKSABHA_REPORT.md		FINAL_LOKSABHA_REPORT.md
FINAL_STATUS.txt		FINAL_STATUS.txt
FINAL_TYPE1_RESULTS.md		FINAL_TYPE1_RESULTS.md
FUTURE_WORK_PLAN.md		FUTURE_WORK_PLAN.md
Gazette-Elected-List-GAC-2024.pdf		Gazette-Elected-List-GAC-2024.pdf
HIGH_QUALITY_TYPE1_CONSOLIDATED.csv		HIGH_QUALITY_TYPE1_CONSOLIDATED.csv
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
MISSING_ACS_LOKSABHA.md		MISSING_ACS_LOKSABHA.md
OCR_HANDLING.md		OCR_HANDLING.md
OPTIMIZATION_SUCCESS.md		OPTIMIZATION_SUCCESS.md
PROCESSING_COMPLETE_SUMMARY.md		PROCESSING_COMPLETE_SUMMARY.md
PROCESSING_STRATEGY.md		PROCESSING_STRATEGY.md
README.md		README.md
RETRY_RESULTS.md		RETRY_RESULTS.md
SPEED_IMPROVEMENTS.md		SPEED_IMPROVEMENTS.md
SPEED_OPTIMIZATION_V2.md		SPEED_OPTIMIZATION_V2.md
TYPE1_SUCCESS_REPORT.md		TYPE1_SUCCESS_REPORT.md
TYPE3_PROCESSING_GUIDE.md		TYPE3_PROCESSING_GUIDE.md
TYPE3_STRATEGY.md		TYPE3_STRATEGY.md
ULTRA_FAST_SUMMARY.md		ULTRA_FAST_SUMMARY.md
ac_234_parallel.log		ac_234_parallel.log
ac_39_processing.log		ac_39_processing.log
advanced_vision_extractor.py		advanced_vision_extractor.py
all_acs.txt		all_acs.txt
analyze_failed_type1.py		analyze_failed_type1.py
analyze_type3_strategy.py		analyze_type3_strategy.py
batch_output.log		batch_output.log
batch_output_ls2019.log		batch_output_ls2019.log
batch_process_all_pdfs.py		batch_process_all_pdfs.py
categorize_extractions.py		categorize_extractions.py
check_processing_status.sh		check_processing_status.sh
check_reprocess_status.sh		check_reprocess_status.sh
check_status.py		check_status.py
classification.log		classification.log
claude_vision_extractor.py		claude_vision_extractor.py
comprehensive_classification_results.json		comprehensive_classification_results.json
comprehensive_classifier.py		comprehensive_classifier.py
consolidate_files.py		consolidate_files.py
consolidated_extractor.py		consolidated_extractor.py
create_consolidated_csv.py		create_consolidated_csv.py
create_tracking.py		create_tracking.py
dashboard_final.html		dashboard_final.html
detailed_progress.sh		detailed_progress.sh
enhanced_reprocessing.log		enhanced_reprocessing.log
enhanced_type1_extractor.py		enhanced_type1_extractor.py
extraction.log		extraction.log
failed_and_pending_acs.csv		failed_and_pending_acs.csv
failed_pdf_processor.py		failed_pdf_processor.py
failed_pdf_reprocessor.py		failed_pdf_reprocessor.py
failed_reprocess_output.log		failed_reprocess_output.log
failed_type1_analysis.json		failed_type1_analysis.json
field_availability_analysis.csv		field_availability_analysis.csv
final_pdf_solver.py		final_pdf_solver.py
fix_json_paths.py		fix_json_paths.py
fix_pdf_urls.py		fix_pdf_urls.py
gemini.mermaid		gemini.mermaid
gemini_ac01.log		gemini_ac01.log
gemini_ac01_output.log		gemini_ac01_output.log
gemini_ac1_final.log		gemini_ac1_final.log
gemini_ac1_run.log		gemini_ac1_run.log
gemini_vision_extractor.py		gemini_vision_extractor.py
gemini_vision_extractor_optimized.py		gemini_vision_extractor_optimized.py
gemini_vision_extractor_parallel.py		gemini_vision_extractor_parallel.py

prajeshmadhavi/form20

Folders and files

Latest commit

History

Repository files navigation