A sophisticated multi-stage AI-powered pipeline for analyzing and processing business books, specifically designed to extract actionable insights, frameworks, and implementation strategies from PDF documents.
This pipeline transforms a PDF business book into a comprehensive, consolidated analysis document with zero content loss but optimized organization. It uses multiple AI models (OpenAI GPT-4.1, GPT-4.1-mini, Gemini 2.5 Flash-Lite, and O3) for different processing stages to maximize quality and efficiency.
The pipeline consists of 7 distinct phases executed across 7 Python scripts:
Script: analyze_book.py
- Extracts text and images from PDF pages
- Creates incremental overview using Gemini 2.5 Flash-Lite (fastest) or GPT-4.1-mini
- Performs detailed page-by-page analysis using GPT-4.1
- Generates initial markdown document with actionable business insights
- Output:
book_analysis.md
,book_analysis_state.json
,extracted_pages/
Script: analyze_semantic_content.py
- Identifies semantic boundaries and content chunks
- Creates content fingerprints for each chunk (frameworks, examples, action items)
- Output:
semantic_analysis_results.json
Script: run_similarity_analysis.py
- Analyzes semantic similarity between all content chunks
- Categorizes similarities as high (>80%), medium (50-80%), or low (<50%)
- Output: Updated
semantic_analysis_results.json
with similarity data
Script: create_consolidation_map.py
- Creates intelligent consolidation strategy based on similarity analysis
- Defines merge rules for high similarity content
- Groups medium similarity content while preserving unique elements
- Preserves low similarity content as standalone sections
- Output:
consolidation_map.json
Script: consolidate_document.py
- Executes the consolidation plan using GPT-4.1
- Merges similar content while preserving all unique insights
- Creates professionally structured consolidated document
- Output:
book_analysis_consolidated.md
,consolidation_report.json
Script: finalize_with_o3.py
- Uses OpenAI's O3 model for final presentation polish
- Improves formatting, creates clean table of contents, fixes headers
- Maintains 100% content preservation while enhancing readability
- Output:
book_analysis_final.md
,finalization_report.json
Script: fact_check_analysis.py
- Validates final analysis against original book pages for factual accuracy
- Uses Gemini 2.5 Flash-Lite for efficient fact-checking (with OpenAI fallback)
- Identifies misrepresentations, omissions, and factual errors
- Output:
fact_check_control_log.json
,fact_check_state.json
- Python 3.8+
- OpenAI API key with access to GPT-4.1, GPT-4.1-mini, and O3 models
- Google API key (optional, for faster overview processing with Gemini)
-
Clone or download the pipeline files
-
Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
Create a
.env
file with:
OPENAI_API_KEY=your-openai-api-key-here
GOOGLE_API_KEY=your-google-api-key-here # Optional but recommended
For a new book analysis:
# Phase 1: Extract and analyze the book
python analyze_book.py path/to/your/book.pdf
# Phase 2: Semantic content analysis
python analyze_semantic_content.py book_analysis.md
# Phase 3: Similarity analysis
python run_similarity_analysis.py
# Phase 4: Create consolidation mapping
python create_consolidation_map.py
# Phase 5: Consolidate document
python consolidate_document.py
# Phase 6: Final enhancement with O3
python finalize_with_o3.py
# Phase 7: Fact-check validation (only for internal testing. Still needs a lot of development)
python fact_check_analysis.py
Resume interrupted analysis:
# The pipeline automatically resumes from the last processed page
python analyze_book.py path/to/book.pdf
Force re-extraction of pages:
python analyze_book.py path/to/book.pdf --start-page 1 --force-extract
Custom output filenames:
python finalize_with_o3.py book_analysis_consolidated.md my_final_output.md
Fact-check specific page ranges:
# Check only pages 50-100
python fact_check_analysis.py --start_page 50 --end_page 100
# Resume interrupted fact-checking
python fact_check_analysis.py
# Reset and start fresh
python fact_check_analysis.py --reset
book_analysis_final.md
- The final polished business analysis documentbook_analysis_consolidated.md
- Consolidated document before final polishbook_analysis.md
- Initial analysis document from phase 1
semantic_analysis_results.json
- Semantic chunks and similarity dataconsolidation_map.json
- Consolidation strategy and rulesbook_analysis_state.json
- Analysis progress state (for resuming)fact_check_state.json
- Fact-checking progress state (for resuming)
finalization_report.json
- Final enhancement process detailsconsolidation_report.json
- Consolidation process summaryfact_check_control_log.json
- Fact-checking results and discrepancy trackingpreservation_log.txt
- Content preservation verificationcompletion_log.txt
- Overall pipeline execution log
extracted_pages/
- Individual page images from PDF
The pipeline strategically uses different AI models for optimal cost and quality:
- Gemini 2.5 Flash-Lite: Overview updates (fastest, cheapest)
- GPT-4.1-mini: Semantic analysis and similarity detection (cost-effective)
- GPT-4.1: Detailed page analysis and consolidation (high quality)
- O3: Final enhancement and reasoning (superior polish)
- Zero Content Loss: All unique insights, frameworks, and examples are preserved
- Similarity Detection: Intelligently identifies redundant content for consolidation
- Unique Element Tracking: Explicitly preserves distinctive value from each section
- Actionable Focus: Transforms book content into business implementation tools
- Clean Formatting: Professional markdown with clickable table of contents
- Logical Structure: Optimized flow and organization for reference use
- State Management: Automatically saves progress and resumes from interruptions
- Incremental Updates: Only processes new or changed content
- Error Recovery: Graceful fallbacks for API failures
"OPENAI_API_KEY must be set"
- Ensure your
.env
file contains a valid OpenAI API key - Verify the key has access to required models (GPT-4.1, O3)
"Required file not found"
- Run the previous phase scripts in order
- Check that output files from previous phases exist
"Empty response from API"
- Check your API quotas and rate limits
- Verify API keys are valid and have sufficient credits
PDF extraction fails
- Ensure the PDF is not password protected
- Check that pdf2image dependencies are properly installed
- For Linux:
sudo apt-get install poppler-utils
- Use Google API key for faster overview processing
- Run on a machine with good internet for faster API calls
- Monitor API usage as the pipeline makes many calls
- Start with smaller sections to test the pipeline
The pipeline can be customized by modifying:
- Model selections in each script's constants
- Prompts for different analysis focuses
- Similarity thresholds in similarity analysis
- Consolidation strategies in mapping logic
For issues or questions about the pipeline:
- Check the completion logs for detailed error information
- Verify all dependencies are correctly installed
- Ensure API keys have sufficient credits and model access
- Review the troubleshooting section above
This pipeline is designed for business book analysis and educational purposes. Ensure you have proper rights to analyze any copyrighted materials.