Skip to content

A simple yet powerful web tool and REST API built with **Flask** that extracts text from **PDFs** or **images** using `Tesseract OCR` and `PyMuPDF`. It supports both **English** and **Hindi** languages (customizable).

Notifications You must be signed in to change notification settings

jadeitservices/PDF-Image-Text-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF-Image-Text-Extractor

A simple yet powerful web tool and REST API built with Flask that extracts text from PDFs or images using Tesseract OCR and PyMuPDF. It supports both English and Hindi languages (customizable).


🚀 Features

  • Upload PDFs or images (JPG, PNG, TIFF, BMP, GIF) via browser
  • Extracts both digital text and scanned OCR text
  • Returns the result as JSON (REST API)
  • Beautiful Bootstrap web interface
  • Multilingual OCR (default: English + Hindi)
  • Works on Windows, macOS, and Linux

🧩 Requirements

Install the following tools before running the app:

1. Install Tesseract OCR

Windows

Download from UB Mannheim Tesseract

Default path:

Add this to app.py:

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
os.environ["TESSDATA_PREFIX"] = r"C:\Program Files\Tesseract-OCR\tessdata"

tesseract --list-langs

Install the following tools **before running the app**:

### 1. Install Install Poppler

Windows

Download from https://blog.alivate.com.au/poppler-windows/

Extract it and set your Poppler path, e.g.:
C:\poppler-25.07.0\Library\bin

Then update in app.py:

POPPLER_PATH = r"C:\poppler-25.07.0\Library\bin"


⚙️ Installation

1. Clone the repository:

git clone https://github.com/jadeitservices/ocr-json-api.git

cd ocr-json-api

2. Create a virtual environment:

python -m venv venv
venv\Scripts\activate     # on Windows
source venv/bin/activate  # on macOS/Linux

3. Install dependencies:

pip install -r requirements.txt

4. Run the Flask app:

python app.py

To Run the application the Web Interface

http://127.0.0.1:5000/


API Endpoint:

POST http://127.0.0.1:5000/api/ocr

Example cURL:

curl -X POST -F "[email protected]" http://127.0.0.1:5000/api/ocr

About

A simple yet powerful web tool and REST API built with **Flask** that extracts text from **PDFs** or **images** using `Tesseract OCR` and `PyMuPDF`. It supports both **English** and **Hindi** languages (customizable).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published