PDF-Image-Text-Extractor

A simple yet powerful web tool and REST API built with Flask that extracts text from PDFs or images using Tesseract OCR and PyMuPDF. It supports both English and Hindi languages (customizable).

🚀 Features

Upload PDFs or images (JPG, PNG, TIFF, BMP, GIF) via browser
Extracts both digital text and scanned OCR text
Returns the result as JSON (REST API)
Beautiful Bootstrap web interface
Multilingual OCR (default: English + Hindi)
Works on Windows, macOS, and Linux

🧩 Requirements

Install the following tools before running the app:

1. Install Tesseract OCR

Windows

Download from UB Mannheim Tesseract

Default path:

Add this to app.py:

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
os.environ["TESSDATA_PREFIX"] = r"C:\Program Files\Tesseract-OCR\tessdata"

tesseract --list-langs

Install the following tools **before running the app**:

### 1. Install Install Poppler

Windows

Download from https://blog.alivate.com.au/poppler-windows/

Extract it and set your Poppler path, e.g.:
C:\poppler-25.07.0\Library\bin

Then update in app.py:

POPPLER_PATH = r"C:\poppler-25.07.0\Library\bin"


⚙️ Installation

1. Clone the repository:

git clone https://github.com/jadeitservices/ocr-json-api.git

cd ocr-json-api

2. Create a virtual environment:

python -m venv venv
venv\Scripts\activate     # on Windows
source venv/bin/activate  # on macOS/Linux

3. Install dependencies:

pip install -r requirements.txt

4. Run the Flask app:

python app.py

To Run the application the Web Interface

http://127.0.0.1:5000/


API Endpoint:

POST http://127.0.0.1:5000/api/ocr

Example cURL:

curl -X POST -F "[email protected]" http://127.0.0.1:5000/api/ocr

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
templates		templates
README.md		README.md
app.py		app.py
ocr_utils.py		ocr_utils.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF-Image-Text-Extractor

🚀 Features

🧩 Requirements

1. Install Tesseract OCR

Windows

About

Uh oh!

Releases

Packages

Languages

jadeitservices/PDF-Image-Text-Extractor

Folders and files

Latest commit

History

Repository files navigation

PDF-Image-Text-Extractor

🚀 Features

🧩 Requirements

1. Install Tesseract OCR

Windows

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages