README.md

Current URL for app → https://table-extractor-johnnykfeng.streamlit.app/
Report of this project can be found here → https://johnnykfeng.github.io/Table-extraction/
Public Github repo → https://github.com/johnnykfeng/Table-Extractor-app
Video Demo → Loom

How-to-use

There are 3 sliders to adjust the parameters inside the ML model. For the most of the time the default settings don't need to be changed. If adjustments to the model must be made, here are some guidelines.

TD threshold: Increase if it falsely labels parts that are not tables. Decrease if some tables are missed by the ML model.
TSR threshold: Increase if too many rows/columns are found. Decrease if some rows/columns are not identified properly.
Crop padding: Increase if the edges of the table are cutoff. Decrease if unwanted text outside of the table is captured.
First row header: Check box to designate the first row to be header i.e. column names. Otherwise, column names will be numbered starting at 0.

You can play around with these sliders using pre-loaded samples that you can select from the dropdown menu. Upload your own image by drag and drop or browsing through your local computer. At the moment, it can only take image files such as png or jpg. Click on the Run table extractor button to start the process.

Go to Download csv page on the left sidebar to download extracted data.

Hope you enjoy the app :)

Background

Table extraction from documents using machine learning involves training algorithms to automatically identify and extract tables from a given document. This process can be challenging, as tables can come in various formats and layouts, and may be embedded within larger documents such as research papers, reports, or financial statements. The successful implementation of ML-based table extraction can save significant time and resources compared to manual extraction methods, especially for large or complex documents with multiple tables. However, the accuracy of table extraction can be affected by factors such as the quality and consistency of input data, as well as the complexity of the document layout.

A very accurate model has been developed by a team at Microsoft [1]. They trained their DETR (End-to-end Object Detection with Transformers) -based model on a very large dataset of approximately 1 million annotated tables. The original tables were scraped from the PubMed Central Open Access (PMCAO) database. The Microsoft team also formulated their own scoring criteria, Grid Table Similarity (GriTS), for assessing the accuracy of their model [2].

Project Status

2023-03-25 - A streamlit app has been built around this work. The first prototype of app has been deployed.
2023-05-01 - My Google API has been restricted, app is down
2023-05-15 - Fixed the problem, re-deployed app with new URL

Future developments:

Develop multiple table extraction capabilities through parallel processes
Incorporate PDF files and multiple page extraction
Implement header structures automatically for more complex tables
Further training the model via transfer learning to improve performance on hard cases

Resources

This project is made possible with https://github.com/microsoft/table-transformer
Hugging face for making it accessible https://huggingface.co/docs/transformers/model_doc/table-transformer

References

[1] "PubTables-1M: Towards comprehensive table extraction from unstructured documents".
[2] "GriTS: Grid table similarity metric for table structure recognition"
[3] "Aligning benchmark datasets for table structure recognition"

Contact

Created by John Feng.
Feel free to contact me at [email protected].
My website https://johnnykfeng.github.io/

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.streamlit		.streamlit
pages		pages
samples		samples
.gitignore		.gitignore
.slugignore		.slugignore
Dockerfile		Dockerfile
Procfile		Procfile
README.md		README.md
TableExtraction.py		TableExtraction.py
Table_Extractor_App.py		Table_Extractor_App.py
apt-get_packages.txt		apt-get_packages.txt
header_te.png		header_te.png
requirements.txt		requirements.txt
setup.sh		setup.sh
streamlit_download_button.py		streamlit_download_button.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README.md

How-to-use

Background

Project Status

Future developments:

Resources

References

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

johnnykfeng/Table-Extractor-app

Folders and files

Latest commit

History

Repository files navigation

README.md

How-to-use

Background

Project Status

Future developments:

Resources

References

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages