Data Harvesting from PDFs – CoreView

21NovNovember 21, 2022

Data Harvesting from PDFs

By Sachin Kalaskar Uncategorized Comments Off

Customer Challenges:

Extract information from Complex , Freeform PDFs, with huge sizes (1000+ pages each)
No fixed layouts
Tables represented as data or images

Result:

70% accurate extraction, with identified and labeled exceptions
New PDF layouts supported with incremental efforts
Zero license cost – Open source solution
> 10X improvements in data extraction speeds.

CoreView Solution:

Regex based configurable parsing
Recursive algorithms to navigate n-level sections-subsections-body
Heuristics based co-relation
OCR powered image table parsing
Modular data pipeline

Scope:

Parse PDFs to understand sections, headers, body, tables
Co-relation of related data, sections across document
Parsing tables in PDF to understand mine data
Multiple PDF documents with different layouts

Author

Sachin Kalaskar