Data Harvesting from PDFs
Customer Challenges:
- Extract information from Complex , Freeform PDFs, with huge sizes (1000+ pages each)
- No fixed layouts
- Tables represented as data or images
Result:
- 70% accurate extraction, with identified and labeled exceptions
- New PDF layouts supported with incremental efforts
- Zero license cost – Open source solution
- > 10X improvements in data extraction speeds.
CoreView Solution:
- Regex based configurable parsing
- Recursive algorithms to navigate n-level sections-subsections-body
- Heuristics based co-relation
- OCR powered image table parsing
- Modular data pipeline
Scope:
- Parse PDFs to understand sections, headers, body, tables
- Co-relation of related data, sections across document
- Parsing tables in PDF to understand mine data
- Multiple PDF documents with different layouts