Data Harvesting from PDFs

Data Harvesting from PDFs

Customer Challenges:

  • Extract information from Complex , Freeform PDFs, with huge sizes (1000+ pages each)
  • No fixed layouts
  • Tables represented as data or images

Result:

  • 70% accurate extraction, with identified and labeled exceptions
  • New PDF layouts supported with incremental efforts
  • Zero license cost – Open source solution
  • > 10X improvements in data extraction speeds.

CoreView Solution:

  • Regex based configurable parsing
  • Recursive algorithms to navigate n-level sections-subsections-body
  • Heuristics based co-relation
  • OCR powered image table parsing
  • Modular data pipeline

Scope:

  • Parse PDFs to understand sections, headers, body, tables
  • Co-relation of related data, sections across document
  • Parsing tables in PDF to understand mine data
  • Multiple PDF documents with different layouts

Share this post