Research is to see what everybody else has seen and to think what nobody else has thought.
We were trying to build an NLP model which is used to detect and extract data from distant structured PDF files and processed it to Excel sheets in a well-defined questionnaire format. The PDF files contained all sets of structures like simple tables, bullet points texts, headers, footers as well as data present as a multidimensional matrix. To build the model we required PDF files in appropriate amount and variety which was a bottleneck for us so less availability of data hindered our model accuracy and unable to predict certain objects present in PDF files. During the search for the solution to tackle this issue, we come across an API provided by Adobe Document Cloud SDK. Adobe had recently launched this SDK for the ease to perform various operations on PDF. One of the operations this API provided is to convert the PDF into Excel as simple as it sounds.
Adobe provided a specific free rate limit to explore the API, we tried a few types of PDF files and to our surprise, we got the excel sheet of complex PDF structure in a simple, expected questionnaire format. Upon this, we connected to the Adobe team for early access to SDK and implemented the solution.
We don’t always have to go for high-tech trends to solve any issue sometimes simple detailed research can help to fasten the work with great accuracy.
What are your experiences?