Complex Document Extraction

Accurately extracting text/image data from unpredictable format/layout documents (PDF, Word, Excel, Web pages, E-mails), which do not have an underlying technical structure like XML or field identifiers, has always been a challenge for all conventional technologies, including other RPA (Robotics Process Automation) platforms. This means people must read each document and re-enter data; increasing processing cost, time and errors.

Instaknow’s patented Artificial Intelligence processes millions of very complex documents to eliminate manual processing for Fortune 500 clients in Banking, Supply Chain, Health Care, Utilities, Pharmaceuticals, Law, Insurance and Government. All required data is accurately extracted and converted to XML for conventional processing.

Using human-eyeball-like scanning of each document’s layout, Instaknow can correctly decide which text is related to which header or label in that document, WITHOUT needing the underlying technical structure like XML or field identifiers. Data can be laid out DIFFERENTLY in different documents. Instaknow can even accurately determine the values of checkboxes and radio buttons. If a human eyeball can find and isolate the data of interest, Instaknow can do it too, regardless of variations. Documents do NOT need to be in specific technical formats. They can be text documents or image/scan documents, with one or multiple pages. Sections within documents can appear in any order and columns in tables can also have unpredictable sequence!

E.g. in the following scanned tax returns example, the top return has space for two Officers while the bottom return can have up to four Officers listed. Also, the column widths are very different. These documents came in as scanned images and have no underlying XML, technical ids or predictable string sequences which will allow conventional data processing like RPA (Robotic Process Automation). Only a person can detect the actual data layout and content, and has to manually re-enter it in another computer system or file for further processing. But manual processing of thousands of documents is expensive, slow and error-prone!

Instaknow can do the same processing automatically. Using the human-like Artificial Intelligence, Instaknow can be told to extract “’Officer Name and address’ from the ‘Information about officers’ section”. That instruction (kept on a simple user-friendly place like an Excel) allows it to do the following user-like steps:

Multiple sets of data can be extracted from the same document page together, separately or conditionally (e.g. “Extract Balance Sheet details from another page only if total revenue reported on the first page of the Tax Return is more than $100,000”.). All exceptions (e.g. expected pages missing from document) are routed to specified users for review.

Below is an example of accurate data extraction in spite of layout variations in HTML Web pages.

As can be seen, different fields are present at the same physical location or field sequence, for different companies. Since there are no XML or technical ids present in this Web page, a conventional data extraction attempt (e.g. “Screen Scraping”) will fail, because it cannot tell if the fourth field is “Registration date” or “Renewal date”. Instaknow DYNAMICALLY decides where labels describing data of interest are and accurately extracts the related data, regardless of unknown location or position variations.