Automate Unpredictable Data
Accurately extracting text/image data from unpredictable format/layout documents (PDF, Word, Excel, Web pages, E-mails), which do not have an underlying technical structure like XML or field identifiers, has always been a challenge for all conventional technologies, including other RPA (Robotics Process Automation) platforms. This means people must read each document and re-enter data; increasing processing cost, time and errors.
Instaknow’s patented Artificial Intelligence processes millions of very complex documents to eliminate manual processing for Fortune 500 clients in Banking, Supply Chain, Health Care, Utilities, Pharmaceuticals, Law, Insurance and Government. All required data is accurately extracted and converted to XML for conventional processing.
Using human-eyeball-like scanning of each document’s layout, Instaknow can correctly decide which text is related to which header or label in that document, WITHOUT needing the underlying technical structure like XML or field identifiers. Data can be laid out DIFFERENTLY in different documents. Instaknow can even accurately determine the values of checkboxes and radio buttons. If a human eyeball can find and isolate the data of interest, Instaknow can do it too, regardless of variations. Documents do NOT need to be in specific technical formats. They can be text documents or image/scan documents, with one or multiple pages. Sections within documents can appear in any order and columns in tables can also have unpredictable sequence!
E.g. in the following scanned tax returns example, the top return has space for two Officers while the bottom return can have up to four Officers listed. Also, the column widths are very different. These documents came in as scanned images and have no underlying XML, technical ids or predictable string sequences which will allow conventional data processing like RPA (Robotic Process Automation). Only a person can detect the actual data layout and content, and has to manually re-enter it in another computer system or file for further processing. But manual processing of thousands of documents is expensive, slow and error-prone!
Instaknow can do the same processing automatically. Using the human-like Artificial Intelligence, Instaknow can be told to extract “’Officer Name and address’ from the ‘Information about officers’ section”. That instruction (kept on a simple user-friendly place like an Excel) allows it to do the following user-like steps:
- Read each document. If the document is an image, it is converted to text using Optical Character Recognition (OCR).
- Decide if document is relevant for this data extraction (i.e. is it a tax return)
- Find the appropriate page for this data extraction. Required data may be on different pages in different tax returns. Within the page data of interest may be in different vertical and horizontal locations in different documents.
- Find the section header “Information about Officers”. Alternates can be provided for Headers and Labels, to take care of different text meaning the same thing.
- Within the proximity of the dynamically found header, look for label “Name and address”, look to left and right of the label to decide how far “visual scope” of the label extends (i.e. which data below the label is for that column). This use of “white space” to decide label scope requires artificial intelligence and is beyond capabilities of conventional technologies.
- Decide what is the vertical scope of the data part in that section using white space gaps and font prominence (e.g. bold or bigger fonts are more likely for headers and labels than data)
- After isolating the data rectangle like this, extract data and save as XML for further automated downstream processing. Tables/grids of data from document are correctly extracted as XML nodes, retaining the original data relationships.
Multiple sets of data can be extracted from the same document page together, separately or conditionally (e.g. “Extract Balance Sheet details from another page only if total revenue reported on the first page of the Tax Return is more than $100,000”.). All exceptions (e.g. expected pages missing from document) are routed to specified users for review.
Below is an example of accurate data extraction in spite of layout variations in HTML Web pages.
As can be seen, different fields are present at the same physical location or field sequence, for different companies. Since there are no XML or technical ids present in this Web page, a conventional data extraction attempt (e.g. “Screen Scraping”) will fail, because it cannot tell if the fourth field is “Registration date” or “Renewal date”. Instaknow DYNAMICALLY decides where labels describing data of interest are and accurately extracts the related data, regardless of unknown location or position variations.