How OCR has advanced and how you can use it.
There is an evil villain lurking in your company. This vile monster worships errors, loves to waste time and lives to drain money from your bottom line. Who is this creature? It’s the manual data entry monster. Who can save us? It’s a bird; it’s a plane; it’s Super OCR software*. We know that regular OCR saves time and money by eliminating manual data entry from static forms. Super OCR has special powers that can capture fields on variable forms. The information you need could be in a different location on each form, each form could look different, and the field might even be called something different on each form.
Not Your Father’s OCR Software
Can you imagine a bank managing accounts without a computer? It is possible, but it would be slow, expensive, and not as accurate. Why should people have to manually key forms into a computer? A computer that could read forms would be fast and accurate. It would apply the same business rules every time. It could do the work of five people.Those five people would be happier and their employer could assign them to tasks that are more productive.
Field-based OCR has been around for over twenty years, but in the past it could only read static forms where the form looked the same every time and the data was always in the same place. Past field-based OCR systems were designed to read handprint (ICR), machine print (OCR), checkmarks (OMR), and barcodes.
This worked for many types of static forms like tax returns, tests, and surveys, where a large volume of the same kind of forms were distributed and then returned to a central collection point. However, reading hand printed forms is very difficult, requires redesign of the form, and requires substantial operator verification. While OCR of static forms helps automate a small percentage of an organization’s forms, it does not address the majority of documents that are actually variable.
Similar but Not identical
Invoices, utility bills, bills of lading, and insurance claim forms are all examples of documents representing a business transaction between two parties. To the sending party, the documents are identical but with different data. To the receiving party, these documents are similar, but not identical. The data that needs to be captured is the same, but it is in a different place on each document.
Documents that contain similar data can also be called semi-structured. Since these types of forms are generated by a machine, they usually have some structure and are more easily read by a machine. Specifically, the OCR software does not have to try to read handprint and it can concentrate on finding and reading the required fields which are machine printed. Because these types of forms are similar, it is possible to construct rules about where to find the data points.
The rules can be layered and the software can rely on the previous rule to decide on the next rule. Reading semi-structured documents has been helped by recent improvements in OCR software such as lower cost, better recognition, and new logic like automatic table reading. Lower cost means that advanced OCR is affordable for more companies. Better recognition means that OCR should be able to locate and read the characters for most fonts, point sizes, and printer types.
Since many documents contain tables, automatic table reading improves the OCR process significantly by reducing setup time and correctly identifying required fields that exist in tables.
How Super OCR Works
Reading a variable form with Super OCR software varies with each OCR vendor’s software, but usually requires these four basic steps:
If you are going to navigate a form, you need some landmarks. Titles, logos, and lines make effective landmarks for a form. These landmarks are “anchors.” For example, the actual invoice number on a form is typically located somewhere to the right or underneath a word called “invoice number.” This could also be called “invoice #” or “invoice no.” or many other variations. By telling the software that there are 10 possible ways to say “invoice number” and that the actual invoice number is below or to the right, the software is able to find the actual invoice number. This process of mapping an anchor and a field together continues for all the required fields on the form.
Many forms have tables. If you can identify the header fields, most Super OCR software can automatically pick out the rows and columns. If you know the numbers of rows and columns beforehand (like on a utility bill), it is even simpler to setup for OCR. Once the OCR software has identified the header, rows, and columns, it can read each individual item in the table even if the number of rows and columns changes from form to form.
When we read something, we read better if it is “in context.” OCR software is the same way. Location and accuracy are better if you can tell the OCR software that the invoice date is always in a month/day/ year format. If you tell it that the month is always two digits, the day is always two digits, and the year is always four digits, that will simplify the process. If you tell it that there are always either slashes or hyphens between the digits, you have given the OCR software a better shot at locating and reading that field. More constraints mean better accuracy and data that is more consistent.
Super OCR software just needs an image. Images can come from a scanner, fax machine, networked copier, or any other device that will create a TIFF or PDF image. These devices differ in speed, resolution, and feeder quality, but even inexpensive scanners can scan at 200 or 300 dpi, which is sufficient for OCR. Scanning devices are inexpensive enough that remote users can all have scanners and scan documents into a central site rather than mailing forms in.
Computers are great at consistently following rules. Computers don’t daydream and they don’t make exceptions. When we OCR forms, we want our data to come out as consistent and accurate as possible. Validation means not only checking data but improving data.
Checking data can be done manually by an operator viewing the image in question or automatically by the OCR software. For example, if a particular invoice was lacking an invoice date, the human operator could look at that image and make a decision. Alternatively, the software could look at the problem and be instructed to fax that invoice back to the company that sent it along with a note that said it was missing an invoice date.
Improving data is typically done by the computer. For example, if the “ship via code” on the invoice is “Federal Express Priority,” but the back-end database requires “FEP,” then the OCR software can automatically change this field. Although not technically an OCR function, improving data is frequently done at the point of OCR because we want to perfect the data before the data is sent to a back-end application or database.
One of the magical things about Super OCR software is that it can export both data and images simultaneously. Many organizations have both an electronic content management system and a financial software system. If an invoice is received, the image needs to go to the ECM repository while the extracted data needs to be routed to the financial system for further processing. OCR software can send data and images to multiple systems in multiple formats. For example, it could send PDF files to the ECM repository and XML files to the financial system, all at the same time.
Finding the Right Super OCR Project
For your first Super OCR project, look for forms where there is some structure, the characters are legible, and where you can identify most of the fields you need by a printed name. Poor quality is kryptonite for Super OCR. If you can’t read the printing on a form, it’s doubtful that the OCR software can pick it up. This happens frequently on faxed forms, carbon copies, and some forms printed on dot matrix printers. A complete lack of structure can also be a problem. If there are no landmarks on a form, it is going to be difficult for the OCR software to locate and then read the required fields.
Interestingly, Super OCR makes OCR of static forms even better, because it handles the variability inherent in real-life forms processing. For example, a tax form is printed, distributed, and then returned for scanning. The printer and the scanning introduce some variability. In static form OCR, you draw a box around an area on the image and expect the field to be somewhere inside the box. With Super OCR, you give the softwarea landmark and it finds the field exactly. Since the location is better, the accuracy is better.
Once you have found your project, you can choose to buy an OCR engine and build your own application or you can buy an off-the-shelf OCR program that can read variable forms. If it is possible to read your forms with an off-the-shelf program, this is usually the fastest and least expensive alternative. If your forms cannot be read adequately by an off-the-shelf program or if you require special features, then there are a number of OCR engines you can use.
Super OCR is a welcome evolution of OCR technology that provides better cost, recognition, and logic for forms that look different but have similar data points.
Hopefully this latest twist on OCR will help you save time and money on more of your paper-based transactions.
Mike Stuhley is the president of Formtran, Inc. and GoScan, Inc., software companies in the document imaging industry. He has over sixteen years of software industry experience, holding executive positions with Scantron and Cardiff Software, and is a frequent speaker at industry events.