Automating document processing – What is OCR and Barcodes?
OCR stands for Optical Character Recognition but OCR is really a process of taking an
image and converting it into text so that it can be edited and searched. There are two different types of OCR you can utilize to help eliminate manual key entry, full text OCR and Zonal OCR. Both are not perfect, and accuracy relies heavily on the quality of images amongst numerous other factors. BARCODE recognition is much faster and more accurate than OCR, and is widely used in scanning service bureaus for capturing indexes as well as document separation allowing for batch scanning.
So let’s discuss the two different OCR functionalities.
Full Text OCR
Full Text OCR takes the entire image and converts it to a text output. The OCR output can be in several formats, Plain text, Formatted text or a Searchable file. The main goal of full text OCR is typically “SEARCHABILITY”, and the results are usually placed into a backend ECM repository for optimal search capability.
Sounds great, doesn’t it? Scan documents and let the software provide all the retrievable information one could ever want to search an image. The cure all and does everything you need. Well, that’s what some might tell you, but couldn’t be farther from the truth. OCR is not perfect and does not yield 100% accuracy even with the most pristine document. Everything affects OCR accuracy including the quality of the image, fonts, dpi, pictures, word spacing, columns etc. Many improvements can be made to obtain higher accuracy, but there is also a large expense involved as well and may still need human supervision. Since scanning service bureaus scan every type of application with different formats, OCR becomes an issue due to accuracy requirements and if you cannot find one document in an audit, well…..
Zone OCR is used to extract data from a particular region, or zone, of the scanned page and converts just that portion to text, or an index. This is often used on AP invoices, applications, checks etc. when you receive a lot of the same types of forms with identical layouts and the text is in a specific place on each page. Software is used to design a forms template for the form so that it can find the zone you plan to extract data from. In its simplest form, Zone OCR extracts print data from one or more zones on the document, validates it using simple rules such as format, length, data mask and populates these index fields.
For example, you work in Accounts Receivable, and your department typically files each invoice by Client Name, Due Date and Amount. A Zonal OCR template can be used to map the text found in those physical page locations to specific document properties. So, every time a new invoice is scanned into the system, it is automatically filed by Client Name, Due Date and Amount. These document properties can then be used to search for the document when it needs to be retrieved.
Zonal OCR is very popular and widely used in service bureaus and business, why? Much Faster, more accurate and provide specific search criteria for indexing (granular). Zonal OCR is an incredibly effective solution for applications that deal with repetitious paper forms, therefore much more preferred than Full text OCR.
To accurately capture data from difficult documents consistently is a complex problem. You need to test any proposed solution extensively before you accept it. Otherwise, you may be disappointed with the results and come to the incorrect conclusion that OCR does not work.
Last but certainly not least is using Barcodes
Barcode Recognition is the most efficient way to capture index data printed on documents. There are two different barcodes types, 1D and 2D. Traditional barcodes 1D, represent each character by a vertical line, and the lines are arranged horizontally across the paper. These linear barcodes become impractical when the number of characters exceeds 30. 2D, barcodes represent characters by small cells, arranged both vertically and horizontally. They can accommodate several times the number of characters that the linear barcode can.
Some documents already have key information in barcode format on them. In many cases adding a barcode to a document is as simple as changing or adding a font. Adding barcodes to new documents is preferable as all the index data is on the document at the time it is created and in a format that can be read with near 100% accuracy. More widely used is barcodes for separating files or documents known as “Patch T” to allow for large batch scanning. Patch Codes is an essential part of Bar Code Technology. Each Patch Code is actually a combination of various barcode patterns. Usually, each Patch Code consists of six different barcode patterns. This includes patterns that are created with the numerals 1, 2, 3, 4 and 6 and the alphabet T. Patch Codes are used as a part of document separator coding.
Barcode recognition can also be useful when you have documents with a variable number of pages that will all receive the same index values. If it is not possible to generate an indexed coversheet for these at the time they are created, a generic barcode coversheet can be used to separate the scanned images into multi-page files, one for each document. A second process can then be used to index these images one file at a time instead of one page at a time, greatly increasing throughput.
Enough cannot be said about the effectiveness, accuracy and efficiency barcodes provide to the automation of scanning.
Chances are, full text OCR, Zonal OCR or barcode recognition will work for automating your scanning processes. If not, there is and always will be “Old Reliable” manual indexing.