OCR and the winds of time

OCR as time goes by: The first machine-readable font was developed for the American government 45 years ago. Much has changed in the world of OCR technology from that time until today.

1968 was a revolutionary year – not just in the sense of political upheaval, but in the history of the computer as well. Douglas C. Engelbart invented the computer mouse, the precursor of the personal computer hit the market, and electronic data processing was slowing gaining in popularity and demand.

In those days there was only limited use for OCR. Because of the restricted computing power back then – an amount we would consider laughable today – you had to use standardized fonts with clearly distinguishable characters that could easily be read by a machine in order to generate usable results. The most-recognizable such font can be seen in the image accompanying this article: The line of numbers indicating the check number are printed in the very first machine-readable font called OCR-A. Code points are another essential element of OCR fonts and, based on their appearance in this case, are called the hook, fork and chair. These provide important help to the machine reader, such as making it recognize the end of a particular unit of information.

Some 45 years and a quantum leap in technology later, OCR can not only still read the vast majority of fonts, but evaluates on the basis of integrated language recognition – generally using dictionaries – whether the text it reads produces any meaningful context. If in doubt, the program will correct itself instead of accepting “dou&t”.

Even today, the prerequisite for good results is that the scanner produces a high-resolution digital image of the original document. This means that scanners should process documents with a resolution of at least 200 dpi, because any lesser resolution will result in the image having too few pixels, thus causing too many errors in the OCR process.

Even today OCR generally still has difficulty with heavily shaded scans and faded texts on thermal paper. The human eye is capable of interpreting those parts that are illegible and deriving the overall meaning – an ability that (for now) exceeds the limits of software. In practice this means that if the source document appears illegible, we recommend you do a trial conversion to a text file and then check the reproduced text for legibility before doing the final archiving. You’ll also find it helpful to try this out with a number of different documents and fonts. Having said that, we cordially invite you to use our webPDF portal where you’ll find all of webPDF’s OCR features at your fingertips and with no obligation.

Until a few years ago, texts were only outputted with no layout information. Although texts were correctly recognized, they were generated without any information about their position (layout) on the page. This is no longer a problem today thanks to the advanced hOCR standard which allows information about the page composition and layout to be stored using XML tags – even without hooks and chairs. webPDF versions 5.0 and higher support this output format. You can select it manually or configure it as the standard by changing the default parameters.

Typescripts such as OCR-A and OCR-B are anything but antiquated. You see them more and more often as elements of style in modern design. These fonts are also a recurring feature used especially by vintage and retro designers. And what’s wrong with using fonts that are more easily recognizable than others despite tremendous technological advancements? After all, the purpose of OCR is to recognize characters with the minimum possible amount of errors.