Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is the term used for the automated process of converting the words on an image into machine readable text that can be searched,, displayed on-line, and indexed. When applied to historical newspapers OCR can be exceptionally challenging.

OCR Makes History Searchable

Advantage offers an affordable solution to create a high volume of accessible, digital images using an automated OCR process after scanning your community’s newspapers. Libraries and local communities can afford to procure access to decades of content as an enhancement to the preservation of their papers on microfilm. In addition to making the digital archive keyword searchable, it can also be indexed by City, County, State, Country, Title, Institution, Date, Page and more. As a result, the user has a searchable collection and content-browsing tool, much more practical and accessible to the community that is far more efficient than film readers.

Although digitization coupled with automated OCR makes for a fantastic research tool, it must not be mistaken for a preservation tool. We are a preservation company first and foremost; everything begins and ends with the microfilming process. We preserve the fragile paper to film for long-term preservation and scan the microfilm to create digital files. These digital files are the fourth generation of the images. Three out of the four generations of  a newspaper page (printing, filming and digitizing)  can happen years, decades, or in some rare instances, centuries in between. The newspaper industry in the United States has evolved considerably over the last 300 years. Each development in the typesetting methods, printing process, and paper stock created unique challenges in adapting the digital image of old newspapers to a searchable format. In short, the OCR quality (therefore, the search-ability of the newspaper) is solely dependent on the condition of the source material. This does not reflect a technology problem, nor is it a problem with. If the words on a page are not recognized by the software, most likely it is the result of a series of problems that began 300 years before the first computer was even invented.

One method of measuring OCR efficiency is gauging how accurately it determines which words are on the printed page. This is normally expressed as the percentage of words on the page that are accurately “read” by the software. Of course, “reading” a word entails piecing it together letter by letter, therefore, OCR accuracy is sometimes measured as the percentage of letters that are accurately “read.” It is important to note, however, that these two measures of accuracy are fundamentally different. Word accuracy is, by definition, significantly lower than letter accuracy, as it is effectively the joint accuracies-or joint probabilities- of the letters in the word. For example, OCR accuracy at the letter-level for a document may be 98 percent. But computing the accuracy of a five-letter word in that same document is done by taking 0.98 to the fifth power.

Most OCR software tools do not necessarily follow the logical arrangement of a newspaper’s multi-column, multi-sectioned layout. The software does, however, endeavor to identify zones with possible text so that OCR may be applied to these zones. There are two types of zones: graphic and text. Typically, the text zones are very accurately represented. At the Advantage, we use 100 percent automation for both zoning and capture. The quality of the OCR is completely dependent upon the quality of the medium that is scanned, as is the case in every step preceding OCR.

The quality of the original image has several implications throughout an automated process. If the second or third generation images on the microfilm have deteriorated to any degree, the imperfections and poor image quality will interfere dramatically with the OCR process. Areas of text may be seen as a graphic and spacing of columns or even letters may lack consistency, which leads to a “ballooning” of the number of terms that are submitted for a search. For example, a simple imperfection can close a “C,” transforming a word like “cat” into “oat”.

Combined with unusual fonts, faded printing, shaded backgrounds, fragmented letters, skewed text, curved lines and bleed-through on the originals, OCR will be far less than 50 percent on most historical documents. We are able to manually correct OCR, but this would be a cost-prohibitive process for libraries given their budget constraints.

Despite these challenges, the benefits of OCR far outweigh the benefits of using microfilm for research or content-based inquiries. Microfilm provides many advantages for long-term archiving and preservation of content, but a quick search to find information, such as a name, is not among them. OCR eliminates the need to tirelessly inspect each page, and scour each word, to locate a mere tidbit of information within a microfilm reel. Furthermore, one must know the City, State, Title, and Date of the information sought in order to first locate the reel. If the user is unable to find an item by conducting a search due to poor OCR returns, the digital images are still indexed by City, State, Title and Date. As a result, the user has a content-browsing tool, and the process operates more efficiently than film readers.

 Call Today:  1-855-303-2727 

Libraries & Historical Societies

Combined, the Advantage Preservation team has been partnering with Libraries, Colleges, and Historical Societies to preserve their local history for over a century. We take great pride in converting local historical newspapers, record books, public records, and photos onto 35mm Silver Halide Microfilm to protect the valuable content from the ravages of time. Our Microfilm meets all ANSI/AIIM Standards for microfilm preservation and we only use archival-quality 35mm Silver Halide produced in our Kodak Certified Lab for true 500+ year preservation!

 

Newspaper Publishers

We give Newspaper Publishers a clear Advantage! In partnering with Advantage Preservation you will be working with a team that has over 50 years combined experience in the Newspaper Microfilming Industry, we meet or exceed all ANSI/AIIM Standards for microfilm preservation. The Microfilming will be done at no cost to our Publishing Partners. The only cost you will have is shipping the papers to our facility in Cedar Rapids, Iowa.  Your Master Negative copy can be stored at no additional charge in our state of the art vault under your ownership.

State & Government Libraries

The Advantage Preservation team has been closely monitoring the painful condition and trends in the budgets of our state governments… and we are extremely conscious of the fact that one of the hardest hit areas has been in newspaper and historical preservation services in State Libraries.  The cuts have been deep, and in many cases, these institutions have been left behind.

State Historical Societies

The Advantage Preservation team has been closely monitoring the painful condition and trends in the budgets of our state governments… and we are extremely conscious of the fact that one of the hardest hit areas has been in newspaper and historical preservation services in State Libraries and Historical Societies.  The cuts have been deep, and in many cases, these institutions have been left behind.  We are pleased to be working with State Historical Societies across the country to help ease the burden of their filming.

Educational Institutions

Educational institutions of all sizes, public and private, are burdened with more paperwork than ever before. Dealing with paper-based business processes while struggling to comply with regulations such as NCLB, FERPA and HIPAA is tough for any administrator. Converting paper and microfilm student files (e.g. attendance documents, report cards, transcripts) to secure digital files can help your organization provide instant record access from any education system.

 Contact Us To Learn More