The Paper Problem:
With the newfound ability to assemble content and produce pages quickly and efficiently, existing papers increased their circulation numbers substantially. The demand for the paper on which to print the news inevitably grew as well. This created a need to use a less expensive paper stock to match the increased production rates.
This ushered in the era of “newsprint” paper. The inexpensive, machine- produced, non-archival paper consisted mainly of wood pulp. Paper made from mechanical pulp contains significant amounts of lignin, a major component in wood. In the presence of light and oxygen, lignin reacts to turn materials yellow, which is why newsprint and other mechanical paper yellows with age.
Newsprint was seen as a disposable medium. The readily available and inexpensive paper-stock was used with the view that the newspaper discarded immediately after viewing.
Much of the early paper made from wood pulp contained significant amounts of alum, a variety of aluminum sulfate salts that is significantly acidic. Alum was added to paper to assist in sizing and to make it somewhat water resistant so that ink did not run or spread uncontrollably. Deviations in the amount of alum created inconsistencies in the printing process and varied from batch to batch.
Early papermakers did not realize that the alum they liberally added to cure almost every product development issue would have detrimental effects.
In 1933, William Barrow published a paper on the acid paper problem. In later studies, Barrow tested paper from American books made between 1900 and 1949. He discovered that after forty years, the books had lost on average 96 percent of their original strength; after less than ten years, they had already lost 64 percent.
Manufacturing methods used after 1870 employed sulfuric acid for sizing and bleaching purposes. Earlier papermaking methods left the final product only mildly alkaline or even neutral. Such paper has maintained its strength for 300 to 800 years, despite sulfur dioxide and other air pollutants. Barrow’s 1933 article on the fragile state of wood pulp paper predicted that the life expectancy, or “LE,” of this paper was approximately 40–50 years. After that point, the paper would begin to show signs of natural decay, thereby necessitating the need to research new mediums on which to write and print. His recommendation for long-term preservation was to reformat newspapers to microfilm before decay from acid in the wood-pulp paper set in.
The microfilm process plays an integral role in the quality of the source material for OCR. Quality assurance is essential from the stage in which the document is microfilmed to the stage in which microfilm is stored.
Unfortunately, by the time that some of the most valuable newspaper pages were filmed for preservation purposes, they had already deteriorated to an illegible state. Even newspapers with great significance, such as the first newspaper to be published in Iowa- the Dubuque Visitor- were taped together. Large amounts of the text were eradicated by age, neglect, and inadequate storage. Diligent and careful filming cannot restore this mistreated source material. The image is hardly comprehensible to the human eye, let alone to a computer that must interpret the words on the page.
However, a remarkable amount of text was captured from this newspaper page and the complete “readable” OCR text file comes down to less than 20percent. There is value in what was extracted from the OCR output. The output for the whole page is as follows:
“Discount subsequent beset you are wanted where he that fault the the door and her beautiful over her face am exclaimed my child Barlow, gently repulsing her. Bessie, however, without pressed addressing self to the stranger in an energetic sort of away are but am bedroom do, persuade no, Martha, her kind, to step into confuses stranger characteristic half penetrated whispered young whom he commended to New ‘York, proper winch it seems by reason or art an; insane person. He serve look to me-like those who both will am more generous and I will tell you name Now is” it hot clasping and looking upward whole bright and rapturous, that he who is chosen and of God for cause of freedom,, friend of of my struggling to this little dwelling to find ?out and aid me? Choose impatient services is not our if she were, but there was no for farther explanation. As soon were the inner room. door. She seemed at but instantly am unknown sir, fade seems to ‘have’ that written on it; and ‘it shall to you. is. current of.an. some was They, would it so strange his very you I have some-port cellar. or two; for it; .and fellows, the ague and’ your Mrs. I have -would us filing. And I will serve you poor girl her faded, feverish you quietly, and I will see what can will wait patiently, is but one thing to yes with tears, and i moment found it his voice.. can of Mrs. thought so is the mecjves.t and the the most my good net cannot tell their sent object is to get to New-York as as possible, where I have business of the. I have staying days with. I would not feelings.pn she whisper; they are very judgment at all; indeed, ‘there are that have, to confide to them of my actions wills my dear young Robert resided. powerful, loved of Lafayette” Pressed , mounted his horse, off at Barlow gazed after him till the cavalcade, after God, he said, looking at the watch, must pledge if heaven me I will and leave, it, as fast property From”
Not all newspapers published before the implementation of microfilm s as the standard preservation practice in the late 30’s and early 40’s, were in such poor condition. Some publishers or libraries treated the hard copies with respect, and when they were filmed many years later, the quality was extraordinary.
The following issue of the Fairbanks Times proves that not all OCR output from the late 1800’s and the early 1900’s was poor. This front page from 1908 (“Cubs Win!”) shows that the image quality, clear boundaries of each article, clean typeset and well-preserved paper stock add up to quality text indexing.
Although it is not perfect, the OCR output is outstanding. This reflects what will be expected from newspapers filmed at a much later date. The clear “zones” that break up the columns give the OCR software a clear indication on where to start and stop. As a newspaper in the early 20th century, it represents the end of the “wall of text”: tight columns and dark, dense typesetting. The inexpensive paper stock, while fragile and of temporary use, allowed publishers to be more liberal with the use of white space and cleaner aesthetics. This type of layout enhanced machine readability.
The following section from the center of the page clearly shows an improvement. Aside from a few words that run together and were not indexed by a search engine, the intelligible, well-spaced text is among the best examples of OCR’s quality performance.
By 1938, microfilm was used for archival preservation by libraries and institutions across the country. Microfilm enabled libraries to greatly expand access to collections without putting rare, fragile or valuable items at risk of theft or damage. Microfilm is compact with significantly lower storage costs than paper documents. Normally, 500 shots (or 1,000 individual newspaper pages) will fit on a single reel. When compared to filing paper, microfilm can reduce space storage requirements by up to 95 percent.
However, quality of filming varied significantly, sometimes creating an image so poor, it was nearly unusable.
In 1979, the ANSI / AIIM standards for microfilm were established. These standards have undergone several revisions, most recently in 2004. Prior to this, volunteers, students and county employees often were tasked with the microfilming duties.
Pictured here is the Dubuque Visitor (which, as mentioned, was the first paper printed in Iowa). In the lower right hand corner, the photographer’s thumb is visible. Many of the images captured on microfilm before the 1980’s were overexposed or underexposed, out of focus, misaligned skewed or, as in this image, obscured by fingers or other foreign objects.
Since 1979, the quality of filming and the storage of the master microfilm reels has improved tremendously. If the newspaper was preserved to microfilm around the time of publication and after 1979, the quality of the image captured to film is quite good, and thus the digital image and OCR quality will be good as well.
Modern newspapers are printed and preserved to standards within months of each process, resulting in images like this front page from The Cedar Rapids Gazette in 2008. While many publishers use their pre- or post- press PDF as their vehicle for online digital viewing, most are still microfilmed for archival purposes. Those that do not use the full color PDF to complement their archive undergo the same microfilm scanning process as do historical newspapers.
The OCR on these papers is nearly perfect.
Many other publications prior to 1979 treated their hardcopy collection with great care, and a quality image can still be captured on film from their collection, which is usually contained in bound volumes. Historical newspapers that are already on film, especially images captured prior to the introduction of ANSI/AIIM standards in 1979, typically suffer from faded print, shaded backgrounds, fragmented letters, touching/overlapping letters, skewed text, curved lines (which is very common in bound volumes) and “gutter shadow,” which is created by the binding between two pages when a book is opened.
Even if a decent image is captured from a historical newspaper, the quality of the OCR is completely dependent upon proper storage of the film. Even if the page was pristine and filmed to standards, the film may be damaged, scratched, suffering from vinegar syndrome or redox. When low-quality film is used, it further reduces the likelihood of success in the OCR process
Microfilm is a stable, archival form when properly processed and stored. Preservation standard microfilms use the silver halide process, which creates silver images in hard gelatin emulsion on a polyester base. With appropriate storage conditions, this film has a life expectancy of 500 years.
Unfortunately, until the late 1970’s, many of these collections were stored on shelves and exposed to humidity and temperature variations that caused instability in the gelatin used to bind the silver halide.
The differences in output between OCR applied to a greyscale image, and OCR applied to a bi-tonal image are negligible. Although greyscale is presumed to increase OCR accuracy, studies show that, in general, greyscale digitization is almost 5percent worse than the black and white image of the same page. With eight shades of grey used to recreate the image, words can get lost in the “noise” of the page, and the halftones can create pale or blurry text and poor contrast.
One method of measuring OCR efficiency is gauging how accurately it determines which words are on the printed page. This is normally expressed as the percentage of words on the page that are accurately “read” by the software. Of course, “reading” a word entails piecing it together letter by letter,as seen in tests conducted in earlier articles. Therefore, OCR accuracy is sometimes measured as the percentage of letters that are accurately “read.”
It is important to note, however, that these two measures of accuracy are fundamentally different. Word accuracy is, by definition, significantly lower than letter accuracy, as it is effectively the joint accuracies-or joint probabilities- of the letters in the word. For example, OCR accuracy at the letter-level for a document may be 98 percent. But computing the accuracy of a five-letter word in that same document is done by taking 0.98 to the fifth power.
Most OCR software tools do not necessarily follow the logical arrangement of a newspaper’s multi-column, multi-sectioned layout. The software does, however, endeavor to identify zones with possible text so that OCR may be applied to these zones. There are two types of zones: graphic and text. Typically, the text zones are very accurately represented.
At the Advantage Companies, we use 100 percent automation for both zoning and capture. The quality of the OCR is completely dependent upon the quality of the medium that is scanned, as is the case in every step preceding OCR.
The quality of the original image has several implications throughout an automated process. If the second or third generation images on the microfilm have deteriorated to any degree, the imperfections and poor image quality will interfere dramatically with the OCR process. Areas of text may be seen as a graphic and spacing of columns or even letters may lack consistency, which leads to a “ballooning” of the number of terms that are submitted for a search. For example, a simple imperfection can close a “C,” transforming a word like “cat” into “oat”.
Combined with the aforementioned unusual fonts, faded printing, shaded backgrounds, fragmented letters, skewed text, curved lines and bleedthrough on the originals, OCR will be far less than 50 percent on most historical documents.
We are able to manually correct OCR, but this would be a cost-prohibitive process for libraries given their budget constraints.
Despite these challenges, the benefits of OCR far outweigh the benefits of using microfilm for research or content-based inquiries. Microfilm provides many advantages for long-term archiving and preservation of content, but a quick search to find information, such as a name, is not among them.
OCR eliminates the need to tirelessly inspect each page, and scour each word, to locate a mere tidbit of information within a microfilm reel. Furthermore, one must know the City, State, Title, and Date of the information sought in order to first locate the reel.
If the user is unable to find an item by conducting a search due to poor OCR returns, the digital images are still indexed by City, State, Title and Date. As a result, the user has a content-browsing tool, and the process operates more efficiently than film readers.
Although digitization coupled with automated OCR makes for a fantastic research tool, it must not be mistaken for a preservation tool. The maintenance of a digital collection is costly. Digitization also does not provide a permanent solution, as formats change, backwards compatibility is not guaranteed, and technology continues to advance. Higher-quality images can be created. OCR can be corrected manually. Extensive metadata can be collected. However, he question remains: Is it worth the investment?
Advantage aims to offer an affordable solution and ensure the production of a high volume of accessible, digital images. Libraries and local communities can afford to procure access to decades of content as an enhancement to the preservation of their papers on microfilm.
Microfilm is analog, or an actual image of the original data, and it does not require OCR to read. Unlike digital media, the format does not require software to decode the data; it is instantly comprehensible. All that is needed is a simple magnifying glass.