Making Old Newspapers Searchable: The History of Printing


Until the first part of the 19th century, the composition and typesetting were accomplished by painstakingly placing each individual letter into a “galley” by hand. The galleys were then placed into a “chase” that was the size of the final page. Since each letter was placed individually, there was no method for standardization or uniformity.If there was a picture or a drawing in the newspaper, it was created primarily by using woodcuts or engravings. This was a very time-consuming and expensive process.

The actual print quality varied not only from paper to paper, but also in each edition. The type and consistency of the ink used, coupled with quality of the rag or cotton paper, contributed to “word bleed”, splatter and other imperfections.
Composing sticks were later used, allowing the length of the lines and consequent width of the page or column to be set, with spaces and quadrants of different sizes being used to make up the exact width. This added uniformity and consistency to the page.

Around 1815, one of the first significant advances in composition emerged: the development of stereotyping. This method of pre-press production allowed for a whole page of type to be cast in a single mold so that a printing plate could be made from it. Until the invention of stereotype printing, type had to be reset if a second printing was to be made

The invention of the stereotype led to the mass production of printing plates. Multiple copies could now be sent to other printers and newspapers, which allowed for the reproduction of larger numbers of identical images.Paper was in high demand and costly at the time, which ultimately influenced the design and layout of this era. Newspapers were dense and dark, as publishers shrank type and eliminated illustration in an effort to get as much information as possible into a limited space. The denseness of the typesetting of this era created yet another OCR challenge. Due to the tight fit, many words were combined into a single line of text without spacing to separate the words. Columns were not detected properly in many cases due to their proximity to one another. In addition to the gothic fonts commonly used, there was no way to adjust the spacing between characters in a proportional font, especially in relation to height. For instance, the S/F challenge still existed, but for new reasons. The “f “was now occasionally truncated at the top, creating a “t”. Tops of an “A” could be cut off, creating an “H” in the OCR process. A capital “R” could easily become either a “K” or a “H””. These are just a few examples.

This page of the Iowa News, published on June 3, 1837, was selected for testing due to its image clarity. Again, this is a testament to the quality
of the paper s
tock used, and the condition in which this title was kept. The paper was carefully preserved to microfilm and the microfilm was stored properly. When this image was scanned to create the digital image, over 175 years had passed since it came off of the printing press.

In total, 6,214 “words” were captured on this page; however the “words counted” include strings like this:

J ^ • O —–»VH I.V. MW tvuw. .v <i j, I’Mil I (U PVUIbiUII” IUI U 
i .v,e, | 1 walci*s edge. Ii was in the Solf, inexpert, rxhau&tcd, nnd cncumbercd as j for the public road leading from Wdpello’s old …1 ~r ‘ …………. r ‘ ” *’ village on tho Iowa to Du lluque, I fell iu

If we striped out the extra spaces and “non-words”, this would remain:

1 edge. was in the inexpert as for the public road leading from old village on Iowa to I fell.

Although this OCR text output is an extreme example, it illustrates the difference in what is captured by the OCR software and what is legible. A better demonstration is derived from the following article on the same page:

America* Statu.uiy Marble.—We have authority for stating that Mr. Fcntherstonhaugh, U. Slates Geologist, Ims nsccitaincd the existence of sjine important dnposites of whito statury marble, tu tlio Cheioltce country, lie lias followed an ob-icuro ridge in the mountains six miles, consisting entirely of that v iluablo bubitance, hitherto only soon in tlio United Slates in thin beds, not exceed-ini’ u few Indies, lie reports ono of these dopu-siios ns equal to thnt of Masna-Carrara, In Italy, ivith which he is familinr. Marble of this kind lias been hitherto brought, nt a groat expense, from Italy. We trust this additional development of our mineral resourcos will be highly advantageous m tho linn arts, in tlio hands of our men ofgenlus. (Jreecc and Italy owe much of their celebrity in sculpture to tho iibundanco of statuary marble in thoso coantrics. We imagino that if Phidias and f’raxilcins had boenobligild lo import tlieir material fiom foreign countries, posterity would never iavo possessed the noblest examples of art, which their genius has boqucnlod to mankind.-[Nat.Int.

There are 172 “words” that are counted in this very short article. If we were to clean the OCR to only include real words, we would be left with 126 words. That is less than a 30 percent loss on a newspaper page that is over 180 years old. The newspaper is also in remarkably good shape with a high-quality image. The majority of the OCR text is legible; most importantly, the names,no matter how unusual, were captured:

America Marble. We have authority for stating that Mr. Fcntherstonhaugh, U. Slates Geologist, the existence of important of marble, country, lie followed an ridge in the mountains six miles, consisting entirely of that hitherto only soon in United Slates in thin beds, not few Indies, lie reports of these equal to of Masna-Carrara, In Italy, which he is Marble of this kind been hitherto brought a expense, from Italy. We trust this additional development of our mineral will be highly advantageous arts, in hands of our men and Italy owe much of their celebrity in sculpture to of statuary marble in We that if Phidias and had import material foreign countries, posterity would never possessed the noblest examples of art, which their genius has to mankind.The presses

Newspapers were originally printed on wood presses that were operated by turning a crank by hand. In 1814, the first mechanized printing press was introduced. This steam-powered machine could produce more than 1000 pages per hour. The production potential was more than tripled in 1832 with the introduction of the first cylinder press. Cylinder presses were much faster than platen and hand presses and could print between 3,000 and 4,000 impressions per hour.
The rate at which a newspaper page were printed doubled again in 1844. The first rotary press was invented and could print up to 8,000 copies per hour. Photographs of this era could be reproduced on rotary presses via halftones, providing a higher resolution at a significantly lower cost compared to previous methods of illustration. Larger rotary presses, containing multiple machines, made printing large newspaper runs possible.The age of mass production of newspapers had begun, and it coincided with the development of a new paper-making technique based on pulping wood. By 1870, wood pulp had completely replaced rags as the main ingredient in producing paper. This innovation allowed paper production to finally keep pace with the increased rate at which newspapers were printed.