Help:How to clean digitized texts

From PlantUse English
Jump to: navigation, search

The basic principle is to produce reliable texts so that the reader doesn't need going back to the original. If the original text is too difficult to read, it may be advisable to give two versions, one original and the other modernised.

Respect of the original text

  • respect of spelling, including typographical errors and what we can perceive as orthographical errors (Wikisource allows correcting orthographic errors, but this can lead to unanticipated consequences).
  • respect of character formatting (italics, bold).
  • Use of the Unicode standard for texts which include diacritical signs or which are written in non latin alphabets.


  • deletion of rules which cut a word at the end of a line (with the exception of cut words at the end of a page);
  • standardisation of the use of letters u/v et i/j, which represented simple graphical variants;
  • replacement of the long s (ſ) by a normal s ;
  • respect of the German estset ß.

Formatting pages

  • respect of the original structure of pages. Page numbers can be used as section titles. This allows to quote pages precisely. They can also be put in square brackets.
  • in order to distinguish footnotes from the text, they may be separated by a continuous line (we suggest 12 long hyphens)
  • for encyclopedias, each article may be put in a different page. This facilitates putting links to such pages.


In the introduction, it is best to stress on which edition the work has been done, which is the source of the digitized document, and to which level of reliability you have reached. This last point is a delicate one, as it is the result of a compromise. Searching for the last errors takes an infinite time, but on the reverse, a non reliable text will not be usable utilisable, or will the user to correct it again. Wikisource distinguishes correctors and validators. A text is considered as validated only if it has been read by a validator different from the corrector.

Technical aspects

Cleaning digitized texts is chronophagous. You must think carefully before charging ahead.

  • choose the best version available. When a book is available on several platforms, they must be compared and tested, and the best one must be chosen. For Candolle's book, Origin of cultivated plants, for example, Googlebooks gives a bad version, Gallica a correct one and Madrid an excellent one.
  • chose the optimal downloading option. On the site of Madrid, for example, the result is better if you download page by page than if you do it by batch (when reading with Acrobat).
  • chose the best software. Always for Candolle, the result is not as good with Aperçu (on a French Macintosh) than with Acrobat (odd linebreaks and tabs).
  • if you clean first under Microsoft-Word, think that character formats and linebreaks will not be recognized by Mediawiki. But linebreaks will cut lines in the Edit mode. So it is useless to add or delete them. See also: Help:Convert a text formatted with a word-processor?
  • chose a bigger font size, in order to visualize better the characters which can be confused by OCR (e, c, o...). Chose a font which qui differenciates 1 (one), l (el) et I (capital i), also confused. Under Mediawiki, such characters can only be differentiated once published. You then have to copy the wrong words in a word processor, in order to remember them under Edit.

Read more

A French book of agronomy is being edited. It can be useful to read their procedure.