DocRead : an Automatic Generator of Recognition Systems on Structured Documents

Contact : Bertrand Coüasnon ( Bertrand.Couasnon@irisa.fr )

DocRead is an automatic generator of structured document recognition systems. It has been developed thanks to the Dmos method. It is made of a compiler of the EPF language (with which it is possible to describe a document), a parser associated to this language, an early vision module (binarization and line segment detection) and a classifier that is able to reject. This generator allows us a fast adaptation to a new type of document. Indeed, it is only necessary to define a new grammar in EPF, which describe the new type of document. Then, if necessary, a new learning of the classifier is done to make it able to recognize new symbols. The new recognition system, adapted to a new structured document, is produced by compilation.

With this generator, we already have been able to produce a lot of recognition systems of structured documents:

  • ScoRead: a prototype for musical score recognition;
  • MathRead: a prototype for mathematical formulae recognition;
  • TennisRead: a prototype for tennis court detection in video sequences;
  • TabRead: a prototype for table structure recognition;
  • DecRead: a prototype for the recognition of naturalization decrees of 19th century. These are fully handwritten forms;
  • FormuRead: a software for military forms of the 19th century recognition despite their deterioration. This software has been successfully tested on 88,954 pages of the Archives de la Mayenne and Archives des Yvelines.
    Image du formulaire
    Structure détectée