The system presented integrates rule-based and case-based reasoning for artifact recognition in Digital Publishing. In Variable Data Printing (VDP) human proofing could result prohibitive since a job could contain millions of different instances that may contain two types of artifacts: 1) evident defects, like a text overflow or overlapping 2) style-dependent artifacts, subtle defects that show as inconsistencies with regard to the original job design. We designed a Knowledge-Based Artifact Recognition tool for document segmentation,
layout understanding, artifact detection, and document design quality assessment. Document evaluation is constrained by reference to one instance of the VDP job proofed by a human expert against the remaining instances. Fundamental rules of document design are used in the rule-based component for document segmentation and layout understanding. Ambiguities in the design principles not covered by the rule-based system are analyzed by case-based reasoning, using the Nearest Neighbor Algorithm, where features from previous jobs are used to detect artifacts and inconsistencies within the document layout. We used a subset of XSL-FO and assembled a set of 44 document samples. The system detected all the job layout changes, while
obtaining an overall average accuracy of 84.56%, with the highest accuracy of 92.82%, for overlapping and the lowest, 66.7%, for the lack-of-white-space.
When designers develop a document layout their objective is to convey a specific message and provoke a specific response from the audience. Design principles provide the foundation for identifying document components and relations among them to extract implicit knowledge from the layout. Variable Data Printing enables the production of personalized printing jobs for which traditional proofing of all the job instances could result unfeasible. This paper explains a rule-based system that uses design principles to segment and understand document context. The system uses the design principles of repetition, proximity, alignment, similarity, and contrast as the foundation for the strategy in document segmentation and understanding which holds a strong relation with the recognition of artifacts produced by the infringement of the constraints articulated in the document layout. There are two main modules in the tool: the geometric analysis module; and the design rule engine. The geometric analysis module extracts explicit knowledge from the data provided in the document. The design rule module uses the information provided by the geometric analysis to establish logical units inside the document. We used a subset of XSL-FO, sufficient for designing documents with an adequate amount complexity. The system identifies components such as headers, paragraphs, lists, images and determines the relations between them, such as header-paragraph, header-list, etc. The system provides accurate information about the geometric properties of the components, detects the elements of the documents and identifies corresponding components between a proofed instance and the rest of the instances in a Variable Data Printing Job.
Conference Committee Involvement (1)
16 January 2006 | San Jose, California, United States