Imaging text documents always adds a certain amount of noise and other artifacts. Fidelity of electronic reproduction depends very much on the accuracy of noise removal algorithms. The present algorithms attempt to remove artifact by filters that are based on convolution or other methods that are “invasive” with respect to the original representation of the text document. As a result, it is highly desirable to design noise removal algorithms that restore the image to the original representation of text, removing merely noise and added artifacts without blurring or tampering with font corners and edges. In this paper, we present a solution to this problem by design of a filter based on accurate statistics of text in its Matrix Frequency Representation that was developed earlier by the authors
The advent of the internet has opened a host of new and exciting questions in the science and mathematics of information organization and data mining. In particular, a highly ambitious promise of the internet is to bring the bulk of human knowledge to everyone with access to a computer network, providing a democratic medium for sharing and communicating knowledge regardless of the language of the communication. The development of sharing and communication of knowledge via transfer of digital files is the first crucial achievement in this direction. Nonetheless, available solutions to numerous ancillary problems remain far from satisfactory. Among such outstanding problems are the first few fundamental questions that have been responsible for the emergence and rapid growth of the new field of Knowledge Engineering, namely, classification of forms of data, their effective organization, and extraction of knowledge from massive distributed data sets, and the design of fast effective search engines. The precision of machine learning algorithms in classification and recognition of image data (e.g. those scanned from books and other printed documents) are still far from human performance and speed in similar tasks. Discriminating the many forms of ASCII data from each other is not as difficult in view of the emerging universal standards for file-format. Nonetheless, most of the past and relatively recent human knowledge is yet to be transformed and saved in such machine readable formats. In particular, an outstanding problem in knowledge engineering is the problem of organization and management--with precision comparable to human performance--of knowledge in the form of images of documents that broadly belong to either text, image or a blend of both. It was shown in that the effectiveness of OCR was intertwined with the success of language and font recognition.