30 January 2003 Matrix frequency analysis and its applications to language classification of textual data for English and Hebrew
Author Affiliations +
The advent of the internet has opened a host of new and exciting questions in the science and mathematics of information organization and data mining. In particular, a highly ambitious promise of the internet is to bring the bulk of human knowledge to everyone with access to a computer network, providing a democratic medium for sharing and communicating knowledge regardless of the language of the communication. The development of sharing and communication of knowledge via transfer of digital files is the first crucial achievement in this direction. Nonetheless, available solutions to numerous ancillary problems remain far from satisfactory. Among such outstanding problems are the first few fundamental questions that have been responsible for the emergence and rapid growth of the new field of Knowledge Engineering, namely, classification of forms of data, their effective organization, and extraction of knowledge from massive distributed data sets, and the design of fast effective search engines. The precision of machine learning algorithms in classification and recognition of image data (e.g. those scanned from books and other printed documents) are still far from human performance and speed in similar tasks. Discriminating the many forms of ASCII data from each other is not as difficult in view of the emerging universal standards for file-format. Nonetheless, most of the past and relatively recent human knowledge is yet to be transformed and saved in such machine readable formats. In particular, an outstanding problem in knowledge engineering is the problem of organization and management--with precision comparable to human performance--of knowledge in the form of images of documents that broadly belong to either text, image or a blend of both. It was shown in that the effectiveness of OCR was intertwined with the success of language and font recognition.
© (2003) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Joseph Henry Uchill, Joseph Henry Uchill, Amir H. Assadi, Amir H. Assadi, } "Matrix frequency analysis and its applications to language classification of textual data for English and Hebrew", Proc. SPIE 4793, Mathematics of Data/Image Coding, Compression, and Encryption V, with Applications, (30 January 2003); doi: 10.1117/12.454831; https://doi.org/10.1117/12.454831


Back to Top