The application-relevant text data are very useful in various natural language applications. Using them can achieve
significantly better performance for vocabulary selection, language modeling, which are widely employed in automatic
speech recognition, intelligent input method etc. In some situations, however, the relevant data is hard to collect. Thus,
the scarcity of application-relevant training text brings difficulty upon these natural language processing. In this paper,
only using a small set of application specific text, by combining unsupervised text clustering and text retrieval
techniques, the proposed approach can find the relevant text from unorganized large scale corpus, thereby, adapt
training corpus towards the application area of interest. We use the performance of n-gram statistical language model,
which is trained from the text retrieved and test on the application-specific text, to evaluate the relevance of the text
acquired, accordingly, to validate the effectiveness of our corpus adaptation approach. The language models trained
from the ranked text bundles present well discriminated perplexities on the application-specific text. The preliminary
experiments on short message text and unorganized large corpus demonstrate the performance of the proposed methods.
Logical structure extraction of book documents is significant in electronic document database automatic construction. The tables of contents in a book play an important role in representing the overall logical structure and reference information of the book documents. In this paper, a new method is proposed to extract the hierarchical logical structure of book documents, in addition to the reference information, by combining spatial and semantic information of the tables of contents in a book. Experimental results obtained from testing on various book documents demonstrate the effectiveness and robustness of the proposed approach.