One of the main problem related to genomics is finding similarities between different species represented by DNA sequences. The dynamic programming algorithms (Needleman-Wunsch, Smith-Waterman) give a good measure of similarity, but are not efficient for big data sets. In this study we present the new heuristic algorithm based on common parts of reads. The approach can handle all types of sequencing errors: insertions, deletions and replacements. Our algorithm result is similar to other well known tools. The presented algorithm is implemented in C++, it uses Boost libraries, it internally use threads for parallel computing. This algorithm is a part of the DNA assembler ’dnaasm’. Source code, demo application and supplementary materials are available at project homepage: http://dnaasm.sourceforge.net.
In this paper we consider the problem of detecting and recognizing widgets in screenshots of computer programs’ graphical user interface (GUI). This problem is fundamental in business process automation. The solution we propose here is based on detecting GUI elements with Canny edge operator, and recognizing already detected GUI elements with classifiers: neural networks, random forests, XGBoost, and others.
The most significant differences between second and third sequencing generation are length of reads and percentage of errors. In the field of de novo DNA assembly there is a need for new effective algorithms as these used for second generation reads are highly ineffective or even unusable when applied to the successor. In this article we propose a solution tailored for DNA assembly of reads from third generation sequencers. In this approach we use overlap-layout-consensus (OLC) graph method. It is composed of number of algorithms focusing on time and memory optimization. The proposed algorithm was implemented as shared library and added as a new module to the ’dnaasm’ de novo assembler. The implementation has been tested on simulated as well as real data. Results prove increase in speed and memory consumption in comparison with other de novo DNA assemblers.
The second generation sequencing techniques opened doors to further research on a world scale, because the cost of DNA sequencing dropped significantly. However, the second generation sequencing technology has some drawbacks, mainly short read length. In 2017 the new devices, that use real-time sequencing started to be available. This approach, called "the third-generation sequencing" achieve read length of 20kbp and error rate about 15%. As a consequence of this process new DNA assemblers were developed. In this article we propose an implementation of Overlap Graph-based de novo assembly algorithm for third-generation sequencing data. The proposed method involves graph algorithms and dynamic programming, optimized using a MinHash filter. The solution has been tested on both simulated and real data of bacteria obtained from Oxford Nanopore MinION sequencer. The algorithm is included in "OLC" module of the dnaasm de novo assembler. Dnaasm application provides command line interface as well as web browser-based client. Source code as well as a demo web application and a docker image are available at the dnaasm project web-page: http://dnaasm.sourceforge.net.
The second generation sequencing methods produce high-quality short reads, which are assembled into contigs by DNA assemblers. Due to the fact that length of a single read is limited to 500bp it is really hard to assembly full genomes or full chromosomes. Generating longer contigs with low cost of sequencing is a main effort of computer scientists in this area. We propose to link contings created from second-generation reads using reads from third-generation sequencers. Such reads have length 10-20kbp. An existing implementation of this approach appears to be time and memory demanding for larger genomes. We developed an algorithm based on Bloom filter and extremely memory-efficient associative array. Our implementation remarkably exceeds the previous one in terms of time and memory consumption. Presented algorithm, provided as a shared library, is a part of the dnaasm de-novo assembler. The library has been created using C++ programming language, Boost and Google Sparse Hash libraries. Both web browser-based graphical user interface and command line interface are provided. Source code as well as a demo web application and a docker image are available at the dnaasm project web-page: http://dnaasm.sourceforge.net. Our application has been tested on real data of bacteria, yeast and plant genomes.
Many organisms, in particular people, contain sections of the genome which could be present in various number of copies between individuals. This event is called copy number variations (CNVs) and in many cases is associated with genetic diseases. However, the accuracy of CNV detection in the human genome is still low. We propose the new algorithm for common CNVs detection based on artificial intelligence algorithms. We generalized a common CNVs detection task to classification problem. In this paper we showed some classification models and compare them in order to detect common CNVs. The algorithm contains three stages: counting depth of coverage in targets (whole exome sequencing), quality control of targets and training the models. Then, trained models are used to detetct CNVs in a new sample. The proposed approach was tested, the obtained CNVs calls showed the corecctness of our proposals. The results present, that our approach is designed to detect only common CNVs, the sensitivity and specificity of the approach are higher than for another algorithms. However, rare CNVs are not discovered, but we plan to extend presented approach in order to detect also rare CNVs (based on anomalies detection algorithms). The presented approach could improve the accuracy of detection common CNVs in the human genome. The described method could be useful in labolatories, where large volume of annotated common CNVs dataset exists. What is more, to our knowledge, this is the first paper which shows the usage of artificial intelligence methods in common CNVs detection problem.
The development of next generation sequencing opens the possibility of using sequencing in various plant studies, such as finding structural changes and small polymorphisms between species and within them. Most analyzes rely on genomic sequences and it is crucial to use well-assembled genomes of high quality and completeness. Herein we compare commonly available programs for genomic assembling and newly developed software - dnaasm. Assemblies were tested on cucumber (Cucumis sativus L.) lines obtained by in vitro regeneration (somaclones), showing different phenotypes. Obtained results shows that dnaasm assembler is a good tool for short read assembly, which allows obtaining genomes of high quality and completeness.
Genome sequencing is the core of genomic research. With the development of NGS and lowering the cost of procedure there is another tight gap - genome assembly. Developing the proper tool for this task is essential as quality of genome has important impact on further research. Here we present comparison of several de Bruijn assemblers tested on C. sativus genomic reads. The assessment shows that newly developed software - dnaasm provides better results in terms of quantity and quality. The number of generated sequences is lower by 5 - 33% with even two fold higher N50. Quality check showed reliable results were generated by dnaasm. This provides us with very strong base for future genomic analysis.
Katome is a new de novo sequence assembler written in the Rust programming language, designed with respect to future parallelization of the algorithms, run time and memory usage optimization. The application uses new algorithms for the correct assembly of repetitive sequences. Performance and quality tests were performed on various data, comparing the new application to `dnaasm', `ABySS' and `Velvet' genome assemblers. Quality tests indicate that the new assembler creates more contigs than well-established solutions, but the contigs have better quality with regard to mismatches per 100kbp and indels per 100kbp. Additionally, benchmarks indicate that the Rust-based implementation outperforms `dnaasm', `ABySS' and `Velvet' assemblers, written in C++, in terms of assembly time. Lower memory usage in comparison to `dnaasm' is observed.
The next generation sequencing techniques produce a large amount of sequencing data. Some part of the genome are composed of repetitive DNA sequences, which are very problematic for the existing genome assemblers. We propose a modification of the algorithm for a DNA assembly, which uses the relative frequency of reads to properly reconstruct repetitive sequences. The new approach was implemented and tested, as a demonstration of the capability of our software we present some results for model organisms. The new implementation, using a three-layer software architecture was selected, where the presentation layer, data processing layer, and data storage layer were kept separate. Source code as well as demo application with web interface and the additional data are available at project web-page: http://dnaasm.sourceforge.net.