The U.S. Food and Drug Administration (FDA) has approved two digital pathology systems for primary diagnosis. These systems produce and consume whole slide images (WSIs) constructed from glass slides using advanced digital slide scanners. WSIs can greatly improve the work ow of pathologists through the development of novel image analytics software for automatic detection of cellular and morphological features and disease diagnosis using histopathology slides. However, the gigabyte size of a WSI poses a serious challenge for storage and retrieval of millions of WSIs. In this paper, we propose a system for scalable storage of WSIs and fast retrieval of image tiles using DRAM. A WSI is partitioned into tiles and sub-tiles using a combination of a space-filling curve, recursive partitioning, and Dewey numbering. They are then stored as a collection of key-value pairs in DRAM. During retrieval, a tile is fetched using key-value lookups from DRAM. Through performance evaluation on a 24-node cluster using 100 WSIs, we observed that, compared to Apache Spark, our system was three times faster to store the 100 WSIs and 1,000 times faster to access a single tile achieving millisecond latency. Such fast access to tiles is highly desirable when developing deep learning-based image analytics solutions on millions of WSIs.
Whole slide images (WSIs) can greatly improve the workflow of pathologists through the development of software for automatic detection and analysis of cellular and morphological features. However, the gigabyte size of a WSI poses serious challenge for scalable storage and fast retrieval, which is essential for next-generation image analytics. In this paper, we propose a system for scalable storage of WSIs and fast retrieval of image tiles using Apache Spark, a space-filling curve, and popular data storage formats. We investigate two schemes for storing the tiles of WSIs. In the first scheme, all the WSIs were stored in a single table (partitioned by certain table attributes for fast retrieval). In the second scheme, each WSI is stored in a separate table. The records in each table are sorted using the index values assigned by the space-filling curve. We also study two data storage formats for storing WSIs: Parquet and ORC (Optimized Row Columnar). Through performance evaluation on a 16-node cluster in CloudLab, we observed that ORC enables faster retrieval of tiles than Parquet and requires 6 times less storage space. We also observed that the two schemes for storing WSIs achieved comparable performance. On an average, our system took 2 secs to retrieve a single tile and less than 6 seconds for 8 tiles on up to 80 WSIs. We also report the tile retrieval performance of our system on Microsoft Azure to gain insight on how the underlying computing platform can affect the performance of our system.