Open Access
7 June 2023 cq100: a high-quality image dataset for color quantization research
M. Emre Celebi, María-Luisa Pérez-Delgado
Author Affiliations +

Color quantization (cq) is a classical image processing operation that reduces the number of distinct colors in a given image. Although the idea of cq dates back to the early 1970s, the first true cq algorithm, median-cut, was proposed later in 1980. Since then, hundreds of publications have investigated the topic of cq, proposing dozens of algorithms. A vast majority of these publications demonstrate their results on small datasets, containing a handful of images of mixed quality. Furthermore, the reproducibility of cq research is often limited due to the use of private test images or public test images with multiple non-identical copies on the World Wide Web or restrictive licenses. To address these problems, we curated a large, diverse, and high-quality dataset of 24-bit color images called cq100 and released it under a permissive license. We present an overview of cq100 and demonstrate its use in comparing cq algorithms.



24-bit color images have become commonplace since the turn of the millennium.1,2 These images typically contain hundreds of thousands of distinct colors, which complicate their display, storage, transmission, processing, and analysis. Color quantization (cq) is a common image processing operation that reduces the number of distinct colors in a given image. Figure 1 shows the pencils image (Ref. 3, cc0 license, 768×512  pixels) and its quantized versions with 4, 16, 64, and 256 colors obtained using the median-cut algorithm.4 It can be seen that the reproduction is reasonably accurate with only 64 colors and is nearly indistinguishable from its original with 256 colors.

Fig. 1

Pencils and its various quantized versions: (a) original (117,157 colors), (b) 4 colors, (c) 16 colors, (d) 64 colors, and (e) 256 colors.


cq is composed of two phases: color palette design and pixel mapping. In the former phase, a small set of colors, called the color palette, representing the input colors is selected, whereas, in the latter phase, each pixel in the input image is assigned to one of the palette colors. The purpose of cq is to reduce the number of distinct colors in a given image to a significantly smaller number with minimal distortion. Since natural images often contain a wide range of colors, faithful reproduction of such images with a small color palette is a challenging problem. In fact, cq can be characterized as a large-scale combinatorial optimization problem.5

There are several ways to classify cq algorithms. Image-independent algorithms design a universal color palette without regard to any particular input image, whereas image-dependent ones design a custom color palette based on the distribution of the colors in a given input image. Unsurprisingly, a vast majority of cq algorithms are image-dependent. Another classification scheme is based on the nature of the underlying clustering algorithm.6 Hierarchical algorithms recursively find nested clusters in a top-down (or divisive) or bottom-up (or agglomerative) fashion. In contrast, partitional algorithms find all the clusters simultaneously as a partition of the data without imposing a hierarchical structure on the data.7 Early cq algorithms were mostly hierarchical, whereas modern cq algorithms tend to be partitional.8 Yet another classification scheme is based on whether or not the palette size is allowed to change during the cq process. Static algorithms assume that the palette size is a constant to be specified by the user in advance, whereas dynamic algorithms compute the palette size automatically at run-time. A vast majority of cq algorithms proposed to date are image-dependent and static. Therefore, in this paper, we focus primarily on such algorithms. (The extraction of relevant colors (i.e., colors that stand out to an observer9) from a color image, which is an important task in specific application domains such as fine arts9,10 and medicine,11,12 is outside the scope of this study.)

Many popular partitional algorithms are based on the local optimization of an objective function, which is typically nonsmooth and nonconvex with numerous local optima. Hence, such algorithms are often highly dependent on initialization13 and can easily get stuck in a poor local optimum. By contrast, metaheuristic-based algorithms are formulated based on global optimization and, thus, are less sensitive to initialization.

cq was a necessity in the past due to the limitations of the display hardware, many of which could not handle the number of colors that can be present in a typical 24-bit image. Over the past decades, 24-bit display hardware have become ubiquitous. However, cq is still used in many visual computing applications as a preprocessing step. Modern applications of cq include non-photorealistic rendering, image matting, image dehazing, image compression, color-to-grayscale conversion, image watermarking/steganography, image segmentation, content-based image retrieval, color analysis, saliency detection, and skin detection. For specific references, see Ref. 8.

Despite the over four decades of research on cq, there is currently no benchmark dataset on which cq algorithms can be developed, tested, and compared.8 Many cq studies employ a subset of the usc-sipi image database (Available at Ref. 14) or the Kodak lossless true color image suite (Available at Ref. 15). The usc-sipi dataset contains 14 scanned images of relatively poor quality (in modern standards). Furthermore, the copyright status of many of these images is unknown. The Kodak dataset contains 24 relatively high-quality images in the public domain. However, the dataset is neither sufficiently diverse nor sufficiently large to permit comprehensive evaluation of cq algorithms. (For example, in the usc-sipi dataset, three (21%) images (4.1.01, 4.1.03, and 4.1.04) are portrait photos, two images (4.1.04 and 4.1.02) show the same person, and two images (4.1.07 and 4.1.08) depict identical objects. On the other hand, one-third of the Kodak images portray water bodies. In addition, the dataset contains pairs of images featuring identical objects [i.e., kodim02 depicts a subregion of kodim01, whereas kodim06 and kodim11 depict the same boat captured in two separate yet similar scenes.]) To address these problems, we curated a large, diverse, and high-quality dataset of 24-bit color images called cq100 and released it under a permissive license.

The remainder of this paper is organized as follows. Section 2 gives a detailed description of our dataset. Section 3 demonstrates the use of our dataset in comparing cq algorithms. Finally, Sec. 4 concludes the paper and suggests future research directions.


Description of the Dataset

Our objective was to collect a large, diverse, and high-quality dataset of images and release it under a permissive license. We started by collecting over 100 images from three public sources:

  • Wikimedia Commons (Available at Ref. 16): A repository of over 88 million freely useable media files.

  • PxHere (Available at Ref. 3): a repository of over one million cc0 -licensed images.

  • Kodak lossless true color image suite: a collection of 24 Kodak photo cd images placed in the public domain by the Eastman Kodak Company.

During the image collection process, we paid particular attention to the diversity of content and license compatibility issues. Once we obtained a tentative dataset with 100+ images, we eliminated the images with outlying aspect ratios using the following iterative process. We first computed the mean (m) and standard deviation (s) of the aspect ratios (Ordinarily, the aspect ratio of an H×W image is given by W/H. However, to treat portrait and landscape images uniformly, we computed the aspect ratio as max(W/H,H/W).) of the current set of images and then eliminated those with outlying aspect ratios, that is, aspect ratios outside the range [m2s,m+2s]. We added a few more images to the set (from the aforementioned public sources) and then repeated the above outlier removal process until we were left with 100 images. The characteristics of this dataset were:

  • Sources: Wikimedia Commons (73), PxHere (23), and Kodak lossless true color image suite (4).

  • Categories: animals (18), food (17), miscellaneous (6), objects (23), people (8), places (11), plants (6), and vehicles (11).

  • Licenses: public domain (14), cc0 (25), cc by (7), and cc by-sa (54).

The images in our initial dataset had dimensions ranging from 768×512 to 6498×8123. To facilitate comparisons, it is desirable to have all images contain the same number of pixels. A simple way to achieve this goal is to resize all images to the same dimensions (and thus the same aspect ratio). To this end, we performed a grid search between the minimum (l=1.25) and maximum (u=1.64) aspect ratios in the dataset with a step size of (ul)/1000. The optimal aspect ratio minimizing the average relative deviation (Given an image, let a and a˜ be its aspect ratios before and after resizing, respectively. The relative deviation in the aspect ratio caused by the resizing operation is then ε=|a˜a|/a.) turned out to be 1.49973, which is very close to the standard 3:2 aspect ratio commonly used in 35-mm film photography and digital single-lens reflex camera (dslr) photography. Accordingly, we resized [We used the resize operator (with the default Lanczos resampling filter) of the convert program of ImageMagick 7.1.0-33 (available at Ref. 17)] all portrait and landscape images in our initial dataset respectively to 512×768 and 768×512 to achieve our target aspect ratio of 1.5. These resizing operations caused an average of only 5.35% relative distortion in the aspect ratio, which is negligible. Other than resizing, file renaming and file format conversion were the only modifications made to the original images. Most of the files originally had meaningless or long and overly descriptive names, some reaching 120 characters. Therefore, we renamed each file concisely to accurately reflect its contents. This renaming operation reduced the mean filename length from about 40 to about 12. As for file formats, 95 of the images were originally stored in the jpg format and 5 in the png format. We converted all images to the uncompressed binary ppm format. ppm is popular in visual computing because it is uncompressed, and ppm files are easy to read and write, owing to their extremely simple structure. As mentioned earlier, each image in cq100 has one of four types of licenses (For an overview of these licenses, visit Ref. 18.) (listed in decreasing permissiveness):

  • Public domain: no known copyright.

  • cc0 (Creative Commons zero 1.0 universal): no rights reserved.

  • cc by (Creative Commons attribution): others may share and adapt the work, even commercially, as long as they credit the copyright holder for the original work.

  • cc by-sa (Creative Commons attribution-share alike): others may share and adapt the work, even commercially, as long as they credit the copyright holder for the original work and license their new works under the same license as the original.

Since these licenses are mutually compatible, we released cq100 under the least permissive of them, namely the cc by-sa 4.0 license. Nevertheless, this license allows anyone to share and modify our dataset with the restrictions noted above. The metadata associated with each image include the following 16 attributes: original image filename, modified image filename, image category, source url, license, license url, author, author url, modifications made to the original image, additional notes about the original image, original image width, original image height, original image number of colors, modified image width, modified image height, and modified image number of colors.

To demonstrate the diversity of cq100, we show thumbnails for images 1 through 25, 26 through 50, 51 through 75, and 76 through 100 in Figs. 2Fig. 3Fig. 45, respectively. The numeric id around each thumbnail indicates the alphabetical order of the filename of the corresponding image, e.g., 1: adirondack_chairs; 2: astro_bodies,…, 99: wool_carder_bee; and 100: yasaka_pagoda. A more detailed view of a subset of cq100 is given in Fig. 6, which displays a sample image from each category. Note that for each image featured in Figs. 2Fig. 3Fig. 4Fig. 56, the original license is given in the respective caption.

Fig. 2

Thumbnails for images 1 through 25 (3, 5, 10, 18, and 21 are in the public domain; 6, 15, 16, and 23 are licensed under cc0; the others are licensed under cc by-sa).


Fig. 3

Thumbnails for images 26 through 50 (35, 39, 44, and 48 are in the public domain; 26, 28, 29, 36, 37, 41, 42, 47, and 49 are licensed under cc0; 34, 45, and 50 are licensed under cc by; and the others are licensed under cc by-sa)


Fig. 4

Thumbnails for images 51 through 75 (51, 60, 73, and 74 are in the public domain; 63, 64, and 66 are licensed under cc0; 67 is licensed under cc by; and the others are licensed under cc by-sa).


Fig. 5

Thumbnails for images 76 through 100 (79 is in the public domain; 76, 78, 82, 86, 88, 91, 92, 97, and 98 are licensed under cc0; 84, 89, and 96 are licensed under cc by; and the others are licensed under cc by-sa).


Fig. 6

Sample images from cq100: (a) red-eyed tree frog (animals), (b) fruits (food), (c) color checker (miscellaneous), (d) nylon cords (objects), (e) schoolgirls (people), (f) cosmic vista (places), (g) daisy bouquet (plants), and (h) sports bicycles (vehicles). Images (a), (e), and (f) are in the public domain; the others are licensed under cc0.


Figures 7 and 8 show box plots of the distribution of respectively the number of distinct colors and colorfulness19 per image for cq100, Kodak, and usc-sipi datasets. It can be seen that cq100 has a greater color diversity compared to the other two datasets.

Fig. 7

Box plot of the distribution of the number of distinct colors per image for cq100, Kodak, and usc-sipi datasets.


Fig. 8

Box plot of the distribution of the colorfulness per image for cq100, Kodak, and usc-sipi datasets



Demonstration of the Dataset

In this section, we demonstrate the use of our cq100 dataset on a cq task. First, we describe the experimental setup and present a brief qualitative assessment. Then, we conduct a more detailed quantitative assessment and discuss its implications.


Experimental Setup and Qualitative Assessment

We compare the following 21 cq algorithms (listed in chronological order) with respect to their effectiveness [An effective cq algorithm is one that produces minimal distortion, as quantified by a chosen image fidelity metric (see Sec. 3.2).]:

  • Median-cut algorithm4 (divisive hierarchical) (mc)

  • Popularity algorithm4 (partitional) (pop)

  • Octree algorithm20 (agglomerative hierarchical) (oct)

  • Wan et al.’s marginal variance minimization algorithm21 (divisive hierarchical) (wan)

  • Orchard and Bouman’s binary splitting algorithm22 (divisive hierarchical) (bs)

  • Wu’s variance minimization algorithm23 (divisive hierarchical) (wu)

  • Dekker’s self-organizing map algorithm24 (partitional) (som)

  • Split-and-merge algorithm25 (agglomerative hierarchical) (sam)

  • Batch k-means algorithm initialized using vcl26 (partitional) (wsm)

  • Fuzzy c-means algorithm initialized using wu27 (partitional) (wfcm)

  • Adaptive distributing units algorithm28 (partitional) (adu)

  • Variance-cut algorithm with Lloyd iterations29 (divisive hierarchical) (vcl)

  • Ant-tree algorithm30 (metaheuristic) (atcq)

  • Firefly algorithm combined with atcq31 (metaheuristic) (ffatcq)

  • Artificial bee colony combined with atcq32 (metaheuristic) (abcatcq)

  • Shuffled-frog leaping algorithm33 (metaheuristic) (sfla)

  • wu combined with atcq34 (metaheuristic) (wuatcq)

  • Particle swarm optimization combined with atcq35 (metaheuristic) (psoatcq)

  • bs combined with itatcq36 (metaheuristic) (bsitatcq)

  • Iterative atcq37 (metaheuristic) (itatcq)

  • Iterative online k-means algorithm38 (partitional) (iokm)

These algorithms represent the cq work published between 1980 and 2022. (For a modern survey of cq, see Ref. 8.) However, it should be noted that our objective is not to perform an exhaustive comparison of the cq algorithms published over the past 40 years but to demonstrate the use of the presented dataset in comparing several popular cq algorithms.

We execute each algorithm with the default parameter values suggested by its authors (see Table 1) and quantize each image in our dataset to 4, 16, 64, and 256 colors separately. The cases of 4 and 256 colors represent the two extremes. It can be argued that a typical natural image can hardly be quantized to fewer than 4 colors, and most images can be reproduced relatively accurately with no more than 256 colors.

Table 1

Parameter setting for each cq algorithm.

AlgorithmParameter setting
mc/pop/wan/wuNumber of bits: 5
octMaximum depth: 6
somSampling factor: 1
samInitial number of clusters: 20× number of colors
wsmDecimation factor: 2
Maximum number of iterations: 100
Convergence threshold: 0.001
wfcmDecimation factor: 2
Maximum number of iterations: 100
Convergence threshold: 0.001
Weighting exponent: 2
aduLearning rate: 0.015
vclDecimation factor: 2
Number of bits: 5
Number of iterations: 10
atcqα=0.25, 0.3, 0.35, and 0.4 for 4, 16, 64, and 256 colors, respectively.
ffatcqNumber of fireflies: 5
Number of iterations: 20
abcatcqNumber of food sources: 5
Number of iterations: 20
sflaNumber of frogs: 8
Number of memeplexes: 2
Number of iterations: 20
psoatcqNumber of particles: 5
Number of iterations: 20
bsitatcqNumber of iterations: 20
Iterations: 20

Figures 9Fig. 10Fig. 1112, respectively show the shopping bags, motorcycle, umbrellas, and common jezebel images quantized using three algorithms (in each figure, going from top to bottom, the subfigures are arranged in progressively decreasing quality). Given an input image, for each algorithm, we display the corresponding reduced-color output image and a grayscale error image that allows us to visualize the differences between the input and output. The error image is obtained by amplifying the pixelwise normalized Euclidean differences between the input and output by a factor of four and then negating them for better visualization (see below). Hence, the cleaner/lighter the error image, the better the reproduction of the input image.

Fig. 9

(a) Shopping bags (13,752 colors) and its various quantized versions (4 colors): (b) wfcm, (d) wuatcq, and (f) bsitatcq. Subfigures (c), (e), and (g) are the error images corresponding to subfigures (b), (d), and (f), respectively.


Fig. 10

(a) Motorcycle (212,020 colors) and its various quantized versions (16 colors): (b) psoatcq, (d) adu, and (f) bsitatcq. Subfigures (c), (e), and (g) are the error images corresponding to subfigures (b), (d), and (f), respectively.


Fig. 11

(a) Umbrellas (217,673 colors) and its various quantized versions (64 colors): (b) sfla, (d) abcatcq, and (f) wfcm. Subfigures (c), (e), and (g) are the error images corresponding to subfigures (b), (d), and (f), respectively.


Fig. 12

(a) Common jezebel (137,446 colors) and its various quantized versions (256 colors): (b) adu, (d) wsm, and (f) iokm. Subfigures (c), (e), and (g) are the error images corresponding to subfigures (b), (d), and (f), respectively.


Let IRGB and I˜RGB respectively denote the H×W original input and quantized output images in the standard rgb (srgb) color space.39 IRGB(r,c) and I˜RGB(r,c) are then three-dimensional vectors containing the rgb values of the pixel with (row, column) coordinate (r,c) in IRGB and I˜RGB, respectively (r{1,,H} and c{1,,W}). The corresponding pixel in the 8-bit error image E is then computed as

Eq. (1)

where ·2 denotes the Euclidean (2) norm. Note that the division by a 3 is necessary for the rgb -to-grayscale conversion, and pixels in E with negative values are clipped to zero.


Quantitative Assessment and Discussion

Quantitative assessment remains to be one of the least explored aspects of cq.8 Most cq studies employ pixelwise image fidelity metrics such as the mean squared error (mse), or its variants such as the peak signal-to-noise ratio (psnr), computed in the rgb color space. Recently, Ref. 40 investigated 25 fidelity metrics and concluded that mse computed in the cielab color space was among the best. Following their recommendation, we compute the mse metric as follows:

Eq. (2)

where ICIELAB and I˜CIELAB respectively denote the H×W original input and quantized output images in the cielab color space. Thus, MSE represents the average color distortion in the cielab color space with respect to the squared Euclidean (22) norm. Note that in the rgb -to- cielab conversion, we assume a working color space of srgb and a d65 reference white.

The presented cq100 dataset includes the 100 (true-color) input images and their metadata (as described in Sec. 2), 8400 (reduced-color) output images (21 cq algorithms ×100 input images ×{4,16,64,256} colors), and Microsoft Excel worksheets containing the mse for each input/output image combination. (Other image fidelity metrics can be computed over the provided input/output images.) By contrast, to the best of our knowledge, for the usc-sipi and Kodak datasets, there is no public repository containing the output images produced by a variety of algorithms and the corresponding mse values.

To determine if there are any statistically significant differences among the cq algorithms, we employ two nonparametric statistical tests:41 the Friedman test42 and the Iman-Davenport test.43 These tests are alternatives to the parametric two-way analysis of variance (anova) test. Their advantage over anova is that they do not require normality or homoscedasticity, assumptions that are often violated in machine learning or optimization studies.4448

Given B blocks (subjects) and T treatments (measurements), the null hypothesis (H0) of the Friedman test is that populations within a block are identical. The alternative hypothesis (H1) is that at least one treatment tends to yield larger (or smaller) values than at least one other treatment. The test statistic is computed as follows.49 In the first step, the observations within each block are ranked separately, so each block contains a separate set of T ranks. If ties occur, the tied observations are given the mean of the rank positions for which they are tied. If H0 is true, the ranks in each block should be randomly distributed over the columns (treatments). Otherwise, we expect a lack of randomness in this distribution. For example, if a particular treatment is better than the others, we expect small ranks to “favor” that column. In the second step, the ranks in each column are summed. If H0 is true, we expect the sums to be fairly close—so close that we can attribute differences to chance. Otherwise, we expect to see at least one difference between pairs of rank sums so large that we cannot reasonably attribute it to sampling variability. The test statistic is given as

Eq. (3)

where Rj (j{1,2,,T}) is the rank sum of the jth column. χr2 is approximately chi-square with (T1) degrees of freedom. H0 is rejected at the α level of significance if the value of (3) is greater than or equal to the critical chi-square value for (T1) degrees of freedom.

Reference 43 proposed the following alternative statistic

Eq. (4)

which is distributed according to the F-distribution with (T1) and (T1)(B1) degrees of freedom. Compared to χr2, this statistic is not only less conservative but also more accurate for small sample sizes.43

In this study, blocks and treatments correspond to images and cq algorithms, respectively. For each K{4,16,64,256}, our goal is to determine if at least one algorithm is significantly better than at least one other algorithm at the α=0.05 level of significance. If this is the case, we perform multiple comparison testing to determine which pairs of algorithms differ significantly. For this purpose, we employ the Bergmann-Hommel test50 (also at the α=0.05 level), a powerful multiple comparison test that has been used successfully in various machine learning studies.13,41,5153 (The power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis.) Bergmann-Hommel is a dynamic test that considers the logical relations among the hypotheses and is strictly more powerful54 than various alternative tests that control the familywise error rate (The familywise error rate is the probability of falsely rejecting at least one null hypothesis when performing multiple comparison tests.) such as Nemenyi,55 Holm,56 and Shaffer57 tests.

Table 2 gives the mean rank of each cq algorithm over the dataset for K{4,16,64,256} (lower ranks are better). The last column gives the mean of the (mean) ranks over the four K values. Unsurprisingly, the top ranks are occupied by partitional and metaheuristic-based algorithms. As mentioned earlier, the Bergmann-Hommel test is a powerful multiple comparison test. However, this power comes at the expense of a very high computational cost. More specifically, the test takes exponential time in the number of hypotheses;58 thus, it cannot handle more than ten algorithms, even on a high-performance cpu. To address this problem, we eliminate the patently inferior algorithms,48 namely all seven hierarchical algorithms (i.e., bs, mc, oct, sam, vcl, wan, and wu) and the five partitional or metaheuristic-based algorithms that rank in the bottom half (i.e., atcq, ffatcq, itatcq, pop, and som). Thus, the nonparametric statistical analyses below are conducted for the remaining nine algorithms (Even if the Bergmann-Hommel test could handle all 21 algorithms, it would have been extremely unwieldy to analyze the test results involving (212)=420 hypotheses, as opposed to (92)=36 hypotheses, and such an analysis would have more than likely yielded complex and uninterpretable rules.), namely abcatcq, adu, bsitatcq, iokm, psoatcq, sfla, wfcm, wsm, and wuatcq.

Table 2

Mean rank of each cq algorithm for K∈{4,16,64,256}.


For K=4 colors, both Friedman and Iman-Davenport tests detect a statistically significant difference in image fidelity (or effectiveness) among the cq algorithms: χr2(8)=158.990 with p=9.613e11 and Fr(8,792)=24.555 with p=6.675e34. The results of the Bergmann-Hommel test are given in column 2 of Table 3. For example, the test rejects the null hypothesis “abcatcq versus adu” (i.e., the two algorithms are equally effective). Since we know from Table 2 that abcatcq has a lower (or better) rank than adu for K=4, the rejection of the above hypothesis means abcatcq is more effective than adu for K=4 and the difference between the two algorithms is statistically significant at the α=0.05 level. The results of the Bergmann-Hommel test can be summarized succinctly as follows:

wfcm > {abcatcq, psoatcq, sfla, wsm} > {adu, bsitatcq, wuatcq},

where a notation such as {a, b} > c indicates that there is no statistically significant difference between algorithms a and b, and these two algorithms are significantly more effective than (or superior to) algorithm c. The above (summary) rule can be interpreted as follows:

  • wfcm is the best algorithm;

  • {adu, bsitatcq, wuatcq} is the worst group of algorithms; and

  • {abcatcq, psoatcq, sfla, wsm} is in between.

Table 3

Results of the Bergmann-Hommel test for K∈{4,16,64,256} (✓: rejected and ✗: not rejected).

Null hypothesisK=4K=16K=64K=256
abcatcq versus adu
abcatcq versus bsitatcq
abcatcq versus iokm
abcatcq versus psoatcq
abcatcq versus sfla
abcatcq versus wfcm
abcatcq versus wsm
abcatcq versus wuatcq
adu versus bsitatcq
adu versus iokm
adu versus psoatcq
adu versus sfla
adu versus wfcm
adu versus wsm
adu versus wuatcq
bsitatcq versus iokm
bsitatcq versus psoatcq
bsitatcq versus sfla
bsitatcq versus wfcm
bsitatcq versus wsm
bsitatcq versus wuatcq
iokm versus psoatcq
iokm versus sfla
iokm versus wfcm
iokm versus wsm
iokm versus wuatcq
psoatcq versus sfla
psoatcq versus wfcm
psoatcq versus wsm
psoatcq versus wuatcq
sfla versus wfcm
sfla versus wsm
sfla versus wuatcq
wfcm versus wsm
wfcm versus wuatcq
wsm versus wuatcq

Observe that the above summary does not include iokm. This is because, as Table 3 shows, while iokm is inferior to wfcm and superior to {adu, bsitatcq}, it cannot be included in group {abcatcq, psoatcq, sfla, wsm } since it is not superior to wuatcq. Hence, an alternative rule, including iokm but excluding wuatcq, is

wfcm > {abcatcq, iokm, psoatcq, sfla, wsm} > {adu, bsitatcq}.

For K=16 colors, both Friedman and Iman-Davenport tests detect a statistically significant difference in image fidelity (or effectiveness) among the cq algorithms: χr2(8)=203.771 with p=9.240e11 and Fr(8792)=33.835 with p=4.888e46. The results of the Bergmann-Hommel test are given in column 3 of Table 3. In this case, the results can be summarized by a single rule that covers all nine algorithms:

psoatcq > {abcatcq, iokm, sfla, wfcm, wsm} > {adu, bsitatcq, wuatcq}, which can be interpreted as follows:

  • psoatcq is the best algorithm;

  • {adu, bsitatcq, wuatcq} is the worst group of algorithms; and

  • {abcatcq, iokm, sfla, wfcm, wsm} is in between.

For K=64 colors, both Friedman and Iman-Davenport tests detect a statistically significant difference in image fidelity (or effectiveness) among the cq algorithms: χr2(8)=330.423 with p=1.416e10 and Fr(8,792)=69.663 with p=1.763e86. The results of the Bergmann-Hommel test are given in column 4 of Table 3. In this case, the results can be summarized by three alternative rules:

  • psoatcq > {adu, iokm, wsm} > {bsitatcq, wfcm, wuatcq};

  • {abcatcq, adu, iokm, wsm} > {bsitatcq, wfcm, wuatcq}; and

  • psoatcq > {adu, iokm, sfla} > {bsitatcq, wfcm, wuatcq}.

Observe that {bsitatcq, wfcm, wuatcq} is the worst group of algorithms in every case.

For K=256 colors, both Friedman and Iman-Davenport tests detect a statistically significant difference in image fidelity (or effectiveness) among the cq algorithms: χr2(8)=394.704 with p=1.872e10 and Fr(8,792)=96.413 with p=1.441e111. The results of the Bergmann-Hommel test are given in column 5 of Table 3. In this case, the results can be summarized by two alternative rules:

  • {adu, psoatcq, wsm} > iokm > {bsitatcq, sfla} > {wfcm, wuatcq} and

  • {abcatcq, psoatcq, wsm} > iokm > {bsitatcq, sfla} > {wfcm, wuatcq}

Observe that, in either case, {wfcm, wuatcq} is the worst group of algorithms; {bsitatcq, sfla} is the second-worst group; and iokm is superior to the preceding two groups.

General remarks with respect to effectiveness (recall that lower ranks are better):

  • abcatcq is always in the best or second-best group;

  • adu ranks progressively lower (This behavior is likely due to the constant learning rate used in the experiments, as suggested by Ref. 28. For better performance, this rate should probably be set as a function of K.) with increasing K (it is in the worst group for K{4,16} but the best or second-best group for K{64,256});

  • bsitatcq ranks progressively lower with increasing K (nevertheless, it is in the worst group for K{4,16,64} and the second-worst group for K=256);

  • iokm is always in the best or second-best group;

  • psoatcq is in the best group for K{16,64,256} and the second-best group for K=4;

  • sfla is in the second-best group for K{4,16,64} and the second-worst group for K=256;

  • wfcm ranks progressively higher (The lackluster performance of wfcm, which is a fuzzy generalization of wsm, was demonstrated earlier by Ref. 27 on a limited number of images.) with increasing K (it is the best algorithm for K=4 but in the worst group for K{64,256});

  • wsm ranks progressively lower with increasing K (it is always in the best or second-best group); and

  • wuatcq is always in the worst group.

Finally, for each K value, if we were to recommend a single algorithm based on effectiveness considerations, it would be: wfcm for K=4, psoatcq for K{16,64}, and adu for K=256.

It is important to emphasize that in our experiment, we compared these cq algorithms along a single dimension (mse in the cielab color space). There are many other criteria on which one can compare cq algorithms, including computational efficiency, simplicity (conceptual and implementation), and ease of use (e.g., as measured by the number of user-defined parameters). For example, while iokm is inferior to psoatcq based on effectiveness (as quantified by mse), it is vastly superior based on the other three criteria mentioned above. Therefore, there is no such thing as a universal cq algorithm, and the best algorithm depends highly on the application requirements.


Conclusions and Future Research Directions

In this paper, we presented cq100, a large, diverse, and high-quality dataset of 24-bit color images released under the cc by-sa 4.0 license. We also demonstrated how cq100 can be used to compare cq algorithms based on the popular mse metric (computed in the cielab color space). Future work includes a multivariate comparison of cq algorithms over cq100 based on several image fidelity metrics.

The use of cq100 is not restricted to cq. For example, it can also be used to develop, test, and compare filtering59 or segmentation60,61 algorithms. However, for some applications, the images in the dataset may need to be annotated by one or more human experts. For example, in a segmentation application, the annotation for a given image could include bounding boxes around objects of interest or a segmentation mask for the entire image. cq100 is significantly larger and more diverse than its two main competitors (i.e., usc-sipi and Kodak). Although the current size of the dataset appears to be adequate for its primary target application, which is an unsupervised learning task, it should be expanded if it is to be used for supervised learning tasks.

Data, Materials, and Code Availability

The cq100 dataset presented in this paper is available at Ref. 62. The dataset includes the 100 (true-color) input images and their metadata, 8400 (reduced-color) output images (21 cq algorithms ×100 input images ×{4,16,64,256} colors), and Microsoft Excel worksheets containing the mse for each input/output image combination.


This material is based upon work supported by the National Science Foundation (Grant No. OIA-1946391). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

The authors gratefully acknowledge the following open-source software:

1. median-cut algorithm (gimp, Spencer Kimball, and Peter Mattis);

2. octree algorithm (ImageMagick, John Cristy);

3. marginal variance minimization algorithm (Utah Raster Toolkit and Craig E. Kolb);

4. binary splitting algorithm (Charles A. Bouman);

5. variance minimization algorithm (Xiaolin Wu);

6. self-organizing map algorithm (Anthony Dekker);

7. split-and-merge algorithm (Luc Brun);

8. multipletest software41 (Salvador García and Francisco Herrera); and

9. BoxPlotR software63 (Michaela Spitzer, Jan Wildenhain, Juri Rappsilber, and Mike Tyers).

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.



G. Sharma, M. J. Vrhel and H. J. Trussell, “Color imaging for multimedia,” Proc. IEEE, 86 (6), 1088 –1108 (1998). Google Scholar


R. Ramanath et al., “Color image processing pipeline,” IEEE Signal Process. Mag., 22 (1), 34 –43 ISPRE6 1053-5888 (2005). Google Scholar


P. Heckbert, “Color image quantization for frame buffer display,” ACM SIGGRAPH Comput. Graph., 16 (3), 297 –307 (1982). Google Scholar


X. Wu, “Color quantization by dynamic programming and principal analysis,” ACM Trans. Graph., 11 (4), 348 –372 ATGRDF 0730-0301 (1992). Google Scholar


A. K. Jain, M. N. Murty and P. J. Flynn, “Data clustering: a review,” ACM Comput. Surv., 31 (3), 264 –323 ACSUEY 0360-0300 (1999). Google Scholar


Partitional Clustering Algorithms, Springer( (2015). Google Scholar


M. E. Celebi, “Forty years of color quantization: a modern, algorithmic survey,” Artif. Intell. Rev., (2023). Google Scholar


J. L. Nieves et al., “Computing the relevant colors that describe the color palette of paintings,” Appl. Opt., 59 (6), 1732 –1740 APOPAI 0003-6935 (2020). Google Scholar


J. L. Nieves et al., “Psychophysical determination of the relevant colours that describe the colour palette of paintings,” J. Imaging, 7 (4), 72 (2021). Google Scholar


M. E. Celebi and A. Zornberg, “Automated quantification of clinically significant colors in dermoscopy images and its application to skin lesion classification,” IEEE Syst. J., 8 (3), 980 –984 (2014). Google Scholar


C. Barata et al., “Clinically inspired analysis of dermoscopy images using a generative model,” Comput. Vis. Image Underst., 151 124 –137 CVIUF4 1077-3142 (2016). Google Scholar


M. E. Celebi, H. Kingravi and P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert Syst. Appl., 40 (1), 200 –210 (2013). Google Scholar


D. Hasler and S. E. Suesstrunk, “Measuring colorfulness in natural images,” Proc. SPIE, 5007 87 –95 (2003). Google Scholar


M. Gervautz, W. Purgathofer, “A simple method for color quantization: octree quantization,” New Trends in Computer Graphics, 219 –231 Springer( (1988). Google Scholar


S. J. Wan, P. Prusinkiewicz and S. K. M. Wong, “Variance-based color image quantization for frame buffer display,” Color Res. Appl., 15 52 –58 CREADU 0361-2317 (1990). Google Scholar


M. Orchard and C. Bouman, “Color quantization of images,” IEEE Trans. Signal Process., 39 (12), 2677 –2690 ITPRED 1053-587X (1991). Google Scholar


X. Wu, 126 –133 Academic Press( (1991). Google Scholar


A. Dekker, “Kohonen neural networks for optimal colour quantization,” Netw. Computat. Neural Syst., 5 (3), 351 –367 (1994). Google Scholar


L. Brun and M. Mokhtari, “Two high speed color quantization algorithms,” in Proc. 1st Int. Conf. Color in Graph. and Image Process., 116 –121 (2000). Google Scholar


M. E. Celebi, “Improving the performance of k-means for color quantization,” Image Vis. Comput., 29 (4), 260 –271 (2011). Google Scholar


Q. Wen and M. E. Celebi, “Hard vs. fuzzy c-means clustering for color quantization,” EURASIP J. Adv. Signal Process., 2011 (1), 118 –129 (2011). Google Scholar


M. E. Celebi, S. Hwang and Q. Wen, “Colour quantisation using the adaptive distributing units algorithm,” Imaging Sci. J., 62 (2), 80 –91 (2014). Google Scholar


M. E. Celebi, Q. Wen and S. Hwang, “An effective real-time color quantization method based on divisive hierarchical clustering,” J. Real-Time Image Process., 10 (2), 329 –344 (2015). Google Scholar


M.-L. Pérez-Delgado, “Colour quantization with ant-tree,” Appl. Soft Comput., 36 656 –669 (2015). Google Scholar


M.-L. Pérez-Delgado, “Artificial ants and fireflies can perform colour quantisation,” Appl. Soft Comput., 73 153 –177 (2018). Google Scholar


M.-L. Pérez-Delgado, “The color quantization problem solved by swarm-based operations,” Appl. Intell., 49 (7), 2482 –2514 (2019). Google Scholar


M.-L. Pérez-Delgado, “Color image quantization using the shuffled-frog leaping algorithm,” Eng. Appl. Artif. Intell., 79 142 –158 EAAIE6 0952-1976 (2019). Google Scholar


M.-L. Pérez-Delgado and J.-Á. Román-Gallego, “A hybrid color quantization algorithm that combines the greedy orthogonal bi-partitioning method with artificial ants,” IEEE Access, 7 128714 –128734 (2019). Google Scholar


M.-L. Pérez-Delgado, “Color quantization with particle swarm optimization and artificial ants,” Soft Comput., 24 (6), 4545 –4573 (2020). Google Scholar


M.-L. Pérez-Delgado, “A mixed method with effective color reduction,” Appl. Sci., 10 (21), 7819 (2020). Google Scholar


M.-L. Pérez-Delgado, “Revisiting the iterative ant-tree for color quantization algorithm,” J. Visual Commun. Image Represent., 78 103180 JVCRE7 1047-3203 (2021). Google Scholar


A. Abernathy and M. E. Celebi, “The incremental online k-means clustering algorithm and its application to color quantization,” Expert Syst. Appl., 207 117927 (2022). Google Scholar


M. Anderson et al., “Proposal for a standard default color space for the internet—sRGB,” in Proc. Color and Imaging Conf., 238 –245 (1996). Google Scholar


B. Ortiz-Jaramillo et al., “Evaluation of color differences in natural scene color images,” Signal Process. Image Commun., 71 128 –137 SPICEF 0923-5965 (2019). Google Scholar


S. García and F. Herrera, “An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons,” J. Mach. Learn. Res., 9 2677 –2694 (2008). Google Scholar


M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” J. Am. Stat. Assoc., 32 (200), 675 –701 (1937). Google Scholar


R. L. Iman and J. M. Davenport, “Approximations of the critical region of the friedman statistic,” Commun. Stat. Theory Methods, 9 (6), 571 –595 (1980). Google Scholar


J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., 7 1 –30 (2006). Google Scholar


J. Luengo, S. García and F. Herrera, “A study on the use of statistical tests for experimentation with neural networks: analysis of parametric test conditions and non-parametric tests,” Expert Syst. Appl., 36 (4), 7798 –7808 (2009). Google Scholar


S. García et al., “A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability,” Soft Comput., 13 959 –977 (2009). Google Scholar


S. García et al., “A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 special session on real parameter optimization,” J. Heuristics, 15 (6), 617 –644 (2009). Google Scholar


J. Carrasco et al., “Recent trends in the use of statistical tests for comparing swarm and evolutionary computing algorithms: practical guidelines and a critical review,” Swarm Evol. Computat., 54 100665 (2020). Google Scholar


W. W. Daniel, Applied Nonparametric Statistics, 2nd ed.PWS-KENT Publishing Company( (1990). Google Scholar


B. Bergmann, G. Hommel, , “Improvements of general multiple test procedures for redundant systems of hypotheses,” Multiple Hypotheses Testing, 100 –115 Springer( (1988). Google Scholar


J. Derrac et al., “A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms,” Swarm Evol. Computat., 1 (1), 3 –18 (2011). Google Scholar


R. M. O. Cruz, R. Sabourin and G. D. C. Cavalcanti, “Dynamic classifier selection: recent advances and perspectives,” Inf. Fusion, 41 195 –216 (2018). Google Scholar


U. Johansson et al., “Rule extraction with guarantees from regression models,” Pattern Recognit., 126 108554 (2022). Google Scholar


G. Hommel and G. Bernhard, “Bonferroni procedures for logically related hypotheses,” J. Stat. Plann. Inference, 82 (1–2), 119 –128 (1999). Google Scholar


P. B. Nemenyi, “Distribution-free multiple comparisons,” Princeton University, (1963). Google Scholar


S. Holm, “A simple sequentially rejective multiple test procedure,” Scand. J. Stat., 6 (2), 65 –70 SJSADG 0303-6898 (1979). Google Scholar


J. P. Shaffer, “Modified sequentially rejective multiple test procedures,” J. Am. Stat. Assoc., 81 (395), 826 –831 (1986). Google Scholar


G. Hommel and G. Bernhard, “A rapid algorithm and a computer program for multiple test procedures using logical structures of hypotheses,” Comput. Methods Programs Biomed., 43 (3–4), 213 –216 CMPBEK 0169-2607 (1994). Google Scholar


M. E. Celebi, H. A. Kingravi and Y. A. Aslandogan, “Nonlinear vector filtering for impulsive noise removal from color images,” J. Electron. Imaging, 16 (3), 033008 JEIME5 1017-9909 (2007). Google Scholar


Y. Deng and B. Manjunath, “Unsupervised segmentation of color-texture regions in images and video,” IEEE Trans. Pattern Anal. Mach. Intell., 23 (8), 800 –810 ITPIDJ 0162-8828 (2001). Google Scholar


M. Mignotte, “Segmentation by fusion of histogram-based k-means clusters in different color spaces,” IEEE Trans. Image Process., 17 (5), 780 –787 IIPRE4 1057-7149 (2008). Google Scholar


M. Spitzer et al., “BoxPlotR: a web tool for generation of box plots,” Nat. Methods, 11 121 –122 1548-7091 (2014). Google Scholar


M. Emre Celebi received his PhD in computer science and engineering from the University of Texas at Arlington, United States. Currently, he is a professor and the chair of the Department of Computer Science and Engineering at the University of Central Arkansas, United States. He has published 170+ articles on image processing/analysis and data mining. According to Google Scholar, his work has received more than 14,000 citations so far. He is a senior member of the ieee and a fellow of the spie.

María-Luisa Pérez-Delgado received her engineering degree in computer science from the University of Valladolid, Spain and a PhD from the University of Salamanca, Spain, where she is currently a professor. Her research interests include artificial intelligence, optimization, graph theory, and data mining. She has published several research articles and books in these areas.

© 2023 SPIE and IS&T
M. Emre Celebi and María-Luisa Pérez-Delgado "cq100: a high-quality image dataset for color quantization research," Journal of Electronic Imaging 32(3), 033019 (7 June 2023).
Received: 14 March 2023; Accepted: 11 May 2023; Published: 7 June 2023
Get copyright permission  Get copyright permission on Copyright Marketplace
RGB color model

Image processing

Image segmentation

Image quality


Machine learning

Image visualization


Back to Top