Since the early 90s a number of papers on 'robust' digital watermarking systems have been presented but none of them uses the same robustness criteria. This is not practical at all for comparison and slows down progress in this area. To address this issue, we present an evaluation procedure of image watermarking systems. First we identify all necessary parameters for proper benchmarking and investigate how to quantitatively describe the image degradation introduced by the watermarking process. For this, we show the weaknesses of usual image quality measures in the context watermarking and propose a novel measure adapted to the human visual system. Then we show how to efficiently evaluate the watermark performance in such a way that fair comparisons between different methods are possible. The usefulness of three graphs: 'attack vs. visual-quality,' 'bit-error vs. visual quality,' and 'bit-error vs. attack' are investigated. In addition the receiver operating characteristic (ROC) graphs are reviewed and proposed to describe statistical detection behavior of watermarking methods. Finally we review a number of attacks that any system should survive to be really useful and propose a benchmark and a set of different suitable images.