Many medical image perception tasks are tasks of interpretation. The clinician is asked to look at something and evaluate it. Is it broken? Is it infected? In another set of tasks, the clinician is asked to find something. Is there a stroke in this brain? Is there cancer in this breast? These are visual search tasks, where the location and even the presence of a finding are uncertain. The human “search engine” is, at once, powerful and constrained. It is useful to understand the way that the capabilities and limitations of human search interact with the demands of medical search tasks because that understanding can lead to improvements in vitally important medical image perception tasks.
HUMAN SEARCH IS GUIDED SEARCH
Imagine that you are looking for a 1977 penny in a pile of pennies. There is not much you can do except to direct your attention to each penny in turn, rejecting it when it turns out to have a different date and moving on to the next one. Since that date inscription is small, you will need to fixate each penny to get the numbers on the fovea. Voluntary fixations occur at a rate of about 4 per second so your penny search will be constrained to be at least that slow. If the numbers on the pennies were big enough that you could read them without the need to foveate each one, the rate at which you could processes the pennies would increase to something like 25-50 pennies per second. This tells you either that the “spotlight” attention can be deployed separately from and faster than the deployment of the eyes or, perhaps, it tells you that, under the right circumstances, more than one item can be processed in parallel during each fixation of the eyes 1.
If all of visual search was a succession of such “serial, self-terminating” searches, finding anything from the cat to an aneurysm would be a needle-in-the-haystack experience and it would be hard to get anything done. Even at 25-50 deployments of attention per second, it would take far too long to find your keys, your socks, or a nodule in the liver. Fortunately, the human “search engine” is smarter than that. It uses several sources of information to “guide” attention. Suppose, for example, you were still looking for that 1977 penny but now the coins are not all pennies. They are a mix of pennies, nickels, dimes, and quarters. You will still need to search through one coin after another. However, you will be able to use the size and color of the coins to restrict your attention to the pennies, avoiding the other coins 2. If that 1977 penny was the only penny in the pile, search would be trivial. With the target is defined by a unique, salient feature the penny would “pop-out” of the display 3.
BOTTOM-UP AND TOP-DOWN PROCESSING OF BASIC FEATURES
The penny example suggests that there are a set of features that are processed in parallel across the entire visual field 4. A single, unique item will tend to attract attention whether or not the observer is looking for that feature. This is an example of the “bottom-up”, stimulus-driven guidance of attention. In Figure 1, the white, horizontal item on the left attracts attention in this bottom-up manner because its features are markedly different from those of its homogeneous neighbors. The same type of white horizontal item is much less salient on the right, in a more diverse neighborhood. It is this local, bottom-up salience that is captured by most “saliency” algorithms 5,6.
Bottom-up saliency is useful, up to a point but, if you think about a typical medical image, the clinically significant finding is not likely to be the most salient feature in the image – not if we define salience in terms of these local differences in basic features. In deliberate search tasks, when we have a target in mind, we will configure our search engine in a top-down, user-driven manner to guide attention to candidate targets. Thus, in Figure 1, you can configure yourself to look for black and vertical. This will rapidly guide your attention to the black vertical item even if it was not the most salient item and had not attracted much attention bottom-up 7. As a medical example, in lung CT, in a search for long nodules, attention will be guided to small white objects.
Note that, in both of these examples, the target does not happen to be defined by a single feature but by a conjunction of features. Some items are black. Other items are vertical. You are looking for the conjunction of black and vertical and the intersection of the set of black things and the set of vertical things would be an excellent place to look for black vertical targets.
There is a limited set of attributes that can guide attention. You cannot direct your search engine to guide your attention to breast cancer. A trained radiologist can search for breast cancer, of course, but the guidance that limits the set of sensible places to deploy attention will be drawn from a limited vocabulary of attributes. There appear to be one to two dozen of these 8. Everyone agrees on attributes like color and motion. Other attributes like “shininess” have less experimental attestation 9, while other candidates like faces or facial emotion remain controversial even after years of research 10 11.
IDIOSYNCRACIES OF THE HUMAN SEARCH ENGINE
The human search engine is not a search engine like Google that allows you type anything into the search box. Not only is the set of attributes limited, but additionally, the use of those attributes is also limited.
One set of constraints is illustrated in Figure 2 where there are two types of target shown in the “search box”: Items with big and small parts on the left and small things with big parts on the right. Use you search engine to find each in turn. It turns out that the human search engine is limited to one feature value for each attribute. That is, observers can search for the item that has the color feature “RED” and the size feature “BIG”. This yields the intersection of the sets of red and big items. However, if the observer tries to look two size features (BIG and SMALL), that search seems to yield the union of the big and small items. In this case, that includes all of the dumbbell objects and thus there is no guidance 12. However, the system is capable of search for a feature of the whole object and a feature, in the same dimension, of a part of that object. Thus, the search for the small square objects with the larger, enclosed square parts is guided and should feel somewhat easier 13,14 (Did you find both examples of each target?)
Notice that the size terms used here are “big” and “small”. We seem to be able to talk to our search engine only in a very limited vocabulary. We can guide to big and small (but not medium-sized) 15. Orientation seems to be defined by the terms: steep, shallow, left, and right 16. Color guidance is probably guidance by color categories like “red” and “blue” not “rose red” or “610nm” 17. This color categorical guidance has medical image perception consequences. Think of color heat maps in, for example, PET images. The scale is continuous but we are predisposed to see these maps categorically. For example, in a standard red-to-green heat map, we might see a red hot spot of a specific dimension. The shape and size of that perceived hot spot would be quite different if the color mapping was changed even though that mapping would not change the underlying data.
Even with this limited vocabulary of search terms, it is possible to guide search quite intelligently because, as a general rule, the world is not constituted like Figure 2. In the real world, knowing that you are looking for a big, red, shiny thing with a small yellow part is likely to substantially reduce the set of candidates. We know that it is possible to guide to many attributes at the same time 18. Moreover, visual search for objects is best if you show the observer the exact target object just before the search 19, suggesting that our search engine can translate a picture cue into an effective search template quickly and effectively.
MISSING THE MONKEY AND OTHER INCIDENTAL FINDINGS
There is a downside to effective guidance. In mechanical terms, guidance probably involves setting “weights” in the nervous system to boost the effectiveness of some feature (e.g. red) or some dimension (e.g. color) 20-22. That can make it less likely that another, incidental target will be found. We had observers searching for nodules in lung CT so, we may presume, they had their search engine set for small and white. As a consequence, 84% of our expert radiologists (and 100% of non-radiologists) failed to report a gorilla the size of a matchbook that we had inserted across 5 slices in the last case (upper right of Figure 3) 23. Others have seen similar effects (like missing a missing clavicle 24). We were tracking our radiologists’ eye movements so we could see that the gorilla was often fixated and still missed. Looking at an object is not quite the same thing as “seeing” it. Indeed, it is even possible to pick up a target and move it without noticing that it is the thing you are looking for 25.
When researchers have looked at the eye movements of expert radiologists and compared the results to those from novices, the most striking difference is that experts look in fewer places 26. Clearly something is guiding the expert’s search and it seems very unlikely that this guidance is produced by better appreciations of the basic features of the targets. Indeed, guidance of attention by basic features will only get us part of the way toward explaining the efficiency of search in real world scenes in general. An expert, searching for breast cancer, and shopper, searching for cucumbers are both guided by a set of “scene guidance” cues. These are not present in random arrays of items like those in Figures 1 and 2. However, they are present in most standard search tasks. Consider the cucumber search. A shopper, in the produce section, will be aided by “syntactic” guidance – guidance based on the physical rules of the world. Cucumbers will be in a bin somewhere. They will not be floating in mid-air because cucumbers just do not do that. Search will be aided by “semantic” guidance – guidance based on what we know about cucumbers, beyond their status as physical objects. They are likely to be near the carrots and celery because the produce section is typically ordered by rules like “put salad vegetables near each other”. Other rules are perfectly possible (arrange by color or alphabetically) but that is not how it is typically done and our shopper knows that. Finally, he might be aided by “episodic” guidance. The cucumbers were in this location last week. They are probably in the same place today. (“Episodic” from episodic memory.). Take the shopper to another store and the episodic guidance will fail. Take the shopper to another country and the semantic guidance might fail. Syntactic guidance should be reliable as long as the laws of gravity do not change.
The rules of feature guidance probably come with the system. You don’t have to learn to guide attention to color or size. On the other hand, semantic guidance, in particular, must be learned. There is no rule of nature that says that forks tend to lie near plates on a table. A significant part of search expertise in tasks from radiology to satellite surveillance must involve learning the contingencies that are reliable in landscapes from North Korea to CT colonoscopy. In radiology, it is interesting to note the rapid rate of change in the “landscape” as the technology evolves. For example, the 2D world of the chest x-ray has evolved into the 3D stack of CT images representing a volume and not just a plane. The relative novelty of 3D volumetric image data gives us the chance to observe what might be thought of as the evolution of guidance.
The gorilla study, mentioned above, was part of just one case in a study that was actually directed at measuring the eye movements of radiologists as they searched for nodules in lung CT. Using an eye tracker, we could measure the eyes’ position in the XY plane while also registering “depth” by tracking the slice that was being viewed as the clinicians scrolled up and down through the stack of images of the lung 27. We found that radiologists fell into two groups. “Drillers” tended to have their eyes fixated in one spot in the XY plane or, at least, in one quadrant of the lung, while they scrolled up and down through the lung. “Scanners” moved more slowly in depth while moving their eyes throughout all four quadrants in the present slice (Quadrants are coded by color in Figure 4). We do not yet know if one of these methods is superior. It is possible, however, that scanning is the older style, having been all that could be done with a 2D chest x-ray. Drilling might be an adaptation to the new, 3D world.
THE GIST OF DISEASE?
One of the striking aspects of scene perception is the speed with which one can extract the “gist” of a scene. Tens of milliseconds of exposure are all that is required to know that a scene in natural or man-made or that it is navigable or that it contains an animal 28 29. Interestingly, this gist signal is often quite global in nature. That is, an observer might have a reliable sensation that an animal is present but not actually know what animal it might be or where in the image it is located 30. Experts sometimes report a similar phenomenon. A radiologist might have the sense that an image contains bad news for the patient before locating the actual problem. Kundel and Nodine 31 talk about an initial stage of “holistic” or “gestalt” processing of radiological images before the clinician gets down to actual search. We found that expert mammographers were significantly above chance in classifying a mammogram as normal or abnormal after just 250 msec of exposure 32. Technologists, reading cervical cancer slides, had similar abilities. In both cases, the experts were at chance in localizing the pathology. Apparently, they had become sensitive to a global signal in a very specific kind of scene. They could detect the “gist” of cancer at above chance levels. Non-experts were at chance in these tasks.
WHEN IS IT TIME TO QUIT
The various forms of guidance, discussed here, make it possible for humans to find what they are looking for quite effectively in many, if not most cases. However, guidance does not solve (and may even complicate) one of the most fundamental search problems. When are you finished? If you are looking for your cell phone and you find your cell phone, then the answer is straight-forward. You are done. But suppose you do not find the phone? How long should you keep looking? Or suppose that you are looking for the best lemon in the supermarket or all of the metastases in an abdominal CT? In these cases, the correct quitting time is not obvious. Guidance complicates matters because, with guidance, the correct answer isn’t “look through everything”. Perhaps you will look through all or most of the candidates above some guidance threshold. But if you are looking at a complex image like a mammogram, setting that threshold becomes the quitting problem. It is not at all obvious where to set the threshold and when to quit. Mammography is a good venue to worry about quitting time because, in screening mammography (as in baggage screening and a number of other important tasks), actual targets are very rare. Almost all the cases end when the clinician decides to quit without a positive finding.
It turns out the extreme rarity of target in a task like breast cancer screening is, itself, a problem. You are more likely to miss a rare target than a common one 33 34. This seems to be true, even if you are an expert radiologist. We took 100 mammograms, 50 positive cases and 50 negative, and introduced them into the normal workflow of a breast cancer screening at a slow rate of less than one per day. Under those low prevalence conditions, radiologists missed 30% of the cancers. When we had radiologists read the same 100 cases in a single session outside the clinic (50% prevalence), they missed just 12% of the cancers 35. Prevalence modulated performance. The same thing happened with cervical cancer screening stimuli 36 and with airport baggage screeners 37. Like the gorilla experiment, these results do not reflect badly on radiologists (or airport screeners, for that matter). These results tell us that the limits on the human search engine on human decision making processes apply to experts as well as to novices. We need to understand how humans perform search tasks, if we are going to ask experts to do difficult, important search tasks and if we want them to do those tasks well.
BRING ON THE MACHINES?
Of course, the limits of human search engines would not be a problem if we could turn over medical image perception tasks to the computer; but we can’t – not yet, in most cases. Computer Aided Detection (CADe) systems are good but they are not perfect so, at present, they are partners with human observers, not replacements for those human. One would think that a good radiologist plus a good CADe system would be markedly better than either alone. Curiously, that is not the case. On balance, CAD helps but the improvement is modest when it is found at all 38,39. Radiologists do not appear to make optimal use of CAD signals. For instance, in one study, radiologist failed to act on 70% of correct positive CAD marks 40. The prevalence problem might be one reason. Suppose we screen 1000 women. There might be 3 cancers and a good CADe system might mark them all. A really good system might also mark 10% of negative cases. Even a system this good, therefore, is producing 100 false positive marks in these 1000 cases. The radiologist gets 103 marks, three of them correctly marking cancer. That is not a positive predictive value designed to inspire confidence. Moreover, a CAD mark at one location may actually make it less likely that an observer will find a target at another location. That was our finding with non-radiologists and a simulated CAD task 41. This is a relation of a more general problem known as “satisfaction of search” 42-44 or “subsequent search misses” 45 where finding one target makes it less likely that you will find a second one in displays that have multiple targets.
Radiologists and visual attention researchers have a lot to say to each other. Radiologists perform remarkable feats of visual search on a daily basis. By trying to understand what they do, we can come closer to understanding quotidian world of search in which we all live. At the same time, by understanding the capabilities and limitations of the human search engine, we may be able to identify pitfalls and opportunities in the world of medical image perception.
The work reviewed here was supported by grants from NIH (EY017001), the Office of Naval Research ONR MURI N000141010278, and Toshiba Medical Systems (BWH Agreement No. A203079). In addition, support was provided by NIH postdoctoral fellowships for Trafton Drew (1F32EB011959) and Karla Evans (1F32EY019819).