CAPTCHA – Why We Can Read It and Robots Can’t

This is a typical CAPTCHA security question.
People can read the two words - no automated system can.

We’ve all seen it. Whether it’s there to keep automated spammers away from your blog comments or to make sure you are a real person who is registering for an account, at some point we’ve all had to deal with a graphic like the one above. It’s called CAPTCHA, which stands for Completely Automated Public Turing test to tell Computers and Humans Apart. While there is some controversy over who invented it, the process was first patented in 1998 by Mark D. Lillibridge, Martin Abadi, Krishna Bharat, and Andrei Z. Broder at AltaVista.

Why is CAPTCHA so effective? Because even though it is relatively simple for you and me to read the obscured and distorted words in a graphic, so far no one has been able to program an automated system to do the same thing. Computers can be programmed to scan a picture of a page of printed text and read the words in the picture. However, when the words are obscured or distorted too much, the program doesn’t recognize them anymore. A human looking at the same picture can read the words, even when the most sophisticated automated system cannot.

A team of scientists at the Salk Institute for Biological Studies is starting to reveal the amazing complexity behind our ability to interpret such images.

Scientists have known for a long time that there are several “levels” of interpretation involved in the human visual system. The light-sensitive cells in the eye send signals to the brain, and those signals are processed by brain cells called neurons. Some of the first neurons to process the signals coming from the eye are only able to see the information that comes from a tiny region in the area that is being examined. However, those neurons send their results to other neurons for further processing, and those neurons send their results “up the line” to another set of neurons. When the signals reach an area of the brain referred to as V4, the neurons there process the information so as to interpret the images coming from a large section of what is being viewed by the eyes. This kind of processing allows the brain to recognize shapes, like lines and curves.

Consider, for example, looking at the number “5.” The early neurons that process the visual information coming from the eyes don’t see the number at all. They just see tiny sections of the number. However, they send their processed information to other neurons, which do more processing and then send the information on to another set of neurons. When the information reaches the V4 section of the brain, the neurons there are tuned to recognize specific shapes from larger regions of what is being examined. One set of neurons, for example, might recognize the curve at the bottom of the “5.” Another set might recognize the vertical line that sits on top of the curve, and another set might recognize the horizontal line that sits on the vertical line. It’s actually a lot more complicated than that, but at least it gives you some idea of how all this works.

In two separate papers, the scientists investigated how the neurons in the V4 responded to simple shapes (like the lines and curve in a “5”)1 and how they respond to images that contained more complicated shapes (like a natural scene).2 In the second paper, the authors start with this statement:

Although object recognition feels effortless, it is in fact a challenging computational problem. There are two important properties that any system that mediates robust object recognition must have. The first property is known as “invariance”: the ability of the system to respond similarly to different views of the same object. The second property is known as “selectivity.” Selectivity requires that systems’ components, such as neurons within the ventral visual stream, produce different responses to potentially quite similar objects (such as different faces) even when presented from similar viewpoints. It is straightforward to make detectors that are invariant but not selective or selective but not invariant. The difficulty lies in how to make detectors that are both selective and invariant.

In other words, to recognize objects, a sensor must respond the same to different views of the same object, but it must respond differently to similar (but distinct) objects, even when presented from the same view. It was thought that the brain took care of all this by the time the signals reached the V4 section of the brain. In other words, it was thought that the neurons in the V4 are already selective and invariant – they can recognize the curve of the “5” or the lines in the “5” no matter where they are in the visual field.

The authors show that the V4 neurons are invariant when it comes to simple shapes (like the lines in the “5”), but they are not invariant when it comes to more complicated shapes (like the curve in the “5”). In order for the visual system to become invariant with the complicated shapes, more processing has to be done. This makes sense, of course, since it should be easier to process simple shapes than it is to process complicated shapes. However, until now, we haven’t appreciated the depth of the processing this requires.

So why can we interpret a CAPTCHA when the most sophisticated automated system that exists today cannot? Because our brains can process visual stimuli so that the information is both selective and invariant. Right now, the best of human technology cannot match such a feat. Of course, as we learn more about how the brain produces both selectivity and invariance, our automated systems will become better, because the designers of the automated systems will be able to copy (in a crude way) the design of the Ultimate Engineer.

REFERENCES

1. Anirvan S. Nandy, Tatyana O. Sharpee, John H. Reynolds, and Jude F. Mitchell, “The Fine Structure of Shape Tuning in Area V4,” Neuron 78(6):1102-1115, 2013
Return to Text

2. T. O. Sharpee, M. Kouh, and J. H. Reynolds, “Trade-off between curvature tuning and position invariance in visual area V4,” Proceedings of the National Academy of Sciences of the United States of America, DOI:10.1073/pnas.1217479110, 2013
Return to Text

10 Comments

  1. josiah says:

    It feels like there’s something missing here.
    We work with multiple roughly concentric levels of classification in what we see.
    You see a face, and you see that the face is Kathleen, and you see that it’s the face Kathleen has when she’s in a bad mood and might slap you.
    Sometimes we don’t need to recognize a particular person’s face, we just need to know what the generic face looks like. At other times you really do need to know one face from another, because if you kiss the wrong one you’re in for the slappy face!
    It seems that you need not only recognize equality between two objects, you need to identify the similarities and differences so that you can respond appropriately.

    1. jlwile says:

      That’s an excellent point, Josiah. Somehow, I recognize my wife’s face, but I also recognize several different variations on that face, each of which indicates a mood or disposition. I assume this is higher up on the processing chain, after selectivity has been established.

  2. Winston Ewert says:

    Actually, we have automated systems which are pretty good at reading CAPTCHA images. You can actually see this because CAPTCHAs have gotten more and more difficult to read as time has gone on. That’s because the people making CAPTCHAs are constantly making it harder to defeat more and more sophisticated reading techniques.

    None of that is to say that our visual system isn’t way more impressive than anything we’ve come up with.

    1. jlwile says:

      I am not aware of any such systems, Winston. Some CAPTCHAs are very difficult to read, but others are not. As I understand it, they all protect against automated systems. The more complex ones are just assumed to be better for when automated systems are developed. I would think if there really are automated systems that can deal with CAPTCHAs, object character recognition (OCR) systems would be significantly better than they are today! Can you give an example?

  3. Winston Ewert says:

    You can see a project for reading CAPTCHA’s here: http://caca.zoy.org/wiki/PWNtcha. It manages to defeat many of the CAPTCHA engines out there.

    Comparing it with OCR is a bit of apples vs oranges thing. When we are defeating CAPTCHA we are typically studying a particular site’s CAPTCHA and figuring out how it distorts the text and tweaking the program to do that. An OCR has to take any image distorted in just about anyway and figure out what it says.

    1. jlwile says:

      Thanks very much, Winston. I had no idea. The point you make is very important, however. In this case, they are programmer for particular kinds of CAPTCHAs, not for CAPTCHAs in general. That’s the hard part.

  4. josiah says:

    When I look at those captchas I cannot immediately read them. It takes me a moment to realize that there are two rough ovoids flipping black and white, at which point I mentally remove those ovoids and can see the text. If you flip those ovoids, you get back to something that would probably be within range of a good OCR program.

    I would speculate that something which seeks to read captchas wouldn’t just throw them into an OCR program and hope for the best. Instead the intelligent programmer would work out what distortions are being placed on the image and remove them first. You’d then get a program that is fairly good at solving a specialized type of captcha.

    1. jlwile says:

      Well, Josiah, it seems that you are exactly correct. Winston’s link demonstrates that! You certainly aren’t wasting that Oxford education!

  5. Luke says:

    CAPTCHA is not that effective, because human labor is being exploited to solve captchas for spam purposes. http://blog.minteye.com/2013/02/26/captcha-solving-human-labor/

    1. jlwile says:

      Wow, Luke. Spammers are relentless, aren’t they?