In my previous post, I described the rather amazing auditory abilities of human beings. Based on microsecond time-keeping and split-second computations, humans can localize sounds with impressive accuracy even in confusing auditory environments.
One of the fun things about psychology and neuroscience, however, is how full of contradictions we are as humans. For example, having (hopefully) impressed you with our outstanding powers of auditory localization, consider for a moment our stunning incompetence in this very same area.
For sounds originating from directly in front of us, we can localize the sound source to within 2-4 degrees. Even for sounds originating from off to the sides, we still do pretty good: within about 5-10 degrees. And yet probably many of us have experienced stunning examples of mis-localization: the ventriloquism effect.
A ventriloquist, of course, is a performer who operates a puppet and also speaks for the puppet without moving his or her own lips. The ventriloquist may carry on a conversation with the puppet, speaking normally when upholding his or her part of the conversation, but hiding lip movements when voicing for the puppet. The illusion can be very real if the ventriloquist is skilled; the puppet’s voice is perceived to actually emanate from the puppet’s moving mouth. Indeed, the illusion is so compelling that we speak of the ventriloquist “throwing his voice”.
Depending on how close we are to the puppeteer, this may represent a fairly small localization error (but a noticeable one, since we perceive a difference between where the ventriloquist’s voice is coming from when voicing the puppet versus voicing for himself.) Yet we are capable of making much more stunning errors in localization.
For example, I often show video clips in my classroom. Often the video clips feature scientists being interviewed. Without fail, both I and my students perceive the scientist’s words as emanating from his or her moving mouth. It’s the most natural perception in the world (voices come out of moving mouths, after all), yet a moment’s reflection would reveal the impossibility of this perception. The visual image of the scientist is being projected onto a screen. There is no talking person there, only the image of a talking person, and voices don’t emanate from projection screens.
Where is the voice coming from? In most classrooms, from speakers placed in the ceiling above the students’ heads. Given the impressive auditory localization abilities described in my last post, shouldn’t we quite naturally perceive the sound as emanating from above us, and not from the ghostly image in front of us?
We can, to be sure, but only after willfully shifting our attention from visual analysis to auditory analysis. Under most circumstances we won’t. We’re the ones “throwing voices” around – in this case from the ceiling to the front of the room! If the classroom situation isn’t relatable, consider a movie theater or a rock concert. Speakers – often large, often clearly visible – are located in the walls of the cinema or left and right of the performers on stage. But if we aren’t concentrating on sound localization per se, the sound will seem to be coming from the movie screen, or the performers in the middle of the stage.
Now this is an illusion, our brains are wrong in allowing this to happen – and yet who can deny the adaptability of the illusion? The puppet act is much more enjoyable if we imagine the puppet actually has its own voice, and the movie is much more understandable if we perceive the voices and sound effects emanating from the screen. And the rock concert is even the best example – the sound is coming from the speakers, not the performers; but after all, it is the actions of the rock band that cause the speakers to make the sound in the first place – and so, like two wrongs making a right, our perception correctly shifts the sound back to its original source – the guitarist, the bassist, the drummer.
Psychologists call this visual capture because the image seems to have captured the sound and taken it from where it should be to where the our visually-based interpretation needs it to be. Although our auditory skills are indeed impressive, our visual localization abilities are orders of magnitude superior. Given a conflict between visual and auditory information, our brains appear to prefer the visual interpretation.
Interesting as the ventriloquism effect is, it is not the most astonishing influence of the visual system over our auditory perception. Vision can influence not just where as sound appears to be coming from, but also what the sound sounds like. The simplest example of this is the McGurk Effect.
The following video clip is very short. We are going to watch the clip three different times, with three different sets of instructions.
Experiment 1. The first time through, watch the clip as you normally would, making sure: 1) the sound is working and is turned up on your computer and 2) you are watching the man as he speaks his nonsense syllables.
Most people report hearing something along the lines of “Da-da, da-da, da-da” or possibly “Tha-tha”. If you did not have that experience, run experiment 1 again paying careful attention to the instructions.
Experiment 2. Does he really say “Da-da”? Run the video clip again, but this time with your eyes closed.
Woah! If you’re like me, you now hear something completely different – “Ba-ba, ba-ba, ba-ba.” Is YouTube playing tricks on us? Have they switched out video files somehow? The third experiment is the clincher.
Experiment 3. Run the video clip again. This time, you decide when to open and close your eyes. In other words, during the 5 second clip, open and close your eyes a couple of times at random.
Now you should have the uncanny sense that you are controlling the sound with your eyes.
What’s going on here? The man is filmed saying “Da-da, da-da, da-da” but the video producers have removed the audio and replaced it with audio of the man saying “Ba-ba, ba-ba, ba-ba”. Although the bas are coming out of your speakers in a clearly audible fashion, your eyes don’t see the man’s lips coming together (as they must to make a b-sound). Instead, your eyes see lips open, and may catch his tongue meeting the roof of his mouth (as it must to make a d-sound).
Still, this seems very mysterious. With the ventriloquism effect, it makes sense that, given a conflict, the brain might choose the visual interpretation over the auditory interpretation given the superior spatial localization skills of the visual system. But here the conflict is over what phonemes (speech sounds) the man is making. It doesn’t seem to make sense that our visual system would be better at picking up speech sounds than our auditory system. If you’ve ever tried to make sense of what the newscaster on TV is saying with the sound on mute, it’s a pretty unreliable exercise. (Not that people can’t get pretty good at reading lips, which I mention so that I can post a Seinfeld clip:)
The truth is, though, the difference between “ba” and “da” is incredibly subtle. The consonants are two different ways of stopping the air flow through our mouths, each “releasing” into the same vowel sound. In other words, the sound waves generated by ba and da are extremely similar, but for a very quick difference at the very beginning. The fact that we perceive these sounds as being so very different, so clearly distinguishable, is not a reflection of how similar or different those sound waves are, but rather reflect a great deal of learning enhanced by visual input and linguistic context. (Compare how “jumbled” and “muddled” and fast a foreign language sounds to our ears – the difficult-to-discriminate sounds of unfamiliar language reveal how difficult parsing speech really is.)
In other words, vision can clarify differences in speech sounds. If you’ve ever listened to someone who speaks with a thick accent, or who has a cold, or who speaks very softly, you may find it considerably easier to “hear” what the person is saying if you are looking at the person as he or she talks. That is not to say we can’t distinguish speech sounds on the basis of audition alone, only that it tends to be much easier if we have corresponding visual input. We make more auditory interpretation errors on the phone than we do in person, for example.
The McGurk effect is stunning because it is clear how wrong we can be, but of course, we hardly ever encounter situations where the auditory and visual stimulation are inconsistent with one another. Outside of ventriloquists and psychology experiments, the malleability of our auditory perception by our visual perception is almost always to improve our perception, not lead it astray.
Does the reverse ever occur? Does audition ever dominate vision? As far as I know, auditory capture, which has certainly been demonstrated, tends to be more subtle. But movie sound effects offer a good example. When our hero tracks down the bad guy and throws a punch, we know the actors aren’t actually making contact with one another. But the sound effect of the punch landing certainly makes the confrontation seem more violent than it actually is.
The original Star Trek provides tons of great examples – because the fight scenes, visually, are often – um, shall we say – less than compelling. Without the sounds of shirts ripping and punches landing, it wouldn’t seem like a fight at all. When Kirk does his famous two handed chop move on some bad guy’s shoulders it doesn’t look that bad, but it sounds awful.
Michael Bach’s excellent illusions website also provides a different example – the motion bounce illusion – in which the trajectory of two balls seems to change depending on whether sound is present or not.
Vision can also dominate over our other senses in ways that seem almost paranormal – what we see can affect our very body image. But this is another long story, so look for it in my next post.
I could go on about this stuff all day, but – Kroykah!