Accessibility research roundup

Here’s an occasional Weblog giving you a quick overview and review of, and commentary on, some of the academic research concerning captioning and audio description.

Captioning

And in this section, we focus exclusively on Carl Jensema of the Institute for Disabilities Research and Training, the international megastar when it comes to captioning research. What’s he written lately?

Carl Jensema, “Viewer reaction to different television captioning speeds”
American Annals of the Deaf, 143(4):318–324 (1998)

Abstract: Video segments captioned at different speeds were shown to a group of 578 people that included deaf, hard-of-hearing, and hearing viewers. Participants used a five-point scale to assess each segment’s caption speed. The “OK” speed, defined as the rate at which “caption speed is comfortable to me,” was found to be about 145 words per minute (WPM), very close to the 141 WPM mean rate actually found in television programs.... Participants adapted well to increasing caption speeds. Most apparently had little trouble with the captions until the rate was at least 170 WPM. Hearing people wanted slightly slower captions. However, this apparently related to how often they watched captioned television. Frequent viewers were comfortable with faster captions. Age and sex were not related to caption-speed preferences, nor was education, with the exception that people who had attended graduate school showed evidence that they might prefer slightly faster captions.
Without a doubt, this study represents the most useful hard data we’ve got about captioning. Simply put, this paper tells us how fast captions should be.
Interestingly, the “presentation rates” typically used by experienced U.S. captioners are also the ones the subjects of this experiment found the most comfortable. In other words, the Americans are generally doing it right, at least when it comes to caption speed (if not editing, placement, or correctness of transcription).
Subjects watched a set of custom-made 30-second videoclips with caption speeds of 96, 110, 126, 140, 456, 170, 186, and 200 WPM. They marked comfort levels on a five-point scale. The most comfortable reading speed would be given a score of 3. The captions were pop-on, not scrollup.
Most subjects found slow captions a bit too slow and fast captions a bit too fast. “A mean score of 3 would be associated with a caption speed between 140 and 156 WPM. By means of simple interpolation, an estimated ‘OK’ speed of 145 WPM is derived.”
Fun little factoid: Hearing subjects, most of whom did not have as much experience watching captioned TV as deaf or hard-of-hearing viewers, were the ones who were hitting the panic button during the very fastest segments. The 186- and 200-WPM segments showed the highest scores of all subjects and all segments with the hearing participants. The average score, in fact, for the 200-WPM segment was over 4 out of 5 on the difficulty scale – the only score that high in the whole experiment. Jensema: “My basic conclusion is that the more hearing [that] people had, the slower they wanted the captions to be.”
Jensema provides evidence, based on surveying his subjects, that hearing people spend significantly less time watching captions than deaf or hard-of-hearing people. It is not exactly rocket science to conclude that experience makes you better at watching captions.
- I would add a personal interpretation here. The entire closed-captioning system was invented because “surveys” done by NCI and PBS in the late 1970s showed that hearing people were quite opposed to open captioning. I’ve read every paper I could get my hands on since the mid-’80s on the topic of captioning, including the very earliest ones, and I’ve never seen the results of those earliest studies published.
- In other words, we invented an entire technology on the basis of a single opinion poll.
- If you ask hearing people today if they would accept open captions on their TV shows, they’d probably say no. If you then ask “How much time do you spend watching captioned TV at present?” I suspect you would find that those most resistant to captioning had watched it the least. Most would have never watched it at all.
- If, moreover, you stopped hearing people in an electronics store, pointed to captions displayed on the bank of television sets, and asked them if they would accept open captions, the answer would again be a firm no. Who can blame them? Standing too far away from too many screens and attempting to concentrate on program dialogue and captions in a crowded, noisy retail store are more then enough to predispose people against captioning.
- Yet if you asked a group of hearing people to watch all their TV (and home videos) with captions for two solid weeks, it is my submission that an unprecedented percentage of those respondents would agree to either keep captions turned on or to accept a certain amount of open-captioned programming.
- Closed-captioning is in fact the correct access method in nearly all cases. But open-captioning can work, too. And closed-captioning is not as unpopular with hearing people as is assumed.
- In any event, hearing people are now the majority audience of captioning, as I have shown elsewhere.
One must not interpret Jensema’s study as authorization to edit down fast television dialogue to the magical 145 WPM speed. “As caption speed increased, the respondents recognized this, but most seemed able to adjust and did not appear to consider the captions unacceptable,” even up to 170 WPM. “Only about 1% would consider 141 WPM ‘too fast.’ ” Irrespective of hearing status, caption viewers can keep up with most captioning in the real world.
It would be useful to conduct future research on the preferred “presentation rate” of scrollup captioning. For technical reasons, it’s possible to display scrollup captions far faster than anyone could reasonably read them over the span of an entire television program (300 WPM is easily attained). What are viewer preferences for:
- live-display vs. stenocaptioning – in other words, captions that scroll up as complete, ready-made lines or appear word-by-word?
- captions at bottom, top, or moving between bottom and top (as in many sportscasts)?
- all-capitals vs. upper-and-lower-case captions?
- short vs. long line lengths?

Carl Jensema, Ralph McCann, and Scott Ramsey, “Closed-captioned television presentation speed and vocabulary”
American Annals of the Deaf, 141(4):284–292 (1996)

Abstract: [...]Caption data were recorded from 205 television programs. Both roll-up and pop-on captions were analyzed. In the first part of the study, captions were edited to remove commercials and then processed by computer to get caption-speed data. Caption rates among program types varied considerably. The average caption speed for all programs was 141 words per minute, with program extremes of 74 and 231 words per minute. The second part of the study determined the amount of editing being done to program scripts. Ten-minute segments from two different shows in each of 13 program categories were analyzed by comparing the caption script to the program audio. The percentage of script edited out ranged from 0% (in instances of verbatim captioning) to 19%. In the third part of the study, commonly-used words in captioning and their frequency of appearance were analyzed. All words from all the programs in the study were combined into one large computer file. This file, which contained 834,726 words, was sorted and found to contain 16,102 unique words.
The study gives a quickie history lesson of U.S. captioning and makes a perfectly accurate but unsubstantiated remark: When it comes to editing captions as extensively as was done on The Captioned ABC News in the 1970s, “almost everyone now considers [that] overediting.”
Deaf viewers have asked for full access to the spoken words of a program. “Caption companies have tended to interpret this as meaning deaf people want straight verbatim captioning.” But how close to verbatim are captions in the real world?
For the study, researchers recorded 180 hours of television in many categories and 22 music videos. (Home videos were also included, but they were apparently matched against films shown on TV during the period of the study.)
Captions were downloaded into a computer file and matched against time criteria like total program time and appearance and disappearance of captions. Special consideration was given to roll-up (“scroll-up,” “scrollup”) captions, whose lifespans are harder to pin down than pop-on (“pop-up,” “popup”) captions.
The study looked at the words edited out of certain selected programs; the full corpus of words used in all programs and the list of unique words in that corpus; and caption speed.
“We found that roll-up captions generally present more words over a given period than pop-up captions (151 WPM vs. 138 WPM), and that roll-up captions are used for a wider range of audio speeds, from very slow (74 WPM) to very fast (231 WPM [!]).” It is not clear just where the “evaluation of audio speeds” came from or how it was done.
Many genres of programming “tended to cluster around the mean captioning speed of 141 WPM.”
Length of words had no bearing on difficulty of reading. Shows with very slow or very fast captions had essentially equal average word lengths.
Caption editing, if you look at Jensema’s numbers, is not a huge problem. A subsection of all the recorded programs was examined for degree of verbatim transcription; “the average was 94% captioned,” with a low of 81% (explained as an anomaly; the next one up was 87%) and a high of 100%. Unfortunately, the qualitative aspects of edited captions could not quite be considered.
- I remember, back in the dawn of closed-captioning, that the ill-informed caption “editors” of NCI laboured to shave individual words off captions. The assumption was that reducing word length, even by a single word, made a caption easier to read.
  - This, of course, is ridiculous: With rare, specific exceptions, people do not read word-by-word. The eye pogos along the line in so-called saccades, and words are recognized more by shape and outline than letter-by-letter.
  - NCI would take this mania to extremes, turning copula verbs into contractions, even when attached to long noun phrases: “The prime minister of the United Kingdom’s expected to land in Washington within the hour.”
  - I rarely see egregiously inappropriate editing in American captions – those produced by the big names, at least. It is entirely commonplace in Canadian captioning to witness an all-out, indisputable butchering of the source text, particularly for programs captioned by inept neophytes working in broadcasters’ own in-house caption departments.
  - Yet it is quite likely that even those butchered captions provide more than 87% captioning, though the goal of equivalence of access is not met.
In analyzing the concordance or set of terms in the sample, “[j]ust 250 words accounted for more than 2/3 of all the words used in the captions.... [M]astery of fewer than 500 words will help a viewer to understand most of the vocabulary in any television program shown in the United States today.”

Carl J. Jensema, Sameh El Sharkawy, Ramalinga Sarma Danturthi, Robert Burch, and David Hsu, “Eye-movement patterns of captioned-television viewers”
American Annals of the Deaf, 145(3):275–285 (2000)

Abstract: Eye movement of six subjects was recorded as they watched video segments with and without captions. It was found that the addition of captions to a video resulted in major changes in eye-movement patterns, with the viewing process becoming primarily a reading process. Further, although people viewing a specific video segment are likely to have similar eye-movemen tpatterns, there are also distinct individual differences present in these patterns. For exampole, someone accustomed to speechreading may spend more time looking at an actor’s lips, while someone with poor English skills may spend more time reading the captions. Finally, there is some preliminary evidence to suggest that higher captioning speed results in more time spent reading captions than on a video segment.
Three deaf/hard-of-hearing and three hearing subjects sat in a special apparatus and watched brief captioned segments on a computer monitor. The apparatus followed the movements of the subjects’ eyes, tracking exactly where they looked.
Captioned and uncaptioned segments were watched without audio. The image content was more or less comparable within matched pairs of captioned and uncaptioned videoclips. Two additional segments, custom-made for the experiment, contained precisely 80-WPM and 220-WPM captioning.
For segments with no captions, eye movements tended to zip around the screen, with the exception of a Peter Jennings newscast, in which case eyes tended to focus on Jennings’ head (or thereabouts).
But for segments with captions, the preponderance of eye gaze dominated at the bottom of the screen. “The addition of captions apparently turns television viewing into a reading task, since the subjects spend most of their time looking at captions and much less time examining the picture.”
Rather interestingly, subjects were re-tested with the same videos a few days later. Subjects spent slightly more time looking at the picture than reading captions. That was also true for the hearing subjects, who had initially had less experience watching captions (“only the deaf subjects watched it regularly”).
For the two special segments, the slow-captioned video gave viewers more of a chance to watch the video, while the fast-captioned video forced nearly all the subjects’ attention to the captions.
“Viewers read the caption and then glance at the video action after they finish reading.” Testify!
- Captions are alleged to be “distracting” to hearing viewers.
- We have evidence here that viewers of all stripes tend to spend most of their time watching captions. As we see from this study, captions technically are distracting.
- However, the purpose of captions is to be read. Undistracting captions are failed captions.
- Moreover, we do not have any evidence of understanding of the program, or retention. Caption viewers will all tell you that, after you get the hang of reading captions, you don’t really miss the rest of the action. But you aren’t looking directly at it very often. Do we infer that peripheral vision comes into play? Based on my experience, that is clearly true. It’s just that we don’t have any experimental evidence. It is still possible for captioning detractors to claim that captions are “distracting” and force you to ignore the all-important main video.

Carl J. Jensema, Ramalinga Darma Danturthi, and Robert Burch, “Time spent viewing captions on television programs”
American Annals of the Deaf, 145(5):464–468 (2000)

Abstract: The eye movements of 23 deaf subjects, ages 14 to 61 years, were recorded 30 times per second while the subjects watched four 2.5-minute captioned television programs. The eye-movement data were analyzed to determine the percentage of time each subject actually looked at the captions on the screen. It was found that subjects gazed at the captions 84% of the time, at the video picture 14% of the time, and off the video 2% of the time. Age, sex, and educational level appeared to have little influence on time spent viewing captions. When caption speed increased from the slowest speed (100... WPM) to the fastest speed (180 WPM), mean percentage of time spent gazing at captions increased only from 82% to 86%. A distinctive characteristic of the data was the considerable variation from subject to subject and also within subjects (from video to video) in regard to percentage of time spent gazing at captions.
Captioning “is a much more complicated process than it may seem and requires many decisions concerning timing and screen placement.”
Four silent custom videoclips, captioned at 100, 120, 140, 160, and 180 WPM, were used in the study.
All 25 subjects were deaf. On average, subjects spent 84% of their time looking at captions. The range was 82–86% – near-uniformity, in other words.
Along with the previous study, we have ample evidence that the viewing of captioned programming involves spending most of your time looking at captions.

Audio description

James Turner, “Some characteristics of audio descripton and the corresponding moving image”
Information Access in the Global Information Economy: Proceedings of the 61st ASIS [American Society for Information Science] Annual Meeting, 35:108–117 (1998)

Abstract: Just as closed-captioning adds visual information for the benefit of hearing-impaired television viewers, audio description is a technique which adds an audio track describing the images for the benefit of the visually-impaired. This research is concerned with reusing the texts produced for use by audio describers as a source for automatically deriving shot-level indexing for film and video products. A first step in studying the question of recycling audio-description text for this purpose is to identify the characteristics of it in described productions.... This paper proposes to flesh out those results by analyzing the characteristics of a few different kinds of described television productions, drawing some conclusions about the usefulness of the technique for purposes of automatically deriving shot-level indexing for moving-image materials.
Turner, whose interest lies in indexing motion pictures, examined three 27-minute segments of DVS-described programming (a Nova episode, Poirot, and Jurassic Park).
Using his own techniques, Turner indexed the contents of the descriptions and when they appeared.
The number of “episodes” (really, “instances”) of audio description in the 27-minute segments was 53, 107, and 197. That equates to 212, 428, and 788 descriptions in an equivalent two-hour program, representing up to six times as many as the example of live audio description I examined (165 instances). The low end of the scale, however, is comparable – 212 vs. 165. The extrapolated figures may, however, be excessive.
The number of shots accompanied by audio descriptions varied from 36% to 56%. But “in only 25.8% of cases is the audio description text spoken entirely during the shot it describes.” Many descriptions overlap scenes or anticipate scenes (sometimes two scenes away).
The research provides an extremely useful list of “types of information transmitted by the audio-description text”:
1. Physical description of characters
2. Facial and corporal expressions
3. Clothing
4. Occupation, roles of characters
5. Information about the attitudes of characters
6. Spatial relationships between characters
7. Movement of characters
8. Setting
9. Temporal indicators
10. Indicators of proportion
11. Décor
12. Lighting
13. Action
14. Appearance of titles
15. Textual information included in the image
This study appears to be the first academic research into the contents and deployment of audio description on television. (Actually, Turner does have an earlier study on his site. Indeed, several papers are available there.)