Can we enlist AI to accelerate human-led work in alt text and audio description?

DESCRIPTION

To watch the recently released “All The Light We Cannot See” with audio descriptions “on” is a revelation, at least for a sighted person. The audio description uses words sparingly to augment the obvious soundscape and to call out subtle details anyone might easily miss. It’s art only a human team could produce (sorry AI proponents), but then it’s also expensive and time consuming. In that regard, producing alt text for images online or audio descriptions for video face the same challenge: how to do more and do it well. At Scribely and MAX, the human-first approach is uppermost, but they are also exploring how AI and related tech can be narrowly channeled to speed up their vital work.

Speakers
- Caroline Desrosiers, Founder & CEO, Scribely
- Angela McIntosh, Senior Product Director for Accessibility, Warner Bros. Discovery
- Moderator: Thomas Reid, Host and Producer, Reid My Mind Radio
SESSION TRANSCRIPT

Download transcript as .txt file

[MUSIC PLAYING]

THOMAS REID: Thank you, Alice, and thank you, Sight Tech Global. In order to begin this conversation covering both image and audio description, I’ll begin with my own personal image description. I’m a brown skinned Black man with a smooth shaven bald head, full, neat beard sprinkled with salt and pepper, and I’m wearing dark shades, and a burnt orange button up shirt and seated in my vocal booth, which has a beige colored background. My pronouns are he/him/his. I’d like to welcome our panelists, Angela Mcintosh and Caroline Desrosiers, and ask you each to introduce yourselves. So we’ll start with Angela.

ANGELA MCINTOSH: Hi, my name is Angela Mcintosh. I am a white woman with long, brown hair and some funky, yellow, beaded earrings, a black turtleneck, and a beige background behind me. And my pronouns are she/her/hers.

THOMAS REID: Caroline.

CAROLINE DESROSIERS: Hi, everyone. Caroline Desrosiers. Pronouns she/her. I’m a white woman with auburn hair and bangs wearing a crewneck sweater and sitting in front of a whimsical wallpaper background filled with trees.

THOMAS REID: Awesome. So let’s take a few minutes for each of our panelists to help set the stage for this conversation. So Caroline, I’ll start with you. Can you talk a bit about the current state of image descriptions on the web and the challenges associated in expanding that access?

CAROLINE DESROSIERS: Yeah. So as I’m sure we all know, millions of images are added to the web daily, maybe even hourly at this point. And most people know that Alt text is required at this point. But we’re not doing that great of a job at actually describing these images. We’re actually seeing a lot of bad Alt text out there on the web. And I define bad Alt text as either missing, inaccurate, or incomplete in some way. And it’s difficult to find statistics on this, but we do have the WebAIM Million study that tracks how we’re doing at Alt text on the top 1 million home pages over time. And for a while, we were doing better. Since 2019, we were seeing those rates go down and down and down. And unfortunately, now they’re going back up again. And we have, right now, 58% of home pages missing Alt text in 2023. And in terms of what’s happening in the Alt attribute, everything under the sun, right? I’ve seen it all. So there’s some very SEO-focused Alt text out there where it’s a string of keywords. We’re also seeing a lot of formulaic Alt text where either the adjacent text nearby is copied or it’s somehow a formula. This plus this plus this equals Alt text, which of course, is not descriptive. On social media, we’re seeing some AI generated Alt text. You might have heard maybe an image of this or an image from an account. So some form of just generic Alt text that’s added to the web there. And this is a real problem, because it creates barriers to information access which has tremendous impacts on research, education, just access to information in general, career opportunities, lots of– huge impact there. And then exclusion in other spaces. If we can’t access images that really impacts social lives and access to resources, buying products online, and then of course entertainment. So my company, Scribely, is really focused on, how can we fix this problem, how can we address this challenge? After all this time has gone by, we’re still seeing these issues with Alt text. So we’re really focused on, well, how can we use different approaches, especially involving humans, to improve Alt text on the web?

THOMAS REID: Awesome. Thank you. Angela, can you address the state of audio description and the challenges in not only expanding but also assuring quality?

ANGELA MCINTOSH: Sure. Audio description is one of my most favorite topics. A lot like Caroline, we spend our time thinking about good, better, and best ways to describe things and how we can do this at scale. I work at Warner Brothers Discovery, which is an umbrella media company that is home to brands like Warner Brothers Pictures, Warner Brothers Television, HBO, HBO Latin America. There’s Max. There’s Cartoon Network. There’s TNT Sports. There’s just any number of different content types. So as we think about where the most important places to invest in audio description, it has been sort of a wild last two years. We went from roughly 1,500 hours of audio description to 7,500 hours in the last two years. And it’s been absolutely a learning journey. There’s been a number of challenges. One, we love the idea of AI and what it will bring to us in the future. But audio description itself, it requires a human being to actually not just describe what’s on the screen but describe what’s on the screen that’s meaningful to the plot. What moves the story forward? What’s going to help someone understand the context of why a character is saying what they’re saying? And the technology is just not quite there yet to really build meaningful scripts for audio description. So we’re still absolutely in a human process of writing audio description scripts, casting voice talent, making sure that it’s real people. And a big thing that matters to us is that the storytelling in the audio description matches the level of the storytelling that’s happening on the screen. You know, Warner Brothers have been some of the best storytellers in the world for the last 100 years. And so we hold a really high bar for the audio description, and it has to match that standard. So that’s kind of where we are today.

THOMAS REID: All right, thank you. So I know the process of creating image descriptions is quite different based on the image, right? The level of detail, the familiarity with the subjects, and setting of the image can all require a significant amount of research. So, Caroline, can you discuss the process or workflow in creating image descriptions and highlight those areas where technology can be of help?

CAROLINE DESROSIERS: Sure. So I love this topic. I love talking about process. When it comes to Alt text, I like to think of it with we have a process for content and then we have a process for actual Alt text workflows. And when it comes to the content, it’s really important to make sure that Alt text is succinct and, as Angela said, meaningful, also relevant here, and, of course, contextual. That’s a critical piece. The image changes based on the context. And really good image description all comes down to pretty much the purpose of the image and the message to the audience. What information is that image conveying? Why is it there on the page? So we really do a lot of work before any writing happens at Scribely where we’re thinking about, why is this image here? What is the role that it plays? And then, of course, there’s also different approaches that you take for different content types. The approach that we’re taking for describing educational content is not the same approach that we’re taking for describing memes, GIFs, internet pop culture. That’s a very different tone and style. And we can really adjust that because we have human writers that specialize in this area. And when it comes to actual workflows, different challenges than creating the content. Because those are decided within organizations or if individuals are content creators and publishing on the web. And there are a few different ways to approach Alt text. Some are training internal teams and designating a group to actually write this content. And then others are hiring external vendors, because they found that they just don’t have the time and resources within their teams or they’re feeling this is a specialized task that needs external support. So based on those two approaches, there can be different workflow challenges. But I like to think about, how can we scale this, right? This is a big question when it comes to Alt text. How can we scale workflows? And a few ways we can do that are thinking about Alt text as being controlled at the source within a centralized repository, within an organizational ecosystem, or some sort of system, some place that we’re storing and managing our Alt text. We can also think about scaling Alt text by thinking about where we’re licensing images and getting Alt text from the source. So is it possible that the image providers that are licensing from or the photographers have already created Alt text that we could reference or that we could repurpose and use? So a lot of the work that I do is to think about, how can we establish these systems, how can we put the plumbing in place to make sure that we’re preserving, passing, and managing Alt text efficiently? And also setting up these standard operating procedures within content creation workflows that involve Alt text earlier on in the process so that it’s not an afterthought. And all of those kind of approaches where you’re thinking about how to improve and how can we establish systems, that ultimately results in creating better Alt text. So I think workflows are really interesting and great to explore. If you’re an organization who’s struggling to get Alt text done, it’s often– the culprit is often your workflows or your systems. So that may be worthwhile to actually dig into those.

THOMAS REID: Right. I want to move to a workflow-related problem that impacts the AD experience that we refer to as the passthrough problem. So this is a pretty common experience where a title has been described either when released in a theater, on a specific network, or even on a DVD but when streamed on a different network or broadcasted, the AD is missing. Angela, can you tell us what Max is doing to help eliminate this problem?

ANGELA MCINTOSH: Yeah, absolutely. So I’m actually going to start with a real human story here because it will help ground us in like the complexity of what’s actually happening in the background.

THOMAS REID: Yeah.

ANGELA MCINTOSH: About a year ago, we had a mom reach out to us in New Zealand who was trying to purchase the movie 42 with descriptions for her son. And 42 was described in the United States and on, at that time, HBO Max. And she was baffled as to why it wasn’t going through iTunes. And there’s a person on my team who absolutely was like, I’m going to get this kid this movie. And so he started to traverse internally, what systems deliver to Max? What systems deliver to other distribution endpoints within our company? And how can we make sure that audio description is a track that’s always included inside the company to the distribution endpoints? And that one exercise really was eye opening for us. Thomas speaks about the passthrough problem almost as if everybody knows about it. But there were a lot of sighted people in the company that were frankly– we just didn’t really understand what was happening. And so that was a big eye opener for us about a year ago. One thing that we have started doing is making sure that we’re cross– that we’re attaching audio description to the products and distribution endpoints that we own and are responsible for. But the larger challenge is when you license content from studio to studio, from streaming service to streaming service, what happens in the licensing deal is if it’s not explicitly stated that audio description should be included, then the operational team who delivers it might not have an actual order to deliver it, or they might have a note to deliver it and the receiver might say, well, that’s not in my technology spec to accept this particular track and so give me everything but that track. And I think, we care deeply about this at Max and certainly on my team. And I know that there are allies in each of these major media companies who care about this. What it’s really going to take for this problem to get solved is for each of those media companies to agree that we always pass accessible tracks between each other in every licensing deal every single time. And that’s not just audio description, but we really need to see that happen also for closed captions, for SDH, even for sign language assets, right? So we are doing a lot of talking within our team, but really the next and right step is for the industry itself to be talking together. And that’s one of the reasons we’re really excited to be here at this conference today. We’re hoping that some of our friends and colleagues are also listening to these kinds of messages and thinking about what the industry could do. I think just handling that in the licensing would be game changing for everybody.

THOMAS REID: Awesome. OK. We’ll move to AI in a bit, right? So we all here now. But ChatGPT, AI in general is the subject that everyone’s talking about today. Educators, entertainment industry, business in general are all trying to figure out its place in the industry. Angela, we’ll start with you. Could you talk a little bit about the role of AI in AD? And as an addition, while it’s not AI, I’m hoping that you can talk a little bit about the role you see text to speech playing in the production of audio description as well.

ANGELA MCINTOSH: Sure. So I will say that within audio description, for all the reasons that I mentioned earlier, we’re really careful about implementing AI. When you’re listening to a story that comes from one of our renowned storytellers, we don’t want to hear a synthesized voice. We don’t want to hear something described that’s not meaningful to the plot or that’s just sort of arbitrary noise, right? And so while we’re interested in exploring this in a kind of experimental way, we’re not really thinking about meaningful AI implementation of script writing or voice synthesis for audio description at this time. And that’s just because we care deeply about the customer experience. And we want to make sure that the technology is really right before we roll it out. There are things that AI does incredibly well, and there are things that it still really stumbles on. So when we’re thinking about innovations that would impact the blind and low-vision communities, that would really meaningfully enhance their experience, most of our research around AI has really been about the content discovery part of the app. It’s really been about, how can we leverage the voice technology that exists today and something like ChatGPT to allow customers to really discover content on their own? To be able to say things like, it would be very cool if I was co-viewing with my son and my son could say, hi, it’s me and mom in the room. We’re looking for what we should watch tonight. We kind of want to watch a comedy. And Max should be able to say back to you, well, I’ve got these three movies with audio description. And so those are the kinds of things that I think we’re really interested in. As we started to develop text to speech technology on our apps, we really felt like a visual user is getting so many more pictures and so much more mood, and tone, and color in their experience. We sort of said, what if we completely scrapped the visual experience and thought about what the audible experience equivalent would be if you were only navigating through audio? And that’s a place where AI can really shine for us. So we’re spending a lot of time thinking about voice interfaces and thinking about how we can interject the music of a theme song into that voice interface so that you have the same sense of mood and tone when you’re navigating the Max app. So, yeah, we think AI has a lot, a lot of potential in spaces like that where it can give very smart answers, it can serve up the right music, it can serve up the right recommendations, and really enhance the experience for customers.

THOMAS REID: OK, right. Caroline, how are you thinking about AI and its place in the work you’re doing at Scribely?

CAROLINE DESROSIERS: Yeah. So, of course, we have this problem, as I mentioned earlier, of millions of images being added to the web daily. And just looking at the amount of formulaic Alt text and shortcuts being taken out there, clearly, content publishers are looking for more solutions in this area. We need help with describing all of these images. The generative AI descriptions spent a great deal of time looking at and testing at different types of images and how it does that. And we’re definitely seeing it improving. But there is a lot of variability and some concerns from an Alt text perspective. There’s this delicate balance that we try to find when we’re writing Alt text where we don’t want to be too basic or generic, but we also don’t want to be interpretive. And what I’m seeing happen a lot of the times is that it’s one or the other. We’re not getting enough descriptive detail or it’s just going in the direction of being wildly interpretive and drawing conclusions based on what it can see, which can be very misleading if it’s not picking up the right visual details. Because then, it’s drawing conclusions from false information that’s not actually there. So these hallucinations, as we refer to them, at these full confidence levels feel a bit reckless to me and definitely a threat to the trust and accuracy of information on the web, which is ultimately very important. So my perspective on AI is that despite this technology, we still have an accountability to the images that we publish. And the Alt text absolutely needs to be right for accessibility purposes. But there are opportunities, and it’s exciting. And we should be paying attention to how this technology is advancing, and collecting data, and doing research on how it’s improving over time. There could be opportunities to speed up the writing process. I definitely see that. Or incorporating context, which is always a challenge. How do we describe this image in this context, which may free up more bandwidth for human writers to actually focus on more complex imagery or content that we would consider high stakes, right? That this is very, very important to our brand or our message. So we could work in collaboration through a hybrid approach where we’re using generative AI for some images that we can really trust it to describe and not using it for others. So definitely paying attention to all of the advancements here, and how we can incorporate this technology into our work at Scribely.

THOMAS REID: OK. So when it comes to AI and technology in general, it’s understandable that many people become concerned that the technology will eliminate the human involvement. Yet, I see some real potential in AI enabling more participation from the blind community. Caroline, I will start with you, Caroline. Can you take some time to share your thoughts on how your company can include more people who are blind or have low vision into the creation process and what steps you’re taking to make it happen?

CAROLINE DESROSIERS: Yeah. So first of all, we have our internal image guidelines that we– that are kind of like our central focus, what we use to create the philosophy, the approach behind Scribley’s image description work. And we are always looking for folks who are interested in different subject areas or who are experts in those areas from the blind and low-vision community to provide feedback to us on how we’re doing it describing those specific types of images. So if you’re really passionate about education or you’re looking at comic book art or something and really care about that improving, we’d love to hear from you and focus on those specific subject areas. When it comes to AI, Scribely is definitely in the territory of collecting data and doing research at this point on how are these generative AI descriptions compared to human generated descriptions or some sort of combination between generative AI and human descriptions? And we’re looking at, what is the output? What do we like about the descriptions from each one of those categories? So I’d really love to hear from anyone who is interested in getting involved in that research, either as a consultant or as a participant in a study. And please, reach out to Scribely to get involved with that.

THOMAS REID: Awesome. Angela, same question. How is Max including more people who are blind in– or have low vision into the process.

ANGELA MCINTOSH: That’s a great question. From our perspective, there really can never be enough blind involvement in any given project. I think when we start out thinking about new features or evaluating the feature sets that we have, we start with rounds of research where we reach out to the blind community, we ask a structured set of questions, and then we design around what the community needs. In addition to research, we also have internal data teams of employees that have various diverse ability sets that can test for us. So that’s another way. I think in the future, we would love to have more consulting and more input. I think we also listen on social media. So you might be in a group that you think is just talking to other people, and there’s a bunch of us listening to hear what you’d like. So in any kind of public forum, please feel free to speak up. And then I’ll just echo what Caroline said. We struggle to get enough of that voice into the work. We don’t get a lot of feedback through customer service. We don’t get a lot of help requests. So any time you want something– I’m telling you LinkedIn– feel free to connect with me on LinkedIn. DM me at any time. We would love more involvement. And we’re happy to take questions, feedback, suggestions. Because there’s sort of an old mantra in the community that says, nothing for us without us. And it’s something that’s really a central tenet of our team. So we’d love to have more blind involvement.

THOMAS REID: OK. Awesome, awesome. Well, I think we hit everything there. So I want to thank Angela and Caroline for a great informative conversation. And as I understand it, you both are scheduled for Q&A sessions, so I hope our audience will take advantage of that opportunity to ask you all questions. You might see me there with a couple of questions, too. [LAUGHS] So I just want to say big thanks to you all and big thanks to Sight Tech Global for the invitation to participate and spreading such good information to the public. And I think that’s a fantastic thing. So I’m Thomas Reid, and I hope you all enjoy the rest of your conference.

[MUSIC PLAYING]

Can we enlist AI to accelerate human-led work in alt text and audio description?

Speakers