-
DESCRIPTIONGenerative AI offers the promise of large-scale web accessibility, yet its automated image descriptions often fall short in accuracy, context, and equity. This session explores AI's dual role as a powerful but imperfect alt text author, examining its strengths and weaknesses. We will present solutions for building resilient workflows through Human-in-the-Loop (HITL) strategies, moving beyond simple error correction to cultivate a virtuous alt text cycle where expert human input informs adaptive, context-aware AI. Join us to critically evaluate the AI-generated description process, champion quality-focused alt text solutions, and understand how integrating AI into your workflow—rather than replacing it—is essential for truly effective alt text outcomes.
Speakers
-
Moderator: Caroline Desrosiers, Founder & CEO, Scribely
-
Erin Coleman, Chief Product Officer, Scribely
-
-
SESSION TRANSCRIPT
[MUSIC PLAYING]
VOICEOVER: The Virtuous Alt-Text Cycle: Engineering Context and Quality in AI-Generated Alt-Text. Speakers: Erin Coleman, Chief Product Officer, Scribely. Moderator: Caroline Desrosiers, Founder and CEO, Scribely.
CAROLINE DESROSIERS: Hi, everyone. And first, a big thank you to Sight Tech Global for hosting this conference. We are so glad to have this platform to share important advancements in assistive technologies. My name is Caroline Desrosiers, and I am your… one of your speakers for this session, which is called The Virtuous Alt-Text Cycle: Engineering Context and Quality in AI-Generated Alt-Text.
And I’d like to introduce my co-speaker, Erin Coleman. Erin is the chief product officer at Scribely, where she leads the vision for technology solutions that combine innovation and human expertise to make images accessible to everyone. And she’s held previous roles at Google, Irrational Labs, and The Outcast Agency.
And as I mentioned, I’m Caroline Desrosiers, the founder and CEO of Scribely. And we partner with large organizations to produce premium quality image descriptions at scale. We’re developing solutions to embed image accessibility right into the fabric of their content workflows and their culture. And now, I’ll hand it over to Erin to kick off our presentation.
ERIN COLEMAN: Hello, everyone. We’re thrilled to be here to talk about a powerful new alt-text author, generative AI. This session is about building resilient workflows that support AI as a writing partner for drafting high-quality, meaningful alt-text. We’ll talk about moving beyond simple editing AI alt-texts for errors and instead cultivating what we call a Virtuous Alt-Text Cycle, a system where expert human input makes the AI smarter, more adaptive, and more context-aware.
So generative AI offers this incredible potential for accessibility, right? But as stewards of good image description, we at Scribely know that this is really not an easy task when it comes to actually achieving high-quality image descriptions. So producing these contextual descriptions at scale is incredibly difficult. We need to evaluate this new AI alt-text author in the accessibility space so we can understand how to use this powerful technology responsibly and how to build a process and a workflow that prioritizes high-quality image access for all users moving far beyond just this better-than-nothing end result. We need to push for more.
As an alt-text author, AI’s strengths are obvious. It has the scale, it has the speed, and it has the breadth of knowledge. But this same strength is actually the reason we’re seeing today’s AI models really struggle with generating quality alt-text. By default, an AI describes an image in complete isolation from its context. In cutting that corner, it unfortunately misses the true purpose of the image or the reason why it exists on that specific page. And when we browse the web, we don’t just see images in a vacuum, right? We are influenced by the surrounding text, the surrounding headlines, and the images are part of this experience of the page. The AI just doesn’t get that. On its own, it doesn’t factor in the different types of context that give meaning to the images that we’re experiencing. And we’ll dig into three different forms of context later on in this presentation, but at a very basic level, we as humans know that we need to understand context to interpret the why behind the description.
CAROLINE DESROSIERS: Exactly. Out of the box AI models aren’t designed to review the surrounding page. Because they can’t see that content, they can’t determine the intent. So what do they do? They default to a simple literal description. They perfectly describe the what, the literal facts, but they completely miss the why. An AI might be an excellent researcher. It can identify objects in an image, but it can’t understand why that image is important in the context that it’s in. It also doesn’t know your brand voice, industry language, or the emotional tone it should adopt. This is all critical missing context. This brings us to the core issue. It’s not just about what is in an image, but why it’s there. We can’t expect AI to succeed if its default process ignores that why.
ERIN COLEMAN: It’s crucial that we understand exactly how these large language models work. The way they are programmed to function makes them more predictive than perceptive. When prompted to generate an image description, the out of the box AI gets one initial hit of context that it can work with and that’s the objects that it can detect within the image and also their relation to one another. And after that, its language model takes over to do the writing. And that predictive writing model is exactly where hallucinations come from.
Look at the example on the slide. The AI sees a person, a bathing suit, a hat, and a towel. Its statistical writing model fires up and predicts beach. And because it’s a predictive writer, it might actually add sand to the description even if, as in this case, the person is clearly sitting on a white surface. It predicts the sand into existence. And here’s the most critical part. The AI then interprets the purpose of the image based on the predictive description it just wrote, including the parts it completely made up. It’s building its why on a what that might factually be wrong.
This brings us to one of the most serious flaws of AI as an alt-text author, its potential to mirror and even amplify societal biases. This often leads to inequitable or stereotypical descriptions where the AI misrepresents people, defaults to narrow assumptions, or uses harmful loaded adjectives. And as this slide says, AI models are trained on the internet, which has essentially become this massive repository of human bias. It’s learning from all of our content, absorbing all of those preexisting stereotypes and harmful patterns that we’ve published. And do we like that about the internet? No, but it’s there.
So if we ever want to get to a place where AI-generated alt text is high quality and we can confidently implement it at scale, we must solve for this. We, as the human alt text managers, must know when and how to assert ourselves in the process. We have to be the ones who provide the equity check and the real world context the AI lacks.
CAROLINE DESROSIERS: So what do we do about this? How do we respond to these critical issues of ensuring accuracy and equity in image description writing? We need to teach the AI the why of an image. Its descriptions have to be focused by the context surrounding the images, not just the pixels. And this is where context engineering comes in.
So at Scribely, when our human writers start an image description project, we have a very well-defined process. We first collect all of the contextual information that we think we might need to produce the quality descriptions, and this may include things like brand style guides, creative briefs, the page content, and we’re using our human ability to discern what information is needed in this particular project, and it changes quite a lot from project to project in terms of the context we need.
So when we think about AI at Scribely, the big question on our minds is, can we programmatically replicate this human-centric process of collecting the relevant contextual information for an AI? And then a follow-up question, would the AI improve if it had all of the contextual information it needed? Is that even possible?
The answer today is we’re getting there. We’re still in the early phases. Context engineering needs more testing, more development, and more experts, but it is showing real promise, and we encourage all of you to start testing this. When your context is precise, relevant, and aligned with the visual facts, it becomes the most effective tool for grounding the AI in reality. It’s how we capture purpose and deliver it to the author, whether the author is human or an AI.
And let’s be clear. Even when we get a better result from the AI, we still need humans to review the output. The end state is not completely to remove our involvement, and we still need to get to that feedback and put it back into the loop.
ERIN COLEMAN: So we’ve established that context is everything in alt text, but this term context can feel super vague. So what exactly do we mean by context? There are many different forms of context, but we’re going to break down three distinct actionable types, which we have up here on the slide.
And the first one is technical context. So think of this as the what of the image or the hard facts about the image. The second is content context. So this is the where, as in where does this image live on the web, on a page or in a document? And third is intentional context. This is the why or the purpose of the image. So let’s dive into each one of these.
Let’s start with technical context. This is the image data that already exists in your systems. The AI cannot guess this. It must be told. So this basic foundational context, as the slide says, is what stops the AI from hallucinating when there isn’t an actual object present within the image.
Second is content context. This is the page itself as well as all the unstructured content surrounding the image on the page, like the page title. You have to build a workflow to systematically feed the surrounding story to the AI. This is what prevents the AI from misinterpreting what the page is actually about.
And finally, most importantly is the intentional context, the intention or purpose of the image. So this is the human knowledge and expertise that exists inside the content manager’s head, right? It’s the reason why the team ultimately chose this specific image for this specific page, and this is the authorial intent that we had mentioned previously. It’s the hardest form of context to capture, but it’s actually the most critical part of preventing those generic, sometimes bland descriptions that we often see from AI outputs.
CAROLINE DESROSIERS: So let’s walk through a good context engineering practice. It all begins with the first step, which is gathering all of your context. The AI needs a deep understanding of every image it’s going to describe, so we found it’s crucial to gather contextual data in four key areas.
So the first is structured and relational data. These would be all of the facts about what’s in the image. For example, what are the products featured in the image? What is the color being shown in the image?
The second is the content and experience. This is the purpose of the page itself. So for example, an image existing on a homepage is going to be different than an image existing on a landing page or a product page or in social media, and the experience is also all of the information that lives in close proximity to that image. So think of the header, think of the marketing blurb, and all of the other images perhaps that appear in a carousel-like experience.
Um, third is function and type. So this is the image’s role on the page. Some of this could be established programmatically, so we could actually look at whether the image is a link and where that link takes us. Um, it could also be that we’re identifying the image as a brand image based on where it appears, what page it appears on. Um, we’re determining its role and we’re thinking about how does this image further the message of the surrounding information for that image?
And the last form of contextual data is brand and equity, and this is where it gets really tricky because your brand and equity information likely lives in documents on servers or in campaign briefs, on creative guidelines. Um, we’re often not connecting this information to image description, but the good news is that you might be able to apply this form of context at a higher level across all images in a campaign or all images across an entire business unit. And the reason you can do that is because the context doesn’t change necessarily from image to image in this case as other forms of context tend to do.
ERIN COLEMAN: Step two is feed all that context to the AI. This is the most direct way to control the AI alt text output, and it’s generally done in two main ways. First, through well-formed prompting. This is where we take all that context Caroline just listed and build it directly into your prompting method. Second, a more advanced method is fine-tuning. This involves training the AI model on your own associated labeled metadata which teaches it specific rules and allows it to generate far more accurate grounded information.
CAROLINE DESROSIERS: Okay, so let’s ground all of this information that we’ve been talking about in an example, and we’re going to use the same image across three different forms of context to illustrate how the description changes. So on the screen, we have an image of a person in a stylish white trench coat holding a skateboard, and this is an e-commerce brand that we’re talking about. The trench coat is actually a product that this brand sells.
So now that we’ve established what’s in the image, the trench coat, we need to ask where does this image live? So this would be the content context that we talked about earlier, and in this case, we are on the homepage. This is the brand’s front door. It’s their first impression for customers, and the content surrounding this image includes a marketing blurb which we also have up here on the slide. So this marketing blurb relates to this image and informs our experience as an audience.
So let’s think about what is the intent. The goal here isn’t to sell this specific coat on the homepage. The goal is actually to showcase the overall brand feel and inspire you to click deeper into this collection, and the marketing blurb next to the image says things like “Look and feel confident. Go places in style. Browse our collections.” This image is doing that exact same job as that marketing blurb. It’s communicating a vibe that’s cool, confident, and versatile.
ERIN COLEMAN: Exactly. That context directly dictates the focus of the alt text. A generic out-of-the-box AI might just say “A person in a white coat holding a skateboard.” That’s the literal what. It completely misses the why, because we know the intent is to convey the brand style. A high-quality description must capture the feeling of the image, the confident person, the stylish monochromatic white outfit, the modern breezy feel. The alt text job here is to communicate the brand, not just the product. Now keep this homepage context in mind, all that inspiration. Later, we’ll use this exact same image in a different scenario.
CAROLINE DESROSIERS: So collecting all of this context for every single image can feel like this massive manual bottleneck, right? It feels like it might be impossible to scale, but the secret is most of this data is already there. You are already capturing most forms of relevant context in your process somewhere. It’s just living in different systems. It’s a bit scattered right now, and the solution is to optimize this process by tapping into the tools you already have. If we organize this data, if we put the work forward just through organization, we can help the AI function as a more focused and well-informed descriptive writer rather than just an inexperienced or interpretive one that’s just doing its best.
ERIN COLEMAN: Yes, truly effective AI alt text is about giving both humans and AI the right information. This brings us to knowing your tools and, as the slide shows, knowing where your context lives. For example, like your CMS or content management system, it provides your most critical on-page context, the page title, H1, captions, et cetera, and it prevents the AI from missing the image’s intent.
Similarly, your PIM or your product information manager provides your structured, non-negotiable facts. The official product name, SKU, brand approved colors, et cetera. And it prevents hallucinations.
And finally, your DAM, Digital Asset Management system, provides the image canonical metadata, its single source of truth, like a campaign name or usage rights, and it helps prevent the AI alt texts from generic one-size-fits-all descriptions.
CAROLINE DESROSIERS: Okay. Now, let’s take that exact same image from earlier and drop it into a completely different context, and this time, that image appears on a product listing page for outerwear. So the image appears in a grid right next to other outerwear products, including a tan coat and a lime green jacket. The user’s intent here has completely changed. They are no longer just browsing for vibe. They are actively shopping. They are scanning, comparing, and trying to decide which item to investigate further.
So that old brand feel description that we created earlier for the homepage is suddenly not helpful in this scenario. It doesn’t help the user differentiate or, just as importantly, to visualize the difference between the outerwear products on this page. So when a user is on this page, we can imagine them asking two key questions to themselves. Number one, “How does this jacket differ from the tan one or the green one?” Number two, “How would I wear this? What does it go with that I already own?” So a high-quality description here needs to serve both of those needs, comparison and imagination.
ERIN COLEMAN: With that in mind, the new alt text for this context would be, “An ankle length white cotton trench coat that is open at the front and has a breezy flowing fit. It is paired with a white fitted blouse with an angular hem and white wide-leg trousers.”
So let’s break down why that alt text description you just heard works in this context. That first sentence, “An ankle length white cotton trench coat.” This is for comparison. It immediately gives the user the core facts to tell it apart from the tan slouchy jacket or the lime green cropped blazer.
And the second sentence, which reads, “It is paired with a white fitted blouse with an angular hem and white wide-leg trousers.” That is for imagination. It helps the user see the full outfit and visualize it in their own wardrobe. They might be thinking, “Oh, I already have white wide-leg trousers. This trench coat would be the perfect addition.”
So this is a great example of how the exact same image requires completely different alt text. The context changed the role of the image from inspire to compare and style.
CAROLINE DESROSIERS: Okay, so now let’s pull all of these pieces together. This is how we build a resilient alt text workflow that takes the concept of human in the loop, as we’ve heard so many times, from something that’s corrective to something that’s influential.
To get high-quality AI alt text and to prevent the AI from struggling with context, inaccuracies, or harmful bias, we have to build a workflow that ensures quality at every step. As you can see on the slide, it looks like this:
First, ingest and contextualize. The human provides the image and all those relevant context types.
Second, first draft. The AI takes that input and generates the first pass.
Third, review and refine. This is the critical human step. The expert evaluates that draft, editing for quality, intent, and equity.
Fourth, approval and publication. The human approved high-quality alt text is published.
And finally, the feedback loop. The human edited version is fed back into the system. This is what creates the virtuous cycle. That data is now used to fine-tune your custom AI models, improve your prompt engineering, and build a powerful QA database.
ERIN COLEMAN: And this workflow also solves a critical issue of ownership, and this is a huge problem that organizations need to address with images. The model is influenced by human input in the first place. And when the AI routes the image to the right person, who’s the content expert, excuse me, um, they are empowering them with a higher quality AI generated draft that they can then adapt and finalize the description for publication.
CAROLINE DESROSIERS: Okay, so one final example, and we’re going to use that exact same image from earlier, but this time, it’s on the product detail page. So we’re past the brand vibes, we’re past the product detail comparisons, and the add to cart button is right there on the page. At this stage, the user’s intent is to inspect the images. They want specific detailed answers before they make that decision to buy the product.
And the text on the page often doesn’t help. It’s usually abstract marketing copy like “high fashion drama.” The images are doing the real work of showing the product from every angle. This is crucial not just to make the sale, but to prevent returns.
And this is precisely where we’re seeing most brands fail. A Scribely study two years ago found that 98% of the top e-commerce product pages have missing or completely useless alt text. The most common error was 91% of pages had auto-generated formulaic text that was non-descriptive, like something like, “Product Title, Image 1 of 2.” And that is completely useless to a shopper and doesn’t provide them with any specific details to inspect. And a few years later, we’re still in the same place.
ERIN COLEMAN: So for our image in this context, the alt text must become the zoom-in. It needs to be hyper-specific. Instead of just trench coat, we need to discuss the semi-sheer cotton fabric, the blouse and sleeves that are loose on top and fitted at the wrist, the unstructured lapels, the high cowl neckline of the blouse, and the deep, flowing pleats of the trousers. This is the only way to give all your customers the facts and confidence they need to click Add to Cart.
CAROLINE DESROSIERS: So as we conclude, our core message is this. Generative AI is an incredibly powerful tool, but it is not a magic wand. Without human input to provide context, to ensure accuracy, and to champion equity, it will fail to scale. True innovation in accessibility isn’t just about automation. It’s about designing intelligent workflows that integrate AI as a partner. This means you cannot just set it and forget it.
So we must assume the role of context engineer and editor if we want to use AI. And it is a powerful assistant, but you are the ultimate author of these descriptions. You are the one that is hitting Publish. So human input and oversight are a complete non-negotiable.
And that’s about all the time we have for today. So thank you all so much for listening to this presentation. Our contact information is up on the screen. We genuinely mean it when we say we’d love to continue the conversation, so please don’t hesitate to reach out on LinkedIn or at the emails you see here. It’s been a real pleasure.
ERIN COLEMAN: Great. And those emails for everyone are erin, E-R-I-N, @scribely.com. That’s S-C-R-I-B-E-L-Y dot com, or caroline@scribely.com, C-A-R-O-L-I-N-E @scribely.com. And thank you to Vista Center, the producer of the Sight Tech Global Conference, and back to you.
[MUSIC PLAYING]
