[MUSIC PLAYING]

VOICEOVER: Holding Ai Accountable: Benchmarking Accessibility With Aimac. Speakers: Joe Devon, Chair and Founder, GAAD Foundation and A11Y Audits. Emon McErlean, VP and Global Head of Digital Accessibility and Globalization, ServiceNow. Moderator: Karae Lisle, CEO, Vista Center.

KARAE LISLE: Well, hello, everyone. My name is Karae Lisle, and I'm the CEO of Vista Center. We're the executive producer of Sight Tech Global 2025. Our speakers today are going to talk about AIMAC, the AI Model Accessibility Checker, uh, a wonderful benchmark for the accessibility of AI coding models. So this one's gonna get technical, um, which I know a lot of our audience is really excited about. So I'm going to turn the mic over to Joe Devon and let him begin our session. Thanks, Joe, for being here.

JOE DEVON: Thank you, Karae and, uh, everybody at the Sight Tech Global team. We really appreciate this opportunity to share, uh, our, uh, our benchmark with you. Uh, my name is Joe Devon, and I am co-founder of Global Accessibility Awareness Day, chair of the GAAD Foundation, and with me is my frequent collaborator, Emon McErlean. He's VP and global head of Digital Accessibility and Globalization for ServiceNow. Um, Emon, would you like to introduce yourself a bit more?

EMON MCERLEAN: Sure. Thanks, Joe. Yeah, so as Joe mentioned, I lead digital accessibility and globalization at ServiceNow. I've been working in the accessibility space for over 15 years now, and I'm really passionate about making sure that as we build new technologies, particularly AI technologies, we're building them with accessibility in mind from the start. And that's really what AIMAC is all about.

JOE DEVON: Great. So let me start by giving everyone some context. We're in the middle of this incredible AI revolution, right? AI is being integrated into everything—our phones, our cars, our homes, our workplaces. And one of the most exciting applications of AI is in software development. We now have AI models that can write code, that can help developers build applications faster and more efficiently.

But here's the problem: if these AI models are writing code, and that code isn't accessible, then we're essentially automating the creation of inaccessible software. We're taking accessibility problems and scaling them exponentially. And that's a huge concern for the disability community and for anyone who cares about digital inclusion.

So Emon and I, along with some other colleagues, decided we needed to do something about this. We needed a way to measure whether AI coding models are generating accessible code or not. And that's how AIMAC was born—the AI Model Accessibility Checker.

KARAE LISLE: That's such an important point about scaling accessibility problems. Can you explain a bit more about what AIMAC actually is and how it works?

JOE DEVON: Sure. So AIMAC is a benchmark—essentially a test suite—that measures how well AI coding models perform when it comes to generating accessible code. We created a set of prompts that ask AI models to generate common web components—things like forms, navigation menus, modal dialogs, data tables, that sort of thing. And then we evaluate the code that the AI generates to see if it follows accessibility best practices.

We look at things like: Does the code have proper semantic HTML? Does it have appropriate ARIA labels? Is it keyboard navigable? Does it have sufficient color contrast? Can it be used with a screen reader? All the things that make a website or application accessible to people with disabilities.

And what we found is that most AI models don't do very well on this benchmark. They generate code that looks fine visually, but it's often completely unusable for people who rely on assistive technologies.

EMON MCERLEAN: Yeah, and I think it's important to understand why this is happening. AI models are trained on vast amounts of existing code from the internet. And unfortunately, the vast majority of code on the internet is not accessible. So the AI models are essentially learning bad habits. They're learning from examples of inaccessible code, and then they're reproducing those same accessibility issues in the code they generate.

It's like if you learned to cook by only watching people who made bad food—you'd probably learn to make bad food too. The AI models are only as good as the data they're trained on.

KARAE LISLE: That makes sense. So what did you find when you tested these AI models? What were the results?

JOE DEVON: Well, the results were concerning but not entirely surprising. We tested several of the major AI coding models—things like GitHub Copilot, OpenAI's Codex, and others. And across the board, they struggled with accessibility. 

For example, when we asked the models to generate a form, many of them would create input fields without proper labels. Or they would create buttons that weren't keyboard accessible. Or they would use placeholder text instead of actual labels, which is a common accessibility mistake.

When we asked them to create a modal dialog, most of them failed to implement proper focus management. So when the modal opens, the focus doesn't move to the modal, and when it closes, the focus doesn't return to where it was. This makes the modal completely unusable for keyboard and screen reader users.

And when we asked them to create data tables, many of them failed to include proper table headers or to associate data cells with their headers. This makes the tables very difficult or impossible for screen reader users to understand.

EMON MCERLEAN: And I think what's particularly concerning is that these are not edge cases. These are common, everyday components that developers build all the time. Forms, modals, tables—these are fundamental building blocks of web applications. And if AI models can't generate accessible versions of these basic components, then we have a serious problem.

Because increasingly, developers are relying on AI to help them write code. And if the AI is generating inaccessible code, and the developer doesn't know enough about accessibility to recognize the problems and fix them, then we end up with inaccessible applications. And that means people with disabilities can't use them.

KARAE LISLE: So what's the solution? How do we fix this?

JOE DEVON: Well, I think there are a few things that need to happen. First, AI companies need to prioritize accessibility in their training data and their models. They need to make sure they're training on examples of good, accessible code, not just any code they can find on the internet.

Second, they need to fine-tune their models specifically for accessibility. Just like they fine-tune models for security or performance, they need to fine-tune for accessibility. And benchmarks like AIMAC can help with this—they provide a way to measure progress and identify areas that need improvement.

Third, I think we need better tooling and guidance for developers. Even if AI models generate inaccessible code, developers need to be able to recognize the problems and fix them. So we need better accessibility linting tools, better testing tools, better educational resources.

EMON MCERLEAN: And I'd add to that—we need to change the culture around AI development. Right now, when companies build AI models, they focus on metrics like speed, accuracy, and efficiency. But accessibility needs to be part of that equation too. It needs to be a quality metric that's tracked and reported just like any other metric.

At ServiceNow, we've been working on this. We've integrated accessibility checks into our AI development pipeline. When our AI models generate code, we automatically run accessibility tests on it. And if the code fails those tests, we flag it and work to improve it. It's not perfect yet, but it's a start.

KARAE LISLE: That's great to hear. Can you talk a bit more about how organizations can use AIMAC? Is it publicly available?

JOE DEVON: Yes, AIMAC is open source and publicly available. Anyone can use it. We've published it on GitHub, and we've documented the methodology so that people can understand how it works and even contribute to improving it.

We designed it to be useful for several different audiences. For AI companies, it's a way to evaluate and improve their models. For enterprises that are deploying AI coding tools, it's a way to assess which tools are generating more accessible code. For researchers, it's a benchmark they can use to study AI and accessibility. And for the disability community and accessibility advocates, it's a way to hold AI companies accountable.

EMON MCERLEAN: And I think that accountability piece is really important. Because right now, there's not a lot of transparency around how well AI models perform on accessibility. Companies will talk about their models' performance on general coding benchmarks, but they rarely talk about accessibility specifically. AIMAC provides a standardized way to measure and compare accessibility performance across different models.

And our hope is that by publishing these results, we can create some competitive pressure. If one AI model scores significantly better on AIMAC than another, that should matter to customers and users. It should be a factor in their decision-making. Just like you might choose one product over another because it's faster or more accurate, you should be able to choose based on which one generates more accessible code.

KARAE LISLE: That makes a lot of sense. What kind of response have you gotten from the AI community? Are companies engaging with this?

JOE DEVON: It's been mixed. Some companies have been very receptive. They understand the problem, and they're actively working to improve. We've had conversations with several major AI companies about how they can use AIMAC to evaluate and improve their models. And some of them have already started making changes based on the results.

But other companies have been less responsive. I think some of them see accessibility as a lower priority compared to other features or capabilities. And that's concerning because it means accessibility issues are going to persist unless there's more pressure to address them.

That's why we think it's so important to have benchmarks like AIMAC that are public and transparent. It creates visibility. It makes it harder for companies to ignore accessibility issues.

EMON MCERLEAN: And I think we're starting to see the tide turn a bit. There's more awareness now about AI ethics and responsible AI. And accessibility is a key part of that. You can't claim to be building responsible AI if your AI is generating code that excludes people with disabilities.

So I'm optimistic that as the conversation around responsible AI continues to grow, accessibility will become more of a priority. But we need to keep pushing. We need to keep measuring. We need to keep holding companies accountable.

KARAE LISLE: Let's talk about the technical details a bit. For the developers and technical folks in the audience, can you explain more about how AIMAC actually evaluates the code?

JOE DEVON: Sure. So AIMAC uses a combination of automated testing and expert review. The automated testing checks for common accessibility issues that can be detected programmatically—things like missing alt text on images, missing labels on form fields, insufficient color contrast, that sort of thing. We use existing accessibility testing libraries and tools for this.

But automated testing can only catch about 30-40% of accessibility issues. The rest require human judgment. So we also have accessibility experts manually review the generated code and test it with assistive technologies like screen readers.

For each test case in AIMAC, we provide a detailed rubric that explains what makes the code accessible or inaccessible. This includes specific WCAG success criteria that apply, best practices from the WAI-ARIA Authoring Practices Guide, and practical considerations for assistive technology users.

EMON MCERLEAN: And I think what's valuable about this approach is that it's not just pass/fail. We're not just saying "this code is accessible" or "this code is not accessible." We're providing detailed feedback on what's working and what's not. We're identifying specific issues and suggesting how they could be fixed.

This is really useful for AI companies because it gives them actionable information they can use to improve their models. It's not just a score—it's a roadmap for improvement.

KARAE LISLE: That's really helpful. What are some of the most common issues you've seen?

JOE DEVON: I'd say the most common issue is missing or improper labels. AI models often generate form fields without labels, or they use placeholder text as a substitute for labels, which doesn't work with screen readers. They also frequently forget to add labels to icon buttons, so screen reader users don't know what the buttons do.

Another common issue is poor keyboard navigation. AI models often create interactive elements that can't be accessed with the keyboard, or they create keyboard traps where users get stuck and can't navigate away. This is a huge problem for people who can't use a mouse.

Focus management is another big issue, especially with dynamic content like modals and dropdowns. The AI models often don't properly manage where the focus goes when something opens or closes, which makes these components very difficult to use with a keyboard or screen reader.

And then there are color contrast issues. AI models often generate designs with insufficient contrast between text and background, which makes the text hard to read for people with low vision or color blindness.

EMON MCERLEAN: I'd also add that AI models struggle with semantic HTML. They often use div and span elements for everything instead of using proper semantic elements like button, nav, header, etc. This makes the page structure unclear for screen reader users and makes it harder to navigate.

And they struggle with complex widgets. Things like date pickers, autocomplete fields, tree views—these require sophisticated ARIA implementation, and AI models rarely get it right. They either implement ARIA incorrectly or they don't implement it at all.

KARAE LISLE: So looking ahead, where do you see this going? What's next for AIMAC?

JOE DEVON: We're working on expanding AIMAC to cover more types of components and more complex scenarios. Right now, we focus on common web components, but we want to expand to mobile applications, desktop applications, and other platforms.

We're also working on making AIMAC easier to integrate into development workflows. We want to make it so that companies can automatically run AIMAC tests as part of their CI/CD pipeline, just like they run unit tests or security scans.

And we're working on building a community around AIMAC. We want to get more people involved in contributing test cases, reviewing results, and helping to improve the benchmark.

EMON MCERLEAN: And I think longer term, the goal is to make AIMAC not just a benchmark but a tool that actually helps AI models improve. We want to use the results from AIMAC to create training data that AI models can learn from. We want to create a feedback loop where the benchmark not only measures accessibility but actively improves it.

We're also exploring the idea of certification. Could we create an AIMAC certification for AI models that meet a certain threshold of accessibility performance? That could be valuable for enterprises that want to make sure they're deploying AI tools that generate accessible code.

KARAE LISLE: Those are exciting directions. As we wrap up, what message do you want to leave the audience with?

JOE DEVON: I think the message is this: AI is incredibly powerful, and it has the potential to make software development faster and more efficient. But if we're not careful, it also has the potential to make software less accessible. And that's unacceptable.

We need to be proactive about building accessibility into AI from the start. We can't just assume that AI models will magically learn to generate accessible code. We have to teach them. We have to measure their performance. We have to hold companies accountable.

And that's what AIMAC is about. It's a tool for measurement and accountability. It's a way to shine a light on the accessibility performance of AI models and to drive improvement.

EMON MCERLEAN: And I'd add—this is everyone's responsibility. It's not just up to AI companies or accessibility experts. If you're a developer using AI coding tools, you need to understand accessibility and be able to recognize when the AI generates inaccessible code. If you're a business leader deciding which AI tools to deploy, you need to ask about accessibility performance. If you're a user or an advocate, you need to speak up and demand accessible AI.

We all have a role to play in making sure that as AI becomes more prevalent, it doesn't leave people with disabilities behind.

KARAE LISLE: Well, thank you both so much for this important work and for sharing it with us today. This has been a really illuminating conversation. I encourage everyone in the audience to check out AIMAC on GitHub and to think about how you can help ensure that AI is accessible. Thank you, Joe and Emon.

JOE DEVON: Thank you, Karae.

EMON MCERLEAN: Thanks so much.

[MUSIC PLAYING]