[MUSIC PLAYING]

VOICEOVER: AI Meets Human Insight: The Future of Visual Interpretation With Aira and Google Deepmind. Speakers: Geoffrey Peddle, CTO, Aira, Everette Bacon, Chief of Blindness Initiatives, Aira, Alistair Muldal, Senior Staff Research Scientist, Google DeepMind, Greg Wayne, Director in Research, Google DeepMind. Moderator: Troy Otillio, CEO, Aira.

TROY OTILLIO: Hello. I'm Troy Otillio, CEO of Aira. And I wanted to begin by recognizing Sight Tech Global for its leadership in exploring how rapid advances in technology, especially AI, can drive future barrier-free access for people who are blind or have low vision. And special thanks to Vista Center for the Blind for producing this outstanding event and continuing to champion inclusion through innovation. So, at Aira, we believe that access is a human right. And for nearly a decade, we've connected people who are blind or have low vision with professional human interpreters, trusted partners who turn visual information into independence. Today, that partnership between humans and technology is evolving once again through the power of artificial intelligence. This session, "Human plus AI: Building Trusted Intelligence for Access," explores how people and AI together are reshaping a more inclusive and accessible world. Joining me today are four incredible thought leaders, Geoffrey Peddle, Aira's CTO and the architect behind our technical vision; Everette Bacon, our chief of blindness initiatives and a powerful voice for lived experience and user-centered design; Greg Wayne, director in research at Google DeepMind, who has helped lead breakthrough research in artificial intelligence; and Alistair Muldal, senior staff research scientist at Google DeepMind, whose work on AI agents has helped define what's possible in intelligent systems. Together, we'll explore Project Astra, a collaboration between Aira and Google DeepMind that combines AI with human oversight to create more reliable, more responsive, and more respectful access to visual information. So let's start at the beginning. Geoffrey, can you set the stage for us? What is Aira, and how did this partnership with Google DeepMind come to be?

GEOFFREY PEDDLE: Sure. So Aira is a visual interpreting service. We connect people who are blind or have low vision with trained professional agents who provide visual information through a smartphone app or smart glasses. When someone needs assistance, they open the app, and within seconds, they're connected to one of our agents who can see what their camera sees and provide real-time visual interpretation. We've been doing this for nearly a decade, and we've helped people with everything from reading mail to navigating airports to grocery shopping to identifying what's in their refrigerator.

Now, the partnership with Google DeepMind really started in a very organic way. I actually knocked on my neighbor's door here in Mountain View, and it turned out he was a research scientist at Google DeepMind. And we started talking about what Aira does and what DeepMind does, and we realized there was this incredible opportunity to explore how AI could enhance the visual interpreting experience. Not replace the human element, but enhance it. And that conversation turned into a collaboration that eventually became Project Astra.

TROY OTILLIO: That's amazing. So literally knocking on a neighbor's door led to this incredible partnership. Greg, from your perspective at Google DeepMind, what made this collaboration interesting? What drew you to working with Aira?

GREG WAYNE: Well, I think there are a few things. First, Aira had this really clear use case and this really clear user base that we could work with. A lot of times when you're doing AI research, you're working in a lab, you're working with synthetic datasets, you're trying to imagine how people might use your technology. But with Aira, we had this opportunity to work with real users, with real needs, in real-world scenarios. And that was incredibly valuable for us.

The second thing was that Aira had this human-in-the-loop model that we found really fascinating. Because I think one of the challenges with AI, especially as these systems get more capable, is how do you ensure that they're reliable? How do you ensure that they're trustworthy? How do you ensure that when they make mistakes, there's a way to catch those mistakes and correct them? And Aira's model of having human interpreters who could work alongside the AI, who could verify its outputs, who could step in when needed, that provided a really interesting framework for thinking about how to deploy AI safely and responsibly.

TROY OTILLIO: That's a great point. Everette, you've been a user of assistive technology for many years. From your perspective, what was exciting about the possibility of bringing AI into the visual interpreting experience?

EVERETTE BACON: Well, I think the key word there is "possibility," right? Because I've been blind since I was six years old, and I've seen a lot of technologies come and go over the years. And I think what was exciting about this was not just the technology itself, but the approach. The fact that we were thinking about AI not as a replacement for human interpreters, but as a way to enhance and augment what they can do.

Because here's the thing: human interpreters are incredible. They understand context. They understand nuance. They can make judgments about what information is relevant and what's not. They can adapt to different situations and different user needs. But they're also limited by their own human constraints, right? They can only process so much information at once. They can only be in one place at one time. And so the question was, what if we could take the best of what AI has to offer—the speed, the scale, the ability to process vast amounts of visual information instantly—and combine that with the best of what human interpreters bring—the judgment, the empathy, the understanding of context?

And that's what Project Astra is about. It's about creating this partnership between human and AI where each one makes the other better.

TROY OTILLIO: That's beautifully said. Alistair, can you talk a bit about the technology behind Project Astra? What makes it different from other AI vision systems that are out there?

ALISTAIR MULDAL: Sure. So Project Astra is built on what we call a multimodal AI agent. And what that means is that it's not just a system that looks at images and describes them. It's a system that can actually interact with the user, that can understand the user's goals, that can ask clarifying questions, that can take actions in the world to help the user accomplish what they're trying to do.

So for example, if you ask it, "What's in front of me?" it's not just going to give you a generic description of the scene. It's going to think about why you're asking that question. Are you trying to navigate? Are you trying to find something specific? Are you just trying to get oriented in your environment? And based on that understanding, it's going to give you information that's actually useful for accomplishing your goal.

The other thing that makes Astra different is that it's designed from the ground up to work in real time. So it's not just analyzing a single image. It's continuously processing the video stream from your camera. It's building up a model of the environment over time. It's tracking objects as they move. It's understanding spatial relationships. And it's doing all of this fast enough that it can provide information in the moment when you need it, not several seconds later.

And then the third thing, which I think is really important, is that Astra is designed to be proactive. So it's not just waiting for you to ask questions. It's actually paying attention to what's happening in the environment and alerting you to things that might be relevant. So if there's a car coming, it can warn you. If there's a sign that you might want to read, it can point it out. If you're looking for something and it appears in the frame, it can let you know.

TROY OTILLIO: That's fascinating. And I think that proactive element is really key because one of the challenges with traditional assistive technology is that it's very reactive. You have to know what to ask for. You have to know what questions to ask. Whereas with a proactive system, it can help you even when you don't know what you don't know. Geoffrey, how does this work in practice? How does the AI interact with the human interpreters?

GEOFFREY PEDDLE: So the way we've designed it is that the AI is essentially acting as an assistant to the human interpreter. When a user connects to an Aira call, the AI starts analyzing the video stream immediately. It's identifying objects, it's reading text, it's understanding the spatial layout of the environment. And all of that information is being fed to the human interpreter in real time.

So the interpreter might see on their screen, "There's a door 10 feet ahead. There's a sign on the right that says 'Exit.' There's a person approaching from the left." And this gives the interpreter a huge advantage because they can process all of that information much faster than if they had to describe everything from scratch. They can focus on the things that are most relevant to what the user is trying to accomplish.

But here's the key: the interpreter is always in control. They can choose to relay the AI's information directly to the user, or they can add context, or they can correct it if it's wrong, or they can ignore it entirely and rely on their own judgment. So the AI is providing support, but the human is still making the decisions about what information is most important and how to communicate it.

TROY OTILLIO: And that's such an important point because I think one of the concerns people have about AI is this idea of, "Is it going to replace human jobs? Is it going to eliminate the need for human expertise?" And what we're seeing with Project Astra is that it's not about replacement. It's about augmentation. It's about making human interpreters more effective, more efficient, more capable. Everette, from a user perspective, what does this mean? How does it change the experience?

EVERETTE BACON: So I think there are a few ways it changes the experience. First, it makes the service faster. Because the AI can process visual information instantly, the interpreter doesn't have to spend time describing basic things. They can get right to the information that matters. So if I'm trying to find a specific product in a grocery store, instead of the interpreter having to describe every single item on the shelf, the AI can quickly scan and identify what I'm looking for.

Second, it makes the service more proactive. Like Alistair mentioned, the AI can alert me to things that I might not even know to ask about. So if I'm walking down the street and there's a pothole ahead, the AI can warn me before I ask. Or if I'm in a store and there's a sale sign, the AI can point it out.

And third, it makes the service more personalized. Because the AI can learn from my interactions over time. It can understand what kinds of information I typically want, how I like information to be presented, what my common tasks are. And it can use that to provide more relevant and useful information.

But I want to emphasize something that Geoffrey said: the human is still in control. And that's critical. Because at the end of the day, I'm not just looking for information. I'm looking for understanding. I'm looking for context. I'm looking for judgment. And those are things that, at least for now, humans are still better at providing than AI.

TROY OTILLIO: That's a great point. And I think it highlights something really important, which is that this isn't just about the technology. It's about how the technology is deployed. It's about the values that guide how it's used. Greg, from a research perspective, what have you learned from this collaboration about how to deploy AI responsibly?

GREG WAYNE: I think one of the biggest lessons is the importance of keeping humans in the loop, especially for high-stakes applications. When you're dealing with something like visual interpretation for someone who's blind, the stakes are high, right? If the AI makes a mistake, that could have real consequences for someone's safety or independence. And so having human interpreters who can verify the AI's outputs, who can catch mistakes, who can provide that additional layer of oversight, that's really valuable.

The other thing we've learned is the importance of transparency. Users need to understand when they're getting information from AI versus from a human. They need to understand what the AI can and can't do. They need to understand its limitations. And so we've tried to design the system in a way that's transparent about when the AI is being used and how it's being used.

And I think the third lesson is about the importance of user feedback. Because at the end of the day, the people who are using this technology are the experts on their own needs. And so we need to be constantly listening to them, constantly incorporating their feedback, constantly iterating and improving based on what they tell us about what's working and what's not.

TROY OTILLIO: Alistair, can you talk a bit about some of the technical challenges you faced in building this system? What were some of the hard problems you had to solve?

ALISTAIR MULDAL: Yeah, there were definitely a lot of challenges. I think one of the biggest was just the real-time constraint. Like I mentioned earlier, this system needs to work in real time. It can't take several seconds to process an image and come up with a response. It needs to be fast enough that it can keep up with a video stream and provide information in the moment. And that's technically really challenging because these AI models are very computationally intensive.

Another big challenge was robustness. Because this system needs to work in all kinds of different environments, in all kinds of different lighting conditions, with all kinds of different objects and scenes. And it needs to work reliably. It can't just work 90% of the time. It needs to work 99% of the time or better. And achieving that level of robustness is really hard.

And then I think the third big challenge was figuring out how to make the AI's outputs useful. Because it's not enough for the AI to just describe what it sees. It needs to provide information that's actually helpful for accomplishing the user's goals. And that requires understanding context, understanding intent, understanding what information is relevant and what's not. And those are all things that we're still working on and improving.

TROY OTILLIO: Geoffrey, from Aira's perspective, how have your human interpreters responded to working with AI? Was there resistance? Was there excitement? What's been the reaction?

GEOFFREY PEDDLE: I think there was definitely some initial apprehension, which I think is natural when you're introducing new technology that's going to change the way people do their jobs. There were questions about, "Is this going to replace us? Is this going to eliminate our jobs?" And we were very clear from the beginning that that's not what this is about. This is about making you better at your job. This is about giving you tools to be more effective.

And I think once our interpreters started using the system and seeing how it could help them, the reaction was really positive. Because they could see that it was making their jobs easier in a lot of ways. Instead of having to describe every single detail of a scene, they could focus on the high-level information, the context, the judgment calls. The AI was handling a lot of the tedious, repetitive tasks, which freed them up to do the things that humans are uniquely good at.

We also found that it helped with training new interpreters. Because the AI could provide a baseline of information that new interpreters could build on as they were learning the job. So instead of having to learn everything from scratch, they could start by learning how to work with the AI, how to verify its outputs, how to add value on top of what the AI provides.

TROY OTILLIO: That's fascinating. Everette, you mentioned earlier that you've seen a lot of assistive technologies come and go over the years. What makes you optimistic that this approach is going to have staying power?

EVERETTE BACON: I think a few things. First, it's solving a real problem. Visual access is one of the biggest challenges that blind people face, and this technology is directly addressing that in a practical, useful way. It's not just a cool demo. It's something that people can actually use in their everyday lives.

Second, it's built on a foundation of human expertise. A lot of assistive technologies fail because they try to replace humans entirely with technology. And the technology is never quite good enough. There's always edge cases. There's always situations where it doesn't work. But this approach recognizes that humans and AI each have strengths, and it tries to combine them in a way that's better than either one alone.

And third, I think it's because there's a real commitment to listening to users and iterating based on their feedback. We're not just building technology in a lab and then throwing it over the wall and hoping it works. We're actively involving blind users in the development process, listening to what they need, incorporating their feedback, and constantly improving.

TROY OTILLIO: Greg, looking forward, where do you see this technology going? What's the next frontier?

GREG WAYNE: I think there are a few directions. One is just continued improvement in the AI's capabilities. As our models get better at understanding images, at understanding video, at understanding context, the quality of the visual interpretation is going to improve. We're going to be able to provide richer descriptions, more accurate information, better understanding of complex scenes.

Another direction is personalization. Right now, the system provides the same information to everyone. But people have different needs, different preferences, different ways they like to receive information. And so I think there's a lot of opportunity to make the system more personalized, to learn from individual users and adapt to their specific needs.

And then I think the other big direction is expanding beyond visual interpretation to other kinds of assistance. Because the same technology that can help someone understand what they're seeing can also help them navigate, can help them accomplish tasks, can help them interact with the world in all kinds of different ways. And so I think we're just scratching the surface of what's possible.

TROY OTILLIO: Alistair, do you want to add anything to that?

ALISTAIR MULDAL: Yeah, I think one thing I'm particularly excited about is the idea of the AI becoming more conversational. Right now, the interaction is fairly structured. You ask a question, you get an answer. But I think there's potential for it to become more of a dialogue, where the AI can ask clarifying questions, can negotiate with you about what information you need, can help you think through problems. And I think that could make it much more useful and much more natural to interact with.

TROY OTILLIO: That's exciting. So as we start to wrap up, I want to give each of you a chance to share maybe one key takeaway or one message for the audience. What do you want people to understand about this work? Geoffrey, let's start with you.

GEOFFREY PEDDLE: I think my key takeaway would be that AI and humans are better together. We don't have to choose between AI and human expertise. We can have both. And when we design systems that leverage the strengths of each, we can create something that's better than either one alone. And I think that's true not just for visual interpretation, but for a lot of different applications.

TROY OTILLIO: Everette?

EVERETTE BACON: I would say that the most important thing is to keep users at the center of everything you do. Technology is a means to an end. The end is access, independence, and inclusion. And the only way to achieve that is to actually listen to the people who are going to be using your technology and let them guide what you build.

TROY OTILLIO: Greg?

GREG WAYNE: I think my message would be that AI has enormous potential to improve accessibility, but we have to be thoughtful about how we deploy it. We have to think about safety. We have to think about reliability. We have to think about transparency. We have to think about human oversight. And when we do those things, I think we can create technology that really makes a difference in people's lives.

TROY OTILLIO: And Alistair?

ALISTAIR MULDAL: I would just say that this is still early days. We're at the beginning of what's possible with AI for accessibility. And I'm really excited to see where it goes from here. But I think the key is to approach it with humility, to recognize that we don't have all the answers, and to keep learning and iterating based on what we discover.

TROY OTILLIO: Well, thank you all so much for joining me today and for sharing your insights. This has been a really fascinating conversation. And thank you to everyone in the audience for your time and attention. We're excited about what the future holds, and we're committed to continuing this work to create more accessible and inclusive technology for everyone.

GEOFFREY PEDDLE: Thank you.

EVERETTE BACON: Thank you.

GREG WAYNE: Thank you.

ALISTAIR MULDAL: Thank you.

[MUSIC PLAYING]