[MUSIC PLAYING]

MATTHEW PANZARINO: Hello, everyone. Welcome. I'm glad to have you with us. My name is Matthew Panzarino. I'm the editor in chief of TechCrunch. And we have with us Sarah Herrlinger, senior director of Global Accessibility Initiatives at Apple. And also Jeff Bigham, research lead for AI and ML accessibility at Apple. They're here to talk to us a little bit about ML and AI, and how those relate to accessibility features, and how Apple is implementing them across their feature stack. And of course, how those will benefit people long term.

I'm going to throw it, I think, to you, Sarah, first. I just want to kind of ask if you wouldn't mind, just give us a little bit of an overview of how Apple's approach to accessibility design differs a little bit perhaps from the industry norms, leading to kind of more success in the aggregate on Apple's implementation of accessibility features across iOS, Mac OS, and that sort of thing.

SARAH HERRLINGER: Sure. And thank you very much for having us both here today. It's really a pleasure to be with you again. So to answer that question. Yeah. At Apple, we've long felt that technology should respond to everyone's needs. And we really mean that as everyone. It's not just that some people should be able to use our tech, but everybody should.

So our teams work relentlessly to build accessibility into everything that we make. And as a piece of that, we're always looking to make daily life easier for our users as well. And I think that's a place where our work in ML really comes in. Whether you think about that as things like sound recognition that we've built to support the deaf and hard of hearing community to understand when there might be a sound around them that they would otherwise miss, to the work we've done around image descriptions or screen recognition for the blind community, to even some of the work that we've done around AssistiveTouch for Apple watch to support those with limb differences.

And I know we'll go into a little more depth on some of these. But I think they're all great examples of the way that we're using machine learning to support our communities in really new and innovative ways.

MATTHEW PANZARINO: Thank you. And Jeff, that's a good time to segue to ask you about the applications of ML and AI across Apple's accessibility efforts. Because obviously, these are additive technologies that can help in a lot of different ways. And we've seen them applied in a variety of Apple features. But specifically where it relates to accessibility, can you talk a little bit about how Apple's applying those practically?

JEFFREY BIGHAM: Yeah. So it's a great question. I mean, I think that ML is really important for accessibility for a variety of reasons. It's how we teach our devices to be able to perceive the world more like humans do. It's also how we adapt our devices to be able to deal with whatever context they're in or whatever abilities that people have.

And if you contrast this to sort of traditional programming, where you explicitly define every single situation and input you might expect to see out in the world, and then you can very well determine the output, machine learning allows us to go beyond that and to deal with situations we haven't seen. Maybe ability or ability combinations we haven't seen before, so that we can really support everyone, and we can provide access to information out in the world that's unpredictable.

MATTHEW PANZARINO: Yeah. And the adaptability of ML I think is probably one of the things that attracts coders and builders primarily to it when it comes to accessibility features because you cannot always predict what scenarios a user is going to be presented with. And this allows far more adaptability, as you said.

I'm curious. How are some of those features-- I think iOS 15 launched a handful of features or enhanced a handful of features that were already available using ML. Can you talk a little bit about how it powers some of the latest features? Live Text is one example. Obviously, image recognition and VoiceOver with images. I think we're heavily using ML this year.

JEFFREY BIGHAM: Yeah. So we're really excited about all the things that came out this past year. So one of the really big improvements which you noticed was with image description. And so this is the feature we introduced last year, which provides an image description for any image across the iOS ecosystem, whether that's in Photos, or Safari, or elsewhere.

And what we've really done this year is we kind of dove in to think, well, now that we've made this possible, how can we continue to iteratively improve the design of this feature so that we can make these descriptions really great? And so designing text generation systems is really hard.

So the whole point of machine learning is that we don't know what input we're going to get, and how do you then control what output you're going to be able to provide? And so something where accessibility is once again kind of leading the way in AI is figuring out how do you work with something like text generation to be able to influence the quality of the type of descriptions that are produced? And I think we're very much at the forefront of that.

The image description I think is one way you get access to the rich visual information and photographs. But I think it's just the start. And so the other feature that we are really proud to have been able to release in the same domain this year is the image exploration feature, which allows users to kind of interactively take agency in how they want to dive in to photographs.

So they can interactively explore. So for instance, if you want to explore more about a person or the people that are in photographs, people are really important for the photographs that people take. You can understand. What kind of hair color does someone have? Are they wearing glasses? This past year, we've all been wearing masks. And so being able to understand if somebody's wearing a mask, that's really important.

And then being able to understand the spatial relationship between various things that are in a photograph. So understanding where the people are, how they relate to one another. If the TV is on top of the fireplace, and you're standing next to it, it's kind of neat to be able to explore that around. And that's something we've been able to enable this year.

MATTHEW PANZARINO: Yeah. A lot more contextual awareness of the objects in the photo, rather than just, hey, these things are in this bucket of this image somewhere. Picture for yourself, but instead, you're able to paint a clearer image of hey, you know, we've got a person wearing a mask to the left of a person who's smiling with a background of trees, for example, to paint a better picture of what is in the image. And are you working on doing this live as well as on recorded images? I understand there's a little bit of both going on.

JEFFREY BIGHAM: Yeah. Some of the same technologies can be used in a live situation with Magnifier, in addition to the photos that are already recorded. It's one of the powers of being able to do all of this on device and so quickly, is that you can really do this fast in the context that matter to people.

MATTHEW PANZARINO: Right. And the latency of the cloud would not allow really for a live camera view to give people kind of an image of what's going on via Magnifier, for instance. Where as doing it on device would allow you to do that. So it's like an enabling factor there.

JEFFREY BIGHAM: Right, exactly. And that's been really powerful for us. If I take a look back, thinking about the developments that we've had over the past 10 years, it's just remarkable that we've gotten to this point. I think 10 years ago, if you told me we were able to do what we're able to do now, I would have not believed it. And if you said we could do it all on device, I just would have not believed it. So it's been incredible to see the progress.

MATTHEW PANZARINO: Yeah. I mean, audio description tracks obviously have been common in being filmed entertainment for accessibility reasons for some time. And audio description tracks are a great way for people to experience the visuals of a movie if they can't experience it through their eyes. But having that for real life, so to speak, is a very refreshing and new and kind of futuristic way of looking at this and certainly will, I think, make a big difference for people over time. It's a pretty incredible feature.

SARAH HERRLINGER: Absolutely. I think that we're really only at the start of all of this. It's been amazing for us to see how many ways that we're able to make visual elements auditory and provide that additional information for people. So it really is fantastic to know that, as Jeff was saying, 10 years ago, we were really just at the start of all of this. And now, the sky's the limit as to where we can go.

MATTHEW PANZARINO: So when you're building up data sets like this because obviously, these ML models have to be trained on massive amounts of data in order for them to begin to recognize on their own, so to speak, the patterns that are necessary to say, hey, this is a mask. This is a tree, et cetera.

So when you're gathering data for these data sets, do you pay special attention to how the data is gathered? I've been reading some amazing studies and reports about training ML models, for instance, on images taken by the blind so that they know, hey, if a blind person is taking an image, this is how it might look and how it might look differently than somebody who is sighted taking a picture.

SARAH HERRLINGER: Yeah. I mean, I think when it comes to data sets, inclusion has to be intentional. It is important to make sure that as you're gathering your data, that you're getting it from a really diverse group of individuals, so that you cover off all of those different types of options.

When we look at a product like AssistiveTouch on Apple Watch, this is a feature that was built to support individuals with limb differences who might be using the watch in a one-handed manner. And so when we work to build this, we made sure that the data set that we used included a lot of diverse use cases, including the fact that the same gestures that we use for AssistiveTouch on Apple Watch, we're also using separately as a way to have gestures for VoiceOver users who use the watch.

And so as we built out that data set, it wasn't just about finding individuals who might be amputees or other types of users who would use it in a one-handed way in the physical motor world, but also looking at how individuals who are members of the blind community have experiences that are one-handed.

And when you think about that, we were really looking at if you're someone who's moving through space using a cane, or you have your hand on the harness of a service dog, you're effectively in that same position of being a one-handed user. And so we really looked at trying to come up with gestures that might not even have been initially the simplest of gestures. But they were the gestures that you could use for an extended period of time.

So we really looked at what are the gestures that are best used for when you're going to do this in a repetitive way over and over again. And then how do we make sure that our data set was really robust so that we know that those would be the most effective for every type of use case we can come up with?

MATTHEW PANZARINO: Yeah. That makes sense. And for those who aren't familiar with the feature, could you describe what's possible with the AssistiveTouch feature for Apple Watch? Because this is a relatively new feature, obviously, designed with ML and AI frameworks to build models for how people would interact with these gestures and how the watch could then recognize them. Could you explain what the feature does a little bit, and then kind of how it does those things?

SARAH HERRLINGER: Yeah, sure. So what the feature does is by using simple gestures, like a pinch or a clench, you're able to navigate through all of the elements on the Apple Watch and do things from answering a phone call, to starting up a workout, to any of those other things one would do, but being able to do it without ever touching the watch face itself.

And we do this not just through the machine learning, but also through using some of the really cool elements on the watch, like the gyroscope and the other motion-activated elements, to be able to understand the muscle movements and the tendon movements of your wrist, to be able to power that those clenches and pinches.

MATTHEW PANZARINO: Got it. So instead of interacting with the screen directly with your other hand, for those with motor skills disabilities, or as you mentioned, where their other hand is occupied, like holding onto a leash, for instance, of a guide dog, just as an example, they're able to then operate most of the features of the watch using just the programmable gestures. Because I think there's several gestures that you can choose. Hey, this is the one I use all the time. Or this is the one that is repeatable and nonstressful for me personally.

SARAH HERRLINGER: Yep. So yes. Using those defined gestures to be able to choose how to do different things on the watch. And as you said, basically use it to its full extent, but without ever having to actually touch the device.

MATTHEW PANZARINO: Yeah. That's very cool. I want to dive a little bit into the Live Text feature for a bit because I think this is one of those things that people look at as a curio at first. And certainly a broader audience, like during the keynote, you're like, oh, cool. Text will show up selectable on the screen. And maybe your average person might use it a handful of times.

But obviously, for a massive amount of people, this is actually a pretty big, eventful scenario, where alt text and in any image can be both searchable and selectable and indexable. So can you talk a little bit about how you implemented that feature, first of all? And then we'll dive a little bit into the possibilities there.

JEFFREY BIGHAM: Yeah, sure. I mean, I think Live Text is really a fascinating place, where we see this kind of mainstream feature, where many people are wanting to interact with the text that's in their photos. And we've been able to, in various ways, detect text and photos for a long time, although the improvements that were necessary to make this work reliably and fast and on device have really come together in the last year or two.

What I think is really fascinating about this though, and it kind of illustrates another thing that I really like about Apple's approach here, is that this is a mainstream feature. Anybody could take a photo, and it has text in it. They can interact with that in new ways. You can call the restaurant that you took the photo of the phone number from.

But it actually integrates really well. And this is why it's so great to see how all these things fit together with the accessibility features on iOS, where you can also have it spoken aloud to you. You can have it read via VoiceOver. And so I think this is an interesting place, where we see a lot of different things coming together.

So we see screen recognition being able to understand the visuals that are in the graphical user interfaces that are on your device. We see image description being able to describe any photograph, including some of the text in those photographs. And then we see live text being able to make it so you can interact with that text.

And I think it's really great too. We didn't just make it possible to access the text going that next step to being able to kind of semantically understand how that text relates to each other. So if it's a table of information, you don't hear just one long string of text. You actually hear the rows and columns, and you can navigate through them in a familiar way as you would in another table.

And so having all this stuff come together I think is really, really exciting. It kind of hints at this future of what's the grand unifying theory of these various features coming all together? And I'm really excited about that. And I think Live Text is just one example, really powerful one, of how this plays out.

MATTHEW PANZARINO: Yeah. And it's also compatible with VoiceOver, right? Spoken content. So in addition to telling somebody, hey, here's the visual context of, say, a person, an image, or whatever, it can also give them the context of there's a sign with the words, you know, Chateau Lafite on it. And that helps I think when navigating the world.

And that navigation, that enabling kind of aspect of it I think is what's so exciting about Live Text specifically for me-- is that you can, I believe-- it's impossible, I think, for most people to appreciate how much of the world is inaccessible if you cannot read text for one reason or another. And it's a closed book.

And I think that the exciting parts of applying ML to this and applying the skill sets needed to, say, interpret text in at odd angles, at perspective, at distance, and with clarity, with different fonts, different sizing, all of that stuff. It's been pretty incredible. I've been experimenting with a lot over the last few months, and it's been a pretty great thing.

And like did you have to do a lot of experimentation, IRL experimentation with these things to get this model to even begin to start wrapping its head around the wide and varied applications of text in the real world?

JEFFREY BIGHAM: Yeah. I mean that's been one of the biggest challenge if you look at text recognition work over the last decades. Like the initial systems, you would code them for a particular font. And they'd have to be perfectly scanned in a perspective that you knew. And then eventually, the big innovation was we could recognize multiple kinds of fonts. And then eventually, we could do it from multiple perspectives.

And I think over the last few years, I mean, we've been able to get these systems that are trained with machine learning, that are able to recognize text in this wide variety of different scenarios, as you're saying. And of course, there's lots and lots of iteration and work that goes into that behind the scenes to make that work well. I mean, that's one example with Live Text. Of course, all of the features essentially, we're doing similar kinds of work.

And one of the really important, I think, aspects of this is being able to iterate on the data sets and iterate on the annotations you provide to them, the models you produce with them. And at each point along the way and then all the way back from the beginning, right? How do we keep improving and keep iterating, get to that magical design, that point where we're really proud of what we produce. And you see that with features like Live Text when we ship them.

MATTHEW PANZARINO: Yeah. And also this attitude of updating the models continuously, right? Because these are not static. They do get better over time. And they get better with data that's obviously provided by testing and further advancements in the camera and the processor.

JEFFREY BIGHAM: Yeah, exactly. As I mentioned, we've been improving various features of all of our features over time. And it's incredible to see how much better they've gotten. Image recognition, I feel like we've done incredible work when we shipped it. And now, it's gotten even better. And we've been able to adapt to new scenarios, to new kinds of things that we've noticed, feedback from our users. And so that's obviously really exciting. Something that you wouldn't even imagine doing at the kind of pace we're able to do with traditional software.

MATTHEW PANZARINO: And Sarah, have you have you noticed-- this goes for AssistiveTouch or for Live Text or other elements. What have you noticed as far as user adoption and feedback from people that have begun to use the features so far?

SARAH HERRLINGER: Great feedback from customers. One of our things that we're always trying to do is to sort of boost independence and help people to get more from their devices. Technology should be something that pushes you forward, not something that holds you back. And so we're finding amongst so many of our communities when we build these tools that it just pushes them farther forward to be able to do more in their lives and feel like they're able to do it with the dignity and respect that everyone else can do.

MATTHEW PANZARINO: And obviously, none of this happens-- oh, I shouldn't say none because we talked a little bit last year about how Apple is trying to make sure that features like VoiceOver work anywhere, regardless of any external input or effort. But you have to balance that of course with developer advocacy because you want developers building the enormous amount of third-party applications that end up on iOS or a Mac OS with accessibility in mind.

So how do you balance like the advancements in ML and AI where obviously, Apple has enormous amounts of resources available to it, and it's putting those resources into this very important arena. But then you have to balance that out with developer advocacy to get them to adopt these features that you're building out. And can you talk a little bit about how you do that? Because perhaps, you don't want to offer developers a way out. You want to offer them a way in to adopting these features.

SARAH HERRLINGER: Yeah. From our perspective, we're always trying to ensure that developers build accessibility into their products. Because one of the things we know is as much as users love our products, they also love the entire ecosystem that surrounds them. And so all of those third-party apps are important for every user to be able to take advantage of.

So our developer relations team gets in really early when they're talking to new developers about the importance of accessibility. And so we find ways to try and help developers to do this. Whether that be just building into our human interface guidelines that everybody gets elements around accessibility, kind of teaching them what to think about.

Also having tools like the Accessibility Inspector and Xcode that will help give them immediate feedback as they're building their apps on where they might be able to improve their accessibility. But with that, we also do look at the power of ML to kind fill in some blanks. So when we think about the work we've done on screen recognition, we may be able to look at something like a slider and through our machine learning model, determine, I think that's a slider. And so we will be able to tell a member of the blind community that next element on the screen is a slider, and here's how you can work with it.

But at the end of the day, it's the developer who really knows their own app better than anybody else. So while we're trying to ensure that the blind community still has access to all of those many apps around the world, we are really encouraging of developers and kind of work with them to make sure that they're putting the time and energy into the accessibility of their own apps so that members of the blind community have that full experience in a way that is just a little bit better than what we're able to do with our machine learning models.

MATTHEW PANZARINO: Right. Because the contextual nuance of a particular app is not always evident to a model that is trained on a general app, like, hey, this is a button. This is the way you navigate it. So you'd want to encourage the developers to add more specific context-rich accessibility features to the app.

SARAH HERRLINGER: Yeah. Absolutely. I mean, if you're a developer, and you're pouring your heart and soul into developing your app, you do know it better than anyone else. And you do want everyone to have that magical experience of using what you've made. So when you think about accessibility early in your process and start building it in from the start, you're going to give everybody that magical experience. And we want to make sure everybody does it.

MATTHEW PANZARINO: Yeah. I mean, theoretically speaking, developers should know, for instance, what the most important action on the page is. And perhaps that's where you start your navigation and allow them to move forward through the navigation from there using VoiceOver, for instance, or other commands, gesture-based commands if necessary.

And that ends up being a lot more efficient than just starting at the first button on a page, wherever that may be. And those little affordances and little enhancements and specific choices the developer can greatly enhance accessibility, regardless of the raw feature set that Apple offers in the OS.

JEFFREY BIGHAM: Sure. For every developer, I think you should turn on VoiceOver and go through whatever the five top workflows of your app might be. Try them with VoiceOver, and make sure that your navigation works properly. And that the experience that a blind user can actually get through that workflow and, I don't know, buy something, do something. Whatever it is your app does, let's make sure that everybody can take advantage of that.

MATTHEW PANZARINO: That's great. That's actionable advice for developers looking to build out these apps. And I think it's also actionable advice for anybody who's a PM, or who is involved in the development of product, or anything like that. Accessibility based on practical usage. And kind of doing those just basic user testing can teach you a lot about how people might navigate using the tools provided. Regardless of how good the tools are, you have to utilize them to understand how they're going to be used in the field, so to speak.

Let's talk a little bit about the future of ML and AI as it relates to accessibility to Apple. How do you see these features being additive over time?

JEFFREY BIGHAM: Yeah. I mean, I'm super excited about the future here. I think that we'll continue to see developments and exciting new features, just like we've seen in the last year, like we saw the year before that. And I think one thing though is given how far we've come, I think it might be tempting to think, oh, wow. We've come so far. There must not be that much more to do. But I actually think the next 10 years are going to be even more exciting than the last 10 years.

And if I look back-- one way to get a perspective on the future is to kind of just look back over the past. And if I look what I was doing 10 years ago, I was developing an iPhone app. It's called VizWiz. And it allowed a VoiceOver user to take a picture, speak a question they'd like to know about it, and they'd get an answer. It's pretty accurate, in a few seconds.

The one hitch though is that the way we did that was we actually sent the questions and the photograph out to people on the internet to get these answers. Worked out pretty well. It's just that's not necessarily what we want to always do with our photographs.

MATTHEW PANZARINO: I remember that, by the way.

JEFFREY BIGHAM: Oh, really? Great. That's awesome. And you know, I talked to friends at the time. I thought, wow, people really seem to be responding to this. Can we maybe start to automate it? So I talked to some of my friends who were like the top computer vision people in the world. And I won't embarrass them by calling them out, but like to all of them, they all thought, oh, no. That's far beyond what we would be able to do with machine learning or computer vision.

And now, just a few years later, here we are at a point where we're doing something really similar. And we're doing it on device privately. And that's really exciting. And so what I think though over the next 10 years, 5 or 10 years, I see a few different things that are going to come together that I think are just going to be even bigger than what we've seen so far.

So the first one is now that we've kind of figured out how to do this stuff. So we have this really nice toolbox. I think we're going to see people just really go wild with it, understanding, figuring out new features, designing new features that use this set of tools. And so again, what I'm really excited about over the last year is not that we enabled image descriptions across our whole device. We already did that. But how we went in, and we really focused in on how do we make them compelling? How do we improve the design of them? And I we're going to see that expanded out to a whole bunch of new things.

I also am really excited about thinking, OK, well, now we have all of these features, and we keep introducing new ones. So we say, screen recognition for graphical user interfaces. Image description for all the photographs. Live text, people detection, et cetera, et cetera. I mentioned earlier, what's the grand uniting theory of all these? I think it's really exciting to think about how does this stuff all come together. Because they're kind of doing similar things, how do we bring them together into an experience that kind of just works for people in a way that is much more fluid?

And then I think the last thing is, related to what Sarah is talking about with AssistiveTouch, we have all of these ML-powered features for accessibility, so whether that's AssistiveTouch or Live Text or sound recognition, all of these things. And I'm kind of excited to how they come together, and they stack.

So Sarah already mentioned this in one instance, where you can use AssistiveTouch, which is sort of designed with a different kind of use case in mind, to control VoiceOver. And so now, you've got AssistiveTouch controlling VoiceOver. So you've got a new input modality enabling the screen reader interaction. I think that's really exciting.

You've also got Live Text combined with Speak Selection. So we've got a couple of different things happening there. And so what's this look like when everyone can kind of personalize their device using all of these different features in combination to really work for them, their abilities in their current context? I think it's super exciting. And there's all this great stuff that's going to come out of this.

MATTHEW PANZARINO: Yeah. And Sarah, do you have any thoughts on that?

SARAH HERRLINGER: Yeah. I mean, I agree with what Jeff's saying. I think the biggest thing about accessibility is it's really about customization. Every person's use of their technology is unique, regardless of even whether you self-identify as having a disability or not. So I think as we start to look at how the leaps that will come forward in hardware, software, and in what machine learning can do, we have a real opportunity to integrate these features more seamlessly into people's lives so that they can just take advantage of each one that works best for them. Or as Jeff said, stack them, to be able to use a lot of different things at once and all together so that they can get the most out of their technology.

MATTHEW PANZARINO: That's great. Thank you. I mean, one of my favorite things about accessibility is that these augmentations that the device allows to any sense benefit everyone literally. And we can see the adoption of a variety of accessibility features by a broad, broad public, far beyond the scope that they originally designed for as examples of this. People do get the benefits of this regardless. They get the benefit of Live Text, obviously. VoiceOver can be useful in situations that aren't necessarily for a blind person.

And then also, you get a lot of really great kind of gesture-oriented research that has benefited people for years and years, and a lot of people love this. So that's one of my favorite things about it. And the second favorite thing about it is that the fact that these augmentations are focused on allowing anybody with accessibility issues to not just reach parity, but grow with everyone else as a human, right? These augmentations can take us beyond zero.

Because in the past, accessibility was very much thought like, hey, how do we get people up to a neutral level? And in reality, it's about expanding options for everyone. And that's the really exciting thing. I think it's the broadly applicable uses of accessibility that come out of the dire need for it. That's the exciting part. So I'm excited to see how all that develops, especially as Apple is building out these ML-driven features for the rest of the world.

So thank you both very much. I really appreciate you taking the time. It's been a great discussion.

SARAH HERRLINGER: Thank you very much. Pleasure to be here.

[MUSIC PLAYING]