Seeing AI: What Happens When You Combine Computer Vision, LIDAR and Audio AR?
DESCRIPTIONThe latest features in Microsoft’s Seeing AI app enable the app to recognize things in the world and place them in 3D space. Items are literally announced from their position in the room; in other words, the word “chair” seems to emanate from the chair itself. Users can place virtual audio beacons on objects to track the location of the door, for example, and use the haptic proximity sensor to feel the outline of the room. All of this is made possible by combining the latest advances in AR, computer vision and the iPhone 12 Pro’s lidar sensor. And that’s only the start.
DEVIN COLDEWEY: Thank you, Will. And thank you, Saqib for joining us. It’s always very interesting to have you and very interested to hear what you have going on. First of all, I think that I would like to have you sort of introduce your work with Seeing AI just for our viewers who may not be as familiar or who are not users of the app. So maybe you could just give us a sort of top-level look before we start diving into the details.
SAQIB SHAIKH: Absolutely. Thank you very much for having me. Seeing AI is an ongoing research project at Microsoft where we are looking at how do we bring the latest emerging technologies, such as artificial intelligence, to empower people with disabilities and, in particular here, the blind and low-vision community? So this could be using AI to recognize who and what is around you, to read text in real time. And Seeing AI is an iPhone app which you can get from the App Store.
DEVIN COLDEWEY: And you describe it as a research project. But really, it’s gone on long enough that now it’s become a real tool that lots and lots of people use. And you think “research project” is still the correct term, or is it really like becoming sort of a platform as an assistive app?
SAQIB SHAIKH: It’s really both. So Seeing AI as a mobile app has been at it, you’re right, for a number of years now. We refer to it as a research project because very much what we’re still doing today is looking at what are the emerging technologies? What are the problems our customers have? And I kind of think of it like a conversation between the two sides. So understanding this problem, understanding the technology potential, and bringing them together to create brand new experiences.
But then you’re absolutely right, that we have a large number of users depending on this every day. It’s a very popular app amongst people who are blind and low vision and even people with other disabilities. So it’s, yeah, a tool that people depend on. But we’re still continuing to push the envelope to show the art of what’s possible with the next generation as well.
DEVIN COLDEWEY: Right. There’s always more to be done. And on that note, I’m curious how the development works. Like as you said, it is a conversation with the community. I’m curious how you decided on the core modules that are available. Obviously, some of them are very obviously useful like being able to read text or something like that. But how do you decide on the features and how to implement them with the involvement of the community?
SAQIB SHAIKH: Yeah. Many of the features you mentioned, like reading text, they’ve been there right from the beginning. I think it’s been four years since we launched, maybe as long as seven years since the first idea, the first line of code. But in that time, we keep looking at what are the challenges that people have? And there’s still so many aspects of daily life where we hear from customers that, hey, can you do anything to help with this, or this, or this?
And on the other side, it’s just matching that up with what is the capability of technology? We’d love it if we could get real time video descriptions, just as one example. And you know, that’s not there yet. But there are scientists working in that field and many more. There are other areas like– don’t know which to pick– so many different challenges where we go to the sciences and say, hey, we hear from customers that this is something that would be great. And they’ll all scratch their heads and say, wow, I never thought of that. And so we’re in the middle of those two worlds, bringing them together.
DEVIN COLDEWEY: Oh. That’s interesting. I always wonder with apps like these, with the people that you’re working with and learning from, do you ever find that they use the app in some sort of unexpected way, and you’d think like, oh, I can’t believe I didn’t think of that, and now we can move that into a full feature or something like that?
SAQIB SHAIKH: Totally.
DEVIN COLDEWEY: I love that. Well, it’s interesting just to hear– I would love to hear if there’s any that kind of experience you’ve had.
SAQIB SHAIKH: There are so many of those examples where I would just think like, I don’t even know that it could do that. But maybe my favorite of all of those is the very first version of Seeing AI, going way back when, had the ability to recognize someone if you trained it to recognize their face.
And someone had done that with all the banknotes to recognize presidents’ faces. But they would give the names as the denomination, like $10, $20, because the dollar bills are all the same color and all the same size. And so eventually, we ended up building a currency channel within Seeing AI specialized for that purpose. But it was just the creativity of using face recognition to identify banknotes was something that stays with me today.
DEVIN COLDEWEY: That’s truly wonderful. They say necessity is the mother of invention. And like you mentioned, the bills are all the same size, like many other places. So it’s like, well, what can I do to solve this? The tools are there in the tech, but it hadn’t been implemented. And so they came up with this great solution. I think that that’s wonderful.
And I understand you have a few new features that you’ve been working on in the app since we last talked almost a year ago, I think. Can you tell us a little bit about some of the new additions? I believe there was some Lidar involved?
SAQIB SHAIKH: Yes. So a big part of what we’ve been doing with Seeing AI– so we have different channels for different tasks so that someone can specify what they’re most interested in knowing about. And we introduced the new world channel, which maybe is an unusual name. But it’s like what if you could recognize everything in the world and build up a model of the world? And so this is what we call audio augmented reality.
Now, augmented reality is a term that you might have come across in the tech press, et cetera, where you’ll be overlaying visuals over the real world. But for us, it’s about overlaying audio distributed in the real world. So in the world channel, you’ll be able to point the phone around, and it will announce things that it sees using your headphones and using spatial audio in the world itself.
So you’ll hear the word chair coming from the chair itself, which is really quite remarkable the first time you see this– hear this, I should say. And you can even place audio beacons in the world. So you might say, OK, I want to know where the door is. And you’ll place that audio beacon on the door. And even as you move your camera or look different directions, the beacon will remain coming from the door. So these are just some very early steps where we’re taking audio augmented reality to understand the world around you, to build up this 3D model of the world, and help you explore an unfamiliar space.
DEVIN COLDEWEY: Well, that’s really interesting. I would love to try that out. And is the feature live right now, or is it still coming out?
SAQIB SHAIKH: Yeah. This early version is live right now on any device that has a Lidar. So we use the Lidar to sort of build up that world understanding of where things are, to match them up with what the camera sees. But this really truly is the first step in this space. And yeah, we have a bunch of really, really exciting experiments that we’re working on in the lab.
DEVIN COLDEWEY: I’m curious about the development of the Lidar feature and that audio augmented reality. I imagine it must have taken a lot of fine tuning and a lot of engineering work to sort of hit that right balance of being able to identify something quickly, and the cadence at which it should announce itself, and all that sort of thing. Was this feature long in testing?
SAQIB SHAIKH: Yeah. That’s a really good question, because unlike just recognizing a single image, here we want you to be able to explore something that you might not be looking towards with the camera. So we’re building up this 3D model. And you can get a summary of everything around you, even if you’re not looking at it.
And so we want to know, if you looked at something from a few different angles, that it’s always the same object. And how does that map into 3D space? So definitely, there was a whole bunch of testing and iteration involved.
DEVIN COLDEWEY: Absolutely.
SAQIB SHAIKH: Of course, user testing too.
DEVIN COLDEWEY: I’m really curious to try it out. Unfortunately, I don’t have a Lidar-equipped phone. I guess this is one of those challenges that you want to be able to get features to as many people as possible, but some of the features are going to require the latest ML chip, the latest sensing tech. How do you balance the need to get it something to as many people as possible with the drive to embrace the latest technology that not everyone has yet?
SAQIB SHAIKH: Yeah. This is actually a challenge we’ve been very conscious of within the team, that we support devices that are a number of years old, back to iOS 10 right now. And so even if you don’t have that really fast GPU or the Lidar, Seeing AI will still work for you.
But then at the same time, we want to really push the hardware to its limits, so that if you have that capability, then more channels will be available to you, or we will be able to use more advanced models. Maybe we’ll be doing more in the Cloud on slower devices and things like this. So it’s how do we provide a good experience for everyone and then optimize that as the hardware allows?
DEVIN COLDEWEY: Absolutely. And you mentioned, with the Lidar and the other imaging that goes on, that you’re building this 3D model of the world. Is this just the beginning of what you can do with this richer 3D data? Is there any other interesting stuff that you’re interested in doing or that you’ve hoped to do for the last few years?
SAQIB SHAIKH: So, so much more. And like I say, I wish I could talk more about some of our new experiments. But the way we may think about this is, like I mentioned, Seeing AI has been around for four years now, looking at sort of 2D, still images. And that’s really coming into its own. And it’s not perfect, even today. But we’ve made huge strides at Microsoft, but also in the industry.
Also, with wearables, we came up with a prototype four or five years ago of a wearable. And wearables are still not mainstream, but we’ve seen several companies coming out with interesting form factors. So I think audio AR, it’s just at the beginning now. And I think over the upcoming months, years, we’ll see more and more capabilities. Right now, what we have today is the ability to understand your environment in 3D. But we’re really looking at, how do you interact with the environment?
DEVIN COLDEWEY: That was actually my next question. Interacting with that seems like a unique challenge for someone who– normally, you would have to tap on the button on the screen or something like that, but it may not be possible for everyone who is using it who is visually impaired or has limited mobility. The normal options of the kind of VR, AR interface to interact with that rich 3D environment aren’t always possible. So that seems like a really tough nut to crack. Can you tell us a little about how you have designed the interactions and how you think they’re going to move forward?
SAQIB SHAIKH: Yes. That’s interesting about interacting with the device. And there’s also interacting with the real world. So as you are building up this world understanding, as the system is understanding what you’re doing in the world, you wanted to understand how you’re interacting with the world and maybe through audio cues or some other haptics, enabling you to interact with the world.
But then while you’re doing that, you also want to be able to interact with the device. And as you bring up, it’s not always convenient to be tapping or scrolling on a touch screen, even though someone who’s blind can do so very proficiently with a screen reader. So yeah, we’re continually looking at that.
Speech recognition is something we’re aware of, but it’s not always natural. I definitely even personally had a situation where you ask your phone, where am I? And someone very helpfully next to you tells you. And that’s a bit embarrassing. So it’s not as simple as just speaking to your device, but definitely looking at different ways of inputting. And then also hands-free options, as I mentioned earlier. I really hope that wearables continue to develop. And we always are looking at the different options on the market to see what is the mainstream wearable that we could adopt.
DEVIN COLDEWEY: Yeah. I was curious about whether you think that there are any particular sort of gadgets or wearable technology, any kind of like haptic technology that might be used specifically for this kind of purpose. I’m sure you keep a very close eye on that. I’m curious whether you think there’s any particular tech that is coming out that would be extra useful for users of Seeing AI.
SAQIB SHAIKH: There’s so much exciting things going on. And we’re kind of interested in working within. Like get in touch if you have a nice new gadget that can be useful. Things that are on our radar– definitely the mainstream AR headsets. A lot of them today, including Microsoft’s HoloLens, are more aimed at industrial applications. But I do think that consumer AR is going to come along.
But again, as with many fields, the blind community is an early adopter here, where maybe we are less interested in the screen that goes in a pair of glasses but more in terms of audio and cameras to observe and so on and so on. So looking at whatever comes out there and what are the accessories we could attach with Seeing AI.
DEVIN COLDEWEY: I’m curious about the sort of broader, looking forward vision for Seeing AI and for assistive agents in general. You told me you have this sort of more general and larger idea of having these assistive agents. But this is only the very beginning of that. So maybe you could tell us a little bit about what you envision over the next 5 to 10 years as these agents become more capable and as that rich 3D data starts being more accessible and the tools are more accessible to the people.
SAQIB SHAIKH: Yeah. Absolutely. So assistive agents is this vision I’ve had for some years and probably since the beginning of the thing that became Seeing AI. It’s this idea that, what if we had some kind of software system that understood each and every one of us and that could fill in the gaps between what we’re capable of and what we’re able to do in this environment?
And understanding that each and every one of us is different. That might be because of a permanent disability, or just a situation, or the context that you’re in. And for someone who’s blind, that might look like– I know I sometimes describe it as a friend sitting on my shoulder, looking around, whispering in my ear, the equivalent of a sighted guide when no one else is around. And someone with neurodiversity or someone who’s just in an unfamiliar country, where they don’t speak the language, these are all situations where you wish there was someone who could help disintermediate our environment.
And so that’s a big vision for assistive agents, but really that’s where I see Seeing AI as just one step along that road to developing an AI that understands us, understands our needs, understands the world around us, and then fills in that gap.
DEVIN COLDEWEY: It really is a great idea, the assistive agent as a sort of all-purpose helper. Because there’s so many different ways that we have that understanding gap between ourselves and the world around us. And for some people, like you said, it’s a permanent disability or could be something as short as this afternoon, I happened to be in a building I’ve never been in before or something like that.
I do love this idea that it literally could be anybody, and everybody will find some utility in this idea. I guess I would love to hear how you think we get there. I know that this will take a large amount of engineering, a large amount of data, a large amount of sort of human involvement. As you’ve noted at the very beginning, the community is the one that produces the most ideas here. So what are the steps, you think, between here that sort of expansive vision of assistive agents that you’ve just described?
SAQIB SHAIKH: I think a big part is just taking it step by step and solving real problems that real people have. And we can look at the industry at large and see that so many of the technologies that we depend on today have their origins in disability, in someone with a great need and someone with the ability to create coming together.
And we see that in the touch screen, the on-screen keyboard, speech synthesis and speech recognition, in the telephone. There’s so many more examples– text messaging. And all of these things come from the need and the ability to create. And I do see the disabled community as early adopters. So with assistive agents, I passionately believe that today, in solutions like Seeing AI, we are really exploring the state of the art, bringing these solutions to bear to solve real problems. But one day, they will be things that everyone just takes for granted.
DEVIN COLDEWEY: Certainly so. And I know that as we– I mentioned, we’re going to need data for this. It’s like, with AI and machine learning, you always need more data. What do you think we’re missing in that part of the equation? Because I’m always hearing about the data desert in this industry or that industry. What do you think we’re missing here, and what needs to be done to collect that data?
SAQIB SHAIKH: Yeah. I think we really need more diversity in our data and, in a way, also, the ability to do more teaching of the AI by end users. So I think there’s two sides of it. So right now, a lot of these systems are trained from large scale data sets. And they’re not always representative of the real world. And they’re certainly not representative of people with disabilities.
So one thing that’s very much on our radar– and there are some research projects in this area involving how do we have open data sets, which are fairly collected, which represent people with disabilities? So that’s one part of it. So make sure we have diversity in our data set so that the AI systems that we create are fairer for everyone, including people with disabilities.
And then the other part of it is, even if you have large data sets, at the end of the day, we each have personal needs and we have things around us which are unique to us. So how do you enable someone to be teaching the AI themselves? So this idea of personalization of a machine learning model by an individual. And there’s an interesting research going on in that area too. So yeah, I can talk much more about this, but I think there’s some things we have our eyes on.
DEVIN COLDEWEY: There are loads of things to look into. And I’m curious. the work that needs to be done I’m sure it will be done by many companies, many research institutions, universities, individual researchers. Microsoft, of course, I think has been, in my opinion, generous in its research into these areas and supporting, of course, the development of Seeing AI and other sort of work along these lines.
Ultimately, it doesn’t seem like there is a profit motive attached to this, but that seems to be the way things move forward quickly in the tech industry, is if there’s some way, a productization of this sort of thing. Do you think that that will figure in this, that we may see breakthroughs when suddenly you can start making money off of an assistive agent like this?
SAQIB SHAIKH: Yeah. Today we definitely separate those two. So Seeing AI is a tech for good project that we are very much doing to help people out, to use the technology that we have for social impact. However, I really believe in this idea of inclusive design, where when you design for one, you can extend to many.
And we touched on this before, this idea that if you laser focus on solving the problems of one community, eventually, those are going to inspire solutions, which will have much broader applicability. And then it’s those broader solutions one day that may have that profit or become a solution for everyone.
DEVIN COLDEWEY: That’s true. And I’m trying to not imagine a future where there’s the different branded agents, but perhaps that kind of competition is also a powerful thing. Because I know that certainly researchers at other companies and universities and things will look at what you’ve done and say like, wow, it’s inspiring. It’s useful. It’s practical. People are interested. Maybe we weren’t aren’t going to make money on it this year. Who cares? We should like get into that. I think that feeling of competition is also very important.
SAQIB SHAIKH: Yeah. And I very much see this as if we can push the state of the art, if we can inspire future work, then, really, everyone benefits by that.
DEVIN COLDEWEY: Do you think– just sort of off the top of your head, you mentioned that, of course, accessibility, designing for one, and extending to many, all these ideas of just designing for accessibility and for everybody. Is there anywhere that you think that we need to make some major leaps forward in the tech sector? I know that there are so many fronts. It’s not just making sure a website is screen reader accessible. There’s all kinds of stuff in UX design, in apps, and even just regulatory processes. I’m curious where you think there’s a lot of room to take a few steps forward.
SAQIB SHAIKH: Most of my work is in the space of AI emerging tech. However, I am also a user of accessibility solutions. And so definitely– for the developers in the audience, you definitely want to make sure that all your apps and websites are still accessible and follow all those guidelines.
There’s also some really interesting work that sort of– even side projects we have on, what would it be like if tools like Seeing AI came to the computer screen to make some of those solutions accessible when they weren’t otherwise? But of course, the best is if the developer makes things accessible.
But also, if I may dream again, big picture, if the AI can understand the applications and the user intent, I do look to some far off future where we’ll be able to converse with the technology in the way that we want to, and it will be able to understand the different apps and, again, let each of us communicate and receive information in the way that suits each of us. Because we all are unique individuals, unique human beings.
DEVIN COLDEWEY: I know I am. It seems like you are. So I think it’s pretty good sample. You touched on something just briefly that I wanted to ask about too, which is different levels of access. Obviously, there is a question of whether the AI runs locally or in the cloud. And you mentioned that maybe on an older phone that can’t handle the certain workloads, it will offload to the Cloud. Or if it can, it will do it locally.
How do you manage that balance? Because it seems like there are many people around the world who may not have a reliable enough internet connection, or they want to be able to use this sort of app technology when they’re camping or something like that. How do you make that balance?
SAQIB SHAIKH: The best user experience is certainly when everything can run offline on device, and many parts of Seeing AI do. However, the Cloud has far more compute power. You can run far bigger AI models. And so, when possible, you can actually get much better results by going to the Cloud. So as you mentioned, you have to balance so many factors like availability of the internet, or GPUs on your phone, or battery life, and so much more.
That’s on the consumption. When you’re looking at generating the models, the Cloud is really exciting because we’ve suddenly got this huge resource available which some people have called the internet supercomputer, where you can train models with so much data in the Cloud. So training the cloud and running on your phone is definitely the ideal when possible.
DEVIN COLDEWEY: I’m– speaking of having good connections, certainly we’ve all had good connections on our home Wi-Fi, which is where we’ve all worked for the last like year and a half, two years now with the pandemic. I’m curious how you think the work-from-home change has affected the needs of the community that you work with and the needs of something like the Seeing AI and what it addresses and whether any new challenges have appeared over the last couple of years that were unexpected or perhaps productive.
SAQIB SHAIKH: Yeah. It’s changed the way all of us live. And there are some things which are less needed. For example, if someone is spending more time in an environment where there’s a lot of sighted family members, that might change the equation for them. But for someone else who’s spending a lot of time at home living by themselves, that’s a different factor.
So we’ve been talking to our customers, and it’s affected different people in different ways. There are new challenges, like we’re getting a lot more deliveries, but we’re using a lot less cash. We’re not going to the office, but when we do go to the office, it’s social distancing and less people around. So yeah, we are in touch with the community and understanding what are those trade-offs and what are the differences in this interesting world that we’re going through at the moment.
DEVIN COLDEWEY: I think we’re getting close to time here. But I want to just ask for your thoughts on where Seeing AI is going in the next couple of years, where you hope to take it in general. Not necessarily any specific features you’re looking to add, but what do you think is possible in the near term for you, and what are you excited to bring into the mix for all the users of Seeing AI out there?
SAQIB SHAIKH: In many ways, we’re continuing that conversation with the users, with the scientists, and seeing as ever, how do we bring the latest emerging technologies to bear on those problems? Audio augmented reality is a big part of what we’re working on right now, this idea of how do we do more with this world understanding? How do we let you query the world, interact with the world?
That’s definitely a big part of what’s coming up in the short term. But then that sort of takes us through to that assistive agent vision. How do we take incremental steps towards enabling the AI to understand the world, understand you, and empower you to do more?
DEVIN COLDEWEY: Well, it certainly is inspiring to hear you talk about this. I’m looking forward to what comes out next. I think that our time is up, however, so we must say goodbye for now. Saqib Shaikh, thank you so much for joining us here. It’s always super interesting. And thank you for everything that you do.
SAQIB SHAIKH: Thank you so much. It’s been a pleasure talking.