Seeing AI meets Generative AI - The View from Microsoft

DESCRIPTION

Microsoft’s Saqib Shaikh co-founded the Seeing AI app, which was one of the first deployments of AI computer vision in an app to help blind users “narrate the world around you” by using AI to describe nearby people, text and objects. Shaikh’s employer, Microsoft, is a leading investor in OpenAI, the organization that created the ground-breaking chatGPT, a type of AI called “generative” because of its ability to “generate” novel content in response to questions. Today, Seeing AI can tell you there is a chair in the room because it is “trained” to identify a chair. With chatGPT, Seeing AI might be able to answer a random question it was not specifically trained for, such as,”Is there a cat in the room?” The answers chatGPT provides can be wondrous or wildly off base “hallucinations,” in the lingo of AI specialists. Can generative AI’s quirky nature be tamed for accessibility?

Speakers
- Saqib Shaikh, Co-founder of Seeing AI, Microsoft
- Moderator: Devin Coldewey, Writer & Photographer, TechCrunch
SESSION TRANSCRIPT

Download transcript as .txt file

[MUSIC PLAYING]

DEVIN COLDEWEY: Thanks, Alice. And I’m happy to be here at Sight Tech Global. Saqib, thanks for joining me here today. Since we last talked, AI has become much more prominent in the tech world, though, not necessarily all in ways that contribute to accessibility. Maybe, you can tell me first, what you’re working on, and what you’ve worked on. And then maybe, you can tell me what you think are the more interesting aspects of the new AI ecosystem that we’re all taking part in right now.

SAQIB SHAIKH: Oh, thank you so much. It’s a pleasure to be here. As you say, it’s an incredibly exciting time, especially when you think about AI accessibility. So at Microsoft, I lead the Seeing AI team, where we look at emerging technologies and how we leverage these to empower people with disabilities, especially people who are blind or have low vision in their everyday lives. And we’ve been doing this for a number of years. We made an app, which seems to be pretty popular. But as we look at the latest generation of AI, a lot of those early visions and ideas that we had, all those years ago, it really feels like we’ve taken this big step forward as an industry towards realizing that. So I remember when I started this a few years back, I would think of, could AI be like a sighted guide? Someone who knows you as a human, who understands the visual world, and can fill in that gap by describing what’s around you at the right time, and, maybe you could have a conversation with them. Like you said, AI, it’s come a long way this past year. And we’re busy building towards that future.

DEVIN COLDEWEY: It feels like– you touched on this. But it feels like there’s a very stark difference, though, between a technology like ChatGPT, which we’re all now familiar with and how it can be useful to a blind or low vision person, and technology like Seeing AI, which is really designed from the ground up for specific needs. As someone who both builds and uses these tools, can you tell me why that difference is important?

SAQIB SHAIKH: I think whenever you think about developing these kind of technologies, nothing about us without us. So when you build with the community, as we do with Seeing AI, then you can have these really rich natural experiences tailor made for this community. So ChatGPT, incredibly, useful for someone who’s blind Same with all the Microsoft Copilot products, which are coming out. With Seeing AI, we’re all focused in on, what can you do through the power of your smartphone camera and the AI based on that? So it’s very much about, OK, what can the camera see? What can it tell you? And increasingly, what could you ask of it? So it’s chat of a different sort.

DEVIN COLDEWEY: So maybe you can tell us a little about what the latest frontiers are on the side that you’re working on. What are the latest developments in what you can ask about an image, what you can extract from an image, and the utility that you can get out of that?

SAQIB SHAIKH: Yeah. So with this next generation of, what we call, generative AI, it means we can leverage these large models, which can give us so much richer information. Up until now, we’ve been able to give you descriptions of an image, which could be a sentence. And it might, for example, tell you that it’s a man and a boy in a park. But now, we can have way richer descriptions that might be in a paragraph or multiple paragraphs in some cases, which can go into the detail about, OK, here’s a man. And this is what he’s wearing. And this is more about what he’s doing. And this is the boy. And there’s trees of this sort in the background. And oh, there’s this visible as well. And there’s mountains further in the background. And it’s a level of detail that I, as a blind person, never knew existed, let alone, had access to. And I do remember the first time I encountered this kind of description, I was just blown away. It’s like a whole new frontier. And then that’s the beginning. When you’re a creator and innovation keeps going, and it’s like, OK, this is what we have now. But I can’t imagine where is it going to go next. That’s probably the biggest step forward. But then you mentioned also this idea of, can we now talk to our smartphone, whether it’s a photo, or in the future, a video, and get even more information? So maybe it can have all this information about the world. But now, it can be a two-way conversation. So you can pick out the things that are important to you in this moment in time. Sometimes, it’s easier to show than tell. So here’s a picture that I took with my son when we were in the aquarium. And you can see a description, how rich it is. And also, you can ask questions about it. SPEAKER 2: What do the starfish look like? SPEAKER 3: The starfish in the image have various appearances. Some are bright orange with long slender arms, while others are larger with a more robust structure and a mottled pattern of orange, brown, and white. There is also a starfish with a pinkish red color and a bumpy texture on its surface. They are all submerged in water. SPEAKER 2: How many starfish are there? SPEAKER 3: There are at least 15 starfish visible in the image. SPEAKER 2: Tell me more about the pink starfish. SPEAKER 3: The pink starfish in the image has a textured surface covered in numerous small rounded protrusions. It is located towards the bottom left of the image, partially overlapping with an orange starfish. The pink starfish has five arms, typical of starfish, and is surrounded by various other starfish and sea creatures in different colors and sizes.

DEVIN COLDEWEY: And speaking of what’s important to you at this moment in time, I know that Seeing AI has built up into a collection of specialized models. And this one is specifically for navigating a room. Or this one is specifically for identifying a label, these everyday tasks. Is that model of having these different specialized models going to stay important, do you think?

SAQIB SHAIKH: I think, yeah. I think it’s going to be a hybrid approach. You have these large models, which are, by their nature, slower, but incredibly powerful. But then we have developed, over time, interesting user experiences to work in real time to help you line up a good photo, or to tell you things that you need to in milliseconds, not seconds. And that’s going to remain increasingly important. But then it’s going to be a hybrid approach, where those things are paired with the rich descriptions, the Q&A conversation, et cetera, that the large models will give us. And, again, that’s going to change because, I say all this, and it’s really just a moment in time and everything is probably going to become faster, more powerful, and so forth, as is the way with our industry. And that’s what makes it so exciting.

DEVIN COLDEWEY: I’m curious. You were describing this, and being able to describe things with much more detail. And I’m wondering how that might work in media, TV, and movies, paintings. There’s so much more detail there that isn’t always hit on in the audio descriptions or the scene descriptions. It seems like it’d be so cool to be able to pause the video or whatever. And say, like, oh, well. What’s going on right now? Where is the character standing? What does the landscape look like? My mind is just like fizzing with the possibilities for this technology.

SAQIB SHAIKH: Yeah, absolutely. At Microsoft, there’s this term Copilot. And this idea that you’ll have an AI to assist you with every app that you use or every product. And I, sometimes, say that Seeing AI for someone who’s blind, Seeing AI is your Copilot for life. So if you’re watching a movie, I imagine and we have– we’re busy prototyping in the lab. But imagine that could we describe the video that you’re watching. When you’re browsing the web, what does it mean for a blind person to have help with a web page or completing whatever task they’re doing in whatever app they’re doing on the computer? So spanning media, digital life, maybe even games one day, and, of course, the physical world. So I think there’s so much possibilities here. And combining all the different bits of AI that we have, whether that be speech, or vision, or understanding text. And I think all of this is going to come together. And, of course, this is not next month, probably not even next year. But things are moving so fast at the moment. And we’re really reexamining what’s possible. There’s all these things, which we decided, oh, that’s years away. We’re not doing that, which we have to now re-examine and think, what if we could?

DEVIN COLDEWEY: It is a hopeful moment. But there’s also new risks, like, this is one of the most prominent risks that we’ve encountered with generative AI is, obviously, it approximates knowledge rather than repeating it. Or may hallucinate details in a scene. It didn’t in any of the things that I’ve seen describing the scenes. But there’s always that risk. Are there new risks and tensions being introduced by the inclusion of generative AI or large language models in the accessibility world here?

SAQIB SHAIKH: Yes, absolutely. And I think it’s always been true that AI, by definition, is this probabilistic view of the world. And that’s even more true of generative AI. Think one of the challenges is that when it’s wrong, it’s very, very confidently and convincingly wrong. And that’s why we know we’re being a bit cautious, where developing with the community. So we have a bunch of testers, who are working with us to make sure that these solutions work in all different scenarios for different types of people. But we do find hallucinations to be a problem. And we’re figuring out the workarounds and so forth. So that we can have– even if it’s a bit less capable, we have more reliable solutions. And answering questions about, if you’re asking open-ended questions, it definitely matters because you’re relying on the AI. And you want it to be robust and reliable, or to tell you accurately when it doesn’t know.

DEVIN COLDEWEY: Yeah. And maybe we can go a little further on to that the side of working with the community. One of the first things you told me in this interview just now was nothing about us without us. That’s, obviously, a foundational aspect of developing with the accessibility community and for all different abilities and disabilities. And how do you do that when– well, maybe, you can tell us about your model for working with the community first. And then we can talk about how that might change when you’re working with other models or other companies.

SAQIB SHAIKH: Yeah. I often think that we are this bridge. We talk to the people who are blind or have low vision, and the people who we wish to serve on the one side. And then we’re talking to scientists, researchers, external companies, academia, on the other hand. And then thinking, OK, we have these emerging technologies, these new techniques, new models. But it’s not about the technology. It’s about the needs of the individual. What are the challenges? And how do we bring that technology to bear? So that’s the early ideation or development stage. But then as you get to that more polished stage, then it’s working with people so that we can identify the issues, understand, is this really solving the user need we set out to solve? And where are the potential risks and pitfalls? And make sure we catch them early on. So we can find new solutions and new workarounds.

DEVIN COLDEWEY: And is that something that you think– are other companies doing that adequately? OpenAI seems like a bunch of very smart people. But it’s not exactly– ChatGPT wasn’t built specifically with the blind and low vision community in mind, even though it’s, of course, very useful to everybody. Do you think that that’s something that needs to be worked on, not just by OpenAI, but by anybody who is pursuing AI as a general purpose technology?

SAQIB SHAIKH: I think human-centered solutions are key. And as an industry, we’ve come a long way in understanding that. I draw this distinction in a way between a platform and an experience. Much of what a platform company does is produce building blocks, which other people are going to build experiences out of. And at some point, the platform could, in theory, be for absolutely everyone. And then the experience, it could be for everyone. But it could be like Seeing AI very targeted. So I think there’s that aspect, when you’re building the experience, that’s the point that you really need to make sure it is going to serve your audience well. But then if there are products at Microsoft and elsewhere, where everyone is your audience, and that’s where the good practices of inclusive design come in, making sure that you’re testing with a diverse audience that represents the breadth of people, and you don’t exclude anyone by accident.

DEVIN COLDEWEY: You mentioned when we were talking before, how, obviously, these large language models are extremely smart. But they are also primarily text-based. And they’re working on the multi-modal side of things. But you mentioned that GPT-4 can do a lot of things. But it can’t really help you navigate a subway station or get around, find a shop in a mall, or identify ingredients of a can of soup or something like that. Are these always going to need a Seeing AI type solution? Or do you think that the general models will reach out and subsume these use cases as well?

SAQIB SHAIKH: So now, we do have the multi-modal GPT amongst other models and other solutions. But I got to say GPT’s multi-modal capabilities are very impressive. And I still think we need faster models and more specialized models, as well. But maybe over time, I do imagine that there’ll be other similar multi-modal models, which may be some will run on device, and so on, and so forth. So things are going to change. But at the end of the day, the way you stitch together the models, the way you fine tune the models or prompt the models, that’s what’s going to make the difference. Because like you say, they might come a day when the AI can help you navigate and find the shop, find the platform and the subway station. We’re really not there now, partly, because of the technical limitations of these models. And partly because there’s one thing to tell you what’s there. It’s quite another to have the end-to-end interactive real time experience that’s not going to distract you from navigating with all your other capabilities. So I’m cautious to– it’s not possible today, but who knows what’s possible, even in a few months or a year or two.

DEVIN COLDEWEY: Yeah. It’s developing so fast. It’s almost scary. It’s not that scary because the capabilities are so interesting and so helpful. So thinking like, oh, well, we can’t do that today. It’s just like, wait a couple of months. And there’ll be a new paper. There’ll be a new model. Obviously, you mentioned this before. Microsoft is hugely in this space. They’re partnering with OpenAI. They’re doing their own stuff. The Copilots and stuff, those are all coming out in– they’re going to be in Windows. They’re going to be in Office. They’re going to be in all these productivity things. I think it’s interesting because all users will benefit from this. If you want to use this, it could be useful. But people who have trouble interacting with the visual GUI in the first place, where there’s insufficient accessibility tools to use them, it feels they’ll have a supercharged situation. I’m hoping that that’s the case anyway. How do you see that playing out? And is there any particular capability you’re looking forward to in the Copilot space?

SAQIB SHAIKH: Yeah. And this is, again, we’re still relatively early on in this stage of copilots. But as someone who’s blind, I’m very excited by the possibility of having a Copilot that might tell me, OK, this document I’ve written, is it formatted well? And if it has knowledge of the visual appearance, then could it do that? And these are just all ideas. Nothing that’s something that’s possible today. Or if you have huge spreadsheets with loads of data, can we get your Copilot to help you analyze all the data, or spot the trends, or so forth, or even to generate content, so generating images or reports? These are things that everyone is going to be using. But I think for people with disabilities, there will be unique types of tasks, which will be really helped by having your personal Copilot to help you accomplish these tasks. And this is cool. Because when you have one user interface, there’s this assumption that the designer, the developer of the interface knew exactly how everyone wants to use their product. But in this AI world, you can, actually, have different types of people, all with different needs, different styles of using the system, and the AI can answer their questions and provide the information in whatever way is right for them. And today, the fashionable way of doing that is via this chat interface. Who knows what it’ll be tomorrow.

DEVIN COLDEWEY: Yeah. Truly, nobody does. But, of course, it has to be guided– as you mentioned before, it has to be guided by the users, and by the community, and by feedback. As an early user of it, do you have any feedback on it? And do you expect that it will be guided by– supercharged itself by the feedback from blind and low-vision users?

SAQIB SHAIKH: Yeah. I absolutely hope so. So I’m enjoying it. I’m still at that stage where I’m exploring what’s possible. And I provide my feedback. But as a group, as a community, I’m sure we will have many tips and tricks and ideas for each other of how to use the systems, but then also for the creators, like Microsoft, that will influence the next generation.

DEVIN COLDEWEY: So tell me, we’ve talked a lot about what we’re hopeful about. Maybe you can get a little bit into what you feel the most skeptical about the claims or the tools that you think may or may not– that may not– that will live up to the promises. Is there anything, in particular, that you’re worried about or more skeptical about?

SAQIB SHAIKH: I’m a little bit skeptical of, I got to say, because there’s almost so much hyperbole at the moment. And it’s like, wow, AI is going to change everything and meet scientists, who maybe a year ago were hard at work on this model, that technique. And now, they’re like. oh, computer vision. It’s a solved problem. I’m, like, it’s not. If I, as a blind person, what– I could list you the first 10 things I do in my day. And I can think of dozens of ways to make each and every one of those more productive and better, and are we going to save these large models are going to solve everything? Are we saying, oh, we’re approaching AGI, artificial general intelligence, and then it’s just going to solve everything. No. I still feel we’re a really long way off. So on the one hand, I’m so optimistic, but, on other hand– and this is just my lived experience there. Of course, so many users in the world with their own experiences. But if I just pick the example of, I get out of the Uber. And I’m going to try and find the door to the building. And now, I’m going to find my way to the cafe and join the end of the queue, and know what to order. And go and take my tray and find an empty table with my friend and say, hi. Yeah. is AI there today? It can probably do each of the building blocks. But until we get to this AGI future, which I think is much hyped. And we’re probably still a ways away. There’s still a lot to be done. So I am very optimistic, but skeptical of some of the hype.

DEVIN COLDEWEY: Yeah, absolutely.

SAQIB SHAIKH: I think it’s a lovely thing, which is also we want to make equity is important. We get to make sure that not everyone has access to super internet, and the latest, AI and the latest everything. We want to make sure– and this has been a focus for us over this past year as well. Just the basics. The basics of being able to read well or recognize products, and all those things are done. Even better than before. And then that we can get this out to more people. So there’s so many people all around the world. And we’d love to bring generative AI to everyone. But even if we can bring some subset of this, I think, we could help so many people in their daily lives. So just a few days ago, we launched Seeing AI on Android. And I’m very excited to think, how that can get out onto much, much cheaper phones. And in the upcoming weeks, we’re going to be launching a whole bunch of new languages to about 36, I believe. That’s the other side of what I’m doing, that thinking about, OK, let’s really push the edge of what’s possible. But then, let’s scale that benefit out to help as many people as possible. So it’s not so much that I’m skeptical of all the amazing new stuff. But it is that I want to make sure it’s available to everyone.

DEVIN COLDEWEY: Yeah, absolutely. That equity piece is super important. The language is like, I think you mentioned something like 36 languages. That expands the scope to so many people, who I’m sure would love to use the app or use it in certain situations, and can’t because they’re like, well, because I speak Portuguese. That’s just how it is. And now, it’s available. I don’t know if it’s available on Portuguese. But, hopefully, it will be soon. But that is such an expansion. Like you said, you don’t have to push at the very frontiers to enable more and more people. I saw a wonderful, a model that was being done by some Swiss students. Literally, all it does was when you walk into a room, it helps you find an empty chair. There’s no other task. And it’s just like, it’s such a simple thing. But, of course, anybody should be able to do that. And they were just, like, well, this is a common problem. Let’s solve that. We don’t need to worry about the cutting edge stuff. Let’s just see if we can make everyday tasks easier. So we only have about a minute left. I’m curious. There’s one last thing I wanted to ask you about– self-driving cars. This is just one of these areas that I think is just like a cutting edge technology, super interesting, super focused on accessibility in a way. But there’s been a lot of movement on it recently. And I’m curious where you think we’re going with this.

SAQIB SHAIKH: I’m just going to say, yes, please. [LAUGHING]

SAQIB SHAIKH: I really want this to happen. And in the same way, the work we do on Seeing AI, I definitely want that AI sighted companion, as I’m walking around. then I want to get an self-driving car and be taken across town too.

DEVIN COLDEWEY: And then you get out and do all the things that you mentioned. All the pieces will fit together. I think it’s a wonderful vision. And I think that you are accomplishing it with Seeing AI with the help of all these other technologies. Anything else? We’ve only just got a few seconds left. But is there anything else you want to talk about that’s coming out in the next week or two?

SAQIB SHAIKH: No. It’s really just we’re building this one step at a time. And I think it’s s perfect time. AI Is getting better. We’ve been thinking about wearables for years now. And wearables are entering the mainstream now. Personalization really helps have human-centric AI. You put all this together and this idea of the AI that just describes the world for you, we’re really closer than ever before. It’s an exciting time. And we’d love to hear from anyone, seeingai@microsoft.com.

DEVIN COLDEWEY: Well, thank you very much. And with that, I think we can wrap it up. Thank you, as always.

SAQIB SHAIKH: Thank you so much.

[MUSIC PLAYING]

Seeing AI meets Generative AI - The View from Microsoft

Speakers