Seeing AI: Where does Microsoft’s blockbuster app go from here
DESCRIPTIONWith ever more powerful computer and data resources available in the cloud, Microsoft’s Seeing AI mobile app is destined to become a steadily better ally for anyone with vision challenges. Co-founder Saqib Shaikh leads the engineering team that is charting the app’s cloud-enabled future.
DEVIN COLDEWEY: Thank you, Saqib, for joining us. First of all, I think I’d like to start by getting a sense of the origins of Seeing AI. There’s been a lot of experiments in assistive computer vision and things like that over the years that I’ve seen, but very few I feel have actually shipped. So I’d like to hear from you what it was like taking this idea from the lab to a full-blown product.
SAQIB SHAIKH: Yeah, I’ve always been interested in artificial intelligence since I studied it back at university, and so always tinkering, making things. But it was all the way back in 2014, where there was a company-wide hackathon where you could spend a week working on anything you wanted to. And so I started talking to researchers across Microsoft, what were the bits of AI that were ready that we could put out there.
And you know, that project has many side projects, too. It ultimately was something quite different, but it was the beginning, the inklings that something could be possible. And you know, I kept working on this on the side. More people, more people got interested. The next year, there were a bunch of colleagues in Silicon Valley who really brought a lot of power to the table.
And the technology kept getting better. And so this was still very much a side project, but eventually, it got the attention of the CEO, Satya Nadella, and in 2016, we put out our first prototype and were on stage with Satya at the Build conference. So that was an incredible moment. That was a real turning point to making this side project into a product.
DEVIN COLDEWEY: Absolutely. And I remember that moment, and I thought it was very interesting to see that on the main stage, as well. So could you tell me– obviously, you’ve worked on lots of products over the years. Can you tell me how the development process for this one has differed from some of the more conventional products that you’ve shipped?
SAQIB SHAIKH: On a personal level, this is something I’d wanted to do for years, but never thought the technology was possible for. So it’s a passion project. I kind of feel this is my life’s work now. And it doesn’t feel like a job, almost.
I think one of the big differences is working on something end-to-end with the community, because each of the developers, everyone on the team, we’re always looking at what are the customer challenges, what are the emerging technologies we can bring to the table from around Microsoft, and beyond academia, as well. So unlike other products I’ve worked on, for example Search or Personal Assistant or other sort of products at Microsoft, this is something that I’ve been deeply involved with from the beginning, very much like it was a hackathon, a side project within a startup-y type feel, and it’s really quite different to any of the other roles I had previously in my 10 years at Microsoft before starting Seeing AI.
DEVIN COLDEWEY: I’m glad you mentioned the community, because I’m always curious, with a product like this, it’s so dependent on the community and advocacy organizations and people who have organized themselves in order to produce the product that you want to produce. How did you involve the community and organizations that are also working on this kind of thing, or would like to help test that sort of thing?
SAQIB SHAIKH: Yeah, to your point, just something like this, you can’t do it in the lab in isolation. This is something where we have to continually be listening to the community, either indirectly through reading and listening to the various media, but then also very directly, too, whether that’s interacting on the internet or getting people into our lab for user testing or beta testing over the internet. You know, there’s so many different forms that takes.
And of course, prior to the pandemic, we were always at as many conferences as we could, having those in-person user studies and conversations and hallway chats. So in every opportunity, for me and the rest of the team, we’re looking at how can we learn from users, our customers, find out what the challenges they most want solving are.
And then, like I said before, we have so many computer vision and AI scientists around the company that we can then talk to to figure out what are the emerging technologies, the latest research. And then I feel our job is bringing those two worlds together, finding ways that we can leverage the emerging technology to solve these real problems and put the solutions out there, and then learn and iterate, yeah.
DEVIN COLDEWEY: Well, that dovetails exactly into what I was going to ask about, was the app itself. How did you choose the specific capabilities that are included in Seeing AI? I know the list is increasing, but how are you choosing and prioritizing those modules, and how did you design it for maximum accessibility?
SAQIB SHAIKH: In terms of choosing the functionality, again, it’s this intersection. What is technology capable of doing today, and then what are the needs that we hear about most frequently, whether that’s direct feedback or trends that we’re spotting in industry. And that’s very much how we prioritized our initial list. And then since then, once you have a product out there, you just hear what people wish it could do, what people have difficulty with.
And that is a key influencer as we choose the next generation. And then as I mentioned, the other side of this is what is the state-of-the-art technology capable of doing. So we’re looking at a few new areas based on new hardware, new software capabilities on modern platforms, and looking at how we leverage them to solve problems that we couldn’t till now.
DEVIN COLDEWEY: The software and hardware thing you mentioned, how much do you think we can credit to advances on the software side, you know, better machine learning training, better databases, and how much can we credit the improvements to hardware, whether it’s integrated stacks on the system on a chip or just better clock speeds? How much of the advances that we’ve seen are due to each side of that equation?
SAQIB SHAIKH: You really can’t separate them. I think the hardware, the software, and indeed the data which AI relies on, these are all the key essential parts of making today’s really powerful AI work. With any one of these, it couldn’t happen. So you need the great compute in the cloud for training larger models. You need– like today’s iPhones, in particular, have really fast GPUs for running on device models.
And you have other elements of the hardware, like the sensors which enable us to do augmented reality type experiences. And I’m very excited by the new LiDAR that is in the iPhone 12. And then yeah, as I mentioned, data, that’s the other part, where there are these open data sets, and we’re working with universities and across Microsoft and externally to make sure that we have these new data sets available.
DEVIN COLDEWEY: I think that Seeing AI is probably the newest and fanciest of the features you’ve included there. And that definitely falls into the combination of all these factors coming together. Can you tell me a little bit about how Seeing AI came about, how you developed it, and why you decided that it was ready for primetime?
SAQIB SHAIKH: Seeing AI like the–
DEVIN COLDEWEY: The Seeing AI– sorry, the Seeing recognition.
SAQIB SHAIKH: Got it, yes. So that’s something that we had at the beginning. And from day one, the feeling was this is where the future is. And it’s still fairly rudimentary, but the technology took a big step forward these past few months with the researchers at [INAUDIBLE] Research generating, taking some of the state-of-the-art algorithms, and then making it even better.
And they got to the leaderboard on one of the challenges, which is the [INAUDIBLE] challenge. And so the model they’re able to generate for that is a big step forward from what we had before. And we worked with a team to put that in Seeing AI, but then we’re also continuing working with them to improve it even further for people who are blind.
DEVIN COLDEWEY: Yeah, it’s obviously– with something like that, with a very large model and a broad set of use cases, there’s always the question of whether to do it at the edge or in the cloud. And obviously, you guys have chosen to do it in the cloud. Can you tell me why that choice was made, and whether we can expect an offline capability like this in the future?
SAQIB SHAIKH: We’re always evaluating those things. So if you have something running on the edge, it means that you can get a much smoother user experience, potentially, because everything’s happening so fast, in near real time. And so we always prefer that where possible. But then we balance that with can we get much better accuracy if we were to go to the cloud, where you have the even greater computational power.
And so when we looked at something like image captioning, scene descriptions, we decided that, while we could compress a model and get it running on the phone, we felt the quality we would get in the cloud was good enough to warrant that roundtrip and the slight wait to have that done.
DEVIN COLDEWEY: One of the things that is actually done locally a lot of the time now is audio, both in natural language processing and in voice synthesis. I feel like we’ve had a lot of major advances in voice synthesis, but they haven’t really– it feels like they haven’t really trickled down, and we end up with kind of robotic-sounding voices. Can you tell me a little more about where that fits into the– where the new synthetic voices fit into your equation for better accessibility, and whether you plan on integrating improved synthetic voices into the app?
SAQIB SHAIKH: Yeah, that’s a key example of what I was just saying about when do you use the cloud versus on a device, on the Edge? And with speech synthesis– excuse me. My lights are turned off. So I’m just going to wave my hand.
DEVIN COLDEWEY: [LAUGHS]
SAQIB SHAIKH: OK. So when you think about speech synthesis, there are different requirements. If you want something just read out loud, then in the cloud, there are some really, really high-quality– what we call neural text-to-speech voices. And they’re so much better than the voices we’ve been used to in years gone by which are more synthetic.
However, they’re really big and computationally heavy. So on the one hand, they’re not going to fit on an Edge device, like a phone. But also, you could argue that for people who are blind, it’s preferable to have the smaller, snappier voices. Because their response time really matters. When you’re swiping through, when you’re tapping something, you really want it to speak intelligently at high speeds and to respond very, very quickly.
So it’s a different use case. So I do hope that the quality of those smaller on-device models– they’re going to get better over time, for sure. But today, if you want human-like speech, then those neural– synthesis in the cloud is definitely way better. But it’s just different use cases.
DEVIN COLDEWEY: Right. And as you sort of mentioned before, it’s important that the product reflects the use cases that it’s being put to. And that’s why you want to include the community so much. I’m curious about how much the product has been shaped, how much Seeing AI has been shaped by the feedback that you’ve received. Have you found that the testers have used it in unexpected ways and that’s led to new features or new priorities?
SAQIB SHAIKH: Definitely. We’re always interested in how people are using the product. And so often, that is surprising. We introduced personal identification a while ago. And at that moment in time, Seeing AI did not recognize currency.
And we found that, actually, some users had started to train the facial recognition system to recognize presidents on the US banknotes. So the president would be $10 or whichever. And that was such a creative use. And then we heard from other people as well. But that very much prompted us to add native recognition of currency bills.
DEVIN COLDEWEY: I think you mentioned this has been sort of a passion project for you. And you’ve personally contributed a lot of work to it. But it feels like not every product needs a personal champion like that. Like for Skype, or Office at Microsoft, those are just kind of– people work on them. Do you think that projects like Seeing AI need a personal champion like you to sort of survive and thrive in the corporate ecosystem?
SAQIB SHAIKH: I think there is this sort of life cycle of a product. I ultimately would hope that many of these accessibility features just become part of the mainstream. They get built in so that things which are– we pioneer. We put them out there for the first time. But then they get built into other products, both Microsoft but also across the industry.
However, I think those very early ideas need strong champions. And you do need someone to say, here’s an idea that we’re passionate about that, we really want to run with, and we’re going to involve our community much more heavily than you can when it goes to a larger audience. So I think there’s definitely room for both. And it really comes down to that lifecycle of an idea, the lifecycle of a product.
DEVIN COLDEWEY: And how did you find– or what did you find when you were sort of embarking on this work in 2015, 2016? What was the work that really had not been done? Because I know you mentioned, of course, Microsoft is really big into computer vision and natural-language processing already. I’m sure they had lots of interesting projects in the works.
But when you came to it and you were saying, well, we should make something like this, what was the work that really had not been done yet that you needed to embark on?
SAQIB SHAIKH: I think there was a lot of great science out there, a lot of research projects. However, I see our job as sort of being that intersection of AI and HCI. You really need this smooth user experience, which– well, first of all, you need the AI. But then you need to listen to the customers. What problem are you solving? And then that experience which enables someone to operate the AI in real time to accomplish the desired task.
So we found that there were a lot of people working on the algorithms and with great ideas on what the machine learning and the deep-learning models would be like, what they might do. But that’s all based on these sort of very academic metrics.
Then when you talk to users, you realize, oh, there’s a problem they want solving in their real, everyday lives. And so how do we productionize those models? Instead of taking a minute to run, they take milliseconds to run. And then once you’ve got this real-time stream of information, what are you going to speak? How are you going to– So I remember, in the old days, some of the questions were, how do you get someone who can’t see what’s in the camera frame to frame a good photo?
So, OK, we need some audio or spoken guidance to help with that. And once you’ve taken a photo, how do you convey the quality? And how do you know when to speak? Do you want to speak more or less? And these sort of twists on the machine learning, which you only do when you are developing the experiences alongside with the machine-learning algorithms. And the two must go hand in hand.
DEVIN COLDEWEY: Absolutely. And how do you think that companies can nurture this kind of work that, like you mentioned, is already going on in the companies? It’s going on in their R&D departments. But how can they nurture the kind of integration that needs to be done to turn these things from experiments into products or something useful that could be utilized by others?
SAQIB SHAIKH: I think it’s twofold. I think diversity of your workforce really, really matters. And amongst people with disabilities and the blind community in particular, there’s a much lower employment rate. And by fostering a more inclusive workplace and a more diverse workforce, you can just have– in this case, say, people who are blind– the more people there are in the company with different perspectives, different ideas, different needs, that’s naturally going to trickle into your products. So I think diversity really matters.
But then on top of that, it is embracing the grassroots efforts. So we have The Garage at Microsoft, or we have the annual hackathon. And that was really something that gave me the opportunity. These were always side projects, but it gave me the opportunity to bring my whole ideas, all myself to work, rather than thinking, accessibility is just this little thing I do. It suddenly became, I’m going to spend this week and go big with the biggest idea I have around this.
So yeah, get the people into the organization. And then give everyone the ability to share their ideas and work on these sort of hacks to come up with new ideas. And as we’re seeing now, grow them into products with users, and to keep going and iterating.
DEVIN COLDEWEY: And what do you think are some of the big ideas that you think may be possible now or in the next year or two that weren’t possible three, or four, or five years ago?
SAQIB SHAIKH: With Seeing AI and this generation of technologies, there’s been a lot of work done on 2D computer vision, so recognizing photos. One of the things that we don’t yet have is recognition of a sequence of photos– in other words, video– so that you have a concept of what happened over time. So I don’t know that that’s there in the next year or two. But that’s something we’re always talking about. We’re always seeing, how do we push the state of the possible there?
The one that is much more closer is augmented reality, which people often think about in terms of a very visual experience, like displaying holograms. But when I think of augmented reality, I’m really, really excited. And with Seeing AI, we’re just about to launch our first experience here where we can actually track the world around you. We can let you know about things in 3D. So it’s going from these 2D experiences to real-time 3D experiences.
And so you can hear in space where different objects are. You can use new sensors, such as the LIDAR that Apple recently put on their phones, to get an element of depth as well. So these new sensors, this 3D-world understanding, that is really, really exciting for the next year, for me.
It’s fun to think about the future. But I suppose one way to think about this is by looking at the past. I often think that disability can be a driver for innovation. And we can look back at so many of the innovations that we’re depending on today, from speech assistance, speech recognition, text messages, on-screen keyboards, or the touch screen– they all have their origins in these challenging problems of someone with disability and someone, an innovator, coming together in a partnership.
So a way of thinking of the future is to think, OK, what are the challenging problems we can solve today for people with disabilities? And at Microsoft, we’ve got this $25 million grant program in AI for accessibility. So we’re partnering with organizations who are innovating in this space.
And there’s so many great examples of end-user solutions. But a key part of this is also the open data sets. Because I think a key part of this future is, how can we make the data more representative of people, more inclusive of people with disabilities? Because it’s not only– the algorithms can be trained on these general-purpose data sets. But if you can have these inclusive data sets, that’s going to create more inclusive apps.
And when we look at the infrastructure required to make all this happen, like with Azure, we have the intelligent cloud with these really large-scale models being developed in an supercomputer. And we’ve seen with image captioning, with language understanding, computer vision, speech recognition, these really powerful models are coming online over the past few years. And that’s going to increase.
But the other part of that is the [INAUDIBLE] where these internet-of-things devices are becoming much more powerful, and you have this network of sensors with AI on the Edge. And if you put these together with 5G and fast ubiquitous networks in the middle, then I imagine this world where the AI is– where you have this sense network observing the world around you. And local intelligence connect to the cloud intelligence.
And so bringing it back around to Seeing AI or other applications for people with disabilities, I imagine, what if we had AI’s metaphorical friend on your shoulder, whispering in your ear, thinking, OK, what is happening in the world? What can I observe? What can the sense network tell me about the world? Build up this model .
And then build up the model of the individual and personalize it. So you’ve got this ability of– people are expected to train the system as well. And kind of bringing this all together so that like when you’re walking with a sighted friend or a family member, it knows what’s of interest to you. And it says, hey, this has changed, this is new.
And I think that’s a really exciting possibility. And yeah, it’s a big vision. We’re really excited to be working on this one step at a time.
DEVIN COLDEWEY: Certainly. And something else that is sort of peculiar to this moment in time is the pandemic. Which I have to ask, has it affected the development of the products you’ve been working on, the development and deployment of Seeing AI features? And do you think that it has changed the way that we think about our technology and interact with it?
SAQIB SHAIKH: Yes. This COVID-19 pandemic has impacted so many people and brought up so many challenges to so many around the world. And on a personal level, I feel so fortunate that the technology enables us to keep doing our work. But as I talk to the blind community and to colleagues around Microsoft, we’ve realized that people with disabilities are really being impacted by this.
And amongst Seeing AI users, we hear that there are so many new challenges. And so many of sort of the workarounds we’ve developed over the years simply do not apply in this new social-distanced world, whether that’s just getting ad-hoc help from someone, stopping someone in the street or the corridor, or holding onto an elbow.
And there are new challenges like, how does social distance in elevators or when you’re walking around a store? Maybe you get human help before. Or going back to the workplace– there are so many of these new challenges.
And actually, in the same way that Seeing AI started as a hackathon, this year, I’ve been pulling together people from around the company to think, how do we tackle this in a responsible way, that we can really tackle some of these big problems which are affecting our community? So yes, this is a long-term thing. But I hope that everyone is safe out there. And necessity is the mother innovation. So I hope there’s a lot of innovation that comes out of this, too.
DEVIN COLDEWEY: And let me just sort of drill down very slightly. Because I think there’s probably a lot of developers who are interested in this as well. How, specifically, have you managed to do the kind of testing to get the kind of feedback that you normally would get in the lab? How have you gone about getting that during these times? Because I imagine there are lots of people who are working on apps who just wish they could get even three or four people together to touch their app in real time. How have you been doing it?
SAQIB SHAIKH: Well, different techniques. So we’re working on quite soon releasing features involving the LIDAR on the iPhone 12. So the minute it came out, we were trying to find members of the community who have a LIDAR device who can help test and give feedback, and having that much more hands-on remote discussions.
But then you often want to see someone’s first experience there before they’ve read anything, before they’ve had to do anything. And so then we’ve been looking at, OK, how do you do screen sharing on a mobile phone remotely instead of just watching what someone’s doing? And elements like that.
So it’s very much an extension of– we’ve always done testing with people over the internet. It’s just that once that becomes your only way and you need people who’ve got the new hardware, then it becomes even more important.
DEVIN COLDEWEY: I suppose it has also highlighted a lot of the shortcomings that modern operating systems have and how inaccessible they can be to people. But I guess my last question would be, sort of along those lines– Microsoft, you’ve worked there for quite a while. How do you feel that Microsoft has changed its approach and its philosophy about accessibility and disabilities in the time that you’ve been there?
SAQIB SHAIKH: It’s changed a lot for the better. And it was always good, don’t get me wrong. But I think it’s really been quite impressive. However, a number of years now, things have been much more built into the culture. There are systems and processes, but more than that– maybe 15 years ago. And back then, I don’t think that everyone knew about it. Well, definitely, people– not everyone knew about accessibility. But now I feel that there is just a much greater awareness.
And it’s much more just something that every team does by default. It’s no longer this thing we ought to do. It’s a thing that is our responsibility and that we do do. So there’s definitely a long way still to go. And I’m not going to say that things are perfect. But as I look across such a huge company, I’ve definitely seen big strides over the years.
DEVIN COLDEWEY: Absolutely. Well, I’m glad to hear it. And I think that that wraps up our time. So I just want to thank you for joining us today. This has been an extremely interesting conversation. And I want to thank you just personally for what you do. It seems very, very, very important to me.
SAQIB SHAIKH: Thank you very much. It’s been a great opportunity. I appreciate that.