Computer vision, AI and accessibility: What’s missing from this picture?

DESCRIPTION

For an AI to interpret the visual world on behalf of people who are blind or visually impaired, the AI needs to know what it’s looking at, and no less important, that it’s looking at the right thing. Mainstream computer vision databases don’t do that well — yet.

Speakers
- Danna Gurari, Assistant Professor, University of Colorado Boulder
- Patrick Clary, Product Lead, Accessibility Engineering, Google Research
- Moderator: Roberto Manduchi, Professor of Computer Science and Engineering, UC Santa Cruz
SESSION TRANSCRIPT

Download transcript as .txt file

[MUSIC PLAYING]

ROBERTO MANDUCHI: So this is Roberto Manduchi UC Santa Cruz. I’m the panelist of this fantastic Computer Vision and AI Panel. I have my dream team in front of me. And I am very excited by that. So we are talking about computer vision. I like to call that the art of teaching computers how to interpret images and details.

And of course, we have it pervasive nowadays. We can get images with a smartphone, maybe wear a camera. We have cameras on cars. We have images on the internet and social media news everywhere. Now in terms of applications of computer vision in access technology, I think that we have all become familiar with things like OCR thanks to applications like KNFB or CNAI.

So my question to you guys is what is forthcoming? What do we expect to see pretty soon? And I’m talking about both the low hanging apple and the moonshot in terms of applications of computer vision in accessibility. And I want to start with Patrick.

PATRICK CLARY: Yeah. Thank you, Roberto. I think there’s lots of promise here. In addition to traditional use cases like OCR, I think there’s these cases involving kind of not only understanding digital contents, but also understanding elements of the physical world, so not only texts that you might see a piece of paper, but also text in the real world around you, which is pretty much everywhere.

Some other use cases that are on our team that are really of interest also are around understanding people, so understanding emotions that people may exhibit, facial expressions, descriptions of people, and things like that. And some other interesting use cases are around navigation, so helping someone get from point A to point B or for instance finding a gate in an airport that they want to go to.

So a lot of these have active research around them. But they’re pretty interesting use cases that data is currently being collected for. People or researching the right way to create models that can help detect some of the elements in these use cases and what the right UX and user flows might look like for these.

ROBERTO MANDUCHI: Very nice, exciting. So Cecily, what about from your team? What do you guys see coming?

CECILY MORRISON: Yeah, I think, as Patrick said, this view that we’re going to start to be able to understand the world. In fact, we can with a lot of the computer vision we do now start to really understand the world. And then the question is, how can we make that into a coherent user experience?

And the two things that we’ve been really focused on in our team is thinking about moving from a world where we have an image and then we have an interpretation of that image to an experience that as a blind person or a person with low vision, as you move through the world, the right information that you need is given to you that’s interpreted.

Now that’s, of course, very tricky, because each person is very individual. So what you might like and what someone else might like might be very different. You don’t want too much information. You just need the right information at the right time. But I think the real key is to move to a place where that information can be readily available in space, in time, personalized to you. And I think where we’re not very close, but we’re not very far either.

ROBERTO MANDUCHI: Nice, the right information at the right time it. I like that. Dana, you have been also working on exciting applications. What do you think?

DANNA GURARI: So we have been working on, I would say, three key applications. One is building up of Cecily’s statement about image descriptions is describing images, but thinking about how to describe them appropriately for different situations.

For example, on a shopping website, someone might want to know for a shirt if it has a pocket and what kind of embroidery there is and what kind of colors are going in. And information is not obvious to someone who’s blind who’s navigating the web.

And in another use case example of descriptions, people can want to know it for social media. So being more engaged in this kind of framework with that in modern days, we use for engaging with each other and building relationships. A second use case is still related to, for example, social media. And that is actually helping people to take high quality images.

Someone who is blind is going to have a hard time to come up with the right focus, the right lighting, the right perspective. And so designing algorithms that can tell someone how to navigate their camera so that image can be higher quality, if you will.

And that does matter to people who are trying to post their images to social media. And a third use case is called visual question answering, so not only describing an image, but actually answering a question about an image.

So for example, does my shirt match my pants in the morning would be a natural question you might ask, or when taking food out from the refrigerator, what flavor of yogurt is this, to know what you’re going to eat. So those are three use cases that I think are coming in the pipeline and hopefully coming to really users sooner rather than later.

ROBERTO MANDUCHI: Beautiful, beautiful, fantastic applications. Do you guys have a thought about how can we ensure that whatever we are going to build, whatever type of application, be it on a smartphone, be it on another device, can be used for everybody?

Everybody will be able to access that. You don’t have to pay a lot of money to get it. You don’t need to spend a lot of money to purchase something. You have access to all you need. You don’t need constant high bandwidth internet. Any thought about that? Wants to pick it up? Cecily.

CECILY MORRISON: I think this is a really interesting challenge. And I know as someone coming from Microsoft, one of the challenges that we’ve taken on in the mainstream is making sure that whatever we build works across multiple devices.

And it’s certainly been something that we’ve been thinking about. How do you make an experience work on a phone that might also work on a wearable? It’s not an easy problem to solve. But it’s a very important problem to solve and certainly one that we’re thinking about.

ROBERTO MANDUCHI: Right, indeed. And of course, the fact that many things are now in a smartphone app makes it more accessible for everybody and would say more equitable, provided everybody has a phone that they want. And I would say that probably the vast majority of people who are blind nowadays are using iPhones. Any thought about other segments in the world, where not everybody owns an iPhone 12?

PATRICK CLARY: Yeah, just a thought on there, I guess when we think of building for everyone, it’s important not to just think of the Western countries, but also elsewhere in the world. And we do see a lot of usage on iPhone in the Western countries. But we also see a lot more diverse platforms globally, especially in developing nations, like Indonesia, places in Africa, India.

And these devices may be a lot different than what we’re used to with iPhones. They might have less computing power. And the data usage among people in these countries is also different. They might not have access to Wi-Fi or reliable data connections. And so their usage patterns might differ. So developing for these users might require a more specific development effort than with Western users in Western countries.

CECILY MORRISON: And I would– Sorry. Go ahead, Danna.

ROBERTO MANDUCHI: Go ahead.

DANNA GURARI: This is Danna. I’m from UT Austin. And building off of Patrick’s statement about looking at other parts of the world beyond just the Western world, even the kinds of visual content people are looking at is different. So the types of food you might eat in Indonesia is not necessarily the food you’re going to eat the United States.

And OCR needs to work for all of these different languages if you have text on the kind of food you have or anything you’re buying. Toilets might look different, for example, in Japan from the United States. And so not just the hardware and thinking about how you design efficient algorithms that can run without access to wi-fi, but also understanding what visual content is actually available in those cultures really matters.

CECILY MORRISON: I think building on what Danna says, some of the work that we’ve just been kicking off is really thinking about how we can learn these different kinds of data. So at the moment, a lot of our computer vision are built off of standardized data sets that have been built in high income countries.

And those data sets, ImageNet is probably one of the best known, look very, very different than the data sets that you might see in sort of the Dollar Street data set, which looks at data from other countries.

But there are lots of new techniques coming out, things that are sometimes referred to as low shot learning or meta learning, where users would be able to train their own systems on their own things. And that would mean that regardless of whether that system was working in Indonesia, or South Africa, or Brazil, or in the United Kingdom, that that would work for that user.

And I think that’s one way where we can start to think about diversity, diversity in the cultures and the things that we look at, but also diversity in, perhaps, vision level. So what a person who’s totally blind might find useful in terms of what they recognize might be very different than someone who has a much higher level of vision and is using that vision to do certain aspects of their daily life.

ROBERTO MANDUCHI: So very nice. So this is a good segue into what I want to talk next. You mentioned individualization. You mentioned being able to adapt to each person’s abilities. So where I want to move now is what I call the how. So computer vision AI are exploding. Everybody wants to do that. All kids want to do AI nowadays. I see that with my students. Tons of confidence, there’s tons of papers, algorithms tested and everything.

The question is, that it’s all good, but how do we move that from that to building an app or system, an application that will actually work, but not only will work, that will be useful. As an engineer, I know that we are technology driven. And the thing that I see constantly is that we tend to propose solutions in search of a problem.

We start from the technology, and then we say, I’m sure you can use that to solve all of your problems. That is no exactly how it works. And in fact, it takes a lot of time and humility to go and talk with the community, try to understand how people live, try to understand the diversity in the community.

So I wanted to ask, and maybe starting with Danna, what advice do you have for somebody, a person who is very well versed in computer vision AI, who intends to build something that is really useful for somebody who has low vision or is blind? What is that I wish to do that?

DANNA GURARI: So I have two tiers of answers. So all of it centers on understanding the target population. And I want to start off with an anecdote of how I came in with a solution to solve a problem and so how I was guilty to answer the first component.

So we built, as our team, a system that helps people shop online and learn more about the clothes on shopping websites. And we built this system, and then we went out and we did user studies with people who are blind and born blind, went blind later in life, and also people with low vision.

And at first the system worked really well with people who lost vision later in life, because the system spoke, as you and I might listen if we’re sighted, at this space pace of speaking. But then I went and provided this system with a user who was very proficient with using technology and screen readers. And the pace at which he wanted to hear was incredibly fast, so fast that I can’t, as a human, understand it.

And that was a real wake up call to me not understanding the population and the diversity of needs. And so the first statement I would say to AI researchers, is if you can, go take part in a user study with the population, talk to these people, not even just a user study, make friends with them, get to know them and their real problems in their daily lives. So that’s one aspect.

But if you can’t do that, there are communities, such as this event, that are creating communities where you have a chance to interact and learn more and be told what are the pressing problems.

And so my team has actually created an annual workshop as part of the premiere computer vision conference called CVPR, where we bring together people from industry and academia to come together and talk about what are the pressing problems and make progress on them. And so I would say if you are an AI researcher, also look for workshops like those, so you can go and learn from others who are doing these user studies, what are the pressing problems.

ROBERTO MANDUCHI: Very nice, thanks. Patrick, do you have any advice?

PATRICK CLARY: Yeah I think that’s great advice that was mentioned. Yeah, I think it’s really easy to make some progress on the engineering side and then try to kind of fit that into a use case.

I think one thing that we’ve tried to do is just develop a really good understanding of these, what we call, activities of daily life, so common tasks that a user might need to complete throughout their day and then do just a lot of user research and surveys to assess where is technology meeting the needs of these activities, and where is it falling? And so what are the opportunities for technology?

And then in addition to user studies, one thing we’ve been trying to do, which is fairly difficult actually, is have what we call a co-design process, where we not only engage people who are blind and low vision through user studies, but we bring them in as more or less equal partners during design.

So they help us conceptualize these opportunities and the goals of the product and the interface and really are seen as kind of like a participant in the product development process.

So that’s something that we found to be really valuable, although it’s extremely challenging, especially with the COVID situation going on now and having a hard time accessing people and having to set things up remotely like that. So it’s provided some difficulty. But I do think it’s just crucially important to involve these folks in the development process.

ROBERTO MANDUCHI: Beautiful, and COVID certainly does not help with anything that requires interpersonal relationships, user studies, focus groups, and the like. Cecily, any comments?

CECILY MORRISON: Sure. I think some of our workers– I sit in Microsoft Research– we’re looking a little bit further out. And we’re doing something similar to Patrick, but perhaps even going one step further, which is that it’s our belief that the things that are five years out, it’s really hard for a researcher to imagine them. It’s very hard for a participant imagine.

Many of the people that we work with, they’ve built their lives up, they’re successful in doing what they’re doing. And all of a sudden, you drop this technology and say, how would you use it? And it’s like, I’ve got strategies. I’ve got my way of being. But we’ve also found that these early adopters are also people who can push the vision the furthest once they have the technology in their hands.

So an approach that we’ve taken is to build out prototypes, put them out there, and see what people do with them, rather than trying to guess up in front of them. A lot of what we’ve seen, and I can give an example from Project Tokyo that we’ve been working on, this is a visual aid in technology that provides spatial sound about where people on in your 3D environment.

And we expected it would be a way to support people initiating conversations, working as social agency. We happen to give it to a young blind person. I wanted to call him a child, but he’s 12, so I guess he’s not quite a child anymore.

And we started to find him doing unexpected things with it. We started to find that he was using it to help focus his communication skills. He was using it to help guide who he didn’t want to talk to. He was using his help monitor how he didn’t get in trouble in school. So all these things that would never have come up in a user study we can start to explore.

So I think these side by side partnerships that really help us explore the future by not imagining the future, but actually working out and being the future through trying things out and imagining where it can go.

ROBERTO MANDUCHI: It’s very intriguing. It’s very interesting. You discovered things after the fact that you would not discover, even with your initial focus group. It makes total sense. Now I have a question for you guys that’s still related.

So it is not just a computer vision algorithm that makes the system usable or not. The whole user interaction thing, how do I communicate, where does the user to speak to the system, how easily, and et cetera. So again, it sounds like it’s not just the algorithm itself. Do you guys have experienced that in your own work? Cecily, you’re nodding.

CECILY MORRISON: Absolutely. In some ways, I think sometimes the actual algorithm is the easiest part, at least within our colleagues. They’re like, I need to deliver you a person tracking, great, done, delivered. But actually most of what we’ve built in our system are what we call helper algorithms.

So if we want to identify a person, if you walk into a space, the easiest way to identify them is looking at your face. But if you can’t see who’s around you, you probably won’t be looking at their face. So if you’re a child, and everyone around you is an adult, you won’t be looking at their face.

So a lot of what we’ve built are algorithms that have helped guide the user towards things that they might be looking at. So we have multiple. So we might be looking at that Pose, and then from Pose, we then can use audio cues to guide the user towards a person. Maybe they need to get closer for that person to be recognized. Maybe they need to go around towards that person’s face to be recognized.

So I would say that’s the vast majority of what we do is to build an experience and algorithms that enable that to happen. But I think once you’ve enabled that to happen, then you also have this problem is does it do it accurately? And again, a lot of our research has been sitting around how do we make that happen.

And that’s first happened in two ways. One is just a lot of thinking within what role does the technology play for the users who are using it? It can be easy to think it’s going to replace a particular ability. It’s like a medical device. We’ve seen this to say, if it’s for a blind person, it must be a medical device. It’s replacing their vision.

We’ve taken a view that actually many of the people that we work with have a lot of skills. So it’s not replacing any of those skills. It’s augmenting those skills. So how do we then make sure that we have a level of accuracy that can support that? And one way that we’ve done that is that we always provide information in multiple ways.

So for example, when we’re looking at people, we might provide the number of people plus an identification of those people. So if you see there are four people, but you only hear three names, you might automatically realize that that’s a problem, that something here is amiss.

So I might trust the system a little bit less, but not only trust the system– it’s not about trust or not trust– that is a step to say, I need to do something different if I’m going to get the information I need. And we can perhaps take the analogy for those people who are using vision, when you walk into a room, sometimes you can’t see everybody who is there. And you move closer to someone. You move around until you can see what you need to see.

And so a lot of what we’re doing in our experiences is helping people get that wider view, so that they can be confident in the information. Or they can make the choice, it doesn’t matter if there are four people or three people, because my interaction is not going to depend on that. So I can push that away for a moment until I need that information next.

ROBERTO MANDUCHI: Very nice. I hear a lot of key words here, very interesting, individualization, trust in the system, and be able to making choices, not just the system dictating you what to do. Danna and Patrick, any comments?

DANNA GURARI: Yeah, one thing that I have noticed from running individual user studies is that there is a training phase. So we build a system, and we come in and we teach someone, here is how you use our system. But everyone’s doing that. I’m sitting here from UT Austin. We have Microsoft. We have Google on the call. We’re all making these individual systems with different instructions.

And it feels nice in our little silos to say, we built a new tool. It’ll help you. But there’s such an overhead for every single tool for these people to have to learn how to use each one of them. And so I think one of the major challenges for the community of developers is how do we come up with standards, so that the population interfaces with one set of instructions and we all, as developers, conform to.

ROBERTO MANDUCHI: Nice. Nice.

PATRICK CLARY: Yeah. This is Patrick. Just kind of some additional thoughts on there, I also think when it comes to user experience, a lot of the output, the results of these products or these models, they’re provided to the user in the form of audio.

And I think that can be tricky sometimes, because of course these users use their ears to interpret the world around them a lot of times. And so we want to be really sensitive that we’re not providing extraneous information.

We want to keep our information through our audio channel really tight so that the information we provide is useful, and it doesn’t necessarily get in the way of a user trying to interpret audio cues from the environment around them, like hearing cars or other people in their room and things like that. So just thinking about interface design, that’s something else that we’ve tried to take a lot of care with.

ROBERTO MANDUCHI: Beautiful. They system must know between basis. Again, we need to be able to be in control and be able to function as we do. Fantastic. So that last topic of the day that I’d really want to touch is we’re talking about AI, and I would argue that your AI system is as good as the data set you’ve been training on.

And a lot of people have been talking about bias in data sets. And here we are. We are talking about a particular type of community. And you can be sure that most of the image data sets the systems have trained on might not be completely representative of the images that this community would take or will be interested in.

And in fact, there are a few data sets that have been taken by blind people. And one of them is VizWiz. And Danna, you’ve been working on VizWiz quite a bit. Do you want to comment a little bit?

DANNA GURARI: Yes, so traditionally in AI, the way that we can put out a problem is we imagine a problem, and then we go make up a data set that resemble that problem, such as scrape a bunch of images from internet, ask crowd workers to come and describe those images, and then today we’ve produced this image capturing task.

Or collect a bunch of images, ask, crowd workers to make up questions about those images, and then go to crowd workers or to say, now answer those questions. That’s really valuable, because it puts out the question for the community. But it’s not valuable for real users yet, because we have yet to close the loop and actually connect to real users.

And so the challenge for a researcher who is really trying to promote AI is how to plug into the community of developers who are developing systems that is collecting data. And so I actually did develop a partnership with a professor in Human Computer in Action, where I developed an application that was used by over 10,000 users, where people submitted images with questions about those images.

And in partnership that particular app had people opt in to say, yes, I’d like to share my data for future research and data set creation. And so then I jumped in and took that data and helped, with my team, package it for the community.

So I think the greatest challenge is figuring out how to develop the right partnerships. And I’d like to say that also this person who I partnered with was in academia. And it’s easier, I think, for academics to share the data than for companies.

And so I know, for example, I’ve talked with people at Microsoft and Amazon, and I know that there’s an interest in having data get shared. But there’s so many layers of issues with privacy. And if any piece of private data got leaked, it’s just not a risk that I’ve heard from people in industry that they’re willing to take right now.

ROBERTO MANDUCHI: That’s a good point. And we have two people from the industry next to us. So Patrick and Cecily, do you have any comment on that?

CECILY MORRISON: Sure. This is Cecily here. I think one of the initiatives at Microsoft is something called Microsoft AI for Accessibility. And one of the things that they’ve been doing is spending a lot of time trying to fund what I would call disability first data sets, so data sets that are collected by and for people with different disabilities, so that when these are then open sourced and made very visible to the research community, we start with applications that are already going to work for that disability, rather than try and say we built this application. Can we now make it work?

So a good example of that in computer vision is if we start with ImageNet data, that the data looks different. Maybe there’s less blur. Maybe the framing is different. Then we try to then retroactively get people to take better images, or we try to then artificially adapt that data to retrain. Or we could start with data that people who are blind have taken and go the other way.

So for example, we’re working on a data set called Orbit, which enables people to collect objects of things that are of interest to them, which is then going to move on to the community to try and help the community really pushed for this idea personalization.

That we can start to build algorithms that in the first round will maybe give us what we called personal object recognizers, something that will recognize something as very personal to you, whether it’s your computer, your wheelie bin, or maybe a landmark that’s really critical to you, but then go on to push that on to say that once we can personalize a small task, can we start personalizing the information we give you?

Can we start personalizing the experience that we give you? So if people are interested in contributing and being part of a disability first data set, you can try that at orbit.city.ac.uk.

ROBERTO MANDUCHI: That’s beautiful. Patrick, I’ll give you the last word, because we are about to close.

PATRICK CLARY: Yeah, so I think this has been great to hear. Yeah, I do think there’s a gap here where it comes to kind of this public corpus of data and maybe a concerted effort among some of these organizations and academia and company to kind of provide this data and sanitize it to remove private information that might be there. I know that can be an issue too.

And these data sets are really a reflection of images that individuals have taken and labeled. And they might reflect the bias of these individuals also. So it’s really important to kind of assess any bias that might exist also in these data sets and these models. And that’s something we do here at Google that we’re learning how to do and we have a lot of efforts going on as how to evaluate what bias may exist and eliminate that.

But I think we’re also kind of in the early stages of this just from a research and technology standpoint. And there’s a lot of work ahead to create representative data sets that will be useful for these users and to do things like eliminating that bias. But it’s really important. And I’m hopeful for what the future holds.

ROBERTO MANDUCHI: Fantastic. Thank you so much. Thank you to all of the panel, Danna, Patrick, Cecily. I think this has been an exciting panel. I learned a lot. I hope everybody who has been listening has as well. So thank you guys again. And we can pass it on now back to Will Butler. Thank you everyone.

[MUSIC PLAYING]

Computer vision, AI and accessibility: What’s missing from this picture?

Speakers