Did Computer Vision AI Just Get Worse or Better?
DESCRIPTIONThe ability an assistive tech devices to recognize objects, faces, scenes is a type of AI called Computer Vision, which calls for building vast databases on images labeled by humans to train AI algorithms. A new technique called "one-shot learning" learns dramatically faster because the AI trains itself on images across the Internet. No human supervision needed. Is that a good idea?
BRIAN: Did computer vision AI just get worse or better? Moderator, Cecily Morrison, speaker, Danna Gurari.
CECILY MORRISON: Thank you very much, Brian, and welcome, Danna. Tell us a few words about yourself.
DANNA GURARI: Yeah. So I’ve been an assistant professor now for just over five years, and I’m currently in Boulder, Colorado, a beautiful place, University of Colorado Boulder, in particular. And I focus on both teaching and leading a research group. So insofar as teaching, I focus a lot on educating the next generation on how to build large scale data sets to train modern artificial intelligence algorithms. I also spend a lot of time teaching them how to design these algorithms that consume that data so they can go and do something impactful in the world and hopefully help society. And so complementing that teaching background, I also lead a group of about typically 10 students from undergraduate all the way up to PhD level students on designing machines that can see. And we call that discipline computer vision. And for the past– jeez– five years, seven years, we’ve focused predominantly on thinking about how do we design computer vision algorithms to be in the lives of people who are blind or visually impaired to help them overcome their visual challenges in their daily lives?
CECILY MORRISON: Such exciting work and very relevant to our audience here. Tell us, innovation in AI has been very rapid. We hear Lots of new things coming out. But can you share with us the latest trends?
DANNA GURARI: Absolutely. I think it’s helpful in talking about the latest trends to understand where we’ve come from. So I mentioned in my teaching load that I teach students how to make large scale data sets to train modern artificial intelligence algorithms. The status quo from about the 2000s up until 2017 was a technique called supervised learning. And the idea was you would have humans label data. So let’s imagine we want a machine that can recognize cats in images. You would have humans go and look through a bunch of images and label which images have cats, which ones don’t. And with those labels, you could feed that to the algorithm, and the algorithm would learn patterns of what’s present when cats are present and what should not be there when cats are not present. And that was a trend up until 2017. The problem with that trend is that we rely on humans to label data to provide supervision to the algorithms. So if you think about the efforts, generally, we’re talking about labeling thousands, tens of thousands, hundreds of thousands, or even millions of examples. That was the status quo until 2017. And so reliance on humans was the bottleneck. It’s time-consuming. It’s costly. It’s relatively slow. Jump forward to 2017 and the Natural Language Processing community– so the community that thinks about how do we design algorithms that can analyze text. And they said, well, let’s not rely on the supervision. Let’s just rely on the data itself. And the idea that was proposed was let’s, for example, have machines read books. And we have all the text from these books. And all we’re going to do is we’re going to mask out single words, and we’re going to teach the machine how to predict the word that was masked out. And so all the text is available. We’re just going to randomly mask out different words and say to machine, predict what was there that I made disappear. A year ago, the computer vision community caught on to this trend and said, wow, this is brilliant. We can move beyond relying on humans to label data. We can take advantage of the mass repository of images and videos on the web to make machines that can see and do it with a similar approach. Given an image, we want to predict if it has a cat, mask out little blocks of the image that have cats, do that in the video, and have the machine learn how to predict the presence of these kinds of things with parts of that visual content masked out. And it turns out that this kind of training of algorithms leads to performance that far exceeds what we’ve ever seen before. So it’s really, really a revolutionary concept that has changed the landscape of what is possible because we just have so much more data from which algorithms can learn to recognize patterns.
CECILY MORRISON: Wow. And it seems pretty quite new in the computer vision space. What do we call this new phenomenon?
DANNA GURARI: So the flavor of these algorithms are called transformers. And so that’s kind of the architecture that is developed. And the approach for training these methods, these transformers, is called unsupervised learning, which is in stark contrast to what we had, which was supervised learning through human labels.
CECILY MORRISON: I get it. So the supervised bit is where we need to tell the computer what things are, and the unsupervised bit is where we’re trying to get the computer to figure it out. Did I get that right?
DANNA GURARI: Yep, with no human annotations at all. That’s the idea.
CECILY MORRISON: That is a– that’s a major transformation. So this sounds great, but how is this going to be useful for many of the people watching this conference who are blind or low vision or building technologies for people who are blind or low vision?
DANNA GURARI: Absolutely. So there are several different use cases we need to think about for our audience. And one is which where we have people who rely on their phones, smart glasses, whatever in order to learn about their visual surroundings. So we know things like Aira, Be my Eyes. We have all these kinds of services that help people in their daily lives, learn what they’re going to eat for breakfast, whether they’re choosing the clothes they want, how to cross the street in a safe manner, and so on. And so we have this tremendous amount of data that’s being collected. But again, it’s costly for these companies to label these. And the companies that I mentioned such as Be my Eyes and Aira, while they are very powerful players in the space and are changing people’s lives, they– it takes a lot of money to get humans to label this data. And so it’s limited on what can be done insofar as getting the appropriate funds and the time it takes to label. And so you can imagine these companies taking advantage of transformers, training them without any supervision, without any human labeling on the data they have. And all of sudden, we can imagine algorithms that are designed for data collected from our target audience. So rather than data that’s widely abundant, traditionally what’s available on the internet– that’s generally what’s being used for training these algorithms– we can instead say, let’s train these algorithms on the data for our target audience. And for sure, when we focus on the data coming from our audience, we would imagine improved performance. So that’s one use case that I would imagine create great advancements on and that would allow us to hopefully move away from exclusive reliance on humans to having more automation in the loop. And I imagine that’s helpful for when we’re dealing, for example, with people’s private information or just people need really fast information and don’t want that human connection. And then the other situation, I think, is just going to be an extension of what we’ve already been seeing. We have things like Microsoft providing captions for their PowerPoints and various productivity applications. And so continuing to see more alt text that’s automatically generated for things on the web, whether it’s products from Microsoft, Twitter, images, and so on.
CECILY MORRISON: So it’s a great concept that there are lots of things in the world that we can use these transformers to recognize, but how do we start to build systems that really put the user in the driving seat of those systems?
DANNA GURARI: Yes. So this is also a space where transformers can have an impact in thinking about how do we do personalization. So one thing I did not mention is that these transformers are not being trained to perform one task. Their effectiveness comes from the fact that we train them to do like five different tasks or 10 or 20 or 100 different tasks at once. So as an example, use cases I’ve seen from our audience that people with some kind of vision impairments is give me a description of an image or answer question about an image as two examples. And so modern transformers, the way that they’re being developed is that we are training them to do a huge number of tasks. And by training these types of architectures to do huge amount of tasks, they’re learning to be more like humans that can generalize and do many things. So they’re not designed to just say, I recognize cats. They’re saying, I can both describe when is cat available. I can answer questions. I can count. I can just– the number of tasks they’re being taught to do is incredible. And that creates a broader level of intelligence that allows them to generalize what they to novel use cases. So maybe they learned what a horse looks like, and they’ve learned to recognize black and white. And so next time they see something that looks like a horse but has black and white stripes, they can say, ah, that’s a zebra, even though I’ve never been taught what a zebra is. And so this space of being taught, this huge number of different skills and learning to generalize what they’re learning, such as colors with detecting things like horses will enable them to work well in novel scenarios, where they were not originally trained. That is called either zero-shot or a few-shot learning, which means take knowledge you have. And without having to get a lot of unlabeled or labeled data, learn to generalize to these novel use cases. And once we do that, we can start opening the door to personalization to individual people’s needs because now we don’t need to keep training from scratch a brand-new algorithm to meet each individual’s needs. We can just kind of provide what we call prompts in this space. And that’s a bit of a technical term that I’m going to gloss over without going into the details. But the idea is we can prompt these algorithms without having to retrain it to be able to generalize to novel use cases that would be specific to each one of the users. And so for me, I think that that’s one of the exciting next steps that we’ll see in this community is how do we personalize to each individual’s needs. Because if you lose your sight when you’re born versus later in life, you maybe have different prior understandings about the world if you have some vision that might change your needs versus if you have no residual vision. And so I certainly envision that we will see these transformers keep helping us move closer to this acknowledgment that really there’s different needs that need to be met. And we don’t want one-size-fits-all algorithms that behave exactly the same for every user.
CECILY MORRISON: That makes a lot of sense because even if people are in a single community, it doesn’t mean they’re really the same. Let’s go a little bit deeper on these things. So the kinds of examples you’re giving are things like how it brings together knowledge about the visual world. So one of the things that we see within technologies in the blind community is that I have one app for navigating certain intersections. And I have one app for navigating maps. And I have one app for recognizing colors and another app for recognizing barcodes and another app for– and the list goes on, right? Most people I know are using multiple apps in a day to do these different visual tasks. So it sounds like well, a transformer is maybe one of those things that are going to bring those visual tasks together.
DANNA GURARI: Absolutely.
CECILY MORRISON: But then how do I know– how do I work with that system to get the particular visual class I need at that time? How do you see this happening? Is this something that’s going to happen by itself, or is this something where a user is going to become skilled and working with these systems to kind of get the visual information that they need?
DANNA GURARI: I think it’s going to be a combination. So as I mentioned, these transformers are really in their infancy. We’re starting to learn that there’s this ability to do personalization through these things called prompts. And so we don’t fully understand how to leverage these prompts with these kinds of architectures. What do I mean by a prompt? It means that you start with letting the algorithm know this is what I’m looking for. So you can imagine saying, here’s an image color. And that would be the prompt you give to the machine, and the machine will say, ah, I see in here a yellow with some black and green, OK? You could have a different prompt that says, is there a person in your view? In the case that we’re looking at me, the answer would be yes. We can have another prompt that says, describe. And it would say, oh, I see a woman who has shoulder length, brown hair, who’s wearing a black shirt, has headphones on, right? You can imagine these different prompts that you give in the machine through this widespread training. All these kinds of different tasks through a lack of supervision will learn when I see a pattern of this kind of a prompt, such as the word color or visual question or caption this is how I should respond. And so similar to how you would prompt like a child or an adult to do something. We would see a similar thing in these architectures. And so there’s going to be learning about how do we come up with the appropriate prompts where we are getting the responses we want. So that’s going to be a part of it. But there’s going to be a dance because in that learning about these algorithms we’re going to learn about the struggles and prompting them correctly and where they misbehave and where they just do things we don’t want them to do. And so we’re going to have to redesign the underlying architectures. And so there is going to be a dance back and forth that will continue at least for the next five years, I envision, because this is again– this is a one to two-year-old concept. And so we’re going to have to dig more and more into understanding what is it these machines understand, right? Like what do they want us to say in order to respond the way we want them to. And how do we make sure they’re doing them ethically? So how do we make sure that they’re making responses that are in line with what we think they should be as, in contrast, through a bias. Like, answer the question– what color is my banana? Most of the time when we see a banana, it might be yellow. But sometimes it’s brown, which means it’s rotten. And we wanted to learn to say, I recognize the color of a banana rather than just the pattern that whenever I see a picture of a banana, it should be yellow most of the time. So I’ll say yellow. So there’s going to be a lot of work also on just verifying that these algorithms are really responding based on the appropriate cues they should be rather than just on biases. And there’s certainly going to be a lot more work around ethical development of these transformers.
CECILY MORRISON: Because there was a lot of nuggets that we can pull from that. So let’s dive in a little bit more to some of them. So you said five years. That’s not very long.
DANNA GURARI: [CHUCKLES] I mean–
CECILY MORRISON: What are you thinking? Tell me.
DANNA GURARI: So let me just give you a sense of the size of the communities that are working on this. When I started my PhD back in 2010, the computer vision research community, the number of people who were going to the mainstream conferences was somewhere around, I think, 1 or 2,000 people. The last time pre-COVID that I went to one of these conferences– so I think that was maybe in 2018 is maybe I went– was when I went. They sold out, and there was about 10,000 attendees. And the community is growing. We are seeing that across the board in all these various artificial intelligence communities. We are seeing a dramatic growth in how many people are coming onto this buzz. We are seeing this– the classrooms. My course has been the most popular course at my university. It maxes out. We have long waiting lists. And so we are just seeing a wild amount of people just diving in onto this trend. And so what’s happening is we have so many people around the world who are working on pushing this space forward. And when you have that amount of human effort coming in, for sure, I think you can see progress just moving so much faster. There is money, millions, billions of dollars going into this. Self-driving cars is an example, right? That industry alone is huge, and they are looking at this kind of technology. And we are seeing this across the board. So I think just the sheer amount of money and human effort focused on this kind of problem means that we can see progress faster. And that’s where my belief is. It is also– my belief also stems from having watched the advancements from the past decade. I mean, it’s astounding how fast we’re moving, but it’s because it’s been working and it’s becoming more and more popular. It’s a snowball effect. And so I don’t think that snowball is quite at a point where we’re slowing down yet.
CECILY MORRISON: Yeah. So this is moving really fast. We know that people have been thinking about some of the challenges that we think we have efforts going on and responsible AI that think about the kind of biases you were talking about with the banana. But how do these challenges change with these kind of big models? I mean, you talked about them. They could do so many things. But I guess that also means they can do a lot of things maybe wrong.
DANNA GURARI: Yes. So many things wrong. So the challenge I described with the banana is, is the machine even looking at the image or the visual content? In that case, if you see a bunch of images and you get a sense of what the color yellow is, then you can just have a machine spit out, well, almost all the time, 85% of the time I see a yellow banana. So I know 85% of time I’m going to be correct when I see one. So I’m just going to say that because I’m usually right. And that’s a good way to gamble. So that’s what the machines are doing. And so there’s ways we can encourage machines against that kind of biases. One thing we can do is we can start to inspect the insides of the machines. And so one of the buzz words I did not throw out about these transformers is they have this module called attention, which means– it’s kind of one of the keys that enables these architectures to work really well. But what it also means is that these machines are learning where to attend in images. And so we can say, are the machines looking where they should be to make a prediction? Or are they looking at surrounding evidence? And so we can do that. We can use that kind of attention to dig in and try to get a sense of, are they really reasoning in the way we think they should be? And if not, we can start to penalize these algorithms. We can start to put punishments when we train them for not looking where they should. That takes a little bit of supervision, so now we’re getting back into that tension we have where we are asking for humans to be involved in labeling so that we are ensuring a more responsible direction. But that’s going to happen. This is true for language. So let’s say you are trying to translate. I would imagine in our audience. We have people who speak different languages and come from different countries. And you have some countries that have gender built into their languages and some that don’t. And so a popular example, I believe, is if you translate from Turkish which I believe doesn’t have gender in it. If you translate it like something like the person is a doctor into English, you’ll see it says, he is a doctor in English. If you translate the person is a nurse from Turkish into English, you’ll see she is a nurse. So the assumption is she goes with nurse. He goes with doctor. That’s another kind of bias that we’re going– we are seeing from these kinds of models because of many reasons, including– the data that they’re looking at perhaps had systematic biases from history that are being fed forward. And so again, we have to create some sort of human intervention where we have checks. We say we know these kinds of biases are possible. And not even possible. They have happened. So let’s go through and systematically test our models to say, what are the biases that are present? And when we see certain ones, we go and retrain it to penalize having those kinds of biases. And for sure with our target population, there’s already published literature talking about the biases that are imparted on people with any kinds of impairments, any kind of disabilities. A lot of the literature that has been used in the self-supervised learning approach has built-in biases about people, for example, with mental illness. There’s just these generalizations that these models are learning about people with mental illness are more likely to go to jail and these really horrible kinds of perpetuated stereotypes that are often found in the material that’s being used to train models. And for sure, I imagine we’ll find these kinds of issues about really inappropriate generalizations about people with visual impairments. And so we’re going to want to understand what are those kinds of stereotypes that we might see from the data that these models are being trained on. How do we test against those? How do we verify that we are not putting systems out that embed these kinds of stereotypes? And that takes a lot of just engagement with the data as well as engagement with our target populations to uncover where could these systems go wrong.
CECILY MORRISON: So as we’re looking towards the future– but actually, a lot of these simpler AI systems are out there for people to use now. What can they learn from some of the things you’re saying about just getting the best out of the systems that they have now?
DANNA GURARI: Yes. I think it’s good to be skeptical if you are using automated systems, even human-based systems. It’s good to have some sort of internal check and ability to do some sort of internal quality control. That’s not always possible if you say what color is something. Then it’s not necessarily the case that if you have no vision, you can make that kind of judgment. But there’s certainly a health– maybe you can ask a few more questions to see does this machine have internal consistency. So maybe you ask the machine in different ways the same thing. And you say, does the machine consistently arrive at the same outcome or not? And if it’s not and that’s a critical question for your health safety, whatever, then maybe you go find another option, a human-based option, because you can assess through multiple probes that the machine’s not behaving a certain way. Of course, that doesn’t feel like a viable solution for the long term. We can’t spend our lives probing and probing and probing forever. But certainly, up front we should do that. And then–
CECILY MORRISON: As an example of that, if I have a medicine– so maybe my app can recognize objects. So it’s recognized that it’s a medicine, but maybe I also then want to go check with my short text recognizer what actually it says on the label. Is that kind of what you mean where you’re using two different AI systems to make sure they’re giving you the same results?
DANNA GURARI: Thank you for the clarification. And I’m actually suggesting something different, which is we have a single model that’s going to be providing us the information. And so you can imagine saying to the system, tell me what is this object. If the answer is pill bottle, OK, we have some confidence. But then we don’t stop there. We say, is this a pill bottle to the machine? Does it say yes, or does it say no? Then we say, if we have confidence that it’s able to detect the right object, then we might say, what dose am I supposed to take? Am I supposed– and then it gives you an answer. You say, 5 milligrams. And you say, am I supposed to take a 5 milligram dose according to this? And so basically figuring out how to probe the machine with slightly different ways of trying to get the information. You could ask– just say, give me a description of the image. Does it give you a description of this as a pill bottle that you’re supposed to take 5 milligrams? And so basically probing it with slightly different variants of what you’re looking for and what you’re hearing to make sure there’s internal consistency in how it answers. Because if it’s not internally consistent, there’s either likelihood it’s just spewing out biased responses and that it doesn’t have logic that is working from based on what you’re giving the machine. Does that clarify?
CECILY MORRISON: That makes a lot of sense. Yeah, thanks for the example. And I’m sure that will be appreciated by the viewers who are making sense and using these systems. In the few minutes that we have left, I want to close with a question, which is, you have a long history of improving computer vision for people who are blind or low vision. But which of the contributions of the many you’ve made do you value most?
DANNA GURARI: Yeah, great question. So I think I’m going to try to hit on three. I know you said the most, but I’m going to maybe spread that over to three series of events which have kind of reflected how my career has gone. When I started my work in this space, I really felt that as a single researcher, I really was not going to be able to change the world for this audience. And that was important to me. Yeah, I was raised with the idea that my mom would lose sight one day, and she had many eye issues, my whole family. We’ve always struggled. And so it was always really a personal drive to try to feel like I can make a difference for something that felt pretty close to home. And so feeling that I as one researcher would not be able to have a huge impact, I said, well, instead I’m going to do what is called making a data set challenge. So basically, I went out starting in 2015, and I found I was able to obtain data from people who are blind. And I was able to label that data to support a bunch of different tasks, such as answer the question about an image, describe the image with a caption, tell me if the contents of the image are recognizable or not. And if not, then we can prompt the photographer to take another picture. So basically, I went out, and I started to create all these what I call data set challenges. And I invited the broader computer vision community to come and compete on these challenges. So I said, I’m not going to develop the solutions. I’m going to develop the foundation and let everyone in the community come if they want and compete. And they have. We’ve had over 100 teams. So I’m guessing hundreds of people who have come and helped push forward the level of performance from these algorithms. And for me, that’s– I mean, there’s no words to express how awesome that feels to be a part of that kind of momentum to have helped facilitate that kind of growth. And it’s not just in the development of algorithms. It’s also the presence of the number of people who are familiar with the needs of people who are blind and low vision by accessing their data implicitly. They’re getting some sort of training about what kind of solutions are needed out there. So I’d say that’s the first level of work that I did that I’m most proud of. There is a series of things that followed from that initial effort. So in hosting these data set challenges, we would have announcements every year at the premier Computer Vision Conference, where, again, there’s 10,000 attendees that come in theory. And so we would host this workshop as part of that event, where we would announce here’s the winners for these challenges every year. We started that in 2018, and that was really cool because all of a sudden, we had people who are needing the work done, showing up and saying, here’s what I want from you. And we were creating kind of this dialogue. In that process, we also discovered there are major barriers to have an inclusion of people who are blind and low vision within the computer vision community and within the larger artificial intelligence community. It’s not just the computer vision community. And so just by trying to invite our audience, we found there’s issues with registration along the way. There was screen reader issues. Like, screen readers weren’t working on these websites for conferences. And we still know that. And so by discovering that, now, for the first time this year, we have created a new position for the conference called the Accessibility Chair, and I have a chance to now go in and help through support and funding from the leaders to start to make a more accessible community. And so I’m really– that’s just starting. I think many people in the audience will appreciate change takes a long time. But I’m really excited to be a part of this change to hopefully allow for this community to be more inclusive to people who are blind and low vision. And so hopefully, we can have a lot of attendance from this population going forward, where in the past, it’s just been not even thought about. And so I would say in closing, being a part of this community and helping create a more inclusive world has been the more recent– the aspect that I’m really excited about.
CECILY MORRISON: It’s so important that when we do these things that there’s nothing without us– nothing about us without us. So it’s great that you haven’t just stuck to the technical side but embrace the whole of it. Well, I have learned a lot, and I hope the audience has as well. So thank you so very, very much both for what you gave us today and for your larger contributions to the community.
DANNA GURARI: It’s my pleasure. Thank you so much.