Our AI future is already here
DESCRIPTIONWhether it’s Alexa, Tesla or Facebook, AI is already deeply embedded in our daily lives. Few understand that better than Dr. Kai-Fu Lee, a scientist who developed the first speaker-independent, continuous speech recognition system as a Ph.D. student at Carnegie Mellon, led Google in China and held senior roles at Microsoft and Apple. Today, Dr. Lee runs Sinovation Ventures, a $2 billion fund based in China, is president of the Sinovation’s Artificial Intelligence Institute and has 50 million followers on social media.
NED DESMOND: Well, thank you very much, Will. It’s an honor to be here today with Dr. Kai-Fu Lee, who is one of the world’s top experts as both a scientist and investor in the field of artificial intelligence. His book AI Superpowers: China Silicon Valley and the New World Order– I’ve got a copy here, actually– highly recommend– is a really lucid explanation of where AI is today, which is to say everywhere, and where AI is likely to take us.
One of the big insights for me in this book is that the current generation’s breakthrough in a type of AI called neural nets, sometimes referred to as deep learning, has enabled remarkable advances in areas like computer vision and natural language processing, both topics we’ve discussed quite a bit at this conference. Dr. Lee’s observation is that today’s AI capabilities are so great in this raw form that what’s needed now are the engineers and most importantly the data to make the most of all the possibilities. He believes China is a powerful competitor to the US in this regard, because of its depth in engineering, entrepreneurship, and perhaps most of all, data.
Dr. Lee, in your book you describe what you call the third wave of artificial intelligence as perception AI, which you describe, and I’m quoting here from your book, “as extending and expanding this power throughout our lived environment, digitizing the world around us through the proliferation of sensors and smart devices. These devices are turning our physical world into digital data that can be analyzed and optimized by deep learning algorithms.”
That description suggests there’s a near infinitude of data available for AI algorithms to digest and interpret the world. So here’s a question, I think, to start our conversation. What are the most amazing manifestations of the world of perception AI in the day-to-day lives of regular people, either here in the United States or in China?
KAI-FU LEE: Sure, thank you. It’s great to be invited to speak in this important conference. In terms of perception AI, you know, we have our six senses. And the most important ones for humans are our vision and hearing– that we can see and hear. And then of course there’s a smell, taste, and others. But primarily we rely on these two.
And computers now can see and hear at the same level as people now. So with speech recognition for machine translation and for object recognition, AI is now at about the same level as humans. And AI is improving rapidly, based on its ability to take a huge amount of data whether it’s spoken language or recorded videos to really train itself to do better and better. So over time, it will be a better see-er and hear-er than humans.
And what’s more is that AI can have a lot more senses that humans are unable to do beyond our six senses. Imagine the ability to see in the dark with technologies like a kind of light and other technologies. AI can actually do that. Like when we hold our iPhone up, it recognizes our faces, even if the room is quite dark. And that is using this technology. Sends little lights out and recognizes how far your nose and eyes are, and makes out a 3D picture of your face, even though it can hardly see you. The same can be done, and that’s beyond human capability.
And it can recognize temperature and warmth. In China today, the automatic temperature checking in the midst of the pandemic is largely done by invisible temperature sensors that recognizes that human has entered, based on motion and recognition of body shape, and then recognition of the temperature. And if there are any high temperatures– potential fever– it will alarm. So the ability to recognize that, and of course, it can also do movement and humidity and many, many other things– vibrations. So all of these things summed together will lead to a future of AI that can sense and understand the environment better than people can.
NED DESMOND: That’s pretty remarkable. But where does all the data come from? I mean, this is a point you come back to over and over again in the book– that data is really the key to unlocking these powers of AI.
KAI-FU LEE: Right. Well, the data can come from the sensors, which can be placed in either one’s home office or public spaces. Those devices, for example with temperature checking, are placed at the entry of buildings. There can be cameras, microphones, and other sensors placed. And there are also sensors in the future of autonomous vehicles, which largely figures out how to move by its ability to sense everything around it through cameras as well as what’s called LiDAR and other types of radar-like technologies that essentially sense the environment as well as reconstructs the three-dimensional picture around it.
So this will be everywhere. Autonomous vehicles are just one example. Smart cities, smart homes will all have sensors. Your Amazon Alexa has a number of sensors. Your phone obviously has many sensors. Your PC has a lot of sensors. So this will just proliferate.
When people talk about the IoT– internet of things– revolution, in some sense it is about having sensors everywhere gathering data. In the past, IoT never really took off, because no human could read all that data. But if you have AI at the back end reading the data, it can check for anomalies, regularities, issues, things to be concerned about, as well as integrate the data for AI or for decision making.
Another example is your Apple Watch or your Fitbit. Those are also sensors. They’re sensing your temperature and blood pressure and other body metrics that might suggest issues when they come up. So basically sensors everywhere.
NED DESMOND: Sensors everywhere. And then the data has to land somewhere. There’s a question of data ownership. How is the world taking shape in terms of who owns the data and can act on the data and build products on the data?
KAI-FU LEE: If you buy the sensor and place it in your office or home, then there’s probably some provision that data is owned by you. So your Fitbit data is owned by you, although Fitbit may aggregate all the data and estimate issues with your body based on numbers compared with other people like you. It could also be used in the future for things like prediction of a pandemic if there are a lot of abnormalities in particular regions.
So in some sense, that data is owned by you. It’s stored on the device’s server, and the user’s data may be used to learn some knowledge and make inferences, but not to be sold or given to other people, protecting the user’s privacy. That’s the current model.
Same as on your phone. Apple or Google has most of your data. Of course, the data is owned by you, but you have given an end user license to them the moment you unlock the phone and click on some box. So it’s kind of shared by the company and the individual.
And that makes some people uneasy, but that’s the current arrangement. It’s hard to imagine that work some other way, because if Apple or Google didn’t have all the data, then they wouldn’t be able to make their applications smart, and find bugs and find viruses– things like that. And if they ever betrayed the users’ trust, then there would be public opinion, as well as laws and regulations, that would get them in trouble. So for the time being, we are in the position where we own the data, but it’s licensed to the company, provided that they take good care of with the data.
NED DESMOND: Right. Right. From an accessibility standpoint, perception AI is a very rich and promising concept. But for the most part, the applications you described are about your environment knowing something about you. Do you have a temperature? Who is this person? What is their shopping background in our store? That type of thing.
But for people who are blind in particular, one of the really desirable outcomes of this would be for them to understand the world around them themselves. In other words, where’s the bus stop? Where’s my phone? Who’s standing in front of me? What does that sign say?
With such an unlimited data picture becoming available in the world, when do you think it’s going to be possible for the devices to come into being that would really enable this? I mean, we have a good start on it in certain functionality on the Apple iPhone and on Android phones and purpose-built devices like OrCam. Amnon Shashua spoke at the event just recently.
But all of it depends on data as much as sensors and algorithms. And how do you see that picture coming together?
KAI-FU LEE: Sure. I think there are several major ways that this type of technology– perception AI– can help with the accessibility. One is obviously a conversational interface, and we can talk about that later.
You’re asking more about interpreting the environment and helping the user understand the environment in some succinct way. I think, assuming right now we’re talking about basically a blind person who needs to turn all of the environment around him or her into voice and speech, then I think the technology would require a high level of natural language understanding and contextual understanding. So that instead of reading everything in front of you– that would not be interesting. Your TV, radio, soda, road– that would not be interesting.
What would be interesting is be careful, there is a very fast car coming on your left. Or there is someone– John is waving at you. John, your friend, is waving at you. Or you are not going in the right direction towards your office. So that requires a very high level understanding of the user’s intents and the user’s friends, and also not just the objects in the environment, but what they are doing and what they mean by what they are doing.
So that level of understanding is still not quite there, but rapid progress is being made. So just around, I think, the recent months, Microsoft made an announcement that they have a research technology that can look at the picture and write a caption at the same level as humans. I think that’s an important step. Obviously, a caption describing a picture is not quite there with what I talked about for the environment, but it is a big step. Because seeing and understanding has always been very difficult.
I think the bad news is, I think, being able to really be personalized and contextual and smart and giving you just the right amount of information in not too many words is something that’s still not quite– we’re not there yet . And we don’t know how long it’s going to be. But it seems like it’ll be around 5 to 10 years time horizon.
And that other good news is that we’re seeing very rapid progress in AI technologies. That Microsoft breakthrough– probably five years ago, I wouldn’t have imagined that it would be possible. So we’re seeing AI having breakthrough results in beating people in machine translation, speech, understanding, recognition, object recognition and so on.
And we’re seeing autonomous vehicles starting to work. One could sort of infer that, hey, the moment autonomous vehicles can drive anywhere, that means they have a contextual understanding of what’s on the road, which means that same technology probably can be brought over for the purpose of accessibility. Since autonomous vehicles are about 5 to 10 years, so that’s probably the time-frame we’re looking at.
And one last point is that the assumption we make here is that it’s interpreting a scene– an environment to speech. What if we had other forms of input? What if we have a new form of a dynamic Braille, or some kind of a combination of tactile and hearing and other interfaces that can give us a multi-channel input? That might be able to accelerate things.
NED DESMOND: Your example of the autonomous car is really fascinating, because when you think about it, there’s already some degree of self-driving capability in many cars on the road today. And they’re developing information from their sensors in real time.
But, of course, they’re not talking to a human being. They’re talking to a series of activities that go on and the car– brakes, accelerators, that type of thing. And that’s the difference, of course. How would these sensors and how would this information really be conveyed to a person who’s walking down the street at two miles an hour and just needs to know that there is an obstacle in their way, or something like this?
I mean, today we’re making remarkable advances in terms of– and you’ve referenced a few of these– the way that a machine can look at an image and tell you what’s in the image. Do you think there will be other more efficient ways to convey information? Practical information to people as they go about their lives if their site is limited or non-existent?
KAI-FU LEE: Well, I know that visually impaired people rely on Braille today for reading. So tactile is the other potential input. I haven’t personally studied how good tactile input is for people, but I think Braille is one form of a static way to feel.
But one could imagine there could be a dynamic Braille. You can generate contours and things like that. So I would hope that at some point inputs to our fingers and our ears simultaneously might be able to do something useful. It’s kind of like when we had the iPhone– not for accessibility, but for us the fact that we can use multiple fingers and also look at, listen to, and manipulate windows and click on things is useful for input. So it ought to be possible in output.
I haven’t studied how to combine tactile and sound, but it seems like– because we know that voice is a very narrow bandwidth. We can only hear on the order of 150 words a minute. So that limits how much information can go into us. And if there can be other things, that would be helpful.
A much longer term proposition, of course, would be computer brain interaction. And that’s the kind of work that Elon Musk’s Neural Link is doing– that would hope someday to directly upload to our brain things related to the environment. And there’s also a dream that one could learn a foreign language by uploading it to our brain.
But I have to say that while he is a visionary, I think this is pretty distant. Most people in cognitive neurosciences would say this is not impossible, but probably more than 20 years away. So I wouldn’t count on getting that too soon.
NED DESMOND: Right. Right. Well, it’s hard enough to build an autonomous car. It’s even more difficult to imagine somehow transferring all of that sensory information directly into someone’s mind.
Let’s see. The area around speech-to-text and text-to-speech is an area where you’ve studied this and did a lot of pioneering work at Carnegie Mellon when you were studying and made some quite remarkable advances. And I think one of the issues today, which is getting a lot of attention and has been a subject at this event as well, is the interaction of people simply with information. The breakthroughs in OCR and in other areas have been remarkable in recent years, but even so, our ability to interact with machines by voice strikes me– just to put a provocative opinion out there– is not what it could be.
And you have a real connection to this question, because back in the early 90s you were at Apple under John Sculley, and you developed a program called Casper, which was designed to really make voice the operating system for an experience on an Apple computer. Could you tell us what you think the state of the art is today, and what consumers in general can expect, and perhaps blind people in particular from a more voice-driven experience in related to information and computer systems?
KAI-FU LEE: Yes. The direction that John Sculley and I referred to when we were interviewed on Good Morning America was that this would be a conversational interface. That if at some point a computer or a phone or an Alexa for that matter became so smart that it could primarily talk to you with a conversation, and that you don’t have to give it the specifics of how to do a task, but just what you want to accomplish, and it gets it done– kind of like a really smart personal assistant who knows you well.
So when you have such an assistant and you tell the assistant plan me vacation to Hawaii this summer, then the assistant knows automatically how much time you have, how much vacation, what your budget is, where you like to go, you go with your spouse or whole family, and how to plan your day how to find travel agents. And when it’s unsure, it can come back to you and say there is a discount that weekend, but you have to go for a week. Or there are three hotels, and there are different trade offs.
So through a conversation, the agent can basically plan your vacation with asking a couple of key questions and get it done. Unlike today, if you are to operate a phone by voice, you would go to Expedia and then Explorer and click and compare. And go to various airlines and check the prices, and go to bookings.com. So you would be going to many websites one step at a time. And if someone who has accessibility challenges has to operate that website hopping– app hopping interface, it would just be unthinkable what it would take to plan a vacation.
But if we elevate all these apps to a level where they know you, and they know the apps, and they can just ask a few key questions and get the right vacation ready for you. Or another example in e-commerce would be give my wife flowers for our anniversary. You wouldn’t have to tell your assistant when the anniversary is, how to ship the flowers, that it is a disaster if it’s one day late, and also not good if it’s one day early, but not as bad. And how much you like to spend, what kind of flowers she likes, what you gave last year, and so on and so forth– what your credit card number is, what your address is. It should know all that.
So we need to elevate the whole computer repository of information– knowledge of and knowledge of all the websites and apps– abstract it to a level so that it can interact with you with just a few sentences and get a task done. The task of getting flowers shouldn’t even require an interface. The one sentence should be enough. The setting up a vacation might require 3 to 10 turns of conversations, and then you’re done.
When we can elevate the quality of a computer or a phone assistant to that level, that’s the kind of thing that will make a huge breakthrough in accessibility interface, making the interaction that could take hours into seconds. But the challenge that remains is how do we get to understand everything there is to know about you? And how do we get to know everything there is to know about every app? And the apps have to speak the same language and agree to cooperate, and that agent level interface has to be built.
So obviously Alexa, and Siri, and Microsoft’s Cortana and others would like to see that as their grand dream, but I think it’s going to take quite a while for the apps to agree to talk to each other in a standard language and without going to the user for further clarification and let an agent control them. So this work is going on.
There actually doesn’t need to be any more technology breakthroughs. The speech recognition is good enough. The language understanding is getting to be good enough, but the problem of gluing all these apps that are used to interacting with people– to have them talk to another app and have that app control their behavior in some canonical language– that work is a lot of work that app developers have to be willing to undertake. So it’s really just a complex problem that doesn’t require any more research, but just hard work to get it done.
NED DESMOND: Is this evolving in any interesting ways in China where there’s a benefit there, for instance, in the WeChat platform, which provides so many services, including third party services on a common platform– much more deeply integrated as I understand it, for instance, than Android or iOS are?
KAI-FU LEE: Yes. I think that’s a great, great point. That the super app that WeChat is– because if you look at my log on my iPhone, I probably spend 95% of my time inside WeChat. From a customer point of view, it’s kind of dangerous. From a monopoly maintenance point of view, it’s kind of dangerous. But from a convenience point of view, for me as a Chinese user, having this conversational or delegation interface inside WeChat would be the most natural thing. WeChat already knows much about me, including my travel, what I like to eat, when I go out, where I work, and knows more about me.
You could also say that an iPhone or a Google could be similarly capable, but I think it’s– I’m not aware that any of these three companies are making big efforts in this area. It’s something that will– you can imagine the amount of work you have to do to persuade all the app developers, some of whom consider themselves to be your competitor– why should Amazon cooperate with Google, because if they did, and all the future users will talk directly to the Super Google assistant, through which Amazon gets called?
Well, what if someday Google develops it’s own e-commerce and starts routing queries to that e-commerce? So an Amazon CEO would not be quite hesitant to collaborate with Google as the top of this new interface. And the reverse would be true if Amazon were to develop Alexa into it. So there are some natural competitive dynamics that makes this rather difficult to accomplish. Which is unfortunate, because technically, research-wise, we are basically ready.
NED DESMOND: So it’s interesting. It’s a business problem that’s inhibiting what could be a revolutionary consumer breakthrough, and the technology is already here.
KAI-FU LEE: Yeah. So maybe a closed platform is what it takes, right? Apple is a more close platform. Or a company that can really do it all. But, of course, then we have a big monopoly antitrust issue.
NED DESMOND: Well, Apple’s screen recognition technology for third party apps that are not accessible is an interesting teeny step in this direction, perhaps.
KAI-FU LEE: Yes. Yes, that would be a relatively benign way to get people making a small first step.
NED DESMOND: Well, my last question relates to the potential dangers of AI. A big part of your book is devoted to the likelihood that as the AI Industrial Revolution or the AI Revolution sweeps through economies that there will be large numbers of people who are unemployed and without much hope of gaining conventional employment as we think of it today. There’s a lot of discussion about that in the States as well.
And here there’s also a lot of concern about the bias that can be introduced into people’s day-to-day lives in policing and employment– in education, because of normative standards established in AI data sets that might not do justice to everybody that the AI encounters. Do you– I’m sure you’ve thought about this problem. Are there ways that architects of AI solutions should be thinking about this and trying to minimize this type of problem?
KAI-FU LEE: So on the issue of bias, it’s actually a problem that can be addressed. One could build a tool that would detect imbalance in the data sets. For example, when a particular company had hiring AI practices that were not friendly to women, it was a result of not having enough women in their training set. And also there were face recognition software that did poorly on African-Americans, because again, due to lack of representation in the data set.
So the way to solve it is to build into the tools for training AI something that checks for balanced data sets. And when there is an issue, at least come up with a warning, so that the developer may go back and collect more data. Or maybe if it’s really seriously imbalanced, not even permitting the completion of the training– not allowing this product launch.
So that will, I think, be a good way to address it, but that’s not a substitute to training of all AI [INAUDIBLE] to be aware that their responsibility extends beyond writing the code. They are responsible for making sure there is balanced data sets so that there is not unfairness for bias in the resulting application.
On the question about displacement, AI, in fact, is very good at doing many of the routine tasks that people do [INAUDIBLE] tasks, such as back office processing, customer service, telemarketing, and also blue collar tasks, such as assembly line worker and waiters, waitresses, and many other such jobs. If we think about all the routine tasks and all the jobs that are done, probably nearly half of all the tasks that all of us do together are routine and are bound to be displaced.
And what needs to be done is that we need to make sure that people whose jobs are largely routine start to become aware of this issue and get training– help that they need to get skill sets in areas that cannot easily be displaced. And the skill sets that cannot be displaced goes all the way from creativity, strategic thinking, analytical thinking, to compassion, empathy, human-to-human interaction, and the ability to build human trust.
And also to dexterity and dealing with unstructured and new environments. So there are many, many jobs available that AI cannot yet do, and AI will also create more jobs over time, as every technology has done. So while we do have a near-term challenge of retraining that is needed, over a longer period of time, if we get our education in order and train the right skill sets for the existing jobs, jobs that cannot be displaced, and also new jobs that AI will create, I think we’ll actually end up in a better place where our grandchildren will be liberated from ever having to do routine jobs.
NED DESMOND: That’s a very hopeful note to end on. Thank you very much, Dr. Lee. It’s been great to get your perspective on these vital questions about AI. So thank you, and back to Will Butler.