DESCRIPTIONAs AI-based computer vision, voice recognition and natural language processing race ahead, the engineering challenge is to design devices that can perceive the physical world and communicate that information in a timely manner. Amnon Shashua’s OrCam MyEye is the most sophisticated effort yet to merge those technologies in a seamless experience on a dedicated device.
MATTHEW PANZARINO: Thank you very much, Will. I’m so happy to be here to talk to you today and to be talking to Professor Shashua about OrCam and the various products they have. And I certainly think that there’s an enormous opportunity here as, obviously, Professor Shashua does in this industry to leverage AI and leverage computer vision to help enable better processes and better accessibility for people with disabilities, people with vision loss, and people with hearing loss.
So I’m excited to talk about this today. And thank you very much for being here Professor Shashua.
AMNON SHASHUA: Thank you. Thank you, Matthew. It’s a pleasure.
MATTHEW PANZARINO: So I think that there’s a lot of ways to start off this conversation. But I think a basic one is why you see AI as an enormous opportunity to leverage here, to get significant advances for people with disabilities that could benefit from it.
AMNON SHASHUA: When we look at the AI, it’s playing out in two major domains. The first domain is machine vision, the ability to see and interpret the visual data that’s coming into the machine. The last five years have been huge leaps in the ability of machines to interpret visual data and to reach performance levels rivaling human level perception.
The second one is understanding acoustics, understanding sound, voice to text and understanding text. This is called natural language processing or natural language understanding. This area has also received huge leaps in the past five years. So now if you put both of them together and add to the mix also the rise of compute, the fact that silicon is becoming more and more dense in compute, and then the power consumption is getting lower and lower. The ability to put things that could be very computer intensive, but we’re now running on a battery.
If you look at our smartphone, our smartphone today is it’s like a supercomputer of 10 years ago. And it sits in our pocket. So if you put all these things together, it’s like stars aligning in a perfect line. Then you start seeing a huge value, especially for people who have disabilities.
So if you have difficulty to see, the machine can see for you, and then whisper into your ear what you are about to see. And then you can also communicate with the machine by talking to the machine, like we talk with Cortana, Alexa, and Siri.
So if the machine is looking at the visual field and you want to know what’s out there, rather than the machine telling you everything it sees and then over– giving too much information, you can start asking a question. What’s in front of me? Guide me to that place. I’m looking at the newspaper, read me titles, and then read me article number three or read me an article about Biden. Or you can start exploring the visual world because there is also a natural language processing interface.
So if you put those things together and put it on a wearable device, then it’s like you are emulating a human assistant. It’s like I have someone smart enough with proper hearing ability and the proper seeing ability standing beside me, and I’m communicating with this person. And this person is telling me about the world, is telling me about what I’m seeing, whom I’m seeing.
I can ask– I can ask this person, tell me when I see Matthew. So I’m going from place to place. And he hears Matthew, because I don’t see, right. Or I can tell him, guide me to the next empty seat. I’m going on a bus, and I’ll tell him guide me to the next empty seat. So this human assistant, will tell me, oh, here’s an empty seat.
So that replaces human assistant with a computer and you can start appreciating the big value that you can get. And it’s all sits on a very, very small device because of this rise of compute, rise of data, rise of machine learning, of deep networks, natural language processing, natural language understanding, computer vision, all coming into a climax in the last five years. It makes it a huge opportunity for helping people.
MATTHEW PANZARINO: Yeah, and there’s been– certainly there’s a lot of components there. Obviously, you do have– you have the advances in natural language processing, but also semantic understanding of the world, as you mentioned. It’s one thing to say, hey, there’s text in the field, I’m going to automatically read that, which I believe we’ve seen some applications of that across voice over and across other technologies.
But certainly there is an advancement here in the way that it segments the world semantically to understand if, for instance, a paper is being held up closer to my face, I want to read that. I don’t want to read a sign in the distance, right? And that’s one of the advancements that I believe has been added to the OrCam camera recently, right?
AMNON SHASHUA: So when we started the OrCam camera, you know OrCam was founded in 2010. So we’re talking about at least four years before the rise of deep networks. But you know, we felt that computer vision was at a point in which we can do something useful. We did something useful in terms of mobilizing with the perceptions writing assist, where the camera can do very, very useful things and avoiding accidents by detecting people, detecting cars.
We thought we can do the same thing, and let’s focus on recognizing products, recognizing money notes, recognizing faces. Faces came out a bit later, but mostly recognizing text. And ability to do OCR, Optical Character Recognition, in real time in the year 2010 was what was groundbreaking, right?
But then as we moved forward, the first thing we told ourselves, how would the user communicate with the device? So in 2010 the way we communicate with the device was with the hand gestures. So if I want to kind of trigger the device to give me information, I’ll point, the device sees my finger. Or if I want the device to stop whatever it’s doing, I’ll do something like this, so hand gestures.
And then later as we moved forward and technology progressed and compute, you know, silicon became more and more dense and algorithms became a much more powerful than they are today, data became much more readily available, we said, OK, we can use natural language processing, natural language understanding in order to guide, to get a good interface between the human and the machine.
So now if I’m looking at the newspaper in 2010, it will read the entire newspaper. But now I can say, read me the headlines. And then I’ll say read my article number three. Or I can tell the device, you know, detect number 23. So while I’m moving, so I’d say I want to room number 23, when I’m walking along a corridor, and I want room number 23, so stop me by when you see the number 23.
So I’m communicating with a device like I would communicate with a human assistant. Or say when I’m looking at a menu. In 2010, it will read the entire menu. Now I can tell OK, find desserts, so it will read me only the desserts. Or start from fish, it will read me something that’s starting from fish.
So the addition of first computer vision became much more powerful, so we can do much more than we did just like in 2010. But also the combination with natural language processing made that very, very powerful.
MATTHEW PANZARINO: Yes, and I think there’s a huge key here, which you sort of touched on a little bit here and there. But I think it’s important to talk about head on, which is that all of this is being done via on-device processing with OrCam. And I believe that that’s huge for a lot of reasons.
One, it’s technologically impressive. But second, it is sometimes underappreciated how intimate accessibility devices can become to the user. You know, they’re letting them into their lives. They’re asking them very personal questions. They’re opening up their world in a visual and auditory sense with acoustic filtering and linguistic recognition.
And all of those, obviously, raise privacy questions. You know, if we’re basically saying this is an extension of my being, privacy and intimacy and closeness of that data becomes massively important. And I think a lot of the other accessible devices or accessibility devices on the market, much of them rely on cloud connectivity. And there will be single features here or there that are local on device.
So can you talk a little bit about that road to ensuring that you could do all of this on-device, on a battery powered unit, because I think that’s one of the most singularly impressive things about OrCam?
AMNON SHASHUA: I think early on we understood that this is the most critical challenge that we need to face. First, if we want a device that sees everything that you see and hears everything that you hear and process it. If you send all the data to the cloud, you will be entering into huge privacy issues.
We did have two years ago kind of a testing the water of such a device that was focused only on face recognition and we went through Kickstarter. We kind of shipped 1,000 devices that are focusing only on face detection where the information is being sent to the cloud, because there in the cloud you have over a database of faces and so forth. And it hit huge privacy concerns. So we were stuck with those 1,000 devices that we shipped.
Processing private information in the cloud is very, very problematic, especially when we’re talking about visual information and auditory information. So it has to be– the intelligence has to be at the edge, has to be wearable. And this raises lots and lots of challenges. One is power consumption. Second is processing power. In the cloud you have unlimited processing power. And here you have to have a very small amount of processing power because of power consumption, because of size.
There’s also practical considerations. So even if I wanted to send something to the cloud, sending high resolution images to the cloud would not make the device work in real time as it should. Because the fact that everything is done on the device allows us real time processing, and provides much more value to the user.
So being able to cram everything into a device of this size– and this is kind of the device for the visually impaired– cram everything and device of this size and have it work for a number of hours in full capacity by processing audio and vision is a huge, huge challenge that OrCam had to face early on.
And it pays back because people feel that there are no privacy issues. They can trust the device because nothing is being sent to the cloud. There’s no communication to the cloud, no communication whatsoever.
MATTHEW PANZARINO: Yeah, and I think that trust is important, obviously, when you’re saying, hey, we’re producing a device for you that does enter into your personal universe. Well, technologically speaking, you know, I think that the computer vision is somewhat understood by a larger percentage of the populace than it was, say, in 2010.
People are starting to see applications of this in their daily lives from the iPhone’s camera to Google’s photos application, which automatically select faces and themes for people things like that and visual search obviously in photo libraries. So they’re starting to see some applications of that in their regular life. And of course, the Microsoft Connect had a lot to do with that, kind of exposing that to a mass audience.
But I think it’s much less so, so far with auditory signals. And so could you talk a little bit about the OrCam here and about the applications there and the development of that?
AMNON SHASHUA: So when you look at processing acoustics, you know, we took it into two directions. One is for the OrCam MyEye, for the visually impaired being able to communicate with the device. As I mentioned before, you can talk to the device.
The second area was taking and creating a new product line that will help people who have hearing loss, even mild hearing loss. And the idea is that there is an open problem, it’s called the cocktail party problem. It has been open for about five decades in the academic circles, which says when you are in a situation in which many people are talking at the same time and you are having a discussion with one specific individual, you’re kind of tune in into what that person is saying and you tune out everything else. And it’s kind of miraculous.
How do you do that? You basically you follow the lip movement, you know, you do something, quite miraculous. In the past two years, as kind of the rise of deep networks, the ability to build architectures that can process together acoustics and video, one can do things that 10 years ago would be considered science fiction. And solving the cocktail party problem is one of them.
So the idea is that you have a black box, which is a network. It receives the acoustics and receives the video of the person that you are looking at. Now that acoustics would be the voice of the person that you are talking with, but also the other people talking, right, you’re at a cocktail party, you’re at a restaurant. You are having a discussion with the person in front of you. There is background noise of dishes clacking and also the background noise. The table beside you, there are other people talking as well.
And all that you want, you want to hear only if the person that is in front of you, tune out everything else. So this network receives all this complicated acoustics, receives the video of the person that you are looking at, somehow follows the lip movements, and then extracts from the acoustic wave only the voice of the person that you are talking with and transfers that to your hearing aid. So it could be a hearing aid or it could be just an ear phone.
It could be that you have a mild hearing loss. You don’t need hearing aids, but you would like at this particular setting of a complicated acoustic setting of a restaurant, put in an Apple AirPod in your ear, and amplify only the voice of the person in front of you and tune out everything else. And this is something that is not possible to do it from a technological point of view.
And we kind of capitalized on several things that OrCam was very good at. One is taking very, very advanced technology and putting it on a very, very small device, kind of advanced technology in terms of computer vision, in terms of natural language processing, in terms of deep nets running on a very, very small device. And taking our more vast experience in computer vision and natural language processing and natural language understanding that we did for the visually impaired and building those architectures that concern the cocktail party problem.
And this is going to be device OrCam Hear. The idea that, you know, you is simply put it on your neck. And then the camera is facing the person that you are talking with. In this case, it’s just a normal earphone. I put the normal earphone in, and then I can tune in into the discussion of the person in front of me.
And what this device does is the camera here– and it has a number of microphones– the camera and the microphones together take the complex acoustics and the video of the person in front of you, you know, feeds it into a kind of miraculous, you know, black box neural network. And out of this neural network comes a new acoustic wave, which tunes out everything else except the voice of the person you are talking with.
MATTHEW PANZARINO: And how do you go about training a network like that?
AMNON SHASHUA: This was one of the big– one of the big challenges. It’s one of our trade secrets even. How do you go and train such a network. But what I would say is that one of the big advancements in the past two or three years is called the self-supervision.
Self-supervision, unlike the classical machine learning where data needs to be tapped, so say I want to recognize cars in an image, then I’ll go through a process of collecting training data and then tag every vehicle in that collection of images and feed this into a neural network telling it, here here’s an image as an input and this is a car, right, I’m tagging it. This requires a lot a lot of effort and limits scalability.
The past two or three years in machine learning has shown a lot of progress in self-supervision where you don’t need to take anything, you don’t need to label– you don’t need to label anything. It started with language processing and then shifted to a computer vision, unsupervised learning or self-supervision.
And we harnessed this new advancement so that we can then take the entire internet of clips, wherever clips are, and use it for training this neural network, so without tanking, without labeling anything. So being able to train such a network is really a big part of the technological ability of the company.
It’s not only putting everything on a small device. It’s being able to train it to begin with.
MATTHEW PANZARINO: And you know, to talk about hardware for a second, though, I mean, I think it would be interesting to hear your view. I mean, obviously with your first company or your previous company and then now OrCam, you’re dealing with on-device processing that uses specialized chips.
And I’m just kind of curious about how you view the entire industry of specialized chips specifically as it relates to AI. We’ve seen, obviously, the biggest AI and ML applications, obviously. But we’ve seen the biggest example of this being Apple launching a dedicated co-processor for processing ML more rapidly and more efficiently. But of course Nvidia is a big player in this space. You know, Intel is a big player in the space now.
I’m just kind of curious what your thoughts are on the industry at large. I mean, the advancement of specialized chips seems to be driving a lot of the innovation around these devices, specific devices, OrCam is one example, that take on tasks that require a lot of very specific actions versus the generalized process or universe of 10 years ago.
AMNON SHASHUA: Well, you know, it’s a spectrum. And you need to place yourself optimally in the spectrum. So on one end of the spectrum, you have general purpose processor like a CPU and ease of programming very, very easy to program. So this will be the CPU. This would be the bigger GPU with the code processing language very, very natural to code, to write code for this processing.
On the other side of the spectrum, you have very, very hard to program, very specialized architectures that are efficient in a very, very narrow domain. So it’ll be very efficient in running a specific type of a neural network. It will be much, much more efficient than a general purpose processor.
The problem with it is it’s very, very difficult to program, so it’s difficult to scale. And also because it’s not general purpose, say technology is evolving and technology is evolving rapidly. You know, every month you see a new architecture coming in. For example, you know, from 2014 to 2017 or so, convolutional networks was, you know, was the definition of big networks. It was all around particle recognition, computer vision, all sorts of network’s architecture around convolution.
And then 2017, the rise of language processing, new type of architectures came in that are called transformers. They came up for Google first and open AI and then thousands of academic papers today around those new architectures.
So the problem is that now if you have an archive– if you have a silicon that’s very, very specialized to a certain architecture and then technology is moving very, very fast, technology software is moving much faster than silicon. You can be in a problem.
So you really need to find a point in this spectrum in which you’re not too much specialized and you’re not too much general purpose. You’re not– it’s not so difficult to program, but all on the other hand, it’s not very easy to program. So finding the right spot on this spectrum is kind of the Holy Grail.
MATTHEW PANZARINO: That system is pushing the boulder on an architecture that’s too narrow. But at the same time, you don’t want to lose the efficiencies of a general processing.
AMNON SHASHUA: Exactly. So for example at Mobile, I specialized in building chips. And those silicon, the system on chip has multiple different architectures. Some of them are very, very specialized to the neural networks. Others are less specialized. So kind of to find a better position in the spectrum of ease of programmability and compute density such that you can move with the times rather than being locked into a very, very specific architecture.
MATTHEW PANZARINO: And I’m curious too, so what do you view as like as, you know, nothing is ever complete, right? No journey is ever complete. But what do you– what would you view as a success point or a major sort of end game for OrCam in general? What are you looking to do?
AMNON SHASHUA: Well, you know, the end game for OrCam would be called AI as a companion. So if you look at our smartphone, this smartphone is a supercomputer. But the problem is that it’s in our pocket, so it doesn’t see and doesn’t hear. So if you have a device that sees and hear and has a computing capability of a strong machine, then if this device is aware of all that we are doing, shares all the experience, the visual experiences and the audible experiences, we have throughout the day and has intelligence, it can create huge value.
And the challenge here is to gradually define what that value is. So what OrCam set out to do is kind of to take society and pull it into layers, where each layer is where we can define very, very precise value to that layer, and then move forward. So we started with the blind and visually impaired because the value there is evident. It’s very, very clear, you don’t see, the device can see and will tell you what it sees, so this is a great help.
And then we move to the hearing impaired and the cocktail party problem, create a much, much bigger– much, much stronger experience in terms of hearing help than what the normal hearing aids.
But then you can go much, much further than that. Since you understand– you’re aware of the people that you meet, the discussions that you have with the people, say, we’re talking and then I mentioned. OK, let’s schedule lunch next week, so the AI can do this automatically, without even letting me know. So next week, it can manage my calendar and put a meeting between you and me and also communicate with your calendar, with your AI, and do this seamlessly, because it was observing our discussion. It knew who you are because of face recognition. It has natural language understanding, so it knows what we are talking about and understands our intentions.
So the challenge here has got to define this value. And it’s difficult to define a value proposition that is good for everyone. This is why we’re doing this in stages. But we believe that as the AI progresses, as the compute density increases, as power consumption decreases, so that we can have such a device working for a full day, we can gradually define the value proposition going forward until we enable a value which is good for everyone, not only for people with disabilities, but for everyone. And this is really the end game for OrCam, wearable AI.
MATTHEW PANZARINO: Yeah, that makes sense. And you know, the one kind of example that we haven’t talked about yet is OrCam Read, right? So the Read device is not necessarily for people who are visually impaired, but it mentions that it can aid people who have reading issues like dyslexia, for instance. I’m kind of curious, like why that– it seems like it fits into this layering, you know, that you were talking, about building out use cases in sort of tranches.
But I’m kind of curious why that detour from that, you know, from sort of the visual and acoustic paths of the other devices.
AMNON SHASHUA: Well, the OrCam Read, is not a detour. It’s really a natural extension of the eye. So that eye, you know, is like this. You know, it’s on eyeglasses. So if you’re blind or visually impaired, putting something on eyeglasses is kind of natural.
But if you are dyslexic or you have age-related difficulties in reading, it’s not necessarily that you want to have eyeglasses. You want to have maybe something that you can hand-held, point it, press a button to take a picture of text, and then you can also communicate through auditory. You can say, read me the headlines, just like you did with MyEye or read me the entire text or find the word Facebook and reading the text around the Facebook or start from the deserts in the menu.
Whatever we do with MyEye, you can do with this device that you handheld. And then you open it up to a new part of society, not necessarily people are visually impaired, but people who have difficulty in processing reading because of dyslexia, because of age-related exhaustion. There are all sorts of syndromes that over time people find it difficult to read the text and like help in reading text. And this simply opens up the market.
What we found out is that even people who are blind and visually impaired sometimes prefer a handheld device than a device that you can click on to through eyeglasses, which is something that we did not anticipate. So the OrCam Read is also, you know, sometimes shipped to blind and visually impaired.
MATTHEW PANZARINO: Got it. And it seems like there is applications there. Obviously, it’s the English language speaking is the first target. But it seems like there’s awesome opportunities there for translation, as well, right, if it recognizes it in one language and can translate it to another.
AMNON SHASHUA: Today machine translation running on the cloud is always so powerful, it would be much easier to read, to kind of decipher the text, send it to the cloud, translate it, and then send it back. So it doesn’t seem to be a high priority to put all those huge, you know, technology for doing machine translation on a very, very small device. In that case, simply send– because there is no privacy issue here, send a picture to translate it, and then send it back–
MATTHEW PANZARINO: Yeah, makes sense.
AMNON SHASHUA: –would be a much, much better route.
MATTHEW PANZARINO: Excellent. Well thank you so much. I really appreciate you taking this time to talk us through OrCam’s offerings. I think that this vision for sort of specialized AI in these instances makes a lot of sense.
AMNON SHASHUA: Thank you. Matthew. Have a good day.
MATTHEW PANZARINO: Thank you. Back to you, Will.