Vispero: The Engineering Experience of Adding a Voice Assistant to ZoomText and JAWS (sponsored breakout)
DESCRIPTIONRoxana and Sriram talk about their experiences in adding Voice Assistant to a mainstream Windows screen reader and magnifier. They explore the new input mechanic's benefits and limitations and the guideposts they used to create the initial command set. They also talk about the Voice assistant's data and conversational privacy aspects and how Vispero is approaching them.
- Sriram Ramanathan, Senior Software Engineer, Vispero
- Roxana Fischer, UX Research Analyst, Vispero
ROBERT FRAWLEY: Well, hello everybody, and welcome to the engineering experience of adding a voice assistant to ZoomText and JAWS. My name is Robert Frawley, and on behalf of Sight Tech Global, I am so excited to have you join us. In today’s 30-minute session hosted by Vispero, you’ll hear from Sam Ramanathan, senior software engineer at Vispero, and Roxana Fischer, UX research analyst at Vispero.
Before we begin, a couple of housekeeping items. The session is being recorded and will be available post-event. If you have a question, please use the Q&A box, and we will answer those at the end of the session. We will also try our best to have available the raise-your-hand feature so you can ask your questions in person if needed.
For screen-reader users, the keyboard shortcuts to raise and lower your hand is option-Y on Mac and Alt-Y on PC. Once you do that, we’ll call your name, and you’ll unmute yourself and you can ask your question. And with that, please take it away, Sam.
SHRIRAM RAMANATHAN: Hi, my name is Shriram Ramanathan. I’m a software engineer at Vispero, and I have Roxana Fischer here with me, and she’s UX research analyst. I want to warn you that we both are not native English speakers, so we will stumble a bit, so please be kind. We are also not professional presenters. We spend most of our days in caves inside code mountain, so we stumble occasionally outside, and if we appear unpolished, you know why.
So, with that, I’ll start with voice assistance. So, we’ll start with the popularity of voice assistance. Again, stating the obvious– voice assistants are getting more and more popular. What started as a novelty right now is becoming more and more commonplace. Open any Black Friday flyer, and you see voice assistant for $79 or $99, and they also are now doing more and more of the tasks than they did before.
Not surprising, but people with low vision and people with blindness have been early adopters of this voice assistants. They can just call out. They don’t have to learn new features. They don’t have to learn how to use it. They can just talk to it, so they have been pretty popular with blind users.
People have been using it to accomplish specific tasks, like I want to send a message, I want to call somebody, I want to dictate a small message. What’s surprising though is in Windows there has been a voice assistant called Cortana, but it’s not very popular. Microsoft has tried to encourage its use. During setup you get asked, but it is not very popular.
So we look at what has historically ailed voice interaction in desktop systems. Microphone availability is just not assured on desk-based systems. Your laptops have microphones, but in general there is this inconsistent availability of microphones, which, if you take the example of phones, you’re always guaranteed a microphone. People just are accustomed to interacting with mouse and keyboard on a PC. There have been other input mechanics, like touch, joysticks, et cetera, but they just don’t take off because people have been taught that everything needs to be driven with mouse and keyboard.
The first generation of desktop training systems also were lacking. You had to train. So the typical mechanism was you would train, you would read a set of phrases, you would train the desktop training, you train the voice recognition system, and then hope for the best. And because often the training was not enough, the recognition quality was lacking, which would lead to very tepid experiences.
And this is the most important part– desktop interaction is very extensive and very complex. You have no two apps are similar in terms of the way you navigate them. You take the example of even setting things in settings, control panel. Each settings has a different layout. You go to Word, you go to Outlook– both have completely different ways of doing certain everything, so people often struggle creating a mental model of what they can ask from voice assistant.
And as often the case, they try once or twice. The voice assistant doesn’t do what they expected it to do. Now they are confused as to what all it can do, and they say, well, not worth my time. I give up.
Things have changed, though. There’s a newer generation of desktop users and regular users who are more comfortable with voice as an input mechanic. I’ll give you an example. I live in Florida, and there was a power outage, and so I didn’t want to bake inside the house. And I told my kids let’s go out, and we came back after four hours, and my first instinct was to see is the power back on. And my instinct is to flip the switch on, but my son goes, Alexa, are you there?
Now, it would have never occurred to me to do that. So, the newer generation is getting a lot more comfortable with voice to drive many things and everything. This is another key factor. Desktop-based voice recognition systems had issues because there was just not enough training data, but now with online speech recognition services provided by Amazon, Google, Microsoft, you don’t need to train. You get a zero configuration out-of-the-box experience, and that is pretty critical. You turn the switch on, and it’s available to you.
So I want to introduce Sharky and Zoomy. We have voice assistants to control the screen reader and magnifier. We have Sharky, which works with JAWS, the most popular screen reader. Zoomy, which works with ZoomText, the most popular magnifier, and Fusion, which is both a screen reader and the magnifier. The Sharky is after a [? shark ?] of [? Jaws, ?] which has historical connotations, and Zoomy is for the– like, comes after the name Zoom.
We decided to provide a simple mental model for the user to learn. We said you can only interact with the voice assistant for what screen reader and magnifier can do. And users often have a good idea of what a screen reader and a magnifier can do. Zoomy is designed to reduce the number of times you had to switch to the screen magnifier application. You will see Roxana demonstrating as to how it works for ZoomText in subsequent slides. Sharky is also designed to reduce the cognitive load of having to remember and look up rarely used keystrokes.
One of the requirements we had was we need to co-exist with other voice systems. We have users who also have physical disabilities who often cannot type well and who use Dragon NaturallySpeaking and other voice systems, so we had to interact well with it too. Now I’ll hand it over to Roxana, who will talk about the experience for adding the screen-reader. Roxana?
ROXANA FISCHER: Thank you. Before we started to design a model for a voice assistant, we reflected on the user and the issues of the current interaction. When the user is using a magnifier to enlarge the information on the screen by 2x, the user sees the content two times bigger, but the view is limited to just one fourth of actually screen. This makes the mouse movement become challenging.
Moving by mouse to a certain element like to a button or to a menu item becomes at least two times farther. You can experience similar if you zoom in into a Word document and try to move to the end of a document. You will have to scroll more than in the original view. There are also directional issues.
Imagine you zoom into a picture and now try to find a certain element in this picture. If this element is out of your current view, you are not able to reach it by the shortest path, as this path is unknown to you. The screen magnifier user has to deal with such a [? variation ?] every time on the screen. Keystrokes to perform an action exists, but users are primarily mouse-driven. Keystrokes help to avoid searching for elements, but just like desktop users, most desktop users [INAUDIBLE] tries to stick with mouse interaction. Also, keystrokes to learn and to remember.
Next slide, please. Thanks. So, taking these aspects into account, we designed the screen magnifier with the aim to avoid mouse movement.
With speech interaction, the user can skip this mouse navigation, as the user can express the intention directly, and therefore executing it without searching for a button. With speech interaction, the user doesn’t need to switch to the magnification control panel to make changes. Therefore, he can stay in his own application. For example, to activate a ZoomText feature, the user can [INAUDIBLE] by voice.
This means the mouse and therefore the magnified image can stay in the current application, like a web page. Also, the need to learn and memorize keystrokes disappeared. Keystrokes would provide the same advantage of executing features directly, but they are hard to remember.
I would like to show two short videos. The first is going to visualize the interaction with ZoomText and the challenge on the user faces. Can you please start it?
All right, let us look up the landscaping and nursery place in Clearwater. My backyard could need some attention. Mhm, this looks right. Oh, the canvas scheme on this website is not easy for me to read.
Let’s change the colors and make it a little bit bigger. That should work. The keystroke for invert colors was Caps Lock and– I don’t remember anymore, so I guess I’ll have to use the UI.
All right, let’s do it. So, going down, there’s little green icon. Yeah, I think this is– yeah, here it is. So, zooming in a little bit, and click on the Color button. That should be fine. All right, let’s go back to our website and see how it looks.
Where exactly was I? I guess– yes, under this picture. Here’s the text. I could have zoomed in a little bit more, but let’s just try and read it. All right, next I’m going to show a demo with the same interaction, just using the voice assistant in addition to it.
Now let’s try it again with voice assistant. Going to the landscaping and nursery place, Eden in Clearwater, selecting website. The text on the website is still hard to read for me, so I’d like to adjust the color and the size. With the voice assistant, I can zoom in step by step, but I also can set the zoom to a certain level.
I think a zoom level of around three or four would work. Let’s do 3.5. Zoomy, set zoom level to 3.5.
COMPUTER VOICE: 3.5.
ROXANA FISCHER: That is already better, but I also prefer dark text on a white background, so let’s adjust the colors. Zoomy, activate color enhancement.
COMPUTER VOICE: Color enhancements enabled.
ROXANA FISCHER: The color [INAUDIBLE] looks good, but the heading disappeared. Let’s try another color scheme. Zoomy, select Invert Colors. Better. Now I can read the whole text, but actually it would be more convenient to hear it. I know ZoomText can do that, but I never learned the keystroke for it. Let’s try launching it by voice.
Zoomy, start [INAUDIBLE].
COMPUTER VOICE: Landscaping and nursery in Clearwater, FL. Creativity. Eden Nursery has been providing professional landscape services to our local homeowners in the Pasco, Hillsborough, and Pinellas Counties for more than–
ROXANA FISCHER: All right. So, by staying in the application, the user could see the change directly, and therefore also saw that the heading disappeared and could correct it. This is an overview of a selection of features we support. To just name a few, Zoomy can help a user to toggle ZoomText settings, like activating color enhancement. Zoomy can help to launch ZoomText features like reader, like we just saw, and also can read off the clipboard or off a selected text to the user.
All right. Back to you, Shriram.
SHRIRAM RAMANATHAN: Thank you. So let’s talk about JAWS, and let’s try to understand how a typical JAWS user works. Now, JAWS users are primarily keyboard-driven. They tend to remember many keystrokes for common tasks. I work with a lot of blind co-workers and blind engineers, and I’m impressed by their working memory. They often have very good capacity to remember and drive the computer just by keystrokes.
Still, human memory capacity is limited. It’s difficult to remember keystrokes for less common tasks. Also, your working memory, your ability to remember a lot, is determined by when you became blind. If you were blind early on in your life, you tend to have larger working memories. But if you have become blind later in your life, you tend to struggle with keystrokes.
It’s also made worse by the fact that we assign common tasks [? all ?] simpler keystrokes. For example, you have say title, which is JAWS key plus T. Say [INAUDIBLE], which is JAWS key plus F. However, uncommon tasks which are important, just not as common, have long keystroke combinations.
You want to move to the bottom of the column? It’s Alt-Control-Shift-Down Arrow. If you want to [? OCR a document, ?] it’s Insert-Space-OD. [? OCR ?] for document, D for “document”. So there is a mental tax that you pay to remember uncommon actions, and it gets worse in some situations. For example, we have one for Skype call, Insert-Space-YA.
You ask, why Y? Well, because Insert-Space-S is taken for speech actions. So we are constantly add useful screen reader actions that will improve productivity. But they are often useless because of the hard-to-remember nature of the keystrokes, the mental tax that people have to do to remember them. So, just like the previous case, we’ll show you video of an issue and then how voice assistant can help.
Again, with screen reader benefits, you can skip the process of looking up a keystroke. You can execute the feature with the speech utterance. You don’t have to remember cryptic keystroke combinations. So, is the video–
– I have to finish a research report for my studies with Paula. Let’s look at it.
– Research report on productivity [? has ?] comment insert subtitle here as comment.
– Hmm. Seems like she added some comments to it. I rarely use the comments, and I don’t remember how I can read them, but I guess JAWS comment such could help me.
– Space. Search for JAWS commands [INAUDIBLE]. Announce comment. Alt plus Shift plus apostrophe [INAUDIBLE] move the prior comment. Shift plus N [INAUDIBLE] quick navigation keys are used. Move to the prior comment in the Word document.
– But that just works in the quick navigation mode. It means I have to switch modes before using it.
– Move to next comment. [INAUDIBLE] adding level three list comments. Control plus Shift plus apostrophe. Heading levels.
– That could work better. I’ll shift, apostrophe, and Control-Shift-Apostrophe.
– Escape. Research report that word. Edit.
– Let’s look at her comments.
– Reviewer’s comments dialog. I could write the abstract. [INAUDIBLE] we must add keywords words [? all ?] [? about ?] what kind of research we are doing here. We have to explain it more.
– I guess I’ll stop [? at first ?] [? one. ?]
– [? Blend ?] section two page one to text.
– Now the next one.
– Reviewer’s comments dialogue.
SHRIRAM RAMANATHAN: So you just saw issue of somebody trying to review comments in Word. Now we will see the same example but with voice assistant.
– Let’s go back to my research report with Paula. She added comments to it. I don’t remember the keystrokes for listing or working with comments, but let’s see if Sharky can help me. Sharky, list all comments.
– Reviewer’s comments dialogue. List one list view. Do you have an idea for a good subtitle? [? Paula ?] Bauer [INAUDIBLE]. I could write the abstract. Paula Bauer [INAUDIBLE]. We must add keywords.
Paula Bauer [INAUDIBLE]. Next can you fix the references? Paula Bauer [INAUDIBLE]. We need more examples–
– I don’t have to look up and remember the keystrokes for it. This helps to concentrate on the document.
– We have to explain it more. Paula Bauer [INAUDIBLE] we must add keywords.
– Let’s start here.
– Blend section two [INAUDIBLE].
– Done, so next comment. Sharky, next comment.
– Comment. Can you fix the references, by Paul Bauer [INAUDIBLE].
– I don’t have to switch modes to navigate to the next comment if I use Sharky. I guess I have an idea for the subtitle. Let’s go to the first comment again. Sharky, go to first comment.
– Comment. Do you have an idea for a good subtitle? By Paula Bauer.
SHRIRAM RAMANATHAN: So, you just saw this with a voice assistant. Now, I do want to point out that this is useful for people. If you’re a lawyer or a paralegal who lives by comments, it will serve you best to learn the keystrokes for this the best. However, if you’re trying to write an invite for a church potluck or HMO meeting, and you get comments for it, and you don’t use comments often, this is very useful.
Again, here’s the list of supported features. You can adjust screen rate, speech rate, you can change settings, you can list all open windows, you can list spelling errors, you can do internet navigation by going to the first heading, you can do Outlook-related tasks, et cetera. So I’ll now hand over to again Roxana, who will talk about how we approach the research process. Roxana?
ROXANA FISCHER: All right. Before designing a voice assistant, we reviewed existing research ideas of voice assistance and the user group of people who are blind. Two research papers were really interesting– Reading Between the Guidelines and Siri Talks at You. They analyzed the current voice assistant situation regarding to their accessibility. In their research, they interviewed blind users and compared the finding with existing guidelines. The main conclusion here was that blind users often found speech output frustrating, slow, and verbose, but still voice assistants have a high acceptance around blind users, even if their needs are not directly addressed in the official guidelines.
We also looked at similar research projects, but research in the area is limited. The two closest concepts are VERSE and Capti-Speak– VERSE as a smart speaker that has the screen reader navigation included by Microsoft Research, and Capti-Speak as the voice-enabled web screen reader. The researchers in both projects conclude that the systems are improving [INAUDIBLE] general, but they could not replace the keyboard interaction.
All the existing research right now focus on blind users. There was no research on low-vision users. For our study, we concentrated on screen magnifier users. We created a prototype for Zoomy and sent this prototype to six low-vision users. We provided them with a prototype and a task sheet. Afterwards, we did follow up with a survey and interviews.
The goal was to evaluate the design of a voice assistant and the feature set off Zoomy, with a high focus on how we could improve before going to production. So, what did we find out? The user study provide us with deep insights. We find out that users are mostly fine with a couple of seconds delay. As the user online service to analyze the speech input, the system needs a few seconds to respond, similar to voice assistants like Google and Alexa.
Also, the users preferred the wake word over the keyboard comment. The research paper is saying that to have a button is important for the users. However, in our study, we saw that more people are using the wake word instead of a keystroke. Another important finding was that personas lead us to create a natural language model for the voice assistant. The initial training set of example utterances were created ZoomText engineers.
The study showed [? that ?] missed on some simple and short utterances. It showed that the users uses complex sentence but also short like one or two-term utterances. For example, the user could say turn on the color enhancement but also just invert color.
So we learned that you need variation of actual users to create a responsible natural language model [? for ?] [? the ?] prediction. For further model design, it is helpful to work together closely with the actual users to provide a more natural way for the interaction. Also, like the existing research on voice assistant in blind users, we also come to the conclusion that voice interaction cannot replace the keyboard interaction or mouse interaction, but it can be helpful in certain situations.
SHRIRAM RAMANATHAN: So, I want to talk privacy right now and how we deal with privacy. Like most voice assistants, we do local wake word recognition. That is no speech data goes– nothing that the microphone hear goes to the online recognition service until the wake word is recognized. We also provide an option for if the users are uncomfortable with the microphone listening continuously. You’d remember one keystroke, then you can invoke the voice assistant by the keystroke. You speak to it, and you get your action done.
We do find that in practice, though, people prefer the wake word. We do not collect any IP address or personally identifiable information for voice, and nor do we associate any personally identifiable information with the speech stream. This was a requirement for us because our user base tends to be conservative. We did not want to associate either what– we didn’t want to be able to record or do anything with user action.
All the metrics and speech logs that we do get on the server side are anonymized. So, what can we see? We can see what a user asks for and what action we executed, but we never see who the user was. We use it to improve existing actions and add new commands and actions.
There’s also a firewall between the engineering and production systems. Engineers cannot go dragnet fishing on live data to create new features. All the data is reviewed by the product owner, and they get to decide on what new features we want to put.
We support five languages– English, German, French, Dutch, and Spanish. Here’s is a shot video of it operating in German.
– Zoomy. [SPEAKING GERMAN]
– [SPEAKING GERMAN]
– Zoomy, [SPEAKING GERMAN]
– [SPEAKING GERMAN]
SHRIRAM RAMANATHAN: So, I want to talk about the limitations of the systems, and it’s really not limitations, but it’s the reality. It’s never meant to compete with existing interaction methods. Often, voice assistance systems– this is not the first attempt at a voice assistant in the accessibility space. But everybody has tried to be the Holy Grail of input. Yeah, they wanted to replace keyboard and mouse interaction. We don’t intend to do that.
We intend this to be a supplement to keyboard and mouse interaction. It will never be as fast as a keystroke. If you remember the keystroke, you get response in milliseconds. If you use the voice assistant, your response is measured in seconds. This is more useful if you know the feature that you want to use. You just don’t use it often, so you use this feature to get to it.
Feature discovery is still hard, and this is a problem with all voice assistants. You will not know what– you still do not know what all is possible with voice. So we do have a voice assistant help, which you can ask the voice assistant to tell what all it can do. But it can read long, and unless you’re used to listening to long audio books, feature discovery can be still hard.
We will not work in systems without internet connection. We want zero configuration, zero training systems, so this will be hard to use in situations like banks, which at times do not want to allow their systems to talk to the internet. It will always be a gimmick to some, especially the power users. If you have been using JAWS for 20 years, you would probably not find this useful because you know most of the keystrokes. Our user study indicated the same.
We had the person. Most of them were very excited, except one power user who said I can do things faster with the keyboard. Why do I want this? But we understand that this will be very helpful for new users and even power users who operate in the limited space and want to use some features they know exists but just can’t remember the keystrokes.
It’s not a personal assistant like Alexa and Siri. We will never support the range of actions, but that’s by design. We want this to be screen reader specific so that the user– we know that user base best. We understand the screen reader users and the magnifier users the best, so that’s where we can provide and add value.
So, what does the future hold for the voice assistant? We want to learn from that anonymous speech logs to see what the user wants and add those commands. We want to add a UI to see how the system interprets the voice commands they issue. We do have some Bluetooth headset limitations that are technology-related. Many of Bluetooth headsets can listen. If they’re listening, they cannot do output, so we want to work around that.
I do want to end up with saying that we have had good response to the feature. People have not only found it exciting but useful. It’s now question and answer time. I see six questions, so I’ll go.
So the first question is from Craig Warren. What if the voice assistant mishears you? Well, you get a response saying, sorry, I did not get that. And you have to reissue the command.
Can you change the name of the assistant? Not yet, but we are hearing the request. Any name change probably will be within a limited set of possibilities, but right now you do have to say “Sharky” or “Zoomy”. Christian Gonzalez.
Christian Gonzalez is asking, I am the trainer for ZoomText and JAWS, but my organization has not yet updated to the 2021 version. I would like to have a list of voice commands, but I’m not able to find one online like I can for the short–