LLMs will result in Voice becoming a 4th component of the GUI. This will be a bigger improvement on UI than the switch from a Blackberry keyboard to a touchscreen
I’ve been Tweeting a lot lately and writing on Commonstock about how LLMs are going to turn voice into a 4th component of the “GUI”. GUI - which stands for graphical user interface - is in quotes because voice doesn’t fit neatly under the acronym. Still, when people think of the GUI they think of the “mechanics with which humans interact with technology”, and that does fit.
The QWERTY keyboard was invented in the late 1800s. The mouse was invented in the 1960s. The computer screen was invented in the early 1970s. Essentially, humans have been interacting with technology in mostly the same way for the past fifty+ years. The only major change worth noting is that we now use touch (primarily on smartphones) in place of a mouse.
Have you ever wondered why you can’t ask Siri to do things like “Check my bank balance”, or “Open my Southwest app and look for a plane ticket to Atlanta”?
Or, have you ever wondered by you can’t just ask Microsoft Excel to sort the table by column F instead of column J?
The answer is simple and it has nothing to do with the technological feasibility of connecting Siri to our apps (which is already possible and done in limited cases).
The answer is: because the user experience of using voice to navigate and interact with apps would have been horrendous. If you have ever used Siri, Alexa, or anything of the sort then you know how awful they are at understanding intent. While they can do simple things like tell you about the weather or set a timer, anything more complex results in them essentially passing keywords from whatever you said into a search box and saying “Here’s what I found”.
Again, let me stress this point because it’s super important: the technology already exists to use voice to navigate and interact with apps, what is missing is the ability of the digital assistant (e.g. Siri) to convert common language into understanding of what we want (which I call “intent”).
This problem has now been solved. ChatGPT already understands intent nearly as good as humans, and the new version coming out this year will be better still.
When you click with a mouse the computer knows exactly what you want. You want to open the app you clicked on, or edit a new line in a Word document, or move an image in a PowerPoint presentation.
The same is true with the keyboard. If you type in a web address and hit enter - you want to go to that webpage.
There is no ambiguity, the computer knows with certainty what it is you’re trying to do. This is what wasn’t true using voice commands - until recently.
Let me stress one more point: voice-to-text transcription technology exists today, people use it all the time. The error rate of transcriptions was already dropping fast, but ChatGPT has effectively dropped the error rate to near zero (I described how in a previous post). Essentially, even if a transcriber garbles up words the LLM is able to reverse engineer what the speaker was trying to say.
Digital assistants like Siri probably need to understand at least 90% of query intent in order for the user experience to be adequate enough to start connecting them to everything. After having used ChatGPT for many hours I’m confident that it and other LLMs to come will easily pass this threshold.
This connection will take place sooner than most people think. This is what Microsoft CEO Satya Nadella had to say yesterday at their Bing/ChatGPT integration launch event (I pulled the text from Ben Thompson’s most recent Stratechery post):
The first few lines in the top paragraph are the ones I want to call attention to, specifically: “All computer interaction is going to be mediated with an agent helping you”. Satya uses the term “co-pilot”. Not coincidentally Co-Pilot is also the name of the Microsoft product that helps programmers code.
Microsoft has already announced that LLM technology is getting integrated into the Office suite of products and Teams. They’ve also just announced that they will be releasing a new platform that lets any company build on top of ChatGPT’s underlying technology.
I think it won’t be long before people spend as much time using their voice to interact with technology as they do with the mouse or keyboard - and it won’t be long after that until voice becomes the primary medium of communication.
I’ll finish the post with three interesting takeaways.
Siri, Hey Google, Alexa, Cortana etc will all be getting upgrades
As you can imagine, being able to navigate with voice will be an incredible user experience. It will be a bigger improvement than the switch from a Blackberry keyboard to a touch screen. As sticky as some products (especially Apple’s) may be, no company will risk losing users by not offering this new way to interact.
I ran across a company yesterday called Embra that is already working on building an app on top of ChatGPT for Macs that will let you query different apps on your computer. You will be able to ask it to do things like:
Search Google
Find text inside of a PDF (or summarize text from a PDF)
Search messages on Slack
Respond to emails
This app is text based, but as I explained above it’s only a matter of time before voice is connected.
That said, it is apps like this (in addition to competition from Google, Microsoft, etc) that will force Apple to release their own digital assistant (or upgrade Siri) sooner than later.
It’s one thing to have a chat interface with apps, it’s quite a different thing for voice to be the medium. It’s too personal and Apple won’t like it. They won’t ban apps like Embra, but they will release their own version that will effectively result in the same thing.
The big question is how long it will take before the upgrade is technologically feasible. Response times will have to be similar to what they are today even though the algorithm in the background will be having to work through more compute.
If anyone has any insights as to the technological details required to make this happen please reach out and let me know.
People are going to spend a lot less time sitting down
Or, at the least they are going to spend much more time sitting in comfortable positions.
Keyboards and mice require a desk to rest on. They require a body in a specific position.
Yes, I am aware of standing desks. I have a standing desk and in the past even had a desk with a treadmill underneath so I could walk while working. Still…
Standing desks don’t solve the problem of spending extended periods of time in the same position. Further, many people cannot - in fact - “walk and chew gum” at the same time. For example, I know that there are certain tasks I can do standing up - like data entry or having a conversation on Zoom. What I cannot do while standing up are activities that require a great deal of thought - like writing or doing complicated math. I’m sure this is not unique to me.
Voice is going to enable us to work in whatever position we want for an increasing portion of our day (as it improves).
There is another related takeaway here - I expect monitor sizes to increase. Monitor’s are the size they are because people are used to interacting with them up close. We’ll need bigger monitors if we’re going to be seeing what’s on the screen from further away.
AR/VR/XR headsets are going to become far more attractive
I believe voice may be the catalyst for extended reality headsets to gain significant market share at the expense of monitors. Already the Meta Quest Pro offers a user experience far superior to having large monitors (at least for some uses to some people). I have a friend who spends hours a day programming with it, being able to sit in bed or in a La-Z-Boy while he works. Still, he must carry around a mouse and keyboard…
Apple’s releasing a headset this year which is rumored to be far more comfortable than anything on the market today, in large part because they detached the battery to lessen the weight. Apple’s headset will also use cameras that watch your hands so that you can use pinching motions to interact with apps. We’ll see how that goes, but one thing is certain - once voice is used as the primary input medium the user experience will be WILD.
Imagine walking around your home office (or even outside) with a mixed-reality headset. You can see everything around you so there is no danger of tripping over anything, but then right in front of you are giant monitors for you to work on. You might have one screen set to Teams so that you can see your workmates, another screen turned into a giant white board for collaboration, and a third screen connected to your newborn’s crib camera. No controller needed…
Great post - finally got the chance to read it properly! Watching how Meta is pivoting hard into VR, including a new operating system (software and hardware) for work-from-anywhere employees to engage virtually....
To your knowledge is Apple doing anything about all this chatbot so assistants or letting it slip away?