Peter Thiel once lamented: “We wanted flying cars and all we got was 140 characters”.
His point was that true groundbreaking technological progress seemed to have slowed down dramatically. In the 20th century alone we got cars, planes, electricity, put a man on the moon, and invented the internet. But - so his thought process went - in the 21st century capital (and time) was sucked up mostly by companies seeking new ways to entertain us (Facebook, Twitter, Netflix) or sell us things (e.g. Google ads).
Innovations around hard tech seemed to be dead or (best case) progressing at a snail’s pace.
With the benefit of hindsight this “progress doomerism” is demonstrably false, and for a fascinating reason that has to do with how we make use of the latest neural networks (specifically: those that are helping us figure out driverless cars, ChatGPT, generative AI and so on). Let me explain.
Consider the resources needed to create ChatGPT:
Neural network architectures
Data (written text)
Now consider the resources needed to create self-driving cars:
Neural network architectures
Now consider the resources needed to create image generators:
Neural network architectures
Now consider the resources needed to create human-like voice generators:
Neural network architectures
Finally, consider the resources that will be needed to create humanoid robots:
Neural network architectures
Data (multi-modal - e.g. video, language, audio, etc)
Apologies for the repetition but it will help me make my point.
Artificial Intelligence (AI) is a technology platform in the same way that electricity and the internet were, and I expect we’ll find that its impact will be far greater. In order to create artificial intelligence we need better chips, better neural network architectures, and more data.
Social media created an enormous amount of demand for better chip technologies, as have video games and Google’s desire to show us more relevant advertising.
The same is true of advances in neural networks (henceforth NNs). Most of the investment into NNs has come from tech giants trying to entertain us and sell us things.
Similarly, as we have learned over the past five years - the size of our training data sets is half the battle in determining how effective the latest AI technologies (e.g. large language models) are.
Put differently - someone “wasting time” doomscrolling videos on Youtube is in fact someone creating demand for all of the things we need to build true artificial general intelligence (AGI).
Each minute they spend watching a video is pushing Google to build better chips and create more efficient architectures.
Each minute they spend watching a video is rewarding the creator who made the content - and hence incentivizing the creation of yet more content. Content that is in fact an incredibly valuable resource (audio, video and text data) which we will use to train artificial intelligences.
In one sense, creating and consuming content could logically be seen as analogous to mining the Earth for natural resources.
It’s sort of like Mr. Miyagi teaching the Karate kid to “Wax on, Wax off”. The Karate Kid thought doing chores was a pointless activity, but the moment he starts to fight he realizes that all of the time he invested in doing chores was well spent.
In this analogy Mr. Miyagi represents the all knowing universe - who has always realized that the fastest path to innovation is to satisfy the voracious demands of consumers.
The Karate Kid lamenting how pointless chores are represents the “progress doomers” who wish people would stop investing in new forms of entertainment and spend their time building hard technologies.
With the benefit of hindsight it seems plainly obvious that the fastest path to wondrous new technologies (humanoid robots, driverless cars) and even new paradigm-shifting technology platforms (e.g. artificial intelligence) was precisely the path we took.
The last technology we’ll ever need
The sum total invested into general purpose robots is likely less than $1 billion. I wasn’t able to find a figure but I did look up funding rounds for various companies working on the problem and this is definitely ballpark accurate. Contrast that against the amount that has been invested into developing other mechanical objects like cars and planes (or even tractors) - it is a rounding error.
This makes perfect sense, because until recently there was no point in building a general purpose robot. The “brain” piece hadn’t been solved. What good is a mechanical skeleton that can dexterously move about the world if it has no idea where to go or what to do?
The brain piece is nearly solved. The technology Tesla is using to teach its robots (cars) to drive is largely the same technology that you would use to teach a robot how to walk around a warehouse, pick up items, copy a human action (and so on).
Tesla’s latest full-self-driving (FSD) 12 software is only 3,000 lines of code (down from 300,000 lines of code in FSD 11). The reason it is so short is because it essentially says the following:
“Watch what humans do - then do that”.
It doesn’t have specific lines of code that address how to handle intersections, roundabouts, red-lights, or what to do if you are driving next to a child on a bike who is wobbling back and forth.
It just tells the system to watch incomprehensibly large amounts of video data and then do the same thing as humans do.
Quick aside - there was a super funny stat Tesla found out while training their FSD software. Only 1 out of every 200 people comes to a full stop at a stop-sign if there are no cars in front of them. Transport authorities have forced Tesla to over-ride their new FSD’s natural tendency to roll-through stop signs (as humans do), but people are petitioning them to remove this requirement. Oh what interesting times we live in!
This is why Tesla is now working on what they call Optimus - a humanoid robot. They realized that the underlying technology one would use to create self-driving cars is the same technology that you would use to train a robot to interact with the world in other ways / locations.
Remember - UNLIKE a car - a humanoid robot (or any general purpose robot capable of moving around the world) does not have to be perfect to be useable. A humanoid robot accidentally walking the wrong direction or picking up the wrong item does not result in a deadly accident. So - in many ways - building a functional humanoid that can do things like help move boxes or pick and pack items as done in fulfillment centers is an easier task than building a self-driving car.
Rather than being trained on videos of people driving, these new robots might be trained by watching every video on Youtube.
I’d like to take a moment to thank everyone who has “wasted” countless hours watching videos for their contribution to the data pool and tools that will lead to the creation of AGI and thereafter a world of unimaginable abundance.
Tesla is not the only company to realize that we are just around the corner from having functional general purpose robots.
I expect now that it is obvious that the brain component will be largely solved within the next couple of years we will see a surge of funding (at least 10X) into skeletons over the next 12-24 months (this includes the increased investment of internal projects like Optimus).
Obviously the first iterations of these robots will not be “artificially generally intelligent” - meaning they will not be human-equivalent. But human-equivalence may be closer than most people think. Once human level intelligence has been artificially created both digitally and transcribed onto a human form-factor - we will have effectively invented the last thing we ever need to invent. The robots will take it from there…
How close might we be?
People (including me in the past) have often pointed out that a driverless car is still worse after being trained on hundreds of millions of hours of video than a teenager is after only 15 hours of practice. Doesn’t this imply that we are nowhere near reaching artificial general intelligence?
A teenager has experienced 131,400 hours of life by the age of 15 (365 X 24 X 15). However, each hour of human life incorporates a far higher data load than an hour of video watching - in large part because the human is taking in data multi-modally (e.g. sound, vision, language, touch, etc).
Consider the data load of a child “experiencing” someone dropping a vase.
Inputs/learnings would include the following concepts:
Gravity - the vase falls
Value - dad yelling in the background that the vase was expensive
Pain - mom instructing everyone to stop walking until the glass shards had been picked up
Material science - the glass shatters
Cause and effect - associating the clash of breaking glass with the future danger of needing to avoid shards on the ground
Sound - that which fragile things make when they shatter
Language colloquialisms - hearing the older brother say “Well that’s just fucking great”, and learning that in this case the comment actually means the opposite of what he said
I could literally go on forever.
Our human ability to take in data multi-modally makes us far more efficient synergizers of data than computers. And - virtually all of the information needed to learn to drive is something we have been taking in since the day we were born. The reality is that everyone essentially knows 99.99% of what they need to drive before they even start learning.
Consider the example of teaching a car what to do if it sees a child wobbling on a bicycle up ahead. Before it starts its training it has no concept of “wobbling” (e.g. the laws of physics). It doesn’t know that wobbling could lead to falling over. Now put yourself in the shoes of a driverless car. If your only method to learn about the concept of “wobbling” was to watch videos of people driving - how many hours would you need to watch to run across enough examples of “wobbling” to learn the concept? The answer is some very large number.
As another example, consider the situation of driving on the highway and seeing something in the road ahead. It only takes a human milliseconds to determine whether an object is likely to be light or heavy, and whether it is worth swerving around or pummeling ahead. A human would be able to immediately identify an empty cardboard box as something that it should not risk swerving to avoid and simply run right over. How many hours of driving video would it take to learn what “light” or “heavy” objects look like?
The first (pre-ChatGPT) iteration of large language models were so unimpressive that nobody had ever heard about them.
ChatGPT was so impressive that it became the fastest growing product of all time, reaching 100 million users in a matter of months. Even so it still wasn’t able to perform as well as educated humans on things like SAT exams, AP exams, LSATs, MCATs and so on.
GPT-4 was released less than 6 months after ChatGPT, and is now capable of performing in the 90th percentile on most exams. It received a 5 (highest score possible) on AP Art History, Biology, Environmental Science, Macroeconomics, Microeconmics, Psychology, Statistics, and US Government and US History exams. It received a 4 on AP Physics 2, Calculus, Chemistry and World History. It scored a 163 on the LSATs - putting it in the 88th percentile and well-capable of earning entry into top law schools.
Here’s my point. To-date computers are successfully making up for their lack of common sense and general understanding of how the world works through their ability to process enormous amounts of information. But what will happen when computers are able to take in information multi-modally just as humans do? What will happen when large language models are combined with video, image, sound and eventually touch (yes, folks are working on that as we speak).
Revisit our example of learning what it means to wobble. Here’s how GPT-4 responded to the prompt: “What should a driver do if it sees a kid wobbling on a bicycle up ahead”:
Obviously ChatGPT understands exactly what it means to wobble.
Now imagine how game-changing it will be when machines can also use the entire corpus of text to learn how to interact with the physical world around them…
Today we have neural networks being trained on video (Tesla FSD), sound (ultra-realistic text-to-speech already exists, most people just haven’t run across it yet), images (Midjourney, Stable Diffusion, Adobe’s latest product, etc), and language (ChatGPT/GPT-4). What we have yet to see is someone integrate all of the above into a single product. Google has done just that and plans to release it in December. It’s called Gemini. Nomadev did a great job of explaining what it is and why it will be so powerful in this article. Here’s a snippet introducing the product:
My intuition is that effective multi-modal integration is what will lead to unlocking true artificial general intelligence.
I don’t know how much of “common sense” and “general intelligence” comes from each input modality (sight, touch, sound, etc) - nor do I know if all modalities are required to achieve AGI (I doubt it), but after we have robots moving around the world, picking things up and putting them down, watching humans interact, etc - they will have all of the inputs we have.
AGI is closer than most people think…