Ian Sansavera, a software architect at a New York startup called Runway AI, typed a short description of what he wanted to see in a video. “A tranquil river in the forest,” he wrote.
Less than two minutes later, an experimental internet service generated a short video of a tranquil river in a forest. The river’s running water glistened in the sun as it cut between trees and ferns, turned a corner and splashed gently over rocks.
Runway, which is about to open its service to a small group of testers, is one of several companies building artificial intelligence technology that will soon let people generate videos simply by typing several words into a box on a computer screen.
They represent the next stage in an industry race — one that includes giants like Microsoft and Google as well as much smaller startups — to create new kinds of artificial intelligence systems that some believe could be the next big thing in technology, as important as web browsers or the iPhone.
The new video-generation systems could speed the work of moviemakers and other digital artists, while becoming a new and quick way to create hard-to-detect online misinformation, making it even harder to tell what’s real on the internet.
The systems are examples of what is known as generative AI, which can instantly create text, images and sounds. Another example is ChatGPT, the online chatbot made by a San Francisco startup, OpenAI, that stunned the tech industry with its abilities late last year.
Google and Meta, Facebook’s parent company, unveiled the first video-generation systems last year, but did not share them with the public because they were worried that the systems could eventually be used to spread disinformation with newfound speed and efficiency.
But Runway’s CEO, Cris Valenzuela, said he believed the technology was too important to keep in a research lab, despite its risks. “This is one of the single most impressive technologies we have built in the last hundred years,” he said. “You need to have people actually using it.”
The ability to edit and manipulate film and video is nothing new, of course. Filmmakers have been doing it for more than a century. In recent years, researchers and digital artists have been using various AI technologies and software programs to create and edit videos that are often called deepfake videos.
But systems like the one Runway has created could, in time, replace editing skills with the press of a button.
Runway’s technology generates videos from any short description. To start, you simply type a description much as you would type a quick note.
That works best if the scene has some action — but not too much action — something like “a rainy day in the big city” or “a dog with a mobile phone in the park.” Hit enter, and the system generates a video in a minute or two.
The technology can reproduce common images, like a cat sleeping on a rug. Or it can combine disparate concepts to generate videos that are strangely amusing, like a cow at a birthday party.
The videos are only four seconds long, and the video is choppy and blurry if you look closely. Sometimes, the images are weird, distorted and disturbing. The system has a way of merging animals like dogs and cats with inanimate objects like balls and mobile phones. But given the right prompt, it produces videos that show where the technology is headed.
“At this point, if I see a high-resolution video, I am probably going to trust it,” said Phillip Isola, a professor at the Massachusetts Institute of Technology who specialises in AI. “But that will change pretty quickly.”
Like other generative AI technologies, Runaway’s system learns by analysing digital data — in this case, photos, videos and captions describing what those images contain. By training this kind of technology on increasingly large amounts of data, researchers are confident they can rapidly improve and expand its skills. Soon, experts believe, they will generate professional-looking mini-movies, complete with music and dialogue.
It is difficult to define what the system creates currently. It’s not a photo. It’s not a cartoon. It’s a collection of a lot of pixels blended together to create a realistic video. The company plans to offer its technology with other tools that it believes will speed up the work of professional artists.
Several startups, including OpenAI, have released similar technology that can generate still images from short prompts like “photo of a teddy bear riding a skateboard in Times Square.” And the rapid advancement of AI-generated photos could suggest where the new video technology is going.
Last month, social media services were teeming with images of Pope Francis in a white Balenciaga puffer coat — surprisingly trendy attire for an 86-year-old pontiff. But the images were not real. A 31-year-old construction worker from Chicago had created the viral sensation using a popular AI tool called Midjourney.
Isola has spent years building and testing this kind of technology, first as a researcher at the University of California, Berkeley, and at OpenAI, and then as a professor at MIT. Still, he was fooled by the sharp, high-resolution but completely fake images of Pope Francis.
“There was a time when people would post deepfakes, and they wouldn’t fool me, because they were so outlandish or not very realistic,” he said. “Now, we can’t take any of the images we see on the internet at face value.”
Midjourney is one of many services that can generate realistic still images from a short prompt. Others include Stable Diffusion and DALL-E, an OpenAI technology that started this wave of photo generators when it was unveiled a year ago.
Midjourney relies on a neural network, which learns its skills by analysing enormous amounts of data. It looks for patterns as it combs through millions of digital images as well as text captions that describe what each image depicts.
When someone describes an image for the system, it generates a list of features that the image might include. One feature might be the curve at the top of a dog’s ear. Another might be the edge of a mobile phone. Then, a second neural network, called a diffusion model, creates the image and generates the pixels needed for the features. It eventually transforms the pixels into a coherent image.
Companies like Runway, which has roughly 40 employees and has raised $95.5 million, are using this technique to generate moving images. By analysing thousands of videos, their technology can learn to string many still images together in a similarly coherent way.
“A video is just a series of frames — still images — that are combined in a way that gives the illusion of movement,” Valenzuela said. “The trick lies in training a model that understands the relationship and consistency between each frame.”
Like early versions of tools such as DALL-E and Midjourney, the technology sometimes combines concepts and images in curious ways. If you ask for a teddy bear playing basketball, it might give a kind of mutant stuffed animal with a basketball for a hand. If you ask for a dog with a mobile phone in the park, it might give you a mobile phone-wielding pup with an oddly human body.
But experts believe they can iron out the flaws as they train their systems on more and more data. They believe the technology will ultimately make video-creation as easy as writing a sentence.
“In the old days, to do anything remotely like this, you had to have a camera. You had to have props. You had to have a location. You had to have permission. You had to have money,” said Susan Bonser, an author and publisher in Pennsylvania who has been experimenting with early incarnations of generative video technology.
“You don’t have to have any of that now. You can just sit down and imagine it.”
- This article originally appeared in The New York Times