OpenAI collapses media reality with Sora, a photorealistic video generator with AI

[ad_1]

Snapshots of three videos generated with OpenAI's Sora. — Enlarge / Snapshots of three videos generated with OpenAI’s Sora.

On Thursday, OpenAI announced sora, a text-to-video AI model that can generate 60-second long photorealistic HD videos from written descriptions. While this is just a preview of research we haven’t tested, it supposedly creates synthetic video (but not audio yet) with higher fidelity and consistency than any text-to-video model currently available. It’s also scaring people.

“It was nice meeting you all. Please tell your grandchildren about my videos and the lengths we went to film them.” wrote Wall Street Journal technology reporter Joanna Stern talks about X.

“This could be AI’s ‘fuck’ moment,” wrote Tom Warren from The Verge.

“Each of these videos is AI-generated, and if this doesn’t worry you at least a little bit, nothing will.” tweeted Marques Brownlee, YouTube technology journalist.

For future reference, since this kind of panic will one day seem ridiculous, there is a generation of people who grew up believing that photorealistic videos must be created by cameras. When video was faked (for example, in Hollywood movies), it took a lot of time, money and effort, and the results were not perfect. That gave people a basic level of comfort that what they were seeing remotely was probably true, or at least representative of some kind of underlying truth. Even when the child jumped over the lavathere was at least one child and one room.

The message generated by the video above: “A movie trailer featuring the adventures of the 30-year-old spaceman wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.“

Technology like Sora eliminates that kind of media frame of reference. Soon, every photorealistic video you see online could be 100 percent fake in every way. Plus, every historical video you watch could also be fake. How to deal with this as a society and work around it while maintaining trust in remote communications is well beyond the scope of this article, but I tried my hand at offering some solutions back in 2020, when all the technology we’re seeing now seemed like a distant fantasy to most people.

In that article, I called the moment when truth and fiction in the media become indistinguishable the “cultural singularity.” It looks like OpenAI is on track to make that prediction come true a little sooner than we expected.

Immediate: Reflections in the window of a train that runs through the suburbs of Tokyo.

OpenAI has discovered that, like other AI models that use the transformative architecture, Sora scale with computing available. With much more powerful computers behind the scenes, AI video fidelity could improve considerably over time. In other words, this is the “worst” AI-generated video you will ever see. There’s no synchronized sound yet, but that could be fixed in future models.

How (we think) they did it

AI video synthesis has progressed in leaps and bounds over the past two years. We first covered text-to-video models in September 2022 with Meta’s Make-A-Video. A month later, Google showed Image Video. And just 11 months ago, an AI-generated version of Will Smith eating spaghetti went viral. In May of last year, what was previously considered the front-runner in the text-to-video space, Runway Gen-2, helped create a fake beer commercial full of twisted monstrosities, generated in two-second increments. In previous models of video generation, people moved in and out of reality with ease, limbs flowed together like pasta, and physics didn’t seem to matter.

Sora (which means “heaven” in Japanese) seems to be something completely different. It has high resolution (1920×1080), can generate temporally consistent video (maintaining the same theme over time) that lasts up to 60 seconds, and appears to follow text prompts with high fidelity. So how did OpenAI achieve it?

OpenAI does not typically share internal technical details with the press, so we are left to speculate based on expert theories and information provided to the public.

OpenAI says Sora is a diffusion model, much like DALL-E 3 and Stable Diffusion. It generates a video starting with noise and “gradually transforms it by removing the noise in many steps,” the company explains. It “recognizes” objects and concepts listed in the written message and picks them out of the noise, so to speak, until a coherent series of video frames emerges.

Sora is capable of generating videos all at once from a text message, enlarging existing videos or generating videos from still images. He achieves temporal coherence by giving the model “forecast” for many frames at once, as OpenAI calls it, solving the problem of ensuring that a generated subject remains the same even if it is temporarily lost from view.

OpenAI represents video as collections of smaller groups of data called “patches,” which the company says are similar to tokens (fragments of a word) in GPT-4. “By unifying the way we represent data, we can train broadcast transformers on a broader range of visual data than was possible before, spanning different durations, resolutions, and aspect ratios,” the company writes.

An important tool in OpenAI’s arsenal of tricks is that its use of AI models is compound. The previous models are helping to create more complex ones. Sora follows directions well because, like DALL-E 3, it uses synthetic subtitles that describe scenes in training data generated by another AI model like GPT-4V. And the company doesn’t stop there. “Sora serves as a foundation for models that can understand and simulate the real world,” writes OpenAI, “a capability we believe will be an important milestone in achieving AGI.”

One question on many people’s minds is what data OpenAI used to train Sora. OpenAI hasn’t revealed its data set, but based on what people see in the results, OpenAI may be using synthetic video data generated in a game engine in addition to real video sources (e.g. pulled from YouTube or with archival video license). libraries). Nvidia’s Dr. Jim Fan, a specialist in training AI with synthetic data, wrote in X: “I wouldn’t be surprised if Sora is trained with a lot of synthetic data using Unreal Engine 5. It has to be that way!” However, until OpenAI confirms this, that’s just speculation.

OpenAI collapses media reality with Sora, a photorealistic video generator with AI | Top Vip News

How (we think) they did it

Leave a Comment Cancel reply