Sora (OpenAI): everything you need to know about AI that generates videos from text

Sora is OpenAI’s artificial intelligence model capable of transforming a text prompt into a video. Potentially a revolution in artistic creation, Sora raises many questions, which we strive to answer here.

video generated by Sora
Credit: OpenAI

After generating unparalleled enthusiasm with its ChatGPT text generator and its DALL-E image generator, OpenAI presented Sora, its video generator. As with its other platforms, it is an artificial intelligence-based tool, capable of creating content from a prompt in the form of text. Sora promises to revolutionize many creative uses on the internet and other sectors, here’s what you need to know about it.

How does Sora work?

Sora is based, like the GPT models, on a transformer architecture. In a neural network, a transformer will use its field of study to establish relationships between the components of a sequence, allowing it to then modify an input sequence into an output sequence and generate a response to a prompt . This system makes it appear that the AI ​​understands the question and is thinking to create a relevant answer, but no reasoning skills are actually involved. These are algorithms using mathematical representations to relate concepts to each other.

When large language models (LLM) use tokens in their operation, Sora uses what OpenAI calls tokens. “patches” (visual fixes). This technique has already proven itself in the field of data visualization. Videos are transformed into patches by compression, and these patches then act as tokens. They can be used to reconstruct a video (or an image) using the transformer.

Sora patchesSora patches
Credit: OpenAI

“Sora is a streaming model that generates a video starting with a video that looks like static noise and gradually transforms it by removing the noise in several steps”, explains OpenAI. It is possible to create a video in one go from a single prompt or use multiple prompts to lengthen or correct the video as you go.

Sora noiseSora noise
Credit: OpenAI

The model uses the same recapitulation technique used by DALL-E 3. This consists of the generation of very detailed and descriptive legends to develop a rich visual training database. The model can thus draw from this database to more faithfully comply with the user’s textual instructions in the generated video.

In addition to a text prompt, Sora supports processing instructions containing a still image. It then creates an animation based on the content of this image. The prompt can even suggest a video, which Sora will be able to extend or to which she can add missing scenes.

How long is a video generated by Sora?

For now, Sora can generate videos up to one minute long. This limit is due to the amount of resources necessary to create a video that strictly respects the user’s instructions and the desired visual style. OpenAI has not communicated on the processing time necessary to generate a video. Feedback from early users seems to indicate that it takes about an hour to create a one-minute video with Sora. Such a delay represents a great weakness for the service, preventing users from effectively correcting their videos with new prompts to optimize them and obtain more relevant results.

How good is Sora’s image quality?

Sora generates videos in definition up to 1920 x 1080p, i.e. Full HD. It can also produce videos in vertical format up to 1080 x 1920p, and adapt to any ratio. Unlike other services of this type, the number of frames per second of the videos is not known.

Sora is able to create ultra realistic renderings, but also more abstract scenes, according to the requests explained in the prompt. Artifices and aberrations in the image may appear, and we may notice a phenomenon of hallucinations, as with image generation with DALL-E. Errors in movements, as well as in interactions between characters or with the setting and objects can also occur. But the first examples published by OpenAI are impressive, and we can think that Sora could already be ready to generate advertising spots broadcast on the internet or on television.

By OpenAI’s own admission, Sora still needs improvement. “It may struggle to accurately simulate the physics of a complex scene and may not understand specific cases of cause and effect”, admits the company. For example, if a person bites into a cookie, it may not have a bite mark. Managing broken glass is also a difficulty encountered by OpenAI. The pattern can get confused in the spatial instructions of a prompt, mixing left and right for example. It may also be difficult to follow direction instructions for a scene, such as a specific trajectory or camera angle.

Sora is, on the other hand, capable of creating scenes with precise details of the subject and the background, of expressing emotions, of respecting a visual style, of changing shots several times in a single video or even of adopting a specific film format, such as 35 mm. 3D consistency is already mastered. Sora can generate videos with dynamic camera movement. “As the camera moves and rotates, the people and elements in the scene move coherently in three-dimensional space”we learn.

Similarly, OpenAI is pleased with Sora’s performance in terms of temporal coherence throughout a video and object permanence. “Our model can preserve people, animals and objects even when they are hidden or leave the frame. It can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video”the company says.

How to try Sora?

Sora is only accessible to members of theOpenAI Red Teaming Network. This is a carefully selected group of users whose mission is to test the capabilities of the tool. The objective is to report technical, legal or ethical problems to OpenAI, so that they can be resolved before a wider launch. The issue of deepfakes particularly worries publishers of video generation solutions. Barriers must be erected in this regard. Respect for copyright is another important issue to take into account.

“We are also granting access to a number of artists, designers and filmmakers for feedback on how to advance the model, so that it is as useful as possible to creative professionals”, also reports OpenAI. The company is sharing its progress and opening the door to Sora to a few people outside of OpenAI right now to get as much feedback as possible and improve its tool. We don’t yet know when Sora will be available to the general public, or in what form.

Will Sora be integrated into ChatGPT?

We do not currently know how OpenAI intends to distribute Sora to the general public. If we rely on recent strategic decisions taken by the company, it is not certain that the tool has its own user platform. DALL-E 2 no longer accepts new clients on its own interface, whereas you have to go through a paid or developer version of ChatGPT to access DALL-E 3. We can therefore imagine that when it is launched, Sora will be directly integrated into ChatGPT Plus. It is not certain that free availability, even limited, of Sora will be offered upon its release.

Credit: 123RF

What security measures are built into Sora?

Before Sora is made available to the general public, OpenAI has already announced a series of measures to reduce the risks of abuse of this powerful tool. The company is currently developing tools “to help detect misleading content”, citing in particular a classification system to make it possible to detect a video generated by Sora. It is also specified that if the model were to be integrated into an OpenAI product in the future, the teams plan to include the C2PA metadata. This open standard, already used for images generated by DALL-E 3, makes it possible to trace the origin of content to know whether or not it was created by an AI.

Sora will also benefit from security features already implemented in other of its services. Is planned a text classifier whose role is to check and reject prompts that violate OpenAI’s usage policies. Prompts that request content displaying extreme violence, sexual content, hateful images, a resemblance to a celebrity or the IP address of a third party are banned. In addition, image classifiers will examine the images of each generated video, to ensure that no video violates these famous usage policies.

Who are Sora’s competitors?

After text and image generation models, the major players in the generational artificial intelligence sector are seriously working on the development of video generation models. Google is one of the main competitors of ChatGPT and GPT-4 with Gemini, it also poses as a tough opponent in the field of video creation with Lumiere. Google Lumiere, which is also inaccessible to the general public, is currently limited to five-second videos. The prompt can contain an image, not just text.

Among the digital heavyweights, Meta is also interested in the subject, notably with Emu Video, which allows you to create videos from a text-only prompt, an image-only prompt, or a combination of both. We can cite Gen-2, from Runway, which is capable of creating videos not only from text or images, but also from another video. Stable Video Diffusion and Pika are also among the serious contenders in this market.

Source link -101