DALL-E: Using AI to Create Images from Text

Artificial intelligence, in the last decade, has experienced constant innovation. As computing power skyrockets and new techniques develop, AI’s seemingly unending potential is being realized more and more every day. In recent weeks, machine learning has been heavily discussed due to OpenAI’s development of DALL-e (pronounced “doll-e”).

Generating Unique Images

This model is fascinating in that it can generate a unique image when given a text prompt in English. In an example from OpenAI’s blog about the program, the model is asked to produce “an illustration of a baby daikon radish in a tutu walking a dog.” The model came back with this array of results:

Array of cartoon images of a radish in a tutu walking a dog generated by DALL-E when asked to produce “an illustration of a baby daikon radish in a tutu walking a dog.”

From https://openai.com/blog/dall-e/

The model clearly understood the prompt and created images with interesting art styles and few of the visual artifacts so common in AI-produced images.

Demonstrating Creativity

In another example, the team input the prompt “a female mannequin dressed in a black leather jacket and gold pleated skirt.” These were the results:

Dall-E generated array of “a female mannequin dressed in a black leather jacket and gold pleated skirt.”
From https://openai.com/blog/dall-e/

The most striking things about these images are their photorealism, as well as, what seems like creativity in the model’s response. Interestingly, each outfit is mostly unique and fits a different style.

What happens if it gets in the wrong hands?

OpenAI has not released the full model to the public as there is real potential a bad actor could use this technology for nefarious purposes. The model, for example, could be used to create fake images of a person in an incriminating situation or to spread false information about an event that never took place.

How does DALL-e work?

DALL-e is a transformer language model, meaning it receives the text and image as a single stream of data containing 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively,” according to the blog post mentioned previously.

As more research goes into this form of model, we can only expect more innovation to come.

But… it is less than perfect

While the program is highly effective in generating creative, good-looking images from natural-language prompts, the model struggles with complex scenes where multiple characters interact.

An article by LESSWRONG highlights this weakness by asking DALL-e to create “Captain America and Iron Man standing side by side.”
Array of images with Captain America and Iron Man standing side by side
From https://www.lesswrong.com/posts/uKp6tBFStnsvrot5t/what-dall-e-2-can-and-cannot-do

While the images still look realistic, it melds the features of both characters together such that neither looks quite right.

Although AI models such as DALL-e can feel hyper-futuristic and even terrifying, they are still just statistical tools and so they don’t have the same basic concepts that human artists naturally possess.

Sometimes it gets key details wrong

DALL-e clearly doesn’t have the same concept of an individual or a character. It swaps the defining characteristics of the two characters in the scene throughout all the images – it’s hard to even tell which character is which.

In another example from the same article, the author shows how DALL-e can produce English letters. But it doesn’t quite know how to spell.

Shows female superhero with hands on hips with text: RANGE EMERENCY, EMERAGENCEY
From https://www.lesswrong.com/posts/uKp6tBFStnsvrot5t/what-dall-e-2-can-and-cannot-do

It’s close and you can get the gist, but the results are comedic at best. The above image also shows how DALL-e can sometimes struggle with anatomy. The figures’ hands are warped, and their face looks unrealistic even within the art style.

Remember that the model is just using math to produce images after being trained for a long time. It doesn’t have the human experience that we do and it doesn’t have any physical intuition.

What the future holds

As we develop more advanced AI with better techniques and more computational power, it’s likely that a lot of these problems can be resolved. With a bigger data set, layers within neural networks may pick up on spelling or human anatomy.

The field of AI is constantly evolving. It’s likely only a matter of time until computer-generated art is truly indistinguishable from human-made art. It is exciting to watch the development of this technology, and very interesting to see where it takes us.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to top
Close Bitnami banner