Exploring the Phase Space of Stable Diffusion, Discovering Procedural Nonsense

Anytime I encounter a new technology I like to knock around the phase space of it’s possible outputs a bit, see what you produce if you take it through the range of possible values for various settings or inputs. I take my inspiration for this from a photography project I distinctly remember encountering years ago, but which I can no longer find or recall the name of, which did this process with a camera: taking multiple images of the same subject while stepping through possible f-stop and shutter speed values. If anyone recognizes this project, please let me know what it was!

I think I’m drawn to these phase space experiments because they help me get a concrete sense of what a technology does. I’m not always a great abstract learner, I have a clearer sense of what’s happening once I get my hands dirty and try stuff out a bit. That’s why I’ve been wanting to try this with one of the machine-learning text-to-image programs for awhile now. These programs (which you’ve probably encountered in the form of DALL-E or one of it’s cousins) are fantastically hard to understand in the abstract, because they rely on hugely complex statistical manipulations to generate images from text.

The quality of images this software can produce has progressed almost unbelievably rapidly over the last year. For example, about a year ago, I asked the then-hot version of a text-to-image generator (CLIP/VQGANS, I think it was called) to draw me “Professor Andrew Famiglietti of West Chester University” and got this:

Whereas the current hot image generator, Stable Diffusion (which is available free of charge and will run reasonably well even on my modest GTX 1060 graphics card) renders output for that prompt that looks like this:

It doesn’t know what I look like, but it understands what a person looks like… mostly

More importantly, at least insofar as my fascination with technological phase space is concerned, Stable Diffusion makes it easy to tweak a couple of settings that influence how it makes images.

(If you want to explore these settings yourself, I wrote a Google Colab for that. If you have a GPU at home, the Stable Diffusion Web UI will also do this with the Prompt X/Y feature.)

To understand what these settings are (at least in the vague way that I understand what they are) we have to quickly review how machine learning image generators, well, generate images. So far as I understand, they work by using a system that has been trained to recognize images on a vast set of image-caption pairs. That is, they learn what a “cat” looks like by seeing a very, very large number of images labelled “cat.” (For a great discussion of the Stable Diffusion training data, and some links to explore that data further, see this blog post by Andy Baio.) The image generator starts with random visual noise, and uses the recognition algorithm to detect what pieces of that noise are most like it’s prompt, and then iteratively modifies the image to increase the recognition score. You can see the process at work in this .gif, which shows the steps stable diffusion uses to draw a cat, basically sculpting the most “cat like” pieces of noise into a more and more defined cat image:

Just take away the noise that’s not a cat! Simple!

Stable Diffusion gives us access to two settings that let us guide this process:

Inference steps sets the number of times the algorithm will repeat the process described above, in other words how many iterative image “steps” it will generate on the way to a final image.

Guidance Scale (or CFG) determines how strictly the algorithm revises the image towards the given prompt. I’m honestly a little fuzzy about what this means, but higher values are said to make the algorithm interpret the prompt more “strictly.”

So, what does the phase space of these two settings look like? Well, if we ask Stable Diffusion to draw us several versions of “a black and white photograph of a galloping horse” (as an homage to Muybridge’s “The Horse in Motion”, which does some phase spacy work itself) using low, average, and high values for steps and guidance scale and arrange the nine resulting images in a grid, with low values on the upper left and guidance scale increasing as we go from left to right and steps increasing from top to bottom, we get this:

Upper left: Low Guidance, Low Steps; Upper right: High Guidance, Low Steps; Lower left: Low Guidance, High Steps, Lower Right: High Guidance, High Steps (click image for larger version)

This gives us a rough sense of the space. Low guidance gives us vaguely horselike shapes, and low steps gives us a “sketchy” unrefined image. Moderate guidance and steps (the recommended settings for “realistic” results) give us, well, a horse. Very high steps and guidance give us a horse with a LOT of detail (not all of the details really make sense though) and LOT of contrast (including an odd, glowing bit of light on the back). The presence of all four feet in this image is interesting, but as we’ll see, not entirely a predictable result of the settings. The other two corners: low guidance, high steps and high guidance, low steps, are perhaps most interesting, from a glitch art perspective. More on these in just a bit.

If we invest a bit of time (and a month’s allotment of Google Colab compute credits) we can expand the above into a much larger grid, slowly incrementing over both Guidance Scale and Inference Steps from very low (Guidance Scale 0 and 3 Inference Steps) to very high (Guidance Scale 16 and 100 Inference Steps) a small step at a time.

The resulting grid looks like this, again low values for both steps and guidance scale are on the upper left, step values increase as you move down the image, and guidance scale values increase as you move from left to right.

Use link below for full resolution (WARNING: LARGE FILE)

You can grab the full size image here. Hopefully no one actually reads this or my hosting will melt.

Several interesting and informative features emerge from a scan of this large phase-space grid.

First, a few of the images are missing! As I’ve since learned, Stable Diffusion has a built in algorithm that attempts to censor “NSFW content” (our era’s telling euphemism for obscenity). The somewhat oversensitive nature of this algorithm can be seen in how it triggers on some random frames, with nothing particularly suggestive in any of the surrounding images:

I’ve since learned to disable the NSFW filter, but just the method of action here is fascinating. Basically, a machine learning system generates an image, then passes it through another machine learning system to see if the image is recognized as obscene. Of course, since generators are based on recognition systems, this does kind of suggest that someone could wire up the obscenity filter to create an obscenity generator, but this disturbing notion will be left as an exercise for the reader.

Second, the features rendered by the algorithm are incredibly mutable and ephemeral. A few steps more or less, or a bit more or less “guidance” can cause significant changes in the image. These changes don’t seem to follow an easily discernible pattern, instead features may emerge for a range of settings then disappear. Most notably, the horse’s missing legs come and go at various points in the sequence. Here a leg emerges for a single image in the step sequence at a moderate Guidance Scale, along with some motion blur (another idea the algorithm seems to occasionally toy with and then discard), before disappearing again:

At a very high Guidance Scale, the leg reappears more consistently, but this phase space experiment makes me doubt that the high guidance scale has made the image “better” or “more accurate.” The process of using Gaussian noise to draw an image seems to just riff on certain image features for awhile and then drop them.

Finally, there are those two corners of the space I called interesting before, the bottom left, where high-iteration, low guidance images live, and the upper right, where low-iteration, high guidance images dwell.

The low-guidance, high iteration images are nonsense, but oddly realistic nonsense. The algorithm draws a very solid, photo realistic picture of some totally impossible shapes. Take the comparison below, for example. With the slider all the way right, it shows the generation scale 0 image at 50 iterations, all the way left is 100 iterations. The image is only subtly different, but seems more solid. The “scene” the algorithm has hallucinated (some sort of city street? A market?) seems to have more depth.

Slider right, Generation Scale 0/50 Inference Steps. Slider left, Generation Scale 0/100 Inference Steps.

The oddly human figure on the lower right of this image (which becomes incorporated into the front half of the horse with stricter guidance given to the algorithm) is also intriguing to me. These emergent human figures we might dub “The Generation Scale Zero People,” and further experiments with Phase Diffusion suggest they are easy to generate.

“An infrared photograph of a woman on campaign”

Further examples coming soon to a Mastodon bot I’m building. As I was generating these, I also experimented with some prompts that asked the generator to create something other than a photograph: for example a line drawing, charcoal sketch, or painting. These tended to create loving renders of the technique (brush strokes, pencil lines) with subjects that seemed odd, fanciful, or even metaphorical.

“A painting of an airplane in the future”

I find these images somewhat evocative, despite the fact that I know just how little I really did to generate them. Basically, these are the result of a script I wrote that generates random image generation prompts from terms entered into a spreadsheet (modeled on the SSBot twitter bot application by Zach Whalen). I gathered a bunch of those, ran them on low Generation Scale and high Inference steps, and picked the ones that spoke to me.

You can try this process of random prompt generation out yourself with this Google Colab I wrote.

These images are compelling to me because they seem absurd in a pleasing way. It’s this automatic generation of absurdity, let’s call it: Procedurally Generated Nonsense which I find the most fascinating thing about AI Image Generation. In the late 19th and early 20th century, the technologies of “mechanical reproduction” made the creation of sensible texts all too easy. Legible text, clear “realistic images,” all became something easy to make and easy to copy via machine. For at least some artists, the response was to reject sense and embrace nonsense, sometimes leveraging the affordances of these same technologies to create images that were anything but “realistic.” Instead, they embraced the absurd, the garish, the non-nonsensical, the fantastic.

There is a way in which the AI image generator and it’s ken seem to stand this equation on its head. Yes, they produce nonsense, but they often produce compelling nonsense quickly, easily, almost thoughtlessly. As such, they effectively automate a domain of art embraced exactly because it seemed to resist earlier forms of automation.

I’m not sure I like that. I’m not sure where that goes. But, in the meantime, I can’t stop asking my little machine-mind to dream me more absurdity.

Exploring the Phase Space of Stable Diffusion, Discovering Procedural Nonsense

Related

Leave a Reply Cancel reply

Share this:

Related

Leave a Reply Cancel reply