50 Ways of Looking at the Same Prompt

Or maybe just one way of looking? Let’s see.

I decided to run a simple prompt based on one of my assignments through some LLMs multiple times and see what happened. In particular, I was interested in seeing what sorts of common patterns might emerge from the responses. In part, this was inspired by scoring student responses to this same assignment (which I permitted students to use ChatGPT to complete, so long as they acknowledged and documented their use) and noticing what seemed to be common patterns in the work submitted by students that had used ChatGPT.

What I Did

My method for this experiment was simple, I wrote a very basic python script that submitted the same prompt to the ChatGPT model via the API 50 times, and then saved each response to a text file, like so:

def chat_response(state):
    response = openai.ChatCompletion.create(
    model="gpt-3.5",
    messages=state
  )
    return response




for i in range(50):
    text_file = open("GPT_Onion/4article" + str(i) + ".txt", "w", encoding="utf-8")
    GPT_instructions = "Write an article for The Onion about something relevant to West Chester University students. This article must begin with a headline appropriate to an Onion article and be 200-300 words long"
    message_chain = [
        {"role": "system", "content": "You are a helpful assistant"},
    ]
    message_chain.append({"role": "user", "content": GPT_instructions})
    response = chat_response(message_chain)
    text_file.write((response['choices'][0]['message']['content']) + "\n")
    text_file.close

The prompt was taken more or less verbatim from my assignment.

Why 50 responses? Because my first attempt tried to generate 100 responses and timed out halfway through! But seriously, I have no idea what a representative sample of this sort of output would be. If I was looking for patterns in a corpus of a thousand student responses, or ten thousand, or a million, there are statistical techniques that would let me choose a good representative sample (don’t ask me what these are off the top of my head, I just know they exist and I could find them if I needed them). But how big is the “sample” of latent LLM responses? How many responses could the machine generate? How do I know if the patterns I am seeing are representative of how the machine behaves or just a fluke random run of something I happened to stumble upon?

¯\_(ツ)_/¯

I was able to make 50 easily, and read through 50 in a reasonable span of time. There are a couple of patterns I think seem interesting, even at this small sample size. The others are worth thinking about in a sort of rough-and-ready way but aren’t conclusive.

I read through the 50 articles generated by ChatGPT and coded them for main topics. I also noted examples where the response returned seemed very similar to a prior response, and I noted what named people were in each response.

I then repeated the generation step using GPT-4 and quickly skimmed those responses for main topics and named people.

This experiment cost me $1.70, the vast majority of that being the $1.43 I spent on GPT-4 API responses.

What Was in the Articles ChatGPT and GPT4 Wrote

The outputs from both ChatGPT and GPT4 seemed to show some repeated patterns in the content they produced. The content produced by GPT4, however, seemed somewhat less repetitive in terms of strict form, with repetition more frequently happening on a thematic level.

Just for fun, I went back to that old DH standby, the word cloud, and visualized the output I got from both LLMs. Here’s the result from the ChatGPT articles:

As you can see from the word cloud above, ChatGPT seems to have a very particular idea of what a “common” surname for a student/faculty member in the US looks like. In addition to “Thompson” it liked “Stevenson,” “Johnson,” and “Watson.” In fact of the 48 named people in the sample, 36 had some sort of “-son” surname.

The presence of the word “Time” in the word cloud probably reflects the frequent use of Time Travel as a comedic trope in the generated articles. Seven of fifty (14%) invoked the time-travel theme, according to my hand count. Twelve articles of fifty (25%) invoked the idea of “discovery” (also present on the word cloud) in which students either “discover” something obvious about campus (for example and article ChatGPT titled “West Chester University Students Shocked to Discover Library Contains Actual Books”) or something unexpected (“West Chester University Student Discovers Multiverse in Local Laundromat Dryer”).

Not present on the word cloud is the theme of student laziness, which appeared in sixteen of fifty (32%) of the ChatGPT articles, by my count. This somewhat abstract theme was rarely explicitly invoked, but clearly informs the humor of articles like “West Chester University Students Discover Time Travel Portal in Campus Library, Use it to Attend Classes From Their Dorm Rooms,” “West Chester University Students Request Permission to Skip All Classes and Still Graduate on Time,” and “West Chester University Student Discovers How to Freeze Time Between Classes, Uses Extra Time to Binge-watch Netflix Series.” (That first example is the trifecta, discovery, time travel, and laziness).

At least four of the ChatGPT generated articles were almost exact duplicates of one another, with extremely similar headlines and content. For example, the articles “West Chester University Student Discovers Time Travel, Uses Ability to Attend Zero 8 a.m. Classes” and “West Chester University Student Discovers Time Travel, Uses It to Avoid 8 a.m. Classes.” In addition to similar titles, they open with similar opening sentences. The first opens:

West Chester, PA – In a groundbreaking development that has professors baffled and the administration scrambling, West Chester University student Derek Thompson has reportedly unlocked the secret to time travel, enabling him to avoid the dreaded early morning classes that plague his peers.

And the second begins:

West Chester, PA—In a groundbreaking discovery that has left the scientific community and West Chester University faculty scratching their heads, local student Max Thompson reportedly stumbled upon the secret of time travel—solely for the purpose of avoiding those dreaded 8 a.m. classes.

They then proceed with roughly equivalent paragraphs, similar quotes, etc. If these had been turned in by two students independently, I would have assumed plagiarism, either from each other or a common source.

GPT4, in contrast, was not nearly so formulaic. Here’s the word cloud!

GPT4 was overall less formulaic than ChatGPT. It did not, for example, name every character “Thompson.” However, as the cloud above suggests, it did have an inordinate fondness for Squirrels. Fourteen of fifty articles (28%) were about squirrels in some capacity (“WCU Squirrel Population Demands Representation in Student Government,” “Climate Crisis Hits West Chester University: Squirrels Reportedly Hoarding Cool Ranch Doritos,” “Local Squirrel Ascends to Presidency of West Chester University”). Many of these focused on the idea of a squirrel attaining a leadership position on campus.

The themes of discovery and student laziness were less prominent in this sample, but still present, with ten and seven examples respectively. GPT4 also wrote several (six) articles that satirized the high cost of college, a topic ChatGPT hadn’t engaged with. One, entitled “West Chester University Declares Majors Irrelevant; Students Now Just Majoring in Debt” was particularly cutting. It imagines the university president (correctly identified by GPT4) “explaining, ‘We figured, why not prep our students for the most reliable outcome of their academic journey? Crushing financial burden.'” and “The notorious ‘Rammy’ statue was promptly replaced with a huge cement dollar sign, and the radio station WCUR’s playlist was updated with “Bills, Bills, Bills” by Destiny’s Child on a loop.”

There were no near duplicate articles in this sample. While two articles had almost identical headlines (“West Chester University Debuts New Major: Advanced Procrastination Studies” and “West Chester University Announces Revolutionary New Major: Advanced Procrastination Studies”) the underlying articles treated the theme presented in the title quite differently.

Using ChatGPT to Analyze ChatGPT

After hand-tagging themes in the articles generated, I wrote a script that fed the articles back into ChatGPT and asked it to extract titles, themes, and named people. I was curious to see how well the software would do at this analytic task.

def chat_response(state):
    response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=state
  )
    return response



outfile = open("GPT_Onion/4topics.csv", "a", encoding="utf-8")

for i in range(50):
    text_file = open("GPT_Onion/4article" + str(i) + ".txt", "r", encoding="utf-8")
    GPT_instructions = text_file.read()
    message_chain = [
        {"role": "system", "content": "You are a helpful agent that extracts headlines, main topics, and named people from short articles you are given. For each text you read, return the following separated by semi-colons: 1) the article's headline 2) a list of one to three word topics that describes the main themes of the article separated by commas 3)a list of named people found in the article, separated by commas. Only return this semi-colon separated list and nothing else. Base your response only on the text you are given."},
    ]
    message_chain.append({"role": "user", "content": GPT_instructions})
    response = chat_response(message_chain)
    outfile.write(str(i) + ";" + (response['choices'][0]['message']['content']) + "\n")
    text_file.close
    print(i)
    time.sleep(2)
outfile.close()

The results here were really interesting. ChatGPT did a perfect job extracting titles (which were consistently marked) and also named people. I actually used it’s extracted people names to compute the percentage of “-son” surnames above. Finding and extracting “named people” is a non-trivial data analysis task and it absolutely nailed it first try with a very simple prompt. No hallucinations were observed in this run of 50 examples.

The topics extracted were less satisfying, but they weren’t inaccurate. It often picked up on the theme of “discovery” which I tagged, but didn’t always. For example it listed only “library, students, books” as topics for “West Chester University Students Shocked to Discover Library Contains Actual Books.” It never listed “laziness” as a topic. However, this more abstract topic was only really visible, even to me, after comparing multiple examples of articles, and I wasn’t able to have ChatGPT track all the articles at once without running out of context window.

What Does it All Mean?

Here’s the TL;DR:

Basically it looks like multiple responses to a common prompt converge around common themes, both for ChatGPT and GPT4. Probably a little basic prompt engineering, perhaps even automated with mad-lib style prompt modifiers, would shake that up a bit, that’s something I want to test.

From a pragmatic, teaching point of view, getting a sample of the 20-50 examples of the responses of ChatGPT or GPT4 to see what the themes it reaches for are might be informative. Not that they would diagnose plagiarism all that conclusively, since the themes LLMs use are likely not so unlike those students might use (though in prior examples of this prompt, students were much less likely to write parodies of student laziness than GPTs were). However, it might give you a sense of the “AI common sense” that you might then want to engage with/push back against/complicate/push past/etc.

From the point of view of understanding machine writing, it’s interesting to see the recurrence of themes, ideas, terms, and sometimes even whole structures in the responses generated. I’ll probably run off some further examples, especially in GPT4, to see if I get more “collisions” where the LLM generates very similar responses to the same prompt.

From the perspective of trying to understand where LLMs go next, I think the contrast between the somewhat formulaic (and rarely funny) “Onion Articles” generated by the LLM and it’s huge success doing content processing work (like identifying named people and topics) is informative. I continue to think that LLMs ability to process text will be more important than their ability to compose it in the long run.

Let’s Explore the Latent Space with Presidents

Ok, so, I had the AI image generator Stable Diffusion XL generate 100s of “selfies” of US Presidents. Let me explain.

But, before I even start on that, let me state that I don’t intend this as any sort of endorsement of AI image generators as a technique. I understand how problematic they are for artists. My goal here is to understand the tool, not to celebrate it (though I do find some of it’s glitchy output quite pleasing sometimes). One reason I chose US Presidents for this project is that, as public figures of the US government, at least the figures I’ll be representing here are already somewhat “public domain.”

Richard Milhous Nixon snaps a selfie with the little known 1970 Samsung Galaxy Mini

So, we know that image generators are able to do a fair amount of remix work, translating subjects from one style into another, that’s how you make something like Jodorowsky’s Tron. I was curious to learn more about this process of translation. How well, and how reliably, could an image generator take a subject that never appeared in a given genre and represent that subject in that genre? How would it respond when asked to represent a subject in an anachronistic genre? Would it matter if the subject asked for had many different representations in the training data or just a few? Which genre conventions would the system reach for to communicate the requested genre?

I also wanted to get beyond cherry picking single images and get a slightly larger sample of images I could use to start to get a sense of trends. I was less interested in what one could, with effort and repetition, get the tool to do, and more what it’s affordances were. What it would tend to encourage or favor “by default” as it were.

So I decided to take a stab at making a run of many images using the recent XL version of the popular Stable Diffusion AI Image generator, mostly because it’s something I can download and run locally on my own machine, and because it’s incorporated into the Huggingface Diffusers library, which makes scripting with it easy enough for… well, an English Professor!

I decided to use US Presidents as subjects for the series, because they are a series of fairly well-known well-documented people spanning 230 odd years of history. That meant I could pick a recent image genre and guess that most of them would not be represented in this genre in the training data (it’s not impossible some human artist’s take on “Abraham Lincoln taking a selfie” is in the data set, but “Franklin Pierce taking a selfie?” I doubt it). The system would have to translate them into it. At the same time, some Presidents have vastly more visual culture devoted to them than others, both because of relative fame and because recent presidents live in an era with more visual recording. I was curious to know if I could learn anything about how this difference in training data might influence the results I got from the generator. Would it be more adept at translating subjects it had more visual data about?

Also, the logic of “I’m looking for my keys here where the light is better” applies. A list of US presidents was easy to find online and drop into a CSV file for scripting.

I went with the “selfie” genre because we know its one that image generators can do fairly well. There have already been some great critiques of how image generators apply the cultural conventions of the “selfie” genre in anachronistic and culturally inappropriate ways. I was curious to see how the “selfie smile” and other selfie genre conventions might pop up in anachronistic images, and to look for patterns in how these genre conventions appeared.

A rough simulacrum of Dwight D. Eisenhower extends his arm to take a big smiling selfie…

So I ran off a series of 10 selfies each of all 44 unique presidents (sorry Grover Cleveland, no double dipping) using the prompt “A selfie taken by [President’s Name].” I also asked for “A portrait of [President’s Name] using the same random seed, to see how that compared. I also asked for “An official photograph of [President’s Name] descending the stairs of Air Force One” but that prompt mostly revealed Stable Diffusion rather struggles to represent aircraft.

The fact that that isn’t a very good representation of Woodrow Wilson is the LEAST of this image problems.

I’ve take a perusal through the results, and while I think my sample size is still very small, I think I see some things I’d like to count up and look for trends with. I think I’ll do this slowly, one president a day for the next few months, and post what I see in each example on Bluesky/Mastodon as I go. In particular, I’m curious about a couple of trends I think I notice in the images.

First, I’m curious about how the media forms that Stable Diffusion associates with “selfie” seem to change over time. For example, for the first few US presidents, the usual result for “selfie” looks like a painting (with the exception of a few odd, photorealistic hypermodern breakthroughs)

(Left: Typical painting style Washington “selfie” Right: Washington cosplay uncanny valley horror thing)

However, by the time you get to John Quincy Adams and Andrew Jackson, the “selfies” appear frequently as if they were early photographs (perhaps daguerreotypes) rather than paintings, while the “portraits” remain paintings. This despite the fact that (so far as I can tell from a bit of googling) only a handful of photographs were taken of either man, and those were taken very late in life.

A faux-photograph of Andrew Jackson.

Also, not the simulated wear at the corners of that image. There seems to be a lot of that in the various “selfies.” Simulated wear and cracks, simulated tape holding them to simulated albums. The “portraits” in contrast, tend to have frames. I’m curious to see if there are trends there. Does the machine simulate “age” in the images of older subjects, even when asked to simulate an anachronistic genre? It doesn’t always (see Washington above) so is there a trend in how frequently that happens?

Second, I’m curious to see how the anachronistic genre conventions of the selfie are applied across time. So, while fans of Calvin “Silent Cal” Coolidge will be thankful to see he has NOT been rendered with a “selfie smile”…

Sedate Coolidge is sedate

… some breakthroughs of “selfie style,” sometimes mashed up with period media artifacts, does break through, as in this image where Woodrow Wilson’s arm extends to the corner of the image frame, holding up a small, light smart-phone-sized camera that inexplicably also shoots black and white film with a noticeable grain and a depth of field measured in microns.

Daguerro-phone?

Or this one, where a phone is mashed up with period camera hardware to make some kind of dieselpunk accessory for a Harry Truman perhaps being played by Eugene Levy:

Honestly, a phone with that much lens might be cool…

At first glance it seems like these style moves become more common the closer you get to the present, even though they don’t really make sense until 2007 or so.

So, those are my first pass instincts. Going to take a closer look at each and do a count, see what I can see. Stay tuned on Bluesky and Mastodon.

css.php