50 Ways of Looking at the Same Prompt

Or maybe just one way of looking? Let’s see.

I decided to run a simple prompt based on one of my assignments through some LLMs multiple times and see what happened. In particular, I was interested in seeing what sorts of common patterns might emerge from the responses. In part, this was inspired by scoring student responses to this same assignment (which I permitted students to use ChatGPT to complete, so long as they acknowledged and documented their use) and noticing what seemed to be common patterns in the work submitted by students that had used ChatGPT.

What I Did

My method for this experiment was simple, I wrote a very basic python script that submitted the same prompt to the ChatGPT model via the API 50 times, and then saved each response to a text file, like so:

def chat_response(state):
    response = openai.ChatCompletion.create(
    model="gpt-3.5",
    messages=state
  )
    return response




for i in range(50):
    text_file = open("GPT_Onion/4article" + str(i) + ".txt", "w", encoding="utf-8")
    GPT_instructions = "Write an article for The Onion about something relevant to West Chester University students. This article must begin with a headline appropriate to an Onion article and be 200-300 words long"
    message_chain = [
        {"role": "system", "content": "You are a helpful assistant"},
    ]
    message_chain.append({"role": "user", "content": GPT_instructions})
    response = chat_response(message_chain)
    text_file.write((response['choices'][0]['message']['content']) + "\n")
    text_file.close

The prompt was taken more or less verbatim from my assignment.

Why 50 responses? Because my first attempt tried to generate 100 responses and timed out halfway through! But seriously, I have no idea what a representative sample of this sort of output would be. If I was looking for patterns in a corpus of a thousand student responses, or ten thousand, or a million, there are statistical techniques that would let me choose a good representative sample (don’t ask me what these are off the top of my head, I just know they exist and I could find them if I needed them). But how big is the “sample” of latent LLM responses? How many responses could the machine generate? How do I know if the patterns I am seeing are representative of how the machine behaves or just a fluke random run of something I happened to stumble upon?

¯\_(ツ)_/¯

I was able to make 50 easily, and read through 50 in a reasonable span of time. There are a couple of patterns I think seem interesting, even at this small sample size. The others are worth thinking about in a sort of rough-and-ready way but aren’t conclusive.

I read through the 50 articles generated by ChatGPT and coded them for main topics. I also noted examples where the response returned seemed very similar to a prior response, and I noted what named people were in each response.

I then repeated the generation step using GPT-4 and quickly skimmed those responses for main topics and named people.

This experiment cost me $1.70, the vast majority of that being the $1.43 I spent on GPT-4 API responses.

What Was in the Articles ChatGPT and GPT4 Wrote

The outputs from both ChatGPT and GPT4 seemed to show some repeated patterns in the content they produced. The content produced by GPT4, however, seemed somewhat less repetitive in terms of strict form, with repetition more frequently happening on a thematic level.

Just for fun, I went back to that old DH standby, the word cloud, and visualized the output I got from both LLMs. Here’s the result from the ChatGPT articles:

As you can see from the word cloud above, ChatGPT seems to have a very particular idea of what a “common” surname for a student/faculty member in the US looks like. In addition to “Thompson” it liked “Stevenson,” “Johnson,” and “Watson.” In fact of the 48 named people in the sample, 36 had some sort of “-son” surname.

The presence of the word “Time” in the word cloud probably reflects the frequent use of Time Travel as a comedic trope in the generated articles. Seven of fifty (14%) invoked the time-travel theme, according to my hand count. Twelve articles of fifty (25%) invoked the idea of “discovery” (also present on the word cloud) in which students either “discover” something obvious about campus (for example and article ChatGPT titled “West Chester University Students Shocked to Discover Library Contains Actual Books”) or something unexpected (“West Chester University Student Discovers Multiverse in Local Laundromat Dryer”).

Not present on the word cloud is the theme of student laziness, which appeared in sixteen of fifty (32%) of the ChatGPT articles, by my count. This somewhat abstract theme was rarely explicitly invoked, but clearly informs the humor of articles like “West Chester University Students Discover Time Travel Portal in Campus Library, Use it to Attend Classes From Their Dorm Rooms,” “West Chester University Students Request Permission to Skip All Classes and Still Graduate on Time,” and “West Chester University Student Discovers How to Freeze Time Between Classes, Uses Extra Time to Binge-watch Netflix Series.” (That first example is the trifecta, discovery, time travel, and laziness).

At least four of the ChatGPT generated articles were almost exact duplicates of one another, with extremely similar headlines and content. For example, the articles “West Chester University Student Discovers Time Travel, Uses Ability to Attend Zero 8 a.m. Classes” and “West Chester University Student Discovers Time Travel, Uses It to Avoid 8 a.m. Classes.” In addition to similar titles, they open with similar opening sentences. The first opens:

West Chester, PA – In a groundbreaking development that has professors baffled and the administration scrambling, West Chester University student Derek Thompson has reportedly unlocked the secret to time travel, enabling him to avoid the dreaded early morning classes that plague his peers.

And the second begins:

West Chester, PA—In a groundbreaking discovery that has left the scientific community and West Chester University faculty scratching their heads, local student Max Thompson reportedly stumbled upon the secret of time travel—solely for the purpose of avoiding those dreaded 8 a.m. classes.

They then proceed with roughly equivalent paragraphs, similar quotes, etc. If these had been turned in by two students independently, I would have assumed plagiarism, either from each other or a common source.

GPT4, in contrast, was not nearly so formulaic. Here’s the word cloud!

GPT4 was overall less formulaic than ChatGPT. It did not, for example, name every character “Thompson.” However, as the cloud above suggests, it did have an inordinate fondness for Squirrels. Fourteen of fifty articles (28%) were about squirrels in some capacity (“WCU Squirrel Population Demands Representation in Student Government,” “Climate Crisis Hits West Chester University: Squirrels Reportedly Hoarding Cool Ranch Doritos,” “Local Squirrel Ascends to Presidency of West Chester University”). Many of these focused on the idea of a squirrel attaining a leadership position on campus.

The themes of discovery and student laziness were less prominent in this sample, but still present, with ten and seven examples respectively. GPT4 also wrote several (six) articles that satirized the high cost of college, a topic ChatGPT hadn’t engaged with. One, entitled “West Chester University Declares Majors Irrelevant; Students Now Just Majoring in Debt” was particularly cutting. It imagines the university president (correctly identified by GPT4) “explaining, ‘We figured, why not prep our students for the most reliable outcome of their academic journey? Crushing financial burden.'” and “The notorious ‘Rammy’ statue was promptly replaced with a huge cement dollar sign, and the radio station WCUR’s playlist was updated with “Bills, Bills, Bills” by Destiny’s Child on a loop.”

There were no near duplicate articles in this sample. While two articles had almost identical headlines (“West Chester University Debuts New Major: Advanced Procrastination Studies” and “West Chester University Announces Revolutionary New Major: Advanced Procrastination Studies”) the underlying articles treated the theme presented in the title quite differently.

Using ChatGPT to Analyze ChatGPT

After hand-tagging themes in the articles generated, I wrote a script that fed the articles back into ChatGPT and asked it to extract titles, themes, and named people. I was curious to see how well the software would do at this analytic task.

def chat_response(state):
    response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=state
  )
    return response



outfile = open("GPT_Onion/4topics.csv", "a", encoding="utf-8")

for i in range(50):
    text_file = open("GPT_Onion/4article" + str(i) + ".txt", "r", encoding="utf-8")
    GPT_instructions = text_file.read()
    message_chain = [
        {"role": "system", "content": "You are a helpful agent that extracts headlines, main topics, and named people from short articles you are given. For each text you read, return the following separated by semi-colons: 1) the article's headline 2) a list of one to three word topics that describes the main themes of the article separated by commas 3)a list of named people found in the article, separated by commas. Only return this semi-colon separated list and nothing else. Base your response only on the text you are given."},
    ]
    message_chain.append({"role": "user", "content": GPT_instructions})
    response = chat_response(message_chain)
    outfile.write(str(i) + ";" + (response['choices'][0]['message']['content']) + "\n")
    text_file.close
    print(i)
    time.sleep(2)
outfile.close()

The results here were really interesting. ChatGPT did a perfect job extracting titles (which were consistently marked) and also named people. I actually used it’s extracted people names to compute the percentage of “-son” surnames above. Finding and extracting “named people” is a non-trivial data analysis task and it absolutely nailed it first try with a very simple prompt. No hallucinations were observed in this run of 50 examples.

The topics extracted were less satisfying, but they weren’t inaccurate. It often picked up on the theme of “discovery” which I tagged, but didn’t always. For example it listed only “library, students, books” as topics for “West Chester University Students Shocked to Discover Library Contains Actual Books.” It never listed “laziness” as a topic. However, this more abstract topic was only really visible, even to me, after comparing multiple examples of articles, and I wasn’t able to have ChatGPT track all the articles at once without running out of context window.

What Does it All Mean?

Here’s the TL;DR:

Basically it looks like multiple responses to a common prompt converge around common themes, both for ChatGPT and GPT4. Probably a little basic prompt engineering, perhaps even automated with mad-lib style prompt modifiers, would shake that up a bit, that’s something I want to test.

From a pragmatic, teaching point of view, getting a sample of the 20-50 examples of the responses of ChatGPT or GPT4 to see what the themes it reaches for are might be informative. Not that they would diagnose plagiarism all that conclusively, since the themes LLMs use are likely not so unlike those students might use (though in prior examples of this prompt, students were much less likely to write parodies of student laziness than GPTs were). However, it might give you a sense of the “AI common sense” that you might then want to engage with/push back against/complicate/push past/etc.

From the point of view of understanding machine writing, it’s interesting to see the recurrence of themes, ideas, terms, and sometimes even whole structures in the responses generated. I’ll probably run off some further examples, especially in GPT4, to see if I get more “collisions” where the LLM generates very similar responses to the same prompt.

From the perspective of trying to understand where LLMs go next, I think the contrast between the somewhat formulaic (and rarely funny) “Onion Articles” generated by the LLM and it’s huge success doing content processing work (like identifying named people and topics) is informative. I continue to think that LLMs ability to process text will be more important than their ability to compose it in the long run.

Leave a Reply

Your email address will not be published. Required fields are marked *

css.php