AI Buzzwords – The Most Commonly Used Words by GPT (But Not by Humans)

In this article, we will explore the most commonly used buzzwords by AI models like GPT that are not typically used by humans.

As the first post of this blog, I wanted to study in deep (before presenting articles from reputable news outlets, medical publications, or theses) the use of words that ‘Scream AI.’

Here’s a link to a Reddit post that discusses it from a different perspective.

My interest naturally began with identifying corrupted texts that contained buzzwords. Indeed, I came across a multitude of links, articles, and news pieces that genuinely believed they had found the magic solution, the infamous list of buzzwords.

Basically, I wanted to find them on my own, as 90% of the articles I encountered deliberately claimed to have used GPT to find the words, and honestly, I didn’t find that too objective.

So, i took this dataset and wrote a script to analyze word frequency.

idhuman_textai_text
cc902a20-27c4-4c18-8012-048a328206d1Also they feel more comfortable at home. Some school have..Therefore, when it comes to allowing students the option to attend classes from home, there are..
710f585e-5e98-42b8-81f6-265d7c934645Base in my experiences I’m growing, I try hard, and I tryAs Emerson said, by going confidently in the..
e4db6c43-7b6b-4385-9b67-04652c71df0cMany people around the world have different character..Parents, for example, can have a major influence on a child’s moral compass by..

An example of a few rows from the dataset—the original dataset also contained the instructions given to the human and the AI. I might utilize that for other studies, but In this case it wasn’t useful, as I focused the study on word frequency.

I’m attaching a table with the 10 most frequently used words by AI compared to humans. Below, you’ll find the complete .CSV file created with 100,000 rows of dataset.

WordAI FrequencyHuman FrequencyWorld FrequencyDelta vs HumanAI Overuse Percentage (%)
drawbacks1162201.62e-0611622inf
invaluable731403.47e-067314inf
incredibly645602.4e-056456inf
foster637402.14e-056374inf
increasingly556802.4e-055568inf
problemsolving492400,000000024924inf
detrimental433403.55e-064334inf
implications370601.26e-053706inf
digital370406.76e-053704inf

1. Word
This column contains the individual words that were analyzed. These are the unique words found in the AI-generated text (ai_text) after the text has been tokenized and filtered based on the length criteria (words with 4 or more letters if exclude_short_words is set to True).

2. AI Frequency
This column represents the frequency of each word in the AI-generated text. It shows how many times each word appeared across all the AI-generated examples within the selected sample size.

3. Human Frequency
This column represents the frequency of each word in the human-written text. It shows how many times each word appeared across all the human-written examples within the selected sample size.

4. World Frequency
This column provides the global frequency of each word based on a general corpus, using the word_frequency function from the wordfreq library. It represents how commonly each word is used in everyday language according to large-scale, real-world text data.

5. Delta vs Human
This column shows the difference between the AI Frequency and the Human Frequency for each word. It is calculated as AI Frequency - Human Frequency. A positive delta means the word is used more frequently by AI than by humans, while a negative delta indicates that the word is used more frequently by humans.

6. AI Overuse Percentage (%)
This column represents the percentage by which AI uses the word more frequently than humans. It is calculated using the formula ((AI Frequency - Human Frequency) / Human Frequency) * 100. If the word does not appear in the human text (i.e., Human Frequency is 0), this value is set to infinity (inf), indicating that the AI uses the word significantly more often than humans. A high percentage suggests that the word is much more common in AI-generated text compared to human-written text.

Among these, I’d like to point out the ones that I find the most intriguing. These are the words that AI used at least once compared to human usage:

WordAI FrequencyHuman FrequencyWorld FrequencyDelta vs HumanAI Overuse Percentage (%)
love16862228211.01-20596-9.243.335.427.699.480
selfdirection4200.042inf
selfawareness53802.09e-08538inf
selfmotivation33000.0330inf
idea8176485923.49-40416-8.317.418.505.103.720
laugh4015824.57e-05-1542-974.715.549.936.789
fail1276225904.07e-05-21314-9.435.148.295.706.060
think83561575660.12-149210-946.968.254.572.687
tired202112805.13e-05-11078-9.820.921.985.815.600
boring4245222.69e-05-4480-9.907.120.743.034.050
nice82127663.54-12684-9.935.766.880.777.060
everybody4274546.92e-05-7412-994.365.441.373.759
maybe40215324.42-21492-998.142.299.832.807

Being human is amazing.

Here’s the study on the first 100k rows from the dataset Word_frequencies_ai.csv (deprecated)


Edit of 28.08.2024:

I realized that the chosen database contained duplicates in the cells related to AI and human text. I proceeded to remove the duplicates, first from one category and then from the other.

I then created a new database based on the previous one but with unique values:

https://huggingface.co/

After that, I split the process into 1,000,000 blocks (now just under 719,588).

combined_words.csv