As the first post of this blog, I wanted to study in deep (before presenting articles from reputable news outlets, medical publications, or theses) the use of words that ‘Scream AI.’
Here’s a link to a Reddit post that discusses it from a different perspective.
My interest naturally began with identifying corrupted texts that contained buzzwords. Indeed, I came across a multitude of links, articles, and news pieces that genuinely believed they had found the magic solution, the infamous list of buzzwords.
Basically, I wanted to find them on my own, as 90% of the articles I encountered deliberately claimed to have used GPT to find the words, and honestly, I didn’t find that too objective.
So, i took this dataset and wrote a script to analyze word frequency.
id | human_text | ai_text |
---|---|---|
cc902a20-27c4-4c18-8012-048a328206d1 | Also they feel more comfortable at home. Some school have.. | Therefore, when it comes to allowing students the option to attend classes from home, there are.. |
710f585e-5e98-42b8-81f6-265d7c934645 | Base in my experiences I’m growing, I try hard, and I try | As Emerson said, by going confidently in the.. |
e4db6c43-7b6b-4385-9b67-04652c71df0c | Many people around the world have different character.. | Parents, for example, can have a major influence on a child’s moral compass by.. |
An example of a few rows from the dataset—the original dataset also contained the instructions given to the human and the AI. I might utilize that for other studies, but In this case it wasn’t useful, as I focused the study on word frequency.
I’m attaching a table with the 10 most frequently used words by AI compared to humans. Below, you’ll find the complete .CSV file created with 100,000 rows of dataset.
Word | AI Frequency | Human Frequency | World Frequency | Delta vs Human | AI Overuse Percentage (%) |
drawbacks | 11622 | 0 | 1.62e-06 | 11622 | inf |
invaluable | 7314 | 0 | 3.47e-06 | 7314 | inf |
incredibly | 6456 | 0 | 2.4e-05 | 6456 | inf |
foster | 6374 | 0 | 2.14e-05 | 6374 | inf |
increasingly | 5568 | 0 | 2.4e-05 | 5568 | inf |
problemsolving | 4924 | 0 | 0,00000002 | 4924 | inf |
detrimental | 4334 | 0 | 3.55e-06 | 4334 | inf |
implications | 3706 | 0 | 1.26e-05 | 3706 | inf |
digital | 3704 | 0 | 6.76e-05 | 3704 | inf |
1. Word
This column contains the individual words that were analyzed. These are the unique words found in the AI-generated text (
ai_text
) after the text has been tokenized and filtered based on the length criteria (words with 4 or more letters if exclude_short_words
is set to True
).2. AI Frequency
This column represents the frequency of each word in the AI-generated text. It shows how many times each word appeared across all the AI-generated examples within the selected sample size.
3. Human Frequency
This column represents the frequency of each word in the human-written text. It shows how many times each word appeared across all the human-written examples within the selected sample size.
4. World Frequency
This column provides the global frequency of each word based on a general corpus, using the
word_frequency
function from the wordfreq
library. It represents how commonly each word is used in everyday language according to large-scale, real-world text data.5. Delta vs Human
This column shows the difference between the AI Frequency and the Human Frequency for each word. It is calculated as
AI Frequency - Human Frequency
. A positive delta means the word is used more frequently by AI than by humans, while a negative delta indicates that the word is used more frequently by humans.6. AI Overuse Percentage (%)
This column represents the percentage by which AI uses the word more frequently than humans. It is calculated using the formula
((AI Frequency - Human Frequency) / Human Frequency) * 100
. If the word does not appear in the human text (i.e., Human Frequency
is 0), this value is set to infinity (inf
), indicating that the AI uses the word significantly more often than humans. A high percentage suggests that the word is much more common in AI-generated text compared to human-written text.Among these, I’d like to point out the ones that I find the most intriguing. These are the words that AI used at least once compared to human usage:
Word | AI Frequency | Human Frequency | World Frequency | Delta vs Human | AI Overuse Percentage (%) |
love | 1686 | 22282 | 11.01 | -20596 | -9.243.335.427.699.480 |
selfdirection | 42 | 0 | 0.0 | 42 | inf |
selfawareness | 538 | 0 | 2.09e-08 | 538 | inf |
selfmotivation | 330 | 0 | 0.0 | 330 | inf |
idea | 8176 | 48592 | 3.49 | -40416 | -8.317.418.505.103.720 |
laugh | 40 | 1582 | 4.57e-05 | -1542 | -974.715.549.936.789 |
fail | 1276 | 22590 | 4.07e-05 | -21314 | -9.435.148.295.706.060 |
think | 8356 | 157566 | 0.12 | -149210 | -946.968.254.572.687 |
tired | 202 | 11280 | 5.13e-05 | -11078 | -9.820.921.985.815.600 |
boring | 42 | 4522 | 2.69e-05 | -4480 | -9.907.120.743.034.050 |
nice | 82 | 12766 | 3.54 | -12684 | -9.935.766.880.777.060 |
everybody | 42 | 7454 | 6.92e-05 | -7412 | -994.365.441.373.759 |
maybe | 40 | 21532 | 4.42 | -21492 | -998.142.299.832.807 |
Being human is amazing.
Here’s the study on the first 100k rows from the dataset Word_frequencies_ai.csv (deprecated)
Edit of 28.08.2024:
I realized that the chosen database contained duplicates in the cells related to AI and human text. I proceeded to remove the duplicates, first from one category and then from the other.
I then created a new database based on the previous one but with unique values:
https://huggingface.co/
After that, I split the process into 1,000,000 blocks (now just under 719,588).