You're here: textbroker.co.uk » Blog » For clients » How (un)reliable are AI detectors? How False Positives Come About and How to Best Deal with Them

How (un)reliable are AI detectors? How False Positives Come About and How to Best Deal with Them

A magnifying glass in front of a blue background

Artificial intelligence – the new technology is causing a stir, but also some uncertainty. How can you be sure that the text you ordered really flowed from the pen of a genuine creative writer? Are AI detectors the solution, and how should you interpret the results of these applications? We have done some research for you. In this blog, we will give you an insight into what goes on behind the scenes at Textbroker and explain what role AI detectors play in this.

How Do AI Detectors Work?

Honestly – when you first used such a tool, didn’t you think that the percentage value you received described how much of the text was created by AI? After all, everywhere you go, you hear that AI detectors promise to identify text that has been written by ChatGPT or similar AI text generators. Unfortunately, that is not entirely correct. These kinds of tools work with probabilities, so actually all they give you is the following kind of information: The checked text has been created by an AI with a probability of x percent.

How would the tool know? The underlying principle is quite simple. Basically, the AI detectors themselves are based on a language model with Artificial Intelligence, such as ChatGPT. The tools use it to check how strong the probability is that the checked text has been created with the assistance of AI. The AI detector calculates which word is most likely to appear next within the text. Since AI utilizes similar language models, this leads to a high degree of matches in the case of AI-generated text. On the other hand, if the text has been created by a human, the deviations should tend to be so extensive, that the AI detector will identify the text as manually created – at least as far as the theory goes.

Other aspects that the AI detectors check for are spelling and grammatical errors. Textual errors are annoying, but they are human. AI on the other hand does not make any spelling mistakes. So, a completely error-free text can be indicative of ai-generated text. Of course, this does not mean, that – conversely – an error-free text is automatically an AI product, or that errors are automatically proof of human work. An AI is no less infallible than a human.

Creativity vs. Probabilities

Two additional factors that such AI detectors look for are „Perplexity” and „Burstiness”. In both cases, a high score indicates that a human must have written the text.

Perplexity

The meaning of perplexity in text is well-illustrated by the following examples:

Low Perplexity	Higher Perplexity
“The family went to the train station in order to take the train.”	“The family went to the train station in order to pick up the friends from the train, and then immediately proceeded to the zoo together, riding e-scooters.”

The first sentence has a highly probable sequel, meaning it is of lower perplexity. The contents are logically suited, and the second half of the sentence is exactly what a reader would most likely expect. And that is exactly the point: AI very often generates sentences of foreseeable content, because AI is lacking the creativity with which our authors infuse the text.

Higher perplexity text, on the other hand, looks more like the second sentence. There is a turn in the story, which is far removed from the content of the first part. It is rather improbable that an AI would generate such a sentence. Because an artificial intelligence almost always generates plausible sentences with predictable turns.

Burstiness

Burstiness is the metric used by AI detectors to rate the variations within the sentences. This refers mostly to sentence length and structure. Artificial intelligence tends to construct sentences with similar patterns, acting repetitively. Consequently, the sentences are usually of the same length, and there is little variation. By contrast, humans usually employ a mix of short and long sentences, switching between active and passive voice, and using varied sentence structures. Hence, a high burstiness rating would be indicative of a text generated by a human.

AI Detectors are not Infallible

The notion that detector tools would be able to identify AI-generated text is attractive at first glance. Unfortunately, the tools are not infallible. This is mainly due to the fact that detector tools and technology as well as AI text generators continue to evolve. So, unfortunately, AI detector results are not entirely reliable, and regrettably they often report so-called false positives, i.e. results that erroneously classify human-generated text as AI-generated.

For German language texts, the estimated precision is approximately 60%, based on testing done in 2023. This result comprises failures to detect AI-generated text as well as human-generated text classified by the detector tool as AI-generated.

Potential Causes of False Positives

Perplexity and burstiness in particular can frequently lead to incorrect results. For instance, certain types of text may mandate a certain structure or leave little room for phrasing and creativity. News articles or listicles for example have to adhere to a certain type of predictable structure. And company and service descriptions also rely on certain types of content and phrasing.

Text where legal reasons require that the author incorporate certain phrases or refrain from clear statements, are also prone to false positive results. Take healthcare-related texts or financial texts, for instance: In these types of documents, authors have to avoid making efficiency claims. Consequently, they have to utilize predictable phrasing as well as auxiliary phrases, which are commonly attributed to AI text generators.

Short texts tend to be problematic as well. They simply do not provide a sufficient number of indicators that would allow the tool to make an objective determination. For this reason, the results of AI detectors should be taken as an indication, but never as proof positive. Therefore, the same basic principle applies to AI detector tools as for all other tools: They were created by humans for humans. And they always require a user to verify and interpret the results!

roboterhand-hält-scrabble-steinchen-mit-fragezeichen

How Reliable are AI Detectors Really?

In order to understand precisely how False Positives can come about, let’s have a look at which kind of information the AI detector results represent. AI tools actually do not provide an absolute answer in terms of whether a text was generated by an AI. Rather, they present you with a probability percentage which indicates the likelihood that a given text stems from a human or an AI. So, a percentage value of 75 does not mean that an AI such as ChatGPT and the like has generated 75% of the text. It merely means: According to this particular tool, the likelihood that some form of AI was used during the writing process is 75%.

For more on how to interpret the results, see the blog article by Originality.ai on this subject. Additionally, Copyleaks provides a PDF with the most frequently asked questions regarding the AI detector. It provides good examples of what the Copyleaks tool considers AI usage.

The providers of AI detector tools advertise high precision. They frequently claim hit rates of more than 90%. Yet, conversely, this means that even the providers themselves do not consider their own tools to be infallible. False positives – i.e. cases in which a human has generated a text, while the tool, for a variety of reasons, suspects an AI to be the author – can occur indeed. The providers themselves are emphasizing this in their FAQs and blog articles, as you will see further on in this blog. Of course, this is very annoying for any author who has taken great care in writing a text, without the help of an AI. We have tested two of the most popular AI detector tools on the market. And while doing so, we made many interesting discoveries.

Our Testing: These are the Results

We have put four German sample texts through the detector tools: two Category Descriptions for a fictitious online-coffee trader, and two Service Descriptions for a fictitious locksmith service. One of each pair was authored by a human, while the other was written by an artificial intelligence. Why did we decide on these particular text types? Because it has been our experience that text types with mandated content and mandated wording are more prone to false positives than genres that give authors more room for creativity.

We purposely selected German language texts for our testing. The rationale being that many AI detector tools were originally developed for English. They yield better results in English (see for instance the Copyleaks FAQs). We wanted to test specifically, whether the results are reliable with German texts as well. Additionally, we uploaded our four texts in different entry formats: once as a direct Word file, and once via the text entry form on the AI detector’s website, plus once with and once without HTML-formatting.

Before going any further, let us be clear about one very important aspect: Of course, we tested only a small amount of texts. For that reason, we neither can nor want to issue any statements about the statistical accuracy of the results which these tools put out. This test is intended to be a demonstration only and its purpose is to provide food for thought.

Testing Copyleaks

Copyleaks was the first tool that we tested:

Method/Format of Upload	Text# 1: Human (Category Text)	Text# 2: Human (Service Description)	Text# 3: AI (Category Text)	Text# 4: AI (Service Description)
Upload of Word Files with html	0% AI	0% AI	0% AI	0% AI
Upload of Word Files without html	0% AI	100% AI	100% AI	100% AI
Upload in Text Entry Mask via Copy & Paste with html	0% AI	0% AI	0% AI	0% AI
Upload in Text Entry Mask via Copy & Paste without html	100% AI	100% AI	100% AI	100% AI

Method/Format of Upload

Text# 1: Human (Category Text)

Text# 2: Human (Service Description)

Text# 3: AI (Category Text)

Text# 4: AI (Service Description)

Upload of Word Files with html

0% AI

Upload of Word Files without html

0% AI

100% AI

Upload in Text Entry Mask via Copy & Paste with html

0% AI

Upload in Text Entry Mask via Copy & Paste without html

100% AI

The tool marked the first human text as 0% AI 3 times – only in the case of the direct entry without html, Copyleaks was 100% certain that the text was AI-generated – which was wrong. The second text (which was also generated by a human author) was marked by the Copyleaks AI Detector only twice as definitely of human origin. However, the tool is wrong in the case of the text upload without html-formatting, which it identifies as 100% AI. And precisely the same result came up once more for text number 3 and text number 4 – while these two are entirely AI-generated.

It is important to realize: When uploading both types, the html-formatting is converted to so-called “entities”. This appears to skew the results. So, here is a first basic recommendation: If you want to have any texts checked, it is best to do so without html-formatting!

Testing Originality.ai

We repeated the exact same testing with Originality.ai, more precisely with the Multi-Language model, which is designed for languages other than English:

Method/Format of Upload, Model	Text# 1: Human (Category Text)	Text# 2: Human (Service Description)	Text# 3: AI (Category Text)	Text# 4: AI (Service Description)
Upload of Word Files with html, Multi Language	50% AI	97% AI	100% AI	6% AI
Upload of Word Files without html, Multi Language	50% AI	99% AI	100% AI	100% AI
Upload in Text Entry Mask via Copy & Paste with html, Multi Language	50% AI	97% AI	100% AI	6% AI
Upload in Text Entry Mask via Copy & Paste without html, Multi Language	51% AI	100% AI	100% AI	100% AI

Method/Format of Upload, Model

Text# 1: Human (Category Text)

Text# 2: Human (Service Description)

Text# 3: AI (Category Text)

Text# 4: AI (Service Description)

Upload of Word Files with html, Multi Language

50% AI

97% AI

100% AI

6% AI

Upload of Word Files without html, Multi Language

50% AI

99% AI

100% AI

Upload in Text Entry Mask via Copy & Paste with html, Multi Language

50% AI

97% AI

100% AI

6% AI

Upload in Text Entry Mask via Copy & Paste without html, Multi Language

51% AI

100% AI

The tool rated the first human-generated text as AI-generated with a probability of 50 to 51%. The second text (also authored by a human) was even marked as 97 to 100% AI by Originality.ai. Originality.ai correctly identified the third text as 100% AI, independent of the manner in which we performed the upload. In the case of the fourth text, however, it marked the html-formatted texts as AI-generated with a probability of only 6%. With the upload without html, the result was correct again, at 100% AI.

Statements from the Operators of the Tools

In addition, we asked the operators of the AI detector tools Copyleaks and Originiality.ai about the accuracy of their tools. We asked them under what conditions the tools yield optimum results, what triggers false positives and whether the use of writing assistants, such as spell checkers and style checkers, can also influence the result. Both replied via e-mails, which are summarized below.

Copyleaks

According to its provider, this AI detector provides reliable assessments only for texts of a certain minimum length. The minimum length is 350 characters, when using the Browser add-on, and 255 characters with the web app. Copyleaks attributes false positives to the use of additional tools for text optimization, among other factors. For instance, LanguageTool.org offers a function in addition to their spell checker, which allows you to rephrase sentences. This is an AI application, which the AI detector tools will identify as such. However, Copyleaks does not call this a false positive, but a correct identification:

”While writing assistant tools have been using AI for a while, many platforms have evolved to use large language models (LLMs) for rewriting portions of content, which can lead to the text being flagged as AI which, technically speaking, is not a false positive but rather a correct detection of AI content.”

Language is another factor. According to Copyleaks the detector tool works best with English text. Other languages, such as German, French, and Italian, are supported by the AI detector tool as well, but the reliability is not yet as high.

Copyleaks emphasizes that their internal language testing on 1,000 English texts did not yield any false positives, where the text had been corrected by a spell checker:

”To determine the threshold at which content edited by writing assistant tools gets flagged as AI, we performed a test using two AI-powered writing tools: Grammarly and the Copyleaks Writing Assistant.

1,000 random files from a public essay dataset of English-language text were collected for the test. The dataset is designed to be English only and does not contain AI. The essays were then edited using Copyleaks Writing Assistant and Grammarly. Here are the findings:

One thousand human-created files were run through the Copyleaks Writing Assistant, with each one averaging around 35% of changes made. These updated files were scanned through the Copyleaks AI Detector. All 1000 came back as human content.” (For more information, see the Copyleaks blog article “Do Writing Assistants Like Grammarly Get Flagged As AI?”)

However, when functionalities for improving sentence structure are applied, the detector tool identifies 31.6% of the texts as AI-generated. Copyleaks lists the following the main reasons for false positives: “While the Copyleaks AI Detector has a false positive rate of .2%, there is always the possibility of human-authored text being flagged as AI. This can occur for several reasons: The content was put through a writing assistant tool using genAI-powered features like GrammarlyGo, which will likely get flagged as AI. The content was altered by a text spinner or a similar tool. AI was used to create an outline or template.”

Recently, Copyleaks also published a blog clarifying these issues: How Does AI Detection Work?. Detailed explanations are also found in their PDF of the most frequently asked questions in regards to the Copyleaks tool.

Originality.ai

The feedback from Originality.ai is similar, referring to their comprehensive Help Center post. Additionally, the provider emphasizes that a score of 40% does not mean that the AI has generated 40% of the text:

“Our AI detector provides a probability that a piece of content was AI or Original (human-generated). It returns a confidence score.

‍60% Original and 40% AI means the model thinks the content is Original (human-written) and is 60% confident in its prediction.” (Source: Originality.ai, “AI Detection Score Meaning”)

So, the tool rates the probability that some type of AI was employed during the generation of the text. That can also mean that the AI was merely used as a content planning tool or for spell-checking (Source: Originality.ai, “Most Common Reasons for False Positives With Originality”). In one post about false positives, Originality.ai even goes as far as to say: “When any amount of AI touches the content, it can cause the entire article to be flagged as AI.” (Source: Originality.ai, “AI Content Detector False Positives – Accused Of Using Chat GPT Or Other AI?”).

Why is Detecting AI Use in Text Generation so Important?

Many clients place great importance on receiving texts that are not AI-generated. And with good reasons. For one, text written by a real author is generally of higher quality and greater depth. Furthermore, accuracy of content is also important. This is particularly the case with the so-called YMYL Topics (short for “Your Money Your Life”). With health-related, legal, or financial topics, the information has to be absolutely reliable and accurate. The subject matter expertise and the thorough research that a human author can provide exceed the capabilities of an AI by far.

And there is a concern that AI-generated text will rank lower in the search engines. For these reasons, we at Textbroker give you a choice. When drafting your briefings, you can specify if and to what extent you want to permit the use of artificial intelligence during text generation.

However, one important question remains: Does Google even care whether a text was written by a human or generated by an AI? Google itself states that Content quality is the highest priority:

“Our focus on the quality of content, rather than how content is produced, is a useful guide that has helped us deliver reliable, high quality results to users for years.” (Source: Google Search’s guidance about AI-generated content)

The Internet giant emphasizes that the focus is on user-oriented content. What matters is that your content provides added value to the users!

This brings up another question: Can Google reliably identify AI content? Worth mentioning in this context is Google’s E-E-A-T-Update, where the search engine operator has expanded their guidelines for quality evaluators. E-E-A-T is short for ”Experience”, ”Expertise”, ”Authoritativeness”, and ”Trustworthiness”. Which brings us back to the added value mentioned above, or what Google calls “Helpful Content”: Users should be able to rely on content being trustworthy, factually accurate, and unique as well as being provided by a person of experience and expertise. Google gives a higher ranking to content that fulfills these criteria. This concerns mainly text written by real authors – or that has at the least been thoroughly and conscientiously edited by such authors.

In Conclusion: These are our Suggestions for the Handling of False Positives

It is understandable that clients would like text that is generated entirely without AI assistance. And we would like to ensure this is the case. For this reason, we are checking all texts with our very own detector tool. Whenever our AI checker spots a supposedly „suspicious“ text, we conduct an additional manual check.

In doing our checking, we rely on our experience, our competent editorial team, and on the fact that we have become very familiar with our authors’ writing styles over the years. So, when it comes to correctly assessing texts, we are not only relying on an AI detector tool, but rather on the combination of several rating criteria and on the know-how of experienced employees.
 
Nevertheless, it can happen, of course, that an external AI detector tool suspects the use of AI during text generation, when you check your texts. As previously mentioned, suspected AI use does not automatically mean that the text was created by or with the assistance of artificial intelligence. If you explicitly ask for text with a 0% AI score, utilizing a tool of your choice, you can make corresponding arrangements with an author in the framework of a DirectOrder. Of course, we are also available, if you need assistance or if you should have further questions about the subject.

For you as the client, this means: The fact that such tools exist is good and well; after all, they can assist you and provide pointers. Yet, you should always take the results with a grain of salt and check for possible causes of false positives:

Is the text even long enough for meaningful results?
Have you selected the right tool for your language?
Does the text require certain content, structures, and phrasing which could skew the results?
Might the author merely have applied spell- or grammar checkers?

If you take these aspects into consideration and always question the results of the AI detectors with some good common sense, you will certainly come to a good assessment. And what if you don‘t? In that case, we will be here for you – with our knowledge, our experience, and our trusted freelancer community.

What has been your experience with AI detectors? Share it with us in the comments section!

September 06, 2024
December 10, 2024

angelika.jung-peeters
For clients
0 comments

How (un)reliable are AI detectors? How False Positives Come About and How to Best Deal with Them

Creativity vs. Probabilities

Perplexity

Burstiness

How Reliable are AI Detectors Really?

Testing Copyleaks

Testing Originality.ai

Statements from the Operators of the Tools

Originality.ai

Why is Detecting AI Use in Text Generation so Important?

In Conclusion: These are our Suggestions for the Handling of False Positives

You have a question or a comment on the article? Share it with us! Cancel reply

You have a question or a comment on the article? Share it with us!
Cancel reply