摘要:Essays in English written by people from China were branded by text-analysis tools as being generated by artificial intelligence 61 per cent of the time

Tools to detect if a body of English text has been written by humans or artificial intelligence exhibit bias against people whose primary language isn’t English. The tests frequently misidentify their work as being created by an AI.

Text-generating AI models such as OpenAI’s ChatGPT and GPT-4 are being used by some students at schools and universities to create essays that they are passing off as their own work. To stop this, there are many tools designed to identify patterns in text that reveal the work of AI.

James Zou at Stanford University in California and his colleagues have now tested whether the AI-detection tools worked equally well for all students. They did this by feeding a selection of essays to seven of the most popular AI-text detectors. “This is a pretty important problem,” says Zou.

Read more: 'Fair' AI could help redress bias against Black US homebuyers

In all, 161 essays, written in English, were given to each AI-detection service. Of those, 91 were written by people with English as a second language, obtained from a Chinese educational forum, and 70 were US college admission essays, mainly obtained from PrepScholar, an admissions preparation website.

Only about 5 per cent of the US-based essays were flagged as being written by AI, while, on average, 61 per cent of the Chinese essays were.

One reason for this may be the lack of “perplexity” in language in the essays from China. This is a probabilistic measure of how varied the word choice is in a sample of writing that detection tools often use to decide whether something is computer-generated. “If perplexity is high, that’s more likely to be human according to these detection algorithms,” says Zou.

This measure puts non-native speakers at a disadvantage, he says. “Very reasonably, they tend to use more common words. That’s why their texts are misclassified.”

Besart Kunushevci, CEO of Crossplag, which created one of the plagiarism checkers tested, says: “We acknowledge that our publicly available model has certain limitations.” However, the company is developing a new, enhanced model that minimises false positives, he says. “The model that researchers evaluated was based on our previous public version, which indeed had its constraints.”

Computer Neural Network Concept Image

How this moment for AI will change society forever (and how it won't)

There is no doubt that the latest advances in artificial intelligence from OpenAI, Google, Baidu and others are more impressive than what came before, but are we in just another bubble of AI hype?

Edward Tian, who developed GPTZero, another of the tools evaluated, recognises the language problem, saying his app is predominantly trained on English prose written by native speakers. “Notably, our detector has one of the lowest false positives,” he says, adding that his R&D team is already working on incorporating non-native English data and different languages.

Jon Gillham at Originality.AI says the study assessed a prior version of the company’s plagiarism checker. “Our model 1.4 improved both AI detection and reduced false positives,” he says.

A spokesperson for ZeroGPT, which was also tested, says it has no bias against non-English writing, but its model was trained on content from the internet, which is predominately written in English. “It is a question of data availability for each language, as well as data accuracy,” they say.

OpenAI, Quill.org and Sapling didn’t respond to a request to comment on the paper’s findings about their tools.

Read more: The Power of Language review: What speaking many languages can do

The findings might not be as straightforward as they appear, though, because ghostwriters often produce essays to order for international students, says Thomas Lancaster at Imperial College London. “We don’t have any clear evidence that the [Chinese] essays sampled were written by somebody who has English as their second language,” he says.

Lancaster also says that international students who feel less confident in their written English might use grammar-checking tools, which could result in more false positives in AI checkers. “This will naturally see their essays being shaped towards a more standardised approach,” he says.

Zou and his colleagues did find one way to reduce the number of false positives. If the Chinese-origin essays were put into GPT-4 and the AI was asked to change word choices to make the essay sound more like it was written by a native English speaker, the likelihood of them raising concerns among AI-detection tools dropped to just over 10 per cent. “It’s pretty ironic,” says Zou.

Reference: arXivDOI: 10.48550/arXiv.2304.02819