A Microsoft-owned tool powered by artificial intelligence is designed to make life easier for programmers, but some developers say it may be repurposing some of the billions of lines of code it was trained on without permission.
The tool, called CoPilot, was released by GitHub, a Microsoft subsidiary that is used by millions of people to share source code and organise software projects. CoPilot uses powerful neural network tools developed by OpenAI to solve programming problems by scouring vast numbers of examples of existing solutions, both from GitHub and elsewhere, and learning how to create similar solutions. It then suggests code based on a human programmer’s work in progress or English descriptions of the functionality needed.
But sometimes CoPilot may directly plagiarise its training data, says Armin Ronacher at software company Sentry. He has found that it is possible to prompt CoPilot to suggest copyrighted code from the 1999 computer game Quake III Arena, complete with comments from the original programmer. This code cannot be reused without permission.
“You can definitely make it recite code that is almost entirely in the training set where there’s no originality happening,” says Ronacher, although he says it should be possible to adjust CoPilot so that it warns the user if code being output is very close to original work.
Read more: AI learns to write its own code by stealing from other programs
A more difficult problem to solve is that many of the software projects that CoPilot has been trained on are released under free software licences such as the General Public License, or GPL, that only allow derivative works if they are also freely released. This doesn’t stop them being used commercially – “free” in this context means that people are free to modify the code – but it does mean that people using CoPilot would in theory also be required to release any source code they create for other people to use. However, the CoPilot website says that copyright for the code it generates belongs to the programmer using it.
Some developers have already taken action to protect their work. Adrian Bowyer at RepRap, an open source 3D printer project, says CoPilot is a great idea but anything it creates must itself be open source. He has altered the wording of the licence under which he releases software: “If any part of RepRap covered by the GPL is used to train any AI, then all the products of that AI must be released under the GPL as free software.”
CoPilot may also raise security issues. One developer found that CoPilot provided private API keys in suggestions – a username and password unique to a certain user which is used to access features of another piece of software or website. It is good practice to remove these keys from open source code before publicly releasing it, but this is sometimes overlooked.
The tool’s website says that about 0.1 per cent of CoPilot code suggestions may contain “some snippets” of verbatim source code from the training set. The company also warns that it is possible for CoPilot to output genuine personal data such as phone numbers, email addresses or names and that outputted code may offer “biased, discriminatory, abusive, or offensive outputs” or include security flaws. It says that code should be vetted and tested before use.
Read more: Google is using AI to design processors that run AI more efficiently
Neil Brown at UK law firm decoded.legal says that GitHub has made a “bald assertion” that it can analyse code in the way CoPilot does under the fair use copyright infringement defence in the US, but the position in the UK is less clear.
Yin Harn Lee at the University of Bristol, UK, agrees that CoPilot could be allowed under US copyright law, which is less prescriptive but says it is a grey area in the UK that needs to be tested in court. The US fair use is a broader, less defined and altogether more unpredictable defence. UK copyright law tends to offer a much clearer definition of what is and isn’t acceptable.
“I’m very keen for it to be tested, because I want to know,” says Harn Lee.
Such a legal challenge may be difficult, says Brown. “It wouldn’t surprise me if someone was willing to challenge them over it, but I doubt it would be someone with comparable resources to GitHub.”
Microsoft and OpenAI didn’t respond to requests for comment, while GitHub declined to comment.