Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content
... A group of researchers backed by the French government have released what is thought to be the largest AI training dataset composed entirely of text that is in the public domain. And the nonprofit Fairly Trained announced that it has awarded its first certification for a large language model built without copyright infringement, showing that technology like that behind ChatGPT can be built in a different way to the AI industry’s contentious norm. ...
Today, Fairly Trained announced it has certified its first large language model. It’s called KL3M and was developed by Chicago-based legal tech consultancy startup 273 Ventures, using a curated training dataset of legal, financial, and regulatory documents.
The company’s cofounder, Jillian Bommarito, says the decision to train KL3M in this way stemmed from the company’s “risk-averse” clients like law firms. ...
Although the dataset is tiny (around 350 billion tokens, or units of data) compared to those compiled by OpenAI and others that have scraped the internet en masse, Bommarito says the KL3M model performed far better than expected, something she attributes to how carefully the data had been vetted beforehand. ...
On Wednesday, researchers released what they claim is the largest available AI dataset for language models composed purely of public domain content. Common Corpus, as it is called, is a collection of text roughly the same size as the data used to train OpenAI’s GPT-3 text generation model and has been posted to the open source AI platform Hugging Face. ...
The Authors Guild, along with actors and radio artists labor union SAG-AFTRA and a few additional professional groups, was recently named an official supporter of Fairly Trained. ...
See the full story here: https://www.wired.com/story/proof-you-can-train-ai-without-slurping-copyrighted-content/
Pages
- About Philip Lelyveld
- Mark and Addie Lelyveld Biographies
- Presentations and articles
- Tufts Alumni Bio