A case study of human-machine interaction is Wikipedia. Will AI affect that in any way?
By Irving Wladawsky-Berger
Wikipedia has been doing pretty well so far. The English version of the online encyclopedia, which is the largest and most popular, has more than 6.7 million articles and a base of more than 118,000 active editors worldwide despite various ups and downs over the past 20 years. It serves as the default information source for people, search engines, and digital assistants.
Will the emergence of generative AI impact everything? In my opinion, no.
The emergence of GPT-3, a forerunner to the new chatbots from OpenAI, was seen by a Wikipedia editor in early 2021 as something resembling a funnel cloud on the horizon, according to author and journalist Jon Gertner in “Wikipedia’s Moment of Truth,” a recent NY Times Magazine piece. “After trying out the new technology, this editor, a well-known Wikipedian who goes by the user name Barkeep49, realized it was unreliable. In otherwise accurate and cogent responses, the bot would readily incorporate fake aspects (a bogus identity, a false academic citation). He was confident in its potential, nevertheless.
Wikipedia has long been my go-to resource for researching subjects I’m interested in learning more about, and I frequently cite Wikipedia pages in the weekly blogs I’ve been writing since 2005.
More Than an Enclyclopedia
Wikipedia has developed into much more than just an encyclopedia in the last ten years, serving as “a kind of factual netting that holds the entire digital world together.” Wikipedia frequently provides the data used by Google, Bing, Siri, Alexa, and other search engines and digital assistants to respond to queries from users.
With a current estimate of between 3 and 5 percent, Wikipedia has emerged as one of the major data sources for training Large Language Models (LLMs) and related chatbots. Wikipedia has had a significant impact on the digital world because of its abundance of high-quality, free, and well-curated data.
After using ChatGPT in 2021, Barkeep49 penned an essay titled “Death of Wikipedia,” in which he predicted how Wikipedia might no longer hold the top spot among encyclopedias in the English language. “Some other encyclopedia would have to be able to match some of the advantages we have accrued over time for it to displace Wikipedia. In particular, the articles that people care about the most are updated rather quickly despite the fact that we have millions of them.
He went on to say that it was improbable that a replacement encyclopedia would share Wikipedia’s dedication to openness, pursuit of non-profit goals, and unrestricted use of its content under its highly permissive license. “Those values have benefited our reputation and generated enormous value for readers around the world, and I believe that if an encyclopedia didn’t share those values, the world would be a worse place.”
Is the End Near?
I don’t believe it’s conceivable that humans will overtake us. Artificial intelligence, however, might. As I write this in January 2021, AI is already capable of producing some material rather competently due to its present exponential growth. Barkeep49 stated that while they are hopeful in the near term
In the long run, he feared that AI might replace Wikipedia’s human editors, just as Wikipedia had replaced Britannica.
According to the NYT Magazine piece, “Within the Wikipedia community, there is a cautious feeling of hope that AI, if controlled properly, would help the organization improve rather than crash. But even if the editors triumphed in the short run, one had to question if the robots wouldn’t triumph in the end.
Since Wikipedia’s founding in January 2001, there have been predictions that it will eventually come to an end. In advance of Wikipedia’s 20th anniversary, professor Joseph Reagle of Northeastern University wrote “The many (reported) deaths of Wikipedia,” a historical essay that examined how Wikipedia had been regularly expected to die over the previous two decades and how it had managed to adapt and persist.
Reagle observed that in Wikipedia’s early years, both its detractors and its founders served as examples of three different future-oriented viewpoints. To start, they researched other encyclopedia projects to gain an idea of what was doable. They discovered that even well-funded, well-established programs, like Microsoft’s Encarta, had failed to develop a viable online encyclopedia. Additionally, they believed that Wikipedia’s extremely challenging first six months would be the standard for the following seven years.
Exponential growth, which characterized Wikipedia article development until around 2007, was the sole model that individuals failed to follow, according to Reagle. Wikipedians anticipated having 100,000 entries in its first year, which would have been more than most print encyclopedias have. They calculated that they could reach the goal in seven years if they produced 1000 articles per month.
As it turned out, the English Wikipedia attained two million entries in September 2007 – 20 times the initial projection.
By 2009, it was evident that the growth of new articles on the English Wikipedia had slowed or plateaued and that the activity balance was shifting in favor of seasoned editors rather than continuing to draw in new editors. From a peak of 53,000 editors in 2007, there were only about 30,000 editors active in 2014. A 2015 NYT editorial post titled “Can Wikipedia Survive?” questioned whether Wikipedia’s issues were being exacerbated by the exponential growth of smartphones, where amending Wikipedia’s entries was more challenging than on desktops.
But it seems that the number of editors who are actively editing has remained constant since 2014, never falling below 29,000, and that this pattern of rapid expansion and plateau is common for wikis,” noted Reagle.
He continued, “The platform and community have momentum that no alternative will substitute. The only forecast I’d dare for the next ten years is that Wikipedia will still be around. And by then, the Wikimedia Endowment, which was founded in 2016, ought to have surpassed its objective of raising $100 million in support of keeping its initiatives running “in perpetuity.” Undoubtedly, the English Wikipedia community will have difficulties and crises, but I don’t see anything particularly severe that would leave it reduced to a skeleton of static entries.
Over 6.7 million articles and over 118,000 active editors (editors who have made at least one modification in the last 30 days) may be found on the English Wikipedia as of September 2023.
The Role of Humans
Additionally, the NYT Magazine piece by Gertner states that “Wikipedia currently contains versions in 334 languages and a total of more than 61 million articles. Even though it consistently ranks among the top ten most-visited websites in the world (with Google, YouTube, and Facebook typically at the top), it is the only member of that elite group to reject the profit motive.
The fact that Wikipedia’s knowledge was developed by people, however, is its most important contribution to generative AI.
Wikipedia’s corpus has typically been consumed by the new AI chatbots, according to Gertner. “Wikipedia data and Wikipedia text, knowledge that has been compiled over years of painstaking work by human contributors, are embedded deep within their responses to queries.” He continued, saying that one conclusion from a conference call with several Wikipedia editors was obvious enough: We want a society in which information is produced by humans. But has that already been abandoned?
More than just anti-AI human idealism, making sure that generative AI systems are educated with content meticulously developed by people is important.
It turns out that AI systems will invariably fail without human-generated training data.
This phenomena was outlined in great depth and given the label “model collapse” in a research article that was released in May 2023.
I discovered a more straightforward explanation of model collapse in the new TechTarget article “Model collapse explained: How synthetic training data breaks AI.”
“Model collapse happens when fresh generative models get degenerated over time from training on AI-generated material. Even though the distribution of the data remains the same in this case, models begin to forget it. This indicates that the models start to lose knowledge of the less prevalent but nonetheless significant features of the data. As AI model generations advance, the outputs that the models produce become less and less varied.
“Human-produced data must be used to train the models of generative AI. When fresh models are trained on model-generated content, they develop irreparable flaws.
Their results grow more uniform and ‘wrong’. Even under ideal learning circumstances, model collapse was discovered to be unavoidable.
“Model collapse is crucial because generative AI has the potential to fundamentally alter the nature of digital material. Online messages are increasingly being created entirely or in part by AI techniques. In general, this phenomena has the potential to greatly increase data pollution.
Even though it’s easier than ever to produce massive amounts of text, model collapse claims that none of this data will be useful for developing the next generation of AI models.