I have been writing and reading an awful lot about AI since ChatGPT-4 came out in March 2023. There are so many moving parts now, and the story is changing weekly. One week we are convinced that ChatGPT is going to dominate the market, which could be a logical consequence of OpenAI having racked up $540M in losses in 2022 alone. Then Google, which seemed hopelessly behind, announced open access to Bard which is based on Google’s LaMDA (language model for dialogue applications) and it can do things that ChatGPT cannot, such as look up new information on the Internet. And there is a sprawling industry of open-source platforms (i.e., huggingface.co), new apps are being launched every week (often based on older versions of GPT), and there are many people like me who write lengthy diatribes about the dangers and failed promises of AI.
AI is today’s P1 topic, a topic that dominates public conversations for a while, such as 09/11 for many years after 2001, followed by the financial crisis of 2008, then Donald Trump from 2016 to 2020 and Covid from 2020 to 2021[i].
My fear of AI or AGI is still rather muted, and I probably said everything that I can about regulation. My ongoing concern is that AI will make us collectively dumber and the world more boring.
AI Needs Human-Generated Content To Learn and Function
“Good artists copy, great artists steal”, Picasso is supposed to have said. These are the sources of LLaMA’s 1.4T token pre-training data[ii], Meta’s foundational large language model:
English CommonCrawl (67%): Removed non-English text and duplicated content. Only includes pages used as references in Wikipedia.
C4 (15%): A cleaned version of CommonCrawl. The same filters were applied.
Github (4.5%): Public GitHub dataset available on Google BigQuery.
Wikipedia (4.5%): From June-August 2022 period covering 20 languages.
Gutenberg and Books3 (4.5%): Both are book datasets.
ArXiv (45%): Scientific data.
StackExchange (2%): High-quality Q&As covering science and engineering topics.
The entire content was created by humans (with some machine input), many of them working with enormous dedication and often without getting paid. According to Wikipedia, there are currently 45,548,689 Wikipedia accounts, of which 125,862 have made at least one edit during the last month. Synthetic data generation to improve model learning is the latest big trend, but that data must still be derived from real world data.
When it comes to copying and stealing, some LLMs are rumored to feed off each other now. Google denies this vehemently, but the rumors persist.
I have consulted Wikipedia over 150 times in the last 3 months, on topics ranging from AI to the Yamnaya culture, our violent ancestors from the Copper and Bronze age who migrated from the Caspian steppe to Western Europe from 3,300 to 2,600 BC. Why would people keep contributing to Wikipedia when nobody reads it anymore, because it’s more convenient to ask a question in natural language to ChatGPT or Bard? Even proper citations, if introduced voluntarily[iii] or by (European) law will not keep people motivated to create original writing.
AI-Generated Content Will Be Increasingly Banal And People Will Like It
ChatGPT or DALL-E, the other AI system that can create realistic images and art from a description in natural language, are finding widespread use in novel ways:
In April, a couple from San Diego used ChatGPT to write their wedding vowels
Screen writers in L.A. are on strike to demand that AI bots are not being used to replace them (among other items on the list)
AI is already moving from supporting journalists to replacing them entirely. By remixing information from across the internet, generative models are “messing with the fundamental unit of journalism”: the article. Instead of a single first draft of history, Mr Caswell says, the news may become “a sort of ‘soup’ of language that is experienced differently by different people”.
Scientists have listed ChatGPT as a co-author of publications, leading to the question of where does AI (LLMs) begin and where does it end?
Sen. Richard Blumenthal delivered an AI generated speech this week (as a warning about unregulated AI)
Muzak[iv] – functional background music that we know from elevators or retail stores – accounts for 7-10% of streaming revenue (this includes whale sounds that people use for sleep) and is increasingly generated by AI via copying of existing music. I wouldn’t be able to tell the difference between real and AI generated muzak and its not my favorite genre, but it shows the broader trend.
Real music gets remixed with AI generated voices or real artists and attracts millions of views on TikTok and some lawsuits
DALL-E or Stable Diffusion are used for image generation that often feeds off artists’ work in violation of their copyrights and people think it’s jaw-dropping art.
You can probably see what I am driving at with my cultural pessimism: we will turn into readers of word soups, listening to machine composed muzak and admiring cheesy AI generated pictures while mistaking everything for high-class art. Our taste for creative works will gradually become as degenerated as our taste for food, now that generations have been raised on processed food and soda drinks.
People May Be Even Lonelier With AI Than They Already Are With Social Media
According to a recent study[v] with 1,649 participating adults from Norway, United Kingdom, USA, and Australia between November 2021 and January 2022, 30-50% of respondents reported loneliness and the number correlates with the use of social media.
AI and especially chatbots like ChatGPT may become addictive for people who are craving for social interaction (we all are), like Instagram has transformed many relationships into a series of photo ops.
A journalist from the New York Times recently experimented with handing her mailbox to ChatGPT for a week. By the end of the experiment, the friendly but strangely impersonal tone of her emails made a colleague think she was going to murder him in his sleep.
AI Can Only Provide An Illusion Of Knowledge
As I wrote, the pocket calculator did not improve my math skills. Google Maps has not improved our knowledge of geography (which countries does Ukraine border on?). Our sense of orientation has been degraded by the use of GPS navigation, according to multiple studies.
The ability to retrieve AI-generated knowledge instantly via a mobile phone (the new battle stage for AI) will make everyone look smart, but it will deprive us of the ambition to dig a little deeper and gain true understanding.
Summarization is the most common use case for ChatGPT and even Sam Altman (CEO of OpenAI) uses it almost exclusively for this purpose. According to this interview, his favorite use case for AI would be a copilot that works through his email, text messages, ToDos and calendar and takes care of his everyday duties.
There Is Always Hope
As much as humans are craving for social interaction, they value originality, authenticity, and provenance. We attribute value to a work – art, writing, furniture, cars – because of who created it, under what circumstances, when and where.
People will pay big money for a photograph that they could copy for free, if it’s an original print by the artist which makes them feel connected to the artist. A car that was owned by Steve McQueen or a watch that was owned by Paul Newman will sell for many multiples of identical artefacts, and it’s not just about bragging rights.
Scientific discoveries and truly original thought will continue to be the domain of humans. And the consumption of fast food or processed meat is declining - even here in the US - thanks to our desire for authenticity and the FDA.
[i] Felix Salmon, “The Phoenix Economy: Work, Life, and Money in the New Not Normal”, 2023
[ii] “A brief history of LLaMA models”, April 30, 2023, by Andrew (Sagio Development LLC)
[iii] Perplexity.ai which is based on GPT 3.5, does a pretty good job at managing citations
[iv] A registered brand, owned by UMG, which produces background music which we generally refer to as muzak in the US
[v] Health Psychol Behav Med. 2023; “Associations between social media use and loneliness in a cross-national population: do motives for social media use matter?”, Tore Bonsaksen, Mary Ruffolo, Daicia Price, Janni Leung, Hilde Thygesen, Gary Lamph, Isaac Kabelenga, and Amy Østertun Geirdalj