Articles

Publish at November 22 2023 Updated November 23 2023

Exploring the effect and implications of generative AI in web content creation

The frontiers of artificial intelligence: from theory to practice in digital marketing and SEO

In many articles about generative AI, the question most often asked concerns the limits of AI.

This article will not go into the technological limits of AI, which are unknown to date, but the possible limits of its use on the web and in content creation.

Content is ROI

The expression "Content is king" has been around for a long time in the world of digital marketing.

It means that content is the central pillar of a successful marketing and sales strategy. Without engaging content, all other components of the strategy will have less effect. This underscores the importance of producing quality content for a successful online presence.

Here are the main aspects of this notion:

Content adds value: useful, practical, entertaining or inspiring content attracts attention and builds loyalty. It's a key success factor.
Content helps SEO: rich, optimized content improves a website's SEO and visibility in search results.
Content feeds a content strategy: blog articles, product sheets, white papers, guides, webinars, podcasts... so many possible content formats.
Content affects every stage of the conversion tunnel: it informs during the discovery phase, reassures during the evaluation phase, weighs in during the decision phase, guides after the purchase...
Content creates engagement: it allows you to interact with your audience, build long-term loyalty and develop your brand.
Content provides expertise: it positions a brand or company as a reference in its sector.

The Large Language Model (LLM) used to create content

Since 2022, ChatGPT has democratized the use of generative AI for text production. It quickly reached 100 million users. Its arrival was seen as a godsend for digital marketing, as content production was a bottleneck and a major cost for deploying online strategies.

Those in need of mass-produced content, and who called on the services of offshore copywriters, turned en masse to the multitude of automated copywriting tools that have sprung up. The best-known are Jasper, WriteSonic, WordHero, Rytr, etc.

AI-generated content is not penalized by Google

This practice is also encouraged by Google, which has decided not to penalize AI-generated texts. The Mountain View firm even wrote an article as early as February 8, 2023 saying:

"At Google, we've long believed in the power of AI to transform the ability to deliver useful information. In this article, we explain in more detail how AI-generated content fits in with our long-standing approach to delivering useful content to users in Google search.
Rewarding high-quality content, however it's produced" Source

A basic understanding of how these authoring tools work

Authoring tools such as those mentioned above or ChatGPT are specific, customized LLM applications. ChatGPT is optimized for conversational interactions, with capabilities and guidelines that make it suitable for a wide range of interactive applications. While a general LLM can be used for a variety of language processing tasks, ChatGPT is specifically honed to understand, participate in and maintain conversations with users.

Here are the steps involved in obtaining auto-generated text:

1. Dataset collection and preparation :

Collection: Text data is collected from various sources such as books, articles, websites and other written media.
Cleaning: Data is cleaned to remove irrelevant or inappropriate elements (e.g. racist content).
Formatting: Texts are formatted so that they can be understood by the model.

2. Model training :

Machine learning: LLMs are a type of neural network that uses machine learning. Machine learning enables a model to extract information in a structured way.
Data processing: The model processes textual data, learning linguistic structures, vocabulary, writing styles, etc.
Optimization: The model adjusts its internal parameters to minimize errors and improve its ability to predict or generate text.

3. Language understanding and analysis :

Query analysis: When a request is made to the LLM (for example, a question or a text generation query), it analyzes and interprets the request using the knowledge acquired during training.
Contextualization: The model takes into account the context of the request to provide an appropriate response.

4. Text generation :

Word prediction: The LLM generates a response by predicting the sequence of words that best meet the request, based on the patterns learned.
Sentence assembly: It assembles words into coherent, grammatically correct sentences, taking into account context and language structure.

5. Optimization and revision:

Adjustments: The model can adjust its answer according to feedback or additional corrections to improve accuracy or relevance.
Finalization: The generated response is finalized and presented to the user.

This is where the importance of the initial data - the so-called dataset - becomes apparent. Datasets serve as the knowledge base from which LLM learns. The larger and more diversified the dataset, the more varieties of language and styles the model can learn from.

The quality of the dataset directly affects the accuracy of the model. A good dataset will produce more precise and relevant answers.

Will the snake bite its own tail?

This brings us to the question posed in this article. Before their democratization, generative AIs had datasets fed by almost exclusively human production.

Over the past year, the Internet has literally been flooded with texts, images and even videos produced by AIs. It's not just the web, but also various types of content such as books, student dissertations, reports, press articles and so on.

It's also reasonable to assume that humans will increasingly resort to AIs when they need to generate texts, and that the proportion of human creations will therefore shrink both in number and proportion. Increasingly, datasets will therefore be fed by data produced by AIs on the basis of their datasets from the time when this content was produced. This would lead to a form of inbreeding.

If we think of it as genetics, this form of inbreeding could lead to an impoverishment of the content created, as its datasets will be less and less diversified. The variety and richness of the language generated would then be limited.

This is where we see the importance of the different stages used to train and evaluate an AI. In an ideal case, the training data should be cleaned up beforehand, so as to train the model with varied and relevant data.

The other important factor is the evaluation of the LLM. To properly evaluate an AI model, test data must be varied and contain special cases. Only if the test database is correctly constructed will it be possible to see whether the model has been correctly trained. If not, the model may not perform well.

If an LLM is increasingly trained with AI-generated content, the variety of the training dataset may be impoverished. If the model is trained for too long, there's a risk that it won't be generalized enough. This is known as "overfitting". But thanks to test data, this drop in performance can be detected and stopped.

Is this a new phenomenon?

To understand how it might affect us in the future, it's useful to know whether engineers have already faced similar situations.

A recurring problem when training an artificial intelligence model is access to sufficient data to train our model. In many scenarios, the amount of data is too small, or the data is not accessible for reasons of confidentiality, etc. It's common to have to think about how to train our model.

It's common to have to think about how to increase the amount of data available to train models.

Some of the techniques employed are simplistic, such as duplicating the available data to obtain a larger quantity. Others are more complex, such as generating new data using GAN models.

What is a GAN?

A GAN stands for Generative Adversarial Network. These networks operate in two parts:

A part responsible for creating data based on the initial data available.
A part responsible for differentiating between the initial data and the data created.

Its operation can be visualized as that of a counterfeit painter trying to fool an art expert.

Initially, the counterfeit painter will practice by imitating the original paintings to which he has access. Once he is able to paint correctly, he will stop practicing. The art expert will analyze both genuine and fake paintings. He will train himself to detect the fakes until he is able to detect most of them.

Once the forgeries are no longer confused with the originals, the painter goes back to work and practices painting better pictures until they are identical to the originals in the eyes of the expert. The expert in turn improves, and so on.

At the end of this process, the painter is able to create paintings very similar to the original data. This increases the amount of data available to train an AI model.

What does this have to do with our LLM and ChatGPT models?

We can draw a parallel between AIs trained with data from GAN and the phenomenon we discussed earlier in the article. The data used to train future AIs will contain data produced by humans as well as data generated by AI models.

On the face of it, you might think that this poses no problem, since it's common practice in the scientific world. Well, there are consequences that are important to be aware of:

First of all, AI productions are based on their training data. A concrete example is customs fraud. We have a data sample containing customs declarations, and we want to increase the number of fraudulent declarations to train our model. If we generate fraudulent declarations using GANs, the new data will contain the same "types and methods" of fraud as the original data. AI is not going to develop new methods of customs fraud on its own, which could have been discovered by adding real data obtained by customs services.
A second consequence is the diversity of information contained in the data. If we choose to train an AI model based on human-only data produced between 2010 and 2020. We can assume that the amount of data produced each year is similar. Each year will be represented equivalently.

Now we assume that in 2015 LLM models are created and trained on data produced between 2010 and 2015. Thanks to the creation of this data, the amount of content generated explodes and 70% of new content produced between 2015 and 2020 has been generated by AI.

The years 2010-2015 will be over-represented compared to the years 2015-2020. In fact, the first part of the decade contains human content + all the AI content of the following years, as the AI is trained on this data. Our new AI model that we're training in 2020 will see a majority of data from the first half of the decade, so the connections in the model will be strengthened on this data and their importance will be greater.

Representativeness

It's important for artificial intelligence engineers to take into account all the consequences of using AI in the production of new content, to ensure that new ideas and advances in our society have a place.

This implies constant vigilance in the selection and renewal of datasets, to avoid an impoverishment of diversity and creativity in the content generated. It is essential to maintain a balance between human and AI contributions, to ensure that content reflects a wide range of perspectives and innovations.

Contributing editor:

Loïc Vansnick, Civil Engineer in Artificial Intelligence and webmarketer

Sources

What is a large language model (LLM)?