Can versatility guarantee employability?
What exactly does "making a difference" refer to? The added value that versatility offers.
Publish at November 22 2023 Updated November 23 2023
In many articles about generative AI, the question most often asked concerns the limits of AI.
This article will not go into the technological limits of AI, which are unknown to date, but the possible limits of its use on the web and in content creation.
The expression "Content is king" has been around for a long time in the world of digital marketing.
It means that content is the central pillar of a successful marketing and sales strategy. Without engaging content, all other components of the strategy will have less effect. This underscores the importance of producing quality content for a successful online presence.
Here are the main aspects of this notion:
Since 2022, ChatGPT has democratized the use of generative AI for text production. It quickly reached 100 million users. Its arrival was seen as a godsend for digital marketing, as content production was a bottleneck and a major cost for deploying online strategies.
Those in need of mass-produced content, and who called on the services of offshore copywriters, turned en masse to the multitude of automated copywriting tools that have sprung up. The best-known are Jasper, WriteSonic, WordHero, Rytr, etc.
This practice is also encouraged by Google, which has decided not to penalize AI-generated texts. The Mountain View firm even wrote an article as early as February 8, 2023 saying:
"At Google, we've long believed in the power of AI to transform the ability to deliver useful information. In this article, we explain in more detail how AI-generated content fits in with our long-standing approach to delivering useful content to users in Google search.
Rewarding high-quality content, however it's produced" Source
Authoring tools such as those mentioned above or ChatGPT are specific, customized LLM applications. ChatGPT is optimized for conversational interactions, with capabilities and guidelines that make it suitable for a wide range of interactive applications. While a general LLM can be used for a variety of language processing tasks, ChatGPT is specifically honed to understand, participate in and maintain conversations with users.
Here are the steps involved in obtaining auto-generated text:
This is where the importance of the initial data - the so-called dataset - becomes apparent. Datasets serve as the knowledge base from which LLM learns. The larger and more diversified the dataset, the more varieties of language and styles the model can learn from.
The quality of the dataset directly affects the accuracy of the model. A good dataset will produce more precise and relevant answers.
This brings us to the question posed in this article. Before their democratization, generative AIs had datasets fed by almost exclusively human production.
Over the past year, the Internet has literally been flooded with texts, images and even videos produced by AIs. It's not just the web, but also various types of content such as books, student dissertations, reports, press articles and so on.
It's also reasonable to assume that humans will increasingly resort to AIs when they need to generate texts, and that the proportion of human creations will therefore shrink both in number and proportion. Increasingly, datasets will therefore be fed by data produced by AIs on the basis of their datasets from the time when this content was produced. This would lead to a form of inbreeding.
If we think of it as genetics, this form of inbreeding could lead to an impoverishment of the content created, as its datasets will be less and less diversified. The variety and richness of the language generated would then be limited.
This is where we see the importance of the different stages used to train and evaluate an AI. In an ideal case, the training data should be cleaned up beforehand, so as to train the model with varied and relevant data.
The other important factor is the evaluation of the LLM. To properly evaluate an AI model, test data must be varied and contain special cases. Only if the test database is correctly constructed will it be possible to see whether the model has been correctly trained. If not, the model may not perform well.
If an LLM is increasingly trained with AI-generated content, the variety of the training dataset may be impoverished. If the model is trained for too long, there's a risk that it won't be generalized enough. This is known as "overfitting". But thanks to test data, this drop in performance can be detected and stopped.
To understand how it might affect us in the future, it's useful to know whether engineers have already faced similar situations.
A recurring problem when training an artificial intelligence model is access to sufficient data to train our model. In many scenarios, the amount of data is too small, or the data is not accessible for reasons of confidentiality, etc. It's common to have to think about how to train our model.
It's common to have to think about how to increase the amount of data available to train models.
Some of the techniques employed are simplistic, such as duplicating the available data to obtain a larger quantity. Others are more complex, such as generating new data using GAN models.
A GAN stands for Generative Adversarial Network. These networks operate in two parts:
Its operation can be visualized as that of a counterfeit painter trying to fool an art expert.
Initially, the counterfeit painter will practice by imitating the original paintings to which he has access. Once he is able to paint correctly, he will stop practicing. The art expert will analyze both genuine and fake paintings. He will train himself to detect the fakes until he is able to detect most of them.
Once the forgeries are no longer confused with the originals, the painter goes back to work and practices painting better pictures until they are identical to the originals in the eyes of the expert. The expert in turn improves, and so on.
At the end of this process, the painter is able to create paintings very similar to the original data. This increases the amount of data available to train an AI model.
We can draw a parallel between AIs trained with data from GAN and the phenomenon we discussed earlier in the article. The data used to train future AIs will contain data produced by humans as well as data generated by AI models.
On the face of it, you might think that this poses no problem, since it's common practice in the scientific world. Well, there are consequences that are important to be aware of:
It's important for artificial intelligence engineers to take into account all the consequences of using AI in the production of new content, to ensure that new ideas and advances in our society have a place.
This implies constant vigilance in the selection and renewal of datasets, to avoid an impoverishment of diversity and creativity in the content generated. It is essential to maintain a balance between human and AI contributions, to ensure that content reflects a wide range of perspectives and innovations.
Contributing editor:
Loïc Vansnick, Civil Engineer in Artificial Intelligence and webmarketer
Sources
What is a large language model (LLM)?