ChatGPT has recently highlighted the unsuspected powers of large language models in countless fields, including translation, copy-editing, note synthesis, and advertising text production.
Opsci has been using tools similar to ChatGPT since 2021 to analyze large corpora of texts, images, and soon videos. The main patterns of the corpus are automatically detected by a model called "Transformers" (the T in ChatGPT). Major examples of transformers include BERT (Bidirectional Encoder Representations from Transformers, a language model developed by Google in 2018) for text and CLIP (Contrastive Language-Image Pre-training developed by OpenAI in 2021) for images.
These models put each publication in the corpus on a large semantic map. The "embeddings," a series of abstract semantic coordinates, describe the position of a publication relative to other publications. In that sense the closer two publications are, the more likely they share a similar meaning. For textual corpora, OpSci commonly uses BERTopic, an application developed by Maarten Grootendost since 2021. BERTopic gathers each semantic coordinate into coherent regions corresponding to consistent themes, patterns, or topics.
Compared to other classification methods, embedding-based approaches have a major advantage: they can work on multilingual corpora. As much as fifty different languages are supported in the “Multilingual” model used by BERTopic. Within a language, BERT tackles more easily informal forms of expression (jargon, SMS language, spelling mistakes) that occur frequently on social networks.
Applied to the public sphere, transformers models provide a "panoramic" view of public opinion. Not only are the most discussed topics immediately visible, but also a long tail of secondary or even emerging topics. The suggestions and groupings made by the model are evaluated and annotated by our team of analysts. In their final state, the classifications are inseparable from human inputs and fitted to the relevant domain of expertise.
OpSci has developed a large annotated model of debates on climate change and energy transition in France. It includes 345 discussion topics that cover both long standing themes that have been present for years (the role of nuclear energy, investment in renewables...) as well as weak signals on the rise.
Beyond online platforms, this method broadly changes the observation of public opinion. Since 2023, OpSci has been collaborating with the French polling institute Cluster17 to create AI-assisted surveys. Thanks to their understanding of syntax and sentence structure, Transformer models are able to identify recurring statements and language elements, whose popularity or persuasive effect can then be tested on a representative sample of the French population. Beyond the present, AI provides the keys to understanding and anticipating future developments in public opinion.
The models used by OpSci are similar to the tools already implemented in opaque ways by the big platforms. The recent opening of Twitter's recommendation algorithm shows that each tweet and account is analyzed by a BERT model: tweets that talk about similar subjects to what the person regularly discusses are more likely to be suggested.
Thanks to the expertise gained over nearly two years, OpSci also aims to inform professionals and the general public about artificial intelligence models. The techniques mobilized by the new generative AI models are virtually the same as those used for corpus classification. They do, however, raise unprecedented issues in terms of reliability, data security, and social and economic impact.