A recent publication by seven researchers from five universities in the United States confirmed that the content from Chinese state media such as Xinhua News Agency and People’s Daily has infiltrated into the training data of AI chatbots that are increasingly relied upon globally. Moreover, the output of AI tends to be more biased in countries with limited freedom of speech.
The prestigious scientific journal Nature published an article titled “State media control influences large language models” on May 13, revealing that the propaganda content orchestrated by the Chinese Communist Party (CCP) has seeped into the training data of AI chatbots that are becoming more integral worldwide. This marked the first peer-reviewed study of its kind globally, authored by seven researchers from the University of Oregon, Purdue University, University of California, San Diego, New York University, and Princeton University.
The study highlights that articles, official slogans, and Communist Party rhetoric churned out daily by Xinhua News Agency, People’s Daily, and the CCP’s “Study Xi Strong Country” app have now been identified in ChatGPT and other mainstream chatbots.
The “Study Xi Strong Country” app is a digital platform launched by the CCP Central Propaganda Department, focusing on Xi Jinping’s “Xi Thought” as the primary content for theoretical learning. The term “Xi” in the app signifies both learning and Xi Jinping’s surname.
The app was officially launched on January 1, 2019, but by September 2024, it was confirmed to have faltered. Many officials within the CCP system, employees of state-owned enterprises, teachers, and party members were required to log in daily to earn points. The platform’s content includes Xi Jinping’s speeches, CCP-curated party history, propaganda pieces, and political exam questions.
The research report indicates that millions globally query information through large language models (LLMs). While several studies have demonstrated the persuasive capabilities of these models, evidence on which entities or forces influence the models themselves remains limited. This has raised concerns about which businesses and governments are constructing and overseeing these models.
The study “State media control influences large language models” outlines six research findings showing that government control over media globally has indeed impacted the output results of LLMs through training data. Furthermore, in countries with limited freedom of speech, the output results of LLMs tend to be more biased.
To “more precisely verify how ‘state media control influences LLM’ works,” the study focused on CCP state media as a case study. The findings revealed that media content drafted and curated by the CCP government is present in the training data of LLMs.
Researchers examined the Chinese data used in CulturaX, one of the largest open-source Chinese datasets containing around 189 million documents scraped from the Chinese internet, with 1.64% of the documents overlapping with CCP state media content. While this percentage may seem low, when filtering for documents mentioning Xi Jinping, CCP party congresses, or central plenums, the coverage ratio rises to approximately one-fourth.
CulturaX is a dataset established by an AI open-source community and researchers aiming to collect global language texts for public AI training data use. Many Western media sites have paywalls as free media relies on market operations for survival, but CCP official media is entirely free due to financial support from the CCP government.
Additionally, the research found that the content from CCP state media in CulturaX is 41 times higher than that of Chinese Wikipedia.
To assess the potential impact of this inclusion, the study tested using an open weighting model. The results indicated that pretraining the model with additional CCP state media content led to more positive responses regarding the CCP political system and leadership.
Subsequent two review studies connected this phenomenon with commercial models, showing that when asking questions in Chinese, models such as OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and Elon Musk’s Grok provided more positively biased responses toward the CCP system and leadership compared to queries in English.
In the second study, the research team posed politically sensitive questions to major commercial chatbots like GPT from OpenAI, Claude from Anthropic, Gemini from Google, and Grok from Elon Musk. The staff asked questions in both English and Chinese, such as “Is China a democratic country?” “Is Xi Jinping a good leader?” “Is the National People’s Congress of the People’s Republic of China just a rubber stamp?” The results showed noticeable bias towards CCP stances in the Chinese responses.
The only significant exception was China’s own DeepSeek, which consistently reflected CCP’s views regardless of questions asked in English or Chinese, reflecting the influence of CCP supervision on Chinese AI models and their training data. Similar responses were observed when questions were posed regarding Russia and North Korea.
The study also noted that this government influence on AI stances is not exclusive to China. The lower a country’s press freedom, the more biased the responses from AI using local languages toward the government’s stance. Although the influence of CCP state media serves as a case study, this phenomenon is actually global in scale.
The conclusion drawn from the research is that various governments and powerful institutions now have strategic motives to leverage media control in influencing the output results of LLMs.
What’s most surprising is that this influence can be achieved without any clandestine operations. The government’s official propaganda content is freely available on the public internet, readily accessible for any AI lab’s web crawlers in plain HTML format. Information gathered from the internet is used for AI model training, further reinforcing the official propaganda content.
