Last week, the Chinese lab DeepSeek unveiled an updated version of its AI model R1-0528. However, the model immediately sparked heated debates — a developer from Melbourne, Sam Paich, released evidence suggesting that the DeepSeek model could have been trained on data obtained from Google Gemini, notably repeating words and expressions characteristic of Gemini. Similar observations were made by the creator of “SpeechMap,” noting that the “thoughts” generated by R1-0528 during operation closely resemble Gemini.
This is not the first time DeepSeek has been suspected of using competitors’ data to train its models. Back in December last year, developers noticed that one of the previous versions of DeepSeek often identified itself as ChatGPT, which could indicate training on chat logs from this platform. OpenAI previously reported detecting traces of so-called distillation — a method where a new model is trained on the outputs of more powerful systems, and linked this to DeepSeek. At the end of last year, Microsoft recorded massive data extraction through OpenAI developer accounts, which the company suspected were connected to DeepSeek.
In light of such accusations, leading AI market players are strengthening security measures. Since April, OpenAI requires identity verification from organizations using advanced models, with China absent from the list of supported countries. Google and Anthropic have also begun implementing additional restrictions — both companies now “aggregate” traces of their models to complicate competitors’ training on this data.
Despite this, some industry experts do not rule out that DeepSeek could indeed have used Google Gemini data to create its model. Researcher Nathan Lambert noted that with a lack of GPUs and sufficient funding, the company could well generate large volumes of synthetic data based on the best available models to gain additional computational capabilities.