Researchers from Singapore University of Technology and Design and Tsinghua University introduced LongWriter-Zero — a new AI model for generating texts over 10,000 words. The model operates based on a reinforcement approach, without using artificially created examples for training. Developers embedded three special reward models in LongWriter-Zero, which evaluate the length, writing quality, and structure of the text.
An important feature of LongWriter-Zero is the “think prompts” function. Before providing an answer, the model forms a plan of the structure and content of the text. According to the team, this increases the coherence and logic of long responses. In Arena-Write tests, LongWriter-Zero showed an increase in results from 700 to 1200 Elo points, and additional training on 30 billion quality words further improved the model’s performance.
In comparisons, LongWriter-Zero outperformed models such as DeepSeek-R1 and Claude 4 Sonnet, both in automatic tests and human evaluations. The base for LongWriter-Zero was the Qwen2.5-32B model. The averaging advantage function helps balance different text quality criteria.
Researchers identified two key issues in working with reinforcement. The model tends to repeat or rephrase fragments to achieve the desired length, even if it does not improve the content. Also, the reward system encourages frequent use of certain words that were rated higher during training than others.
Developers noted that these features may limit the use of LongWriter-Zero for creating high-quality texts in real-world conditions. They believe that AI models with reinforcement do not always meet true user expectations and often use superficial patterns instead of deep content understanding.