## Recursive Language Models: Scaling Context through Symbolic Interaction

Recursive Language Models (RLMs) represent a fundamental shift in how large language models handle extensive input data by treating the prompt as an external object within a Read-Eval-Print Loop (REPL) environment. Traditional models suffer from a phenomenon known as context rot, where performance degrades steeply as prompts approach or exceed the context window limit, but RLMs solve this by allowing the model to interact with data programmatically. This approach enables the system to process contexts that are orders of magnitude larger than their native limits by using symbolic recursion to decompose, transform, and analyze the input. By offloading the prompt to the environment, the model can maintain high accuracy across millions of tokens without the loss of detail associated with standard summarization techniques.

*Arbitrarily long user prompts should not be fed into the neural network directly but should instead be treated as part of the environment.*

The RLM architecture relies on three specific design choices: providing a symbolic handle to the prompt, enabling programmatic sub-calls, and allowing the model to manage intermediate values within the REPL. Unlike standard context compaction or retrieval-augmented generation, RLMs do not force the model to summarize or forget early details to make room for new content, instead using Python code to perform regex searches and chunk data. This framework allows the system to handle tasks with linear or even quadratic complexity relative to the input length, effectively bypassing the output token limits of the underlying transformer. The model can iteratively build up a final response by storing partial results in variables, ensuring that even extremely long outputs remain coherent and grounded in the source material.

Evaluation of the RLM framework utilized frontier models such as GPT-5 and Qwen3-Coder-480B across diverse benchmarks including S-NIAH, OOLONG, and BrowseComp-Plus. The results showed that RLMs significantly outperform vanilla models and common long-context scaffolds, often by double-digit percentage gains in accuracy and reasoning quality. For example, on the OOLONG-Pairs task, which requires quadratic reasoning over pairs of entries, vanilla GPT-5 failed almost entirely while the RLM version achieved a 58 percent F1 score. The system demonstrated robust performance even at the 10 million token scale while maintaining comparable inference costs to standard models, proving that symbolic interaction is more efficient than raw attention for massive datasets.

*RLMs can successfully process inputs up to two orders of magnitude beyond model context windows.*

The researchers also demonstrated that models can be fine-tuned to become natively recursive through a simple post-training recipe using only 1,000 filtered trajectories from the LongBenchPro benchmark. They created RLM-Qwen3-8B by training a small open model on trajectories from a larger teacher model, resulting in a median performance improvement of 28.3 percent across all evaluation tasks. This fine-tuning focused on the model's ability to manipulate the REPL and determine when to launch recursive calls, rather than teaching it new domain-specific knowledge. Analysis of RLM trajectories revealed emergent behaviors, such as using regular expressions to filter information and storing partial results in variables to stitch together long outputs programmatically without exceeding context limits.

While RLMs offer significant performance gains, they introduce higher variance in inference costs and runtime due to the variable number of sub-calls required for different tasks. However, the median cost for RLM runs remained comparable to or lower than standard summarization agents because the RLM selectively views context rather than ingesting the entire prompt at once. Future work may focus on deeper levels of recursion, asynchronous sub-calls, and the integration of more sophisticated symbolic tools to further enhance efficiency. This research provides a practical path for scaling context in long-horizon AI applications, emphasizing that the future of AI lies in how models symbolically interact with their environments to solve complex problems.