Paper review: SSRL - Self-Search Reinforcement Learning

Using LLMs is often intertwined with search engine calls, so it makes sense to teach LLMs how to make these calls during training. Search engine API can be expensive, so people came up with the idea of simulating search engine responses with a language model. It’s cheaper than the real thing if you are using a small self-hosted model for the purpose.

This, however, is not the novelty of this paper. The idea has been known since earlier, for example from ZeroSearch: Incentivize the Search Capability of LLMs without Searching. ZeroSearch is both about reducing costs and controlling retrieved documents quality to make the training more robust. The authors finetuned their simulation model.

The authors of SSRL: Self-Search Reinforcement Learning dispense with this complexity: they don’t care about the content of retrieved documents at all - they mask (exclude from loss calculations) answers’ content. They only care about the process: how to teach the model to use a search engine. They do this by structuring the text with think, search, and information tags. The model “thinks”, then searches, then processes the information retrieved, and the cycle repeats.

Somewhat surprisingly, self-search improves performance over regular Chain of Thought and basic RAG on a number of multi-hop QA benchmarks like HotpotQA. It’s about as good or slightly better than ZeroSearch.

However, this seems to mostly apply to small models, and already with 8B models the results are inconclusive. When they replace simulation with real search engine at test time, the results don’t improve, necessarily. With small models it apparently works quite well, and it’s those small self-hosted models that you would use to cut costs. Also you don’t need to finetune them, just plug in an off-the-shelf model.

FastML

Machine learning made easy

Paper review: SSRL - Self-Search Reinforcement Learning

Comments