FastML

Machine learning made easy

Paper review: FrugalGPT

Large language models are costly. In the paper we’re about to review, a few guys from Stanford present their idea of how to make them cheaper. Specifically, they talk about calling APIs from providers like OpenAI and others. They offer a few general strategies like prompt adaptation and results caching, but the main thing they go into is using a cascade of models. The idea is simple: you arrange the models to call from the cheapest to the most expensive, and start with the cheapest. If the answer is acceptable, you stop, if not, you continue with the next.

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

The obvious question here is how to decide if the answer is good enough. The authors’ solution is to train an auxillary supervised scoring model (DistilBert) on the question-answer pairs. The model outputs scores, and if the score is above a certain threshold, the system deems the answer satisfactory.

This elicits the next question: how do you choose the threshold? It’s an optimization problem which we won’t get into, because there are more important practical considerations to look at. Specifically, they experiment on three datasets - about predicting changes in gold prices, legal classification, and general Q&A. They cascade three models on each, and these three models are different for each dataset, usually ending with GPT-4 though. The thresholds for accepting an answer are also different from dataset to dataset, and from model to model.

At this point, you might have guessed where we’re going with this: will it work in real world? We would say, probably not. It depends on an application. If the users ask questions from a well-defined domain, maybe you could set up a system like this. For a general AI assistant, there are too many moving parts to pull it off. For example, it would be difficult to train the general purpose scoring model.

Still, if someone handles a large volume of user queries, the broader idea of cascading models, or maybe routing queries to different models, might be promising.

Comments