Large language models (LLMs) have shown impressive advancements in many complex tasks such as mathematical reasoning and program synthesis. Despite this progress, the ability of LLMs to effectively utilize tools, services, and applications remains limited. In order to address this gap, we first introduce Gorilla LLM, a finetuning recipe that enhances the ability of LLMs to use tools by invoking APIs. Gorilla also introduces abstract syntax tree (AST)-based metrics to evaluate API hallucination in LLMs. Further, recognizing that evaluating LLMs can be challenging, we develop OpenFunctions, a pre-trained model that does not require retraining and instead relies on retrieval-augmented generation (RAG) to surface relevant APIs. This system allows LLMs to access an updated repository of functions and services, improving their utility without the overhead of constant model retraining.
Complementing function calling, RAFT (Retrieval Augmented Fine Tuning) provides a recipe for embedding new domain-specific knowledge into models. By training LLMs to discern and utilize only relevant information from a set of retrieved documents, RAFT improves accuracy and reliability in "open-book" settings across various in-domain datasets.
Finally, to enable the autonomous execution of LLM-generated commands—which can be prone to errors—the Gorilla Execution Engine (GoEx) is a novel runtime system that enforces execution under least privilege by dynamically interpreting user intentions and also incorporates "undo" and "damage confinement" abstractions to mitigate risks. GoEx supports post-facto validation, allowing users to verify the correctness of actions after they are executed and to revert any undesired effects. GoEx enables LLMs to operate autonomously, significantly reducing the potential risks associated with their autonomous actions.
We believe that together, these developments—Gorilla, OpenFunctions, RAFT, and GoEx—are critical to unlocking the potential for LLM agents to interact with applications and services.