Skip to main content
Alice Hau
Machine Learning Engineer
View all authors

Finetuning Toolshim Models for Tool Calling

· 6 min read
Alice Hau
Machine Learning Engineer
Michael Neale
Principal Engineer

blog cover

Our recently published Goose benchmark revealed significant performance limitations in models where tool calling is not straightforwardly supported (e.g., Gemma3, Deepseek-r1, phi4). These models often fail to invoke tools at appropriate times or produce malformed or inconsistently formatted tool calls. With the most recent releases of Llama4 and Deepseek v3 (0324), we are again observing challenges with effective tool calling performance, even on these flagship openweight models.

Community-Inspired Benchmarking: The Goose Vibe Check

· 20 min read
Alice Hau
Machine Learning Engineer

blog cover

We've been measuring Goose's performance with various AI models, including a variety of popular open-source models that can run locally on consumer hardware (RTX 4080, Mac M-series). We understand that many in our community value a fully open-source, local experience without relying on cloud services.

This blog shares our findings comparing open-source models against their closed counterparts, highlighting both current performance gaps and paths for future improvement. Our benchmark is still in its early stages, but we wanted to release it as a starting point for distinguishing models that exhibit stronger agentic capabilities by their ability to pilot Goose (distinct from reasoning or other capabilities often captured in other popular benchmarks).