Skip to main content
Michael Neale
Principal Engineer
View all authors

Finetuning Toolshim Models for Tool Calling

· 6 min read
Alice Hau
Machine Learning Engineer
Michael Neale
Principal Engineer

blog cover

Our recently published Goose benchmark revealed significant performance limitations in models where tool calling is not straightforwardly supported (e.g., Gemma3, Deepseek-r1, phi4). These models often fail to invoke tools at appropriate times or produce malformed or inconsistently formatted tool calls. With the most recent releases of Llama4 and Deepseek v3 (0324), we are again observing challenges with effective tool calling performance, even on these flagship openweight models.