Ref-Based Tap Replaces Node ID Tap¶
Summary¶
Removed tapOnElementByNodeId entirely and made tap (ref-based) the sole tap-by-element tool across all execution paths — host-side MCP, inner agent, on-device instrumentation. The old node-ID approach used per-capture DFS indices that were inherently unstable; the ref-based approach uses content-hashed element identifiers that remain stable across screen captures.
The Problem¶
tapOnElementByNodeId accepted an integer nodeId — a position in a depth-first traversal of the view hierarchy. These IDs were:
- Unstable across captures. Any change to the hierarchy (a toast appearing, a list item recycling, an animation mid-frame) shifted every subsequent ID.
- Mismatched between host and device. The host-side agent captured its own view hierarchy to plan actions, then sent
nodeIdto the on-device agent for execution. But the on-device agent captured a separate hierarchy where DFS ordering differed, so the samenodeIdpointed to a different element. This required a workaround (resolveToolForOnDevice) that re-resolved the node to coordinates — fragile and slow. - Inconsistent across tree representations.
ViewHierarchyTreeNodeused pre-order DFS whileTrailblazeNodeused post-order DFS fromAccessibilityNode, so IDs never aligned between the two trees.
How Refs Work¶
The tap tool uses a short content-hashed ref (e.g., y778) instead of a positional index. The ref is generated by ElementRef.RefTracker in ElementRefSlug.kt:
identity = "${className}|${label}|${roundedX},${roundedY}"
hash = identity.hashCode()
ref = letter(hash % 26) + number((hash / 26) % 1000)
Key properties:
- Content-stable. The hash is derived from the element’s class name, display text, and center coordinates (rounded to 10px buckets). As long as the element doesn’t fundamentally change, the ref stays the same across captures.
- Position-tolerant. Rounding coordinates to the nearest 10px means minor layout shifts (scroll offsets, animation frames) don’t invalidate the ref.
- Collision-resistant. ~26,000 possible values; collisions get letter suffixes (
k42,k42b). More than enough for any single screen. - Pre-applied to the tree. Refs are assigned during compact element list generation and stored on
TrailblazeNode.ref. Thetaptool matches against this pre-applied field — it does NOT recompute hashes, avoiding traversal-order mismatches.
What Changed¶
Removed¶
TapOnElementByNodeIdTrailblazeTool.kt— deleted entirely. It was aDelegatingTrailblazeTool(never executed directly, only delegated toTapOnByElementSelectororTapOnPointTrailblazeTool), so nothing downstream depended on it.resolveToolForOnDevice()inTrailblazeMcpBridgeImpl— the workaround that re-resolved node IDs to coordinates for on-device execution. No longer needed since refs are stable across capture contexts.CoreTools.TAP_ON_ELEMENT_BY_NODE_IDconstant.- Framework-level exclusion of
TapTrailblazeToolforANDROID_ONDEVICE_INSTRUMENTATIONdriver.
Updated¶
TrailblazeToolSetCatalog—TapTrailblazeToolis now the only tap-by-element tool in the core set.DirectMcpAgentsystem prompt — instructs the LLM to usetapwith ref IDs instead oftapOnElementByNodeIdwith node IDs.CoreTools— addedTAP = "tap"constant, added toTAP_NAMESfor classification.
Preserved¶
OtherTrailblazeToolcatches anytapOnElementByNodeIdreferences in old logs or test fixtures as generic unknown tools. No dedicated class needed for backward compat.
Future Flexibility¶
The ref format is an implementation detail of ElementRef.RefTracker. If a constrained environment ever needs integer-based refs, the generation strategy can change without touching TapTrailblazeTool — it just matches it.ref == ref on whatever string the tracker produces. The current content-hash approach is strictly better than positional indices because stability comes from the element’s identity, not its position in a traversal.