Skip to content

Ref-Based Tap Replaces Node ID Tap

Summary

Removed tapOnElementByNodeId entirely and made tap (ref-based) the sole tap-by-element tool across all execution paths — host-side MCP, inner agent, on-device instrumentation. The old node-ID approach used per-capture DFS indices that were inherently unstable; the ref-based approach uses content-hashed element identifiers that remain stable across screen captures.

The Problem

tapOnElementByNodeId accepted an integer nodeId — a position in a depth-first traversal of the view hierarchy. These IDs were:

  • Unstable across captures. Any change to the hierarchy (a toast appearing, a list item recycling, an animation mid-frame) shifted every subsequent ID.
  • Mismatched between host and device. The host-side agent captured its own view hierarchy to plan actions, then sent nodeId to the on-device agent for execution. But the on-device agent captured a separate hierarchy where DFS ordering differed, so the same nodeId pointed to a different element. This required a workaround (resolveToolForOnDevice) that re-resolved the node to coordinates — fragile and slow.
  • Inconsistent across tree representations. ViewHierarchyTreeNode used pre-order DFS while TrailblazeNode used post-order DFS from AccessibilityNode, so IDs never aligned between the two trees.

How Refs Work

The tap tool uses a short content-hashed ref (e.g., y778) instead of a positional index. The ref is generated by ElementRef.RefTracker in ElementRefSlug.kt:

identity = "${className}|${label}|${roundedX},${roundedY}"
hash     = identity.hashCode()
ref      = letter(hash % 26) + number((hash / 26) % 1000)

Key properties:

  • Content-stable. The hash is derived from the element’s class name, display text, and center coordinates (rounded to 10px buckets). As long as the element doesn’t fundamentally change, the ref stays the same across captures.
  • Position-tolerant. Rounding coordinates to the nearest 10px means minor layout shifts (scroll offsets, animation frames) don’t invalidate the ref.
  • Collision-resistant. ~26,000 possible values; collisions get letter suffixes (k42, k42b). More than enough for any single screen.
  • Pre-applied to the tree. Refs are assigned during compact element list generation and stored on TrailblazeNode.ref. The tap tool matches against this pre-applied field — it does NOT recompute hashes, avoiding traversal-order mismatches.

What Changed

Removed

  • TapOnElementByNodeIdTrailblazeTool.kt — deleted entirely. It was a DelegatingTrailblazeTool (never executed directly, only delegated to TapOnByElementSelector or TapOnPointTrailblazeTool), so nothing downstream depended on it.
  • resolveToolForOnDevice() in TrailblazeMcpBridgeImpl — the workaround that re-resolved node IDs to coordinates for on-device execution. No longer needed since refs are stable across capture contexts.
  • CoreTools.TAP_ON_ELEMENT_BY_NODE_ID constant.
  • Framework-level exclusion of TapTrailblazeTool for ANDROID_ONDEVICE_INSTRUMENTATION driver.

Updated

  • TrailblazeToolSetCatalogTapTrailblazeTool is now the only tap-by-element tool in the core set.
  • DirectMcpAgent system prompt — instructs the LLM to use tap with ref IDs instead of tapOnElementByNodeId with node IDs.
  • CoreTools — added TAP = "tap" constant, added to TAP_NAMES for classification.

Preserved

  • OtherTrailblazeTool catches any tapOnElementByNodeId references in old logs or test fixtures as generic unknown tools. No dedicated class needed for backward compat.

Future Flexibility

The ref format is an implementation detail of ElementRef.RefTracker. If a constrained environment ever needs integer-based refs, the generation strategy can change without touching TapTrailblazeTool — it just matches it.ref == ref on whatever string the tracker produces. The current content-hash approach is strictly better than positional indices because stability comes from the element’s identity, not its position in a traversal.