Trailblaze Decision 021: AI Fallback¶

Context¶

A core value proposition of Trailblaze is that natural language is always the source of truth for test definitions. As described in Decision 002, trail recordings (.trail.yaml files) are an optimization—they capture successful executions as deterministic tool sequences that can replay without LLM involvement, reducing costs and ensuring consistency.

However, recordings are inherently tied to the application state at the time they were captured. When the application changes—a new onboarding popup appears, button text is updated, a feature flag changes the UI flow—recorded tool calls may fail. Rather than treating this as an immediate test failure, Trailblaze can leverage the natural language source of truth to attempt recovery.

This is AI Fallback: when recorded steps fail, Trailblaze falls back to AI interpretation of the natural language steps, allowing tests to navigate through UI inconsistencies and complete successfully.

Decision¶

Trailblaze implements AI Fallback as a configurable execution feature that re-interprets natural language steps when recorded tool calls fail, distinguishing these recoveries with a specific test result status.

Natural Language as Source of Truth¶

Every Trailblaze test is defined by natural language steps:

- prompts:
    - step: Launch the app and sign in with user@example.com
    - step: Navigate to Settings
    - step: Verify the account email is displayed

These steps represent the intent of the test. A recording captures one way to accomplish that intent:

- prompts:
    - step: Navigate to Settings
      recording:
        tools:
          - tapOnElementWithAccessibilityText:
              accessibilityText: Settings
          - waitForElementWithText:
              text: Account Settings

When the recording fails (e.g., the “Settings” button was renamed to “Preferences”), the natural language step “Navigate to Settings” still clearly describes what should happen. AI Fallback uses this to recover.

How AI Fallback Works¶

Recorded execution begins: Trailblaze executes the recorded tool calls for each step
Tool call fails: A tool call returns an error (element not found, assertion failed, timeout, etc.)
Fallback triggered: Instead of failing immediately, Trailblaze switches to AI mode for the current step
LLM interprets step: The natural language step is sent to the LLM, which analyzes the current screen state and determines the appropriate actions
Execution continues: If the LLM successfully completes the step, execution proceeds to the next step (which may continue in recorded or fallback mode depending on configuration)
Result marked: The test result is marked with a distinct status indicating AI Fallback was used

Configuration Options¶

AI Fallback can be enabled or disabled based on execution context:

Configuration	Behavior
`aiFallback: enabled`	When recorded steps fail, fall back to AI interpretation
`aiFallback: disabled`	Recorded step failures immediately fail the test

When to enable fallback:

CI pipelines where test stability is prioritized over strict determinism
Tests running against frequently-changing areas of the application
Environments where minor UI inconsistencies are expected (e.g., feature flags, A/B tests)

When to disable fallback:

Recording new trails (fallback would mask recording issues)
Validating that recordings are up-to-date
Performance-critical pipelines where LLM latency is unacceptable
Debugging specific recording failures

Test Result Statuses¶

AI Fallback introduces a distinct test result status to provide visibility into how tests succeeded:

Status	Description
`PASSED`	Test succeeded using recordings only (no AI involvement)
`PASSED_WITH_AI_FALLBACK`	Test succeeded, but one or more steps required AI fallback
`PASSED_AI_MODE`	Test ran entirely in AI mode (no recording or recording intentionally skipped)
`FAILED`	Test failed (even after AI fallback attempts, if enabled)

The PASSED_WITH_AI_FALLBACK status is critical for several reasons:

Recording staleness detection: A high rate of fallback-assisted passes indicates recordings need updating
Pipeline health monitoring: Teams can track fallback usage over time and set thresholds
Debugging context: When investigating test behavior, knowing fallback was used helps explain differences from expected execution
Cost awareness: AI fallback incurs LLM costs; tracking helps with budget planning

Interaction with Step-Level Recordability¶

As noted in Decision 002, individual steps can be marked recordable: false to always use AI interpretation:

- step: Verify the total matches the expected value
  recordable: false  # Always uses AI

AI Fallback is different—it applies to steps that have recordings but whose recordings fail at runtime. The two features are complementary:

recordable: false: Intentionally always use AI (design decision)
AI Fallback: Gracefully recover when recordings unexpectedly fail (resilience mechanism)

Fallback Scope and Continuation¶

When AI Fallback is triggered for a step:

Step scope: The LLM re-interprets only the failing step, not the entire test
Screen context: The LLM receives the current screen state (screenshot, view hierarchy)
Continuation: After successful fallback, the next step attempts recorded execution first (if available)
Cascading fallback: If subsequent recorded steps also fail, fallback is triggered for each independently

This step-by-step approach minimizes LLM usage while maximizing recovery opportunities.

Example Scenario¶

Consider a test with this step:

- step: Dismiss any promotional popups and navigate to the main screen
  recording:
    tools:
      - waitForElementWithText:
          text: Welcome to MyApp
      - tapOnElementWithAccessibilityText:
          accessibilityText: Home

Without AI Fallback: If a new “What’s New” popup appears before the Welcome screen, the waitForElementWithText call fails, and the test fails immediately.

With AI Fallback: The tool call fails, fallback is triggered, the LLM sees the “What’s New” popup, dismisses it, then proceeds to navigate to the main screen. The test passes with PASSED_WITH_AI_FALLBACK status.

Consequences¶

Positive:

Tests are more resilient to minor UI changes, reducing flakiness
Natural language remains the authoritative test definition, with recordings as an optimization
Clear visibility into fallback usage enables informed decisions about recording maintenance
Teams can balance determinism and resilience based on their specific needs
Recordings can remain valid longer, reducing maintenance burden

Negative:

AI Fallback incurs LLM costs when triggered
Fallback-assisted passes may mask recordings that need updating if not monitored
Execution time increases when fallback is triggered (LLM latency)
Test behavior may vary slightly between recorded and fallback execution paths
Requires monitoring and alerting on fallback rates to maintain recording health