How Duvo Recovers from Transient Problems
When a Run hits a problem that is likely temporary — a network timeout, an overloaded external service, a rate-limit response — Duvo automatically returns the Run to the queue and retries it. You do not need to do anything. The Run will appear as Running or briefly as Pending in your Runs List while the retry is in progress. What counts as a transient problem:- Connection timeouts and network errors (ETIMEDOUT, ECONNRESET, socket hang up)
- HTTP 429 Too Many Requests (rate limiting)
- HTTP 502, 503, 504 (gateway errors and temporary service unavailability)
- Brief overload on the Duvo infrastructure
When a Run Fails Permanently
Some failures cannot be recovered by retrying. Duvo marks the Run as Failed immediately in these cases:| Failure type | What it means | What to do |
|---|---|---|
| Connection could not be resolved | A Connection the agent needs was disconnected or its credentials expired before the Run started | Go to Connections, re-authorize the affected Connection, then retry the Run |
| Run content is too large | The data being sent to the agent exceeds the system limit | Break the task into smaller batches in your AOP |
| Invalid input (HTTP 422) | The Run was started with data the agent cannot process | Fix the input data and start a new Run |
| Max retries exhausted | A transient error persisted through all automatic retry attempts | Check whether the external service is experiencing an outage; retry the Run once the service recovers |
Understanding Run Statuses
| Status | What is happening |
|---|---|
| Running | The agent is actively processing |
| Needs Input | The agent is waiting for a human response before it can continue |
| Postponed | A case has been deliberately scheduled for a later time |
| Completed | The agent finished and explicitly resolved the case or Run |
| Failed | The Run ended without completing — either an error occurred or the agent finished without marking the case as done |
| Stopped | A team member or the API stopped the Run manually |
AOP Patterns for Controlling Recovery Behavior
The agent follows your AOP instructions when deciding how to handle errors it encounters mid-run (as distinct from the system-level retry logic above, which handles failures before or during dispatch). Use these patterns to tell the agent how to behave when something goes wrong.Retry an action a limited number of times
Fail fast on a specific error
Postpone and come back later
Use this when the condition needed to proceed may resolve on its own:Escalate to a human instead of failing
Use HITL when you want a person to decide what to do rather than having the Run fail silently:Skip Semantics: Skipped vs Failed vs Postponed
These three outcomes look similar but mean different things:| Outcome | What triggers it | What happens next |
|---|---|---|
| Skipped step | The agent’s AOP includes a condition that was not met (e.g., “only send a notification if the amount is above $1,000”) | The Run continues; the skipped step simply does not execute |
| Failed | The Run ends without the agent completing, postponing, or handing over the case; or the agent explicitly marks the case as failed | The case moves to Failed status; no further processing until manually retried |
| Postponed | The agent explicitly postpones the case to a future time | The case moves to Postponed status and is automatically retried at the scheduled time |
When an operator rejects a HITL request
When an operator clicks Reject on an approval request, the agent receives a rejection signal. What happens next depends entirely on how your AOP handles it. Duvo does not automatically halt or retry the Run — your AOP must define the response. Common patterns:When condition data is missing
If the agent reaches a conditional step but the data needed to evaluate the condition is absent:Idempotency: Preventing Duplicate Actions on Retry
When an agent retries a step after an earlier attempt failed, there is a risk of performing an action twice — sending two emails, creating two records, charging a customer twice. Use these patterns to prevent that.The check-before-write pattern
Always check whether the action has already been completed before performing it:Tag actions with a reference ID
Use a stable identifier from your source data (PO number, order ID, ticket ID) to make each action uniquely identifiable:Per-connection idempotency patterns
| Connection | Safe write pattern |
|---|---|
| Gmail | Before sending, search sent items for an email with the same subject and recipient. Send only if no match. |
| Google Sheets | Before appending a row, search the sheet for a row with the same identifier column. Append only if not found. |
| Snowflake | Use MERGE INTO ... WHEN NOT MATCHED THEN INSERT rather than plain INSERT to avoid duplicate rows. |
| Slack | When posting a follow-up, reply in an existing thread (thread_ts) rather than posting a new top-level message. |
| HubSpot | Use upsert (search by email or ID, then update or create) rather than creating a new contact each time. |
| Stripe | Pass the order reference or case ID as the idempotency_key header when creating charges or payment intents. |
Where to Look When Something Goes Wrong
Runs List
The Runs List (Past Runs in the sidebar) shows the status of every Run across all your assignments. Use the Needs attention quick filter to surface Failed and Needs Input Runs immediately. For each failed Run, the evaluation badge shows the severity of any issues found. Click the badge to see a breakdown of what went wrong and why.Case timeline
For cases processed through Queue, open the case detail view to see the full processing history — which agent handled each stage, when it ran, and what the outcome was. If a case has been retried multiple times, each attempt appears in the timeline.Reading a failure
Common failure patterns and what they indicate:| Failure pattern | Likely cause | What to fix |
|---|---|---|
| Connection error early in the run | A Connection was disconnected or expired | Re-authorize the Connection |
| Timeout on an external API call | Rate limiting or the external service is slow | Add a retry instruction to your AOP or reduce batch size |
| Agent completed but case still Failed | AOP does not explicitly mark the case as completed | Add a completion step to your AOP |
| Same step fails on every attempt | A hard data error (missing field, wrong format) | Fix the input data or add a validation gate in the AOP |
| Evaluation shows repeated issues | The agent is consistently making the same mistake | Refine the AOP for the failing step |
Related
- Designing Human-in-the-Loop Workflows — Escalation patterns, fallbacks, and approval thresholds
- Queue — Case statuses, postpone behavior, and manual retry
- Runs List — Monitoring status across all agents