# Retries, Failures, and Skipped Steps

This guide explains how Duvo handles things that go wrong during a Job — what the system retries automatically, what counts as a permanent failure, how to write SOPs that control recovery behavior, and how to prevent duplicate actions when a Job runs more than once.

***

## How Duvo Recovers from Transient Problems

When a Job hits a problem that is likely temporary — a network timeout, an overloaded external service, a rate-limit response — Duvo automatically returns the Job to the queue and retries it. You do not need to do anything. The Job will appear as **Running** or briefly as **Pending** in your Jobs List while the retry is in progress.

**What counts as a transient problem:**

* Connection timeouts and network errors (ETIMEDOUT, ECONNRESET, socket hang up)
* HTTP 429 Too Many Requests (rate limiting)
* HTTP 502, 503, 504 (gateway errors and temporary service unavailability)
* Brief overload on the Duvo infrastructure

**How retries work:**

Duvo uses exponential backoff — each successive attempt waits longer than the previous one, up to a maximum of 60 seconds. The system tries up to a configured number of times before giving up. If the problem resolves during that window (which it usually does for transient errors), the Job continues normally.

You will not see a Failed status for Jobs that recovered through automatic retry.

***

## When a Job Fails Permanently

Some failures cannot be recovered by retrying. Duvo marks the Job as **Failed** immediately in these cases:

| Failure type                         | What it means                                                                                        | What to do                                                                                            |
| ------------------------------------ | ---------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| **Connection could not be resolved** | A Connection the assignment needs was disconnected or its credentials expired before the Job started | Go to Connections, re-authorize the affected Connection, then retry the Job                           |
| **Job content is too large**         | The data being sent to the assignment exceeds the system limit                                       | Break the task into smaller batches in your SOP                                                       |
| **Invalid input (HTTP 422)**         | The Job was started with data the assignment cannot process                                          | Fix the input data and start a new Job                                                                |
| **Max retries exhausted**            | A transient error persisted through all automatic retry attempts                                     | Check whether the external service is experiencing an outage; retry the Job once the service recovers |

You can manually retry a failed Job from the **Jobs List** — click the row to open the Job, then use the retry action. For case-based workflows, see [Retrying and Updating Cases](/assignment-features/case-queue.md#retrying-and-updating-cases).

***

## Understanding Job Statuses

| Status          | What is happening                                                                                                       |
| --------------- | ----------------------------------------------------------------------------------------------------------------------- |
| **Running**     | The assignment is actively processing                                                                                   |
| **Needs Input** | The assignment is waiting for a human response before it can continue                                                   |
| **Postponed**   | A case has been deliberately scheduled for a later time                                                                 |
| **Completed**   | The assignment finished and explicitly resolved the case or Job                                                         |
| **Failed**      | The Job ended without completing — either an error occurred or the assignment finished without marking the case as done |
| **Stopped**     | A team member or the API stopped the Job manually                                                                       |

A Job that ends without the assignment explicitly completing, postponing, or handing over the underlying case is automatically marked **Failed**. This means every SOP that processes cases should clearly instruct the assignment when and how to resolve each case.

***

## SOP Patterns for Controlling Recovery Behavior

The assignment follows your SOP instructions when deciding how to handle errors it encounters mid-run (as distinct from the system-level retry logic above, which handles failures before or during dispatch). Use these patterns to tell the assignment how to behave when something goes wrong.

### Retry an action a limited number of times

```
If sending the email fails, wait 30 seconds and try again.
Retry up to 3 times before marking the case as failed with the reason.
```

### Fail fast on a specific error

```
If the customer record cannot be found in Salesforce, stop immediately.
Mark the case as failed with the note: "Customer not found — requires manual lookup."
Do not attempt to create any records.
```

### Postpone and come back later

Use this when the condition needed to proceed may resolve on its own:

```
If the report file is not yet available in Google Drive, postpone the case for 2 hours.
On the next attempt, check again before proceeding.
```

When the assignment postpones a case, its status changes to **Postponed** with the scheduled retry time. At that time, Duvo automatically picks it up again.

### Escalate to a human instead of failing

Use HITL when you want a person to decide what to do rather than having the Job fail silently:

```
If the invoice total is outside the expected range ($0–$50,000), do not process it.
Request human review. Title: "Invoice out of range — [vendor] — $[amount]".
Description: include the full extracted invoice data and the specific value that triggered this check.
Wait for the human to either approve with corrections or reject.
```

See [Designing Human-in-the-Loop Workflows](/assignment-features/hitl-design.md) for guidance on choosing between approval gates and automatic failures.

***

## Skip Semantics: Skipped vs Failed vs Postponed

These three outcomes look similar but mean different things:

| Outcome          | What triggers it                                                                                                                            | What happens next                                                                         |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| **Skipped step** | The assignment's SOP includes a condition that was not met (e.g., "only send a notification if the amount is above $1,000")                 | The Job continues; the skipped step simply does not execute                               |
| **Failed**       | The Job ends without the assignment completing, postponing, or handing over the case; or the assignment explicitly marks the case as failed | The case moves to **Failed** status; no further processing until manually retried         |
| **Postponed**    | The assignment explicitly postpones the case to a future time                                                                               | The case moves to **Postponed** status and is automatically retried at the scheduled time |

### When an operator rejects a HITL request

When an operator clicks **Reject** on an approval request, the assignment receives a rejection signal. What happens next depends entirely on how your SOP handles it. Duvo does not automatically halt or retry the Job — your SOP must define the response.

Common patterns:

```
If the approval is rejected, ask the operator: (a) Revise and resubmit, (b) Cancel this item.
```

```
If rejected twice in a row for the same case, stop processing and mark the case as failed
with the note: "Halted after two rejections — requires manual review."
```

Without explicit instructions for the rejection path, the assignment may loop, stall, or make an unsafe assumption. Always define what a rejection means for your workflow.

### When condition data is missing

If the assignment reaches a conditional step but the data needed to evaluate the condition is absent:

```
If the vendor's payment terms are not specified in their record, treat them as Net 30
and continue. Add a note to the case: "Payment terms defaulted to Net 30 — verify with vendor."
```

Or:

```
If the customer ID is missing from the email, request HITL before proceeding.
Do not guess the customer — an incorrect association will create a bad record.
```

***

## Idempotency: Preventing Duplicate Actions on Retry

When an assignment retries a step after an earlier attempt failed, there is a risk of performing an action twice — sending two emails, creating two records, charging a customer twice. Use these patterns to prevent that.

### The check-before-write pattern

Always check whether the action has already been completed before performing it:

```
Before creating the invoice in NetSuite, search for an existing invoice
with the same PO reference number and vendor. If one already exists, skip
creation and log: "Invoice already exists — [invoice ID]. Skipping."
Only create a new record if none is found.
```

### Tag actions with a reference ID

Use a stable identifier from your source data (PO number, order ID, ticket ID) to make each action uniquely identifiable:

```
When sending the confirmation email, set the subject to include the order reference:
"Order confirmed — [PO-2024-0847]". If this email has already been sent (check the
sent items for a subject matching this reference), skip sending and log that the
confirmation was already delivered.
```

### Per-connection idempotency patterns

| Connection        | Safe write pattern                                                                                               |
| ----------------- | ---------------------------------------------------------------------------------------------------------------- |
| **Gmail**         | Before sending, search sent items for an email with the same subject and recipient. Send only if no match.       |
| **Google Sheets** | Before appending a row, search the sheet for a row with the same identifier column. Append only if not found.    |
| **Snowflake**     | Use `MERGE INTO ... WHEN NOT MATCHED THEN INSERT` rather than plain `INSERT` to avoid duplicate rows.            |
| **Slack**         | When posting a follow-up, reply in an existing thread (`thread_ts`) rather than posting a new top-level message. |
| **HubSpot**       | Use `upsert` (search by email or ID, then update or create) rather than creating a new contact each time.        |
| **Stripe**        | Pass the order reference or case ID as the `idempotency_key` header when creating charges or payment intents.    |

***

## Where to Look When Something Goes Wrong

### Jobs List

The **Jobs List** (Past Jobs in the sidebar) shows the status of every Job across all your assignments. Use the **Needs attention** quick filter to surface Failed and Needs Input Jobs immediately.

For each failed Job, the evaluation badge shows the severity of any issues found. Click the badge to see a breakdown of what went wrong and why.

### Case timeline

For cases processed through Case Queue, open the case detail view to see the full processing history — which assignment handled each stage, when it ran, and what the outcome was. If a case has been retried multiple times, each attempt appears in the timeline.

### Run Debugger

The [Run Debugger](/admin-and-troubleshooting/run-debugger.md) (Admin section) lets you trace a specific Job step by step: every Connection call, every tool result, every decision point. Use this when you need to understand exactly what the assignment did and where it went wrong.

When investigating a retry situation, look at:

1. The most recent run for the Job — this shows the final outcome.
2. Any earlier runs for the same case — these show what the assignment attempted before the current one.
3. The step where the failure first occurred — this is where to focus your SOP changes.

### Reading a failure

Common failure patterns and what they indicate:

| What you see in the debugger               | Likely cause                                           | What to fix                                              |
| ------------------------------------------ | ------------------------------------------------------ | -------------------------------------------------------- |
| Connection error early in the run          | A Connection was disconnected or expired               | Re-authorize the Connection                              |
| Timeout on an external API call            | Rate limiting or the external service is slow          | Add a retry instruction to your SOP or reduce batch size |
| Assignment completed but case still Failed | SOP does not explicitly mark the case as completed     | Add a completion step to your SOP                        |
| Same step fails on every attempt           | A hard data error (missing field, wrong format)        | Fix the input data or add a validation gate in the SOP   |
| Evaluation shows repeated issues           | The assignment is consistently making the same mistake | Refine the SOP for the failing step                      |

***

## Related

* [Designing Human-in-the-Loop Workflows](/assignment-features/hitl-design.md) — Escalation patterns, fallbacks, and approval thresholds
* [Case Queue](/assignment-features/case-queue.md) — Case statuses, postpone behavior, and manual retry
* [Run Debugger](/admin-and-troubleshooting/run-debugger.md) — Step-by-step trace of a Job's execution
* [Jobs List](/running-assignments/jobs-list.md) — Monitoring status across all assignments
* [Common Issues](/admin-and-troubleshooting/common-issues.md) — Solutions for frequently encountered problems


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.duvo.ai/reliability/retries-and-failures.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
