Why the First Week Is Different
During testing you controlled the inputs. In production, inputs are unpredictable — real emails, real files, real volumes. Assignments that worked perfectly in tests often surface new failure modes in the first week simply because the variety of real-world data is much wider than your test cases covered. The goal of first-week monitoring is not to watch every Job in real time. It is to catch pattern changes early enough to act before they become problems, and to graduate off intensive monitoring once you have evidence that the assignment is stable.Daily Checks: Day 1 to Day 7
Run through this list each day while the assignment is in its first week of production use. It takes about five to ten minutes.Activity Inbox
Open the Activity Inbox and check for pending approval requests.- Are there more pending requests than expected? A spike in Needs Input usually means the SOP is hitting an ambiguous case it is not confident about — see Spike in Needs Input below.
- Are requests sitting unanswered for more than a few hours? Unanswered requests pause the Jobs that created them. If your team is not seeing the notifications, adjust the escalation path in the SOP or your Slack/email notification setup.
Case Queue health (if applicable)
If the assignment uses Case Queue, open the Cases view and check the status breakdown for the time period covering the last 24 hours.| Metric | Healthy range | Watch for |
|---|---|---|
| Failed rate | Under 5% of cases processed | Any sudden spike above your baseline |
| Needs Input rate | Stable or declining over the week | Rising rate signals a SOP gap |
| Postponed rate | Stable | Rising rate may indicate an upstream system is slow |
Jobs List
Open the Jobs List (Past Jobs) and apply the Needs attention filter to see any Failed or Needs Input Jobs from the past 24 hours.- For each failed Job, open it and note which step failed. Look for clustering: if five Jobs all fail at the same step, that is a SOP or Connection issue, not random noise.
- If a Job failed due to a Connection error, go to Connections, check whether the Connection is still authorized, and re-authorize if needed.
Cost per Job trend
In Team Insights, check the average cost per run for this assignment over the past 24 hours against your baseline from testing.- A cost spike often means the assignment is making more tool calls than expected — common when the SOP does not give the assignment a clear stopping point and it keeps searching or retrying.
- A cost spike can also indicate a tool-call loop: the assignment calls a tool, gets an unexpected result, and tries again repeatedly. Open the high-cost Job from the Jobs List and scroll through the steps to find where calls are repeating.
Output spot-check
Pick three to five completed Jobs at random and open them. Read the output the assignment produced.- Does the output look correct? Would a human reviewer accept it?
- Are there patterns in what looks off? Edge cases cluster — if you find one, look for others like it.
Team Insights Signals to Watch
Open Team Insights and set the time period to Last 7 days. Use these signals to identify trends early.Run volume vs expected
Compare the number of Jobs triggered against what you expected when you set up the schedule or trigger.- Significantly fewer Jobs than expected may mean the trigger is not firing (check the trigger configuration) or Jobs are failing before they complete (check the Failed count).
- Significantly more Jobs than expected may mean the trigger is firing on events you did not intend. Check the trigger source configuration and consider adding a filter in the SOP.
Source breakdown
The source breakdown shows where runs originate — manual, scheduled, or from a specific trigger.- If the source mix changes unexpectedly (for example, a scheduled assignment suddenly shows a spike in manual runs), investigate whether team members are retriggering Jobs manually because they do not trust the automatic output.
Failure clusters
If the failure rate is above your baseline, look for clustering before making SOP changes.- Single Connection failing repeatedly: the external service may be experiencing issues, or the Connection credentials expired. Check the third-party service status page and re-authorize the Connection if needed.
- Failures spread across multiple Connections: this usually points to a SOP logic issue — the assignment is attempting something in the wrong order or with the wrong data.
- Failures only on certain input types: your SOP needs to handle that input shape. Add a HITL gate to catch it while you refine the SOP.
HITL approval rate by branch
If the assignment has multiple approval gates, check whether one is generating significantly more requests than others. A single gate driving most of the Needs Input volume is usually the first SOP gap to fix.Anomaly Response Patterns
Spike in Needs Input
What it means: The assignment is frequently reaching a decision point it is not confident about. It is asking for help rather than deciding autonomously. How to respond:- Open several recent Needs Input requests in the Activity Inbox and read them carefully.
- Identify what they have in common — the same type of question, the same step in the workflow, the same data condition.
- Update the SOP to handle that condition explicitly. Either give the assignment a rule to follow, or widen the HITL gate with a clearer decision framework so reviewers can respond consistently.
- If the condition genuinely should always require human approval, tighten the HITL gate description so reviewers understand what they are approving.
Spike in Failed Jobs
What it means: Jobs are ending without completing successfully. This requires immediate investigation. How to respond:- Open two or three failed Jobs from the Jobs List (Past Jobs in the sidebar) and read through the steps of each one to find where it failed.
- If they all fail at the same step, that step has a problem — a Connection issue, a data format mismatch, or a logic error in the SOP.
- If they fail at different steps, look for a common thread: the same Connection, the same input type, or the same time of day (which may indicate an external service outage).
- If the failure rate is high and you cannot immediately fix the root cause, consider pausing the assignment (disable the schedule or trigger) and communicating to your team while you investigate.
Cost spike
What it means: Individual Jobs are consuming significantly more than your baseline cost estimate. How to respond:- Open one of the high-cost Jobs from the Jobs List and scroll through the steps.
- Count how many tool calls the assignment made. A healthy Job typically makes a predictable number of calls. A high-cost Job often shows many repeated calls to the same tool — a sign of a loop.
- If you see a tool-call loop, update the SOP to give the assignment a clear exit condition. For example: “If you do not find the record after searching twice, stop and request HITL.”
- If the cost is high but the tool calls look reasonable, the inputs may be much larger than expected. Check whether you can batch the input or pre-filter it before the assignment processes it.
Latency drift
What it means: Jobs are taking significantly longer to complete than during testing. How to respond:- Check whether the latency drift is consistent across all Jobs or isolated to specific runs.
- If isolated, the upstream system may have been slow at that time (check external service status pages or your own infrastructure logs).
- If consistent, compare the volume of data the assignment is processing in production versus testing. Real-world volumes are often larger. Update the SOP to process in smaller batches, or add a volume cap while you investigate.
- If Job duration is approaching your business process SLA, consider whether you need to adjust the schedule (run more frequently in smaller batches) or add a volume limit to the SOP.
Emergency Stop
If something is clearly going wrong and you need to stop the assignment immediately: Disable the schedule or trigger- Open the assignment.
- Click the Schedule button in the assignment header.
- Toggle the schedule off, or navigate to the Triggers tab and disable the trigger.
- Jobs already running will complete. No new Jobs will start.
- Go to Connections in the left sidebar.
- Find the Connection the assignment is using to write data.
- Disconnect it.
- Open the assignment.
- Use the revision selector in the builder toolbar to select the last known-good revision.
- Switch back to that version and re-enable the schedule or trigger.
- Monitor the next two or three Jobs manually before reducing your check frequency.
Post-Incident Template
When something goes wrong and you need to communicate or document it, use this template. To identify the root cause before filling it in: open one of the failed Jobs from the Jobs List and read the step where it failed. A Connection error points to a credential issue. Repeated failures on the same step with similar inputs point to a SOP logic gap. A failure on the very first step often means the input data was missing or malformed.Graduating Off First-Week Mode
You can move from daily checks to a routine operational rhythm when all of the following are true for three consecutive days:- Failed rate is under 5% and stable or declining.
- Needs Input rate is stable or declining.
- No cost or latency spikes.
- Output spot-checks look correct.
- No unresolved anomalies in the Activity Inbox.
Handoff to ongoing operational rhythm
When you graduate off first-week monitoring:- Move to weekly rather than daily checks, using Team Insights to review the past seven days.
- Set up persistent alerts if your assignment uses a notification step in the SOP — for example, a Slack message after each run summarizing the outcome.
- Schedule a 30-day review with the assignment owner to reassess whether the SOP needs refinement based on accumulated production experience.
- Update the assignment runbook with anything you learned in the first week — owner contact, typical cost per Job, known edge cases, and the escalation path.
Related
- Promote to Production — The checklist and setup steps before going live
- Jobs List — Where to find failed and in-progress Jobs across all assignments
- Team Insights — Run volume, failure rates, and cost trends
- Activity Inbox — Where HITL requests land and how to respond
- Case Queue — Case statuses and failure rates for queue-based assignments
- Assignment Versions — Reviewing and reverting to a previous revision