# StillRunning, agent install playbook StillRunning monitors scheduled jobs, cron, AI agent runs, and background scripts. Register a "monitor" (workflow), wire a one-line ping on completion, and StillRunning alerts you if the job stops running, runs too long, or costs too much. YOU (the coding agent) do the discovery. You scan the user's project and machine with your own file and shell tools. StillRunning gives you the API and this playbook to create, reconcile, wire, and verify monitors. StillRunning does NOT scan the user's filesystem. You do. ## Auth All calls below use a key-authed API at https://stillrunning.ai/api/v1. Authorization: Bearer sr_live_xxxxxxxxxxxxxxxxxxxxxxxx The user creates an installer key at https://stillrunning.ai/app/settings and puts it in their environment as STILLRUNNING_API_KEY. A default installer key can create, read, and scan monitors, and send connectivity test pings. It deliberately CANNOT reveal the ping URLs of monitors it didn't just create, and cannot rotate tokens. Ask the user for the key; never invent one. ## Non-negotiable trust rules 1. Ping URLs are secrets. The API returns a ping URL ONLY for a monitor you just created in that same response. Never log it anywhere public, never put it in a commit. Treat it like a password. 2. A test ping is NOT health. Testing the endpoint proves the wiring is reachable; it does NOT mean the real job ran. A monitor only becomes healthy when a real production run pings it. 3. Re-running this flow reconciles, it never duplicates. The same job maps to the same monitor across runs (the server derives a stable id from source + path + command). 4. NEVER edit the user's crontab, launchd plist, systemd unit, GitHub Actions, Vercel config, shell scripts, or agent wrappers without first SHOWING the exact proposed changes and getting an explicit yes. This is a hard stop. ## The flow (follow in order) 1. INSPECT. Use your own tools to read the user's scheduled work: crontab -l, ~/Library/LaunchAgents and /Library/LaunchAgents (launchd), systemctl list-timers (systemd), .github/workflows/*.yml (GitHub Actions cron), vercel.json crons, and any custom watchdog/agent scripts. 2. LIST. Show the user every job you found: name, what runs it, and its schedule. 3. DRY-RUN RECONCILE. Send the discovered jobs to the reconcile endpoint with dryRun:true. This computes the diff (new / changed / unchanged / already-monitored / missing) WITHOUT writing anything. Show the user the result: "I found N jobs, X new, Y unchanged, Z missing." 4. SHOW PROPOSED CHANGES. List exactly which files and commands you will edit to add the pings. 5. ASK FOR CONFIRMATION. Hard stop. Do not edit anything until the user says yes. 6. APPLY. Re-send the same monitors with dryRun:false. Capture the pingUrl returned for each NEW monitor in the response. That is the only time the URL is returned. 7. WIRE. Add the ping to each job using the correct fail-path so a monitoring outage can NEVER break the user's job (see wiring patterns below). Ping on success; ping ?event=fail on failure. 8. TEST. For each monitor, send a connectivity test ping by id. The monitor moves to "endpoint verified, waiting for first real run". It is NOT healthy yet. That is correct. 9. REPORT. Tell the user each monitor is wired and endpoint-verified, and that it will go healthy on its first real run, and alert if it ever stops. ## Endpoints # Dry-run / apply many jobs at once (the main path): POST /api/v1/install/reconcile body: { "source": "crontab", // optional label for this install batch "dryRun": true, // true = compute diff, write nothing "monitors": [ { "name": "Nightly DB backup", // required, human label "schedule": "0 3 * * *", // cron, @daily/@hourly, "6h"/"30m", or raw seconds "sourceType": "crontab", // crontab|launchd|systemd|github-actions|vercel-cron|agent-script|... "sourcePath": "/etc/crontab", // where the job is defined (helps derive a stable id) "command": "/usr/local/bin/backup.sh", // the command; hashed into the monitor's stable id "description": "pg_dump to S3" // optional } ] } response: { "sessionId": "...", "dryRun": true, "created": [ { "externalId": "...", "name": "...", "id"?: "...", "pingUrl"?: "..." } ], "updated": [ { "externalId": "...", "name": "...", "id": "..." } ], "unchanged": [ { "externalId": "...", "name": "...", "id": "..." } ], "rejected": [ { "name": "...", "error": "why it was rejected" } ], "missingFromLatestScan": [ { "externalId": "...", "name": "...", "id": "..." } ] } Notes: id + pingUrl appear on created[] entries ONLY when dryRun:false. A monitor that already exists (matched by its derived id) comes back in updated[]/unchanged[] with NO pingUrl. That is the trust rule, not a bug. Max 100 monitors per call. Always surface rejected[] to the user. # Create a single monitor: POST /api/v1/workflows body: { "name": "...", "schedule": "0 3 * * *", "sourceType": "crontab", "sourcePath": "...", "command": "..." } -> 201 with "pingUrl" on a brand-new monitor; an existing match returns 200 with NO pingUrl. # List monitors (never returns ping URLs): GET /api/v1/workflows # Connectivity test a monitor by id (after wiring): POST /api/v1/workflows//test-ping -> { "id", "workflow", "test": true, "verificationState": "endpoint_tested", "message": "Endpoint verified. Waiting for first real run." } Test pings never make a monitor healthy and never fire alerts. ## Wiring patterns (always isolate the ping so it can't fail the job) # crontab: append a success ping, and trap the failure: 0 3 * * * /usr/local/bin/backup.sh && curl -fsS --max-time 10 "PING_URL" || curl -fsS --max-time 10 "PING_URL?event=fail" # shell script: trap ERR, ping fail, then a success ping at the end (pings can't break the job): #!/usr/bin/env bash set -euo pipefail trap 'curl -fsS --max-time 10 "PING_URL?event=fail" || true' ERR # ... the real work ... curl -fsS --max-time 10 "PING_URL" || true # GitHub Actions: add two steps, both continue-on-error so the ping never fails the workflow: - name: ping success if: success() continue-on-error: true run: curl -fsS --max-time 10 "PING_URL" - name: ping fail if: failure() continue-on-error: true run: curl -fsS --max-time 10 "PING_URL?event=fail" # Node: try/catch/finally, swallow ping errors: try { await job() ; await fetch(PING_URL).catch(() => {}) } catch (e) { await fetch(PING_URL + '?event=fail').catch(() => {}) ; throw e } # Python: try/except/finally, swallow ping errors: import urllib.request def ping(u): try: urllib.request.urlopen(u, timeout=10) except Exception: pass try: job(); ping(PING_URL) except Exception: ping(PING_URL + "?event=fail"); raise Replace PING_URL with the URL returned for that monitor at creation. If you don't have the URL (the monitor already existed), tell the user: the URL is only shown when a monitor is created, so either re-create it through the flow, or reveal it from the dashboard. ## Convenience: the MCP server There is also a local MCP server, "stillrunning-mcp", that wraps these primitives (plan, create, wiring snippet, test ping) for a guided flow. It is optional. This curl playbook works on its own. See https://stillrunning.ai/docs/mcp. Full docs: https://stillrunning.ai/docs/agent-onboarding