# StillRunning, agent install playbook

StillRunning monitors scheduled jobs, cron, AI agent runs, and background scripts. Register a
"monitor" (workflow), wire a one-line ping on completion, and StillRunning alerts you if the job
stops running, runs too long, or costs too much.

YOU (the coding agent) do the discovery. You scan the user's project and machine with your own file
and shell tools. StillRunning gives you the API and this playbook to create, reconcile, wire, and
verify monitors. StillRunning does NOT scan the user's filesystem. You do.

## Auth

All calls below use a key-authed API at https://stillrunning.ai/api/v1.

  Authorization: Bearer sr_live_xxxxxxxxxxxxxxxxxxxxxxxx

The user creates an installer key at https://stillrunning.ai/app/settings and puts it in their
environment as STILLRUNNING_API_KEY. A default installer key can create, read, and scan monitors,
and send connectivity test pings. It deliberately CANNOT reveal the ping URLs of monitors it
didn't just create, and cannot rotate tokens. Ask the user for the key; never invent one.

## Non-negotiable trust rules

1. Ping URLs are secrets. The API returns a ping URL ONLY for a monitor you just created in that
   same response. Never log it anywhere public, never put it in a commit. Treat it like a password.
2. A test ping is NOT health. Testing the endpoint proves the wiring is reachable; it does NOT mean
   the real job ran. A monitor only becomes healthy when a real production run pings it.
3. Re-running this flow reconciles, it never duplicates. The same job maps to the same monitor
   across runs (the server derives a stable id from source + path + command).
4. NEVER edit the user's crontab, launchd plist, systemd unit, GitHub Actions, Vercel config,
   shell scripts, or agent wrappers without first SHOWING the exact proposed changes and getting an
   explicit yes. This is a hard stop.

## The flow (follow in order)

1. INSPECT. Use your own tools to read the user's scheduled work: crontab -l, ~/Library/LaunchAgents
   and /Library/LaunchAgents (launchd), systemctl list-timers (systemd), .github/workflows/*.yml
   (GitHub Actions cron), vercel.json crons, and any custom watchdog/agent scripts.
2. LIST. Show the user every job you found: name, what runs it, and its schedule.
3. DRY-RUN RECONCILE. Send the discovered jobs to the reconcile endpoint with dryRun:true. This
   computes the diff (new / changed / unchanged / already-monitored / missing) WITHOUT writing
   anything. Show the user the result: "I found N jobs, X new, Y unchanged, Z missing."
4. SHOW PROPOSED CHANGES. List exactly which files and commands you will edit to add the pings.
5. ASK FOR CONFIRMATION. Hard stop. Do not edit anything until the user says yes.
6. APPLY. Re-send the same monitors with dryRun:false. Capture the pingUrl returned for each NEW
   monitor in the response. That is the only time the URL is returned.
7. WIRE. Add the ping to each job using the correct fail-path so a monitoring outage can NEVER
   break the user's job (see wiring patterns below). Ping on success; ping ?event=fail on failure.
8. TEST. For each monitor, send a connectivity test ping by id. The monitor moves to
   "endpoint verified, waiting for first real run". It is NOT healthy yet. That is correct.
9. REPORT. Tell the user each monitor is wired and endpoint-verified, and that it will go healthy
   on its first real run, and alert if it ever stops.

## Endpoints

# Dry-run / apply many jobs at once (the main path):
POST /api/v1/install/reconcile
  body: {
    "source": "crontab",                      // optional label for this install batch
    "dryRun": true,                            // true = compute diff, write nothing
    "monitors": [
      {
        "name": "Nightly DB backup",           // required, human label
        "schedule": "0 3 * * *",               // cron, @daily/@hourly, "6h"/"30m", or raw seconds
        "sourceType": "crontab",               // crontab|launchd|systemd|github-actions|vercel-cron|agent-script|...
        "sourcePath": "/etc/crontab",          // where the job is defined (helps derive a stable id)
        "command": "/usr/local/bin/backup.sh", // the command; hashed into the monitor's stable id
        "description": "pg_dump to S3"          // optional
      }
    ]
  }
  response: {
    "sessionId": "...",
    "dryRun": true,
    "created":   [ { "externalId": "...", "name": "...", "id"?: "...", "pingUrl"?: "..." } ],
    "updated":   [ { "externalId": "...", "name": "...", "id": "..." } ],
    "unchanged": [ { "externalId": "...", "name": "...", "id": "..." } ],
    "rejected":  [ { "name": "...", "error": "why it was rejected" } ],
    "missingFromLatestScan": [ { "externalId": "...", "name": "...", "id": "..." } ]
  }
  Notes: id + pingUrl appear on created[] entries ONLY when dryRun:false. A monitor that already
  exists (matched by its derived id) comes back in updated[]/unchanged[] with NO pingUrl. That is
  the trust rule, not a bug. Max 100 monitors per call. Always surface rejected[] to the user.

# Create a single monitor:
POST /api/v1/workflows
  body: { "name": "...", "schedule": "0 3 * * *", "sourceType": "crontab", "sourcePath": "...", "command": "..." }
  -> 201 with "pingUrl" on a brand-new monitor; an existing match returns 200 with NO pingUrl.

# List monitors (never returns ping URLs):
GET /api/v1/workflows

# Connectivity test a monitor by id (after wiring):
POST /api/v1/workflows/<id>/test-ping
  -> { "id", "workflow", "test": true, "verificationState": "endpoint_tested",
       "message": "Endpoint verified. Waiting for first real run." }
  Test pings never make a monitor healthy and never fire alerts.

## Wiring patterns (always isolate the ping so it can't fail the job)

# crontab: append a success ping, and trap the failure:
0 3 * * * /usr/local/bin/backup.sh && curl -fsS --max-time 10 "PING_URL" || curl -fsS --max-time 10 "PING_URL?event=fail"

# shell script: trap ERR, ping fail, then a success ping at the end (pings can't break the job):
#!/usr/bin/env bash
set -euo pipefail
trap 'curl -fsS --max-time 10 "PING_URL?event=fail" || true' ERR
# ... the real work ...
curl -fsS --max-time 10 "PING_URL" || true

# GitHub Actions: add two steps, both continue-on-error so the ping never fails the workflow:
- name: ping success
  if: success()
  continue-on-error: true
  run: curl -fsS --max-time 10 "PING_URL"
- name: ping fail
  if: failure()
  continue-on-error: true
  run: curl -fsS --max-time 10 "PING_URL?event=fail"

# Node: try/catch/finally, swallow ping errors:
try { await job() ; await fetch(PING_URL).catch(() => {}) }
catch (e) { await fetch(PING_URL + '?event=fail').catch(() => {}) ; throw e }

# Python: try/except/finally, swallow ping errors:
import urllib.request
def ping(u):
    try: urllib.request.urlopen(u, timeout=10)
    except Exception: pass
try: job(); ping(PING_URL)
except Exception: ping(PING_URL + "?event=fail"); raise

Replace PING_URL with the URL returned for that monitor at creation. If you don't have the URL
(the monitor already existed), tell the user: the URL is only shown when a monitor is created, so
either re-create it through the flow, or reveal it from the dashboard.

## Convenience: the MCP server

There is also a local MCP server, "stillrunning-mcp", that wraps these primitives (plan, create,
wiring snippet, test ping) for a guided flow. It is optional. This curl playbook works on its own.
See https://stillrunning.ai/docs/mcp.

Full docs: https://stillrunning.ai/docs/agent-onboarding