OpenClaw Daily Ops Runbook: Gateway 18789 Health Checks, Staged Upgrades & Log Disk Planning (2026)

Three ops habits to commit to muscle memory before reading on:

Double-probe, don't misread I/O spikes

Run curl on port 18789 and confirm port ownership with lsof. Dashboard slowness is usually a disk-write spike, not a gateway fault.

curl + lsof
Staged upgrades, 30-minute rollback window

Freeze version tag → golden node first → then roll production. Restore tarball config and old plist together; rollback should never exceed 30 minutes.

≤ 30 min
Separate logs from the root volume

Log directories and npm cache should rotate independently. A silently full root disk is the leading cause of unexpected Gateway exits.

logrotate 7d

This post answers one question

Once the OpenClaw installer finishes and Gateway 18789 goes green for the first time, teams often enter a "works until it doesn't" phase: sporadic 401s, one node behaving differently after an upgrade, or the Gateway silently exiting because the root disk filled up. None of these are covered by the installation guide.

This post answers one question: how do you turn OpenClaw's daily operations into a standardised runbook that any team member can pick up and complete a health check or rollback in 15 minutes? Scope: Gateway health probes → staged upgrade strategy → rollback procedure → Tailscale / SSH transport choice → log disk planning. For installation and initial onboarding, see OpenClaw 2026 Installation & Onboarding Guide.

1. Gateway 18789 Health Probes

Many teams' "health check" is a single curl call — that's not enough. When the port is occupied by a stale process or launchd hasn't correctly restarted the service, curl will hang rather than fail fast. A proper probe validates both the HTTP response and the port owner.

Gateway double-probe (suitable for a cron job)

# 1. HTTP liveness: expect 200, alert on failure
curl -fsS --max-time 3 http://127.0.0.1:18789/health \
  && echo "gateway OK" || echo "GATEWAY FAIL"

# 2. Confirm port ownership: PID should belong to node/openclaw
lsof -nP -iTCP:18789 -sTCP:LISTEN

# 3. Recent error lines (adjust path to your install dir)
tail -n 50 ~/.openclaw/logs/gateway.log | grep -i "error\|warn\|exit"

If curl returns 200 but the Dashboard is sluggish, check disk I/O first: iostat -d 1 5. A log-write spike is the most common false alarm. Keep a dedicated "golden node" on the same version as production so any health anomaly can be compared against a known-good baseline before you start changing config.

Port 18789 occupied by the wrong process

If lsof -iTCP:18789 returns a PID that isn't OpenClaw, run launchctl list | grep openclaw to check service status. A leftover process from an incomplete previous install may be holding the port. Killing the process without removing the old plist will cause launchd to respawn it on the next login.

2. Three Things Before Every Upgrade

OpenClaw ships updates regularly. Blindly following every release introduces two classes of risk: config-key changes that prevent Gateway from starting, and new versions that are incompatible with existing Channel configuration. Staged upgrades are the standard approach, but teams often interpret "staged" as "try it on one box first" while skipping version freezing and disk headroom checks.

Pre-upgrade checklist: the difference it makes
Step	Done Staged upgrade flow	Skipped Direct npm update
Version freeze (record current tag)	Roll back with `npm install openclaw@old-tag`	Old version unknown; forced tarball restore
Golden node first	Failures isolated to test node; production unaffected	All-or-nothing failure across every Channel
Root disk ≥ 15% free	npm install temp files have room to land	Install aborts mid-way; leaves a broken half-state

The version freeze requires three artefacts: current npm package tag (npm list -g openclaw), absolute plist path, and a config tarball. Store these in your team wiki or Notion — you won't have time to hunt for them during an incident.

Pre-upgrade snapshot (save to wiki or .upgrade-memo)

# Record current version
npm list -g openclaw --depth=0 > .upgrade-memo

# Snapshot current config
tar czf openclaw-config-$(date +%Y%m%d).tar.gz \
  ~/.openclaw/config/ \
  ~/Library/LaunchAgents/com.openclaw.gateway.plist

# Verify disk headroom ≥ 15%
df -h / | awk 'NR==2 {print "Free:"$4, "Used:"$5}'

3. Rollback: tarball + plist Two-Step Restore

The most common reason rollbacks fail isn't the binary version — it's config-key incompatibility. A new version may have added required fields; rolling back the binary while leaving the new config in place will cause a Gateway startup error. The correct sequence is: bootout → restore tarball → downgrade binary → bootstrap.

Full rollback procedure (target: ≤ 30 min)

# Step 1: Stop the service
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.openclaw.gateway.plist

# Step 2: Restore config tarball
tar xzf openclaw-config-20260604.tar.gz -C /

# Step 3: Downgrade the npm package
npm install -g openclaw@1.x.x   # replace with frozen version tag

# Step 4: Restart the service
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.openclaw.gateway.plist

# Step 5: Verify (allow ~10 s to start)
sleep 10 && curl -fsS http://127.0.0.1:18789/health && echo "Rollback OK"

Watch out for plist path drift: if the new version's installer rewrote the ProgramArguments path, you'll need to bootout using the new plist path but bootstrap using the restored old plist. These paths may differ — which is exactly why diffing old and new plist before upgrading belongs in your upgrade-memo.

4. Tailscale vs SSH/rsync: Which One When

The perennial team debate — "should we use Tailscale or direct SSH?" — is a false dichotomy. They optimise for different scenarios. Tailscale shines for multi-device, cross-region ambient access without firewall rules. SSH/rsync is the right choice for bulk file transfers where the bottleneck is egress bandwidth, not protocol overhead.

Tailscale vs SSH/rsync decision matrix
Dimension	Tailscale WireGuard mesh	SSH / rsync Direct encrypted transfer
Best for	Watching Dashboard from any device, multi-member remote access, VPN control plane	CI artefact downloads, log archive pulls, scheduled TB-scale syncs
Latency	DERP relay adds ~50 ms Asia↔Canada; direct mode near bare RTT	Bare TCP, minimal overhead; requires open port or jump host
Bandwidth	~5% protocol overhead; fine for sustained low-throughput sessions	Use `--bwlimit` to avoid saturating Canada egress
Access control	Tailscale ACLs per device/user	SSH keys; `authorized_keys` needs periodic audits
Troubleshooting	ACL rules, DERP node, MagicDNS resolution	Routing, MTU, timeout and resume behaviour

Recommended hybrid: use Tailscale tailnet for everyday Dashboard access and ad-hoc ops; use SSH + rsync with --bwlimit=50000 (~50 MB/s cap) for periodic log and artefact pulls. Keep both paths live as mutual fallbacks.

Rate-limited rsync (protect Canada egress)

# Pull log archives from Canada node at ≤ 50 MB/s, resume on disconnect
rsync -avz --bwlimit=50000 --partial \
  user@canada-m4.hashvps.com:~/.openclaw/logs/archive/ \
  ./local-logs/

# Verify Tailscale is using direct path (not DERP relay)
tailscale ping canada-m4 --until-direct

5. Log Disk Planning: logrotate Config and Partition Tips

A surprising share of Gateway unexpected exits trace back to a full root disk. By default OpenClaw writes logs to the user home directory — the same volume that holds the npm global cache. Add Xcode exports or CI artefact temp files and you can silently fill a 512 GB drive within weeks. logrotate (not periodic rm) is the right fix.

~/.openclaw/logrotate.conf (adjust paths as needed)

/Users/mac/.openclaw/logs/*.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
    dateext
    copytruncate
}

# Add to crontab (run at 03:00 daily):
# crontab -e, then add:
# 0 3 * * * /usr/local/bin/logrotate /Users/mac/.openclaw/logrotate.conf

For npm cache, set npm config set cache-max 10240 to cap it at 10 GB, and run npm cache clean --force after each upgrade. On a 1 TB node, consider moving ~/.openclaw/logs and the npm cache to a separate volume — if log growth spikes, it cannot impact the root disk or the Gateway process.

6. FAQ: Six Common Ops Questions

Gateway 18789 health check returns non-200 — where do I look first?

Run lsof -nP -iTCP:18789 -sTCP:LISTEN to check whether anything is listening. No output means the process isn't running: check launchd with launchctl list | grep openclaw, then look at the last 50 lines of ~/.openclaw/logs/gateway.log. If the port is listening but HTTP returns 500, the most likely cause is a config-key error — compare against your pre-upgrade tarball.

After rollback, do I need to manually edit the plist path?

Only if the new version's installer modified ProgramArguments. If it was a pure npm version bump with no installer changes, restoring the tarball leaves the plist intact. If the binary path changed, update the plist's ProgramArguments entry to point at the old binary location before running bootstrap — that's why the pre-upgrade diff of old vs new plist belongs in your runbook.

Should multiple nodes upgrade simultaneously or one at a time?

Always one at a time. Upgrade the golden node first; observe Gateway health and Channel SLA for 24 hours. Then roll production nodes one at a time with at least 30 minutes between each. Simultaneous upgrades mean no comparison baseline if something goes wrong — every node fails at once with no known-good reference.

Dashboard showing 100% CPU — is it an OpenClaw bug?

Not necessarily. Run top -o cpu or open Activity Monitor to find the actual CPU-hungry PID. A healthy OpenClaw Gateway process should consume 1–5% CPU at steady state. If it's a node process spiking, check for queued Channel requests or abnormal log-write bursts. On a dedicated Mac mini, use renice to deprioritise Xcode build processes if they compete with OpenClaw.

Tailscale is connected, but I can't reach the Dashboard — why?

Check your Tailscale ACL policy to confirm port 18789 is allowed for that device (custom ACLs may block specific ports). Then verify MagicDNS resolution by grabbing the tailnet IP with tailscale ip -4 and hitting it directly: curl http://<tailnet-ip>:18789/health. If the IP works but the hostname doesn't, it's a DNS resolution issue, not a Gateway issue.

After logrotate runs, does the Gateway keep writing to the right file?

With copytruncate, logrotate copies the file then truncates the original in-place — the Gateway's open file descriptor remains valid throughout. If you use the default create mode instead, send the Gateway a SIGHUP signal or use a postrotate script to launchctl kickstart the service so it reopens the log handle. copytruncate is simpler and avoids any service interruption.