Three ops habits to commit to muscle memory before reading on:
-
Double-probe, don't misread I/O spikes
Run
curlon port 18789 and confirm port ownership withlsof. Dashboard slowness is usually a disk-write spike, not a gateway fault.curl + lsof
-
Staged upgrades, 30-minute rollback window
Freeze version tag → golden node first → then roll production. Restore tarball config and old plist together; rollback should never exceed 30 minutes.
≤ 30 min
-
Separate logs from the root volume
Log directories and npm cache should rotate independently. A silently full root disk is the leading cause of unexpected Gateway exits.
logrotate 7d
This post answers one question
Once the OpenClaw installer finishes and Gateway 18789 goes green for the first time, teams often enter a "works until it doesn't" phase: sporadic 401s, one node behaving differently after an upgrade, or the Gateway silently exiting because the root disk filled up. None of these are covered by the installation guide.
This post answers one question: how do you turn OpenClaw's daily operations into a standardised runbook that any team member can pick up and complete a health check or rollback in 15 minutes? Scope: Gateway health probes → staged upgrade strategy → rollback procedure → Tailscale / SSH transport choice → log disk planning. For installation and initial onboarding, see OpenClaw 2026 Installation & Onboarding Guide.
1. Gateway 18789 Health Probes
Many teams' "health check" is a single curl call — that's not enough. When the port is occupied by a stale process or launchd hasn't correctly restarted the service, curl will hang rather than fail fast. A proper probe validates both the HTTP response and the port owner.
# 1. HTTP liveness: expect 200, alert on failure curl -fsS --max-time 3 http://127.0.0.1:18789/health \ && echo "gateway OK" || echo "GATEWAY FAIL" # 2. Confirm port ownership: PID should belong to node/openclaw lsof -nP -iTCP:18789 -sTCP:LISTEN # 3. Recent error lines (adjust path to your install dir) tail -n 50 ~/.openclaw/logs/gateway.log | grep -i "error\|warn\|exit"
If curl returns 200 but the Dashboard is sluggish, check disk I/O first: iostat -d 1 5. A log-write spike is the most common false alarm. Keep a dedicated "golden node" on the same version as production so any health anomaly can be compared against a known-good baseline before you start changing config.
lsof -iTCP:18789 returns a PID that isn't OpenClaw, run launchctl list | grep openclaw to check service status. A leftover process from an incomplete previous install may be holding the port. Killing the process without removing the old plist will cause launchd to respawn it on the next login.
2. Three Things Before Every Upgrade
OpenClaw ships updates regularly. Blindly following every release introduces two classes of risk: config-key changes that prevent Gateway from starting, and new versions that are incompatible with existing Channel configuration. Staged upgrades are the standard approach, but teams often interpret "staged" as "try it on one box first" while skipping version freezing and disk headroom checks.
| Step | Done Staged upgrade flow | Skipped Direct npm update |
|---|---|---|
| Version freeze (record current tag) | Roll back with npm install openclaw@old-tag |
Old version unknown; forced tarball restore |
| Golden node first | Failures isolated to test node; production unaffected | All-or-nothing failure across every Channel |
| Root disk ≥ 15% free | npm install temp files have room to land | Install aborts mid-way; leaves a broken half-state |
The version freeze requires three artefacts: current npm package tag (npm list -g openclaw), absolute plist path, and a config tarball. Store these in your team wiki or Notion — you won't have time to hunt for them during an incident.
# Record current version
npm list -g openclaw --depth=0 > .upgrade-memo
# Snapshot current config
tar czf openclaw-config-$(date +%Y%m%d).tar.gz \
~/.openclaw/config/ \
~/Library/LaunchAgents/com.openclaw.gateway.plist
# Verify disk headroom ≥ 15%
df -h / | awk 'NR==2 {print "Free:"$4, "Used:"$5}'
3. Rollback: tarball + plist Two-Step Restore
The most common reason rollbacks fail isn't the binary version — it's config-key incompatibility. A new version may have added required fields; rolling back the binary while leaving the new config in place will cause a Gateway startup error. The correct sequence is: bootout → restore tarball → downgrade binary → bootstrap.
# Step 1: Stop the service launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.openclaw.gateway.plist # Step 2: Restore config tarball tar xzf openclaw-config-20260604.tar.gz -C / # Step 3: Downgrade the npm package npm install -g openclaw@1.x.x # replace with frozen version tag # Step 4: Restart the service launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.openclaw.gateway.plist # Step 5: Verify (allow ~10 s to start) sleep 10 && curl -fsS http://127.0.0.1:18789/health && echo "Rollback OK"
Watch out for plist path drift: if the new version's installer rewrote the ProgramArguments path, you'll need to bootout using the new plist path but bootstrap using the restored old plist. These paths may differ — which is exactly why diffing old and new plist before upgrading belongs in your upgrade-memo.
4. Tailscale vs SSH/rsync: Which One When
The perennial team debate — "should we use Tailscale or direct SSH?" — is a false dichotomy. They optimise for different scenarios. Tailscale shines for multi-device, cross-region ambient access without firewall rules. SSH/rsync is the right choice for bulk file transfers where the bottleneck is egress bandwidth, not protocol overhead.
| Dimension | Tailscale WireGuard mesh | SSH / rsync Direct encrypted transfer |
|---|---|---|
| Best for | Watching Dashboard from any device, multi-member remote access, VPN control plane | CI artefact downloads, log archive pulls, scheduled TB-scale syncs |
| Latency | DERP relay adds ~50 ms Asia↔Canada; direct mode near bare RTT | Bare TCP, minimal overhead; requires open port or jump host |
| Bandwidth | ~5% protocol overhead; fine for sustained low-throughput sessions | Use --bwlimit to avoid saturating Canada egress |
| Access control | Tailscale ACLs per device/user | SSH keys; authorized_keys needs periodic audits |
| Troubleshooting | ACL rules, DERP node, MagicDNS resolution | Routing, MTU, timeout and resume behaviour |
Recommended hybrid: use Tailscale tailnet for everyday Dashboard access and ad-hoc ops; use SSH + rsync with --bwlimit=50000 (~50 MB/s cap) for periodic log and artefact pulls. Keep both paths live as mutual fallbacks.
# Pull log archives from Canada node at ≤ 50 MB/s, resume on disconnect rsync -avz --bwlimit=50000 --partial \ user@canada-m4.hashvps.com:~/.openclaw/logs/archive/ \ ./local-logs/ # Verify Tailscale is using direct path (not DERP relay) tailscale ping canada-m4 --until-direct
5. Log Disk Planning: logrotate Config and Partition Tips
A surprising share of Gateway unexpected exits trace back to a full root disk. By default OpenClaw writes logs to the user home directory — the same volume that holds the npm global cache. Add Xcode exports or CI artefact temp files and you can silently fill a 512 GB drive within weeks. logrotate (not periodic rm) is the right fix.
/Users/mac/.openclaw/logs/*.log {
daily
rotate 7
compress
missingok
notifempty
dateext
copytruncate
}
# Add to crontab (run at 03:00 daily):
# crontab -e, then add:
# 0 3 * * * /usr/local/bin/logrotate /Users/mac/.openclaw/logrotate.conf
For npm cache, set npm config set cache-max 10240 to cap it at 10 GB, and run npm cache clean --force after each upgrade. On a 1 TB node, consider moving ~/.openclaw/logs and the npm cache to a separate volume — if log growth spikes, it cannot impact the root disk or the Gateway process.
6. FAQ: Six Common Ops Questions
Gateway 18789 health check returns non-200 — where do I look first?
Run lsof -nP -iTCP:18789 -sTCP:LISTEN to check whether anything is listening. No output means the process isn't running: check launchd with launchctl list | grep openclaw, then look at the last 50 lines of ~/.openclaw/logs/gateway.log. If the port is listening but HTTP returns 500, the most likely cause is a config-key error — compare against your pre-upgrade tarball.
After rollback, do I need to manually edit the plist path?
Only if the new version's installer modified ProgramArguments. If it was a pure npm version bump with no installer changes, restoring the tarball leaves the plist intact. If the binary path changed, update the plist's ProgramArguments entry to point at the old binary location before running bootstrap — that's why the pre-upgrade diff of old vs new plist belongs in your runbook.
Should multiple nodes upgrade simultaneously or one at a time?
Always one at a time. Upgrade the golden node first; observe Gateway health and Channel SLA for 24 hours. Then roll production nodes one at a time with at least 30 minutes between each. Simultaneous upgrades mean no comparison baseline if something goes wrong — every node fails at once with no known-good reference.
Dashboard showing 100% CPU — is it an OpenClaw bug?
Not necessarily. Run top -o cpu or open Activity Monitor to find the actual CPU-hungry PID. A healthy OpenClaw Gateway process should consume 1–5% CPU at steady state. If it's a node process spiking, check for queued Channel requests or abnormal log-write bursts. On a dedicated Mac mini, use renice to deprioritise Xcode build processes if they compete with OpenClaw.
Tailscale is connected, but I can't reach the Dashboard — why?
Check your Tailscale ACL policy to confirm port 18789 is allowed for that device (custom ACLs may block specific ports). Then verify MagicDNS resolution by grabbing the tailnet IP with tailscale ip -4 and hitting it directly: curl http://<tailnet-ip>:18789/health. If the IP works but the hostname doesn't, it's a DNS resolution issue, not a Gateway issue.
After logrotate runs, does the Gateway keep writing to the right file?
With copytruncate, logrotate copies the file then truncates the original in-place — the Gateway's open file descriptor remains valid throughout. If you use the default create mode instead, send the Gateway a SIGHUP signal or use a postrotate script to launchctl kickstart the service so it reopens the log handle. copytruncate is simpler and avoids any service interruption.
A runbook is only as reliable as the hardware underneath it
The 30-minute rollback window and "Gateway should use 1–5% CPU" assumptions hold on dedicated hardware — not on oversubscribed VMs. Hashvps Canada M4 Mac mini instances are bare-metal, never oversold: your launchd, SSH and npm cache behave identically to a local Mac, so a runbook tested on your laptop actually runs the same way in production. Apple Silicon's 4W idle power draw lets the Gateway run 24/7 without fan noise or unexpected thermal throttles. Gatekeeper + SIP + FileVault mean your Gateway tokens and Channel config stay protected without extra security-group work.
If you want to land this runbook on a node that behaves predictably and is auditable by your whole team, Hashvps cloud Mac mini M4 is the most direct starting point — explore plans and turn OpenClaw ops from "it works until it doesn't" into a production SLA.