When Should Mac M4 CI Split to Dual Nodes? Decision Model & Migration Runbook (2026)

In early June we parked Xcode CI and the OpenClaw Gateway on the same Canada M4, leaning on launchd Nice=-5 and capped compile concurrency to get through two weeks. By week three, product folks were pinging in IM: “Gateway is lagging again.” Build logs looked fine; curl health checks drifted from 50 ms to 400 ms+ during peak hours. We tuned concurrency, cleared DerivedData, even bumped to a 24 GB plan temporarily. The problem went from twice a day to once a day—not gone.

That was not a misconfiguration. It was the single-host resource model hitting its ceiling. Ports did not clash (20300 vs 18789), CPU cores were ample; the fight was over unified memory and swap. This piece picks up after colocated deployment and launchd tuning: when you must split to dual nodes, how to draw the topology, and how to cut traffic without blowing up Channels. For Apple’s view on memory and build processes, see Xcode build caching documentation; for mesh networking, Tailscale installation guide.

Bottom line first: the dividing line is resource isolation, not “buy another machine.”

Four hard signals—split if any one fires

≥50 builds/day, ≥3 concurrent full builds, Gateway serving end users, or ≥3 Critical memory_pressure events per week—more Nice tuning on one box only buys time.

≥50 builds/day
Minimum dual node: build host + Gateway host

Build node keeps 24 GB+ and a large SSD; Gateway node runs fine on 16 GB with ~500 MB resident. Two M4s over Tailscale; expose only what the public internet needs.

16 GB Gateway host
Migration is state and DNS, not reinstall

rsync config dirs, reuse tokens, cut over with Tailscale MagicDNS; keep the build host running, blue/green Gateway in under 15 minutes.

Blue/green ≤15 min

1. Why the colocated setup “suddenly” stops working

Most teams’ first cloud Mac does triple duty: run xcodebuild, host OpenClaw Gateway on 18789, and occasionally VNC in for keychain prompts. Early on, 16 GB feels generous. As branches multiply, UI tests run in parallel, and Channels graduate from internal trial to production IM, the memory curve shifts from sawtooth to plateau—ten minutes after a build finishes, compressed pages still hold 8 GB+, and the Gateway heap gets pushed into swap.

Same-box tuning (lower concurrency, raise Nice, flush caches) reduces peak overlap probability; it does not remove the peaks. When two peaks almost always overlap on the timeline—APAC daytime builds, North America evening IM rush—you are not optimizing parameters, you are betting on the scheduler. Apple Silicon unified memory has no discrete VRAM buffer; Xcode and the Node.js Gateway compete in the same physical pool. Once memory_pressure hits Warn, latency jitter is user-visible; for CI it means sporadic timeouts and signing-step failures.

Splitting is not a “scale-up myth.” It is separating incompressible peaks: build spikes land only on the build node; Gateway resident memory is never evicted by xcsbuildd. That mirrors the “dedicated runner, no desktop mixing” logic in self-hosted macOS runners on cloud Mac—here you split Gateway from CI instead of runner from GUI session.

Operators often misread the symptom as “we need a faster chip.” M4 throughput was never the bottleneck in our case. The machine could compile quickly and serve HTTP; it could not do both under sustained overlap without swap. Monitoring that only tracks CPU utilization will look green while Gateway P95 latency tells a different story. If your dashboard has build success rate but not Gateway latency during compile windows, you will discover the ceiling in IM threads, not in Grafana.

2. Three topology tiers: colocated, dual-node, tri-node

Classify before you provision—avoid jumping straight to “add two more boxes”:

T0 · colocated: One M4 runs Xcode Server / self-hosted runner plus Gateway. Fits <30 builds/day, Gateway internal-only, occasional latency acceptable.
T1 · dual-node (this article): Node A is CI-only (24 GB+, large SSD); Node B is Gateway-only (16 GB is enough). Tailscale or datacenter LAN between them; Channels and Dashboard point at B, build triggers hit A.
T2 · tri-node: Split signing/upload or parallel test onto a third host on top of T1. Consider when >150 builds/day, or TestFlight upload must be hard-isolated from compile. Most small and mid teams stay on T1.

Asymmetric conclusion: the problem is not “M4 is too slow”—it is binding two services with different SLAs to one memory pool. Gateway wants 99.9% steady response; CI wants throughput. Colocation forces one SLA to compromise both.

T0 is a legitimate long-term choice for solo devs and two-person teams with predictable schedules. T1 is the inflection when either workload gets an external SLA or build frequency crosses the thresholds in section 3. T2 is for teams that already proved T1 and still see disk or signing contention—not a default reaction to first Gateway complaints.

Minimum dual-node model: build peaks and Gateway SLA physically isolated

3. Colocated vs dual-node vs RAM upgrade: how to compare

Mac M4 CI scaling paths compared (2026)
Dimension	Keep tuning colocated Nice / lower concurrency	Single host to 32 GB More RAM, same topology	Split to dual-node CI + Gateway on separate hosts
Root cause addressed	Reduces peak overlap	Raises memory ceiling	Isolates two SLAs
Gateway latency	Still jitters with builds	Better, not eliminated	Can stay <100 ms during builds
Monthly cost (rough)	Lowest	Mid (upgrade delta)	Mid (+16 GB small host)
Ops complexity	Low	Low	Mid (second host, Tailscale)
Best fit stage	<30 builds/day	2-way concurrency, internal Gateway	≥50 builds/day or external Gateway

Split trigger checklist (any one row → consider T1)
Signal	Threshold	How to observe	False-positive guard
Build frequency	≥50/day	CI logs / runner counters	Exclude manual local Archive
Concurrent lanes	≥3 full builds	`xcodebuild -jobs` and queue depth	Light lint jobs do not count as a lane
Gateway audience	End users / 7×24 Channels	Product SLA requirements	Pure internal webhooks can wait
Memory pressure	≥3 Critical per week	`memory_pressure` logs	One-off leak—investigate first
Disk contention	>50 GB artifacts/month and high I/O wait	`df -h` + `iostat`	Clear archives before deciding

RAM upgrade does not replace a split

32 GB colocated fits “2 concurrent builds + internal Gateway trial.” If Gateway is already external-facing, more RAM turns three daily lag spikes into one—users still feel “it’s slow again.”

4. Scenario matrix: how to choose

Two-person team, <20 builds/day, Gateway for yourself only: Stay on T0; invest in colocated launchd tuning.
APAC-triggered builds, Gateway serving IM users: Go straight to T1. Under follow-the-sun, peaks almost always overlap—colocation loses.
GitHub Actions cloud runner live, Gateway just launched: Runner on A, Gateway on B; do not hang 18789 on the runner host.
TestFlight upload fighting compile for disk: T1 first; if still tight, add a signing/upload host (T2). That path has its own runbook—do not conflate with this Gateway split.
Budget allows only one machine: Prioritize Gateway SLA; batch builds at night or cut concurrency. That is a compromise, not a durable architecture.

When in doubt, run the trigger checklist for two weeks before procurement. One Critical memory event during a demo is enough to justify T1 emotionally; three in a week with external Channels is enough operationally. Internal-only Gateway with <30 builds/day rarely needs a second invoice.

5. Recommended stacks (composable)

Minimum dual-node stack: Hashvps Canada M4 24 GB (A, CI) + M4 16 GB (B, Gateway) + Tailscale tailnet + weekly openclaw doctor. Fits most teams upgrading from T0.
Hybrid CI stack: Self-hosted runner on A wired to GitHub Actions; Gateway + Channels on B. Orchestration stays in GitHub, execution on macOS, Gateway never absorbs build peaks. Aligns with “environment sovereignty” in 2027 macOS build market shift.
Observability stack: Gateway latency probe on B (curl every 5 s) + post-build hook on A snapshotting memory_pressure. Two weeks of data is usually enough to justify split ROI to management.

Stacks can layer: start with minimum dual-node, add observability before migration day, then wire GitHub Actions if you are not already on a self-hosted runner. Do not block the Gateway split on perfect CI hygiene—user-facing latency is the sharper pain.

6. Common mistakes before migration

Stop builds before moving Gateway: Build downtime is expensive; blue/green Gateway, keep CI running.
Fresh install instead of rsync state: Losing ~/.openclaw, tokens, and Channel pairings forces everyone to re-authenticate.
Two public IPs both exposing 18789: During cutover use Tailscale or private DNS—never let two Gateways accept the same Channels.
Delete Gateway dirs on build host too early: Keep the old Gateway process available inside the rollback window; change DNS only.
Ignore clock and certificates: NTP skew >30 s across hosts causes sporadic token validation failures; on migration day run sudo sntp -sS time.apple.com on both.

The most expensive mistake we see is treating migration as “provision blank Mac, reinstall, hope.” Gateway state is small on disk but dense in coupling—Channel webhooks, device tokens, dashboard bookmarks. rsync plus DNS is faster and reversible. Treat the old host as hot standby for 48 hours, not landfill.

7. Runbook: seven steps to dual-node

Assumptions: original host mac-ci-01 runs CI+Gateway colocated; new host mac-gw-02 (16 GB) is Gateway-only. Tailscale already installed (see OpenClaw remote Mac ops runbook).

Step 1: Baseline snapshot (day before migration)

On the original host, record Gateway latency P50/P95, build count, and memory_pressure distribution for post-split comparison. Export openclaw status and screenshot the Channels list.

bash — baseline: Gateway latency and memory pressure

# 延迟采样 60 次
for i in $(seq 1 60); do
  curl -o /dev/null -s -w "%{time_total}\n" http://127.0.0.1:18789/health
  sleep 5
done | sort -n | awk '{a[NR]=$1} END{print "p50="a[int(NR*0.5)],"p95="a[int(NR*0.95)]}'

memory_pressure
vm_stat | head -8

Step 2: Initialize new host and join Tailscale

On mac-gw-02: macOS updates, Homebrew, Node, Tailscale; confirm mutual ping with mac-ci-01 <5 ms. Do not run xcodebuild on this machine.

Step 3: rsync Gateway state (maintenance window starts)

bash — sync OpenClaw config from original host to Gateway host

# 在原机 mac-ci-01 执行；先停 Gateway 避免写入分裂
sudo launchctl unload /Library/LaunchDaemons/com.openclaw.gateway.plist

rsync -avz --delete \
  ~/.openclaw/ \
  builder@mac-gw-02.tailnet-abc.ts.net:~/.openclaw/

# 同步 launchd plist
scp /Library/LaunchDaemons/com.openclaw.gateway.plist \
  builder@mac-gw-02.tailnet-abc.ts.net:/tmp/

Step 4: Start Gateway on new host and verify locally

bash — load service and health check on mac-gw-02

sudo cp /tmp/com.openclaw.gateway.plist /Library/LaunchDaemons/
sudo launchctl load /Library/LaunchDaemons/com.openclaw.gateway.plist

openclaw doctor
curl -s http://127.0.0.1:18789/health
sudo lsof -iTCP:18789 -sTCP:LISTEN

Step 5: Cut traffic—Tailscale MagicDNS or reverse proxy

Point the team’s Gateway hostname (e.g. gateway.tailnet-abc.ts.net) at the new host; update mobile and Channels config to the new MagicDNS name. During cutover, do not restart Gateway on the original host.

Step 6: Lighten the build host

After Channels and Dashboard work on the new machine, unload Gateway launchd on the original host and return memory to CI. You can raise concurrent compile tasks back to 5–6 on a 24 GB host.

bash — remove Gateway from build host, restore CI concurrency

# mac-ci-01：确认已无流量打到 18789 后
sudo launchctl unload /Library/LaunchDaemons/com.openclaw.gateway.plist
sudo mv /Library/LaunchDaemons/com.openclaw.gateway.plist \
        /Library/LaunchDaemons/com.openclaw.gateway.plist.bak

defaults write com.apple.dt.Xcode \
  IDEBuildOperationMaxNumberOfConcurrentCompileTasks 6

Step 7: Observe 48 hours and keep rollback ready

Retain .openclaw backup and plist.bak on the original host for seven days. If new Gateway P95 >200 ms or Channels drop, point DNS back and launchctl load the old plist—CI queue untouched. For fresh Gateway install details, see OpenClaw remote Mac onboarding.

Success criteria after 48 hours: Gateway P95 at or below baseline, zero Channel auth regressions, build throughput equal or higher on A, and no Critical memory events on B during A’s peak compile window. Document the before/after numbers in the same thread where leadership approved the second host—it makes the next capacity conversation factual instead of emotional.

8. FAQ

Q1. I only have one 16 GB machine—can I split “logical roles” without buying a second host?

Not as a substitute for physical split. You can time-slice (night builds, daytime Gateway), but follow-the-sun or 7×24 Channels will hit the peak again. Logical separation only helps prove “latency recovers when Gateway moves” to justify procurement.

Q2. Must dual-node use Tailscale?

No, but strongly recommended. Same-cloud private network, self-hosted WireGuard, or SSH tunnels work; Tailscale wins on MagicDNS, ACLs, and low ops burden. Two Hashvps Canada M4s in-region usually see <2 ms RTT—enough for Gateway to call build webhooks on A.

Q3. How much more does dual-node cost vs single 32 GB?

Plan-dependent, but often close to “24 GB build + 16 GB small host” vs “single 32 GB.” The decisive math is provable Gateway SLA, not monthly rent alone. Teams with external users should price downtime, not just hardware delta.

Q4. If I use GitHub Actions hosted macOS runners, do I still need a self-hosted Mac?

Depends on environment sovereignty. Hosted macOS bills per minute—good for spikes. Self-hosted cloud Mac fits fixed >50 builds/day with keychain and DerivedData control. Gateway should still sit on its own node regardless of runner hosting model.

Q5. Fastest rollback if migration fails?

Point DNS back + launchctl load on the original host. Avoid large builds on rollback day; verify Gateway health and one Channel message before reopening CI floodgates. Keep duplicate state dirs until new metrics stabilize.

9. Conclusion

Splitting Mac M4 CI is not failure—it is the natural next stage after colocated tuning works: you proved the business needs builds and Gateway together, just not on one stick of RAM. Remember the four hard signals, isolate SLAs with T1 dual-node, blue/green with the seven-step runbook—the dividing line is resource isolation, not machine count.

If you are stuck in the “Gateway is lagging but builds cannot stop” window, add a 16 GB Gateway host before another round of Nice tweaks. Keep compiles on 24 GB in the cloud; let agents and users hold steady on 18789 elsewhere. That is the “near-production” topology small teams can afford in 2026.