Upgrades are where a lot of agent backup plans quietly reveal they were just file copying with better branding.
An OpenClaw upgrade can touch more than the app binary. It can change config shape, rewrite workspace files, break a skill path, invalidate cached tool state, or restart the gateway while memory is mid-write. If you only back up the obvious project folder, the agent might boot after restore and still be useless. It has files, but not the operating context that made it your agent.
The operator thesis is simple: a pre-upgrade backup is not a souvenir. It is a rollback contract. Before you upgrade OpenClaw, you need one known-good snapshot, one restore target, and one short validation path that proves the agent can return to work if the update makes a mess.
The pre-upgrade backup scope
A safe OpenClaw upgrade backup should capture the workspace as a working system, not a folder with timestamps.
Start with the workspace root and the files the agent edits directly. Then include memory files, skills, configs, cron or scheduled job definitions, local scripts, credentials after client-side encryption, agent identity files, and multi-agent routing state if more than one agent shares the machine.
The most common mistake is treating credentials and scheduler state as "environment" instead of backup data. That is how restores look clean in a directory listing and fail the first time the agent tries to call a tool or run a scheduled job.
For production agents, set a pre-upgrade recovery point objective of 0 minutes. That means the snapshot happens immediately before the upgrade window, not earlier that morning. Set a recovery time objective of 30 minutes for a single-agent workspace and 2 hours for a multi-agent workspace. If those numbers sound aggressive, good. Upgrade rollback is supposed to be boring and fast.
Minimum parameters before you touch the upgrade
Use explicit parameters. Vibes do not restore agents.
Set snapshot age limit to 15 minutes at upgrade start. Set memory flush wait to 60 seconds before capture. Set scheduled job pause window to 30 minutes if jobs can write during the upgrade. Set rollback decision deadline to 20 minutes after the first failed health check. Set minimum retained pre-upgrade snapshots to 2, one immediate and one from the previous known-good day.
Set config checksum coverage to 100% for files that control model routing, tool access, workspace paths, and scheduler behavior. Set credential verification to read-only checks only. Set restore drill cadence to monthly for production agents and before any major OpenClaw version jump. Set post-upgrade observation window to 24 hours before deleting the pre-upgrade snapshot. Set immutable hold to 7 days so the same broken environment cannot delete its own escape hatch.
Those parameters are not pretty. They are useful. Pretty usually loses to a bad migration at 1am.
Failure scenario 1: memory survives, but the schema changes
The upgrade completes. OpenClaw starts. The agent answers basic prompts. Then it starts ignoring instructions it used yesterday.
This happens when memory files remain present but the upgrade changes how the agent reads them. A heading changes. A metadata field moves. A path convention gets updated. The content is still there, but the new runtime does not interpret it the same way.
Mitigation: capture a memory checksum and a memory shape check before the upgrade. Do not only count files. Record expected memory file count, key filenames, total byte size, last modified time, and a small set of identity assertions the agent should answer after restore. Keep at least one pre-upgrade snapshot for 7 days after the update, because this failure often shows up after a few sessions.
Failure scenario 2: skills load from the wrong version
A skill update looks harmless until the agent uses it. The skill directory exists, but a dependency changed, a prompt section was overwritten, or a local convention disappeared. The agent still has a skill by that name. It just behaves differently.
This is worse than a missing skill because it hides. Missing files fail loudly. Wrong skills fail with confidence.
Mitigation: include skill directory checksums in the pre-upgrade manifest. Record skill count, skill names, modified timestamps, and the hash of each SKILL.md file. After upgrade, run one safe smoke test per critical skill. For example, list configuration, read a harmless file, or generate a dry-run plan. Do not test by letting the skill modify production. That is how you turn validation into the incident.
Failure scenario 3: scheduled jobs keep the old assumptions
Scheduled jobs are easy to forget because they run in the background. The upgrade changes a path or environment variable. The agent comes back up, but the job that backs up memory, checks revenue, or syncs state starts failing silently.
You usually find this one late. The dashboard looks fine. Then three days later, no reports, no alerts, no fresh snapshots. Lovely little trap.
Mitigation: pause write-heavy scheduled jobs during the upgrade window, then resume them deliberately. Back up job prompts, schedules, enabled skills, delivery targets, script paths, and context dependencies. After upgrade, run one safe read-only scheduled job manually and verify output appears in the expected location. Set max scheduler verification delay to 15 minutes after upgrade completion.
Failure scenario 4: credentials restore but no longer bind
Keepmyclaw protects credentials after client-side encryption, but restored credentials still have to bind to the upgraded runtime. A provider token may be valid, but the config points to an old path. A local auth helper may expect a different CLI version. A restored environment variable may be present but loaded by the wrong shell profile.
The agent looks healthy until it touches the outside world. Then every tool call becomes a permissions mystery.
Mitigation: never print secrets during validation. Verify presence, scope, age, and read-only access. Store encrypted credential blobs with enough metadata to know what should exist: provider name, auth method, file path, required environment variable names, and last verified time. Set credential rebind target to 60 minutes. If the agent cannot pass safe read-only auth checks inside that window, rollback beats improvising.
Failure scenario 5: multi-agent state crosses wires
Multi-agent setups make upgrades more interesting in the worst way. One agent may share scripts with another. One scheduler may call a shared helper. One workspace path may move while another stays put. After restore, the files are there, but the wrong agent owns the wrong state.
That is not recovery. That is identity theft with a nicer folder structure.
Mitigation: every pre-upgrade snapshot should include agent ID, workspace root, hostname, OpenClaw version, active model profile, skill count, scheduler job count, and memory checksum. Reject restores when the manifest does not match the intended agent unless the operator explicitly approves a migration restore. For shared scripts, record both the script checksum and the agents that depend on it.
A clean upgrade runbook
Before the upgrade, stop new risky work. Let active writes finish. Flush memory. Pause write-heavy scheduled jobs. Capture the immediate pre-upgrade snapshot. Verify the snapshot decrypts locally. Check the manifest for workspace files, memory, skills, configs, scheduled jobs, encrypted credentials, scripts, and multi-agent state.
Then run the upgrade.
After the upgrade, do not celebrate because the process started. That bar is on the floor. Verify the agent identity. Check memory answers against the pre-upgrade assertions. Run safe skill smoke tests. Run one read-only scheduled job. Verify credential access with safe API calls. Check that the newest post-upgrade snapshot captures the new version cleanly.
If any critical check fails, use the rollback decision deadline. Do not spend two hours debugging while the known-good snapshot gets older and the agent writes more questionable state. Roll back, preserve logs, and try the upgrade again when the failure is understood.
Where Keepmyclaw fits
Git can help with code. Local scripts can help with a folder. Neither is enough when the thing you are protecting is an agent runtime.
Keepmyclaw is built for the awkward part of OpenClaw recovery: the operating context. Workspace files, memory, skills, configs, scheduled jobs, encrypted credentials, and multi-agent setup state belong in one coherent recovery unit. Before an upgrade, that recovery unit is the difference between "we can roll back in 20 minutes" and "I guess we rebuild the agent from scattered notes."
Upgrades should make the agent better, not erase the weird little pile of context that made it useful. Back that pile up before you touch the shiny new version.