Are atomic backups coming in version 2.0?

The current method is somewhat troublesome. I use scp to periodically sync the .config directory to my main server. Since the .config directory is dynamic, atomicity cannot be ensured during the scp process. It’s highly likely that the node program will modify the contents of the .config directory during synchronization, resulting in a corrupted or incorrect backup. Ensuring atomic synchronization of the .config directory without stopping the service is challenging, making it difficult to guarantee a successful backup. Will this be addressed in version 2.0?

1 Like

You could compress the folder prior to transit.

There won’t be support atomic backups in the v2.0 release as the Pebble database doesn’t support this feature. @cassie mentioned earlier that there are plans to replace Pebble in the future, so this may change.

1 Like

A simple way that this could be supported, which should also be easy to implement, is by adding a RPC which copies all modified/new files to a provided path while holding a database write lock.

That must be avoided. You’re introducing an endpoint for others to take your files, obviously.

If a local path is provided (or the RPC uses a backup path set in the config yaml file), then it should be perfectly safe.

An alternative would be to have an RPC which can be used to lock and unlock database writes, so that an external backup process can lock (and wait for outstanding write operations to be completed), then perform the filesystem syncing, before unlocking again. That would be unnecessarily complexity though, both in implementation of the RPC and usage, as well as causing locks to be held longer, causing a larger performance hit for the node.

Either way, a simple solution like this should be implemented as soon as possible, because many people have already lost their rewards due to corrupt backups, and the only way to avoid the risk of data loss is to shutdown the entire node before performing a backup, which has a much larger cost due to the time needed for the initial bootstrap.

1 Like

What’s the advantage of an RPC that holds a lock over stopping the service and copying the files?

1 Like

Using a RPC to lock the database for just long enough to sync the changed/new files will cause no downtime and almost no loss of rewards.

Nodes take considerable time to perform all startup/bootstrap procedures, and only have been completed, can they begin performing proofs and earning rewards.

I just restarted a small node to test, and it took 260 seconds before the workers started performing proofs. That means that if you want to backup a large node/cluster once an hour, your node will be down for 1.73 hours every day and you will lose 7.2% of rewards on the current version.

The loss of rewards will however be be much worse with 2.0’s introduction of prover rings, as nodes will be punished for being offline and therefore a node which is stopped will earn far less rewards even during the time that it is running.

I’m referring to the loss of rewards here since that’s what most node operators will care about, but you also should consider the stability/reliability effects on the network if nodes have to constantly be shutdown to perform backups.

3 Likes

@cassie has said:

…you’ll still want to backup the store in 2.0, but the difference is so that if you need to relaunch the node you don’t lose prover priority slot.

You won’t lose rewards in 2.0 if you lose your store.

…if you lost your store as an active prover you’ll be extremely likely to be unable to recover your prover data in time for your next proof window, which would get you demoted in priority rank. Doesn’t mean loss of earned token like it currently would, but it will mean loss of future earnings.

…Downtime = loss of prover slot priority

And on a similar note:

I suspect there may be concern around the prover slot priority question, given there won’t be a grace period like prior upgrades.

How fast should you reasonably upgrade if you were to run update checks for a service-- every 30, 60, or 600 seconds?

1 Like

Missing a single proof interval is something that can happen for a variety of reasons, and is not a cause for concern – more details on this shortly, but the TL;DR is that it takes several failed proof intervals (well beyond the course of an upgrade, restart, etc.) before a prover ring position is lost.

5 Likes

That is good to know, however I believe I didn’t make my question clear, so let me try again :grin::

What frequency should one check for updates as to keep a priority slot?

From your answer, it seems like there actually be a grace period, but that period is severely truncated from the 24 hours to something like a few minutes.

1 Like