PSA: 2.0 Launch Steps

abc · August 5, 2024, 8:32am

I don’t know if there will be another delay, but it’s important to keep in mind that this is probably the most advanced decentralized protocol launch in history, and it’s ran by volunteers with day jobs, so delays are part of the process.

cooper · August 5, 2024, 3:23pm

this is in bad taste, disrespectful to cassie’s hard work, and these comments are not appreciated. if you are not patient for the careful release of a very complicated step in advancing this network, i suggest you find the door.

certainly, all of us want 2.0 to launch asap. but if you understand even the slightest amount from the documentation provided, this is not to be rushed in any way, and the extra precaution taken is deeply appreciated.

qm

cassie · August 5, 2024, 7:04pm

I understand your frustration. Projecting launch times accurately is a difficult thing to do – organizations at the scale of SpaceX still can’t do it 100% of the time, because unpredictable factors (e.g. weather, in the case of SpaceX) get in the way. Crypto projects have missed the mark many times as well (Consider the Merge taking five years for Ethereum). It can be really disappointing to be hyped up and prepared for a specific date and it met another delay.

As @abc mentioned, this project is run by a small unpaid team, and it is hard to nail the ETAs while testing can turn up issues.

We are moving diligently to confirm everything is safe to launch. I appreciate your patience.

cassie · August 7, 2024, 9:42am

In preparation for tonight, bootstrap operators, please reach out to me ahead of 10pm PDT, there will be specific steps for y’all we’ll have to run through as we begin rollout. I’ll try and get everyone into one place, so if you don’t have Telegram and can install it, please do so. If not, send me a DM. If you don’t know if you’re a bootstrap operator, you’re not a bootstrap operator, so relax and enjoy the launch steps until 2.0 is published.

cassie · August 8, 2024, 12:39am

Testing is wrapped up on testnet, and we’re seeing some really interesting numbers from it. Waiting until all tests are settled before publishing perf under optimal/worst-case conditions, but it’s impressive enough to say we will handily be the highest throughput crypto network in existence.

Now, onto the main(net) events.

Tonight at 10pm PDT we will start the mainnet rollout process, but I want to clarify the stages of this process so people don’t expect tonight to instantly be “pull update, we’re in 2.0 now” – the upgrade has several moving pieces that will take time on the order of days to complete, so I’m going to outline them below:

Stage 1: Bootstrap Operators
At about 10pm PDT, a new branch on the repo will open up – v2.0-bootstrap. This is not the full v2.0 release, you will be very disappointed trying to run this in the hopes of being early. This release is essentially a stripped down version of a node client to be as light as possible and easy to run, as maintenance-free as possible, with a sole purpose of getting the new and existing bootstrap peers up and running to handle v2 traffic.

Once a sufficient number has come online, we will begin a series of confirmation tests, outlined below with concerns being tested and failure handling:

Peer Exchange - Are the bootstraps properly pruning peers in a way that allows the mesh to build efficiently?
- Success: peers never stick around for more than three heartbeats before being pruned for PX.
- Failure: peers are taking longer to prune or are not being pruned.
  - Cause/Resolution 1: bootstraps cannot handle the volume because heartbeat isn’t frequent enough – increase the heartbeat
  - Cause/Resolution 2: bootstraps cannot handle the volume because the size of the prune RPC messages become too large and have to be split down and even that is too much – add more bootstraps
  - Cause/Resolution 3: bootstraps cannot handle the volume because the connection manager’s high watermark is too low to handle the volume and it’s spiking – increase the watermark as much as we can safely do so.
Connection Management - Are the bootstraps appropriately handling connections such that they stay below the high watermark?
- Success: high watermark is never breached.
- Failure: high watermark is breached.
  - Cause/Resolution 1: Same as Peer Exchange Failure Cause 3 – bootstraps cannot handle the volume because the connection manager’s high watermark is too low to handle the volume and it’s spiking – increase the watermark as much as we can safely do so.
  - Cause/Resolution 2: bootstraps are not cleaning up dead connections – adjust connection manager strategy to sort high → low stream count, except 0 streams ranks the highest of them all
Routing Success - Are the messages successfully routing to all peers once the mesh has been built?
- Success: Spawn 100k light peers, 5 relays, wait for mesh stabilization, and send messages peer by peer, no message loss.
- Failure: Spawn 100k light peers, 5 relays, wait for mesh stabilization, and send messages peer by peer, some/all message loss.
  - Cause/Resolution 1: message propagation is not making it within TTL because mesh density is not high enough – increase default density targets.
  - Cause/Resolution 2: message propagation is not making it within TTL because gossip heartbeat is not fast enough – increase gossip heartbeat frequency
  - Cause/Resolution 3: message propagation is dropping messages due to non-blossomsub-tuning related issue – review and address from trace logs
  - Cause/Resolution 4: relays are oversaturated and cannot handle the workload – this would be surprising, but temporarily increase relays and retry, or enable the bootstrap gateway plan (direct bandwidth testing before accepting a peer by bootstraps) to remove the support for relays and retry the tests
Upper Limits - What is the upper bound on connections before peers are not making it into the mesh or reaching the bootstraps?
- Success: Set bootstrap peers to different watermarks, find the breaking point
- Failure: The watermark cannot be set high enough to handle 10x current peer count.
  - Cause/Resolution 1: Peers don’t make it into the mesh but do reach the bootstraps – same resolution as Peer Exchange Cause/Resolution 2 – add more bootstraps
  - Cause/Resolution 2: Peers cannot reach the bootstraps because the highest watermark allowed is constantly saturated – add more bootstraps

These test cases will take some time to run, and if test failures occur we will have to resolve and re-run. Expect this may take two to three days in the best case to fully work through as bootstrap operators run across many time zones. It is critical that these tests be run, as the bootstraps are fundamental to ensuring the network mesh builds successfully into the closest approximation of an optimal spanning tree.

The dashboard will be inaccurate during this stage, and we strongly recommend against restarting your node during this time as it will take far longer to find a bootstrap that can service your request (we’re leaving only one on for < 2.0 during this process).

Stage 2: 2.0 Release
At this point, 2.0 will be published. The network will start with a stasis lock that can only become unlocked and 2.0 genesis created with a release quorum signed message so that there is no advantage in running a node from source before the release is signed, from a message that cannot be generated ahead of time of the release, and is emitted on the network by each signer individually to also ensure release signers cannot be uniquely advantaged either. The additional benefit of doing this stasis lock gives the network a chance to fully build out the mesh before traffic starts flowing heavily, giving an opportunity to measure the network health ahead of time, and prevent some of the early thrashing to the mesh from being disruptive during rollout. Once the stasis lock is lifted, prover slot enrollment is enabled and the network is effectively online.

Evaluation Criteria:

Mesh Building – does the network mesh stabilize?
- Success: The network sees minimal mesh rebuilding once the majority of nodes have come online.
- Failure: The network sees constant mesh rebuilding.
  - Cause/Resolution 1: Some peers are crashing – identify crash failure and resolve
  - Cause/Resolution 2: Some peers are deliberately thrashing the mesh – increase decay threshold for peer misbehavior
Prover Slot Enrollment – are we enrolling at the appropriate rate?
- Success: Prover slot enrollments appropriately shard out at the subshard threshold value
- Failure: Prover slot enrollments do not appropriately shard out at the subshard threshold value
  - Cause/Resolution: Peers are missing the prover slot updates leading to believe they should not enroll to a lower subshard – Invoke Alert, war room on resolution

Stage 3: Application Rollout
The bridge application must be deployed first in a paused state, so that we have a target address to mint the previously bridged/wrapped QUIL to when the token application is deployed. The token application will be deployed afterwards, with initial state populated to the reward addresses (or bridge address, as appropriate).

Stage 4: Unpausing the Bridge
The bridge application is unpaused, the Ethereum side of the bridge is resumed, and we are fully live with 2.0.

Followups

LLTI snapshot will occur at stasis unlock time
Q Console details will be outlined in a separate post
SDK integration guide will be outlined in a separate post

Finntarget · August 8, 2024, 5:35am

This is so awesome! Cannot wait to see what this new era will bring. I feel like a kid on Christmas Eve!!

cassie · August 9, 2024, 2:00am

Bootstrap operators: if you have not joined the war room chat and updated to v2.0-bootstrap, please reach out – after today the non-updated bootstrap nodes will be removed.

dmimaz · August 9, 2024, 7:41am

Hi

I’m a node holder. I want to help a team and join bootstrap. Can I?

cassie · August 9, 2024, 11:02pm

Quick update on Stage 1:

Removing non-upgrading bootstrappers. If your multiaddr is in this list, please reach out:
- /dns/quil.dfcnodes.eu/udp/8336/quic-v1/p2p/QmQaFmbYVrKSwoen5UQdaqyDq4QhXfSSLDVnYpYD4SF9tX
- /ip4/148.251.9.90/udp/8336/quic-v1/p2p/QmRpKmQ1W83s6moBFpG6D6nrttkqdQSbdCJpvfxDVGcs38
- /ip4/35.232.113.144/udp/8336/quic-v1/p2p/QmWxkBc7a17ZsLHhszLyTvKsoHMKvKae2XwfQXymiU66md
- /ip4/34.87.85.78/udp/8336/quic-v1/p2p/QmTGguT5XhtvZZwTLnNQTN8Bg9eUm1THWEneXXHGhMDPrz
- /ip4/34.81.199.27/udp/8336/quic-v1/p2p/QmTMMKpzCKJCwrnUzNu6tNj4P1nL7hVqz251245wsVpGNg
- /ip4/34.143.255.235/udp/8336/quic-v1/p2p/QmeifsP6Kvq8A3yabQs6CBg7prSpDSqdee8P2BDQm9EpP8
- /ip4/34.34.125.238/udp/8336/quic-v1/p2p/QmZdSyBJLm9UiDaPZ4XDkgRGXUwPcHJCmKoH6fS9Qjyko4
- /ip4/34.80.245.52/udp/8336/quic-v1/p2p/QmNmbqobt82Vre5JxUGVNGEWn2HsztQQ1xfeg6mx7X5u3f
Lower volume tests have concluded without issue in spite of smaller number of bootstraps – mesh building is working very efficiently, but we will really put this to the test next, so far, no configuration adjustments have been needed even at blast rates of nodes emitting thousands of messages per second.

Next up is the peer test at 100k peers.

For folks who are interested in running a bootstrap, you can always feel free to submit a PR, but please be advised that bootstrap peers do not earn QUIL, consume a lot of bandwidth, and have a high responsibility bar to be kept up to date for the time being while we still use them.

cassie · August 11, 2024, 3:48am

Update: Things are moving along well, a few adjustments have been required as expected, but largely is holding strong – stronger than intended to be tested as a bug in the node deployer kicked off more than twice over and ramped up to over 200,000 peers. (RIP my infra billing)

Current state:
Peer Exchange - Are the bootstraps properly pruning peers in a way that allows the mesh to build efficiently? After adjustments, sustained at (accidentally) over 200,000 peers
Connection Management - Are the bootstraps appropriately handling connections such that they stay below the high watermark? Adjustments already made prior to test, also sustained at (accidentally) over 200,000 peers

Next up:
Routing Success - Are the messages successfully routing to all peers once the mesh has been built?
Upper Limits - What is the upper bound on connections before peers are not making it into the mesh or reaching the bootstraps?

dmimaz · August 14, 2024, 5:37pm

Sorry, but… will 2.0 be released this week?

cutekidspeter · August 15, 2024, 8:25am

Hi Cassie. I would like to ask you a few questions on behalf of everyone. These are questions that everyone is very concerned about.

Is the development team composed solely of Cassie, or are there multiple people working on different tasks?
How many people are currently helping you with testing? Were they selected from the miners?
The last question that everyone cares about: Can you provide a relatively accurate launch date? Will it be within August or by Christmas, so that miners can plan their electricity usage and mining equipment accordingly? Everyone hopes to arrange their resources properly.

LaMat · August 15, 2024, 9:45am

You should post this as a separate topic as this topic is for the launch updates. Thanks

cassie · August 16, 2024, 1:52am

cassie:

Update: Things are moving along well, a few adjustments have been required as expected, but largely is holding strong – stronger than intended to be tested as a bug in the node deployer kicked off more than twice over and ramped up to over 200,000 peers. (RIP my infra billing)

Current state:
Peer Exchange - Are the bootstraps properly pruning peers in a way that allows the mesh to build efficiently? After adjustments, sustained at (accidentally) over 200,000 peers
Connection Management - Are the bootstraps appropriately handling connections such that they stay below the high watermark? Adjustments already made prior to test, also sustained at (accidentally) over 200,000 peers

Next up:
Routing Success - Are the messages successfully routing to all peers once the mesh has been built?
Upper Limits - What is the upper bound on connections before peers are not making it into the mesh or reaching the bootstraps?

Update: Progress continues, with the routing stage completed – all in all, the performance of message routing has proven quite good in optimal conditions, even under the pressure of hundreds of millions of messages over test intervals.

Current state:
Peer Exchange - Are the bootstraps properly pruning peers in a way that allows the mesh to build efficiently? After adjustments, sustained at (accidentally) over 200,000 peers
Connection Management - Are the bootstraps appropriately handling connections such that they stay below the high watermark? Adjustments already made prior to test, also sustained at (accidentally) over 200,000 peers
Routing Success - Are the messages successfully routing to all peers once the mesh has been built? Tests passed, 100M message/s pressure test was successful under optimal settings.

In progress:
Upper Limits - What is the upper bound on connections before peers are not making it into the mesh or reaching the bootstraps?

Additionally, more of the inner components are being pieced out as separate libraries with more permissive licenses (MIT) as a potential boon for other protocols with similar needs:

go-libp2p-stack: The go-libp2p-stack repo (GitHub - QuilibriumNetwork/go-libp2p-stack: Mirror of the forked go-libp2p stack used by Quilibrium) will be updated once again with the changes made from these tests so other networks seeing strain from gossipsub can consider a higher performance alternative.
rpm: The mixnet library has been released (GitHub - QuilibriumNetwork/rpm: Mirror of the RPM implementation used by Quilibrium), we are already aware of another L1 that is investigating its use for solving MEV-adjacent problems.
secure-channel: The secure pairwise and group broadcast channel libraries are the next to be published separately, which includes double/triple-ratchet.

cassie · August 16, 2024, 1:54am

regarding this, the answer has been posted in the separate thread: Hope Cassie can answer three questions - #2 by cassie

abc · August 19, 2024, 9:07am

PSA: I’ve started deleting “when” type messages from this topic, as they aren’t productive. Updates are consistently posted and the release will happen when all the steps are completed. The best you can do to accelerate the process is to be supportive.

abc · August 19, 2024, 5:40pm

Update from @cassie:

We’re in the last stage of tests for bootstraps, I’m grinding through a few specific edge cases in testing the limits of how far they can be pushed so they can safely limit themselves such that they don’t get knocked over by an attacker

[S]tages 2, 3, and 4 will happen in the span of about a 24 hour period, it’s getting through this hurdle to make sure nobody gets sidelined trying to peer up during those stages by any kind of attack that’s critical

Trying to make sure nodes have a proper fair shot in joining prover rings and part of that is making sure that the bootstraps aren’t going to be encumbered by an attack scenario

The amount of bugs I fixed throughout the libp2p stack in the past few weeks are going to be a net win for a lot of protocols

abc · August 21, 2024, 7:52pm

Update from @cassie:

It’s much less noisy to give status updates at intervals where objectives are resolved rather than “still in progress”, but now that several major hurdles in the connection limits testing has been cleared, we’re nearing the end of the last bootstrap stage. Still more testing needed, but one of the most tedious parts is done. For those interested in seeing the pain and suffering involved: perf grinding at the extremes · QuilibriumNetwork/ceremonyclient@6a7cbab · GitHub

abc · August 26, 2024, 3:20pm

Update on the status page from @cassie:

2.0 Upgrade: Mainnet
Bootstrap verification is nearing completion, rollout will begin this week – more granular time info will be provided as we near conclusion. Please visit https://discourse.quilibrium.com/ for more details of this process and incremental notes.

abc · September 1, 2024, 7:08am

Update on the status page from @cassie:

Availability for all release signers has been confirmed for the range required to step through the final stages of mainnet deployment. Release signing and binary availability will begin 2024/09/04 at 4:00PM UTC, with the final stages of deployment and full stasis unlock at approximately 2024/09/05 at 4:00PM UTC.

Topic		Replies	Views
PSA: 2.0 10/10 Launch Node Running	6	1587	October 15, 2024
Comprehensive 2.0 FAQ General question , answered	12	5472	September 1, 2024
Reopen claim and bridge discussion General	3	405	September 7, 2024
PSA: Bootstrap runners Node Running	15	424	July 19, 2024
Questions about the release process (team, testers, ETA) General question , answered	1	788	August 16, 2024

PSA: 2.0 Launch Steps

Related topics