Network partitions are indeed like a fork. I’m going to review what happened in 1.4.18 first that caused it, and then answer the questions you had in particular.
Root Cause
When 1.4.18 landed, a lot of nodes dropped from the mesh: some dropped because they were not behaving correctly, others dropped because they were using a hosting provider that disallows crypto, and the higher cpu usage in 1.4.18 finally raised a tripwire for those hosts. (We have repeatedly discouraged people from using hosts that disallow crypto.)
Downstream Impacts
At the initial week of 1.4.18, the massive dropoff of nodes produced a network partition, resulting in message propagation failures for nodes that became isolated. Generally, you can think of the network structure as a mesh. If you were to take a large chunk out of the mesh all at once, it is possible to have two independent networks operating for a time until the mesh heals. Because this wasn’t just a single chunk being removed, but rather many different hosting providers doing sweeping bans, it caused a lot of chaos the entire first week of 1.4.18, which resulted in a lot of reward info initially missing from the rewards site, and upon recalibration, caused further issues as we did not have a central channel for comms like we have now.
Immediate Response
We also tried to heal some of the mesh by restoring our DHT-only bootstrap nodes to full nodes, as many peers retain a connection to them if they answer to the main gossip bitmask. This has an unfortunate consequence of being an incredible amount of memory pressure on those bootstraps, causing them to frequently restart as they went out of memory handling the volume.
Follow-up
In 1.4.19, we applied several long-needed updates to our forks of libp2p and gossipsub, which helped with some of the above issues. By consequence message propagation has improved, but it is still nowhere close to the ideal state.
How does Quilibrium’s architecture address the impact of network partition attacks on transaction validation and recording?
Right now, the mode of operation is essentially all peers of the network communicating with all peers of the network at all times. This was an acceptable condition at onset when the network was small during our launch preparations, but has continuously found new issues as time has gone on. It was also never an intended end state – the main topic all nodes subscribe to is not intended to be a constant channel for heavy traffic, its current use in this way was intended to find where exactly things start to break down and whether that will surface as issues in 2.0 when everything fans out into various bitmasks.
In the original whitepaper, and for 2.0, the intention is that the master bitmask sees very little traffic at all, but rather the individual core shard bitmasks of the network receive the bulk of traffic. These are far lower in size of expected peers, as each bitmask covers only a small slice of the network. That being said, it is still clear that the way in which gossipsub (and thus the current iteration of blossomsub) is susceptible to a form of DDoS through intentionally good behavior, followed by heavy partitioning. Because there is actually a strong structural pattern that exists in these core shards, we have removed the use of gossipsub altogether for the core shards, and instead will use an approach very tightly aligned with the structure of the network itself.
Specifically, what mechanisms are in place to mitigate and resolve these issues to ensure the system’s consistency and security?
In the context of double spend attacks, the mitigation strategy for core shards is a very simple model – something similar to GHOST (greedy heaviest observed sub-tree), but with the weight of observations being carried by priority slots: the first ring of provers gets the greatest quantity of rewards on a core shard, with a decay as additional provers fill in subsequent rings. Any member can challenge a prover if they misbehave, and misbehavior is easily determined by the presence of conflicting frames – as they are unforgeably linked to a given prover, if a conflicting frame is produced, the member can provide both frames and evict the prover from the ring.
Bringing things back to the network partition context
If, somehow, the prover ring were partitioned evenly within the duration of a single global proof, such that the core shard is effectively forked, when the network heals, the scoring metric of a collection of frames is chosen – this scoring basis follows the time reel logic outlined in data_time_reel.go. If the prover ring were partitioned evenly greater than the duration of a single global proof, the core shard would be halted until the partition heals/one fork becomes heavier.