Boxing Day Anomaly

From Mochimo Wiki
Jump to: navigation, search

The Boxing Day Anomaly

On December 26, 2021 - for a brief period almost 90% of the hashing power on the network was behind a mining pool which was operating on faulty code. The network crashed and was unable to resync to the only living chain, which had a single faulty block that could not properly validate. The community had the option of manually rolling the network back before the problem occurred, thereby losing 12 hours of transaction history, or accepting the faulty block into the chain by way of a manual exception written into the node init() code that would explicitly accept the faulty block hash. The community was asked for input and opted for the exception. Technical details of what came to be known as the Boxing Day Anomaly are presented below.

Conditions

There were three preconditions necessary for this problem to come about:

Condition #1

A pool worker node running new mining code generated a Haiku (nonce) composed of a 10-byte Haiku-1 and a 13-byte Haiku-2. That nonce did not solve a block. However, the node's next attempt with a 10-byte Haiku-1 and a 10-byte Haiku-2 solved the block. However, the remaining 3 bytes from the previous attempt's Haiku-2 were still present in the new Haiku's buffer. It happened that those extra garbage bytes actually helped solve the block.

Condition #2

The Java node which runs the mining pool has all the same validation tools as the mainline C Code, however, one minor oversight in the Haiku validation code has the Java node examine the nonce to confirm it contains syntactically valid Haiku, but once it finds a correctly matching haiku frame, the processing stops. The missing logical test in the Java node code was: "Validate that the remaining bytes of the nonce are zeroed out." This test is present in the C node logic and was an oversight in the Java code.

Condition #3

Under normal conditions, if the network receives a faulty block from any node, it will ignore them and continue to mine. However, it is the case that the Java node was re-advertising the faulty block to the rest of the network, and while they initially ignored it No Other Node solved a block. This is a consequences of too much hashing power being centralized behind the mining pool. The rest of the network continued to attempt to solve the block, but one by one the "big wait" timers for those nodes expired without them receiving a block update or solving a block. Over the course of about 3 hours they continued to ignore the Java node, but lacking any other block solves, they reset themselves until all nodes were at block 0x0. As the Java node continued to serve up the faulty block, the restarted nodes came online, and continued to reject the java node's chain, keeping them at block 0.

Impact

No impact to end-user transactions was caused by this issue. For a period of about 12 hours, the total number of nodes on the network dropped as low as 2 nodes, until a third node was brought online. After 12 hours a 4th node was brought online. The mining pool and its associated hashing power continued to mine throughout this issue.

Individual mining nodes were unable to mine as they could not find a quorum, and every node had flushed its chain state.
The pool worker mining code issue was identified and corrected.
Additional Java nodes were brought online to augment the network size.

Remedial Steps

We patched the Java node code to perform the related block validation test to ensure that not only is the solving Haiku valid but the remainder of the nonce is properly zeroed out.
We patched the C node by hard-coding the block hash for the offending block into the code such that it will unconditionally accept this one block as valid, provided it has the correct hash. The PoW work WAS performed to solve the block, it just solved it in an usual way. This was not pretty, but it was better than losing 12 hours of transaction processing.
We notified the community of the code patch.

Long Term Outlook

No architectural changes are required to the system to address this issue.
The anticipated block chain reboot associated with Mochimo 3.0 will cause this issue to be lost to history.
We may look at the reset-if-no-updates-seen behavior and cause some other behavior to occur to avoid any similar future issue from occurring, though this is not a high-priority change.