The Highly Optimistic Dev Blog #01: The Mystery of the Missing Message

A note from the editorial team:

There’s so much interesting work happening in the universe of Optimistic Ethereum that we often forget to take a moment to tell the world what we’ve been up to. We’ve heard from a lot of you that you’d like to get a more detailed view into the inner workings of Optimistic Ethereum and the people who help make it a reality. As a result, we’re launching a dev blog where the people working on Optimistic Ethereum write about the ideas and challenges they tackle on a daily basis. We aren’t enforcing any particular style or structure on these blog posts — you’ll get a raw look at the way each of us are thinking about our work at Optimism. We hope this will act like a little window into the Optimistic experience.

Welcome to the first edition of the Highly Optimistic Dev Blog! ☺️

Author: Kelvin Fichter

Summary and background

Noticing the symptoms

Engineering Lesson #1: user reports are a last line of defense

User reports should be a last line of defense when discovering issues. If you’re finding issues because of user reports then you probably don’t have sufficient alerting in place. In this specific case we should’ve been regularly submitting deposit transactions and monitoring those transactions for failures.

Starting the debugging process

  1. Deposits could be correctly initiated on L1
  2. Deposits were NOT correctly being reflected on L2

Point (1) meant that the bug likely wasn’t coming from our smart contracts (the deposit transactions would probably be reverting if it was an issue in the smart contracts). Luckily our software stack is relatively simple and there aren’t many places for a bug to hide. If the issue wasn’t in the contracts, then it was either in the L2 geth node OR in the process that carries transactions over to L2 geth (a tool known as data transport layer or DTL for short; it’s a terribly vague name and we’re accepting suggestions for alternatives 😂).

The DTL is a minimal piece of software that doesn’t often see failures like this. We made a few deposits of our own and found that the DTL was correctly discovering and indexing these deposits, so our suspicion turned to L2 geth. A quick skim through the logs didn’t reveal much other than the fact that our new deposit transactions were clearly not being executed on L2.

Fortunately, our luck turned when we found the following log lines:

Syncing enqueue transactions range start=2500 end=3545
...
Syncing enqueue transactions range start=2500 end=3546
...
Syncing enqueue transactions range start=2500 end=3547

These logs are emitted when L2 geth tries to execute a series of L1 ⇒ L2 deposit transactions. We’d expect the start value to increase over time because L2 geth should be executing one transaction and moving on to the next. The fact that the start value wasn't changing suggested that L2 geth was somehow getting stuck when trying to execute deposit 2500. Now we just had to find out what was going on with deposit 2500.

Checking the DTL

GET <https://url.of.dtl/enqueue/index/2500>

What did we get back? null. Apparently the DTL just didn't have any record of this deposit.

So what about deposit 2501?

GET <https://url.of.dtl/enqueue/index/2501>

Not null! Yep, the DTL had a record for that. And it had a record for every other deposit after that. This was clearly the source of the initial issue. The DTL didn't have a record of deposit 2500 so L2 geth wasn't able to progress beyond deposit 2500 but new deposits were still being accepted. Hmm...

Finding the root cause

Lucky for us, the DTL is a relatively simple piece of software — it doesn’t do very much. It’s basically just a loop that:

  1. Looks up the last L1 block it’s in sync with
  2. Finds the next L1 block to sync to
  3. Finds, parses, and stores events within that block range
  4. Updates the last L1 block it’s in sync with

As a result, there just aren’t many things that can cause it to break like this. Nothing obvious stood out to us, so we ended up having to do a good bit of digging. We came up with three potential sources of the bug in order of decreasing likelihood.

Option 1: Maybe Kovan reorg’d

A quick look at Etherscan’s reorg list and we were back to square one. At the time of the incident, no reorgs had occurred for almost a month. Unless Etherscan was experiencing a large localized outage and falling out of sync with the rest of Kovan, it’s unlikely that there was any sort of 12+ block reorg that went unnoticed. On to the next one.

Option 2: Maybe we accidentally skipped a block

To figure out whether we’d skipped a block or not, we had to find a specific log line that would tell us if we’d scanned the L1 block that contained the deposit. In order to do that, we first needed to find the block in which the event was emitted. So we went back to Etherscan to search for the deposit:

The return value here is a struct of the form:

struct QueueElement {
bytes32 transactionHash;
uint40 timestamp;
uint40 blockNumber;
}

Where the timestamp and blockNumber here are the timestamp and block number of the block in which this deposit was submitted. We had what we came for: this deposit was submitted in block 25371197. Now we just had to take a look through the DTL logs to (hopefully) find a log covering the block in question. Sure enough:

The targetL1Block in this log line matches the L1 block we were looking for. We definitely attempted to sync the block that contained the deposit. So why didn't we see anything?

Option 3: Maybe Infura broke

Relatively confident in our DTL code, we started to look for potential reasons to blame someone other than ourselves (because it’s easier that way). We found that Infura was returning various 504 errors at around the same time that this error was triggered:

Interesting? Yes. Proof of a problem with Infura? Not so much. Maybe something went wrong at Infura on June 9th. Or maybe not. Who knows. Unfortunately, we did not store a record of our RPC requests and were therefore not able to debug this issue any further with logs alone.

Engineering Lesson #2: keep a record of your RPC requests.

If an Ethereum node is critical to your system, consider keeping a record of all RPC requests made to the node. Logs can always be deleted after some reasonable period of time to keep storage costs low.

Engineering Lesson #3: Infura is not consensus.

Consider sending requests to multiple nodes or node providers simultaneously to cross-check the validity of incoming results. Consider hosting one or more of these nodes yourself. You absolutely should not rely on the assumption that your node provider will always return correct results. Trust, but verify.

Solution: Learn, adapt, and move on.

After some discussion, a compromise was drawn:

  1. Resync the DTL to get the system back up.
  2. Add some more logic to the DTL so it can recover from missing events.
  3. Add a log line whenever this recovery logic is triggered so we can debug further in the future.

This strategy made the best of the situation. The DTL’s new recovery logic shielded us from the worst effects of the underlying problem. If the bug ever reared its head again we’d get a convenient log line instead of a halted system. Win-win.

Engineering Lesson #4: root cause isn’t everything.

Sometimes it’s possible (or even preferable) to mitigate an issue before you fully understand it. You should still monitor for future of occurrences of the issue. Consider brainstorming other effects that the unknown bug may trigger on your system.

Conclusion

I found this saga so interesting because it highlighted some of the dirtier aspects of software engineering. The process of building and maintaining software is never as straightforward as we’d all like it to be. Practical concerns can often come at the expense of doing things the “ideal” way. At the end of the day, good software engineering is more like a rough science than pure mathematics. You will need to deal with the messy situations that arise when people are building software for other people. All you can do is make the best of it: learn, adapt, and move on.

We’re hiring to build the future of Ethereum. If you liked what you read here, check out our job listings. Maybe you’ll find your next home. 💖

Scaling Ethereum