Distributed-Systems

Cluster membership management on AWS

2025-02-07

In continuation with my previous post, I have now finished porting hedge to AWS. It’s a trimmed-down version for now; only the features directly related to cluster membership are ported. I decided to make a separate repo, called hedge-cb (in keeping with the -cb theme), instead of updating hedge directly. And it’s mainly due to CGO. I didn’t really fancy the idea of introducing CGO to hedge as it could break a lot of the CI builds at work. I had to extract the shared protobuf definitions to a separate repo however, which is a breaking change to hedge. But at least it’s only a version change (v2) as opposed to adding CGO.

Aws · Cgo · Clockbound · Cluster · Distributed-Systems · Ffi · Go · Golang · Hedge · Leader-Election · Memberlist · Programming · Software · Systems · Tech · Timesync · True-Time

2 minutes

Thoughts on Zig

2024-09-30

In continuation with my previous post about the new systems programming languages, I mentioned I was considering Zig as a potential complementary systems language to our main one, which is Go(lang). Well, for the past month or so, in my spare time, I tried writing something more substantial in it to understand the language more. For some time now, I’ve been “itching” to write something similar to Hashicorp’s memberlist library, but in a lower-level language for performance, smaller footprint and minimal network load. Now, I’ve used memberlist before, and it is a superb piece of code, but I wanted something that supports a consistent leader across the whole fleet. It’s a requirement to a system I plan on building in the near future (more on this in a later post). My top choices were C, Rust, and Zig, and as I said, I took a liking to Zig due to its promised simplicity, so I wrote it in Zig. The project is called zgroup, and you can check it out on GitHub if you’re interested. It’s still similar to memberlist but with the added capability of electing a leader across the whole group. It uses both the SWIM Protocol, which memberlist uses, and Raft’s leader election algorithm.

Distributed-Systems · Programming · Programming · Raft · Swim · Zig · Zig · Ziglang

2 minutes

Retries with backoff in distributed systems

2023-05-11

In a distributed system, where multiple processes communicate with each other over a network, failures are inevitable. Network partitions, hardware failures, and software bugs can all cause a request to fail. Retries with backoff are a critical technique to help mitigate these failures. Retries refer to the act of retrying a failed request. When a request fails, the client can retry the request, hoping that it will succeed the next time around. However, simply retrying the request immediately after a failure can be problematic. If the failure was caused by a temporary network issue, for example, retrying immediately will likely result in another failure. This is where backoff comes in.

Backoff · Distributed-Systems · Retries · Retry · Tech

3 minutes