System-Design

July 24, 2024

How Alphaus saves on costs by ‘stitching storage’

2024-07-24

One of Alphaus’ data processing pipelines ingests around 10TB of client financial data per day. The processing engine is running on GKE with around 80-100 (depending on what week of the month) pods sharing the total workload. Each pod has around 10GB of memory and 30GB of attached storage. The consistency of this load allowed us to purchase enough Committed Use Discounts (CUDs) for the underlying VMs to save on compute costs. These pod resource limits are usually enough 80% of the time. However, since late last year, some of the accounts have datasets that are way, way beyond these limits causing persistent OOMKilled events.

Infra · Infra · System-Design

July 24, 2024

3 minutes

Revisiting latency numbers

2024-06-28

“Back-of-the-envelop calculations”, “napkin-math”, “latency numbers every programmer should know” - yes, those numbers that usually come up during system design interview questions. This came into my periphery again while looking at RDMA latency checks and benchmarks with P4d instances in AWS (using SoftRoCE). As an old-timer with (most likely) outdated ideas about system design-related latency numbers, although I’m quite familiar with Jeff Dean’s “Numbers every one should know” approximations, I noticed that in a jiffy, I’m still (unconsciously) subscribed to the idea that disk access is most definitely faster than network. Somewhere along the lines of L* cache > memory > disks > network. Come to think of it, not really sure why. It’s probably because pre-cloud, I really didn’t have access to high tier network bandwidth, so my experience with only crappy networks has been etched in my mind for the longest time. This is usually evident when I do quick, back-of-my-head latency calculations of a potential system to design. Of course in the end, benchmarking will have the last say when it comes to these numbers, but having a rough idea of the performance numbers pre-implementation is always helpful.

Infra · Infra · Latency · System-Design

June 28, 2024

3 minutes