f14t

On Generative AI (GenAI)

Sun, 25 Aug 2024 00:00:00 JST

TL;DR: To my fellow system builders, don’t dismiss it but understand how it works. As the saying goes, “a tool is only as good as the hands that wield it”. As a software craftsman, your tools are important. And equally so, are your skills in using them effectively.

There’s no escaping GenAI nowadays, is there? I’m sure you’ve seen the full spectrum of its effects by now; from total naysayers to skeptics, to cautious optimists, to proponents and fanatics, to full-blown doom-bringers. In the cloud space, the big three cloud providers are “all in” on AI, as you can see in their headlines. Furthermore, there are thousands of AI-powered startups cropping up left, right, and center, with massive venture capital and valuations. If you’re not in the thick of these things, it gives the feeling that if you don’t invest, or apply AI in your business, or products, it’s just a matter of time when you’ll be left behind and eventually relegated into the pages of failures in history. It’s a scary thought. The FUD is real.

According to Gartner’s Hype Cycle for Artificial Intelligence, 2024, AI is now past the “Peak of Inflated Expectations”, although the hype about it still continues. So somewhere around here:

That means the “Trough of Disillusionment”, or in other words, the time of it not being “cool” anymore, is just around the corner (could still be years though, who knows). That’s the phase where you’re going to get the “You’re still doing AI? The world has moved on to Quantum Computing now! Catch up, or you’re gonna be left behind!” sort of feedback.

Remember Web3? According to Crunchbase, the uncontrolled hype of 2021 petered out dramatically around last year, with funding declining from $28 billion to $7.6 billion (73% decline) in 2023. $7.6 billion is not a slouch by any means, but you’d better have a credible business model to be able to survive that. With AI, you can kind of see signs of brakes being applied already with reports saying it might not perform well enough to justify the cost.

So what does that mean to us, system builders? I’m not an economist, nor can I predict the future, but I can form my opinion around what’s going on. Mind you, this is just my opinion; I’m not even representing the opinion of my employer, Alphaus. With that disclaimer out of the way, I think I’m somewhere in the middle; cautiously optimistic about AI and its uses. AI has been compared to the internet revolution in the 90’s, and rightly so, as a lot of the intricacies of AI still feels like “magic” to most people. And I think I can understand this sentiment as I’m in the generation that saw the rise of the internet and social media during their prime years and a lot of it felt like magic as well. Regardless of the dot-com crash in the 2000’s, now, the internet, just like the smartphone, is here to stay. We don’t really think about them that much anymore; they’re just, you know, part of our daily lives. And I think AI will be the same. It will find its place, whether niche or mainstream, in our daily lives in some form or another. Whether that’s in my lifetime, I don’t know.

We use AI at Alphaus. We use forecasting models to help our customers forecast budgets and their future cloud spendings based on their historical data. We use LLMs as part of our customer support. Our engineers use Copilot to aid them in their day-to-day coding tasks. I use both ChatGPT and Gemini in notetaking, summarization, and translation. And I know our CEO, Hajime Hirose, uses a lot of LLMs as well. On top of that, I read a lot of papers behind AI. It’s something I enjoy doing. As a CTO of a startup, it is part of my job to assess technology and leverage it to further our business. And that includes AI, among other things.

If you’re a software engineer who is currently employed, or looking for a job, or looking for something interesting to learn/build, or a technologist who is thinking of starting a company, I think it’s wise to consider AI. Whether we like it or not, we are part of this corporate, capitalistic world, and AI is the current game. And we have to play the game to survive.

Do I think AI will replace developers? To some extent, yes. And not just developers, but other professions as well. But I think it will also create new jobs and positions at the same time. Just like the internet and other transformative tech revolutions before that. I would imagine the industrial revolution has replaced a lot of professions but also created completely new ones. Think assembly lines, steel-making, steam-power-based automation and mass production, etc., which I think didn’t really exist before. Anyway, think of it this way: software engineering is about solving problems, not just coding. Coding is probably the easiest part of our job. LLMs might be able to replace that bit but, without sentience or consciousness, the social/human part of our job is still on us.

Do I think GenAI will cause human extinction? Or replace us humans? Well, it’s like asking a hunter-gatherer during the Stone Age if he thinks a jet engine would render his bow and arrow useless. He’ll probably just reply, “You mad? What are you on about?”.

Thoughts on the newer systems programming languages

Tue, 20 Aug 2024 00:00:00 JST

I’m talking about the new crop of systems programming languages that advertise themselves as better replacements for C and/or C++: Rust, Zig, D, Odin, Nim , etc. I’m using the word “new” loosely here as Rust and Zig, for example, are almost a decade old now. The topic of systems programming has been on my radar (again) recently at work due to our attempts at improving the performance of some of the more critical parts of our stack. At Alphaus, we use Go as our main programming language and as much as I like Go, there are still areas in our infrastructure that could be served better with non-GC languages.

This post will not be an X vs Y review, or at least, not intentionally. Programming languages, I think, are one of those things that developers tend to get attached to, alongside editors, OS’es (or distros in the Linux world). “Here be dragons.”, as they say. But I want to share my thoughts from a perspective of someone who can introduce a new programming language to teams, instead of someone who wants to make a case and convince somebody who can do so.

Before I joined, Alphaus was a PHP shop. Around early 2018, after about half a year, I introduced Go as our main programming language but only for new services; no rewrites. I didn’t know Go at that time, nobody in the company did. But why Go though? I could have chosen Java, or C#. Seven years on, now, I consider that a good decision. Did I know it would be? Of course not. Java probably would have been fine. But thinking about it now, I would attribute it mainly down to Go’s simplicity.

From Go’s homepage.

Before Alphaus, my background was embedded, low-level systems. Over the years of working with several programming languages, I’ve come to appreciate the simplicity of C. Even though experience-wise, I probably have written more C++ than C, there’s something in C’s simplicity that I yearn when working with more complex languages. And I think that translated to my choice of Go, really. It was the C equivalent in the high-level language world for me (apart from the fact that it’s created by C people). So you can say there’s definitely a bias there on my part. And I think it’s true. We always say (in software engineering circles) “use the right tool for the job”, but the “right” part there is the tricky bit. In my case, “right” was ultimately based on my own biases, my own preferences. Although in hindsight, it wasn’t really purely selfish. There was a big push for Go when going cloud native at that time as well. And Docker and Kubernetes were also written in Go, which I think also had an influence, as we based our stack on these tools too. And our product was in infrastructure (we pivoted later) so there’s definitely an alignment there.

Go’s simplicity, in my mind, also meant easy recruitment. I said nobody in the company knew Go at that time but it wasn’t really a big deal to me as I was quite sure the engineering team would pick it up and be productive with it in no time. The same with recruitment; I can hire non-Go engineers knowing that they can learn Go in about a week and then start contributing. With that said though, simplicity being translated to quick productivity is not really a deal-breaker to me. It only applies when you don’t have people who know the language. When you have a team who are already proficient with, say, C++, they will be productive with it.

You see, I could go on enumerating about the pros of Go by mentioning its merits as a language but what I’m really trying to say here is that the “objective” (or “right”) reasons why I should choose Go only had a slight influence on me. Even without all these, I’m sure I’d still end up with Go mainly because of my bias in C. I’m sure there are a lot of engineering leaders out there who are more objective or data-driven than me but again, programming languages are one of those things we get emotional about. And I’m sure I’m not alone in thinking this way.

Now back to the present. At the moment, I’m mulling about introducing a systems language to the company. Not a replacement of Go, but a compliment to Go. Of all the choices listed above, and based on the thought process I just laid out, it’s obvious that my choice would be Zig. It’s advertised as “the better C”, while Rust is “the better C++”. But I’m not quite sure yet.

You could argue that it’s impractical, maybe borderline irresponsible, of me to not choose Rust. It should be the “right” choice. It’s memory-safe, more mature, already included in the Linux kernel, proven in production by some of the big names in tech, etc. And more importantly, some of the people I follow (and look up to) are also advocating for it.

From Rust’s homepage.

I know Rust a bit. I’ve dabbled with it outside work although I’m not really proficient with it since I haven’t written any production code with it. It is a complex language. Not C++-level kind of complex but it is still very complex. I believe it is “the better C++” and if I’m still working in C++ now, I’d choose it as well. But I’m not. And my bias is stronger now than ever before due to C, and now, Go.

I’m actually looking at Zig now. It resonates with me as it still feels “C”-ish but a lot better. I like the choice of control over memory allocations. It’s probably not as safe as Rust but it provides mechanisms to make it much safer than C. And it’s still a simple language overall. Having said that, no decisions yet. It’s only been 2-3 months. And I’m not in a hurry. I think the only real blocker at the moment is it not being tagged 1.x yet.

From Zig’s homepage.

This post has been all over the place. In the end, you might not care whatever my choice will be and the justifications I have for it but another way of looking at it is that if you happen to be in a position of trying to convince your leader(s) to switch to your preferred programming language, basing your arguments only from an objective point of view might not hold much water. You’re probably going to fight an already losing battle from the start.

Or it could still work, you know. Who knows.

How Alphaus saves on costs by ‘stitching storage’

Wed, 24 Jul 2024 00:00:00 JST

One of Alphaus’ data processing pipelines ingests around 10TB of client financial data per day. The processing engine is running on GKE with around 80-100 (depending on what week of the month) pods sharing the total workload. Each pod has around 10GB of memory and 30GB of attached storage. The consistency of this load allowed us to purchase enough Committed Use Discounts (CUDs) for the underlying VMs to save on compute costs.

These pod resource limits are usually enough 80% of the time. However, since late last year, some of the accounts have datasets that are way, way beyond these limits causing persistent OOMKilled events.

Our first stop-gap solution was to increase the memory limit. The trouble was, even with 20GB+ of memory wasn’t really enough for some of the input datasets. And on top of this, GKE’s cluster autoscaler also started increasing the VM sizes to those which we don’t have CUDs for. Suffice it to say, it increased our monthly cloud spend to about +20% while delaying the overall processing time due to pods crashing (and restarting).

We tried other solutions. One was using local files which required increasing the size of the attached storage. While cost-effective, the performance drop was significant mainly because most of the datasets that were well within the memory limit were now moved to disk as well. We also tried using the database we are currently using which turned out to be worse in terms of perfomance and costs. We also tried to use our cache layer (named Jupiter) which was very performant but prohibitively expensive.

Enter Spill-over Store (SoS), our current solution. Inspired by Apache Ignite’s design, the idea is to stitch together the already available memory and storage across the running pods, providing an ad-hoc, on-demand storage for really big datasets.

Illustration of hedge’s Spill-over Store.

From the image above, the pod that is assigned to load a huge dataset will exhaust its local memory first, then “spill over” to its local disk, then another pod’s memory (using gRPC streaming), then disk, and so on. Thus, our example 100GB dataset will utilize around 4 pods in total within the cluster.

This solution allowed us to actually revert back to our original pod resource limits. Both disk and network performance are acceptable (we don’t use GCP’s Tier 1 network) and still within our SLA as the solution only applies to about 20% of the ingestion pipeline. The majority still uses local in-memory processing.

As Alphaus grows (and therefore ingests more and more data) and serve more clients, maybe we will eventually end up using Apache Ignite or some other off-the-shelf distributed solutions, but as of now, SoS works. With that said, if you have a cost-effective (and better) product/solution in mind, please feel free to contact us. We’d love to talk.

Finally, you can find SoS’s implementation here (if you’re interested).

Revisiting latency numbers

Fri, 28 Jun 2024 00:00:00 JST

“Back-of-the-envelop calculations”, “napkin-math”, “latency numbers every programmer should know” - yes, those numbers that usually come up during system design interview questions. This came into my periphery again while looking at RDMA latency checks and benchmarks with P4d instances in AWS (using SoftRoCE). As an old-timer with (most likely) outdated ideas about system design-related latency numbers, although I’m quite familiar with Jeff Dean’s “Numbers every one should know” approximations, I noticed that in a jiffy, I’m still (unconsciously) subscribed to the idea that disk access is most definitely faster than network. Somewhere along the lines of L* cache > memory > disks > network. Come to think of it, not really sure why. It’s probably because pre-cloud, I really didn’t have access to high tier network bandwidth, so my experience with only crappy networks has been etched in my mind for the longest time. This is usually evident when I do quick, back-of-my-head latency calculations of a potential system to design. Of course in the end, benchmarking will have the last say when it comes to these numbers, but having a rough idea of the performance numbers pre-implementation is always helpful.

From Jeff Dean’s slides.

The idea of 50Gbps, 100Gbps, or even 200Gbps network speeds, and now with P4d’s advertised 400Gbps (yes, 400!) somehow didn’t translate to my internal insistence that SSDs, especially the NVMEs or even M.2s, should be faster. Now, there might be enterprise/military/space/etc…-grade SSDs that I’m not aware of, or don’t have access to, that have stupendous read speeds, but I’ve never heard of one so let’s leave them for now. Some of the faster M.2 SSDs can go up to ~15GB/s read speeds, which is quite impressive, but nowhere near as P4d’s. I still can’t wrap my head around it. It’s the sort of numbers you find listed on supercomputers with their custom Infiniband interconnects.

Anyway, after scouring the rabbit hole that is Reddit regarding discussions, debates, (and fights,) about the correctness of these numbers, I came across sirupsen’s napkin-math repo, which I think is a fascinating piece of information. If you’re a system designer, you really should check it out.

From napkin-math repo.

Again, it looks like network can indeed be faster than disk access. I really need to recalibrate my internal latency numbers table to accommodate for these more modern hardware capabilities. The good thing is, at least, these sort of information are now readily available anytime. With that said though, I still believe that system designers should still be able to roughly determine a system’s overall performance behavior pre-implementation using napkin-math latency calculations.

New site look

Thu, 27 Jun 2024 00:00:00 JST

It was simply because the Jekyll theme I was using before, a modified version of much-worse-jekyll-theme, doesn’t build anymore. Actually, no, that’s not quite right. I updated some of its npm dependencies which caused it to not build anymore. It’s using quite an old version of Ruby, including most of its dependencies, that I’ve been receiving a lot of warning emails about vulnerabilities from GitHub. Instead of potentially spending a lot of time just updating the build itself, I’d probably be better off moving to a ready-made, more modern templates/themes. I knew about Hugo and a quick search of its free themes led me to choose Will Faught’s Paige theme, which I quite like due to its very simple look.

The migration was actually quite quick. I only copied my previous _posts/ folder into the new content/blog/ folder and added some of the needed front matter which was simply done using good ol’ grep + awk.

While I really liked the look and feel of my old site, this new one is also quite good. And Netlify’s support for Hugo is quite solid as well so less build headaches for me. Hopefully, for many years to come.

Thank you OSS and free services.

memx - Get process’ memory usage (Linux)

Thu, 30 May 2024 00:00:00 JST

I shared a simple piece of code for getting a process’ memory usage in Linux. It’s called memx. It’s Linux-specific only as it reads the proportional set size (PSS) data from either /proc/{pid}/smaps_rollup (if present) or /proc/{pid}/smaps file. I’ve used this piece of code many times at work. We use memory-mapped files extensively in some of our services and this is how we get more accurate results. Very useful in debugging OOMKilled events in k8s.

The DIVS model

Mon, 13 May 2024 00:00:00 JST

I posted a blog introducing the DIVS model, the process we use at Alphaus, the startup I work for. Check it out here.

oomkill-watch - A tool to watch OOMKilled events in k8s

Fri, 03 May 2024 00:00:00 JST

I recently uploaded a tool to GitHub that wraps the kubectl get events -w command for watching OOMKilled events in Kubernetes. It’s called oomkill-watch. You can check out the code here. You might find this useful.

CTO Diaries #4: On not shipping your org chart

Wed, 09 Aug 2023 00:00:00 JST

You’ve probably heard of the warning “Don’t ship the org chart” common among product development circles. I always thought of this as synonymous to Conway’s Law which states that organizations design systems that mirror their own communication structure. In my experience, I’ve come to believe that this is true. Whether you like it or not, it is an eventuality. At some point I thought that to be an effective solutions architect, you have to be an “org architect”. Or in order to achieve a certain system architecture, you’re better off rearranging your org structure than influence cross-functional architects or senior engineering leadership to agree with you. It may work for a time, especially during the start, but eventually, as engineers come and go, the org-mirrored design will out. But I think there’s more nuance to this than what’s obvious.

There are several types of org structures but I will cover only what’s relevant to my situation which is startups. This might be anecdotal but between peers, the most common types fall under these two broad classifications: less boundaries or bigger teams, and smaller, often siloed teams. Teams with less boundaries tend to move slower because more people need to communicate for alignment but will produce a more coherent product. Smaller teams on the other hand, move faster but their outputs are more difficult to integrate together, resulting in an inconsistent product experience. Personally, I don’t really have a strong opinion on which is better. I think there are phases in a startup where one works better than the other. But this requires you to understand what you want to build in the first place and then build your org chart around that.

There’s also the case of building your org structure based on product lines if you have multiple products as opposed to functional teams covering multiple products. Product-based org structures might promote autonomy and quicker response to industry changes but could easily lead to system duplication (wasted resources). Functional structure however, encourages specialization and focus but inhibits multi-boundary communications.

You may argue that, “Does it really matter? Customers don’t really care about how your org is structured as long as your product is usable and has great UX.” I would actually agree that this is a good baseline for structuring your org. You will have several vertical teams per product but an overlapping horizontal team that ensures UX coherence, usability, and branding. I think this is what the Mirroring Hypothesis study (confirms Conway’s Law) refers to as partial mirroring; in which technological knowledge are invested, shared, and acquired beyond operational boundaries. “API-first” companies come to mind as in partial mirroring, building contractual relationships (APIs) that support technical interdependency across boundaries seems to work. And more often than not, your org structure will undergo changes multiple times during your product’s lifetime, which will result into subsequent changes to your system’s design that will then mirror the new org, and so on and so forth, so you might as well embrace this dynamism instead of fighting against it.

Finally, the study also highlights collaboration patterns in the open source sphere that do not support the mirroring hypothesis (and thus, Conway’s Law). My experience with OSS collaboration is mostly outside work so I’m interested to know if there is a case for adopting OSS-style structure within the organization. I think that’s a good topic for another day.

To conclude, instead of going with the warning “don’t ship the org chart”, since you will ship your org chart anyway, be aware of it, understand it, and make sure it works on your favor.

Announcing our new product, OCTO

Mon, 29 May 2023 00:00:00 JST

We just recently announced the public beta of our new product, OCTO. If you’re interested, you can join our waiting list at https://lp.alphaus.cloud/waitlist.

Retries with backoff in distributed systems

Thu, 11 May 2023 00:00:00 JST

In a distributed system, where multiple processes communicate with each other over a network, failures are inevitable. Network partitions, hardware failures, and software bugs can all cause a request to fail. Retries with backoff are a critical technique to help mitigate these failures.

Retries refer to the act of retrying a failed request. When a request fails, the client can retry the request, hoping that it will succeed the next time around. However, simply retrying the request immediately after a failure can be problematic. If the failure was caused by a temporary network issue, for example, retrying immediately will likely result in another failure. This is where backoff comes in.

Backoff refers to the practice of waiting a certain amount of time before retrying a failed request. The idea is to wait long enough for any temporary issues to resolve themselves before retrying. The amount of time to wait is typically increased with each retry, hence the term “backoff.” The idea is that if the request fails multiple times, the client will eventually back off enough to give the system a chance to recover.

There are several benefits to using retries with backoff in a distributed system. First, it can help reduce the impact of temporary failures. By waiting before retrying a request, the client can avoid bombarding the target with requests, which can exacerbate the problem. Second, it can help improve overall system availability. By retrying failed requests, the client can work around transient issues that might otherwise cause the entire system to fail.

There are several strategies for implementing retries with backoff. One common approach is exponential backoff, where the client waits an increasing amount of time between each retry. Another approach is jittered backoff, where the client adds a random amount of time to the wait period to avoid the so-called “thundering herd” problem, where multiple clients all retry at the same time.

Example 1:

import (
    ...
    backoffv4 "github.com/cenkalti/backoff/v4"
)

func main() {
    var n int
    operation := func() error {
        n++
        log.Printf("n=%v\n", n)
        if n >= 10 {
            return nil
        }

        return fmt.Errorf("backoff")
    }

    err := backoffv4.Retry(operation, backoffv4.NewExponentialBackOff())
    if err != nil {
        log.Println("final backoff failed")
    }
}

Example 2 (my preference):

import (
    ...
    gaxv2 "github.com/googleapis/gax-go/v2"
)

func main() {
    bo := gaxv2.Backoff{
        Initial: time.Second,
        Max:     time.Minute,
    }

    var n int
    operation := func() error {
        n++
        log.Printf("cnt=%v\n", n)
        if n >= 10 {
            return nil
        }

        return fmt.Errorf("backoff")
    }

    for {
        err := operation()
        if err != nil {
            time.Sleep(bo.Pause())
            continue
        }

        break
    }
}

In conclusion, retries with backoff are an important technique for improving the robustness and availability of distributed systems. By waiting before retrying failed requests, the client can help reduce the impact of temporary failures and improve overall system availability. There are several strategies for implementing retries with backoff, and choosing the right approach will depend on the specific requirements of the system.

Additional reading: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

Alphaus Blue API

Wed, 03 May 2023 00:00:00 JST

Hey there,

I just posted a blog about gRPC here. If gRPC and grpc-gateway is right up your alley, you might find it interesting.

Attempt to replace hedge’s member tracking with hashicorp/memberlist

Fri, 28 Apr 2023 00:00:00 JST

I recently came across the hashicorp/memberlist library while browsing GitHub and I thought it would be a good replacement for hedge’s internal member tracking logic. It seems to be widely used (thus more battle-tested) as well. I was quite excited as I always thought that hedge’s equivalent logic is too barebones and untested outside of our use cases. It works just fine for its current intended purpose but I’ve been hesitating to build on top of it until I can really say that it’s stable enough. With memberlist, it might just be what I needed.

After about a month of testing, I think it didn’t really turn out quite well in the end. It is stable enough for deployments that are not spike-y in terms of workloads (frequent scaling up/down). Or if I set min = max in the HorizontalPodAutoscaler. In these cases, memberlist can consistently track members just fine. What’s better is that it works even in multiple deployments in the same namespace which I thought was brilliant. For example, if I have a deployment app1 set to 10 pods in the default namespace using memberlist’s default port numbers, and then I deploy another set, say, app2, within the same namespace using the same ports, app1’s memberlist can track its 10 member pods just fine while app2’s memberlist is also separated. But when applied to my use case, which has a minimum pod of 2 and a max of 150, with frequent scale up/down frequency depending on load, it can’t seem to keep up. The potential for Byzantine faults is just too high: i.e. in a 50-pod scale, memberlist can end up having 2 groups of m-pods and n-pods where m+n=50. Very rarely, it can even go up to 3 groups.

I am a little frustrated. I really wanted it to work; I even attempted to update memberlist to incorporate hedge’s logic but it was too much for now, with my schedule. So now, back to the old one. By the way, the current logic is fairly rudimentary: all members in the cluster/group send a liveness heartbeat to the leader and the leader broadcasting the final list of members to all via hedge’s broadcast mechanism. CPU usage between the two is fairly similar depending on the sync timeout.

I’ve been trying to improve hedge’s member tracking system as I want to build a distributed in-memory cache within hedge itself. Most of the available ones are Raft-based, and I still haven’t figured out how to make Raft work in the same deployment configuration.