Considerations on deploying LLM-based workflows

2025-02-25

AI · Deployment · Genai · Llm · Programming · Software · Systems · Tech

6 minutes

Let me start with a disclaimer. These thoughts come from a startup perspective where funds and personnel are a bit scarce. There’s a lot of generalizations and assumptions made here as well; take them with a pinch of salt.

In just a short span of time, the speed of improvements to LLMs, especially the mainstream ones, are short of astonishing. The massive ones are getting smarter, faster, and more accurate. And even better, the open ones, such as Llama, DeepSeek, Gemma, Qwen, etc. are also catching up, which is a good thing, as I’m more interested in them. And for enterprises who are looking into integrating LLMs into their internal workflows, or even products, the options available now are so many it’s quite confusing where to even start. I hope this blog will shed some light on some of these confusions.

One of the first things to consider is whether to go AIaaS (AI as a service, or external) or deploy internally. While I don’t really have a problem with using Gemini, ChatGPT, or Sonnet for explorations, I think when internal data or knowledge-base are involved, hybrid deployment is the way to go. Hybrid in this sense means hosting LLMs internally for more critical information (for obvious reasons), and using external LLMs for the rest. Ultimately, the decision of what to use, whether external, hybrid, or internal, will depend on your company’s data privacy policies, governance, and regulations. I won’t be digging further into going external as it’s more about tracking and controlling what information are being included in the prompts than deployment, so we treat it as we do with any other API-based vendor integrations. Internal deployment, on the other hand, we generally have two options; a) fine-tuning a closed LLM, and b) using open models.

Fine-tuning a closed LLM really depends on whether that feature is provided by the vendor. For example, OpenAI’s ChatGPT, and Google’s Gemini can be fine-tuned with your own data, and you can use the fine-tuned version as your new LLM; the vendor will still host that LLM for you, you use it as you would with any other LLMs, which is through its API. Now, you might argue that it’s technically not an internal deployment, and you might be right. I only categorized it as such since in doing so, there’s almost always the provision from the vendor effectively promising to not use your fine-tuning data as training data for their next LLM versions, so there’s an assumption of privacy there; whether you trust them or not is up to you.

The second one; using open models, is really what you want. The idea is that, for enterprise use, you don’t really need, say, ChatGPT’s vast, generic knowledge of the world. You want an LLM that understands your hiring and onboarding policies instead of knowing the historical weather data of the Atlantic for the past decade; you want its inferencing (or “thinking”) capabilities applied to your internal data instead. In this route, you can either fine-tune an open model with your data, all hosted internally, or use an open model as is, and use RAG to augment, and ground it with your internal data, also hosted internally.

So, RAG? Or fine-tuning? I think doing both is the way to go. The general rule of thumb (advise I got from the Gemma Japan team) is fine-tune for static (or near-static) data, and RAG for more dynamic, always-changing data.

Considerations for fine-tuning are expertise, and costs. To fine-tune an LLM, you need quality training/fine-tuning data. And since enterprise data are usually all over the place, and often, not centralized, data preparation is actually one of the biggest hurdles. You probably need a team of data engineers, data scientists, and infra/ops personnel to pull this off. And the costs will involve the upfront costs of the actual fine-tuning (you need compute, both CPU and GPU/TPU), and the ongoing inferencing (or serving) costs, which will involve compute (CPU/GPU/TPU), storage, and networking.

Considerations for RAG are also expertise, and costs. Setting up RAG-based workflows involves some important components: LLM routers, embeddings generators, and vector databases. You will most definitely be using multiple LLMs for multiple purposes; one for text summarization or generation, one for reasoning, another for research, or coding, and so on. Each LLM will be deployed separately; could be on a big VM, or a cluster of VMs. And since these clusters need GPUs, you’d probably want a serverless, scale-to-zero environment, as GPUs will bear most of the costs in this layer. You might be able to get away with traditional auto-scaling clusters (with additional checks in place to scale down during idle time) but environments like Kubernetes with, say, Keda for scale-to-zero, or Fly’s GPUs (which I believe can scale to zero), or GCP’s Cloud Run, etc., will be easier to manage. So with multiple LLM deployments, you’d also want a router that will route input requests, or prompts, to the appropriate LLM target. You can do this traditionally, utilizing an LLM to facilitate with the routing, or you could also do MoE (Mixture of Experts) deployments, where the routing is done within the LLM itself. One issue, however, is that there’s not a lot of open MoE models yet; there’s IBM’s Granite, Mistral-MoE, and Qwen-MoE models (I’m monitoring this space closely as well; I think there will be more improvements here in the near future).

Embeddings generators and vector databases are specific to RAG. To “feed” your enterprise data to an LLM (as opposed to fine-tuning), you need to generate “embeddings” for them first. Embeddings are vector representations (with semantic context) of your data. These embeddings will then be stored in a vector database for later use. The more data you have, or at least the more data you want an LLM to have access to, have, the more embeddings you will generate. How many of these will depend on the size of the context windows of the LLMs you choose. An example would be 1 page of a document is 1 embedding, or, if an LLM’s context window is smaller (looking at you, Gemma2), you could “chunk” your data into smaller pieces of defined length; 1 chunk will be 1 embedding, and so on and so on. Options for generators here are plenty; you can use the mainstream providers’ vector embedding APIs, such as OpenAI’s Vector Embeddings API, GCP’s Embeddings API, AWS’ Titan Text Embeddings, etc., or use open models, such as Word2Vec, LexVec, bert, chroma, etc., although you still need to host them internally. Choices for vector databases are also the same; you can use vendor-provided ones, or self-host open source ones. Note that these are still databases, so when self-hosting, you still need to tackle the headaches of deploying databases, even vendor-provided ones. Once these are in place, you do RAG by converting the input query, or prompt, to its embedding equivalent, do a semantic/similarity/distance search from your vector database, map the resulting embeddings to the real data you have, and then use those data as context to (or part of) your final prompt to the target LLM. So, still an ops-heavy deployment, as you can see, notwithstanding the costs, both upfront and operational, that you will incur in deploying these components.

I will write something about cost estimations or simulations regarding these deployments, and some ideas on actual deployments as well, but that will be on a separate blog.