LLMOps on GCP: Scale Your AI Models Now

As many readers know, I’m always exploring new horizons. Recently, Large Language Models (LLMs) sparked my interest, especially with the rise of ChatGPT. With a background in Google Cloud Platform (GCP) data engineering, I saw a clear opportunity to apply my skills in LLMOps on GCP (Large Language Model Operations).

Langchain, an open-source toolkit by Harrison Chase, has played a pivotal role in this journey. Using Langchain, I developed “Edmonbrain,” a personalised chatbot I use daily on Google Chat, Slack, and Discord. It processes my book, GitHub repositories, and whitepapers, creating a tailored AI experience grounded in my own data.

I believe we’re entering a transformative period. With LLMOps on GCP, I feel 20-50% more productive and expect these models will reshape the way we work.
As many readers know, I’m always exploring new horizons. Recently, Large Language Models (LLMs) caught my attention, especially with the rise of ChatGPT. With a background in Google Cloud Platform (GCP) data engineering, I saw a clear opportunity to apply my skills in LLMOps (Large Language Model Operations).

Screenshot of Edmonbrain chatbot in Google Chat summarising key themes from a Google Analytics book, demonstrating LLMOps on GCP. — Edmonbrain summarises key themes of the Google Analytics book, demonstrating how LLMOps on GCP personalises AI interactions

And I’m just getting started – imagine if we all shared a custom resource, and our feedback, questions and dreams are part of the bot’s context?

If you buy me a drink I will talk at length about how OpenAI, Google, Meta and Microsoft are interacting within this space, what the future may hold and what we can do already, but to keep a tight scope for this blog post I will share how I’ve made Edmonbrain: a chat-bot powered by LLMs that can talk about my own private data. As to the rest of the conversation about LLMs, I will for now only say I think I’m 20-50% more productive in developing ideas thanks to them, and that I do believe we are at a place where the nature of work will change as a result once that is applied more generally throughout the population. Naturally, Edmonbrain was very much LLM assisted and it was amazing to work with.

Key objectives for effective LLMOps on GCP

To create a secure, adaptable, and scalable AI solution, my main goals included:

Cost-efficiency: Implement a serverless, scale-to-zero architecture.
Modularity: Enable easy swapping of models, databases, and UIs.
Privacy: Ensure all data stays within GCP, avoiding third-party services.
Scalability: Allow growth from zero to high demand seamlessly.
User-friendly data input: Make it simple to add data sources like URLs, GitHub repos, and Google Drive.
No authentication keys: Use Google metadata for easy and secure access.

My aim was to build a “company-wide brain.” I envisioned an LLM bot that could access internal documents, emails, and chats to answer questions across an organisation. This bot would improve communication using familiar platforms like Google Chat or Slack.

Essential components to implement LLMOps on GCP

Large Language Model (LLM) – the “brain” that will respond to the text you send in. This is the magic that enables everything else. It supplies the intrinsic language understanding that was lacking in previous bots, and advanced LLMs can be like conversing to a knowledgeable individual
Chat history – to make conversations natural, the LLM needs a short-term memory on what has happened. This doesn’t come for free (yet) in the APIs. You need to supply the chat history and add it to the API calls you are making to the LLM
Context – this is the current piece that is enabling a lot of excitement at the moment. It basically means you pasting in a few examples or text context to any question you are asking, to help facilitate the LLMs answer, a form of prompt engineering. If you ask “What is my name?” tools such as Langchain can turn that prompt into “Answer below based on the context provided: What is my name? Context: Mark Edmondson is the creator of this tool”. Its a simple idea but with very powerful applications once you automate generating that context.
Vectorstore – this is the hot new tech that enables the context above. You have a limited window of text to add to prompts, although there is a new arms race about which model can provide the largest context window (1 billion tokens?). Vectorstores hold the vectors of text embeddings that help similarity searches so you can select the best context to add to the prompt to help its answers. The gif below from Matching Engine helps illustrate what they are doing: vectors are similar if encoding similar context, and so you look for points close to existing embedded vectors when you get a new prompt so that the context returned is hopefully useful.

The first material I produced for LLM on GCP was this slide deck on “Edmonbrain – building a brain” in June 2023. I have to mention the date, as its probably already out of date given the pace of updates at the moment. It includes introducing a few key components that I summarise below:

UX – how you interact with the bot. Although there are lots of web based chat bots out there, I avoided the rabbit hole of building a UI by favouring building LLM powered bots in the chat platforms I use daily, namely Discord, Slack and Google Chat. You get a lot of features for free such as chat history, user management and mobile/desktop apps.
Orchestration – you need something that strings all the above components together. Langchain excels here since it offers a consistent API where you can swap out betwen say the OpenAI and VertexAI LLM models, or a different vectorstore such as Supabase, Chroma or CloudSQL.

The above are represented by these data architecture diagrams that I’ve made or saved from the web. Incidentally, @Langchain is also a great social media account to follow to keep up to date with all of the above, and was how I found out about these concepts.

Diagram showing the emerging LLM app stack with components like data pipelines, vector databases, and orchestration. — The Emerging LLM App Stack diagram illustrates key components in LLMOps workflows, showcasing data pipelines, embeddings, and APIs for comprehensive language model applications.

The diagram above is from a comprehensive post called Emerging Architectures for LLM Applications by a16z.com which was my first realisation this LLMOps is an emerging field that I’d like to be involved with.

Below is a diagram I made that helped me break down what I needed for my project:

Flowchart showing LLMOps data architecture on GCP, including components like file loader, vectorstore, and context embedding." — Blueprint of the LLMOps workflow for building Edmonbrain’s architecture.

What I don’t cover in this blog are the tools or agents at the bottom. Thats my next step. But this diagram was the blueprint to what I built below.

Designing data architecture for LLMOps on GCP

On GCP, I used Cloud Run to host microservices, each handling a specific component:

Google Chat App – this accepts HTTP requests from a Google Chat bot, parses the questions into suitable format for the QA service. It also includes some slash command processing to change behaviour such as a different LLM (Codey vs Bison). It receives the answers back from the QA service, and sends them with formatting back to the user.
Slack and Discord – similarly, there are Slack and Discord Apps dealing with their own APIs but outputting the same as the Google Chat App to the QA service.
Question/Answer (QA) App – this accepts questions and chat history and sends them to the LLM for answering. It runs Langchain to enable the various cool applications it has, such as ConversationalRetrievalChain, which call the vectorstore for context before sending to the LLM.
Embedding service – when a user issues a command (!saveurl ) or when it receives a PubSub message from a file broadcast from the attached Cloud Storage bucket, this service receives the raw file and sends it to the Unstructured service. That Unstructured service creates Langchain Documents() that can then be chunked and passed into embedding. The number of chunks per document can get large (1000s for a large PDF), so each chunk is sent separately in its own PubSub message, which scales up the Cloud Run app to support demand and scales back to 0. This speeds up embedding A LOT. I also added some special parses to improve usability, so for example if the URL starts with https://github.com then it will attempt to clone that repo and embed every file within it, or if it starts with https://drive.google.com then it will direct the load from the Google Drive loaders.
Unstructured service – you can call the Unstructured API with an API key if you want, but to keep documents private and within GCP you can host your own Unstructured easily using their pre-made Docker container. This accepts Document parsing requests from the Embedding service
CloudSQL – the only non-serverless bit is CloudSQL running PostgreSQL to use the pgvector extension. (video about this here) This is the database that can connect to Cloud Run via a Private VPC so no need for a public IP. I’m wondering when/if this should be switched out for AlloyDB (which has some built in ML features) or Matching Engine (which has enterprise pricing, but may perform better for 10k + documents)
Cloud Storage – this is another way to load the Embedding service, since you can link a PubSub message to the Cloud Run endpoint. This is handy for lots of documents added at once, or you have existing data flows putting documents into Cloud Storage.
Pub/Sub – this is the glue that binds the Apps together, and allows you to message queue big chunky embeddings and/or send data to different destinations. For example, each question/answer thread may themselves hold valuable training data, so those answers are also piped to BigQuery for use later on.

Here is how it fits together, including the optional 3rd party options outside of GCP.

Diagram of LLMOps architecture on GCP showing integration with external chat apps, Cloud Run services, and vector databases. — Diagram of the GCP services and integrations used to build the LLMOps architecture for Edmonbrain.

The diagram covers the various services that GCP provide – Cloud Run is represented a lot and each Cloud Run box is its own micro-service, connected via PubSub and/or HTTP.

Code examples for LLM Cloud Run services

I have open-sourced Edmonbrain’s code on GitHub. For Google Chat, I use Pub/Sub to handle asynchronous responses and enhance message formatting with Cards.

Google Chat

The Google Chat service uses PubSub since you must reply within 30secs so the Q&A service needed to be asynchronous in case it was longer than that (its usually under 10 seconds to generate a response but longer more complicated answers could happen).

I generated Google Chat Card for output as it looked nice, although not for Codey that will only output text since its output includes back tick code blocks that didn’t work with Card format. Authentication is done using Cloud Run’s default service account, but for downloading chat history it seems this is only available with explicit opt-in by the user.

Python code snippet for sending messages to Google Chat using Google Cloud authentication and API integration. — Generating a Google Chat Card to send messages using Cloud Run authentication

Discord’s API was easier to work with to get chat history and message events, and was my favourite compared to Slack and Google Chat.

CloudSQL running PostgreSQL with pgvector

Langchain already has an existing PGVector connector which I reused for CloudSQL. I can see a business in embedding documents with a certain embedding size and selling those pre-done.

Python code snippet showing CloudSQL setup with pgvector for embedding storage and retrieval using Langchain on GCP. — Integrating PGVector with CloudSQL for efficient document embedding in PostgreSQL.

Setting up the CloudSQL instance was simple enough. I opted for a private IP instance and used a serverless VPC connector as described at these links:

I then needed to specify the VPC connector when creating the Cloud Run service within Cloud Run via the –vpc-connector flag and the database was found at the private ip (10.24.0.3 in above code snippet). I then constructed the PGVECTOR_CONNECTION_STRING and placed it in Secret Manager, accessible during Cloud Run start up via:

–update-secrets=PGVECTOR_CONNECTION_STRING=PGVECTOR_CONNECTION_STRING:latest

Hosting Unstructured Document Loading on Cloud Run

Unstructured is a great short cut to parsing out lots of different file formats that can work with embeddings. The easiest way to get started is to just call their API with your files for which you need a free API key, but for private documents I hosted my own instance on a bigger Cloud Run instance since parsing PDFs can get pretty intensive CPU wise as it uses OCR techniques to read diagrams etc.

Once a Cloud Run URL was available then it was just a matter of pointing the function at your own URL endpoint rather than the public one:

Python code snippet showing document loading on Cloud Run using Langchain's Unstructured API for private file parsing. — Implementing Unstructured Document Loading on Cloud Run for versatile file parsing and embedding capabilities.

Q&A Service

The LLM API calls along with additional context from your documents is handled by Langchain, featured on TechRadar by Devoteam. The choice of which vector store or LLM is made by a simple config file that looks at the name space. For example, Edmonbrain running in Discord uses Supabase/OpenAI whilst GoogleBrain on GChat uses Vertex and CloudSQL. Langchain’s abstractions means you can run the same chat operation across both:

I also plan to be able to customise the prompt more, since this is an area where a lot of optimisation can happen. Eventually I also want to enable Agents, that trigger actions via code functions you ask the LLM to create variables for, and feed it back the results. This is an exciting rapidly developing field.

Summary and future directions

While companies like Google release GenAI tools, building a custom solution has been invaluable for me. This infrastructure may eventually integrate with Google’s GenApp builder for even greater flexibility. I’m also excited about LLMs operating independently and look forward to advancements in reducing AI hallucinations.

LLMOps has reshaped my workflow. It provides an AI “coding buddy” that handles syntax and frees me to focus on ideas. The future of LLMOps on GCP is bright, and I look forward to seeing what’s next. Please reach out if you’d like to discuss or collaborate on LLMOps projects.

This article was initially published on Mark Edmondson’s personal and professional blog, available at: https://code.markedmondson.me/ and the following blog post constitutes a representation of the content of the original article.

Optimise AI with LLMOps on GCP

Get in touch with our experts to learn how LLMOps on GCP can scale and streamline your AI models efficiently.

Get in touch