Transforming Prototypes into Production Systems

08 Apr 2025 Engineering GenAI Inference Pipelines: Transforming Prototypes into Production Systems

Posted by Lettisia George

At Shine Solutions, we’re developing our AI Engineering capabilities to bridge the gap between prototypes and production-ready systems. Our recent collaboration with climate researchers has given us the opportunity to explore how we can do this whilst augmenting foundation models to deliver customised and trusted insights.

We’ve built a Retrieval-Augmented Generation (RAG) inference pipeline on AWS, that can process concurrent complex requests in a cost effective, scalable and performant manner. RAG is a technique that enhances LLM responses by retrieving relevant external knowledge and including it with a custom prompt before generating a response. This article shares our learning journey and the architectural decisions we’ve been exploring along the way.

The climate misinformation detection system we are helping the researchers with, helps identify and debunk false claims about climate change. When presented with a statement that might contain misinformation, the system analyses it, checks it against scientific facts, and provides a clear, structured response that explains why the claim is misleading. The output includes verified facts supported by current research, making it easier for people to understand complex climate issues and recognise common misconceptions.

Debunking process and sample output

The debunking process the researchers developed follows these steps:

Decide whether the statement is misinformation or not
Classify the statement based on a taxonomy of misinformation and create a prompt
Classify the statement based on the fallacy it contains and create a prompt
Search the internet for up-to-date facts related to the statement and use them to create the first fact layer
Pass the prompts created in earlier steps to the LLM to create the fallacy, myth and second fact layer
Put it all together and return the results

The models provide a structured response to the misinformation with a Fact, Myth, Fallacy, Fact structure. This structure is designed by the researchers, see here. For example:

Statement

“Coral reefs have recovered before so they will survive climate change.”

Fact

Coral reefs now face extreme stress from climate change, with predictions indicating that a rise of just 1.5°C may lead to near-annual bleaching by 2040. This level of environmental change differs significantly from past recoveries, highlighting serious concerns for their survival.

Myth

Coral reefs have previously been resilient and can endure climate change again.

Fallacy

This argument contains a slothful induction fallacy, as it disregards the significant differences between past recovery events and the current pace of climate change. Evidence shows that coral reefs are now under unprecedented stress, with a mere 1.5°C increase potentially leading to frequent bleaching events by 2040, which threatens their survival unlike any previous recovery period.

Fact

Coral reefs are under severe threat from climate change, with predictions suggesting that a 1.5°C rise could cause frequent bleaching by 2040, endangering their survival.

The Problem

The research team initially hosted their models on Hugging Face Space—an excellent platform for experimentation but one that presents limitations for production use cases. We saw this as an opportunity to develop our GenAI engineering skills by testing approaches to migrate these models to AWS infrastructure.

The Solution

To start with, what options are there for hosting a machine learning model on AWS? A quick search for “Hosting a machine learning model on AWS” is guaranteed to offer you a host of links that all point to AWS SageMaker as the prime option, but it is worth exploring more traditional (and cheaper options) as we did.

SageMaker endpoints are convenient as AWS does most of the heavy lifting for you, however, convenience can come at a financial cost. In cases where the model development process is coupled with its usage in production, a full service solution like SageMaker could work well. Our pipeline, on the other hand, consists of a combination of pre-trained models and external APIs.

We decided instead to experiment with developing our own inference pipeline using step functions, lambdas, and containers. Our approach was to embed the in-house models in containers and serve them using a FastAPI server. This proved to be a robust solution for a RAG-based application.

We divided the process into small components that could run as discrete steps. These steps form the inference pipeline. Breaking the process down in this way allowed for more resilience and parallel processing.

The three steps of the pipeline are:

Go/No-Go: Decides whether to run the rest of the pipeline. Available via an external API hosted by a private partner.
Query Augmentation: To retrieve supporting information, classify misinformation, choose prompt template and generate prompts. The models are internal for this step.
Generation using LLM integration: The remaining models use OpenAI to generate the result. OpenAI is accessed through an external paid API.

The following diagram summarises the pipeline:

Go/No-Go Stage

The first phase of this RAG pipeline is the Go/No-Go stage. This step decides whether to run the rest of the pipeline. This is a departure from typical RAG implementations. Instead of immediately engaging the classification of the query into a type of misinformation, this phase evaluates whether the input query warrants further processing. By classifying queries as actionable or not, the pipeline avoids unnecessary computation for irrelevant or unsupported requests.

An example of misinformation

“Coral reefs have recovered before so they will survive climate change.”

An example of not misinformation

“I went for a run this morning.”

For a niche application like ours, an important consideration is to add a step to your pipeline that decides if it is worthwhile to process the request. In most cases, RAG apps are purpose built for specific use cases. This step can eliminate requests that the system would not want to process, and avoids ending up being a wrapper for larger foundational model APIs with cost implications.

Supporting Models

Three machine learning models are used to support this debunking solution.

The CARDS model categorises the type of misinformation according to a taxonomy of climate change myths. For example, our inquiry about coral reefs is classified as:

“3_2 Species/plants/reefs aren’t showing climate impacts/are benefiting from climate change”

The FLICC model categorises the underlying logical fallacy being employed by the misinformation. For example the coral reefs inquiry is classified as using the ““slothful induction fallacy”

A third model is a sentence embedding model, all-MiniLM-L6-v2, that calculates similarity between sentences.

Hosting and Calling Supporting Models

The models are hosted in a containerised environment, served using FastAPI, and deployed on AWS Elastic Container Service (ECS), for scalability and reliability.

The following code snippet shows how the FLICC model is called. As you can see, not a lot of code is required for the actual inference. This is similar for the other two models.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

flicc_model = os.environ.get("FLICC", default="fzanartu/flicc") 
tokenizer = AutoTokenizer.from_pretrained(flicc_model)
model = AutoModelForSequenceClassification.from_pretrained(flicc_model)
misinformation = "Coral reefs have recovered before so they will survive climate change." # for example
tokenized_text = tokenizer(misinformation, return_tensors="pt")
with torch.no_grad():
    logits = model(**tokenized_text).logits
detected_fallacy_number = logits.argmax().item()

Other Retrieval Mechanisms

In addition to the custom models the solution relies on search engine results, and local data stores.

Search Engine Agent

Search functionality is implemented using LangChain’s agents and tools. This significantly reduces the amount of code that needs to be written to implement a RAG.

The following example shows the processing of the agent as it conducts two web searches to find supporting information and summarise it. This is an example of a RAG where the agent is given a tool, in this case a web search. This tool helps it to find up to date information that it may not have been available during training. The agent is given specific and detailed instruction in a template in order to teach it to use this process. See the ReACT research paper for more information. Each observation contains the results from a number of web pages. However I have shortened them for clarity.

Original Inquiry: Coral reefs have recovered before so they will survive climate change.

Action: web_search

Action Input: “impact of climate change on coral reefs recovery statistics”

Observation: The global HeatCRD (Heatwaves and Coral-Recovery Database) study sites…

Thought: I should summarise the findings regarding coral reefs’ vulnerability to climate change and the current challenges they face despite past recoveries.

Action: web_search

Action Input: “coral reefs climate change vulnerability stats recovery potential”

Observation: Coral reefs worldwide are under severe threat due to a combination of local stressors and escalating intensity and frequency of heatwaves…

Final Output: While coral reefs have recovered in the past, rising sea temperatures and frequent bleaching events significantly reduce their chances of survival, with many facing severe threats from climate change today.

As you can see the RAG agent is capable of impressive reasoning that gives useful results.

Local Data Stores

We maintain several local data stores containing supporting information, such as examples that refute false climate change claims. After classifying an inquiry using the CARDS model, we retrieve corresponding claims and identify those most similar to the original inquiry. This similarity assessment relies on sentence embeddings: words are first converted to numerical tokens, which are then combined into unique sentence vectors. The embedding model transforms these vectors so that sentences with similar meanings have close numerical values, allowing us to calculate their semantic similarity and thus find related claims.

Augmentation

Once all supporting information has been retrieved, augmented LLM prompts can be created. The prompts contain instructions for the LLM like “You are a senior climate analyst…”, as well as the supporting information such as the fallacy detected in the inquiry, and examples of misinformation along with their matching responses to guide the LLM as to what is required.

Each prompt is in the form of a LangChain PromptTemplate in a python file, so no extra processing is needed other than filling the values using a call to a PromptTemplate function.

Generation

Lastly the prompts are sent to the LLM. We used OpenAI’s gpt-4o-mini which is a smaller, cheaper version of the gpt-4o and is sufficient for our purpose.

It is important to note that the use of external APIs, like OpenAI’s, is acceptable in this instance as the solution is designed to process only publicly available, non-sensitive data. However, careful consideration must be given to data sensitivity and security when using external APIs in other contexts. Furthermore, the use of external APIs is suitable here given that the system is non-critical and asynchronous. For critical, synchronous applications, alternative approaches may be necessary.

Co-ordination and Request Handling

The pipeline consists of a variety of tasks, with different ways of interacting with the models and long running inferences. AWS Step Functions provide an efficient way to manage the pipeline, since each stage uses different underlying technologies.

On receiving a query for debunking from consumer, the API endpoint handler (a lambda) triggers the step function and generates a unique ID for the inquiry. The ID is carried throughout the pipeline, finally matching up with the result saved in DynamoDB for retrieval later.

Conclusion

Our journey with this GenAI inference pipeline has been a valuable learning experience that demonstrates the potential of thoughtful system design when moving from a prototype to production. By experimenting with different AWS services, exploring model deployment strategies, and implementing various RAG techniques, we’re learning how to develop solutions that balance technical robustness with business value.

This project represents an important step in our ongoing efforts in this space. As generative AI continues to evolve, we’re building capabilities at Shine Solutions to effectively operationalize these powerful models in production environments. We’re actively exploring more sophisticated retrieval mechanisms, refining our prompt engineering approaches, and experimenting with different deployment strategies—all while carefully managing the complexity, cost, and performance considerations that come with bringing generative AI to production.

Lettisia George

lettisia.george@shinesolutions.com

Engineering GenAI Inference Pipelines: Transforming Prototypes into Production Systems

08 Apr 2025 Engineering GenAI Inference Pipelines: Transforming Prototypes into Production Systems

Debunking process and sample output

The Problem

The Solution

Go/No-Go Stage