Skip to main content

3 posts tagged with "rust"

View All Tags

AI Cost Firewall: An OpenAI-Compatible Gateway That Cuts LLM Costs by 75%

· 9 min read
Founder of VCAL Project

Originally published on Dev.to on March 16, 2026.
Read the Dev.to version

Exact + semantic caching for AI applications


In today’s era of AI adoption, there is a distinct shift from integrating AI solutions into business processes to controlling the costs, be it the costs of a cloud solution, a local LLM deployment, or the cost of tokens spent in chatbots. If your solution includes repeated questions and uses an OpenAI-compatible model, and if you are looking for a simple, free and effective way to immediately cut your company’s daily token costs, there is one infrastructural solution that does it right out of the box.

AI Cost Firewall is a free open-source API gateway that decides which requests actually need to reach the LLM and which can be answered from previous results without additional token costs.

The gateway consists of a Rust-based firewall “decider”, a Redis database, a Qdrant vector store, Prometheus for metrics scraping, and Grafana for monitoring. All the tools are deployed with a single docker compose command and are available for use in less than a minute.

Once deployed, AI Cost Firewall sits transparently between your application and the LLM provider. Your chatbot, AI assistant, or internal automation continues to send requests exactly the same way as before with the only difference that the API endpoint now points to the firewall instead of directly to the model provider. The firewall then performs an instant check before deciding whether the request should actually reach the LLM and raise your monthly bill.


How the firewall reduces token costs

AI Cost Firewall eliminates unnecessary token spends using two layers of caching.

Exact match cache (Redis / Valkey)

The first step is an extremely fast exact request match check. Each incoming request is normalized and hashed. If an identical request was previously processed, the firewall immediately returns the stored response from Redis. This lookup takes microseconds and costs zero tokens. For workloads with frequent identical prompts such as customer support or internal documentation assistants this alone can already reduce a significant portion of LLM traffic.

Semantic cache (Qdrant)

The second layer addresses the case of semantic similarity: questions are similar but not identical.

For example:

User A: Provide a one-sentence explanation of what Kubernetes is.
User B: What is Kubernetes? Give me a one-sentence explanation.

Even though the wording differs, the semantic value and thus meaning of these questions is essentially the same (if you are interested in what semantics in AI is, have a look at my article From words to vectors: how semantics traveled from linguistics to Large Language Models).

To detect these situations, AI Cost Firewall uses a semantic vector search. Each request is embedded using a lightweight embedding model, and the resulting vector is compared against previously stored queries using Qdrant, a high-performance vector database designed specifically for this. If the similarity score exceeds a certain threshold, the firewall returns the previously generated answer instead of sending the request to the LLM again. In this way, a single LLM response can be reused dozens or even hundreds of times without extra tokens expense.

Forwarding only when necessary

If neither the exact cache nor the semantic cache contains a suitable answer, the firewall forwards the request to your upstream model provider. Besides being provided to the user, the returned response is then stored in both Redis and Qdrant for future reuse. The workflow therefore becomes (simplified):

Client → AI Cost Firewall

Redis check

Qdrant semantic check

(only if needed)

LLM API

The LLM is only called when a genuinely new question appears.

With this approach, the AI Cost Firewall does not only save the costs but also rockets the response time improving the users’ satisfaction (Customer Satisfaction Score, CSAT).


OpenAI compatible by design

One of the most practical aspects of AI Cost Firewall is that you do not have to touch your application to integrate it. What you do is you simply switch the base URL to the firewall’s endpoint:

client = OpenAI(
base_url="http://localhost:8080/v1"
)

From the application’s perspective, nothing changes. The same requests and responses flow through the system. However, now the firewall intelligently “decides” whether the model actually needs to be called and the money has to be spent.

This tool is compatible with:

  • OpenAI models
  • Azure OpenAI
  • local OpenAI-compatible servers
  • many hosted inference platforms

In other words, any system that already works with the OpenAI API can immediately benefit from cost reduction. And more than that, other models are going to be added soon by the project developers.


Observability built in

One of the integrated features of the AI Cost Firewall is its built-in monitoring. It consists of Prometheus for scraping the metrics and integrated Grafana Dashboard. Both services are launched automatically by docker compose using preconfigured Prometheus YAML and a prebuilt Grafana dashboard JSON, so the monitoring stack is ready immediately without any manual configuration.

Prometheus metrics allow you to track:

  • number of cache hits
  • semantic matches
  • forwarded requests
  • estimated cost savings
  • active requests

You can immediately visualize these metrics with the Grafana dashboard to see exactly how much the firewall is saving in real time (with a 5-second delay to be honest).


Why it works well

AI Cost Firewall works because it targets a structural characteristic present in almost every LLM application:

  • repeated user questions
  • overlapping knowledge queries
  • duplicated agent prompts

By caching responses and using semantic similarity search, the system converts repeated LLM calls into near-zero-cost lookups.

Why near-zero and not fully zero? Because semantic matching still requires generating embeddings for incoming queries. However, embedding costs are typically orders of magnitude lower than generating full LLM responses.

Another advantage of the firewall is its intentionally minimal architecture:

  • Rust firewall gateway
  • Redis for exact caching
  • Qdrant for semantic caching
  • Prometheus + Grafana for monitoring

This simplicity makes it easy to deploy, maintain, and scale.


When AI Cost Firewall is most effective

The biggest savings with AI Firewall occur in systems where similar questions appear frequently. You will immediately benefit from the AI Cost Firewall integration if your system includes any or several of the following components:

  • customer support chatbots
  • internal company knowledge assistants
  • documentation Q&A systems
  • developer copilots
  • AI help desks
  • AI Agents performing any of the above tasks

In these environments, the same core questions appear repeatedly across many users. Even when questions are phrased differently, the semantic cache can reuse the same answer multiple times.

Advanced users may also appreciate the integrated TTL (Time-to-Live) feature which allows you to set up the duration of the response kept in Redis’s memory before replaced with a newly generated one. The same feature for Qdrant is currently under development and will be introduced soon.


Try it in 60 seconds

If you want to see how AI Cost Firewall works, you can deploy the whole stack locally or on a small server in less than a minute.

Example:

git clone https://github.com/vcal-project/ai-firewall
cd ai-firewall
cp configs/ai-firewall.conf.example configs/ai-firewall.conf
nano configs/ai-firewall.conf # Replace the placeholders with your API keys
docker compose up -d

This launches:

  • AI Cost Firewall
  • Redis (exact cache)
  • Qdrant (semantic cache)
  • Prometheus (metrics scraping)
  • Grafana (monitoring dashboard)

Once the containers start, simply point your OpenAI client to:

http://localhost:8080/v1

Within seconds, the gateway is ready to accept OpenAI-compatible requests. Similar to Nginx and other infrastructure gateways, you only need to add your API keys to the configuration file. From that point on, every request automatically passes through the cost-saving pipeline while previous responses are stored in Redis and Qdrant for future reuse.

Because the gateway itself is stateless, multiple firewall instances can be deployed behind a load balancer, allowing the system to scale horizontally with growing traffic.

                         ┌───────────────────────┐
│ Clients │
│ Chatbots / Agents / │
│ Internal AI Apps │
└───────────┬───────────┘


┌───────────────────────┐
│ Load Balancer │
│ Nginx / HAProxy │
└───────────┬───────────┘

┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ AI Cost Firewall │ │ AI Cost Firewall │ │ AI Cost Firewall │
│ instance 1 │ │ instance 2 │ │ instance 3 │
└─────────┬────────┘ └─────────┬────────┘ └─────────┬────────┘
│ │ │
└─────────────┬───────┴───────────┬─────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Redis / Valkey │ │ Qdrant │
│ Exact cache │ │ Semantic cache │
└─────────┬────────┘ └─────────┬────────┘
│ │
└──────────┬───────────┘


┌───────────────────────┐
│ Upstream LLM API │
│ OpenAI / Azure / vLLM │
└───────────────────────┘


┌─────────────────────── Observability Stack ─────────────────┐
│ │
│ AI Cost Firewall metrics ─────► Prometheus ─────► Grafana │
│ │
└─────────────────────────────────────────────────────────────┘

Architecture of a horizontally scalable AI Cost Firewall deployment.


Conclusion

With the growing adoption of AI, it is sometimes painful to watch the steady increase of company expenses related to LLM tokens. Large Language Models are extremely powerful, but they can also become quite expensive when used at scale. It feels even more unfair when you realize that a significant portion of LLM traffic consists of repeated or semantically similar questions.

AI Cost Firewall addresses this inefficiency with a simple idea: do not send the same or similar question to the model again and again. Instead, reuse answers that were already generated for identical or semantically similar queries.

By combining exact caching with semantic similarity search, the firewall allows previously generated answers to be reused safely and efficiently. The result is lower token consumption, faster responses, and reduced infrastructure costs.

Because the gateway is OpenAI-compatible, integration requires only a small configuration change. No application refactoring is needed. If your system includes chatbots, knowledge assistants, developer copilots, or AI agents that answer recurring questions, AI Cost Firewall can reduce token usage by 30–75% immediately after deployment.

And since the entire stack runs with a single docker compose command, you can try it in minutes.

Sometimes the most effective optimization is not a new model, a larger GPU cluster, or a complex architecture.

Sometimes it is simply not paying twice for the same answer.


If you find this open-source project useful, a GitHub star will help the project grow.

https://github.com/vcal-project/ai-firewall

Beyond Vector Databases: The Case for Local Semantic Caching

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on November 6, 2025.
Read the Medium.com version

Cover

When “intelligence” wastes cycles

Most teams building LLM-powered products eventually realize that a large portion of their API costs come not from new insights, but from repeated questions.

A support bot, an internal assistant, or an analytics copilot, all encounter thousands of near-identical queries:

“How do I pass the API key to the local model gateway?”
“Why is the dev database connection timing out?”
“How can I refresh the cache without restarting the service?”

Each of those prompts gets re-tokenized, re-embedded, and re-sent to an LLM even when the model has already answered an equivalent question a minute earlier.

What do we have as a result? Burned tokens, wasted latency, and duplicated reasoning.

Vector databases solved storage, not reuse

The industry's first instinct was to throw vector databases at the problem. They excel at persistent embeddings and semantic retrieval, but they were never built for reuse. What they lack are TTL policies, eviction strategies, and atomic snapshotting of in-flight state. In other words, they store knowledge, not memory.

Traditional vector databases follow a key:value paradigm: they persist embeddings indefinitely so they can be queried later, much like records in a datastore. A semantic cache, by contrast, treats embeddings as dynamic memory — governed by similarity, expiration, and adaptive retention. Its goal is not to archive information, but to avoid redundant reasoning across millions of semantically similar requests.

With a semantic cache such as VCAL, cached answers can stay valid for days or weeks, depending on data volatility and TTL settings. This moves caching from short-term repetition avoidance to long-horizon semantic reuse where reasoning itself becomes a reusable resource rather than a recurring cost.

In essence, VCAL bridges the gap between data retrieval and cognitive efficiency, turning past computation into future acceleration.

From data stores to memory layers

In my previous Dev.to article, I explained how we built VCAL, a Rust-based semantic cache that sits between your app and the LLM. Instead of persisting every vector, it memorizes embeddings for a short time, indexed by semantic similarity and metadata.

When a new query arrives, VCAL compares it to cached vectors. If it is close enough — a cache hit — the LLM call is skipped, and the stored answer is returned in milliseconds. Otherwise, the request proceeds normally, and the response is stored for future matches.

The design combines concepts from vector search and traditional caching systems, enhanced with features for resilience and monitoring:

  • HNSW index for ultra-fast approximate similarity search.
  • TTL and LRU eviction for automatic cache turnover.
  • Snapshotting for persistence between restarts.
  • Prometheus metrics for observability.

All of it runs on-prem, next to your model or gateway with no remote dependencies.

Why local caching changes the economics

Unlike vector databases, a local semantic cache has one simple purpose: avoid redundant reasoning. Each avoided LLM call translates directly into saved tokens, lower API bills, and shorter response times.

In real deployments we’ve seen:

  • 30–60 % reduction in LLM calls
  • Millisecond-level latency on repeated queries enabling near-real-time responsiveness
  • Predictable resource usage: no external round-trips, no cloud egress costs, and no multi-tenant contention

At scale, the more your users interact, the greater the savings become. Instead of paying per token for every repetition, you amortize prior reasoning across sessions and teams.

And because VCAL runs inside your private environment, all caching and embeddings stay under your control ensuring data privacy, compliance, and deterministic performance even in regulated industries.

A new layer in the AI stack

If you visualize the modern LLM stack, the simplified design looks like this:

User → Application → LLM Gateway → Model

or, if a RAG (Retrieval-Augmented Generation) framework is involved:

User → Application → Retriever → Vector DB → (context) → LLM Gateway → Model

Adding a semantic cache such as VCAL introduces this new dimension:

User → Application → Retriever → Vector DB → (context) → Semantic Cache → LLM Gateway → Model

Here, the cache checks whether a semantically equivalent query was already answered. If found, the response is returned instantly — skipping tokenization, embedding, and inference altogether. If not, the request continues as usual, and the new answer is stored for future reuse.

Vector databases still matter but they belong to the knowledge layer, not the inference path. What has been missing so far is a memory layer that prevents repeated reasoning altogether. Semantic caching fills the missing “memory” slot in between. It is not a replacement for RAG, it is a complement. While RAG injects context, caching avoids duplication.

Engineering for low latency

Achieving millisecond response times in semantic caching requires more than just a fast similarity search algorithm. It’s the result of careful coordination between data structures, memory layout, and concurrency control.

The cache can be implemented efficiently in systems programming languages such as Rust, using an HNSW-based index for approximate nearest-neighbor search. HNSW provides logarithmic-scale query complexity while maintaining accuracy for large collections of embeddings, making it suitable for workloads that reach millions of cached entries.

Low latency also depends on predictable memory management and lock-free or fine-grained synchronization between threads. Instead of allocating and freeing vectors dynamically, embeddings are often stored in preallocated arenas or memory-mapped regions to minimize fragmentation and system calls. Parallel workers can update the index or evaluate similarity thresholds concurrently, so that retrieval scales with the number of available cores.

In practice, a semantic cache can be deployed as a lightweight service beside an inference gateway, communicating over local HTTP or gRPC. It can also be embedded directly into an application process when minimal overhead is required, for example, within an agent runtime or API handler.

The bigger picture

Caching has always been an invisible driver of performance — from CPU registers that reuse instructions, to CDNs that reuse content, to databases that reuse queries.

Each generation of systems extends the notion of what can be reused. As language models enter production, we are witnessing a shift toward semantic reuse: reusing meaning rather than data. This enables systems to recall previous reasoning instead of repeating it — a step toward more efficient and sustainable AI infrastructure.

In this new layer of the AI stack, semantic caching becomes a form of reasoning memory: it stores the results of understanding, not just storage operations. Instead of recomputing the same insight across thousands of near-identical prompts, we can recall it instantly — with full control over latency, privacy, and cost.

Further reading

For readers interested in implementation details and open-source examples:


Thank you for reading! Semantic caching is still an emerging concept — every real-world use case helps shape how we think about efficient reasoning. Share yours in the comments if you’d like to join the conversation.

How I Created a Semantic Cache Library for AI

· 4 min read
Founder of VCAL Project

Originally published on Dev.to on October 27, 2025.
Read the Dev.to version

Cover

Have you ever wondered why LLM apps get slower and more expensive as they scale, even though 80% of user questions sound pretty similar? That’s exactly what got me thinking recently: why are we constantly asking the model the same thing?

That question led me down the rabbit hole of semantic caching, and eventually to building VCAL (Vector Cache-as-a-Library), an open-source project that helps AI apps remember what they’ve already answered.


The “Eureka!” Moment

It started while optimizing an internal support chatbot that ran on top of a local LLM. Logs showed hundreds of near-identical queries:

“How do I request access to the analytics dashboard?”
“Who approves dashboard access for my team?”
“My access to analytics was revoked — how do I get it back?”

Each one triggered a full LLM inference: embedding the query, generating a new answer, and consuming hundreds of tokens even though all three questions meant the same thing.

So I decided to create a simple library that would embed each question, compare it to what was submitted earlier, and if it’s similar enough, return the stored answer instead of generating an LLM response, all this before asking the model.

I wrote a prototype in Rust — for performance and reliability — and designed it as a small vcal-core open-source library that any app could embed.

The first version of VCAL could:

  • Store and search vector embeddings in RAM using HNSW graph indexing
  • Handle TTL and LRU evictions automatically
  • Save snapshots to disk so it could restart fast

Later came VCAL Server, a drop-in HTTP API version for teams that wanted to cache answers across multiple services while deploying it on-prem or in a cloud.

Screenshot: Grafana dashboard showing cache hits and cost saving


What It Feels Like to Use

Unlike a full vector database, VCAL isn’t designed for long-term storage or analytics. I didn’t want to build another vector database.
VCAL is intentionally lightweight. It is a fast, in-memory semantic cache optimized for repeated LLM queries.

Integrating VCAL takes minutes.
Instead of calling the model directly, you send your query to VCAL first.
If a similar question has been asked before — and the similarity threshold can be tuned — VCAL returns the answer from its cache in milliseconds. If it’s a new question, VCAL asks the LLM, stores the result, and returns it.
Next time, if a semantically similar question comes in, VCAL answers instantly.

It’s like adding a memory layer between your app and the model — lightweight, explainable, and under your full control.

Flow diagram: user → VCAL → LLM


Lessons Learned

  • LLMs love redundancy. Once you start caching semantically, you realize how often people repeat the same question with different words.
  • Caching semantics ≠ caching text. Cosine similarity and vector distances matter more than exact matches.
  • Performance scales beautifully. A well-tuned cache can handle thousands of lookups per second, even on modest hardware.
  • It scales big. A single VCAL Server instance can comfortably store and serve up to 10 million cached answers in memory, depending on embedding dimensions and hardware.

What’s Next

We’re now working on a licensing server, enterprise snapshot formats, and RAG-style extensions, so teams can use VCAL not just for Q&A caching, but as the foundation for private semantic memory.

If you’re building AI agents, support desks, or knowledge assistants, you’ll likely benefit from giving your system a brain that remembers.

You can explore more at vcal-project.com - try the free 30-day Trial Evaluation of VCAL Server or jump into the open-source vcal-core version on GitHub.


Thanks for reading!
If this resonates with you, please drop a comment. I’d love to hear how you’re approaching caching and optimization for AI apps.