Skip to main content

Beyond Vector Databases: The Case for Local Semantic Caching

· 6 min read
Founder of VCAL Project

Originally published on Medium.com on November 6, 2025.
Read the Medium.com version

Cover

When “intelligence” wastes cycles

Most teams building LLM-powered products eventually realize that a large portion of their API costs come not from new insights, but from repeated questions.

A support bot, an internal assistant, or an analytics copilot, all encounter thousands of near-identical queries:

“How do I pass the API key to the local model gateway?”
“Why is the dev database connection timing out?”
“How can I refresh the cache without restarting the service?”

Each of those prompts gets re-tokenized, re-embedded, and re-sent to an LLM even when the model has already answered an equivalent question a minute earlier.

What do we have as a result? Burned tokens, wasted latency, and duplicated reasoning.

Vector databases solved storage, not reuse

The industry's first instinct was to throw vector databases at the problem. They excel at persistent embeddings and semantic retrieval, but they were never built for reuse. What they lack are TTL policies, eviction strategies, and atomic snapshotting of in-flight state. In other words, they store knowledge, not memory.

Traditional vector databases follow a key:value paradigm: they persist embeddings indefinitely so they can be queried later, much like records in a datastore. A semantic cache, by contrast, treats embeddings as dynamic memory — governed by similarity, expiration, and adaptive retention. Its goal is not to archive information, but to avoid redundant reasoning across millions of semantically similar requests.

With a semantic cache such as VCAL, cached answers can stay valid for days or weeks, depending on data volatility and TTL settings. This moves caching from short-term repetition avoidance to long-horizon semantic reuse where reasoning itself becomes a reusable resource rather than a recurring cost.

In essence, VCAL bridges the gap between data retrieval and cognitive efficiency, turning past computation into future acceleration.

From data stores to memory layers

In my previous Dev.to article, I explained how we built VCAL, a Rust-based semantic cache that sits between your app and the LLM. Instead of persisting every vector, it memorizes embeddings for a short time, indexed by semantic similarity and metadata.

When a new query arrives, VCAL compares it to cached vectors. If it is close enough — a cache hit — the LLM call is skipped, and the stored answer is returned in milliseconds. Otherwise, the request proceeds normally, and the response is stored for future matches.

The design combines concepts from vector search and traditional caching systems, enhanced with features for resilience and monitoring:

  • HNSW index for ultra-fast approximate similarity search.
  • TTL and LRU eviction for automatic cache turnover.
  • Snapshotting for persistence between restarts.
  • Prometheus metrics for observability.

All of it runs on-prem, next to your model or gateway with no remote dependencies.

Why local caching changes the economics

Unlike vector databases, a local semantic cache has one simple purpose: avoid redundant reasoning. Each avoided LLM call translates directly into saved tokens, lower API bills, and shorter response times.

In real deployments we’ve seen:

  • 30–60 % reduction in LLM calls
  • Millisecond-level latency on repeated queries enabling near-real-time responsiveness
  • Predictable resource usage: no external round-trips, no cloud egress costs, and no multi-tenant contention

At scale, the more your users interact, the greater the savings become. Instead of paying per token for every repetition, you amortize prior reasoning across sessions and teams.

And because VCAL runs inside your private environment, all caching and embeddings stay under your control ensuring data privacy, compliance, and deterministic performance even in regulated industries.

A new layer in the AI stack

If you visualize the modern LLM stack, the simplified design looks like this:

User → Application → LLM Gateway → Model

or, if a RAG (Retrieval-Augmented Generation) framework is involved:

User → Application → Retriever → Vector DB → (context) → LLM Gateway → Model

Adding a semantic cache such as VCAL introduces this new dimension:

User → Application → Retriever → Vector DB → (context) → Semantic Cache → LLM Gateway → Model

Here, the cache checks whether a semantically equivalent query was already answered. If found, the response is returned instantly — skipping tokenization, embedding, and inference altogether. If not, the request continues as usual, and the new answer is stored for future reuse.

Vector databases still matter but they belong to the knowledge layer, not the inference path. What has been missing so far is a memory layer that prevents repeated reasoning altogether. Semantic caching fills the missing “memory” slot in between. It is not a replacement for RAG, it is a complement. While RAG injects context, caching avoids duplication.

Engineering for low latency

Achieving millisecond response times in semantic caching requires more than just a fast similarity search algorithm. It’s the result of careful coordination between data structures, memory layout, and concurrency control.

The cache can be implemented efficiently in systems programming languages such as Rust, using an HNSW-based index for approximate nearest-neighbor search. HNSW provides logarithmic-scale query complexity while maintaining accuracy for large collections of embeddings, making it suitable for workloads that reach millions of cached entries.

Low latency also depends on predictable memory management and lock-free or fine-grained synchronization between threads. Instead of allocating and freeing vectors dynamically, embeddings are often stored in preallocated arenas or memory-mapped regions to minimize fragmentation and system calls. Parallel workers can update the index or evaluate similarity thresholds concurrently, so that retrieval scales with the number of available cores.

In practice, a semantic cache can be deployed as a lightweight service beside an inference gateway, communicating over local HTTP or gRPC. It can also be embedded directly into an application process when minimal overhead is required, for example, within an agent runtime or API handler.

The bigger picture

Caching has always been an invisible driver of performance — from CPU registers that reuse instructions, to CDNs that reuse content, to databases that reuse queries.

Each generation of systems extends the notion of what can be reused. As language models enter production, we are witnessing a shift toward semantic reuse: reusing meaning rather than data. This enables systems to recall previous reasoning instead of repeating it — a step toward more efficient and sustainable AI infrastructure.

In this new layer of the AI stack, semantic caching becomes a form of reasoning memory: it stores the results of understanding, not just storage operations. Instead of recomputing the same insight across thousands of near-identical prompts, we can recall it instantly — with full control over latency, privacy, and cost.

Further reading

For readers interested in implementation details and open-source examples:


Thank you for reading! Semantic caching is still an emerging concept — every real-world use case helps shape how we think about efficient reasoning. Share yours in the comments if you’d like to join the conversation.