Frequently Asked Questions

Last updated: 22 Sep 2025

VCAL is a semantic cache built on HNSW, delivering fast, local k-NN lookups that cut latency and avoid unnecessary LLM calls. By answering repeated queries from cache, it reduces token usage and makes API spend predictable.

This FAQ explains the “why” and “how” for CIOs, DevOps/MLOps engineers, and business leaders evaluating adoption.

General

What is VCAL?

VCAL (Vector Cache-as-a-Library) is a semantic cache built on HNSW. It accelerates LLM-backed apps by caching embeddings and answers locally. It reduces repeated API calls to providers like OpenAI/Anthropic or local LLMs, improving latency and lowering costs. The vcal-core library is a lightweight, in-process HNSW index; the VCAL server (Growth tier) exposes it over HTTP with metrics and operational features.

Who is VCAL for?

VCAL is built for everyone involved in running and scaling AI-powered products. CIOs and CTOs get predictable latency and lower cloud bills. DevOps and MLOps teams can deploy it on-prem or inside VPCs with full Prometheus/Grafana visibility (included in Growth and Enterprise tiers). Developers can use it as a lightweight HNSW library or as a simple HTTP API. Product and Finance teams benefit from clear cost-savings dashboards, while CEOs and CFOs see reduced API bills without the risk of vendor lock-in.

Is it a vector database?

VCAL is an embedded semantic cache, not a general-purpose vector DB. It’s designed for fast, local lookups and cost reduction, typically handling up to a few million vectors per instance. Think of it as a lightweight in-memory layer for embeddings, not a full analytical database.

Architecture & Features

What core features are included?

- k-NN search with HNSW (Cosine); /v1/search and /v1/batch_search (≤ 256 queries).
- Insert / Upsert / Delete by external ID (ext_id).
- Snapshots (save & load) for fast restarts.
- Eviction: TTL and LRU (by vector/byte caps).

- Prometheus metrics for requests, errors, active IDs, and latency.
- Config via TOML with environment override.

How does snapshotting work?

The server can load an index snapshot on startup (via VCAL_SNAPSHOT or vcal.toml) and can save on demand using POST /v1/snapshot/save. Atomic writes are supported when enabled in the request.

Does VCAL support deletes and updates?

Yes. Delete is soft (tombstone) and also removes the ID from the index if present; subsequent searches skip it. Upsert is implemented as delete-then-insert. All operations are idempotent.

What observability is included?

VCAL comes with production-grade observability out of the box. The server exposes /metrics in Prometheus text format, ready to plug into your dashboards. Key series track usage, performance, and data health — including
- vcal_search_requests_total, vcal_search_errors_total,
- vcal_batch_search_queries_total, vcal_active_ids,
- vcal_snapshot_unixtime, and vcal_search_latency_seconds.
Beyond raw counters, you get real-time insight into cache efficiency, snapshot freshness, and query latency. With Grafana dashboards, you can monitor hit ratios, tokens saved, and cost reductions week over week — making it easy to prove ROI and catch anomalies early.

Security & Data

Is VCAL safe to run on-prem?

Yes. VCAL is designed to run entirely in your environment (no outbound calls). Authentication is not enabled by default — deploy behind your reverse proxy, mTLS, or service mesh policies. Data never leaves unless you expose it.

What data does VCAL store?

Embeddings and associated external IDs you insert. If you enable snapshots, the index is serialized to disk at your chosen path. Prometheus metrics are emitted locally and remain within your network unless you export them.

How do we handle PII or sensitive data?

Many customers store only hashed or anonymized identifiers alongside vectors. If processing PII, follow your standard data minimization and retention policies. VCAL itself does not transmit data externally.

Performance & Scale

What scale does VCAL target?

Single instances are typically sized for up to a few million vectors (depending on hardware and dims). For larger needs, run multiple instances behind a load balancer or shard by key. Eviction keeps memory bounded via TTL and LRU.

What parameters matter?

- m (graph connectivity): try 16–32.
- ef_search (recall vs latency): try 64–256.
- Normalize embeddings for Cosine (unit vectors) to improve quality.
For Environment variables examples see Documentation

Business & Pricing

How does VCAL save money?

By answering repeated or similar prompts locally, VCAL reduces external LLM calls. Teams commonly see lower latency and a meaningful reduction in token spend. Metrics make cost avoidance visible (hit rate, saved calls). All-in-all, users get predictable performance and savings.

Is the core open-source?

Yes. vcal-core is Apache-2.0. The server adds operational features and is offered under a commercial EULA for enterprise use. This open-core model maximizes trust and adoption while funding continued development.

Can VCAL integrate into existing stacks?

Yes. Use VCAL-core in Rust, or deploy VCAL-server with HTTP APIs. Kubernetes support included.

Can we evaluate the VCAL server?

Yes — contact us for an evaluation license. We also support pilots and can assist with sizing, dashboards, and integration.

Roadmap & Support

What’s on the roadmap?

- Adapters (LangChain/LlamaIndex), WASM/WASI packaging.
- CLI tooling and richer snapshot formats.
- Optional authentication and tenancy features.

How do we get help?

For trials, pilots, or enterprise support, email sales@vcal-project.com. Security or privacy inquiries: privacy@vcal-project.com.