VCAL

Stop paying for the same LLM answer twice.

VCAL caches semantically similar queries on-prem — token-free, fast, and observable. Cut 30–60% of LLM calls and improve p95.

Search p50 (library)
~88 µs
Token savings
30–60%
On-prem
Token-free
Observability
Prom/Grafana

Why VCAL

On-prem by design

Keep data in your VPC. No per-token billing. Snapshot locally, restore instantly.

Fast, tiny, embeddable

Rust HNSW core with Python bindings. Optional AVX2. WASM/Edge planned.

Observable

Prometheus metrics out-of-the-box, Grafana dashboards: hits/misses, p50/p95, tokens saved.

Use it in Python (30 seconds)

pip install vcal-core-py
# FAQ cache: avoid repeat LLM calls
from vcal_core_py import Index
from embeddings import embed   # your function, e.g. Ollama/OpenAI/HF

idx = Index(768, m=32, ef_search=256)
idx.insert(embed("What is Rust?"), 1)     # ext-id 1

hits = idx.search(embed("What is Rust?"), 1)  # [(id, distance)]
if hits and hits[0][1] < 0.15:
    print("HIT → reuse answer")
else:
    print("MISS → call LLM, then cache")

Search-only latency

128-D, k=1, 10k vecs, single thread

  • p50: ~88 µs
  • p95: ~244 µs

Measured with the open benchmark harness. Your hardware may vary.

ROI estimator

Estimate monthly token savings.

Estimated savings: $1,600 / month

Pricing

Core (OSS)

Rust library + Python wheels. Snapshots. Prom/Grafana.

Free

Developer License

Commercial embedding + priority support.

$2,000 / app / year

Enterprise (EULA Pro)

SSO/RBAC, multi-tenant snapshots, SLAs, OEM.

Contact sales

Join the pilot / get updates

Pilot access isn’t open yet. Join the waitlist and we’ll email you when slots open.

We’ll only use your email to contact you about pilot availability.