VCAL caches semantically similar queries on-prem — token-free, fast, and observable. Cut 30–60% of LLM calls and improve p95.
Keep data in your VPC. No per-token billing. Snapshot locally, restore instantly.
Rust HNSW core with Python bindings. Optional AVX2. WASM/Edge planned.
Prometheus metrics out-of-the-box, Grafana dashboards: hits/misses, p50/p95, tokens saved.
# FAQ cache: avoid repeat LLM calls
from vcal_core_py import Index
from embeddings import embed # your function, e.g. Ollama/OpenAI/HF
idx = Index(768, m=32, ef_search=256)
idx.insert(embed("What is Rust?"), 1) # ext-id 1
hits = idx.search(embed("What is Rust?"), 1) # [(id, distance)]
if hits and hits[0][1] < 0.15:
print("HIT → reuse answer")
else:
print("MISS → call LLM, then cache")
128-D, k=1, 10k vecs, single thread
Measured with the open benchmark harness. Your hardware may vary.
Estimate monthly token savings.
Rust library + Python wheels. Snapshots. Prom/Grafana.
Free
Commercial embedding + priority support.
$2,000 / app / year
SSO/RBAC, multi-tenant snapshots, SLAs, OEM.
Contact sales
Pilot access isn’t open yet. Join the waitlist and we’ll email you when slots open.