[Discussion] Semantic Caching for LLM/AI Gateways #10

sanjeewa-malalgoda · 2024-11-27T07:05:59Z

Created this issue to discuss Semantic Caching requirement in high level. Would like to hear what community think of this. Supporting caching at gateway is not something everyone agrees. Hence lets have a discussions here and see what is the best approach to follow.

Why Implement Semantic Caching for LLM/AI Gateways?
LLM calls are expensive and time-consuming, making a caching layer highly valuable for consumers of API gateways. Users expect caching capabilities at the gateway level to reduce backend load and improve response times. While most API gateways support request-based caching (based on request attributes), semantic caching extends this concept to the AI/LLM domain by focusing on the meaning of requests rather than their syntax.

Some LLMs provide built-in caching, but these implementations often consume additional resources on the LLM backend. By introducing semantic caching at the gateway level, we can provide vendor-agnostic caching that works independently of the LLM provider. This approach guarantees better resource utilization but also supports multi-LLM backends where a single API may route requests to different LLMs. Gateway-level caching enhances performance, offering faster responses and reducing backend processing costs.

Technical Considerations for Semantic Caching
In semantic caching, cache keys are determined by the meaning of a request rather than its exact syntax. This involves using embeddings, metadata, and context to capture differences in AI behavior and ensure accurate matches for similar requests.

To implement semantic caching(storages):

Vector Databases: Store and manage embeddings for semantic similarity searches.
In-Memory Stores (e.g., Redis): Handle fast lookups for associated metadata and cache entries.
Scalable Object Storage: Manage large payloads, such as LLM responses, efficiently.

mathetake · 2024-12-03T15:57:09Z

let's discuss this API after the basic API definitions are in (like the one transformation, token rate limit and upstream auth stuff) - hopefully start discussing by the end of the year

This adds the skeleton API of LLMRoute and LLMBackend. These two resources would be the foundation for the future iterations, such as authn/z, token-based rate limiting, schema transformation and more advanced thingy like #10 Note: we might / will break APIs if necessity comes up until the initial release. part of #13 --------- Signed-off-by: Takeshi Yoneda <[email protected]>

Krishanx92 · 2024-12-12T06:15:07Z

Design : #30

missBerg added discussion To be discussed in community enhancement New feature or request labels Nov 27, 2024

mathetake mentioned this issue Dec 3, 2024

api: initial skeleton of LLMRoute and LLMBackend #20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Semantic Caching for LLM/AI Gateways #10

[Discussion] Semantic Caching for LLM/AI Gateways #10

sanjeewa-malalgoda commented Nov 27, 2024 •

edited

Loading

mathetake commented Dec 3, 2024 •

edited

Loading

Krishanx92 commented Dec 12, 2024

[Discussion] Semantic Caching for LLM/AI Gateways #10

[Discussion] Semantic Caching for LLM/AI Gateways #10

Comments

sanjeewa-malalgoda commented Nov 27, 2024 • edited Loading

mathetake commented Dec 3, 2024 • edited Loading

Krishanx92 commented Dec 12, 2024

sanjeewa-malalgoda commented Nov 27, 2024 •

edited

Loading

mathetake commented Dec 3, 2024 •

edited

Loading