Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Semantic Caching for LLM/AI Gateways #10

Open
sanjeewa-malalgoda opened this issue Nov 27, 2024 · 2 comments
Open

[Discussion] Semantic Caching for LLM/AI Gateways #10

sanjeewa-malalgoda opened this issue Nov 27, 2024 · 2 comments
Labels
discussion To be discussed in community enhancement New feature or request

Comments

@sanjeewa-malalgoda
Copy link

sanjeewa-malalgoda commented Nov 27, 2024

Created this issue to discuss Semantic Caching requirement in high level. Would like to hear what community think of this. Supporting caching at gateway is not something everyone agrees. Hence lets have a discussions here and see what is the best approach to follow.

Why Implement Semantic Caching for LLM/AI Gateways?
LLM calls are expensive and time-consuming, making a caching layer highly valuable for consumers of API gateways. Users expect caching capabilities at the gateway level to reduce backend load and improve response times. While most API gateways support request-based caching (based on request attributes), semantic caching extends this concept to the AI/LLM domain by focusing on the meaning of requests rather than their syntax.

Some LLMs provide built-in caching, but these implementations often consume additional resources on the LLM backend. By introducing semantic caching at the gateway level, we can provide vendor-agnostic caching that works independently of the LLM provider. This approach guarantees better resource utilization but also supports multi-LLM backends where a single API may route requests to different LLMs. Gateway-level caching enhances performance, offering faster responses and reducing backend processing costs.

Technical Considerations for Semantic Caching
In semantic caching, cache keys are determined by the meaning of a request rather than its exact syntax. This involves using embeddings, metadata, and context to capture differences in AI behavior and ensure accurate matches for similar requests.

To implement semantic caching(storages):

  • Vector Databases: Store and manage embeddings for semantic similarity searches.
  • In-Memory Stores (e.g., Redis): Handle fast lookups for associated metadata and cache entries.
  • Scalable Object Storage: Manage large payloads, such as LLM responses, efficiently.
@missBerg missBerg added discussion To be discussed in community enhancement New feature or request labels Nov 27, 2024
@mathetake
Copy link
Member

mathetake commented Dec 3, 2024

let's discuss this API after the basic API definitions are in (like the one transformation, token rate limit and upstream auth stuff) - hopefully start discussing by the end of the year

mathetake added a commit that referenced this issue Dec 5, 2024
This adds the skeleton API of LLMRoute and LLMBackend.
These two resources would be the foundation for the future
iterations, such as authn/z, token-based rate limiting, 
schema transformation and more advanced thingy like #10 


Note: we might / will break APIs if necessity comes up until
the initial release.

part of #13 

---------

Signed-off-by: Takeshi Yoneda <[email protected]>
@Krishanx92
Copy link
Contributor

Design : #30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion To be discussed in community enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants