You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Created this issue to discuss Semantic Caching requirement in high level. Would like to hear what community think of this. Supporting caching at gateway is not something everyone agrees. Hence lets have a discussions here and see what is the best approach to follow.
Why Implement Semantic Caching for LLM/AI Gateways?
LLM calls are expensive and time-consuming, making a caching layer highly valuable for consumers of API gateways. Users expect caching capabilities at the gateway level to reduce backend load and improve response times. While most API gateways support request-based caching (based on request attributes), semantic caching extends this concept to the AI/LLM domain by focusing on the meaning of requests rather than their syntax.
Some LLMs provide built-in caching, but these implementations often consume additional resources on the LLM backend. By introducing semantic caching at the gateway level, we can provide vendor-agnostic caching that works independently of the LLM provider. This approach guarantees better resource utilization but also supports multi-LLM backends where a single API may route requests to different LLMs. Gateway-level caching enhances performance, offering faster responses and reducing backend processing costs.
Technical Considerations for Semantic Caching
In semantic caching, cache keys are determined by the meaning of a request rather than its exact syntax. This involves using embeddings, metadata, and context to capture differences in AI behavior and ensure accurate matches for similar requests.
To implement semantic caching(storages):
Vector Databases: Store and manage embeddings for semantic similarity searches.
In-Memory Stores (e.g., Redis): Handle fast lookups for associated metadata and cache entries.
Scalable Object Storage: Manage large payloads, such as LLM responses, efficiently.
The text was updated successfully, but these errors were encountered:
let's discuss this API after the basic API definitions are in (like the one transformation, token rate limit and upstream auth stuff) - hopefully start discussing by the end of the year
This adds the skeleton API of LLMRoute and LLMBackend.
These two resources would be the foundation for the future
iterations, such as authn/z, token-based rate limiting,
schema transformation and more advanced thingy like #10
Note: we might / will break APIs if necessity comes up until
the initial release.
part of #13
---------
Signed-off-by: Takeshi Yoneda <[email protected]>
Created this issue to discuss Semantic Caching requirement in high level. Would like to hear what community think of this. Supporting caching at gateway is not something everyone agrees. Hence lets have a discussions here and see what is the best approach to follow.
Why Implement Semantic Caching for LLM/AI Gateways?
LLM calls are expensive and time-consuming, making a caching layer highly valuable for consumers of API gateways. Users expect caching capabilities at the gateway level to reduce backend load and improve response times. While most API gateways support request-based caching (based on request attributes), semantic caching extends this concept to the AI/LLM domain by focusing on the meaning of requests rather than their syntax.
Some LLMs provide built-in caching, but these implementations often consume additional resources on the LLM backend. By introducing semantic caching at the gateway level, we can provide vendor-agnostic caching that works independently of the LLM provider. This approach guarantees better resource utilization but also supports multi-LLM backends where a single API may route requests to different LLMs. Gateway-level caching enhances performance, offering faster responses and reducing backend processing costs.
Technical Considerations for Semantic Caching
In semantic caching, cache keys are determined by the meaning of a request rather than its exact syntax. This involves using embeddings, metadata, and context to capture differences in AI behavior and ensure accurate matches for similar requests.
To implement semantic caching(storages):
The text was updated successfully, but these errors were encountered: