-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop sequence missed #32
Comments
Hi @Webifi, Thanks for reporting! Please note that we can't always just truncate the new delta, since everything before the last token has already gone through the transformer and remembered in its attention caches. That is, you can truncate it for the UI but the model still sees the conversation history as the tokens were not truncated. This is why this code requires the A proper fix would require adding That said, I think it's not always important for LLM performance, since LLM sees and understands that the response is finished anyway (since we put |
I'll just deal with it client side for now, until a proper fix can be implemented. |
@borzunov Could you expand on "remembered in its attention caches" a bit? I'm a little confused on the flow here. Say web-client requests generation of the following prompt using max_new_tokens of 2: Now, the next new request to the WebSocket API is: How do the attention caches play a role here, if at all? |
Requests to api/v2/generate (websocket_api.py) can fail to detect the stop sequence in the generated response and will continue generation well after. This problem becomes even more apparent if max_new_tokens is greater than 1.
I have a change I can commit that works around this issue by scanning for the stop sequence in the last delta appended to the tail of the previous deltas, then stopping and returning a truncated new delta if the stop sequence is found.
If you'd like a PR for this, it will also require merging #31.
The text was updated successfully, but these errors were encountered: