GitHub - ahmed-moubtahij/TokenHealer

What is token healing?

Token healing rectifies the token boundary bias in greedy tokenization. It does this by trimming and regrowing the prompt to better align with the model's tokenizer, thus enhancing generation quality. The improvement is clearest with completion models.

Example: given a completion prompt with a partial url ending with :, the model might have seen the expected completion :// as a single token in training. However, the prompt's tail token : tells it that the next token is not //, and so it looks for wrong completions. Such errors compound in auto-regressive language models.

A more thorough explanation can be found on The Art of Prompt Design: Prompt Boundaries and Token Healing | by Scott Lundberg.

Installation

The only dependency is transformers. pip install transformers or pip install . should pick it up from pyproject.toml.

Usage

from token_healing import TokenBoundaryHealer

prompt = 'The link is <a href="http:'

output = generate(prompt, completion_model, tokenizer)
# The link is <a href="http:&#47;&#47;www&#47;dailymail&#

# The model saw '://' as a single token in training. Seeing a prompt ending with `:` tells it that the
# next token is likely not `//`, because otherwise it would've seen `://`.
# Thus, it completes with a token other than `//`, in this case, `&`.

token_healer = TokenBoundaryHealer(completion_model, tokenizer)
healed_prompt = token_healer(prompt)
# The link is <a href="http://
healed_output = generate(healed_prompt, completion_model, tokenizer)
# The link is <a href="http://www.365doki.com/post/3699

See example.py for the full example.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Contact

@ahmed_moubtahij

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
test		test
tokenhealing		tokenhealing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is token healing?

Installation

Usage

Contributing

Contact

About

Releases

Packages

Languages

License

ahmed-moubtahij/TokenHealer

Folders and files

Latest commit

History

Repository files navigation

What is token healing?

Installation

Usage

Contributing

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages