Quantization done for the given Models #107
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is with reference to the Issue #106
In this, Quantization of each model was required to provide support to devices with less GPU VRAM. The
backend/server.py
andbackend/Generator/main.py
are changed.Now the
backend/server.py
file take the following arguments--quantization
: This represents whether the model are to be quantized or not. Additionallyauto
option is there, so that user doesnt have to set it manually.--gpu_id
: This represents the target CUDA GPU on which the models are to be run on. This is specifically helpful in multigpu systems where one can easily change on which gpu the model has to run on. Additionallyauto
option automatically selects the target GPU'--quantization_type': This represents what type of quantization the user wants to apply on the target models. There are 2 different types of quantization
4bit
and8bit
. The GPU usage differs for both these, with lowest with4bit
quantization. The default is set to4bit
.Earlier without quantizations, the project wouldnt run on a consumer 3050 Ti Mobile GPU. Now with different quantizations, the model is loaded and inferenced upon on the same GPU.
The GPU load on inferencing with
4bit
quantizationThe GPU load on inferencing with
8bit
quantizationThese are done for same tasks. The difference in GPU memory is due to different quantization configs.