Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization done for the given Models #107

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

abhyudayhari
Copy link

This PR is with reference to the Issue #106

In this, Quantization of each model was required to provide support to devices with less GPU VRAM. The backend/server.py and backend/Generator/main.py are changed.

Now the backend/server.py file take the following arguments

--quantization: This represents whether the model are to be quantized or not. Additionally auto option is there, so that user doesnt have to set it manually.

--gpu_id: This represents the target CUDA GPU on which the models are to be run on. This is specifically helpful in multigpu systems where one can easily change on which gpu the model has to run on. Additionally auto option automatically selects the target GPU

'--quantization_type': This represents what type of quantization the user wants to apply on the target models. There are 2 different types of quantization 4bit and 8bit. The GPU usage differs for both these, with lowest with 4bit quantization. The default is set to 4bit.

Earlier without quantizations, the project wouldnt run on a consumer 3050 Ti Mobile GPU. Now with different quantizations, the model is loaded and inferenced upon on the same GPU.

The GPU load on inferencing with 4bit quantization

4bit

The GPU load on inferencing with 8bit quantization

8bit

These are done for same tasks. The difference in GPU memory is due to different quantization configs.

@Roaster05
Copy link
Contributor

@abhyudayhari good suggestion, we often faced this issue that our CUDA runs out of memory, would this affect the time required and the quality of the questions generated?, also in the Snaps you attached their is slight decrease in the memory usage could we optimise it further?

@abhyudayhari
Copy link
Author

abhyudayhari commented Jan 7, 2025

While starting the server it would take 1-2 sec more, but while inferencing the latency is way lower than cpu inferencing, <1 sec. The quality of generation is not majorly compromised cz quantization on a higher level just reduces the parameter size and theres not much difference in accuracy is only around 1-4%. For more optimization, I'll have to look about it, cz the most size reduction quantization is 4bit, after that we are getting around 2gb of vram usage. Actual model size is 6.7 gb. So already it has reduced the size to around 30%. The other thing is all the models are loaded onto the system while starting the server, this leads to wastage of vram. We can load the model when its needed in the runtime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants