Quantization done for the given Models #107

abhyudayhari · 2025-01-04T08:53:58Z

This PR is with reference to the Issue #106

In this, Quantization of each model was required to provide support to devices with less GPU VRAM. The backend/server.py and backend/Generator/main.py are changed.

Now the backend/server.py file take the following arguments

--quantization: This represents whether the model are to be quantized or not. Additionally auto option is there, so that user doesnt have to set it manually.

--gpu_id: This represents the target CUDA GPU on which the models are to be run on. This is specifically helpful in multigpu systems where one can easily change on which gpu the model has to run on. Additionally auto option automatically selects the target GPU

'--quantization_type': This represents what type of quantization the user wants to apply on the target models. There are 2 different types of quantization 4bit and 8bit. The GPU usage differs for both these, with lowest with 4bit quantization. The default is set to 4bit.

Earlier without quantizations, the project wouldnt run on a consumer 3050 Ti Mobile GPU. Now with different quantizations, the model is loaded and inferenced upon on the same GPU.

The GPU load on inferencing with 4bit quantization

The GPU load on inferencing with 8bit quantization

These are done for same tasks. The difference in GPU memory is due to different quantization configs.

Roaster05 · 2025-01-07T06:16:38Z

@abhyudayhari good suggestion, we often faced this issue that our CUDA runs out of memory, would this affect the time required and the quality of the questions generated?, also in the Snaps you attached their is slight decrease in the memory usage could we optimise it further?

abhyudayhari · 2025-01-07T06:24:38Z

While starting the server it would take 1-2 sec more, but while inferencing the latency is way lower than cpu inferencing, <1 sec. The quality of generation is not majorly compromised cz quantization on a higher level just reduces the parameter size and theres not much difference in accuracy is only around 1-4%. For more optimization, I'll have to look about it, cz the most size reduction quantization is 4bit, after that we are getting around 2gb of vram usage. Actual model size is 6.7 gb. So already it has reduced the size to around 30%. The other thing is all the models are loaded onto the system while starting the server, this leads to wastage of vram. We can load the model when its needed in the runtime

abhyudayhari added 2 commits January 4, 2025 13:46

Quantization done for 4bit and 8 bit

4728eda

Cleaning the code

5504032

Naman-B-Parlecha mentioned this pull request Jan 5, 2025

Fix: MCQ Generation Page Break and Missing Options #110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization done for the given Models #107

Quantization done for the given Models #107

abhyudayhari commented Jan 4, 2025

Roaster05 commented Jan 7, 2025

abhyudayhari commented Jan 7, 2025 •

edited

Loading

Quantization done for the given Models #107

Are you sure you want to change the base?

Quantization done for the given Models #107

Conversation

abhyudayhari commented Jan 4, 2025

Roaster05 commented Jan 7, 2025

abhyudayhari commented Jan 7, 2025 • edited Loading

abhyudayhari commented Jan 7, 2025 •

edited

Loading