Unleashing the Full Potential of Deep Learning Models: A Guide to Quantization Techniques

Amit Nikhade
6 min readJan 10, 2023

--

Precision at a fraction of the size: Experience the power of quantization for your deep learning models

Originally published on amitnikhade.com

Originally published on amitnikhade.com

Introduction

Model quantization is a technique for reducing the precision of the weights and activations of a neural network model. This process can be used to decrease the model’s memory footprint and computational complexity, making it easier to deploy on resource-constrained devices such as smartphones and edge devices. In addition, quantization can also improve model performance by reducing the number of bits needed to represent weights and activations, which can lead to faster inference times. There are various approaches to quantization, including post-training quantization, quantization-aware training, and hybrid quantization. Overall, model quantization is a valuable tool that allows the deployment of large, complex models on a wide range of devices.

When to use quantization

Model quantization is useful in situations where you need to deploy a deep learning model on a resource-constrained device, such as a mobile phone or an edge device. These devices often have limited memory and computational resources, making it difficult to run large, complex models. By quantizing the model, you can reduce the size of the model and the amount of resources required to run it, which makes it possible to deploy the model on these devices.

In addition to resource constraints, model quantization can also be useful in situations where you need to reduce the inference time of the model. By reducing the precision of the weights and activations, you can speed up the inference process, which can be important in real-time applications such as video streaming or online gaming.

Overall, model quantization is a powerful tool that can enable the deployment of large and complex models on resource-constrained devices, and can also be used to improve the performance of the model by reducing inference times.

Originally published on amitnikhade.com

benefits of quantizations

Model quantization is useful in situations where you need to deploy a deep learning model on a resource-constrained device, such as a mobile phone or an edge device. These devices often have limited memory and computational resources, making it difficult to run large, complex models. By quantizing the model, you can reduce the size of the model and the amount of resources required to run it, which makes it possible to deploy the model on these devices.

In addition to resource constraints, model quantization can also be useful in situations where you need to reduce the inference time of the model. By reducing the precision of the weights and activations, you can speed up the inference process, which can be important in real-time applications such as video streaming or online gaming.

Overall, model quantization is a powerful tool that can enable the deployment of large and complex models on resource-constrained devices, and can also be used to improve the performance of the model by reducing inference times.

demerits of quantizations

Some potential drawbacks to using model quantization include:

  1. Reduced accuracy: Quantization involves reducing the precision of the weights and activations, which can lead to a loss of information and result in a less accurate model.
  2. Increased complexity: Implementing model quantization requires a thorough understanding of the model and the quantization process, which can be complex. There are various approaches to quantization, each with their own set of trade-offs and considerations, and choosing the right approach can be challenging.
  3. Limited support: Not all deep learning frameworks and hardware platforms support model quantization, which can make it difficult to deploy quantized models in some environments.
  4. Difficulty in fine-tuning: It can be difficult to fine-tune quantized models, as the reduced precision of the weights and activations can make it harder for the model to learn.

Overall, while model quantization can be a useful tool in certain situations, it is important to carefully evaluate the trade-offs and consider whether it is the right approach for your use case.

Alternatives to model quantization

There are a few alternatives to model quantization that can be used to reduce the size and computational complexity of a deep learning model:

  1. Model pruning: Model pruning involves removing unnecessary connections and parameters from the model, which can lead to a smaller and more efficient model.
  2. Low-rank approximation: Low-rank approximation involves approximating a large matrix with a smaller matrix, which can lead to a reduction in the number of parameters and computational complexity of the model.
  3. Knowledge distillation: Knowledge distillation involves training a smaller model to mimic the behavior of a larger, more accurate model. This can be an effective way to reduce the size and complexity of the model while still maintaining good accuracy.
  4. Weight sharing: Weight sharing involves using the same set of weights for multiple parts of the model, which can reduce the number of parameters and computational complexity of the model.

Overall, these alternatives can be effective in certain situations, but it is important to carefully evaluate the trade-offs and choose the right approach for your use case.

There are two main approaches to quantizing a model:

  1. Post-training quantization: This involves quantizing a model after it has been trained. The weights and activations of the model are first converted to integer values, and then the model is fine-tuned to improve its accuracy.
  2. Quantization-aware training: This involves training a model with quantization in mind. During training, the model’s weights and activations are quantized and then dequantized to floating-point values before being used in the forward and backward passes. This helps to ensure that the model is optimized for quantization and produces good results when it is quantized.

We’ll try implementing Dynamic quantization on Bert

from transformers import BertTokenizer, BertModel
import torch
import os

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)

def print_size_of_model(model):
torch.save(model.state_dict(), "temp.p")
print('Size (MB):', os.path.getsize("temp.p")/1e6)
os.remove('temp.p')

print_size_of_model(model)
print_size_of_model(quantized_model)

Quantization is extremely useful in the case of transformer-based models as their size is basically large. And deploying them may consume a lot of resources and hence it's important to optimize and compress the model before deployment. Using this code you can quantize language models like Bert and others too.

Quantization on T5 model

from fastT5 import (OnnxT5, get_onnx_runtime_sessions,
generate_onnx_representation, quantize)
from transformers import AutoTokenizer

model_or_model_path = 't5-small'

# Step 1. convert huggingfaces t5 model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
quant_model_paths = quantize(onnx_model_paths)

Quantizing a T5 model has been, typically a difficult task. FastT5 is a library that helps us to overcome this problem. It quantizes the encoder and decoder to a smaller size.

Quantization using TfLite

import tensorflow as tf
import pathlib
from tensorflow import keras

model = keras.models.load_model('/content/model.h5')
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

tflite_model_quant = converter.convert()



tflite_models_dir = pathlib.Path("/content/l")
tflite_models_dir.mkdir(exist_ok=True, parents=True)

tflite_model_quant_file = tflite_models_dir/"model_quant.tflite"
tflite_model_quant_file.write_bytes(tflite_model_quant)

TFLite is a TensorFlow class that enables us to convert huge models to mobile-compatible models. It also involves model quantization. Pytorch also provides a tflite alternative which is PyTorch mobile.

Quantization using Optimum

from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
onnx_model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)

quantizer = ORTQuantizer.from_pretrained(onnx_model)

dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

model_quantized_path = quantizer.quantize(save_dir="path/to/output/model",
quantization_config=dqconfig,
)

Optimum is a specialized extension of the popular Transformers library that allows for training and executing deep learning models on targeted hardware with the highest level of efficiency. By utilizing various optimization techniques such as pruning, quantization, and utilizing specialized hardware like TPUs or GPUs, Optimum can improve the performance of models and decrease the computational requirements for both training and deployment.

Conclusion

Nothing to conclude, just try implementing the above stuff into your project and comment on the performance improvement you notice after using these techniques.

References

amitnikhade.com

Thanks. Do visit: amitnikhade.com

--

--

No responses yet