A 4-step guide to making a deployment-ready deep learning model

5 min readMar 12, 2023

The model has been trained and dumped, what next?

Originally published on amitnikhade.com

Photo by Mikhail Nilov: https://www.pexels.com/photo/wood-light-landscape-person-6932594/

Deploying a deep learning model in production is a complex task that requires careful attention to both technical and practical details. You want to ensure that your model performs well and delivers accurate results, but you also need to consider other factors such as user experience, data privacy, and scalability.

On the technical side, you need to choose the right hardware and software infrastructure to support your model. Deep learning models can be large and computationally intensive, so you may need to use specialized hardware such as GPUs or TPUs to achieve the best performance. Additionally, you need to optimize your model to run efficiently by using techniques such as pruning and quantization.

But technical details are only part of the equation. You also need to think about practical considerations such as user experience and security. You want to make sure that your model is easy to use and integrate into your application while ensuring that user data is kept private and secure.

Here are some steps that will make a model ready for deployment.

It’s important to note that this article doesn’t cover all the aspects of model deployment. However, these steps can help address some common issues that can be problematic during the deployment process.

we’ll specifically be focusing on the following:

Model size
optimized model

Let’s consider a use case of the iris dataset, let us suppose you want to deploy a deep learning model. We’ll go step-wise.

Training

The training we normally do in the case of deep learning is building a model architecture, defining the optimizer, compiling the model, and training it. But in our case will perform quantization-aware training. Basically

Quantization-aware training (QAT) is a technique used in machine learning to train neural networks that can work on devices with limited resources, such as mobile phones or embedded systems. The process involves introducing quantization operations during training, which simulate the effects of mapping floating-point values to lower-precision integers. This helps to make the neural network more robust to the loss of precision that comes with quantization, resulting in models that can be deployed on low-resource devices without significant loss of accuracy. QAT is often combined with other optimization techniques to further reduce the computational requirements of neural networks, making them more efficient for deployment on resource-constrained devices.

Requirements:

!pip install onnx_tf tf2onnx onnxruntime tensorflow-model-optimization onnxoptimizer

Training code:

The post-training quantization reduces the precision of the model’s weights and activations to reduce memory usage and inference time.

import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import tensorflow_model_optimization as tfmot

iris_data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, test_size=0.2, random_state=42)

encoder = OneHotEncoder()
y_train = encoder.fit_transform(y_train.reshape(-1, 1)).toarray()
y_test = encoder.transform(y_test.reshape(-1, 1)).toarray()

model = keras.Sequential([
    keras.layers.Dense(10, input_shape=(4,), activation='relu'),
    keras.layers.Dense(3, activation='softmax')
])


quantize_model = tfmot.quantization.keras.quantize_model

model = quantize_model(model)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=50, batch_size=5, verbose=1)

test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', test_acc)

Saving and loading the model:

model.save("path/to/model")
model = keras.models.load_model("path/to/model")

Converting the TensorFlow model to tflite:

TFLite models, which are optimized for inference on mobile and embedded devices. The model size is reduced, keeping the accuracy almost the same.

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

Converting the Tflite model to ONNX type:

ONNX is a powerful tool for machine learning developers because it allows them to easily transfer and represent models between different tools and frameworks. This means that developers can use their preferred tools to create and train models, and then convert them into ONNX format to use in other tools.

In addition, ONNX simplifies the process of deploying models by providing a way to optimize and execute models on different hardware platforms. This allows the model to take advantage of the hardware it’s running on, which can result in faster and more efficient execution.

!python -m tf2onnx.convert --opset 16 --tflite /path-to-model/model.tflite --output model.onnx

Quantize the onnx model:

Model quantization is a process that reduces the memory and computational requirements of machine learning models. This is done by using fewer bits to represent the model’s parameters. Quantization can be done after the model is trained or during training, and it results in a smaller model size that requires less computational power.

This technique is especially useful for deploying models on devices with limited resources, such as smartphones and embedded systems. Using a smaller model can lead to faster and more efficient execution and make it easier to deploy the model on different devices.

import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType

model_fp32 = '/path-to-model/model.onnx'
model_quant = 'model.quant.onnx'
quantized_model = quantize_dynamic(model_fp32, model_quant)

Optimizing the Quantized onnx model:

The onnx optimization makes deep learning models run faster and more efficiently. It does this by applying various optimization techniques to the model. These optimizations make the model smaller in size, require less memory, and use less computational power, making it easier to deploy the model on devices with limited resources.

import onnx
import onnxoptimizer

original_model = onnx.load('/path-to-model/model.quant.onnx')
optimized_model = onnxoptimizer.optimize(original_model)
onnx.save(optimized_model, 'optimized_model.onnx')

Performing inference:

import onnxruntime
import numpy as np

model_path = '/path-to-model/optimized_model.onnx'
sess = onnxruntime.InferenceSession(model_path)

input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name

input_data = np.array([ [6.2, 3.4, 5.4, 2.3]], dtype=np.float32)

outputs = sess.run([output_name], {input_name: input_data})

predicted_class = np.argmax(outputs[0])

print('Predicted class:', predicted_class)

#The predicted class was 2, And it was correct.

After optimizing the model size and performance, I tested it on multiple samples and found that it is still able to accurately classify the samples. Despite reducing the model size from 39.54kb to 6.98kb, there was no lag in performance or accuracy, indicating that the optimization process was successful.

Conclusion

To make our machine learning models easier and cheaper to deploy, we can try to make their size as small as possible without compromising their accuracy. By optimizing the model and reducing its size, we can make it more platform-friendly, which means it will be easier to deploy and won’t incur additional deployment costs.

In addition, a smaller model will require less computational resources, reducing the load on our servers and potentially lowering costs. By using ONNX to optimize the model, we can further reduce its inference time and make it run faster on CPUs.

This is particularly useful for larger models, where size can be a limiting factor for deployment and inference speed. However, for simpler models like the iris classification example, these steps may not be necessary since the model size is already small. Nonetheless, it is always a good practice to optimize the model size and performance for efficient deployment and inference.

Originally published on amitnikhade.com

References

Quantize ONNX Models

Quantization in ONNX Runtime refers to 8 bit linear quantization of an ONNX model. During quantization, the floating…

onnxruntime.ai

A 4-step guide to making a deployment-ready deep learning model

Training

Conclusion

References

Quantize ONNX Models

Quantization in ONNX Runtime refers to 8 bit linear quantization of an ONNX model. During quantization, the floating…

Quantization aware training | TensorFlow Model Optimization

Maintained by TensorFlow Model Optimization There are two forms of quantization: post-training quantization and…

Written by Amit Nikhade