Swish Activation function

6 min readMar 6, 2023

The smooth and non-monotonic function that can be used in place of the commonly used ReLU activation function.

Originally published on amitnikhade.com

Introduction to Activation Functions

In machine learning and deep learning, an activation function is a mathematical function that is applied to the output of a neural network layer to introduce non-linearity into the network, allowing it to model complex relationships between inputs and outputs.

Popular activation functions include ReLU, sigmoid, tanh, and softmax. The choice of activation function can have a significant impact on the performance of a neural network, as some are better suited for specific types of problems.

Activation functions enable neural networks to learn complex and nuanced patterns in the data by introducing non-linearity into the network.

What is A Swish Activation Function (Scaled Exponential Linear Unit With a Shift)?

The Swish activation function is a mathematical formula that helps a deep learning model make sense of data. It was invented by a group of Google researchers in 2017 and has since shown promising results in many real-world applications.

The Swish function is similar to another function called ReLU, but has a slightly different shape that makes it more efficient and accurate. Unlike ReLU, it also avoids a common problem known as “dying ReLU,” which can cause issues in deep learning models.

The Swish activation function can be used in many different types of machine learning models, including those used for image recognition and speech processing. It’s still being studied to understand all of its capabilities and limitations, but so far, it’s looking very promising!

How do SWISH differs from other activation functions?

The Swish activation function is a key player in deep learning and sets itself apart from other activation functions in several ways.

Firstly, it is a nonlinear function like ReLU, sigmoid, and tanh, which enables neural networks to model complex relationships between inputs and outputs. Nonlinear functions are crucial for deep learning to work as they are able to capture and represent complex patterns.

Swish has a unique shape compared to other commonly used activation functions such as ReLU. Its shape is more like the sigmoid function, with a smooth and gradual increase in output as input values increase. This feature makes Swish more adaptable and efficient in handling a wide range of inputs.

Another impressive feature of Swish is that it is highly computationally efficient. Swish uses only basic arithmetic operations, such as multiplication and addition, which are easy for computers to perform quickly.

Lastly, Swish is less likely to encounter the “dying ReLU” problem compared to ReLU. The “dying ReLU” problem occurs when the input to a ReLU neuron is negative and remains negative for an extended period of time, causing the neuron to become inactive. Swish, with its smoother gradient, can avoid this issue and ensure that the neurons in a neural network continue to produce output.

In summary, Swish stands out as an effective and efficient activation function that can handle complex data with ease and avoids the common problems of other activation functions.

RELU v/s SWISH

Sure, let me explain the processing of ReLU and Swish in more depth.

ReLU processes the input by setting all negative values to 0, while passing through all positive values. This means that if the input is negative, ReLU produces an output of 0, and if the input is positive, ReLU produces an output equal to the input. This can be represented mathematically as:

ReLU(x) = max(0, x)

For example, if the input is -3, ReLU produces an output of 0, and if the input is 5, ReLU produces an output of 5.

Swish, on the other hand, applies a sigmoid function to the input, which is a smooth curve that gradually increases from 0 to 1 as the input increases. This sigmoid curve is then multiplied by the input, resulting in a smooth output that can take on any value between negative and positive infinity. Mathematically, Swish can be represented as:

Swish(x) = x * sigmoid(x)For example, if the input is -3, Swish produces a small negative output, and if the input is 5, Swish produces a large positive output.

Swish’s sigmoid function is differentiable everywhere, which can be an advantage in training deep neural networks using gradient-based methods like backpropagation. However, Swish is more computationally expensive than ReLU because it involves computing the sigmoid function.

In summary, ReLU sets negative values to 0, while Swish applies a smooth sigmoid function to the input. Swish’s smoothness and differentiability can be an advantage in training deep neural networks, but it comes at a higher computational cost.

When to use it?

When considering whether to use the Swish activation function in a deep learning model, several factors should be taken into account.

One factor is the complexity of the data being used. Swish can perform well on datasets with complex features or noise. If this type of data is being used, Swish could be a good choice.

Another factor is the depth of the neural network. As the number of layers in a network increases, the likelihood of the “dying ReLU” problem also increases. The smoother gradient of Swish can help prevent this issue and keep the neurons active.

However, it is important to note that using Swish may increase the training time of the neural network when compared to other activation functions. This may not be a significant issue for smaller networks, though.

Ultimately, the decision to use Swish depends on the specific requirements of the task at hand. Experimenting with different activation functions and architectures can help determine the best approach for your data and model.

Usage

Here is a simple code snippet to use swish function:

import matplotlib.pyplot as plt
import numpy as np

def swish(x):
    return x * sigmoid(x)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

x = np.linspace(-10, 10, 100)
y = swish(x)

plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('Swish(x)')
plt.title('Swish Activation Function')
plt.grid(True)
plt.show()

Try plotting the Swish curve.

Drawbacks of SWISH

The swish activation function has become quite popular in recent years, it has a few drawbacks that are important to consider.

Firstly, the swish function can be more computationally complex than other functions like the rectified linear unit (ReLU), meaning it can take longer to calculate, especially on large datasets.

Additionally, the swish function can be unstable when used with very deep neural networks, leading to slower training or even divergent behavior. Since the swish function is relatively new, there is limited research on its effectiveness compared to other activation functions. Some studies suggest that it may not always perform better than existing activation functions.

The swish function can still suffer from the vanishing and exploding gradient problems which make training difficult. Moreover, the swish function can decrease for some input values, leading to inconsistent behavior. This can make it more difficult to optimize than some other activation functions.

Overall, while the swish function may offer some advantages over other functions, it is important to carefully consider its limitations and challenges before using it in a neural network.

Conclusion

To sum up, the swish activation function has its pros and cons. While it is a promising alternative to existing activation functions, it is still relatively new and research on its effectiveness is limited. It can be computationally complex, unstable during training, and suffer from gradient problems. Additionally, it is non-monotonic, which can make it more difficult to optimize.

Despite these limitations, the swish function can still be useful in certain applications when combined with other techniques like regularization and batch normalization. It is important to carefully consider the benefits and drawbacks before using the swish function in a neural network. As research in the field of deep learning progresses, we may gain a better understanding of when the swish activation function can be most effective.

Swish Activation function

Originally published on amitnikhade.com

Introduction to Activation Functions

What is A Swish Activation Function (Scaled Exponential Linear Unit With a Shift)?

How do SWISH differs from other activation functions?

RELU v/s SWISH

When to use it?

Usage

Drawbacks of SWISH

Conclusion

Originally published on amitnikhade.com

Written by Amit Nikhade