As you learnt in the previous segment, the number of neurons in the input layer is determined by the input given to the network, and the number of neurons in the output layer is equal to the number of classes (for a classification task) or is one (for a regression task). Now, let’s take a look at some examples to understand the inputs and outputs of ANNs better.
Let’s get started by understanding the inputs and outputs of an ANN from Professor Srinivasaraghavan in the upcoming video.
The most important thing to note is that inputs can only be numeric. For different types of input data, you need to use different ways to convert the inputs into a numeric form. The most commonly used inputs for ANNs are as follows:
- Structured data: The type of data that we use in standard machine learning algorithms with multiple features and available in two dimensions, such that the data can be represented in a tabular format, can be used as input for training ANNs. Such data can be stored in CSV files, MAT files, Excel files, etc. This is highly convenient because the input to an ANN is usually given as a numeric feature vector. Such structured data eases the process of feeding the input into the ANN.
- Text data: For text data, you can use a one-hot vector or word embeddings corresponding to a certain word. For example, in one hot vector encoding, if the vocabulary size is |V|, then you can represent the word wn as a one-hot vector of size |V| with ‘1’ at the nth element with all other elements being zero. The problem with one-hot representation is that, usually, the vocabulary size |V| is huge, in tens of thousands at least; hence, it is often better to use word embeddings that are a lower-dimensional representation of each word. The one-hot encoded array of the digits 0–9 will look as shown below.
data = np.array([0,1,2,3,4,5,6,7,8,9])
print(data.shape)
one_hot(data)
(10,)
array([[1.,0.,0.,0.,0.,0.,0.,0.,0.,0.,],
[0.,1.,0.,0.,0.,0.,0.,0.,0.,0.,],
[0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,],
[0.,0.,0.,1.,0.,0.,0.,0.,0.,0.,],
[0.,0.,0.,0.,1.,0.,0.,0.,0.,0.,],
[0.,0.,0.,0.,0.,1.,0.,0.,0.,0.,],
[0.,0.,0.,0.,0.,0.,1.,0.,0.,0.,],
[0.,0.,0.,0.,0.,0.,0.,1.,0.,0.,],
[0.,0.,0.,0.,0.,0.,0.,0.,1.,0.,],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,1.,]])
- Images: Images are naturally represented as arrays of numbers and can thus be fed into the network directly. These numbers are the raw pixels of an image. ‘Pixel’ is short for ‘picture element’. In images, pixels are arranged in rows and columns (an array of pixel elements). The figure given below shows the image of a handwritten ‘zero’ in the MNIST data set (black and white) and its corresponding representation in NumPy as an array of numbers. The pixel values are high where the intensity is high, i.e., the color is bright, while the values are low in the black regions, as shown below.
- Images (cont.): In a neural network, each pixel of the input image is a feature. For example, the image provided above is an 18 x 18 array. Hence, it will be fed as a vector of size 324 into the network. Note that the image given above is black and white (also called a grayscale image), and thus, each pixel has only one ‘channel’. If it were a colored image called an RGB (Red, Green and Blue) image, each pixel would have three channels, one each for red, blue, and green, as shown below. Hence, the number of neurons in the input layer would be 18 x 18 x 3 = 972. The three channels of an RGB image are shown below.
- Speech: In the case of a speech/voice input, the basic input unit is in the form of phonemes. These are the distinct units of speech in any language. The speech signal is in the form of waves, and to convert these waves into numeric inputs, you need to use Fourier Transforms (you do not need to worry about this as it is covering areas of specialized mathematics that will not be covered in this course). Note that the input after conversion should be numeric, so you are able to feed it into a neural network.
Now that you have learnt how to feed input vectors into neural networks, let’s understand how the output layers are specified.
Depending on the nature of the given task, the outputs of neural networks can either be in the form of classes (if it is a classification problem) or numeric (if it is a regression problem).
One of the commonly used output functions is the softmax function for classification. Take a look at the graphical representation of the softmax function shown below.
A softmax output is similar to what we get from a multiclass logistic function commonly used to compute the probability of an output belonging to one of the multiple classes. It is given by the following formula:
pi=ewix′∑c−1t=oewt.x′
where c is the number of classes or neurons in the output layer, x′ is the input to the network, and wi’s are the weights associated with the inputs.
Suppose the output layer of a data set has 3 neurons and all of them have the same input x′ (coming from the previous layers in the network). The weights associated with them are represented as w0, w1and w2. In such a case, the probability of the input belonging to each of the classes are expressed as follows:
p0=ew0x′ew0.x′+ew1x′+ew2x′p1=ew1x′ew0.x′+ew1x′+ew2x′p2=ew2x′ew0.x′+ew1x′+ew2x′
Also, it is evident from these expressions that the sum p0+p1+p2=1and that p0, p1and p2 ϵ(0,1).
Now, try to answer the questions given below.
So, we have seen the softmax function as a commonly used output function in multiclass classification. Now, let’s understand how the softmax function translates to the sigmoid function in the special case of binary classification.
In the case of a sigmoid output, there is only one neuron in the output layer because if there are two classes with probabilities p0 and p1, we know that p0+p1=1. Hence, we need to compute the value of either p0 or p1. In other words, the sigmoid function is just a special case of the softmax function (since binary classification is a special case of multiclass classification).
In fact, we can derive the sigmoid function from the softmax function, as shown below. Let’s assume that the softmax function has two neurons with the following outputs:
p0=ew0x′ew0.x′+ew1x′,p1=ew1x′ew0.x′+ew1x′
Consider only p1 and divide both the numerator and the denominator with the numerator. We can now rewrite p1 as:
p1=11+ew0.x′ew1.x′=11+e(w0−w1).x′
And, if we replace w1−w0 = some w, we get the sigmoid function. Voila!
Now that you have understood how the output is obtained from the softmax function and how different types of inputs are fed into the ANN, let’s learn how to define inputs and outputs for image recognition on the famous MNIST data set for multiclass classification.
Note: There is a correction in the video below. The professor says 764 pixels instead of 784 pixels.
There are various problems you will face while trying to recognise handwritten text using an algorithm, including:
- Noise in the image
- The orientation of the text
- Non-uniformity in the spacing of text
- Non-uniformity in handwriting
The MNIST data set takes care of some of these problems, as the digits are written in a box. Now the only problem the network needs to handle is the non-uniformity in handwriting. Since the images in the MNIST data set are 28 X 28 pixels, the input layer has 784 neurons (each neuron takes 1 pixel as an input) and the output layer has 10 neurons (each giving the probability of the input image belonging to any of the 10 classes). The image is classified into the class with the highest probability in the output layer.
To revise what we have learnt in this segment, the softmax function stated above is a general case for multiclass classification. It is a commonly used output layer activation function for classification. You learnt how to feed input data into an ANN and obtain the output from it. In the next segment, we will move on to defining the building blocks of a neural network, which will help you understand the workings of a neuron and how to build its network.
Report an error