Creating A More Advanced Artificial Neural Network

The basic artificial neural network we created doesn’t have the best accuracy. Though it performed somewhat well in the recognition of MNIST test data images, it didn’t do as well in the recognition of real-world images. To improve its performance, we can do some adjustments to our model and training data.

Model Layers

We will expands the number of layers to 15 (1 input, 13 hidden, 1 output). We will also use more advanced layers such 2D convolution layers which allow the model to learn spatial features and patterns in images, leading to better image classification or object detection performance.

  1. Input Layer – Defines the shape of the input data as a 28×28 grayscale image.
  2. Conv2D (32 filters, 3×3 kernel, ReLU activation) – Applies 32 convolutional filters to the input image, extracting basic features like edges and textures; Uses the ReLU activation function.
  3. BatchNormalization – Normalizes the activations from the previous layer, ensuring the outputs have zero mean and unit variance, which stabilizes and accelerates training.
  4. MaxPooling2D (2×2 pool size) – Reduces the spatial dimensions of the feature map by downsampling, reducing the data size while still retaining the important information.
  5. Dropout (0.25) – Randomly drops 25% of the neurons during training to prevent overfitting by forcing the model to generalize better.
  6. Conv2D (64 filters, 3×3 kernel, ReLU activation) – Applies 64 filters.
  7. BatchNormalization – Normalizes the activations from the previous layer.
  8. MaxPooling2D (2×2 pool size) – Further downsample the feature map.
  9. Dropout (0.25) – Randomly drops 25% of the neurons during training.
  10. Conv2D (128 filters, 3×3 kernel, ReLU activation) – Applies 128 filters.
  11. BatchNormalization – Normalizes the activations from the previous layer.
  12. Flatten – Converts the 2D feature map into a 1D vector, making it suitable for fully connected layers.
  13. Dense (128 neurons, ReLU activation) – A fully connected layer with 128 neurons, applying ReLU activation to learn higher-level abstractions from the flattened feature map.
  14. Dropout (0.25) – Randomly drops 25% of the neurons during training.
  15. Output Layer, Dense (10 neurons, Softmax activation) – The output layer with 10 neurons (one for each class), using softmax activation to generate a probability distribution over the 10 classes. The class with the highest probability is the predicted class.

This architecture is based on this model that is used to classify 10 classes of objects. This should also work for 10 classes of numeric symbols.

Activation Functions

Instead of Sigmoid activation function, we will use Rectified Linear Unit or ReLU activation function for the hidden layers.

Sigmoid is defined as:

\begin{align*}
\sigma(x) &= \frac{1}{1 + e^{-x}}
\end{align*}

where:

\(
\begin{array}{l}
\sigma(x) &\text{ denotes the sigmoid function,} \\
e &\text{ is the base of the natural logarithm (approximately 2.718),} \\
x &\text{ is the input to the function.}
\end{array}
\)

ReLU is defined as:

\begin{align*}
\text{ReLU}(x) &= \max(0, x) \\
\end{align*}

where:

\(
\begin{array}{l}
\text{ReLU}(x) & \text{ denotes the Rectified Linear Unit function,} \\
x & \text{ is the input to the function.}
\end{array}
\)

Comparing the Sigmoid and ReLu, it shows that ReLU is simpler and more efficient.

For the output layer, we use SoftMax:

\begin{equation}
\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
\end{equation}

where:

\(
\begin{array}{l}
\text{$z_i$ is the raw score (logit) for the $i$-th class.} \\
\text{$K$ is the total number of classes.} \\
\end{array}
\)

The denominator is the sum of exponentials of all raw scores, which normalizes the output values so that they sum to 1. The output is the probability that the input belongs to class i. This is important for the loss function we will use later.

Optimizer

Instead of Stochastic Gradient Descent (SGD), we use Adaptive Moment Estimate or Adam. Adam is faster and more efficient.

Loss Function

Instead of using Mean Squared Error (MSE), which is better suited for regression, we use Categorical Cross-Entropy, which is better suited for multi-class classification tasks. This loss function requires a probabilistic distribution as input, which is what’s generated by a softmax activation function in the output layer.

Early Stopping

Previously we used 500 epochs but that’s just a number I picked arbitrarily to get a higher accuracy for our basic model. It may be too many with the risk of overfitting where the model is overly trained to a dataset and performs worse on new data. For best results, we can stop when the validation loss is no longer improving.

To do this we will need to have validation data during training. We will be using the test data.

Learning Rate Scheduling

The learning rate in a parameter that determines the size of the steps the model takes during optimization to minimize the error (or loss) function. We will a learning rate scheduler to reduce the learning rate dynamically during training, which may help convergence.

As with early stopping, to do this we will need to have validation data during training. We will be using the test data.

Data Augmentation

We augment the data by applying random transformations to the images such as rotation up to 10 degrees, up to 10% zoom, up to 10% horizontal shift, and up to 10% vertical shift. This helps the model’s generalization.

Creating The Model

First, let’s install SciPy. SciPy or Scientific Python is an open-source scientific computing library for Python that builds on NumPy and provides additional functionality for more advanced scientific, mathematical, and statistical operations. 

pip3 install scipy

Next, we fire up your favorite editor and implement all of the enhancements.

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# Check TensorFlow version
print("TensorFlow version:", tf.__version__)

# Build a neural network model
model = Sequential([
Input(shape=(28, 28, 1)), # Use Input to specify the shape

Conv2D(32, (3, 3), activation='relu'),
BatchNormalization(),
MaxPooling2D((2, 2)),
Dropout(0.25),

Conv2D(64, (3, 3), activation='relu'),
BatchNormalization(),
MaxPooling2D((2, 2)),
Dropout(0.25),

Conv2D(128, (3, 3), activation='relu'),
BatchNormalization(),
Flatten(),

Dense(128, activation='relu'),
Dropout(0.25),
Dense(10, activation='softmax') # 10 units for 10 digit classes (0-9)
])

# Compile the model with optimizer, loss function, and evaluation metric
model.compile(optimizer='adam', # Adaptive Moment Estimation as the optimization function
loss='categorical_crossentropy', # Categorical Crossentropy as the cost function
metrics=['accuracy'])

# Load the MNIST dataset (handwritten digits)
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Preprocess data
# Reshaping the training and testing data to include the channel dimension (28x28x1)
x_train = x_train.reshape((x_train.shape[0], 28, 28, 1))
x_test = x_test.reshape((x_test.shape[0], 28, 28, 1))

# Normalizing the data to a range of 0 to 1 by dividing by 255.0
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Convert the labels to one-hot encoding for multi-class classification
y_train = to_categorical(y_train, 10) # One-hot encode the training labels (10 classes)
y_test = to_categorical(y_test, 10) # One-hot encode the test labels (10 classes)

# Data augmentation
datagen = ImageDataGenerator(
rotation_range=10,
zoom_range=0.1,
width_shift_range=0.1,
height_shift_range=0.1
)
datagen.fit(x_train)

# Callbacks
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
lr_scheduler = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1)

# Train the model on the training data for 30 epochs
model.fit(datagen.flow(x_train, y_train, batch_size=64),
epochs=30,
validation_data=(x_test, y_test),
callbacks=[early_stopping, lr_scheduler])

# Evaluate the model on the test data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print('Test loss:', test_loss)
print('Test accuracy:', test_acc)

# Save the trained model
try:
model.save('model.keras')
print("Model training complete and saved")
except Exception as e:
print(f"Error saving model: {e}")

The file is also available on the GitHub repo.

python3 create_advanced_model.py

After a [long] while, early stopping stopped the training and we get the evaluation results:

:
:
Epoch 27/30
938/938 ???????????????????? 22s 24ms/step - accuracy: 0.9918 - loss: 0.0274 - val_accuracy: 0.9961 - val_loss: 0.0107 - learning_rate: 1.2500e-04
313/313 - 1s - 3ms/step - accuracy: 0.9967 - loss: 0.0107
Test loss: 0.01067555695772171
Test accuracy: 0.9966999888420105
Model training complete and saved

Looks promising! Running the recognizer using this model, we get:

Image: 0.jpeg
1/1 ???????????????????? 0s 132ms/step
Recognized Digit: 0
Image: 1.jpeg
1/1 ???????????????????? 0s 6ms/step
Recognized Digit: 1
Image: 2.jpeg
1/1 ???????????????????? 0s 6ms/step
Recognized Digit: 2
Image: 3.jpeg
1/1 ???????????????????? 0s 5ms/step
Recognized Digit: 3
Image: 4.jpeg
1/1 ???????????????????? 0s 5ms/step
Recognized Digit: 4
Image: 5.jpeg
1/1 ???????????????????? 0s 5ms/step
Recognized Digit: 5
Image: 6.jpeg
1/1 ???????????????????? 0s 5ms/step
Recognized Digit: 6
Image: 7.jpeg
1/1 ???????????????????? 0s 5ms/step
Recognized Digit: 7
Image: 8.jpeg
1/1 ???????????????????? 0s 5ms/step
Recognized Digit: 8
Image: 9.jpeg
1/1 ???????????????????? 0s 5ms/step
Recognized Digit: 9

Now that’s excellent performance! We should be able to use it in an application.