Skip to main content
April 2, 2026Colin Jaffe/6 min read

Building a Three-Layer Neural Network with Keras and TensorFlow

Build intelligent neural networks with TensorFlow and Keras

Three-Layer Neural Network Components

Input Layer

Flattens 28x28 pixel images into 784 values. Converts human-readable grid format into machine-readable list for optimal processing.

Hidden Layer

Dense layer with 128 neurons processing 784 inputs. Creates 100,000 weighted connections that form the mysterious black box of pattern recognition.

Output Layer

10 neurons representing digits 0-9. Uses softmax activation to produce probability percentages that sum to 100% for final classification.

Neural Network Architecture by Numbers

784
input values from 28x28 image
128
neurons in hidden dense layer
100,000
weighted connections between layers
10
output neurons for digits 0-9

Building Your Keras Sequential Model

1

Create Sequential Model

Initialize a Keras sequential model that processes layers in order from input to output

2

Add Flatten Layer

Convert 28x28 image grid into single 784-value list with specified input shape

3

Configure Dense Layer

Add hidden layer with 128 neurons using ReLU activation for pattern recognition

4

Define Output Layer

Create final dense layer with 10 neurons and softmax activation for digit classification

Why Flatten Images for Neural Networks

Neural networks process images more efficiently as one-dimensional lists rather than two-dimensional grids. The computer doesn't need spatial relationships between pixels - it learns to weight each of the 784 individual values to recognize patterns.

Black box is a term for stuff's happening in there, we can't really see into it
The hidden dense layer creates 100,000 weighted connections that effectively recognize patterns, but the specific weights remain incomprehensible to humans while being highly effective for computers.

ReLU vs Traditional Activation Functions

FeatureReLUSigmoid
ComplexitySimple max functionComplex smooth curve
PerformanceFaster processingSlower computation
Negative valuesReturns 0Maps to 0-1 range
Current usageModern standardLegacy approach
Recommended: ReLU is preferred for its simplicity and superior performance despite being counterintuitively simpler than traditional sigmoid functions.

Key Activation Functions Explained

ReLU Function

Returns maximum of input value or 0. Prevents negative confidence from decreasing other digit probabilities, ensuring only positive contributions to classification.

Softmax Function

Scales output values to 0-1 range where all probabilities sum to 100%. Converts raw neural network outputs into interpretable percentage confidence scores.

Example Output Layer Probability Distribution

Digit 0
2
Digit 1
5
Digit 2
53
Digit 3
1
Digit 4
8
Digit 5
12
Digit 6
3
Digit 7
7
Digit 8
6
Digit 9
3
Neural Network Architecture Complete

Your three-layer neural network is now built with proper input flattening, hidden layer processing, and probability-based output classification. The next step is compilation and training to make it functional.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Now we'll construct a concise yet sophisticated three-layer neural network, then dive deep into the architecture, functionality, and parameters that make it work. We're building a Keras sequential model—a straightforward architecture where data flows through each layer in order, making it ideal for beginners and many production applications.

Our first layer is the input layer, implemented as a TensorFlow Keras flatten operation. This layer takes our 28×28 pixel images and reshapes them into a format the network can process efficiently. The input_shape parameter of (28, 28) tells the network to expect square images of this dimension—a standard format for the MNIST handwritten digit dataset we're working with.

The second layer represents the heart of our network: a dense (fully connected) hidden layer with 128 neurons. This might seem substantial, but when you consider the computational complexity we're about to explore, you'll realize the true scale of what's happening here. Each neuron uses ReLU (Rectified Linear Unit) activation, a choice that's become the gold standard in modern neural networks for reasons we'll examine shortly.

Our final layer is the output layer—technically another dense layer, but functionally distinct in its purpose. It contains exactly 10 neurons because we're classifying 10 possible outcomes: digits 0 through 9. This layer employs softmax activation, which transforms raw neural outputs into probability distributions that sum to 1.0, giving us interpretable confidence scores for each digit class.

When we pass these layers to Keras, TensorFlow constructs the complete neural network architecture in milliseconds. But understanding what happens beneath the surface requires examining each component in detail.

Let's start with the flatten layer and why it's essential for our architecture. "Flattening" is a fundamental preprocessing step that converts multidimensional arrays into one-dimensional vectors. Our 28×28 image matrix becomes a single array of 784 values, maintaining the exact same pixel data in the same sequence.

While the 2D grid structure helps humans visualize and interpret images, neural networks operate more efficiently with linearized data. The network doesn't need to understand spatial relationships between adjacent pixels—instead, it learns to weight each of the 784 individual pixel values according to their importance for digit classification. This approach allows the network to discover patterns that might not be immediately obvious to human observers, often finding correlations across pixels that aren't spatially adjacent but are mathematically significant.


Machine learning systems excel at processing normalized, structured data formats. A simple array of numerical values is computationally optimal, allowing for efficient matrix operations that form the backbone of neural network calculations.

The second layer is where the real complexity—and power—emerges. This dense or hidden layer operates as what practitioners often call a "black box," a system whose internal workings are opaque even to its creators. Here's where the computational scale becomes impressive: our 784 input values connect to 128 neurons, creating 100,352 individual weighted connections (784 × 128).

Each connection has its own weight parameter that the network adjusts during training. The network analyzes patterns like "when pixel 247 has a high value and pixel 156 has a low value, there's a 73% correlation with the digit being a 5." These weights are learned through exposure to thousands of training examples, with the network gradually optimizing each connection to improve classification accuracy.

This is why even major tech companies like Google and Meta sometimes can't fully explain their neural network decisions. The models work exceptionally well—often achieving superhuman performance on specific tasks—but the reasoning involves hundreds of thousands or millions of weighted connections that don't translate into human-interpretable logic. Google's search algorithms, recommendation systems, and language models all operate on this principle: empirical effectiveness over explicable reasoning.

The output layer brings everything together with 10 neurons representing our digit classes. Each neuron receives weighted inputs from all 128 hidden layer neurons (another 1,280 connections), producing raw scores that indicate the network's confidence for each digit. Before we can interpret these scores, they pass through the softmax activation function, which normalizes them into a probability distribution.

The result is a set of probabilities that sum to 1.0, answering questions like: "Is this a 3? 89% confident. A 8? 7% confident. A 5? 2% confident." The highest probability wins, but having access to the full distribution provides valuable information about the network's uncertainty and alternative hypotheses.


Understanding activation functions is crucial for neural network design. ReLU (Rectified Linear Unit) has become the dominant choice in modern architectures, despite—or perhaps because of—its elegant simplicity. The function simply returns max(0, n): if the input is positive, it passes through unchanged; if negative, it becomes zero.

This seemingly basic operation solves several critical problems. First, it prevents negative activations from diminishing confidence in other classifications—a neuron that strongly indicates "this isn't a 5" doesn't reduce the probability of it being a 7. Second, ReLU addresses the vanishing gradient problem that plagued earlier activation functions like sigmoid and tanh, allowing networks to train more effectively across many layers.

ReLU's computational efficiency also matters at scale. Unlike sigmoid functions that require expensive exponential calculations, ReLU operations are nearly instantaneous. When you're processing millions of parameters across thousands of training iterations, this efficiency compounds significantly. Modern alternatives like Leaky ReLU, ELU, and Swish have emerged, but ReLU remains the reliable default for most applications.

Softmax activation serves a different but equally important role in the output layer. Raw neural network outputs can be any real number—positive, negative, large, or small. Softmax transforms these raw logits into a proper probability distribution where all values fall between 0 and 1 and sum to exactly 1.0.

The mathematical elegance of softmax lies in its ability to amplify differences between competing classes while maintaining probabilistic interpretation. A raw output of [2.1, 1.8, 0.3] becomes approximately [0.57, 0.41, 0.02] after softmax, clearly indicating the network's preference while preserving the relative confidence levels. This makes softmax indispensable for multi-class classification problems where you need both a decision and a confidence measure.

With our architecture defined and its components understood, we're ready to move beyond static structure into dynamic training. The next phase involves compiling the model with optimization algorithms and loss functions, then feeding it data to learn from—transforming our carefully designed but untrained network into a working digit classifier.


Key Takeaways

1Sequential Keras models process layers in order: flatten input, dense hidden layer, and output classification layer for digit recognition
2Flattening converts 28x28 image grids into 784-value lists because neural networks process one-dimensional data more efficiently than spatial grids
3Hidden dense layers with 128 neurons create 100,000 weighted connections that form an incomprehensible but effective pattern recognition system
4ReLU activation function simply returns the maximum of input value or zero, preventing negative confidence while being faster than complex sigmoid functions
5Output layers use softmax activation to scale 10 neurons into probability percentages that sum to 100% for digit classification from 0-9
6The black box nature means even companies like Google don't fully understand how their neural network weights produce effective results
7Neural networks learn to weight individual pixel values rather than spatial relationships, making flattened input optimal for machine processing
8Building the network architecture is just the first step - compilation and training are required to make the model functional for digit recognition

RELATED ARTICLES