NN & Deep Learning

Deep learning is a part of machine learning that uses neural networks. Normal machine learning works well when the data is small and simple, but when we have a lot of data, deep learning does a better job at finding patterns.

Neural networks are designed to act like the human brain. Just like our brain uses many connected neurons to recognize things, a neural network uses layers of nodes to figure things out.

In normal machine learning, we usually give the model clear features (like “height” or “color”) so it can learn. But with neural networks, the computer can learn the features by itself.

For example, if we give the computer a picture of a cat, we don’t have to tell it “look for whiskers” or “look for ears.” The network learns step by step:

First layers find simple shapes (lines, edges).

Middle layers combine them into parts (ears, eyes, tail).

Final layers put everything together to say, “This is a cat.”

The network learns through trial and error. If it makes a mistake, it adjusts its “weights” (connections) until it gets better at making the right choice.

This picture shows how a neural network can tell if an image is a cat or a dog.

Input Layer (left side):

The photo of the animal (a cat in this case) goes into the network.

Hidden Layer 1:

The network looks for small parts, like cat’s ears, eyes, legs, or tail. It also checks for dog parts, like dog’s ears or tail.

Hidden Layer 2:

These small parts are combined into bigger features, like the cat’s head and body or the dog’s head and body.

Output Layer (right side):

Finally, the network makes a decision:

If the features look more like a cat → it outputs Cat.

If they look more like a dog → it outputs Dog.

The green and orange lines show how information flows forward from one layer to the next. The network learns by adjusting these connections until it gets good at telling cats and dogs apart.

Comparison between Statistical Machine Learning (ML) and Deep Learning (DL) in a table format:

Aspect	Statistical Machine Learning	Deep Learning
Data Requirement	Works well with small to medium-sized datasets.	Requires very large datasets to perform well.
Feature Engineering	Relies heavily on human-designed features (domain knowledge is important).	Learns features automatically from raw data (minimal manual feature engineering).
Complexity of Patterns	Suitable for simpler or structured data patterns (e.g., tabular data).	Excels at learning highly complex, nonlinear, and unstructured patterns (e.g., images, audio, text).
Interpretability	Models are usually easier to interpret and explain.	Models are often “black boxes” and harder to interpret.
Examples of Algorithms	Linear/Logistic Regression, SVM, Random Forest, k-NN.	CNNs (Convolutional Neural Networks), RNNs (Recurrent Neural Networks), Transformers.

Different types of deep learning architectures.

Architecture	How It Works	Best For	Example Applications
Feed Forward Neural Network (FNN)	Information moves in one direction (input → hidden layers → output). No loops.	General-purpose tasks with structured/tabular data.	Credit scoring, basic regression/classification, simple recommendation systems.
Recurrent Neural Network (RNN)	Has loops, so it can remember previous inputs and process sequences over time.	Sequential data (time-dependent).	Text generation, speech recognition, stock price prediction.
Convolutional Neural Network (CNN)	Uses convolution filters to detect patterns (edges, textures, shapes) in data.	Image, video, and spatial data.	Image recognition (cats vs dogs), medical imaging, object detection, facial recognition.
Transformers	Uses self-attention to understand relationships in data without needing sequence-by-sequence processing.	Very large-scale sequence and language tasks.	Large Language Models (ChatGPT, BERT), machine translation, document summarization.

Activation functions in NN

It’s like a switch that decides how much a neuron should “fire.”

Without it, a neural network is just doing straight-line math.

With it, the network can bend and learn more complex patterns.

It helps introduce the concept of nonlinearity of the problem you are trying to solve

🔹 What is the sigmoid function?

The sigmoid takes any number (big, small, positive, or negative) and squeezes it between 0 and 1:

If the input is very negative → output is close to 0

If the input is very positive → output is close to 1

If the input is around 0 → output is 0.5

🔹 Why use sigmoid for binary classification?

Because binary problems are just yes/no, 0/1, true/false.

Sigmoid makes the model’s output look like a probability.

0.9 → 90% chance it’s “yes”
0.1 → 10% chance it’s “yes” (so likely “no”)

That’s why it’s perfect for binary classification.

The sigmoid is like a squasher. No matter what numbers the model calculates, sigmoid squeezes them into a range between 0 and 1, so we can treat the output as the probability of something being “yes” (class 1) or “no” (class 0).

SoftMax

For multiclass classifications like images and hand digits where the target variable is more than two classes, the SoftMax activation function is used. Here, it takes the inputs and producing an output of probabilities. The highest probabilities is the correct value.

ReLU. This is the most used activation function when it comes to hidden layers. With relu, it checks if a value is >0 and returns the value if true or 0 if false. Due to how fast it is in computation, it is mostly used with the hidden layers.

It is however not recommended if there are a lot of negative input values as these will be converted to dead neutrons which wont fire and hence we use the leaky ReLu function.

Transformer Architecture


## Stage 1 of a Transformer: Tokenization → IDs → Embeddings → Positional Encoding

When you enter a sentence like **“The wine is amazing”**, the Transformer converts it into numerical vectors before any attention layers. The steps are:

### 1. Tokenization
The sentence is split into tokens.
- BERT: adds **[CLS]** at the beginning and **[SEP]** at the end.
- GPT: does *not* use CLS/SEP and usually splits words into smaller subword pieces (BPE).

Example (BERT):
[CLS], "The", "wine", "is", "amazing", [SEP]

### 2. Convert Tokens to IDs
Each token is mapped to an integer using the model’s vocabulary:
"The” → 101
"wine” → 3482
"is” → 2003
"amazing” → 5934
These IDs represent the index of the token inside the vocabulary table.

### 3. Token Embeddings
Each token ID retrieves a learned embedding vector from the embedding matrix:
- BERT-base uses **768-dimensional** embeddings
- GPT-3 uses **≈12,288-dimensional** embeddings

These embeddings are not static—they are learned during training.

### 4. Positional Encoding
Transformers do not read text in order by themselves, so each position (0,1,2,3,…) has a positional vector.  
Final vector for each token:
**FinalEmbedding = TokenEmbedding + PositionalEncoding**

This gives the model both:
- token meaning  
- token position (word order)

The resulting sequence of vectors is passed into the self-attention layers.

"The wine is amazing"
           │
           ▼
      Tokenization
           │
           ▼
  ["The","wine","is","amazing"]
           │
           ▼
   Vocabulary Lookup
           │
           ▼
     [101, 3482, 2003, 5934]
           │
           ▼
  Token Embedding Lookup
           │
           ▼
[Embedding vectors (768d or 12k d)]
           │
           ▼
  Add Positional Encoding
           │
           ▼
 Final Input Vectors to Transformer
 
 
 When you input a sentence like “The wine is amazing”, the Transformer first tokenizes 
 the text. BERT adds special tokens like [CLS] and [SEP], while GPT does not. Each token
 is then mapped to an integer ID using the model’s vocabulary. These IDs are looked up 
 in a learned embedding matrix to produce high-dimensional vectors (768 for BERT, 
 ~12K for GPT-3). Since Transformers do not inherently understand order
 a positional encoding vector is added to each token embedding, giving the model both
 the meaning of each token and its position in the sequence. 
 The resulting vectors are the input to the attention layers.