Content is user-generated and unverified.

Lesson 2: Mathematical Foundations for AI

Learning Objectives

By the end of this lesson, you will:

Master essential linear algebra concepts and operations
Understand probability theory and statistical concepts crucial for AI
Grasp calculus fundamentals needed for machine learning optimization
Learn information theory basics that underpin modern AI
Implement mathematical operations efficiently using NumPy
Apply these concepts to real AI problems

1. Linear Algebra: The Language of AI

Linear algebra is the mathematical foundation that allows computers to process and manipulate data efficiently. Every piece of data in AI is represented as vectors and matrices.

1.1 Scalars, Vectors, and Matrices

Scalars

A scalar is a single number.

Examples: 5, -3.14, 0.5

Vectors

A vector is an ordered list of numbers (1-dimensional array).

Examples: 
v = [1, 2, 3]        (3-dimensional vector)
w = [0.5, -1.2, 4.7] (3-dimensional vector)

Geometric Interpretation: A vector represents a point in space or a direction with magnitude.

Matrices

A matrix is a 2-dimensional array of numbers.

Example:
A = [[1, 2, 3],
     [4, 5, 6]]      (2×3 matrix: 2 rows, 3 columns)

AI Context:

Images: Each pixel's color values form a matrix
Text: Word embeddings are vectors in high-dimensional space
Neural Networks: Weights are stored in matrices

1.2 Essential Vector Operations

Vector Addition and Subtraction

v = [1, 2, 3]
w = [4, 5, 6]

v + w = [1+4, 2+5, 3+6] = [5, 7, 9]
v - w = [1-4, 2-5, 3-6] = [-3, -3, -3]

Scalar Multiplication

v = [1, 2, 3]
2 * v = [2*1, 2*2, 2*3] = [2, 4, 6]

Dot Product (Inner Product)

The dot product measures how similar two vectors are:

v · w = v₁w₁ + v₂w₂ + v₃w₃

Example:
v = [1, 2, 3]
w = [4, 5, 6]
v · w = 1*4 + 2*5 + 3*6 = 4 + 10 + 18 = 32

Geometric Interpretation:

If dot product = 0: vectors are perpendicular
If dot product > 0: vectors point in similar directions
If dot product < 0: vectors point in opposite directions

Vector Magnitude (Length)

||v|| = √(v₁² + v₂² + v₃²)

Example:
v = [3, 4]
||v|| = √(3² + 4²) = √(9 + 16) = √25 = 5

1.3 Matrix Operations

Matrix Addition and Subtraction

A = [[1, 2],     B = [[5, 6],
     [3, 4]]          [7, 8]]

A + B = [[1+5, 2+6],  = [[6, 8],
         [3+7, 4+8]]    [10, 12]]

Matrix Multiplication

Key Rule: For A×B to be valid, the number of columns in A must equal the number of rows in B.

A (2×3) × B (3×2) = C (2×2)

A = [[1, 2, 3],     B = [[7, 8],
     [4, 5, 6]]          [9, 10],
                         [11, 12]]

C[i,j] = Σ A[i,k] × B[k,j]

C = [[1*7+2*9+3*11, 1*8+2*10+3*12],
     [4*7+5*9+6*11, 4*8+5*10+6*12]]
  = [[58, 64],
     [139, 154]]

Matrix Transpose

Flip rows and columns:

A = [[1, 2, 3],     A^T = [[1, 4],
     [4, 5, 6]]            [2, 5],
                           [3, 6]]

Identity Matrix

A square matrix with 1s on the diagonal and 0s elsewhere:

I₃ = [[1, 0, 0],
      [0, 1, 0],
      [0, 0, 1]]

Property: A × I = I × A = A (like multiplying by 1)

Matrix Inverse

For square matrix A, if A⁻¹ exists: A × A⁻¹ = A⁻¹ × A = I

AI Application: Solving systems of equations, optimization problems

1.4 Why Linear Algebra Matters in AI

Data Representation: All data is stored as vectors/matrices
Neural Networks: Weights and activations are matrices
Image Processing: Images are matrices of pixel values
Natural Language: Words are represented as vectors
Optimization: Finding best model parameters uses linear algebra
Dimensionality Reduction: PCA, t-SNE use matrix decomposition

2. Probability and Statistics: Handling Uncertainty

AI systems must make decisions under uncertainty. Probability theory provides the mathematical framework for this.

2.1 Basic Probability Concepts

Probability Definition

P(A) = (Number of favorable outcomes) / (Total number of possible outcomes)

Range: 0 ≤ P(A) ≤ 1

P(A) = 0: Event never happens
P(A) = 1: Event always happens
P(A) = 0.5: Event happens half the time

Sample Space and Events

Sample Space (Ω): Set of all possible outcomes
Event: Subset of sample space

Example: Rolling a die

Sample Space: Ω = {1, 2, 3, 4, 5, 6}
Event A (rolling even): A = {2, 4, 6}
P(A) = 3/6 = 0.5

2.2 Fundamental Rules

Addition Rule

For events A and B:

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

If A and B are mutually exclusive (can't happen together):

P(A ∪ B) = P(A) + P(B)

Multiplication Rule

For independent events A and B:

P(A ∩ B) = P(A) × P(B)

Conditional Probability

Probability of A given that B has occurred:

P(A|B) = P(A ∩ B) / P(B)

AI Application: "What's the probability this email is spam given it contains the word 'lottery'?"

2.3 Bayes' Theorem

One of the most important theorems in AI:

P(A|B) = P(B|A) × P(A) / P(B)

Components:

P(A|B): Posterior probability (what we want to find)
P(B|A): Likelihood (evidence given hypothesis)
P(A): Prior probability (initial belief)
P(B): Marginal probability (total probability of evidence)

Example: Medical Diagnosis

A: Patient has disease (1% of population)
B: Test is positive (test is 99% accurate)

P(Disease|Positive Test) = P(Positive|Disease) × P(Disease) / P(Positive)
                         = 0.99 × 0.01 / P(Positive)

2.4 Probability Distributions

Discrete Distributions

Bernoulli Distribution: Single trial with two outcomes

Examples: Coin flip, email spam/not spam
Parameters: p (probability of success)
P(X = 1) = p, P(X = 0) = 1-p

Binomial Distribution: Multiple independent Bernoulli trials

Examples: Number of heads in 10 coin flips
Parameters: n (trials), p (success probability)
P(X = k) = C(n,k) × p^k × (1-p)^(n-k)

Continuous Distributions

Uniform Distribution: All values equally likely in an interval

Example: Random number generator
Parameters: a (minimum), b (maximum)

Normal (Gaussian) Distribution: Bell-shaped curve

Examples: Heights, measurement errors
Parameters: μ (mean), σ² (variance)
Formula: f(x) = (1/√(2πσ²)) × e^(-(x-μ)²/(2σ²))

Properties of Normal Distribution:

68% of data within 1 standard deviation
95% of data within 2 standard deviations
99.7% of data within 3 standard deviations

2.5 Statistical Measures

Central Tendency

Mean: μ = (Σx_i) / n
Median: Middle value when data is sorted
Mode: Most frequently occurring value

Variability

Variance: σ² = Σ(x_i - μ)² / n
Standard Deviation: σ = √(variance)
Range: Maximum - Minimum

Relationships

Covariance: Measures how two variables change together

  Cov(X,Y) = E[(X - μ_X)(Y - μ_Y)]

Correlation: Normalized covariance (-1 to 1)

  ρ(X,Y) = Cov(X,Y) / (σ_X × σ_Y)

3. Calculus: The Engine of Learning

Machine learning models improve by minimizing error functions. Calculus provides the tools to find these minimum points.

3.1 Derivatives: Rate of Change

Definition

The derivative measures how a function changes as its input changes:

f'(x) = lim(h→0) [f(x+h) - f(x)] / h

Common Derivatives

d/dx [c] = 0                    (constant)
d/dx [x^n] = n×x^(n-1)         (power rule)
d/dx [e^x] = e^x               (exponential)
d/dx [ln(x)] = 1/x             (natural log)
d/dx [sin(x)] = cos(x)         (sine)
d/dx [cos(x)] = -sin(x)        (cosine)

Chain Rule

For composite functions f(g(x)):

d/dx [f(g(x))] = f'(g(x)) × g'(x)

Example:

f(x) = (x² + 1)³
f'(x) = 3(x² + 1)² × 2x = 6x(x² + 1)²

3.2 Partial Derivatives

For functions with multiple variables f(x,y):

∂f/∂x = derivative with respect to x (treating y as constant)
∂f/∂y = derivative with respect to y (treating x as constant)

Example:

f(x,y) = x²y + 3xy²
∂f/∂x = 2xy + 3y²
∂f/∂y = x² + 6xy

3.3 Gradients

The gradient ∇f is a vector of all partial derivatives:

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

Key Properties:

Points in direction of steepest increase
Magnitude indicates rate of increase
Used in optimization algorithms

3.4 Optimization: Finding Minima and Maxima

Critical Points

Where derivative equals zero: f'(x) = 0

Second Derivative Test

f''(x) > 0: Local minimum
f''(x) < 0: Local maximum
f''(x) = 0: Inconclusive

Gradient Descent Algorithm

To minimize function f(x):

1. Start with initial guess x₀
2. Repeat: x_{n+1} = x_n - α∇f(x_n)
3. Stop when gradient ≈ 0

Where α is the learning rate.

AI Application: This is how neural networks learn!

4. Information Theory: Measuring Information

Information theory quantifies how much information is contained in data and how efficiently it can be transmitted.

4.1 Information and Entropy

Information Content

The information content of an event with probability p:

I(x) = -log₂(p)  bits

Intuition: Rare events carry more information than common events.

Examples:

Certain event (p=1): I = -log₂(1) = 0 bits
Coin flip (p=0.5): I = -log₂(0.5) = 1 bit
Rare event (p=0.01): I = -log₂(0.01) ≈ 6.64 bits

Entropy

Average information content across all possible events:

H(X) = -Σ p(x) log₂ p(x)

Properties:

H(X) ≥ 0 (always non-negative)
H(X) = 0 when outcome is certain
H(X) is maximized when all outcomes are equally likely

Example: Fair coin

H(Coin) = -[0.5×log₂(0.5) + 0.5×log₂(0.5)] = 1 bit

4.2 Cross-Entropy

Measures the difference between two probability distributions:

H(p,q) = -Σ p(x) log q(x)

Where:

p(x): True distribution
q(x): Predicted distribution

AI Application: Cross-entropy is a common loss function in classification tasks.

4.3 Kullback-Leibler (KL) Divergence

Measures how one probability distribution differs from another:

D_KL(p||q) = Σ p(x) log(p(x)/q(x))

Properties:

D_KL(p||q) ≥ 0 (always non-negative)
D_KL(p||q) = 0 if and only if p = q
Not symmetric: D_KL(p||q) ≠ D_KL(q||p)

AI Applications:

Measuring model performance
Variational inference
Regularization in deep learning

5. Hands-On: Mathematical Operations with NumPy

5.1 Linear Algebra Operations

python

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

# Set style for better plots
plt.style.use('seaborn-v0_8')
np.random.seed(42)

print("=== LINEAR ALGEBRA OPERATIONS ===\n")

# Create vectors
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])

print("Vectors:")
print(f"v1 = {v1}")
print(f"v2 = {v2}")

# Vector operations
print(f"\nVector addition: v1 + v2 = {v1 + v2}")
print(f"Vector subtraction: v1 - v2 = {v1 - v2}")
print(f"Scalar multiplication: 2 * v1 = {2 * v1}")

# Dot product
dot_product = np.dot(v1, v2)
print(f"Dot product: v1 · v2 = {dot_product}")

# Vector magnitude
magnitude_v1 = np.linalg.norm(v1)
magnitude_v2 = np.linalg.norm(v2)
print(f"Magnitude of v1: ||v1|| = {magnitude_v1:.3f}")
print(f"Magnitude of v2: ||v2|| = {magnitude_v2:.3f}")

# Angle between vectors
cos_angle = dot_product / (magnitude_v1 * magnitude_v2)
angle_rad = np.arccos(cos_angle)
angle_deg = np.degrees(angle_rad)
print(f"Angle between vectors: {angle_deg:.2f} degrees")

5.2 Matrix Operations

python

print("\n=== MATRIX OPERATIONS ===\n")

# Create matrices
A = np.array([[1, 2, 3],
              [4, 5, 6]])
B = np.array([[7, 8],
              [9, 10],
              [11, 12]])

print("Matrix A (2×3):")
print(A)
print("\nMatrix B (3×2):")
print(B)

# Matrix multiplication
C = np.dot(A, B)  # or A @ B
print(f"\nMatrix multiplication A × B (2×2):")
print(C)

# Matrix transpose
A_T = A.T
print(f"\nTranspose of A:")
print(A_T)

# Identity matrix
I = np.eye(3)
print(f"\n3×3 Identity matrix:")
print(I)

# Matrix properties
print(f"\nMatrix A shape: {A.shape}")
print(f"Matrix A rank: {np.linalg.matrix_rank(A)}")

# Square matrix operations
square_matrix = np.array([[2, 1],
                         [1, 2]])
print(f"\nSquare matrix:")
print(square_matrix)

# Determinant
det = np.linalg.det(square_matrix)
print(f"Determinant: {det}")

# Inverse (if determinant ≠ 0)
if det != 0:
    inv = np.linalg.inv(square_matrix)
    print(f"Inverse:")
    print(inv)
    
    # Verify: A × A⁻¹ = I
    verification = square_matrix @ inv
    print(f"Verification (A × A⁻¹):")
    print(verification)

5.3 Probability and Statistics

python

print("\n=== PROBABILITY AND STATISTICS ===\n")

# Generate random data
n_samples = 1000

# Normal distribution
normal_data = np.random.normal(loc=50, scale=10, size=n_samples)
print(f"Normal distribution (μ=50, σ=10):")
print(f"Sample mean: {np.mean(normal_data):.2f}")
print(f"Sample std: {np.std(normal_data):.2f}")

# Uniform distribution
uniform_data = np.random.uniform(low=0, high=100, size=n_samples)
print(f"\nUniform distribution (0-100):")
print(f"Sample mean: {np.mean(uniform_data):.2f}")
print(f"Sample std: {np.std(uniform_data):.2f}")

# Binomial distribution
binomial_data = np.random.binomial(n=10, p=0.3, size=n_samples)
print(f"\nBinomial distribution (n=10, p=0.3):")
print(f"Sample mean: {np.mean(binomial_data):.2f}")
print(f"Theoretical mean: {10 * 0.3}")

# Statistical measures
data = normal_data
print(f"\nStatistical measures for normal data:")
print(f"Mean: {np.mean(data):.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Mode: {stats.mode(data.round())[0]:.2f}")
print(f"Variance: {np.var(data):.2f}")
print(f"Standard deviation: {np.std(data):.2f}")
print(f"Range: {np.max(data) - np.min(data):.2f}")

# Percentiles
percentiles = [25, 50, 75, 95, 99]
for p in percentiles:
    value = np.percentile(data, p)
    print(f"{p}th percentile: {value:.2f}")

5.4 Visualization of Distributions

python

# Create visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Normal distribution
axes[0,0].hist(normal_data, bins=50, alpha=0.7, density=True, color='blue')
x = np.linspace(normal_data.min(), normal_data.max(), 100)
y = stats.norm.pdf(x, loc=50, scale=10)
axes[0,0].plot(x, y, 'r-', linewidth=2, label='Theoretical PDF')
axes[0,0].set_title('Normal Distribution')
axes[0,0].legend()

# Uniform distribution
axes[0,1].hist(uniform_data, bins=50, alpha=0.7, density=True, color='green')
axes[0,1].axhline(y=1/100, color='red', linestyle='--', linewidth=2, label='Theoretical PDF')
axes[0,1].set_title('Uniform Distribution')
axes[0,1].legend()

# Binomial distribution
unique_vals, counts = np.unique(binomial_data, return_counts=True)
axes[0,2].bar(unique_vals, counts/n_samples, alpha=0.7, color='orange')
theoretical_probs = [stats.binom.pmf(k, 10, 0.3) for k in unique_vals]
axes[0,2].plot(unique_vals, theoretical_probs, 'ro-', label='Theoretical PMF')
axes[0,2].set_title('Binomial Distribution')
axes[0,2].legend()

# Q-Q plot for normality test
stats.probplot(normal_data, dist="norm", plot=axes[1,0])
axes[1,0].set_title('Q-Q Plot (Normal Data)')

# Box plot
axes[1,1].boxplot([normal_data, uniform_data, binomial_data], 
                  labels=['Normal', 'Uniform', 'Binomial'])
axes[1,1].set_title('Box Plots Comparison')

# Correlation example
x_corr = np.random.normal(0, 1, 500)
y_corr = 2*x_corr + np.random.normal(0, 0.5, 500)  # y correlated with x
correlation = np.corrcoef(x_corr, y_corr)[0,1]

axes[1,2].scatter(x_corr, y_corr, alpha=0.6)
axes[1,2].set_xlabel('X')
axes[1,2].set_ylabel('Y')
axes[1,2].set_title(f'Correlation Example (r={correlation:.3f})')

# Add trend line
z = np.polyfit(x_corr, y_corr, 1)
p = np.poly1d(z)
axes[1,2].plot(x_corr, p(x_corr), "r--", alpha=0.8)

plt.tight_layout()
plt.show()

5.5 Calculus with NumPy

python

print("\n=== CALCULUS OPERATIONS ===\n")

# Numerical derivatives
def f(x):
    """Example function: f(x) = x^3 - 2x^2 + x + 1"""
    return x**3 - 2*x**2 + x + 1

def f_derivative(x):
    """Analytical derivative: f'(x) = 3x^2 - 4x + 1"""
    return 3*x**2 - 4*x + 1

# Numerical derivative approximation
def numerical_derivative(func, x, h=1e-5):
    """Compute numerical derivative using finite differences"""
    return (func(x + h) - func(x - h)) / (2 * h)

# Test points
x_test = np.array([0, 1, 2, 3])

print("Comparison of analytical vs numerical derivatives:")
print("x\tAnalytical\tNumerical\tError")
print("-" * 45)

for x in x_test:
    analytical = f_derivative(x)
    numerical = numerical_derivative(f, x)
    error = abs(analytical - numerical)
    print(f"{x}\t{analytical:.6f}\t{numerical:.6f}\t{error:.2e}")

# Gradient example (multivariable function)
def g(x, y):
    """Example function: g(x,y) = x^2 + 2xy + y^2"""
    return x**2 + 2*x*y + y**2

def gradient_g(x, y):
    """Analytical gradient: ∇g = [2x + 2y, 2x + 2y]"""
    return np.array([2*x + 2*y, 2*x + 2*y])

# Test gradient
x, y = 1, 2
grad = gradient_g(x, y)
print(f"\nGradient of g(1,2) = {grad}")

# Gradient descent example
def gradient_descent_1d(func, func_grad, x_start, learning_rate=0.1, num_iterations=100):
    """Simple 1D gradient descent"""
    x = x_start
    history = [x]
    
    for i in range(num_iterations):
        grad = func_grad(x)
        x = x - learning_rate * grad
        history.append(x)
        
        if abs(grad) < 1e-6:  # Convergence criterion
            break
    
    return x, history

# Find minimum of f(x) = (x-2)^2 + 1
def simple_func(x):
    return (x - 2)**2 + 1

def simple_func_grad(x):
    return 2 * (x - 2)

minimum, history = gradient_descent_1d(simple_func, simple_func_grad, x_start=5)
print(f"\nGradient descent result:")
print(f"Starting point: x = 5")
print(f"Found minimum at: x = {minimum:.6f}")
print(f"Function value at minimum: f({minimum:.6f}) = {simple_func(minimum):.6f}")
print(f"True minimum: x = 2, f(2) = 1")
print(f"Converged in {len(history)-1} iterations")

5.6 Information Theory Calculations

python

print("\n=== INFORMATION THEORY ===\n")

# Entropy calculation
def entropy(probabilities):
    """Calculate entropy of a probability distribution"""
    # Remove zero probabilities to avoid log(0)
    p = np.array(probabilities)
    p = p[p > 0]
    return -np.sum(p * np.log2(p))

# Examples
print("Entropy examples:")

# Fair coin
fair_coin = [0.5, 0.5]
print(f"Fair coin: H = {entropy(fair_coin):.3f} bits")

# Biased coin
biased_coin = [0.9, 0.1]
print(f"Biased coin (90%-10%): H = {entropy(biased_coin):.3f} bits")

# Fair die
fair_die = [1/6] * 6
print(f"Fair 6-sided die: H = {entropy(fair_die):.3f} bits")

# Uniform distribution over 8 outcomes
uniform_8 = [1/8] * 8
print(f"Uniform over 8 outcomes: H = {entropy(uniform_8):.3f} bits")

# Cross-entropy
def cross_entropy(true_dist, pred_dist):
    """Calculate cross-entropy between true and predicted distributions"""
    return -np.sum(true_dist * np.log2(pred_dist))

# KL divergence
def kl_divergence(p, q):
    """Calculate KL divergence D_KL(p||q)"""
    return np.sum(p * np.log2(p / q))

# Example: comparing distributions
true_dist = np.array([0.5, 0.3, 0.2])
pred_dist1 = np.array([0.4, 0.4, 0.2])  # Close to true
pred_dist2 = np.array([0.8, 0.1, 0.1])  # Far from true

print(f"\nCross-entropy and KL divergence:")
print(f"True distribution: {true_dist}")
print(f"Prediction 1: {pred_dist1}")
print(f"Prediction 2: {pred_dist2}")

ce1 = cross_entropy(true_dist, pred_dist1)
ce2 = cross_entropy(true_dist, pred_dist2)
kl1 = kl_divergence(true_dist, pred_dist1)
kl2 = kl_divergence(true_dist, pred_dist2)

print(f"\nCross-entropy with prediction 1: {ce1:.3f} bits")
print(f"Cross-entropy with prediction 2: {ce2:.3f} bits")
print(f"KL divergence with prediction 1: {kl1:.3f} bits")
print(f"KL divergence with prediction 2: {kl2:.3f} bits")

5.7 Putting It All Together: Simple Linear Regression

python

print("\n=== APPLYING MATH TO AI: LINEAR REGRESSION ===\n")

# Generate synthetic dataset
np.random.seed(42)
n_points = 100
true_slope = 2.5
true_intercept = 1.0
noise_std = 0.5

X = np.random.uniform(0, 10, n_points)
y = true_slope * X + true_intercept + np.random.normal(0, noise_std, n_points)

# Add bias term (for intercept)
X_with_bias = np.column_stack([np.ones(n_points), X])

# Analytical solution using linear algebra
# Normal equation: θ = (X^T X)^(-1) X^T y
XTX = X_with_bias.T @ X_with_bias
XTy = X_with_bias.T @ y
theta_analytical = np.linalg.inv(XTX) @ XTy

print(f"Analytical solution:")
print(f"Intercept: {theta_analytical[0]:.3f} (true: {true_intercept})")
print(f"Slope: {theta_analytical[1]:.3f} (true: {true_slope})")

# Gradient descent solution
def cost_function(theta, X, y):
    """Mean squared error cost function"""
    predictions = X @ theta
    errors = predictions - y
    return 0.5 * np.mean(errors**2)

def cost_gradient(theta, X, y):
    """Gradient of the cost function"""
    predictions = X @ theta
    errors = predictions - y
    return X.T @ errors / len(y)

# Gradient descent
theta_gd = np.random.randn(2)  # Random initialization
learning_rate = 0.01
num_iterations = 1000
cost_history = []

print(f"\nGradient descent:")
print(f"Initial parameters: {theta_gd}")

for i in range(num_iterations):
    cost = cost_function(theta_gd, X_with_bias, y)
    cost_history.append(cost)
    
    gradient = cost_gradient(theta_gd, X_with_bias, y)
    theta_gd = theta_gd - learning_rate * gradient
    
    if i % 200 == 0:
        print(f"Iteration {i}: Cost = {cost:.6f}")

print(f"Final parameters: {theta_gd}")
print(f"Intercept: {theta_gd[0]:.3f}")
print(f"Slope: {theta_gd[1]:.3f}")

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Data and regression lines
ax1.scatter(X, y, alpha=0.6, label='Data points')
x_line = np.linspace(X.min(), X.max(), 100)
y_analytical = theta_analytical[0] + theta_analytical[1] * x_line
y_gd = theta_gd[0] + theta_gd[1] * x_line
y_true = true_intercept + true_slope * x_line

ax1.plot(x_line, y_true, 'g--', linewidth=2, label='True relationship')
ax1.plot(x_line, y_analytical, 'r-', linewidth=2, label='Analytical solution')
ax1.plot(x_line, y_gd, 'b:', linewidth=2, label='Gradient descent')
ax1.set_xlabel('X')
ax1.set_ylabel('y')
ax1.set_title('Linear Regression Results')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Cost function convergence
ax2.plot(cost_history)
ax2.set_xlabel('Iteration')
ax2.set_ylabel('Cost (MSE)')
ax2.set_title('Gradient Descent Convergence')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate final metrics
predictions_analytical = X_with_bias @ theta_analytical
predictions_gd = X_with_bias @ theta_gd

mse_analytical = np.mean((predictions_analytical - y)**2)
mse_gd = np.mean((predictions_gd - y)**2)
r2_analytical = 1 - mse_analytical / np.var(y)
r2_gd = 1 - mse_gd / np.var(y)

print(f"\nModel Performance:")
print(f"Analytical - MSE: {mse_analytical:.6f}, R²: {r2_analytical:.6f}")
print(f"Gradient Descent - MSE: {mse_gd:.6f}, R²: {r2_gd:.6f}")

6. Key Takeaways and Real-World Applications

What We Learned Today

Linear Algebra: The mathematical language for representing and manipulating data
- Vectors represent data points, images, word embeddings
- Matrices store model parameters and transformations
- Operations like dot products measure similarity
Probability and Statistics: Tools for handling uncertainty and variability
- Probability distributions model real-world phenomena
- Bayes' theorem enables learning from evidence
- Statistical measures quantify model performance
Calculus: The engine that drives learning algorithms
- Derivatives show how to improve model performance
- Gradient descent finds optimal parameters
- Chain rule enables training deep networks
Information Theory: Measures the value of information
- Entropy quantifies uncertainty and information content
- Cross-entropy serves as a loss function for classification
- KL divergence measures distribution differences

Real-World AI Applications

Computer Vision:
- Images as matrices, convolution as matrix multiplication
- Probability distributions for object detection confidence
- Gradient descent for training CNN weights
Natural Language Processing:
- Words as vectors in high-dimensional space
- Probability models for language generation
- Information theory for compression and encoding
Recommendation Systems:
- User preferences as vectors
- Similarity measured by dot products
- Probabilistic modeling of user behavior
Reinforcement Learning:
- Value functions optimized with calculus
- Probability distributions over actions
- Information theory for exploration strategies

Next Lesson Preview

In Lesson 3, we'll build on these mathematical foundations to explore:

How supervised learning algorithms work
Different types of loss functions and their mathematical properties
Optimization techniques beyond basic gradient descent
The bias-variance tradeoff
Regularization methods to prevent overfitting

Practice Assignment

Linear Algebra Practice:
- Implement matrix multiplication from scratch
- Calculate eigenvectors and eigenvalues
- Explore principal component analysis (PCA)
Probability Experiments:
- Simulate different probability distributions
- Implement Bayes' theorem for a classification problem
- Calculate confidence intervals
Calculus Applications:
- Find minima of different functions using gradient descent
- Implement the chain rule for composite functions
- Optimize a simple neural network by hand
Information Theory:
- Calculate entropy for different datasets
- Implement a simple compression algorithm
- Compare different loss functions mathematically

Additional Resources

Khan Academy: Linear Algebra, Statistics & Probability, Calculus
3Blue1Brown: "Essence of Linear Algebra" and "Essence of Calculus" video series
MIT OpenCourseWare: 18.06 Linear Algebra, 6.041 Probabilistic Systems Analysis
Books: "Mathematics for Machine Learning" by Deisenroth, Faisal, and Ong

Ready for Lesson 3? The mathematical foundation you've built here will make everything that follows much clearer and more intuitive!

Content is user-generated and unverified.