By the end of this lesson, you will:
Linear algebra is the mathematical foundation that allows computers to process and manipulate data efficiently. Every piece of data in AI is represented as vectors and matrices.
A scalar is a single number.
Examples: 5, -3.14, 0.5A vector is an ordered list of numbers (1-dimensional array).
Examples:
v = [1, 2, 3] (3-dimensional vector)
w = [0.5, -1.2, 4.7] (3-dimensional vector)Geometric Interpretation: A vector represents a point in space or a direction with magnitude.
A matrix is a 2-dimensional array of numbers.
Example:
A = [[1, 2, 3],
[4, 5, 6]] (2×3 matrix: 2 rows, 3 columns)AI Context:
v = [1, 2, 3]
w = [4, 5, 6]
v + w = [1+4, 2+5, 3+6] = [5, 7, 9]
v - w = [1-4, 2-5, 3-6] = [-3, -3, -3]v = [1, 2, 3]
2 * v = [2*1, 2*2, 2*3] = [2, 4, 6]The dot product measures how similar two vectors are:
v · w = v₁w₁ + v₂w₂ + v₃w₃
Example:
v = [1, 2, 3]
w = [4, 5, 6]
v · w = 1*4 + 2*5 + 3*6 = 4 + 10 + 18 = 32Geometric Interpretation:
||v|| = √(v₁² + v₂² + v₃²)
Example:
v = [3, 4]
||v|| = √(3² + 4²) = √(9 + 16) = √25 = 5A = [[1, 2], B = [[5, 6],
[3, 4]] [7, 8]]
A + B = [[1+5, 2+6], = [[6, 8],
[3+7, 4+8]] [10, 12]]Key Rule: For A×B to be valid, the number of columns in A must equal the number of rows in B.
A (2×3) × B (3×2) = C (2×2)
A = [[1, 2, 3], B = [[7, 8],
[4, 5, 6]] [9, 10],
[11, 12]]
C[i,j] = Σ A[i,k] × B[k,j]
C = [[1*7+2*9+3*11, 1*8+2*10+3*12],
[4*7+5*9+6*11, 4*8+5*10+6*12]]
= [[58, 64],
[139, 154]]Flip rows and columns:
A = [[1, 2, 3], A^T = [[1, 4],
[4, 5, 6]] [2, 5],
[3, 6]]A square matrix with 1s on the diagonal and 0s elsewhere:
I₃ = [[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]Property: A × I = I × A = A (like multiplying by 1)
For square matrix A, if A⁻¹ exists: A × A⁻¹ = A⁻¹ × A = I
AI Application: Solving systems of equations, optimization problems
AI systems must make decisions under uncertainty. Probability theory provides the mathematical framework for this.
P(A) = (Number of favorable outcomes) / (Total number of possible outcomes)
Range: 0 ≤ P(A) ≤ 1
Example: Rolling a die
For events A and B:
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)If A and B are mutually exclusive (can't happen together):
P(A ∪ B) = P(A) + P(B)For independent events A and B:
P(A ∩ B) = P(A) × P(B)Probability of A given that B has occurred:
P(A|B) = P(A ∩ B) / P(B)AI Application: "What's the probability this email is spam given it contains the word 'lottery'?"
One of the most important theorems in AI:
P(A|B) = P(B|A) × P(A) / P(B)Components:
Example: Medical Diagnosis
P(Disease|Positive Test) = P(Positive|Disease) × P(Disease) / P(Positive)
= 0.99 × 0.01 / P(Positive)Bernoulli Distribution: Single trial with two outcomes
Binomial Distribution: Multiple independent Bernoulli trials
Uniform Distribution: All values equally likely in an interval
Normal (Gaussian) Distribution: Bell-shaped curve
Properties of Normal Distribution:
Cov(X,Y) = E[(X - μ_X)(Y - μ_Y)] ρ(X,Y) = Cov(X,Y) / (σ_X × σ_Y)Machine learning models improve by minimizing error functions. Calculus provides the tools to find these minimum points.
The derivative measures how a function changes as its input changes:
f'(x) = lim(h→0) [f(x+h) - f(x)] / hd/dx [c] = 0 (constant)
d/dx [x^n] = n×x^(n-1) (power rule)
d/dx [e^x] = e^x (exponential)
d/dx [ln(x)] = 1/x (natural log)
d/dx [sin(x)] = cos(x) (sine)
d/dx [cos(x)] = -sin(x) (cosine)For composite functions f(g(x)):
d/dx [f(g(x))] = f'(g(x)) × g'(x)Example:
f(x) = (x² + 1)³
f'(x) = 3(x² + 1)² × 2x = 6x(x² + 1)²For functions with multiple variables f(x,y):
∂f/∂x = derivative with respect to x (treating y as constant)
∂f/∂y = derivative with respect to y (treating x as constant)Example:
f(x,y) = x²y + 3xy²
∂f/∂x = 2xy + 3y²
∂f/∂y = x² + 6xyThe gradient ∇f is a vector of all partial derivatives:
∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]Key Properties:
Where derivative equals zero: f'(x) = 0
To minimize function f(x):
1. Start with initial guess x₀
2. Repeat: x_{n+1} = x_n - α∇f(x_n)
3. Stop when gradient ≈ 0Where α is the learning rate.
AI Application: This is how neural networks learn!
Information theory quantifies how much information is contained in data and how efficiently it can be transmitted.
The information content of an event with probability p:
I(x) = -log₂(p) bitsIntuition: Rare events carry more information than common events.
Examples:
Average information content across all possible events:
H(X) = -Σ p(x) log₂ p(x)Properties:
Example: Fair coin
H(Coin) = -[0.5×log₂(0.5) + 0.5×log₂(0.5)] = 1 bitMeasures the difference between two probability distributions:
H(p,q) = -Σ p(x) log q(x)Where:
AI Application: Cross-entropy is a common loss function in classification tasks.
Measures how one probability distribution differs from another:
D_KL(p||q) = Σ p(x) log(p(x)/q(x))Properties:
AI Applications:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
# Set style for better plots
plt.style.use('seaborn-v0_8')
np.random.seed(42)
print("=== LINEAR ALGEBRA OPERATIONS ===\n")
# Create vectors
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
print("Vectors:")
print(f"v1 = {v1}")
print(f"v2 = {v2}")
# Vector operations
print(f"\nVector addition: v1 + v2 = {v1 + v2}")
print(f"Vector subtraction: v1 - v2 = {v1 - v2}")
print(f"Scalar multiplication: 2 * v1 = {2 * v1}")
# Dot product
dot_product = np.dot(v1, v2)
print(f"Dot product: v1 · v2 = {dot_product}")
# Vector magnitude
magnitude_v1 = np.linalg.norm(v1)
magnitude_v2 = np.linalg.norm(v2)
print(f"Magnitude of v1: ||v1|| = {magnitude_v1:.3f}")
print(f"Magnitude of v2: ||v2|| = {magnitude_v2:.3f}")
# Angle between vectors
cos_angle = dot_product / (magnitude_v1 * magnitude_v2)
angle_rad = np.arccos(cos_angle)
angle_deg = np.degrees(angle_rad)
print(f"Angle between vectors: {angle_deg:.2f} degrees")print("\n=== MATRIX OPERATIONS ===\n")
# Create matrices
A = np.array([[1, 2, 3],
[4, 5, 6]])
B = np.array([[7, 8],
[9, 10],
[11, 12]])
print("Matrix A (2×3):")
print(A)
print("\nMatrix B (3×2):")
print(B)
# Matrix multiplication
C = np.dot(A, B) # or A @ B
print(f"\nMatrix multiplication A × B (2×2):")
print(C)
# Matrix transpose
A_T = A.T
print(f"\nTranspose of A:")
print(A_T)
# Identity matrix
I = np.eye(3)
print(f"\n3×3 Identity matrix:")
print(I)
# Matrix properties
print(f"\nMatrix A shape: {A.shape}")
print(f"Matrix A rank: {np.linalg.matrix_rank(A)}")
# Square matrix operations
square_matrix = np.array([[2, 1],
[1, 2]])
print(f"\nSquare matrix:")
print(square_matrix)
# Determinant
det = np.linalg.det(square_matrix)
print(f"Determinant: {det}")
# Inverse (if determinant ≠ 0)
if det != 0:
inv = np.linalg.inv(square_matrix)
print(f"Inverse:")
print(inv)
# Verify: A × A⁻¹ = I
verification = square_matrix @ inv
print(f"Verification (A × A⁻¹):")
print(verification)print("\n=== PROBABILITY AND STATISTICS ===\n")
# Generate random data
n_samples = 1000
# Normal distribution
normal_data = np.random.normal(loc=50, scale=10, size=n_samples)
print(f"Normal distribution (μ=50, σ=10):")
print(f"Sample mean: {np.mean(normal_data):.2f}")
print(f"Sample std: {np.std(normal_data):.2f}")
# Uniform distribution
uniform_data = np.random.uniform(low=0, high=100, size=n_samples)
print(f"\nUniform distribution (0-100):")
print(f"Sample mean: {np.mean(uniform_data):.2f}")
print(f"Sample std: {np.std(uniform_data):.2f}")
# Binomial distribution
binomial_data = np.random.binomial(n=10, p=0.3, size=n_samples)
print(f"\nBinomial distribution (n=10, p=0.3):")
print(f"Sample mean: {np.mean(binomial_data):.2f}")
print(f"Theoretical mean: {10 * 0.3}")
# Statistical measures
data = normal_data
print(f"\nStatistical measures for normal data:")
print(f"Mean: {np.mean(data):.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Mode: {stats.mode(data.round())[0]:.2f}")
print(f"Variance: {np.var(data):.2f}")
print(f"Standard deviation: {np.std(data):.2f}")
print(f"Range: {np.max(data) - np.min(data):.2f}")
# Percentiles
percentiles = [25, 50, 75, 95, 99]
for p in percentiles:
value = np.percentile(data, p)
print(f"{p}th percentile: {value:.2f}")# Create visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# Normal distribution
axes[0,0].hist(normal_data, bins=50, alpha=0.7, density=True, color='blue')
x = np.linspace(normal_data.min(), normal_data.max(), 100)
y = stats.norm.pdf(x, loc=50, scale=10)
axes[0,0].plot(x, y, 'r-', linewidth=2, label='Theoretical PDF')
axes[0,0].set_title('Normal Distribution')
axes[0,0].legend()
# Uniform distribution
axes[0,1].hist(uniform_data, bins=50, alpha=0.7, density=True, color='green')
axes[0,1].axhline(y=1/100, color='red', linestyle='--', linewidth=2, label='Theoretical PDF')
axes[0,1].set_title('Uniform Distribution')
axes[0,1].legend()
# Binomial distribution
unique_vals, counts = np.unique(binomial_data, return_counts=True)
axes[0,2].bar(unique_vals, counts/n_samples, alpha=0.7, color='orange')
theoretical_probs = [stats.binom.pmf(k, 10, 0.3) for k in unique_vals]
axes[0,2].plot(unique_vals, theoretical_probs, 'ro-', label='Theoretical PMF')
axes[0,2].set_title('Binomial Distribution')
axes[0,2].legend()
# Q-Q plot for normality test
stats.probplot(normal_data, dist="norm", plot=axes[1,0])
axes[1,0].set_title('Q-Q Plot (Normal Data)')
# Box plot
axes[1,1].boxplot([normal_data, uniform_data, binomial_data],
labels=['Normal', 'Uniform', 'Binomial'])
axes[1,1].set_title('Box Plots Comparison')
# Correlation example
x_corr = np.random.normal(0, 1, 500)
y_corr = 2*x_corr + np.random.normal(0, 0.5, 500) # y correlated with x
correlation = np.corrcoef(x_corr, y_corr)[0,1]
axes[1,2].scatter(x_corr, y_corr, alpha=0.6)
axes[1,2].set_xlabel('X')
axes[1,2].set_ylabel('Y')
axes[1,2].set_title(f'Correlation Example (r={correlation:.3f})')
# Add trend line
z = np.polyfit(x_corr, y_corr, 1)
p = np.poly1d(z)
axes[1,2].plot(x_corr, p(x_corr), "r--", alpha=0.8)
plt.tight_layout()
plt.show()print("\n=== CALCULUS OPERATIONS ===\n")
# Numerical derivatives
def f(x):
"""Example function: f(x) = x^3 - 2x^2 + x + 1"""
return x**3 - 2*x**2 + x + 1
def f_derivative(x):
"""Analytical derivative: f'(x) = 3x^2 - 4x + 1"""
return 3*x**2 - 4*x + 1
# Numerical derivative approximation
def numerical_derivative(func, x, h=1e-5):
"""Compute numerical derivative using finite differences"""
return (func(x + h) - func(x - h)) / (2 * h)
# Test points
x_test = np.array([0, 1, 2, 3])
print("Comparison of analytical vs numerical derivatives:")
print("x\tAnalytical\tNumerical\tError")
print("-" * 45)
for x in x_test:
analytical = f_derivative(x)
numerical = numerical_derivative(f, x)
error = abs(analytical - numerical)
print(f"{x}\t{analytical:.6f}\t{numerical:.6f}\t{error:.2e}")
# Gradient example (multivariable function)
def g(x, y):
"""Example function: g(x,y) = x^2 + 2xy + y^2"""
return x**2 + 2*x*y + y**2
def gradient_g(x, y):
"""Analytical gradient: ∇g = [2x + 2y, 2x + 2y]"""
return np.array([2*x + 2*y, 2*x + 2*y])
# Test gradient
x, y = 1, 2
grad = gradient_g(x, y)
print(f"\nGradient of g(1,2) = {grad}")
# Gradient descent example
def gradient_descent_1d(func, func_grad, x_start, learning_rate=0.1, num_iterations=100):
"""Simple 1D gradient descent"""
x = x_start
history = [x]
for i in range(num_iterations):
grad = func_grad(x)
x = x - learning_rate * grad
history.append(x)
if abs(grad) < 1e-6: # Convergence criterion
break
return x, history
# Find minimum of f(x) = (x-2)^2 + 1
def simple_func(x):
return (x - 2)**2 + 1
def simple_func_grad(x):
return 2 * (x - 2)
minimum, history = gradient_descent_1d(simple_func, simple_func_grad, x_start=5)
print(f"\nGradient descent result:")
print(f"Starting point: x = 5")
print(f"Found minimum at: x = {minimum:.6f}")
print(f"Function value at minimum: f({minimum:.6f}) = {simple_func(minimum):.6f}")
print(f"True minimum: x = 2, f(2) = 1")
print(f"Converged in {len(history)-1} iterations")print("\n=== INFORMATION THEORY ===\n")
# Entropy calculation
def entropy(probabilities):
"""Calculate entropy of a probability distribution"""
# Remove zero probabilities to avoid log(0)
p = np.array(probabilities)
p = p[p > 0]
return -np.sum(p * np.log2(p))
# Examples
print("Entropy examples:")
# Fair coin
fair_coin = [0.5, 0.5]
print(f"Fair coin: H = {entropy(fair_coin):.3f} bits")
# Biased coin
biased_coin = [0.9, 0.1]
print(f"Biased coin (90%-10%): H = {entropy(biased_coin):.3f} bits")
# Fair die
fair_die = [1/6] * 6
print(f"Fair 6-sided die: H = {entropy(fair_die):.3f} bits")
# Uniform distribution over 8 outcomes
uniform_8 = [1/8] * 8
print(f"Uniform over 8 outcomes: H = {entropy(uniform_8):.3f} bits")
# Cross-entropy
def cross_entropy(true_dist, pred_dist):
"""Calculate cross-entropy between true and predicted distributions"""
return -np.sum(true_dist * np.log2(pred_dist))
# KL divergence
def kl_divergence(p, q):
"""Calculate KL divergence D_KL(p||q)"""
return np.sum(p * np.log2(p / q))
# Example: comparing distributions
true_dist = np.array([0.5, 0.3, 0.2])
pred_dist1 = np.array([0.4, 0.4, 0.2]) # Close to true
pred_dist2 = np.array([0.8, 0.1, 0.1]) # Far from true
print(f"\nCross-entropy and KL divergence:")
print(f"True distribution: {true_dist}")
print(f"Prediction 1: {pred_dist1}")
print(f"Prediction 2: {pred_dist2}")
ce1 = cross_entropy(true_dist, pred_dist1)
ce2 = cross_entropy(true_dist, pred_dist2)
kl1 = kl_divergence(true_dist, pred_dist1)
kl2 = kl_divergence(true_dist, pred_dist2)
print(f"\nCross-entropy with prediction 1: {ce1:.3f} bits")
print(f"Cross-entropy with prediction 2: {ce2:.3f} bits")
print(f"KL divergence with prediction 1: {kl1:.3f} bits")
print(f"KL divergence with prediction 2: {kl2:.3f} bits")print("\n=== APPLYING MATH TO AI: LINEAR REGRESSION ===\n")
# Generate synthetic dataset
np.random.seed(42)
n_points = 100
true_slope = 2.5
true_intercept = 1.0
noise_std = 0.5
X = np.random.uniform(0, 10, n_points)
y = true_slope * X + true_intercept + np.random.normal(0, noise_std, n_points)
# Add bias term (for intercept)
X_with_bias = np.column_stack([np.ones(n_points), X])
# Analytical solution using linear algebra
# Normal equation: θ = (X^T X)^(-1) X^T y
XTX = X_with_bias.T @ X_with_bias
XTy = X_with_bias.T @ y
theta_analytical = np.linalg.inv(XTX) @ XTy
print(f"Analytical solution:")
print(f"Intercept: {theta_analytical[0]:.3f} (true: {true_intercept})")
print(f"Slope: {theta_analytical[1]:.3f} (true: {true_slope})")
# Gradient descent solution
def cost_function(theta, X, y):
"""Mean squared error cost function"""
predictions = X @ theta
errors = predictions - y
return 0.5 * np.mean(errors**2)
def cost_gradient(theta, X, y):
"""Gradient of the cost function"""
predictions = X @ theta
errors = predictions - y
return X.T @ errors / len(y)
# Gradient descent
theta_gd = np.random.randn(2) # Random initialization
learning_rate = 0.01
num_iterations = 1000
cost_history = []
print(f"\nGradient descent:")
print(f"Initial parameters: {theta_gd}")
for i in range(num_iterations):
cost = cost_function(theta_gd, X_with_bias, y)
cost_history.append(cost)
gradient = cost_gradient(theta_gd, X_with_bias, y)
theta_gd = theta_gd - learning_rate * gradient
if i % 200 == 0:
print(f"Iteration {i}: Cost = {cost:.6f}")
print(f"Final parameters: {theta_gd}")
print(f"Intercept: {theta_gd[0]:.3f}")
print(f"Slope: {theta_gd[1]:.3f}")
# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Data and regression lines
ax1.scatter(X, y, alpha=0.6, label='Data points')
x_line = np.linspace(X.min(), X.max(), 100)
y_analytical = theta_analytical[0] + theta_analytical[1] * x_line
y_gd = theta_gd[0] + theta_gd[1] * x_line
y_true = true_intercept + true_slope * x_line
ax1.plot(x_line, y_true, 'g--', linewidth=2, label='True relationship')
ax1.plot(x_line, y_analytical, 'r-', linewidth=2, label='Analytical solution')
ax1.plot(x_line, y_gd, 'b:', linewidth=2, label='Gradient descent')
ax1.set_xlabel('X')
ax1.set_ylabel('y')
ax1.set_title('Linear Regression Results')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Cost function convergence
ax2.plot(cost_history)
ax2.set_xlabel('Iteration')
ax2.set_ylabel('Cost (MSE)')
ax2.set_title('Gradient Descent Convergence')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate final metrics
predictions_analytical = X_with_bias @ theta_analytical
predictions_gd = X_with_bias @ theta_gd
mse_analytical = np.mean((predictions_analytical - y)**2)
mse_gd = np.mean((predictions_gd - y)**2)
r2_analytical = 1 - mse_analytical / np.var(y)
r2_gd = 1 - mse_gd / np.var(y)
print(f"\nModel Performance:")
print(f"Analytical - MSE: {mse_analytical:.6f}, R²: {r2_analytical:.6f}")
print(f"Gradient Descent - MSE: {mse_gd:.6f}, R²: {r2_gd:.6f}")In Lesson 3, we'll build on these mathematical foundations to explore:
Ready for Lesson 3? The mathematical foundation you've built here will make everything that follows much clearer and more intuitive!