Content is user-generated and unverified.

Toronto Transit Delay Analysis: Comprehensive Project Plan

This project plan provides a complete framework for analyzing Toronto's transit delays with specific focus on streetcar performance. Designed for researchers with philosophy, biochemistry, and economics backgrounds transitioning to data science, it combines rigorous methodology with practical R/tidyverse implementation and strategic Python integration.

Dataset Selection and Integration Strategy

Tier 1 Essential Datasets (High Priority)

TTC Delay Data Collection (2014-2025) The foundation of this analysis rests on the already cleaned TTC delay datasets covering buses, streetcars, and subways. These provide direct measurements of delay patterns with consistent temporal coverage spanning over a decade. Statistical significance: This longitudinal dataset enables robust time series analysis and causal inference techniques essential for policy evaluation.

TTC GTFS Static and Real-time Data Critical for calculating schedule adherence metrics and understanding baseline service expectations. The General Transit Feed Specification format provides standardized route definitions, stop locations, and scheduled times. This enables calculation of delay deviations from planned service rather than just raw delay numbers.

King Street Transit Priority Corridor Data (2017-2019) This represents Toronto's most comprehensive transit intervention case study, with measured outcomes showing travel time reductions from 23 to 16 minutes during implementation. The dataset includes before-after analysis with rigorous controls, making it invaluable for validating causal inference methodologies and demonstrating intervention effectiveness.

Traffic Volume Data from Toronto Open Data Essential for understanding congestion impacts on streetcar delays. Research from Melbourne shows 16.4% crash reduction and significant travel time improvements with proper traffic integration analysis. This dataset enables spatial joining with streetcar routes to quantify mixed-traffic operation impacts.

Tier 2 Important Datasets (Medium Priority)

Statistics Canada Large Urban Transit Statistics Provides comparative context with standardized metrics across major Canadian cities. Monthly data for 10 transit operators including TTC enables benchmarking and identification of Toronto-specific challenges versus national trends.

Weather Data from Environment Canada Critical for Toronto's climate-variable operations. Research shows weather as a significant delay factor, particularly for mixed-traffic streetcar operations. Historical meteorological data enables seasonal decomposition and climate resilience planning.

Vancouver TransLink and Montreal STM Comparative Data These Canadian systems provide relevant comparisons for policy learning. Vancouver's integrated bus-rail system and Montreal's metro-bus combination offer different operational models for contextualizing Toronto's performance.

Traffic Signal Data and Intersection Files Essential for signal priority analysis. With 440 TSP locations system-wide and documented 6-10% travel time improvements from signal priority implementations, understanding signal-streetcar interactions is crucial for intervention recommendations.

Tier 3 Supplementary Datasets (Lower Priority)

Parking Ticket and Traffic Camera Data While providing insights into enforcement patterns affecting streetcar operations, these require extensive spatial analysis and may not provide proportional analytical value for initial implementation.

International Comparative Data (NYC MTA, London TfL) Valuable for best practices research but less directly applicable due to different regulatory environments and infrastructure configurations.

Data Integration Framework

Temporal Alignment Strategy: Standardize all datasets to consistent time periods (likely monthly aggregations for trend analysis, daily for operational patterns) with careful handling of missing data periods during service disruptions.

Spatial Consistency Protocol: Use Toronto's standardized coordinate systems and ensure route definitions align across datasets. Implement buffer zones around streetcar stops for spatial joining with traffic and infrastructure data.

Quality Standardization: Harmonize incident classification schemes across different data sources, with particular attention to weather-related vs. traffic-related delay categorization.

Statistical Analysis Framework

Exploratory Data Analysis to Verify Streetcar Delay Hypothesis

Modal Comparison Analysis Implement statistical hypothesis testing using Kruskal-Wallis tests (non-parametric given likely skewed delay distributions) to compare delay patterns across transit modes. Follow up with post-hoc Dunn tests for pairwise comparisons between streetcars, buses, and subway services.

library(tidyverse)
library(dunn.test)

# Modal delay comparison
delay_comparison <- transit_delays %>%
  group_by(mode) %>%
  summarise(
    median_delay = median(delay_minutes, na.rm = TRUE),
    iqr_delay = IQR(delay_minutes, na.rm = TRUE),
    delay_incidents_per_1000_trips = n() / sum(scheduled_trips) * 1000
  )

# Statistical testing
kruskal.test(delay_minutes ~ mode, data = transit_delays)
dunn.test(transit_delays$delay_minutes, transit_delays$mode)

Distributional Analysis: Use kernel density estimation and violin plots to visualize delay distributions by mode, identifying whether streetcars show different delay patterns (potentially bimodal with traffic-related peaks).

Advanced Regression Analysis

Linear Mixed-Effects Models for Hierarchical Data Account for route-level clustering in delay patterns using the lme4 package. This addresses the nested structure where delay incidents cluster within routes and time periods.

library(lme4)
library(broom.mixed)

# Hierarchical delay model
delay_model <- lmer(
  delay_minutes ~ 
    weather_condition + hour_of_day + day_of_week + 
    traffic_volume + construction_nearby +
    (1 | route_id) + (1 | month),
  data = streetcar_delays
)

tidy(delay_model, effects = "fixed")

Sigmoidal Regression for Capacity-Delay Relationships Model the non-linear relationship between ridership/traffic volume and delay using logistic growth curves. This captures the threshold effects where delays accelerate rapidly beyond certain capacity levels.

library(drc)

# Capacity-delay curve fitting
capacity_model <- drm(
  delay_minutes ~ capacity_utilization, 
  data = route_performance,
  fct = L.4()  # 4-parameter logistic
)

# Extract critical thresholds
ED(capacity_model, c(10, 50, 90))  # Effective doses for delay thresholds

Time Series Analysis Implementation

Seasonal Decomposition with STL Apply Seasonal-Trend decomposition using LOESS to separate long-term trends from seasonal patterns and irregular fluctuations. This is particularly important for Toronto's climate-variable operations.

library(feasts)
library(tsibble)

# Convert to tsibble format
delay_ts <- streetcar_delays %>%
  group_by(route_id, date) %>%
  summarise(daily_avg_delay = mean(delay_minutes)) %>%
  as_tsibble(index = date, key = route_id)

# STL decomposition
delay_decomp <- delay_ts %>%
  model(stl = STL(daily_avg_delay))

components(delay_decomp) %>% autoplot()

ARIMA vs Prophet Comparison Implement both approaches and compare forecasting performance using time series cross-validation. ARIMA models excel with stationary delay patterns, while Prophet handles Toronto's complex seasonality (winter weather, summer construction, special events) more naturally.

library(forecast)
library(prophet)

# ARIMA implementation
arima_model <- delay_ts %>%
  model(arima = ARIMA(daily_avg_delay))

# Prophet implementation  
prophet_model <- delay_ts %>%
  model(prophet = prophet(daily_avg_delay))

# Cross-validation comparison
cv_results <- delay_ts %>%
  stretch_tsibble(.init = 365, .step = 30) %>%
  model(
    arima = ARIMA(daily_avg_delay),
    prophet = prophet(daily_avg_delay)
  ) %>%
  forecast(h = 14) %>%
  accuracy(delay_ts)

Causal Inference for Intervention Evaluation

Difference-in-Differences for King Street Analysis Leverage the King Street pilot as a natural experiment, comparing treated (King Street) and control routes before, during, and after implementation. This provides policy-relevant causal estimates.

library(fixest)

# DiD estimation
did_model <- feols(
  delay_minutes ~ 
    i(post_intervention, king_street, ref = FALSE) +
    weather_condition + hour_of_day | route_id + date,
  data = intervention_analysis
)

# Visualize treatment effects
coefplot(did_model)

Instrumental Variables for Traffic-Delay Relationships Address endogeneity in traffic-delay relationships using weather conditions as instruments for traffic volume, following transportation economics methodology.

Network Analysis for Cascade Effects

Graph-Based Delay Propagation Model the TTC network as a graph where delay propagation follows route connections and transfer points. Use centrality measures to identify critical nodes where delays have network-wide impacts.

library(tidygraph)
library(ggraph)

# Create network graph
ttc_network <- create_network_graph(routes_data, transfers_data)

# Calculate centrality measures
network_analysis <- ttc_network %>%
  activate(nodes) %>%
  mutate(
    betweenness = centrality_betweenness(),
    pagerank = centrality_pagerank(),
    delay_centrality = centrality_eigen(weights = avg_delay)
  )

# Visualize network with delay centrality
ggraph(network_analysis, layout = "fr") +
  geom_edge_link(alpha = 0.6) +
  geom_node_point(aes(size = delay_centrality, color = avg_delay)) +
  scale_color_viridis_c() +
  theme_graph()

Machine Learning Implementation Strategy

Decision Trees and Random Forests

Random Forest for Delay Prediction Implement using the ranger package for computational efficiency, particularly valuable given the mixed data types (categorical weather conditions, continuous traffic volumes, temporal features).

library(ranger)
library(tidymodels)

# Feature engineering pipeline
delay_recipe <- recipe(delay_minutes ~ ., data = training_data) %>%
  step_date(datetime, features = c("hour", "dow", "month")) %>%
  step_holiday(datetime, holidays = timeDate::listHolidays("CA")) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

# Random forest model
rf_spec <- rand_forest(
  trees = 1000,
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression")

# Workflow and tuning
rf_workflow <- workflow() %>%
  add_recipe(delay_recipe) %>%
  add_model(rf_spec)

# Time series cross-validation
folds <- sliding_period(
  training_data, 
  datetime, 
  "month", 
  lookback = 12, 
  assess_stop = 1
)

rf_results <- tune_grid(
  rf_workflow,
  resamples = folds,
  grid = 20,
  metrics = metric_set(rmse, mae, rsq)
)

Feature Importance Analysis and Interpretation Use SHAP (SHapley Additive exPlanations) values for model interpretation, crucial for policy recommendations to transit operators.

library(fastshap)

# SHAP analysis
shap_values <- explain(
  final_rf_model,
  X = test_features,
  pred_wrapper = predict_function,
  nsim = 50
)

# Visualization
autoplot(shap_values, type = "importance")
autoplot(shap_values, type = "dependence", feature = "traffic_volume")

Advanced ML Techniques

Gradient Boosting for Non-linear Relationships Implement XGBoost for capturing complex interactions between weather, traffic, and temporal factors affecting streetcar delays.

library(xgboost)

# XGBoost implementation through tidymodels
xgb_spec <- boost_tree(
  trees = tune(),
  tree_depth = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# GPU acceleration (if available with AMD RX 7900 XTX)
xgb_gpu_spec <- boost_tree(
  trees = tune(),
  tree_depth = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost", tree_method = "gpu_hist") %>%
  set_mode("regression")

Cross-Validation for Temporal Data

Time Series Split Implementation Critical for avoiding data leakage in transit forecasting applications.

# Custom time series validation
ts_splits <- training_data %>%
  arrange(datetime) %>%
  sliding_period(
    datetime, 
    "month",
    lookback = 6,    # 6 months training
    assess_stop = 1, # 1 month testing  
    step = 1         # Advance by 1 month
  )

# Walk-forward validation for realistic evaluation
walk_forward_validate <- function(model_spec, data) {
  results <- map_dfr(ts_splits$splits, function(split) {
    train_data <- analysis(split)
    test_data <- assessment(split)
    
    fitted_model <- fit(model_spec, data = train_data)
    predictions <- predict(fitted_model, test_data)
    
    bind_cols(test_data, predictions) %>%
      metrics(delay_minutes, .pred)
  })
  
  return(results)
}

Academic Literature Integration

Evidence-Based Intervention Framework

The literature review reveals three highest-impact interventions for Toronto's context:

Traffic Signal Priority (TSP) Expansion Research from Warsaw shows 6.7% travel time decrease and 5-8.5% accessibility increase with comprehensive TSP implementation. Toronto's current 440 TSP locations provide a strong foundation for systematic expansion, particularly given King Street's demonstrated 12% travel time reduction during properly enforced periods.

Dedicated Transit Lanes with Enforcement Melbourne's empirical evaluation demonstrates 16.4% crash reduction and significant reliability improvements with proper lane priority. The critical lesson from Toronto's King Street experience is that infrastructure alone is insufficient - enforcement effectiveness determines success, with 99.7% violation rates causing service degradation without active traffic agents.

Machine Learning Integration for Operations Recent research on CBLA-Net architectures shows 122-second average prediction error for delay forecasting, enabling passenger information systems and operational optimization. This aligns with Toronto's real-time data capabilities and passenger experience priorities.

Visualization and Communication Strategy

Multi-Audience Dashboard Framework

Executive Dashboard for TTC Leadership Interactive Shiny application with key performance indicators, route-level drill-down capability, and intervention impact modeling. Focus on operational metrics and budget implications.

library(shiny)
library(plotly)
library(DT)

# Executive dashboard structure
ui <- dashboardPage(
  dashboardHeader(title = "TTC Delay Analysis Dashboard"),
  dashboardSidebar(
    selectInput("route", "Select Route:", choices = unique_routes),
    dateRangeInput("date_range", "Date Range:", 
                   start = "2024-01-01", end = Sys.Date()),
    checkboxGroupInput("delay_causes", "Delay Causes:", 
                       choices = delay_categories)
  ),
  dashboardBody(
    fluidRow(
      valueBoxOutput("avg_delay"),
      valueBoxOutput("otperfomance"), 
      valueBoxOutput("cost_impact")
    ),
    fluidRow(
      plotlyOutput("delay_trend"),
      plotlyOutput("spatial_heatmap")
    )
  )
)

Public Communication Through Interactive Maps Implement accessible visualizations showing service reliability by location, using colorblind-safe palettes and clear performance metrics.

library(leaflet)
library(viridis)

# Public-facing reliability map
create_public_map <- function(route_performance) {
  leaflet(route_performance) %>%
    addTiles() %>%
    addPolylines(
      ~longitude, ~latitude,
      weight = ~ifelse(avg_delay < 2, 2, 
                      ifelse(avg_delay < 5, 4, 6)),
      color = ~viridis_discrete(performance_category),
      popup = ~paste("Route:", route_name, "<br>",
                    "Average Delay:", round(avg_delay, 1), "minutes<br>",
                    "On-time Performance:", paste0(otp_percent, "%"))
    ) %>%
    addLegend("bottomright",
              colors = viridis(3),
              labels = c("Good (< 2 min)", "Fair (2-5 min)", "Poor (> 5 min)"),
              title = "Service Reliability")
}

Academic Presentation Standards

Publication-Ready Figures Generate high-resolution, accessible visualizations following Transportation Research Board guidelines.

# Publication figure template
create_publication_figure <- function(data, title) {
  ggplot(data, aes(x = factor, y = delay_minutes)) +
    geom_boxplot(aes(fill = intervention_status), alpha = 0.7) +
    stat_compare_means(comparisons = list(c("before", "after"))) +
    scale_fill_viridis_d(name = "Period") +
    labs(
      title = title,
      subtitle = paste("Analysis of", nrow(data), "delay incidents"),
      x = "Route Category", 
      y = "Delay Duration (minutes)",
      caption = "Error bars represent 95% confidence intervals"
    ) +
    theme_minimal(base_size = 12) +
    theme(
      plot.title = element_text(size = 14, face = "bold"),
      legend.position = "bottom",
      panel.grid.minor = element_blank()
    )
}

# Export for publication
ggsave("delay_analysis.png", width = 8, height = 6, dpi = 300)
ggsave("delay_analysis.pdf", width = 8, height = 6)

Implementation Framework

R/Tidyverse-Centric Workflow

Core Package Ecosystem Implement analysis primarily in R using tidyverse principles, with strategic Python integration for specialized ML applications.

# Core analysis packages
library(tidyverse)    # Data manipulation and visualization
library(lubridate)    # Date/time handling
library(sf)           # Spatial data analysis
library(tidymodels)   # Machine learning framework
library(forecast)     # Time series analysis
library(leaflet)      # Interactive mapping
library(shiny)        # Dashboard development
library(quarto)       # Documentation and reporting

# Specialized packages
library(tsibble)      # Time series data structure
library(feasts)       # Time series feature extraction
library(ranger)       # Fast random forests
library(fixest)       # Econometric analysis
library(broom)        # Statistical model tidying

Python Integration Strategy

Targeted Python Usage Use Python for specific ML implementations and GPU-accelerated computations while maintaining R as the primary analytical environment.

library(reticulate)

# Configure Python environment
use_condaenv("transit_analysis")

# Import Python libraries for specific tasks
sklearn <- import("sklearn.ensemble")
xgb <- import("xgboost")
prophet <- import("prophet")

# Hybrid workflow example
py_predictions <- sklearn$RandomForestRegressor(n_estimators = 1000L)$
  fit(r_to_py(training_features), r_to_py(training_target))$
  predict(r_to_py(test_features))

# Convert back to R for further analysis
r_predictions <- py_to_r(py_predictions)

GPU Acceleration with AMD RX 7900 XTX

ROCm Setup for Linux Development Given AMD's limited ML ecosystem support, establish dual-boot Ubuntu environment for GPU acceleration when needed.

bash

# ROCm installation (Ubuntu 22.04)
wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update && sudo apt install rocm-dkms rocm-libs rocm-dev rocm-utils

# PyTorch with ROCm support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6

Performance Expectations and Alternatives Current AMD software maturity means 30-50% lower performance compared to equivalent NVIDIA hardware, but 24GB VRAM provides significant memory advantages for large datasets. Consider cloud-based alternatives (AWS, GCP) for production ML workloads.

Project Phases and Learning Pathway

Phase 1: Foundation and EDA (4-6 weeks)

Objectives: Data familiarization, basic pattern identification, methodology establishment

Key Deliverables:

Comprehensive data quality assessment and cleaning pipeline
Descriptive analysis confirming streetcar delay hypothesis
Initial visualization dashboard for stakeholder engagement
Statistical testing framework for modal comparisons

Learning Focus: R/tidyverse mastery, statistical hypothesis testing, time series basics, spatial data handling

Phase 2: Statistical Modeling (4-6 weeks)

Objectives: Causal inference, trend analysis, intervention evaluation

Key Deliverables:

King Street pilot evaluation using difference-in-differences
Time series decomposition and forecasting models
Regression analysis of delay factors
Network analysis of delay propagation

Learning Focus: Causal inference techniques, time series analysis, regression diagnostics, econometric methods

Phase 3: Machine Learning Implementation (6-8 weeks)

Objectives: Predictive modeling, feature importance, operational optimization

Key Deliverables:

Random forest delay prediction model with feature importance
Cross-validation framework for temporal data
Real-time prediction system prototype
Model interpretation and policy recommendations

Learning Focus: ML model validation, feature engineering, model interpretation, deployment considerations

Phase 4: Integration and Communication (4-5 weeks)

Objectives: Comprehensive reporting, stakeholder communication, policy recommendations

Key Deliverables:

Complete Quarto research report with interactive elements
Executive dashboard for TTC leadership
Public-facing visualization portal
Academic paper draft for transportation journal submission

Learning Focus: Technical writing, data visualization design, dashboard development, academic publication standards

Expected Outcomes and Policy Impact

Quantified Performance Improvements

Based on international evidence and Toronto's context, properly implemented interventions should achieve:

15-25% reduction in streetcar delays through comprehensive signal priority expansion
8-12% improvement in service reliability with dedicated lane implementation and enforcement
20-30% reduction in delay prediction uncertainty through ML-enhanced passenger information systems
5-10% increase in ridership on corridors with successful priority implementations

Strategic Recommendations for Toronto

Immediate Implementation (6-12 months)

Expand King Street enforcement model to 504 King, 501 Queen, and 506 Carlton corridors using traffic agents and automated enforcement
Deploy ML-based delay prediction for passenger information systems across highest-ridership routes
Systematic signal priority audit and optimization for existing TSP locations

Medium-term Initiatives (1-3 years)

Dedicated lane network expansion focusing on Queen Street and Spadina Avenue corridors
Integrated enforcement system with automated camera detection and dynamic pricing for violations
Climate resilience upgrades informed by weather-delay correlation analysis

Long-term Strategic Development (3-5 years)

Comprehensive network redesign based on delay propagation analysis and ridership optimization
Regional integration optimization with GO Transit and municipal systems
Congestion pricing implementation to support transit priority funding

This comprehensive project plan provides both an educational pathway for developing data science expertise and a practical framework for contributing to Toronto's transit improvement efforts. The combination of rigorous methodology, evidence-based recommendations, and accessible communication ensures both academic learning value and real-world policy impact.

Content is user-generated and unverified.

Toronto Transit Delay Analysis: Comprehensive Project Plan

Dataset Selection and Integration Strategy

Tier 1 Essential Datasets (High Priority)

Tier 2 Important Datasets (Medium Priority)

Tier 3 Supplementary Datasets (Lower Priority)

Data Integration Framework

Statistical Analysis Framework

Exploratory Data Analysis to Verify Streetcar Delay Hypothesis

Advanced Regression Analysis

Time Series Analysis Implementation

Causal Inference for Intervention Evaluation

Network Analysis for Cascade Effects

Machine Learning Implementation Strategy

Decision Trees and Random Forests

Advanced ML Techniques

Cross-Validation for Temporal Data

Academic Literature Integration

Evidence-Based Intervention Framework

Recommended Citation Framework for Quarto Documentation

Visualization and Communication Strategy

Multi-Audience Dashboard Framework

Academic Presentation Standards

Implementation Framework

R/Tidyverse-Centric Workflow

Python Integration Strategy

GPU Acceleration with AMD RX 7900 XTX

Project Phases and Learning Pathway

Phase 1: Foundation and EDA (4-6 weeks)

Phase 2: Statistical Modeling (4-6 weeks)

Phase 3: Machine Learning Implementation (6-8 weeks)

Phase 4: Integration and Communication (4-5 weeks)

Expected Outcomes and Policy Impact

Quantified Performance Improvements

Strategic Recommendations for Toronto