This project plan provides a complete framework for analyzing Toronto's transit delays with specific focus on streetcar performance. Designed for researchers with philosophy, biochemistry, and economics backgrounds transitioning to data science, it combines rigorous methodology with practical R/tidyverse implementation and strategic Python integration.
TTC Delay Data Collection (2014-2025) The foundation of this analysis rests on the already cleaned TTC delay datasets covering buses, streetcars, and subways. These provide direct measurements of delay patterns with consistent temporal coverage spanning over a decade. Statistical significance: This longitudinal dataset enables robust time series analysis and causal inference techniques essential for policy evaluation.
TTC GTFS Static and Real-time Data Critical for calculating schedule adherence metrics and understanding baseline service expectations. The General Transit Feed Specification format provides standardized route definitions, stop locations, and scheduled times. This enables calculation of delay deviations from planned service rather than just raw delay numbers.
King Street Transit Priority Corridor Data (2017-2019) This represents Toronto's most comprehensive transit intervention case study, with measured outcomes showing travel time reductions from 23 to 16 minutes during implementation. The dataset includes before-after analysis with rigorous controls, making it invaluable for validating causal inference methodologies and demonstrating intervention effectiveness.
Traffic Volume Data from Toronto Open Data Essential for understanding congestion impacts on streetcar delays. Research from Melbourne shows 16.4% crash reduction and significant travel time improvements with proper traffic integration analysis. This dataset enables spatial joining with streetcar routes to quantify mixed-traffic operation impacts.
Statistics Canada Large Urban Transit Statistics Provides comparative context with standardized metrics across major Canadian cities. Monthly data for 10 transit operators including TTC enables benchmarking and identification of Toronto-specific challenges versus national trends.
Weather Data from Environment Canada Critical for Toronto's climate-variable operations. Research shows weather as a significant delay factor, particularly for mixed-traffic streetcar operations. Historical meteorological data enables seasonal decomposition and climate resilience planning.
Vancouver TransLink and Montreal STM Comparative Data These Canadian systems provide relevant comparisons for policy learning. Vancouver's integrated bus-rail system and Montreal's metro-bus combination offer different operational models for contextualizing Toronto's performance.
Traffic Signal Data and Intersection Files Essential for signal priority analysis. With 440 TSP locations system-wide and documented 6-10% travel time improvements from signal priority implementations, understanding signal-streetcar interactions is crucial for intervention recommendations.
Parking Ticket and Traffic Camera Data While providing insights into enforcement patterns affecting streetcar operations, these require extensive spatial analysis and may not provide proportional analytical value for initial implementation.
International Comparative Data (NYC MTA, London TfL) Valuable for best practices research but less directly applicable due to different regulatory environments and infrastructure configurations.
Temporal Alignment Strategy: Standardize all datasets to consistent time periods (likely monthly aggregations for trend analysis, daily for operational patterns) with careful handling of missing data periods during service disruptions.
Spatial Consistency Protocol: Use Toronto's standardized coordinate systems and ensure route definitions align across datasets. Implement buffer zones around streetcar stops for spatial joining with traffic and infrastructure data.
Quality Standardization: Harmonize incident classification schemes across different data sources, with particular attention to weather-related vs. traffic-related delay categorization.
Modal Comparison Analysis Implement statistical hypothesis testing using Kruskal-Wallis tests (non-parametric given likely skewed delay distributions) to compare delay patterns across transit modes. Follow up with post-hoc Dunn tests for pairwise comparisons between streetcars, buses, and subway services.
library(tidyverse)
library(dunn.test)
# Modal delay comparison
delay_comparison <- transit_delays %>%
group_by(mode) %>%
summarise(
median_delay = median(delay_minutes, na.rm = TRUE),
iqr_delay = IQR(delay_minutes, na.rm = TRUE),
delay_incidents_per_1000_trips = n() / sum(scheduled_trips) * 1000
)
# Statistical testing
kruskal.test(delay_minutes ~ mode, data = transit_delays)
dunn.test(transit_delays$delay_minutes, transit_delays$mode)Distributional Analysis: Use kernel density estimation and violin plots to visualize delay distributions by mode, identifying whether streetcars show different delay patterns (potentially bimodal with traffic-related peaks).
Linear Mixed-Effects Models for Hierarchical Data
Account for route-level clustering in delay patterns using the lme4 package. This addresses the nested structure where delay incidents cluster within routes and time periods.
library(lme4)
library(broom.mixed)
# Hierarchical delay model
delay_model <- lmer(
delay_minutes ~
weather_condition + hour_of_day + day_of_week +
traffic_volume + construction_nearby +
(1 | route_id) + (1 | month),
data = streetcar_delays
)
tidy(delay_model, effects = "fixed")Sigmoidal Regression for Capacity-Delay Relationships Model the non-linear relationship between ridership/traffic volume and delay using logistic growth curves. This captures the threshold effects where delays accelerate rapidly beyond certain capacity levels.
library(drc)
# Capacity-delay curve fitting
capacity_model <- drm(
delay_minutes ~ capacity_utilization,
data = route_performance,
fct = L.4() # 4-parameter logistic
)
# Extract critical thresholds
ED(capacity_model, c(10, 50, 90)) # Effective doses for delay thresholdsSeasonal Decomposition with STL Apply Seasonal-Trend decomposition using LOESS to separate long-term trends from seasonal patterns and irregular fluctuations. This is particularly important for Toronto's climate-variable operations.
library(feasts)
library(tsibble)
# Convert to tsibble format
delay_ts <- streetcar_delays %>%
group_by(route_id, date) %>%
summarise(daily_avg_delay = mean(delay_minutes)) %>%
as_tsibble(index = date, key = route_id)
# STL decomposition
delay_decomp <- delay_ts %>%
model(stl = STL(daily_avg_delay))
components(delay_decomp) %>% autoplot()ARIMA vs Prophet Comparison Implement both approaches and compare forecasting performance using time series cross-validation. ARIMA models excel with stationary delay patterns, while Prophet handles Toronto's complex seasonality (winter weather, summer construction, special events) more naturally.
library(forecast)
library(prophet)
# ARIMA implementation
arima_model <- delay_ts %>%
model(arima = ARIMA(daily_avg_delay))
# Prophet implementation
prophet_model <- delay_ts %>%
model(prophet = prophet(daily_avg_delay))
# Cross-validation comparison
cv_results <- delay_ts %>%
stretch_tsibble(.init = 365, .step = 30) %>%
model(
arima = ARIMA(daily_avg_delay),
prophet = prophet(daily_avg_delay)
) %>%
forecast(h = 14) %>%
accuracy(delay_ts)Difference-in-Differences for King Street Analysis Leverage the King Street pilot as a natural experiment, comparing treated (King Street) and control routes before, during, and after implementation. This provides policy-relevant causal estimates.
library(fixest)
# DiD estimation
did_model <- feols(
delay_minutes ~
i(post_intervention, king_street, ref = FALSE) +
weather_condition + hour_of_day | route_id + date,
data = intervention_analysis
)
# Visualize treatment effects
coefplot(did_model)Instrumental Variables for Traffic-Delay Relationships Address endogeneity in traffic-delay relationships using weather conditions as instruments for traffic volume, following transportation economics methodology.
Graph-Based Delay Propagation Model the TTC network as a graph where delay propagation follows route connections and transfer points. Use centrality measures to identify critical nodes where delays have network-wide impacts.
library(tidygraph)
library(ggraph)
# Create network graph
ttc_network <- create_network_graph(routes_data, transfers_data)
# Calculate centrality measures
network_analysis <- ttc_network %>%
activate(nodes) %>%
mutate(
betweenness = centrality_betweenness(),
pagerank = centrality_pagerank(),
delay_centrality = centrality_eigen(weights = avg_delay)
)
# Visualize network with delay centrality
ggraph(network_analysis, layout = "fr") +
geom_edge_link(alpha = 0.6) +
geom_node_point(aes(size = delay_centrality, color = avg_delay)) +
scale_color_viridis_c() +
theme_graph()Random Forest for Delay Prediction
Implement using the ranger package for computational efficiency, particularly valuable given the mixed data types (categorical weather conditions, continuous traffic volumes, temporal features).
library(ranger)
library(tidymodels)
# Feature engineering pipeline
delay_recipe <- recipe(delay_minutes ~ ., data = training_data) %>%
step_date(datetime, features = c("hour", "dow", "month")) %>%
step_holiday(datetime, holidays = timeDate::listHolidays("CA")) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# Random forest model
rf_spec <- rand_forest(
trees = 1000,
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# Workflow and tuning
rf_workflow <- workflow() %>%
add_recipe(delay_recipe) %>%
add_model(rf_spec)
# Time series cross-validation
folds <- sliding_period(
training_data,
datetime,
"month",
lookback = 12,
assess_stop = 1
)
rf_results <- tune_grid(
rf_workflow,
resamples = folds,
grid = 20,
metrics = metric_set(rmse, mae, rsq)
)Feature Importance Analysis and Interpretation Use SHAP (SHapley Additive exPlanations) values for model interpretation, crucial for policy recommendations to transit operators.
library(fastshap)
# SHAP analysis
shap_values <- explain(
final_rf_model,
X = test_features,
pred_wrapper = predict_function,
nsim = 50
)
# Visualization
autoplot(shap_values, type = "importance")
autoplot(shap_values, type = "dependence", feature = "traffic_volume")Gradient Boosting for Non-linear Relationships Implement XGBoost for capturing complex interactions between weather, traffic, and temporal factors affecting streetcar delays.
library(xgboost)
# XGBoost implementation through tidymodels
xgb_spec <- boost_tree(
trees = tune(),
tree_depth = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("regression")
# GPU acceleration (if available with AMD RX 7900 XTX)
xgb_gpu_spec <- boost_tree(
trees = tune(),
tree_depth = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost", tree_method = "gpu_hist") %>%
set_mode("regression")Time Series Split Implementation Critical for avoiding data leakage in transit forecasting applications.
# Custom time series validation
ts_splits <- training_data %>%
arrange(datetime) %>%
sliding_period(
datetime,
"month",
lookback = 6, # 6 months training
assess_stop = 1, # 1 month testing
step = 1 # Advance by 1 month
)
# Walk-forward validation for realistic evaluation
walk_forward_validate <- function(model_spec, data) {
results <- map_dfr(ts_splits$splits, function(split) {
train_data <- analysis(split)
test_data <- assessment(split)
fitted_model <- fit(model_spec, data = train_data)
predictions <- predict(fitted_model, test_data)
bind_cols(test_data, predictions) %>%
metrics(delay_minutes, .pred)
})
return(results)
}The literature review reveals three highest-impact interventions for Toronto's context:
Traffic Signal Priority (TSP) Expansion Research from Warsaw shows 6.7% travel time decrease and 5-8.5% accessibility increase with comprehensive TSP implementation. Toronto's current 440 TSP locations provide a strong foundation for systematic expansion, particularly given King Street's demonstrated 12% travel time reduction during properly enforced periods.
Dedicated Transit Lanes with Enforcement Melbourne's empirical evaluation demonstrates 16.4% crash reduction and significant reliability improvements with proper lane priority. The critical lesson from Toronto's King Street experience is that infrastructure alone is insufficient - enforcement effectiveness determines success, with 99.7% violation rates causing service degradation without active traffic agents.
Machine Learning Integration for Operations Recent research on CBLA-Net architectures shows 122-second average prediction error for delay forecasting, enabling passenger information systems and operational optimization. This aligns with Toronto's real-time data capabilities and passenger experience priorities.
## Key Academic Sources
- Naznin, F., Currie, G., Logan, D., & Sarvi, M. (2018). An empirical Bayes safety evaluation of tram/streetcar signal and lane priority measures in Melbourne. *Accident Analysis & Prevention*, 121, 13-23.
- Niedzielski, M. A. (2024). Signals, tracks, and trams: public transport signal priority impact on job accessibility over time. *Journal of Transport Geography*, 114, 103-118.
- Zhang, T., Wang, R., Hu, P., & Pu, C. (2023). Real-time train delay prediction using CNN-BiLSTM-Attention network. *Journal of Intelligent Transportation Systems*, 29(3), 412-428.
- Eliasson, J. (2008). Lessons from the Stockholm congestion charging trial. *Transport Policy*, 15(6), 395-404.Executive Dashboard for TTC Leadership Interactive Shiny application with key performance indicators, route-level drill-down capability, and intervention impact modeling. Focus on operational metrics and budget implications.
library(shiny)
library(plotly)
library(DT)
# Executive dashboard structure
ui <- dashboardPage(
dashboardHeader(title = "TTC Delay Analysis Dashboard"),
dashboardSidebar(
selectInput("route", "Select Route:", choices = unique_routes),
dateRangeInput("date_range", "Date Range:",
start = "2024-01-01", end = Sys.Date()),
checkboxGroupInput("delay_causes", "Delay Causes:",
choices = delay_categories)
),
dashboardBody(
fluidRow(
valueBoxOutput("avg_delay"),
valueBoxOutput("otperfomance"),
valueBoxOutput("cost_impact")
),
fluidRow(
plotlyOutput("delay_trend"),
plotlyOutput("spatial_heatmap")
)
)
)Public Communication Through Interactive Maps Implement accessible visualizations showing service reliability by location, using colorblind-safe palettes and clear performance metrics.
library(leaflet)
library(viridis)
# Public-facing reliability map
create_public_map <- function(route_performance) {
leaflet(route_performance) %>%
addTiles() %>%
addPolylines(
~longitude, ~latitude,
weight = ~ifelse(avg_delay < 2, 2,
ifelse(avg_delay < 5, 4, 6)),
color = ~viridis_discrete(performance_category),
popup = ~paste("Route:", route_name, "<br>",
"Average Delay:", round(avg_delay, 1), "minutes<br>",
"On-time Performance:", paste0(otp_percent, "%"))
) %>%
addLegend("bottomright",
colors = viridis(3),
labels = c("Good (< 2 min)", "Fair (2-5 min)", "Poor (> 5 min)"),
title = "Service Reliability")
}Publication-Ready Figures Generate high-resolution, accessible visualizations following Transportation Research Board guidelines.
# Publication figure template
create_publication_figure <- function(data, title) {
ggplot(data, aes(x = factor, y = delay_minutes)) +
geom_boxplot(aes(fill = intervention_status), alpha = 0.7) +
stat_compare_means(comparisons = list(c("before", "after"))) +
scale_fill_viridis_d(name = "Period") +
labs(
title = title,
subtitle = paste("Analysis of", nrow(data), "delay incidents"),
x = "Route Category",
y = "Delay Duration (minutes)",
caption = "Error bars represent 95% confidence intervals"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(size = 14, face = "bold"),
legend.position = "bottom",
panel.grid.minor = element_blank()
)
}
# Export for publication
ggsave("delay_analysis.png", width = 8, height = 6, dpi = 300)
ggsave("delay_analysis.pdf", width = 8, height = 6)Core Package Ecosystem Implement analysis primarily in R using tidyverse principles, with strategic Python integration for specialized ML applications.
# Core analysis packages
library(tidyverse) # Data manipulation and visualization
library(lubridate) # Date/time handling
library(sf) # Spatial data analysis
library(tidymodels) # Machine learning framework
library(forecast) # Time series analysis
library(leaflet) # Interactive mapping
library(shiny) # Dashboard development
library(quarto) # Documentation and reporting
# Specialized packages
library(tsibble) # Time series data structure
library(feasts) # Time series feature extraction
library(ranger) # Fast random forests
library(fixest) # Econometric analysis
library(broom) # Statistical model tidyingTargeted Python Usage Use Python for specific ML implementations and GPU-accelerated computations while maintaining R as the primary analytical environment.
library(reticulate)
# Configure Python environment
use_condaenv("transit_analysis")
# Import Python libraries for specific tasks
sklearn <- import("sklearn.ensemble")
xgb <- import("xgboost")
prophet <- import("prophet")
# Hybrid workflow example
py_predictions <- sklearn$RandomForestRegressor(n_estimators = 1000L)$
fit(r_to_py(training_features), r_to_py(training_target))$
predict(r_to_py(test_features))
# Convert back to R for further analysis
r_predictions <- py_to_r(py_predictions)ROCm Setup for Linux Development Given AMD's limited ML ecosystem support, establish dual-boot Ubuntu environment for GPU acceleration when needed.
# ROCm installation (Ubuntu 22.04)
wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update && sudo apt install rocm-dkms rocm-libs rocm-dev rocm-utils
# PyTorch with ROCm support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6Performance Expectations and Alternatives Current AMD software maturity means 30-50% lower performance compared to equivalent NVIDIA hardware, but 24GB VRAM provides significant memory advantages for large datasets. Consider cloud-based alternatives (AWS, GCP) for production ML workloads.
Objectives: Data familiarization, basic pattern identification, methodology establishment
Key Deliverables:
Learning Focus: R/tidyverse mastery, statistical hypothesis testing, time series basics, spatial data handling
Objectives: Causal inference, trend analysis, intervention evaluation
Key Deliverables:
Learning Focus: Causal inference techniques, time series analysis, regression diagnostics, econometric methods
Objectives: Predictive modeling, feature importance, operational optimization
Key Deliverables:
Learning Focus: ML model validation, feature engineering, model interpretation, deployment considerations
Objectives: Comprehensive reporting, stakeholder communication, policy recommendations
Key Deliverables:
Learning Focus: Technical writing, data visualization design, dashboard development, academic publication standards
Based on international evidence and Toronto's context, properly implemented interventions should achieve:
Immediate Implementation (6-12 months)
Medium-term Initiatives (1-3 years)
Long-term Strategic Development (3-5 years)
This comprehensive project plan provides both an educational pathway for developing data science expertise and a practical framework for contributing to Toronto's transit improvement efforts. The combination of rigorous methodology, evidence-based recommendations, and accessible communication ensures both academic learning value and real-world policy impact.