Content is user-generated and unverified.

Azure Data Factory Practice Project: Healthcare Data Pipeline

Project Overview

Build a complete healthcare data processing pipeline that demonstrates orchestrating patient demographic and billing data through various stages - from ingestion to final output destinations.

Architecture Components

Source: Azure Blob Storage containers (simulating SFTP/external sources)
Processing: Azure Data Factory with Mapping Data Flows
Enrichment: Azure SQL Database + Azure Table Storage
Destinations: Multiple output containers and database tables

Setup Requirements

Azure Storage Account (Standard LRS is fine)
Azure SQL Database (Basic tier sufficient)
Azure Data Factory V2
Azure Table Storage (part of storage account)

Step-by-Step Implementation

Phase 1: Environment Setup (30 minutes)

1.1 Create Storage Account

Create storage account with containers:
- raw-data (CSV, JSON, XML files)
- hl7-messages (HL7 sample files)
- processed-data (transformed outputs)
- enriched-data (final outputs)
- error-data (failed processing)

1.2 Create Azure SQL Database

Basic tier database
Create tables:

sql

-- Patient lookup table
CREATE TABLE PatientLookup (
    PatientID VARCHAR(50) PRIMARY KEY,
    FirstName VARCHAR(100),
    LastName VARCHAR(100),
    DOB DATE,
    InsuranceProvider VARCHAR(100)
);

-- Billing codes reference
CREATE TABLE BillingCodes (
    Code VARCHAR(20) PRIMARY KEY,
    Description VARCHAR(200),
    Category VARCHAR(50),
    UnitPrice DECIMAL(10,2)
);

-- Processed records log
CREATE TABLE ProcessingLog (
    LogID INT IDENTITY(1,1) PRIMARY KEY,
    FileName VARCHAR(200),
    RecordsProcessed INT,
    ProcessingDate DATETIME,
    Status VARCHAR(50)
);

1.3 Create Azure Table Storage Tables

InsuranceValidation - for real-time insurance verification simulation
AuditTrail - for tracking data lineage

Phase 2: Sample Data Creation (45 minutes)

2.1 Create Sample CSV Files

patients.csv

csv

PatientID,FirstName,LastName,DOB,Gender,Phone,Email
P001,John,Smith,1985-03-15,M,555-0101,john.smith@email.com
P002,Jane,Doe,1990-07-22,F,555-0102,jane.doe@email.com
P003,Robert,Johnson,1978-11-08,M,555-0103,robert.j@email.com

billing.csv

csv

BillingID,PatientID,ServiceDate,BillingCode,Quantity,ProviderID
B001,P001,2024-01-15,99213,1,PROV001
B002,P002,2024-01-16,99214,1,PROV002
B003,P001,2024-01-18,85025,2,PROV001

2.2 Create Sample JSON File

appointments.json

json

[
  {
    "appointmentId": "A001",
    "patientId": "P001",
    "providerId": "PROV001",
    "appointmentDate": "2024-02-01T09:00:00",
    "status": "scheduled",
    "serviceType": "consultation"
  },
  {
    "appointmentId": "A002",
    "patientId": "P002",
    "providerId": "PROV002",
    "appointmentDate": "2024-02-02T14:30:00",
    "status": "completed",
    "serviceType": "follow-up"
  }
]

2.3 Create Sample XML File

insurance_claims.xml

xml

<?xml version="1.0" encoding="UTF-8"?>
<Claims>
  <Claim>
    <ClaimID>C001</ClaimID>
    <PatientID>P001</PatientID>
    <ServiceDate>2024-01-15</ServiceDate>
    <Amount>150.00</Amount>
    <InsuranceProvider>BlueCross</InsuranceProvider>
    <Status>pending</Status>
  </Claim>
  <Claim>
    <ClaimID>C002</ClaimID>
    <PatientID>P002</PatientID>
    <ServiceDate>2024-01-16</ServiceDate>
    <Amount>200.00</Amount>
    <InsuranceProvider>Aetna</InsuranceProvider>
    <Status>approved</Status>
  </Claim>
</Claims>

2.4 Create Sample HL7 File

hl7_messages.txt

MSH|^~\&|EPIC|UCSF|CERNER|CPMC|20240115103000||ADT^A08|12345|P|2.5
EVN|A08|20240115103000
PID|1||P001^^^MRN||SMITH^JOHN^||19850315|M||||||||||123456789
PV1|1|I|ICU^101^01||||DOC123^JONES^ROBERT|||SUR||||A|||DOC123

Phase 3: Data Factory Pipeline Development (2-3 hours)

3.1 Create Linked Services

Azure Blob Storage: Connect to your storage account
Azure SQL Database: Connect to your database
Azure Table Storage: Connect for enrichment lookups

3.2 Create Datasets

Source datasets: CSV, JSON, XML, HL7 text files
Sink datasets: Processed containers, SQL tables
Lookup datasets: Reference tables for enrichment

3.3 Build Master Pipeline: `ProcessHealthcareData`

Pipeline Structure:

Validation Stage
- Check file existence
- Validate file formats
- Log processing start
Parallel Processing Branch
- CSV Processing (Execute Pipeline)
- JSON Processing (Execute Pipeline)
- XML Processing (Execute Pipeline)
- HL7 Processing (Execute Pipeline)
Data Enrichment Stage
- Lookup patient details from SQL
- Validate insurance from Table Storage
- Add billing code descriptions
Final Aggregation
- Combine processed data
- Generate summary reports
- Update processing logs

3.4 Create Individual Processing Pipelines

Pipeline: ProcessCSV

Copy Activity: Raw CSV → Staging area
Data Flow: Transform and validate CSV data
Lookup Activity: Enrich with patient data from SQL
Copy Activity: Write to processed container

Pipeline: ProcessJSON

Copy Activity: Raw JSON → Staging
Data Flow: Flatten JSON, transform dates
Conditional Activity: Route based on appointment status
Copy Activity: Write to appropriate destination

Pipeline: ProcessXML

Copy Activity: Raw XML → Staging
Data Flow: Parse XML, validate claim amounts
Lookup Activity: Insurance provider validation
Copy Activity: Output to claims processing container

Pipeline: ProcessHL7

Copy Activity: Raw HL7 → Staging
Data Flow: Parse HL7 segments, extract patient data
Stored Procedure: Update patient demographics in SQL
Copy Activity: Archive processed HL7

Phase 4: Advanced Features Implementation (1-2 hours)

4.1 Mapping Data Flows

Create data flows for each data type:

CSV Data Flow: TransformPatientCSV

Source: CSV file
Derived Column: Calculate age from DOB
Filter: Remove invalid records
Lookup: Join with billing codes
Sink: Output to processed container

JSON Data Flow: TransformAppointments

Source: JSON file
Flatten: Expand nested appointment data
Conditional Split: Route by appointment status
Aggregate: Count appointments by provider
Sink: Multiple outputs based on conditions

4.2 Triggers Setup

Schedule Trigger: Daily processing at 6 AM
Blob Trigger: Process files as they arrive
Manual Trigger: For testing and ad-hoc runs

4.3 Error Handling & Monitoring

Try-Catch blocks: Handle processing failures
Email notifications: On pipeline success/failure
Custom logging: Track data lineage in Table Storage
Data validation: Implement data quality checks

Phase 5: Testing & Validation (1 hour)

5.1 Unit Testing

Test each pipeline individually
Validate data transformations
Check error handling scenarios

5.2 Integration Testing

Run full master pipeline
Verify data end-to-end
Test trigger mechanisms

5.3 Performance Testing

Monitor pipeline execution times
Check resource utilization
Optimize bottlenecks

Key Skills Demonstrated

Technical Capabilities

Multi-format ingestion: CSV, JSON, XML, HL7
Complex orchestration: Master pipeline with sub-pipelines
Data transformation: Mapping Data Flows for complex logic
Data enrichment: SQL and Table Storage lookups
Error handling: Comprehensive exception management
Monitoring: Logging and alerting setup

Architecture Patterns

Batch processing: Scheduled and triggered execution
Data validation: Input validation and quality checks
Staging approach: Raw → Processed → Enriched flow
Parallel processing: Multiple data streams
Audit trail: Complete data lineage tracking

Interview Talking Points

Explain design decisions for each component
Discuss scalability considerations
Describe error handling strategies
Walk through data transformation logic
Demonstrate monitoring and troubleshooting approach

Estimated Time Investment

Setup: 1 hour
Sample data creation: 45 minutes
Pipeline development: 2-3 hours
Advanced features: 1-2 hours
Testing: 1 hour
Total: 5-7 hours

This project covers all the key aspects mentioned in your description while providing hands-on experience with Data Factory's core capabilities.