Content is user-generated and unverified.

Azure Data Factory Practice Project: Healthcare Data Pipeline

Project Overview

Build a complete healthcare data processing pipeline that demonstrates orchestrating patient demographic and billing data through various stages - from ingestion to final output destinations.

Architecture Components

  • Source: Azure Blob Storage containers (simulating SFTP/external sources)
  • Processing: Azure Data Factory with Mapping Data Flows
  • Enrichment: Azure SQL Database + Azure Table Storage
  • Destinations: Multiple output containers and database tables

Setup Requirements

  1. Azure Storage Account (Standard LRS is fine)
  2. Azure SQL Database (Basic tier sufficient)
  3. Azure Data Factory V2
  4. Azure Table Storage (part of storage account)

Step-by-Step Implementation

Phase 1: Environment Setup (30 minutes)

1.1 Create Storage Account

  • Create storage account with containers:
    • raw-data (CSV, JSON, XML files)
    • hl7-messages (HL7 sample files)
    • processed-data (transformed outputs)
    • enriched-data (final outputs)
    • error-data (failed processing)

1.2 Create Azure SQL Database

  • Basic tier database
  • Create tables:
sql
-- Patient lookup table
CREATE TABLE PatientLookup (
    PatientID VARCHAR(50) PRIMARY KEY,
    FirstName VARCHAR(100),
    LastName VARCHAR(100),
    DOB DATE,
    InsuranceProvider VARCHAR(100)
);

-- Billing codes reference
CREATE TABLE BillingCodes (
    Code VARCHAR(20) PRIMARY KEY,
    Description VARCHAR(200),
    Category VARCHAR(50),
    UnitPrice DECIMAL(10,2)
);

-- Processed records log
CREATE TABLE ProcessingLog (
    LogID INT IDENTITY(1,1) PRIMARY KEY,
    FileName VARCHAR(200),
    RecordsProcessed INT,
    ProcessingDate DATETIME,
    Status VARCHAR(50)
);

1.3 Create Azure Table Storage Tables

  • InsuranceValidation - for real-time insurance verification simulation
  • AuditTrail - for tracking data lineage

Phase 2: Sample Data Creation (45 minutes)

2.1 Create Sample CSV Files

patients.csv

csv
PatientID,FirstName,LastName,DOB,Gender,Phone,Email
P001,John,Smith,1985-03-15,M,555-0101,john.smith@email.com
P002,Jane,Doe,1990-07-22,F,555-0102,jane.doe@email.com
P003,Robert,Johnson,1978-11-08,M,555-0103,robert.j@email.com

billing.csv

csv
BillingID,PatientID,ServiceDate,BillingCode,Quantity,ProviderID
B001,P001,2024-01-15,99213,1,PROV001
B002,P002,2024-01-16,99214,1,PROV002
B003,P001,2024-01-18,85025,2,PROV001

2.2 Create Sample JSON File

appointments.json

json
[
  {
    "appointmentId": "A001",
    "patientId": "P001",
    "providerId": "PROV001",
    "appointmentDate": "2024-02-01T09:00:00",
    "status": "scheduled",
    "serviceType": "consultation"
  },
  {
    "appointmentId": "A002",
    "patientId": "P002",
    "providerId": "PROV002",
    "appointmentDate": "2024-02-02T14:30:00",
    "status": "completed",
    "serviceType": "follow-up"
  }
]

2.3 Create Sample XML File

insurance_claims.xml

xml
<?xml version="1.0" encoding="UTF-8"?>
<Claims>
  <Claim>
    <ClaimID>C001</ClaimID>
    <PatientID>P001</PatientID>
    <ServiceDate>2024-01-15</ServiceDate>
    <Amount>150.00</Amount>
    <InsuranceProvider>BlueCross</InsuranceProvider>
    <Status>pending</Status>
  </Claim>
  <Claim>
    <ClaimID>C002</ClaimID>
    <PatientID>P002</PatientID>
    <ServiceDate>2024-01-16</ServiceDate>
    <Amount>200.00</Amount>
    <InsuranceProvider>Aetna</InsuranceProvider>
    <Status>approved</Status>
  </Claim>
</Claims>

2.4 Create Sample HL7 File

hl7_messages.txt

MSH|^~\&|EPIC|UCSF|CERNER|CPMC|20240115103000||ADT^A08|12345|P|2.5
EVN|A08|20240115103000
PID|1||P001^^^MRN||SMITH^JOHN^||19850315|M||||||||||123456789
PV1|1|I|ICU^101^01||||DOC123^JONES^ROBERT|||SUR||||A|||DOC123

Phase 3: Data Factory Pipeline Development (2-3 hours)

3.1 Create Linked Services

  • Azure Blob Storage: Connect to your storage account
  • Azure SQL Database: Connect to your database
  • Azure Table Storage: Connect for enrichment lookups

3.2 Create Datasets

  • Source datasets: CSV, JSON, XML, HL7 text files
  • Sink datasets: Processed containers, SQL tables
  • Lookup datasets: Reference tables for enrichment

3.3 Build Master Pipeline: ProcessHealthcareData

Pipeline Structure:

  1. Validation Stage
    • Check file existence
    • Validate file formats
    • Log processing start
  2. Parallel Processing Branch
    • CSV Processing (Execute Pipeline)
    • JSON Processing (Execute Pipeline)
    • XML Processing (Execute Pipeline)
    • HL7 Processing (Execute Pipeline)
  3. Data Enrichment Stage
    • Lookup patient details from SQL
    • Validate insurance from Table Storage
    • Add billing code descriptions
  4. Final Aggregation
    • Combine processed data
    • Generate summary reports
    • Update processing logs

3.4 Create Individual Processing Pipelines

Pipeline: ProcessCSV

  • Copy Activity: Raw CSV → Staging area
  • Data Flow: Transform and validate CSV data
  • Lookup Activity: Enrich with patient data from SQL
  • Copy Activity: Write to processed container

Pipeline: ProcessJSON

  • Copy Activity: Raw JSON → Staging
  • Data Flow: Flatten JSON, transform dates
  • Conditional Activity: Route based on appointment status
  • Copy Activity: Write to appropriate destination

Pipeline: ProcessXML

  • Copy Activity: Raw XML → Staging
  • Data Flow: Parse XML, validate claim amounts
  • Lookup Activity: Insurance provider validation
  • Copy Activity: Output to claims processing container

Pipeline: ProcessHL7

  • Copy Activity: Raw HL7 → Staging
  • Data Flow: Parse HL7 segments, extract patient data
  • Stored Procedure: Update patient demographics in SQL
  • Copy Activity: Archive processed HL7

Phase 4: Advanced Features Implementation (1-2 hours)

4.1 Mapping Data Flows

Create data flows for each data type:

CSV Data Flow: TransformPatientCSV

  • Source: CSV file
  • Derived Column: Calculate age from DOB
  • Filter: Remove invalid records
  • Lookup: Join with billing codes
  • Sink: Output to processed container

JSON Data Flow: TransformAppointments

  • Source: JSON file
  • Flatten: Expand nested appointment data
  • Conditional Split: Route by appointment status
  • Aggregate: Count appointments by provider
  • Sink: Multiple outputs based on conditions

4.2 Triggers Setup

  • Schedule Trigger: Daily processing at 6 AM
  • Blob Trigger: Process files as they arrive
  • Manual Trigger: For testing and ad-hoc runs

4.3 Error Handling & Monitoring

  • Try-Catch blocks: Handle processing failures
  • Email notifications: On pipeline success/failure
  • Custom logging: Track data lineage in Table Storage
  • Data validation: Implement data quality checks

Phase 5: Testing & Validation (1 hour)

5.1 Unit Testing

  • Test each pipeline individually
  • Validate data transformations
  • Check error handling scenarios

5.2 Integration Testing

  • Run full master pipeline
  • Verify data end-to-end
  • Test trigger mechanisms

5.3 Performance Testing

  • Monitor pipeline execution times
  • Check resource utilization
  • Optimize bottlenecks

Key Skills Demonstrated

Technical Capabilities

  • Multi-format ingestion: CSV, JSON, XML, HL7
  • Complex orchestration: Master pipeline with sub-pipelines
  • Data transformation: Mapping Data Flows for complex logic
  • Data enrichment: SQL and Table Storage lookups
  • Error handling: Comprehensive exception management
  • Monitoring: Logging and alerting setup

Architecture Patterns

  • Batch processing: Scheduled and triggered execution
  • Data validation: Input validation and quality checks
  • Staging approach: Raw → Processed → Enriched flow
  • Parallel processing: Multiple data streams
  • Audit trail: Complete data lineage tracking

Interview Talking Points

  • Explain design decisions for each component
  • Discuss scalability considerations
  • Describe error handling strategies
  • Walk through data transformation logic
  • Demonstrate monitoring and troubleshooting approach

Estimated Time Investment

  • Setup: 1 hour
  • Sample data creation: 45 minutes
  • Pipeline development: 2-3 hours
  • Advanced features: 1-2 hours
  • Testing: 1 hour
  • Total: 5-7 hours

This project covers all the key aspects mentioned in your description while providing hands-on experience with Data Factory's core capabilities.

Content is user-generated and unverified.
    Azure Data Factory Practice Project: Healthcare Data Pipeline | Claude