Build a complete healthcare data processing pipeline that demonstrates orchestrating patient demographic and billing data through various stages - from ingestion to final output destinations.
raw-data (CSV, JSON, XML files)hl7-messages (HL7 sample files)processed-data (transformed outputs)enriched-data (final outputs)error-data (failed processing)-- Patient lookup table
CREATE TABLE PatientLookup (
PatientID VARCHAR(50) PRIMARY KEY,
FirstName VARCHAR(100),
LastName VARCHAR(100),
DOB DATE,
InsuranceProvider VARCHAR(100)
);
-- Billing codes reference
CREATE TABLE BillingCodes (
Code VARCHAR(20) PRIMARY KEY,
Description VARCHAR(200),
Category VARCHAR(50),
UnitPrice DECIMAL(10,2)
);
-- Processed records log
CREATE TABLE ProcessingLog (
LogID INT IDENTITY(1,1) PRIMARY KEY,
FileName VARCHAR(200),
RecordsProcessed INT,
ProcessingDate DATETIME,
Status VARCHAR(50)
);InsuranceValidation - for real-time insurance verification simulationAuditTrail - for tracking data lineagepatients.csv
PatientID,FirstName,LastName,DOB,Gender,Phone,Email
P001,John,Smith,1985-03-15,M,555-0101,john.smith@email.com
P002,Jane,Doe,1990-07-22,F,555-0102,jane.doe@email.com
P003,Robert,Johnson,1978-11-08,M,555-0103,robert.j@email.combilling.csv
BillingID,PatientID,ServiceDate,BillingCode,Quantity,ProviderID
B001,P001,2024-01-15,99213,1,PROV001
B002,P002,2024-01-16,99214,1,PROV002
B003,P001,2024-01-18,85025,2,PROV001appointments.json
[
{
"appointmentId": "A001",
"patientId": "P001",
"providerId": "PROV001",
"appointmentDate": "2024-02-01T09:00:00",
"status": "scheduled",
"serviceType": "consultation"
},
{
"appointmentId": "A002",
"patientId": "P002",
"providerId": "PROV002",
"appointmentDate": "2024-02-02T14:30:00",
"status": "completed",
"serviceType": "follow-up"
}
]insurance_claims.xml
<?xml version="1.0" encoding="UTF-8"?>
<Claims>
<Claim>
<ClaimID>C001</ClaimID>
<PatientID>P001</PatientID>
<ServiceDate>2024-01-15</ServiceDate>
<Amount>150.00</Amount>
<InsuranceProvider>BlueCross</InsuranceProvider>
<Status>pending</Status>
</Claim>
<Claim>
<ClaimID>C002</ClaimID>
<PatientID>P002</PatientID>
<ServiceDate>2024-01-16</ServiceDate>
<Amount>200.00</Amount>
<InsuranceProvider>Aetna</InsuranceProvider>
<Status>approved</Status>
</Claim>
</Claims>hl7_messages.txt
MSH|^~\&|EPIC|UCSF|CERNER|CPMC|20240115103000||ADT^A08|12345|P|2.5
EVN|A08|20240115103000
PID|1||P001^^^MRN||SMITH^JOHN^||19850315|M||||||||||123456789
PV1|1|I|ICU^101^01||||DOC123^JONES^ROBERT|||SUR||||A|||DOC123ProcessHealthcareDataPipeline Structure:
Pipeline: ProcessCSV
Pipeline: ProcessJSON
Pipeline: ProcessXML
Pipeline: ProcessHL7
Create data flows for each data type:
CSV Data Flow: TransformPatientCSV
JSON Data Flow: TransformAppointments
This project covers all the key aspects mentioned in your description while providing hands-on experience with Data Factory's core capabilities.