From Schema to Strategy: Unlocking Healthcare Insights with SQL and Python

Mukesh Shirke
Oct 17
10 min read

Updated: Oct 22

Leveraging SQL and Python: A Comprehensive EDA and Statistical Analysis of Healthcare Performance

This project is a comprehensive, end-to-end analysis of a large healthcare dataset, comprising eight normalized tables focused on patient visits, demographics, providers, and costs. My primary goal was to transform raw operational data into actionable business intelligence. I began by designing a normalized schema in MySQL, performing rigorous data cleaning, and then executed an exhaustive Exploratory Data Analysis (EDA) using advanced SQL techniques. Finally, I moved the analysis to Python (Pandas/Seaborn) to perform statistical testing and correlation analysis, culminating in key recommendations for improving cost management and patient satisfaction.

Key Skills Demonstrated:

Database Management: Data modeling (Relational Schema), Normalization, MySQL, Primary/Foreign Key Constraint Management, Bulk Data Loading (LOAD DATA INFILE)
Data Analysis: Exploratory Data Analysis (EDA), Time Series Analysis, Conditional Aggregation (CASE), Window Functions (RANK), Subqueries, Grouped Aggregation
Statistical Analysis: Descriptive Statistics, Correlation Analysis (Heatmaps), Grouped Mean Comparison, Distribution Analysis (Box Plots)
Programming: Python (Pandas for data wrangling, Matplotlib/Seaborn for visualization), Database Connection (mysql.connector)

Tools used:

Database: MySQL (for schema design, data storage, and initial EDA)
Programming/Statistical: Python (Anaconda Spyder)
Libraries: Pandas, Seaborn, Matplotlib, mysql-connector-python

GitHub repository:

https://github.com/MukeshTheAnalyst/healthcare-sql-python-stat-model

Scenario:

A client operating a large healthcare provider tasked me with analyzing their historical patient visit data. The data, provided as eight separate CSV files, was the backbone of their operations, yet they lacked a unified view of its performance.

My boss provided the following directive: "We need to identify operational bottlenecks, cost drivers, and areas for quality improvement. Design a robust, scalable data solution and deliver concise, actionable insights into our financials, patient demographics, and staff performance."

This project required me to move beyond simple reporting to deliver true analytical depth using both relational database management and modern statistical programming.

Workflow:

I tackled this project using a strict, five-phase process, making sure my work was organized, repeatable, and easy to audit—just like in a professional environment. This setup shows how I move data step-by-step: from basic files to deep statistical modeling, and finally, to clear business advice. Each phase is a critical milestone that ensures the project is built on a solid foundation.

Phase	🎯 Objective (The "Why")	🛠️ Core Tools (The "How")	✅ Key Action (The "What")	💡 Value Proposition (The "Result")
1	Data Preparation & Schema Modeling	MySQL / Python (Pandas)	Design the Normalized SQL Schema; Perform Bulk Load; Execute necessary Data Cleaning (REPLACE).	Established the Single Source of Truth and guaranteed Data Integrity.
2	Exploratory Data Analysis (EDA)	MySQL	Execute 15 Advanced Queries utilizing Window Functions, CTEs, and Conditional Aggregation (CASE).	Created the Foundational Metrics Baseline and identified initial trends/outliers.
3	Data Integration & Environment Setup	Python (mysql.connector)	Connect securely to the database and run a single pd.read_sql() query.	Bridged the SQL-Python Gap, creating a unified, Analysis-Ready DataFrame for statistics.
4	Statistical Analysis & Visualization	Python (Pandas / Seaborn)	Conduct Correlation Analysis (Heatmap), Descriptive Statistics, and Distribution Analysis (Box Plots).	Validated Hypotheses and revealed structural relationships (e.g., Cost vs. Satisfaction).
5	Interpretation & Communication	Report / N/A	Translate technical findings into the Top 5 Business Recommendations and finalize documentation.	Converted Data into Action, providing a clear Strategic Roadmap for the executive team.

Phase 1: Data Understanding and Preparation

This phase focused on establishing a clean, structured foundation for the entire project.

1. Data Understanding

I first loaded all eight CSV files into Pandas DataFrames in Python to perform initial checks (.head(), .info()). This confirmed that visits was the central fact table, linking all other dimension tables (patients, providers, departments, etc.). I identified all Primary Keys (PKs) and Foreign Keys (FKs) and established the table creation sequence based on these dependencies.

Initial data inspection findings:

Table Name	Primary Key (PK) Confirmed	Foreign Key (FK) Identified	Key Data Type/Quality Observation
cities	City_ID (Unique ID)	None	All columns are non-null.
departments	Department_ID	None	Very small lookup table.
diagnoses	Diagnosis_ID	None	Very small lookup table.
insurance	Insurance_ID	None	Very small lookup table.
patients	Patient_ID	City_ID (references cities)	Patient_Name requires cleaning (. and _ symbols).
procedures	Procedure_ID	None	Column is named Procedure in CSV, but should be Procedure_Name in DB.
providers	Provider_ID	None	Small lookup table. Provider_Image can be VARCHAR(255).
visits	Visit_ID	All other *_ID columns	Crucial: Date_of_Visit is an object (string) and must be converted to DATE (dd/mm/yyyy format). Room_Type has missing values (expected for 'Outpatient').

2. Database Creation, Bulk Loading and Schema Mapping

I used MySQL to create the healthcare_data database and all tables in the correct sequence.

Data Type Enforcement: I specifically ensured the Date_of_Visit column in the visits table was converted from the CSV's dd/mm/yyyy string format to MySQL's proper DATE type using the STR_TO_DATE function during the bulk LOAD DATA INFILE process.
Best Practice Implementation: To maximize import speed for the large visits table, I only added the Foreign Key constraints after the data was successfully loaded.

Create Database

Create Tables (Assign Primary Key, Unique, NULL)

Import data into table from .CSV file

Define Relationship among Tables (Assign Foreign Key)

Schema Diagram

3. Data Cleaning

The initial inspection revealed inconsistent delimiters (_ and .) in the Patient_Name column of the patients table.

Cleaning Action: I executed a MySQL UPDATE query leveraging nested REPLACE functions to standardize all names by replacing _ and . with a single space.

Phase 2: Exploratory Data Analysis (EDA) in MySQL

With the data validated and cleaned, I executed 15 strategic SQL queries to generate initial business insights, demonstrating proficiency in common analytical techniques like JOINs, GROUP BY, CASE, and WINDOW FUNCTIONS.

1. Total Revenue and Volume by Service Type (CASE statement)

Calculates the total count, total revenue (Treatment_Cost + Medication_Cost), and average revenue for each Service_Type (Inpatient vs. Outpatient).

Result:

2. Top 5 Most Common Diagnoses and Average Patient Age

Identifies the top 5 most frequently occurring diagnoses and determines the average age of patients for each of those diagnoses.

Result:

3. Monthly Trend of Visits (Date Functions and Subquery)

Groups the visit data by month and year to track the volume of visits over time, which helps identify seasonal trends or growth patterns.

Result:

4. Top 3 Most Expensive Procedures

Calculates the total accumulated cost (Treatment + Medication) for each procedure and ranks them to find the most expensive ones by total spend.

Result:

5. Top Providers by Patient Satisfaction (Aggregate & Join)

Ranks providers based on their average Patient_Satisfaction score, but only includes providers who have attended to a minimum number of visits (e.g., 50) to ensure the rating is statistically reliable.

Result:

6. Insurance Coverage vs. Total Cost by Insurance Provider (CASE statement)

Analyzes the efficiency of each insurance provider by calculating the total cost, total amount covered, and the average coverage percentage.

Result:

7. Identifying High-Cost Patients (Window Function: RANK)

Uses a Window Function (RANK()) within a CTE (HighCostRankedVisits) to find the top 5 patients with the highest single-visit total cost.

Result:

8. Department Performance (Average Satisfaction & Cost)

Compares departments based on two key metrics: average patient satisfaction and average total cost.

Result:

9. Percentage of Emergency Visits by Gender (CASE and Aggregation)

Calculates the total percentage of visits that were marked as emergency for both male and female patients.

Result:

10. Most Popular Procedures for the Top Diagnosis (Subquery)

Uses a subquery to first find the single most frequent diagnosis and then finds the most popular procedure associated with that specific diagnosis.

Result:

Phase 3: Data Integration (SQL to Python) and Analysis Environment Setup

This phase focuses on the crucial transition of the structured data from the MySQL database into the Python statistical environment.

I established a secure connection to the healthcare_project database using the mysql.connector library. I then defined a comprehensive, multi-table JOIN query as a string variable (sql_query) to efficiently pull all required fields (costs, satisfaction, demographics, etc.) from the database. Finally, I utilized the pd.read_sql() function to execute this query and instantly load the entire analytical dataset into a single, clean Pandas DataFrame (df_analysis). This step ensures the data is correctly structured and ready for advanced statistical testing and visualization in the subsequent phase.

Result:

Phase 4: Statistical Analysis and Visualization in Python

I transitioned to Python using Pandas and Seaborn to perform deeper statistical analysis, moving beyond basic aggregates to explore relationships and distributions.

1. Descriptive Statistics for Key Financial and Quality Metrics

Calculates basic statistics (mean, median, standard deviation, min/max) for Total_Cost and Patient_Satisfaction to understand the data's distribution and check for outliers.

Result:

2. Correlation Matrix and Heatmap Visualization

Calculates the correlation between key numerical variables and visualizes the results using a heatmap. This easily identifies which factors (like Age, Cost, Satisfaction) are related.

Result:

Output:

3. Average Total Cost Comparison by Service Type

Uses the groupby() function to calculate the average (mean) Total_Cost for each Service_Type (Inpatient vs. Outpatient).

Result:

Output:

4. Average Total Cost Comparison by Service Type

Uses the groupby() function to calculate the average (mean) Total_Cost for each Service_Type (Inpatient vs. Outpatient).

Result:

Output:

Phase 5: Interpretation into Business Insights and Communication

5 Key Findings and Strategic Recommendations

This phase is the final deliverable, moving the project beyond technical execution into strategic communication. My goal here is to bridge the gap between the output of my SQL queries and Python models, and the necessary executive decisions. Instead of detailing the code, this section presents the Top 5 Key Findings, clearly defining the business impact of each and providing concise, actionable recommendations that management can immediately implement to control costs and improve patient quality. This shows the project's direct contribution to the organization's goals.

#	Key Business Insight	Data Source	Impact Level	Strategic Recommendation
1	Inpatient Services Dominate Revenue & Risk	SQL EDA (Query #1)	🔴High Financial Risk	Immediately implement a continuous Cost Auditing Program focused solely on Inpatient services to identify cost leakage.
2	Quality is Structurally Disconnected from Cost	Python (Correlation/Heatmap)	🟡Strategic Priority	Shift investment from simply increasing spending to Operational Standardization (e.g., improving communication protocols) to boost satisfaction.
3	Orthopedics Cost Anomaly	SQL EDA (#7), Python Outliers	🔴High Efficiency Target	Launch a specific Process Audit on the Orthopedics Department's supply chain and billing to resolve high-cost outliers.
4	Inconsistent Quality Across Departments	Python (Box Plot Analysis)	🟡Quality Improvement	Deploy a "Consistency Task Force" to target departments with high satisfaction score variance and standardize patient pathways.
5	Patient Age is the Primary Cost Driver	Python (Correlation/Heatmap)	🟢Foundational Strategy	Integrate Patient Age as the top forecasting variable in all future resource allocation and budgeting models.

Project Delivery and Documentation

This crucial final phase ensures the project is not just a one-time analysis but a reusable, verifiable solution suitable for production deployment.

Final Report: The complete write-up, including all visual artifacts (Heatmap, Bar Chart, Box Plot), was packaged as the final deliverable.
SQL Script Documentation: A single, clean, and fully commented .sql file was created, containing all schema definitions, bulk loading commands, and the 10 EDA queries, ensuring the database is fully reproducible.
Python Notebook: A commented Jupyter Notebook or Python script was saved, containing the database connection details, statistical tests, and code used to generate the key visualizations.
Metric Definitions: The two summary tables (SQL EDA and Statistical Analysis) were explicitly included as Technical Documentation to provide a clear, auditable glossary of all metrics used.

Metric Definitions (SQL EDA Summary)

Domain	EDA #	Key Metric	SQL Technique Highlighted	Immediate Business Value	Next Logical Step
Financial Health	1	Total Revenue & Volume by Service Type	GROUP BY, Aggregate Functions	Identifies the primary financial driver (Inpatient vs. Outpatient).	Focus cost-control initiatives on the higher-revenue type.
	4	Top 3 Most Expensive Procedures	ORDER BY, LIMIT	Flags highest accumulated costs for immediate cost reduction efforts.	Benchmark costs against industry standards.
	6	Insurance Coverage Rate vs. Cost	AVG, JOINs	Assesses payer efficiency and informs contract negotiation strategy.	Segment analysis by service type for better negotiation leverage.
	7	High-Cost Patient Identification	RANK() Window Function, CTE	Isolates extreme financial outliers for case management audits.	Trigger immediate clinical review of flagged patient visits.
Operational & Volume	3	Monthly Trend of Visits	Date Functions, GROUP BY	Reveals seasonal demand patterns for staffing and resource forecasting.	Use the trend data for time-series forecasting models.
Clinical Insights	2	Top 5 Most Common Diagnoses	GROUP BY, COUNT	Informs inventory management and helps standardize treatment protocols.	Correlate top diagnoses with average treatment cost.
	10	Most Popular Procedure for Top Diagnosis	Subquery (Dynamic Nested Query)	Optimizes clinical pathways by linking the most common diagnosis to its frequent procedure.	Use finding to create departmental checklists and standards of care.
Quality & Equity	5	Top Providers by Patient Satisfaction	HAVING Clause, AVG	Identifies quality champions and best practice providers for mentoring.	Study high-ranked providers' operational methods.
	8	Department Performance (Cost/Satisfaction)	GROUP BY Dual Aggregation	Creates a comparative baseline (efficiency vs. quality) for departmental strategy.	Use data to build a Performance Scatter Plot (Python).
	9	% of Emergency Visits by Gender	Conditional Aggregation (SUM(CASE))	Analyzes gender differences in seeking emergency care.	Investigate if a specific demographic has barriers to routine care.

Metric Definitions (Statistical Analysis summary)

Domain	#	Analysis Focus	Python/Library Technique	Core Insight Generated	Future Statistical Step
Data Integrity & Risk	1	Descriptive Statistics & Outliers	pd.describe(), Quartile Check	Confirms statistical health of Total_Cost and flags if averages are skewed by extreme outliers.	If outliers confirmed, apply data transformation (e.g., log scale) for modeling.
Relationship Modeling	2	Variable Correlation Analysis	pd.corr(), Seaborn Heatmap	Identifies the linear strength between variables (e.g., Age → Cost, Cost ↔ Satisfaction).	Use high-correlation variables (like Age) to build a Regression Model for cost prediction.
	3	Cost Comparison by Service Type	pd.groupby(), Bar Chart	Quantifies the exact mean difference between Inpatient and Outpatient costs.	Perform an ANOVA Test to determine if this observed difference is statistically significant.
Quality & Distribution	4	Satisfaction Consistency by Dept.	Seaborn Box Plot	Reveals the consistency and spread (variance) of satisfaction, identifying unreliable departments.	Target departments with high variance for quality standardization initiatives.

Conclusion:

This project successfully established a robust, job-ready data pipeline from raw CSVs to actionable business insights. The analysis highlighted that Inpatient services are the major financial driver, while quality of care (satisfaction) must be addressed through operational changes, as it shows a weak correlation with cost. I have provided the necessary tools and metrics to guide the organization toward immediate cost-saving and quality improvement efforts.

Lessons Learned:

Database Design Precedes Analysis: Spending time on proper schema design, key enforcement, and date-type handling in MySQL saved significant time and prevented errors during the subsequent EDA.
The Power of Conditional Aggregation: The SUM(CASE) method in SQL and the .groupby() method in Pandas are versatile tools for translating binary business questions into quantifiable metrics.
Beyond the Average: Using Python's Box Plots was crucial for moving beyond simple averages to understand the full distribution of satisfaction scores, revealing inconsistency—a key insight that an average alone would have masked.
Integrating Tools: The ability to seamlessly transition from complex joining in SQL to statistical modeling and visualization in Python is the hallmark of a complete data analyst workflow.

Author:

Mukesh Shirke

mukeshshirke12@gmail.com

Portfolio | GitHub | LinkedIn

Leveraging SQL and Python: A Comprehensive EDA and Statistical Analysis of Healthcare Performance

Key Skills Demonstrated:

Tools used:

GitHub repository:

Scenario:

Workflow:

Phase 1: Data Understanding and Preparation

1. Data Understanding

2. Database Creation, Bulk Loading and Schema Mapping

Create Database

Create Tables (Assign Primary Key, Unique, NULL)

Import data into table from .CSV file

Define Relationship among Tables (Assign Foreign Key)

Schema Diagram

3. Data Cleaning

Phase 2: Exploratory Data Analysis (EDA) in MySQL

1. Total Revenue and Volume by Service Type (CASE statement)

2. Top 5 Most Common Diagnoses and Average Patient Age

3. Monthly Trend of Visits (Date Functions and Subquery)

4. Top 3 Most Expensive Procedures

5. Top Providers by Patient Satisfaction (Aggregate & Join)

6. Insurance Coverage vs. Total Cost by Insurance Provider (CASE statement)

7. Identifying High-Cost Patients (Window Function: RANK)

8. Department Performance (Average Satisfaction & Cost)

9. Percentage of Emergency Visits by Gender (CASE and Aggregation)

10. Most Popular Procedures for the Top Diagnosis (Subquery)

Phase 3: Data Integration (SQL to Python) and Analysis Environment Setup

Phase 4: Statistical Analysis and Visualization in Python

1. Descriptive Statistics for Key Financial and Quality Metrics

2. Correlation Matrix and Heatmap Visualization

3. Average Total Cost Comparison by Service Type

4. Average Total Cost Comparison by Service Type

Phase 5: Interpretation into Business Insights and Communication

5 Key Findings and Strategic Recommendations

Project Delivery and Documentation

Conclusion:

Lessons Learned:

Author:

Comments