top of page

Cloud-Based ELT Pipeline with Snowflake via Fivetran

  • Writer: Mukesh Shirke
    Mukesh Shirke
  • Sep 1
  • 7 min read

Updated: Sep 2

World Life Expectancy Data Cleaning, Transformation, EDA & Insights


This project demonstrates the end-to-end design of an ELT pipeline using modern cloud tools. Data was ingested from Google Sheets into Snowflake using Fivetran, transformed and cleaned with SQL, and then explored through structured exploratory data analysis (EDA). The project highlights how raw, inconsistent health data can be converted into a reliable dataset to generate valuable insights about global life expectancy, GDP impact, and health disparities between developed and developing countries.


Key Skills Demonstrated:
  • Data Ingestion & Automation: Configuring Fivetran connector for Google Sheets → Snowflake.

  • SQL Data Cleaning: Handling duplicates, filling NULLs, and restructuring columns.

  • Data Transformation: Using window functions, subqueries, and conditional updates.

  • Exploratory Data Analysis (EDA): Identifying trends, correlations, and disparities.

  • Analytical Thinking: Deriving insights into socioeconomic and health indicators.


Tools used:
  • Google Sheets – Raw data source (world life expectancy dataset)

  • Fivetran – Automated ELT data pipeline setup

  • Snowflake – Cloud data warehouse for storage, transformation, and analysis

  • SQL – Data cleaning, transformation, and analysis queries



ree

Scenario:

I was tasked with building a scalable ELT workflow that could take semi-structured health data from Google Sheets, load it into Snowflake, and make it analytics-ready.

The challenges included:

  • Ingestion: Automating data movement from Google Drive to Snowflake without manual effort.

  • Data Quality: The dataset contained duplicates, missing status values (Developed/Developing), and NULL life expectancy figures.

  • Analysis: Extracting meaningful insights such as global trends in life expectancy, correlation with GDP, and differences between developed and developing nations.


ree

Step 1: Extract & Load Data (Google Sheets → Snowflake via Fivetran)

I ingested the raw world_life_expectancy dataset from Google Sheets (stored in Google Drive) into Snowflake using the Fivetran Google Sheets connector. After authenticating and configuring the connector, I ran the initial sync to load the dataset into a raw schema in Snowflake.

Once loaded, I validated the data by checking row counts, null values in key columns, and duplicate records (country + year). This ensured the data was correctly ingested and ready for cleaning and transformation.

This step established a reliable ELT pipeline:


  • Extract: Google Sheets via Fivetran

  • Load: Raw data into Snowflake

  • Transform: Cleaning & EDA performed later in Snowflake


ree

ree

ree

ree

Step 2: Data Cleaning (using SQL in Snowflake)

Now that the table is loaded in Snowflake, data cleaning can be performed.


1.      Arrange and View Table

During loading into Snowflake via Fivetran, column positions changed. The first step is to reorder them to match the original Google Sheet or the preferred layout.


ree


























Result:

ree

ree


2.      Duplicate Data - Identify and Delete

Although there is duplication in the data, it cannot be identified just by comparing rows directly because each row has a unique ROW_ID. Therefore, duplication needs to be analyzed based on specific columns rather than the row identifier.


Identify if there are any duplicate rows available or not


- Groups data by country and data_year.

- Creates a concatenated key (country + data_year) to identify duplicates.

- Counts how many times each key occurs.

- Filters results to only show rows where the count is greater than 1 (duplicates exist).


ree









Result:

ree

Identify ROW_ID in which data is duplicated


- Uses a subquery with ROW_NUMBER() to assign row numbers within each group of duplicate keys (country + data_year).

- Keeps ordering by the concatenated key.

- Filters to only show rows where Row_Num > 1 (these are duplicate rows beyond the first occurrence).

- Helps in identifying which specific ROW_IDs are duplicates.


ree









Result:

ree

Delete duplicate ROW_ID using subQuery


- Delete duplicate rows from the main table (world_life_expectancy_data).

- A subquery uses ROW_NUMBER() to rank rows within each duplicate group (country + data_year).

- Only rows with Row_Num > 1 are considered duplicates.

- Keeps the first occurrence and removes all extra duplicates.


ree











Cross-check if there are any duplicate rows left


ree








Result:

ree


3.      NULL values in Status column - Identify and Populate them

Identify if there are any NULL values in STATUS column


ree





Result:

ree

Logic:

  • Identify distinct values in the status column.

  • Map each country to its existing non-NULL status values.

  • For any country with NULL in status, fill in the value using the corresponding non-NULL entry for that country.


Check for DISTINCT entries in STATUS column


ree




Result:

ree

Check countries with 'Developing' status


ree





Result:

ree

Check countries with 'Developed' status


ree





Result:

ree

Update NULL entries in STATUS column for 'Developing' countries


- Updates rows in world_life_expectancy_data where STATUS is currently NULL.

- Uses a correlated subquery with EXISTS to check:

• If the same country already has at least one row marked as 'Developing'.

• Ensures that status in the subquery is not NULL.

- If such a row exists, the NULL status is updated to 'Developing'.

- Effectively fills missing STATUS values for countries known to be 'Developing'.


ree











Result:

ree

Update NULL entries in STATUS column for 'Developed' countries


- Updates rows in world_life_expectancy_data where STATUS is currently NULL.

- Uses a correlated subquery with EXISTS to check:

    • If the same country already has at least one row marked as 'Developed'.

    • Ensures that status in the subquery is not NULL.

- If such a row exists, the NULL status is updated to 'Developed'.

- Effectively fills missing STATUS values for countries known to be 'Developed'.


ree











Result:

ree

Cross-check if any NULL values left in STATUS column


ree





Result:

ree


4.      NULL values in Life_Expectancy column - Identify and Populate them

Identify if there are any NULL values in LIFE_EXPECTANCY column


ree





Result:

ree

Logic:

  • Identify the rows where life_expectancy is missing.

  • For each missing value, locate the previous and next year’s life_expectancy for the same country.

  • Calculate the average of the previous and next year’s values.

  • Fill the missing life_expectancy with the calculated average.


Update missing life_expectancy values by average of adjacent years


- Self-join the table:

    • t2 = previous year record

    • t3 = next year record

- Match rows based on country and year differences:

    • t2.data_year = t3.data_year - 2 (ensures a one-year gap on both sides)

    • t1.data_year = t2.data_year + 1 (target year sits between prev & next)

- Update only rows where 'life_expectancy' is NULL.

- Use ROUND(..., 1) to keep a single decimal place.


ree














Result:

ree

Cross-check if any NULL values left in LIFE_EXPECTANCY column


ree





Result:

ree

Comparing NULL entries in life_expectancy after filling them with the average of adjacent years.

Before:

ree

After:

ree

ree

Step 3: EDA - Exploratory Data Analysis (using SQL in Snowflake)

Now that the data is cleaned, we can proceed with Exploratory Data Analysis (EDA) to understand patterns, detect anomalies, test hypotheses, and summarize key insights.


1.      Find which countries improved life expectancy the most over time.

- For each country:

    • Take the minimum life expectancy (earliest year available)

    • Take the maximum life expectancy (latest year available)

    • Calculate the improvement as (MAX - MIN), rounded to 1 decimal place

- Exclude countries where life expectancy values are 0

- Finally, order countries by the biggest improvement (highest gains first)


ree











Result:

ree

Conclusion:

Developing countries such as Haiti, Zimbabwe, and Eritrea, which once had life expectancy below 50 years, showed significant progress over the past 15 years. For example, Haiti recorded the largest gain, with life expectancy rising by 28 years. In contrast, developed countries exhibited minimal improvement, as their life expectancy was already high and drastic changes were unlikely.



2.      Calculate the average life expectancy across all countries for each year

- For each year (data_year):

    • Compute the average life expectancy, rounded to 2 decimal places.

    • Exclude years where any country has life expectancy = 0 (to avoid invalid or missing data).

- Order the results chronologically by year.


ree








Result:

ree

Conclusion:

The global average life expectancy increased steadily from 66.75 years in 2007 to 71.62 years in 2022, reflecting a consistent upward trajectory over 15 years.



3.      Analyze the relationship between life expectancy and GDP

- For each country:

    • Calculate the average life expectancy across all available years (rounded to 1 decimal place).

    • Calculate the average GDP across all available years (rounded to 1 decimal place).

    • Exclude countries where either value is 0 (invalid data).


- Order the results by average GDP in descending order, so the wealthiest countries (highest GDP) appear first.


ree










Result:

ree

- Order the results by average GDP in ascending order, so the least wealthy countries (lowest GDP) appear first.


ree










Result:

ree


Compare life expectancy between high-GDP and low-GDP groups


- Classification:

• High GDP: GDP >= 1500

• Low GDP: GDP <= 1500


- For each group:

   • Count how many records (rows) fall into the group.

   • Calculate the average life expectancy for that group.


- The result shows overall counts and average life expectancy

- for high-GDP vs low-GDP countries, enabling a direct comparison.


ree







Result:

ree

Conclusion:

The analysis revealed a strong positive correlation between GDP and life expectancy. Countries with higher GDP generally enjoy longer life expectancy, while low-GDP countries lag behind.



4.      Analyze average health, financial and social indicators based on country status

Analyze average health indicators based on country status


ree










Result:

ree

Conclusion:

There is a clear disparity in health indicators between developed and developing countries, with developed nations benefitting from better healthcare infrastructure. Interestingly, polio prevalence was found to be higher in developed countries, largely due to aggressive eradication campaigns undertaken by developing nations over the past two decades.


Compare social and financial metrics across country statuses


ree








Result:

ree

Conclusion:

Developed countries consistently outperform developing countries in social and financial indicators, reflecting broader access to resources, education, and economic opportunities.


5.      Analyze average health, financial and social indicators based on country status

ree




Result:

ree

Conclusion:

Developed countries (79.2 yrs) live longer than developing ones (66.8 yrs).


6.      Rolling Adult Mortality Rate

ree







Result:

ree

Conclusion:

In the UK Life expectancy gradually increased, while adult mortality showed a steady decline over 15 years.


Limitations of the data:

This dataset does not include population figures, so mortality values cannot be normalized or compared against population size. At this stage, we can contact stakeholders and ask for more data (e.g., population, socioeconomic factors, healthcare access) to enable a more robust exploratory data analysis (EDA).


(Note: These queries represent sample EDA exercises created for portfolio purposes. The dataset allows for many more analyses, such as regional comparisons, trend forecasting, and correlations with socioeconomic indicators.)


This curated data can now be used for advanced visualization in Power BI or Tableau by connecting these tools directly to Snowflake through their built-in connectors.


ree

Conclusion:

The ELT project demonstrates how a structured pipeline can convert raw, inconsistent data into a clean, analytics-ready dataset. The final solution highlights my ability to manage the full data workflow — from automated ingestion to SQL-based transformation and insightful analysis. This project reflects my capability to deliver scalable, cloud-based data solutions that turn messy input into actionable insights.



ree

Lessons Learned:

  • Automating ingestion with Fivetran ensures repeatability and eliminates manual overhead


  • Systematic SQL cleaning (handling duplicates, NULLs, and data alignment) is essential before EDA


  • Using adjacent-year interpolation for missing life expectancy values preserved dataset continuity


  • ELT provides more flexibility than ETL in cloud environments, enabling faster iterations and leveraging compute at scale


  • A structured ELT pipeline makes exploratory and advanced analytics more reliable and scalable


ree

Author:

Mukesh Shirke

 
 
 

Comments


Mukesh_Shirke_Resume_Data_Analyst.pdf

bottom of page