Blog
Dec 15, 2025 - 12 MIN READ
Building a Serverless ETL Pipeline on AWS: From Raw Data to Interactive Dashboards

Building a Serverless ETL Pipeline on AWS: From Raw Data to Interactive Dashboards

How I built a complete data analytics pipeline using AWS serverless services—processing CSV files, cleaning data, enabling SQL analytics, and serving insights through interactive dashboards—all for less than $2/month.

Peter Mangoro

Peter Mangoro

In today's data-driven world, businesses need efficient ways to process, analyze, and visualize their data. As part of my cloud computing coursework, I set out to build a production-ready ETL (Extract, Transform, Load) pipeline that could handle e-commerce sales data from ingestion to visualization—all using serverless AWS services.

The result? A fully automated pipeline that processes CSV files, cleans and transforms data, stores it efficiently, and serves it through interactive dashboards—all while staying within AWS's free tier for most operations.

The Challenge

The goal was ambitious: create an end-to-end data pipeline that:

  • Automatically processes incoming sales data
  • Cleans and transforms raw CSV files
  • Stores data in an optimized format for analytics
  • Enables SQL queries on the processed data
  • Provides real-time visualizations through web dashboards
  • Scales automatically without managing servers
  • Costs less than $2/month

Architecture Overview

The pipeline follows a modern serverless architecture pattern:

CSV Upload → S3 Raw → Lambda Trigger → Glue ETL → S3 Clean (Parquet) 
    → Athena → QuickSight/Nuxt Dashboard → CloudFront (HTTPS)

Key Components

Data Ingestion: CSV files are uploaded to an S3 bucket, triggering an automated workflow.

ETL Processing: AWS Glue processes the data, cleaning inconsistencies, validating fields, and transforming it into a queryable format.

Storage: Cleaned data is stored in Parquet format (a columnar storage format) partitioned by date for efficient querying.

Analytics: Amazon Athena enables SQL queries directly on the Parquet files without loading data into a database.

Visualization: Two interfaces provide insights:

  • A custom Nuxt 4 dashboard with interactive charts
  • Amazon QuickSight dashboards for business intelligence

Delivery: CloudFront CDN serves the dashboard globally with HTTPS encryption.

The Journey: 8 Phases of Development

Phase 1: Infrastructure Setup

I started by creating the foundational AWS resources:

  • 5 S3 buckets for different data layers (raw, clean, dashboard, scripts, results)
  • IAM roles with least-privilege permissions
  • Bucket policies and CORS configurations

This phase taught me the importance of proper infrastructure planning and security from the start.

Phase 2: Building the ETL Pipeline

The heart of the system is the ETL process. I built:

Lambda Functions:

  • sales-data-generator: Generates sample CSV data with intentional quality issues for testing
  • sales-glue-trigger: Automatically starts ETL jobs when new files arrive

AWS Glue ETL Job: A PySpark script that:

  • Reads CSV files from the raw S3 bucket
  • Removes records with missing critical fields
  • Standardizes text fields (trimming, case normalization)
  • Fixes missing prices and amounts
  • Validates and filters invalid dates
  • Extracts date components for partitioning
  • Calculates derived fields (discount amounts, net amounts)
  • Writes cleaned data as Parquet files partitioned by year/month/day

Key Learning: The ETL job uses Glue job bookmarks to process only new data incrementally, avoiding reprocessing of existing files.

Phase 3: Enabling SQL Analytics with Athena

I configured Amazon Athena to query the cleaned Parquet files:

  • Created a Glue Data Catalog database (sales_analytics)
  • Defined a table schema with partition projection (eliminating the need for manual partition management)
  • Wrote 12+ analytics queries covering:
    • Total revenue by month
    • Top products by revenue
    • Regional sales distribution
    • Category performance
    • Order status analysis

Performance Optimization: By partitioning data by date and using partition projection, Athena scans only the relevant partitions, reducing query time and cost.

Phase 4: Building the Frontend Dashboard

I developed a modern web dashboard using:

  • Nuxt 4 with Vue 3 (static site generation)
  • Nuxt UI (Tailwind CSS-based component library)
  • Chart.js for data visualizations

The dashboard features:

  • Real-time analytics cards showing key metrics
  • Interactive charts (revenue trends, top products, regional sales, category performance)
  • Dark theme support
  • Auto-refresh every 5 minutes
  • Responsive design for mobile and desktop

Components Built:

  • StatsCard.vue - Key metrics display
  • RevenueChart.vue - Revenue trends over time
  • ProductChart.vue - Top products bar chart
  • RegionChart.vue - Regional sales doughnut chart
  • CategoryChart.vue - Category performance analysis

Phase 5: CloudFront Deployment

To make the dashboard production-ready, I:

  • Built the Nuxt application as a static site
  • Uploaded it to an S3 bucket configured for static website hosting
  • Created a CloudFront distribution with:
    • HTTPS-only access (HTTP redirects to HTTPS)
    • Global CDN for low latency
    • Custom error pages for SPA routing
    • Compression enabled
    • Optimized caching headers

Result: The dashboard is now accessible at https://d1du5nvx9lhzdt.cloudfront.net with global edge caching.

Phase 6: QuickSight Integration

I integrated Amazon QuickSight for advanced business intelligence:

  • Connected QuickSight to the Athena data source
  • Created datasets with SPICE (in-memory caching)
  • Configured automatic daily data refresh
  • Built interactive dashboards with filters and drill-downs

QuickSight provides a more powerful visualization tool for stakeholders who need deeper analytics capabilities.

Phase 7: API Gateway for Real-Time Data

To connect the frontend to live data, I created:

  • API Gateway REST API endpoint (/analytics)
  • Lambda function (analytics-api) that:
    • Executes parallel Athena queries
    • Aggregates revenue, product, region, and category data
    • Returns JSON responses for the frontend
    • Handles CORS headers

This enables the Nuxt dashboard to fetch real-time analytics data instead of using mock data.

Phase 8: Automation Scripts

I developed automation scripts for:

  • QuickSight setup (data sources, datasets, refresh schedules)
  • Complete pipeline testing
  • Frontend deployment
  • Resource cleanup

Technical Highlights

Data Transformation

The ETL process handles common data quality issues:

  • Missing Values: Defaults missing customer names to "Unknown", calculates missing prices
  • Invalid Dates: Filters out invalid dates and standardizes date formats
  • Inconsistent Formatting: Normalizes text fields (uppercase regions, lowercase status)
  • Data Validation: Removes records with missing critical fields

Storage Optimization

Parquet Format: Columnar storage reduces file size by ~75% compared to CSV while enabling efficient column-based queries.

Partitioning: Data is partitioned by year/month/day, allowing Athena to scan only relevant partitions:

clean/sales/year=2025/month=12/day=15/data.parquet

Result: Faster queries and lower costs (Athena charges per TB scanned).

Serverless Benefits

  • No Infrastructure Management: All services are fully managed
  • Automatic Scaling: Handles traffic spikes without configuration
  • Cost-Effective: Pay only for what you use
  • High Availability: Built-in redundancy and failover

Cost Analysis

One of the project's goals was cost optimization. Here's the breakdown:

ServiceMonthly CostNotes
S3 Storage$0Within free tier (5GB)
Lambda$0Within free tier (1M requests)
Glue~$1-2~$0.03 per job run
Athena$0Within free tier (10GB scanned)
CloudFront$0Within free tier (1TB transfer)
API Gateway$0Within free tier (1M calls)
QuickSight$0Standard edition (free)
Total~$1-2/monthMostly Glue job runs

Challenges and Solutions

Challenge 1: SPA Routing with CloudFront

Problem: CloudFront returned 404 errors for client-side routes.

Solution: Configured CloudFront error pages to redirect 404s to index.html, allowing Vue Router to handle routing.

Challenge 2: Partition Management

Problem: Traditional Athena tables require running MSCK REPAIR TABLE after adding partitions.

Solution: Used partition projection, which automatically discovers partitions based on a pattern, eliminating manual maintenance.

Challenge 3: CORS Configuration

Problem: Frontend couldn't access API Gateway due to CORS errors.

Solution: Configured CORS headers in both API Gateway and Lambda responses.

Challenge 4: Glue Job Performance

Problem: Initial Glue jobs were slow and expensive.

Solution: Optimized by:

  • Using job bookmarks for incremental processing
  • Partitioning output data
  • Using appropriate worker types (G.1X for this workload)

Key Learnings

  1. Serverless Architecture: Building with serverless services eliminates infrastructure management overhead and enables automatic scaling.
  2. Data Format Matters: Parquet format provides significant storage and query performance benefits over CSV for analytics workloads.
  3. Partitioning Strategy: Proper partitioning is crucial for cost-effective analytics at scale.
  4. Event-Driven Design: Using S3 event notifications to trigger Lambda functions creates a truly automated pipeline.
  5. Security First: IAM roles with least-privilege access are essential for production systems.
  6. Cost Optimization: AWS free tiers make it possible to build and test production-like systems at minimal cost.

Results

Complete ETL Pipeline: Automated data processing from ingestion to analytics
6+ AWS Services Integrated: S3, Lambda, Glue, Athena, CloudFront, QuickSight, API Gateway
Production-Ready Dashboard: HTTPS-enabled, globally distributed
Cost-Effective: ~$1-2/month operational cost
Scalable: Handles growth automatically
Well-Documented: Comprehensive documentation for all phases

Live Demo

🌐 Dashboard: https://d1du5nvx9lhzdt.cloudfront.net

The dashboard is live and accessible via CloudFront CDN with HTTPS encryption.

Future Enhancements

Potential improvements for production use:

  • Real-Time Processing: Integrate Amazon Kinesis for streaming data
  • Machine Learning: Use SageMaker for sales predictions
  • Alerting: SNS notifications for data quality issues
  • Multi-Region: Cross-region replication for disaster recovery
  • Data Lake: Expand to handle multiple data sources
  • Advanced Analytics: Add time-series analysis and forecasting

Conclusion

This project demonstrated the power of AWS serverless services to build a complete data analytics pipeline. By leveraging managed services, I was able to focus on business logic rather than infrastructure management, resulting in a scalable, cost-effective solution.

The architecture pattern used here—S3 for storage, Lambda for event processing, Glue for ETL, Athena for analytics, and CloudFront for delivery—can be adapted for various data analytics use cases, from IoT sensor data to financial transactions.

Whether you're a data engineer, developer, or business analyst, understanding how these services work together opens up powerful possibilities for building modern data solutions.


Technical Stack

  • Cloud: AWS (S3, Lambda, Glue, Athena, CloudFront, QuickSight, API Gateway)
  • Frontend: Nuxt 4, Vue 3, Nuxt UI, Chart.js
  • ETL: AWS Glue (PySpark)
  • Analytics: Amazon Athena (SQL)
  • Visualization: QuickSight, Chart.js
  • Infrastructure: IAM, CloudWatch

Project Statistics

  • Total Files Created: 50+
  • AWS Resources: 15+
  • Lines of Code: 2000+
  • Documentation Pages: 10+
  • Queries Created: 12+
  • Components Built: 6+
  • Phases Completed: 8/8

This project was completed as part of a cloud computing course in December 2025. All code and documentation are available in the project repository.

Built with Nuxt UI • © 2026