Serverless Solution to Automate S3-to-Tableau Data Pipelines

Serverless Solution to Automate S3-to-Tableau Data Pipelines





In today’s fast-paced digital landscape, building real-time dashboards for data-driven decision-making is more critical than ever. As data trends evolve rapidly, delays in updating dashboards can lead to missed opportunities and hinder timely insights. Tableau, a widely adopted analytics and dashboarding tool, offers native connectors to various cloud platforms, making it well-suited for real-time or near real-time data ingestion.
Today, we’ll explore a problem statement where application data is received daily in the form of CSV files, which are stored in date-based subfolders within an Amazon S3 bucket. As days progress, additional CSV files continue to accumulate in new folders. However, Tableau supports connecting only to a single CSV file within an S3 bucket, which poses a challenge:
How can we ensure Tableau accesses up-to-date data without manual consolidation?
To address this, we need a solution that automatically merges daily CSV data into a single, continually updated file or provides a structured way for Tableau to access and query this data without manual intervention.
This article outlines two fully serverless AWS-based solutions to automate the S3-to-Tableau data pipeline — ensuring Tableau receives consistent data updates and dashboards remain synchronized with the latest information.

Proposed Solutions:

  1. AWS Lambda + Master CSV

    A lightweight Lambda function merges incoming daily CSVs into a single rolling master.csv file that Tableau can directly import.
  2. AWS Glue Crawler + AWS Glue Data Catalog + Athena

    A managed AWS Glue Crawler automatically discovers new files and schemas, while Athena enables SQL-based querying that Tableau can leverage using its Athena connector.
Note: Throughout this article, replace placeholders with your actual S3 bucket name and prefixes such as:
  • YOUR_BUCKET_NAME
  • raw-data-prefix/
  • master-data-prefix/

Solution 1: AWS Lambda + Master CSV

Architecture Overview

S3 Event Trigger

  • Fires AWS Lambda when new CSV files arrive within the specified folder of an S3 Bucket

Lambda Function

  • Reads a state marker (last_processed.txt) to determine which folders have been processed.
  • Downloads the existing master.csv from S3 (if it exists) to seed the merge.
  • Streams and concatenates only new CSV files from unprocessed date folders.
  • Uploads the updated master.csv directly to S3 and updates the last_processed.txt marker.

Tableau:

  • Connects directly to the static master.csv in S3 using the native S3 connector or as an external file source.

Prerequisites

  • AWS account with S3 and Lambda permissions.
  • IAM user or role with rights to create and manage Lambda functions, IAM roles, and S3 resources.

Step 1: Create the S3 Bucket & Folder Structure

In the AWS Console, navigate to Amazon S3 service and click “Create bucket.” Name your bucket as YOUR_BUCKET_NAME.
Once the bucket is created, set up the folder (prefix) structure:
  1. Inside the bucket, create a folder for raw data/daily data dumps: raw-data-prefix/.
  2. Within this folder, upload at least one subfolder named in YYYY-MM-DD/ format (e.g., 2025–07–10/), and place sample CSV files such as regionA.csv and regionB.csv inside it.
Next, create another folder at the root level of the same bucket to store the consolidated output:
  • master-data-prefix/ — this is where the merged master.csv will be stored.

Step 2: Create an IAM Role for Lambda

In the AWS Console, navigate to IAMRoles, and click “Create role.”
This will guide you through a 3-step role creation process:
  1. Select trusted entity
    • Choose AWS service as the trusted entity type.
    • Select Lambda as the use case.
    • Click Next.
  2. Add permissions
    • Attach the managed policy: AWSLambdaBasicExecutionRole.
    • Click Next.
  3. Name, review, and create
    • Enter a name for your role, e.g., lambda-s3-master-role.
    • Click Create role.
Once the role is created:
  1. Navigate to the Permissions → “Add inline policy” of the newly created role.
  2. In the policy editor, switch to the JSON tab and paste the following policy:
{
 "Version": "2012–10–17",
 "Statement": [
 {
 "Effect": "Allow",
 "Action": ["s3:ListBucket"],
 "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME"
 },
 {
 "Effect": "Allow",
 "Action": ["s3:GetObject","s3:PutObject"],
 "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME/*"
 }
 ]
}
3. Click Next and name the policy LambdaS3AccessPolicy, and finally click on button Create policy.
You have now successfully created an IAM Role with the necessary permissions to access the S3 bucket and perform read and write operations.

Step 3: Create the Lambda Function

  1. Navigate to Lambda service → Create function.
  2. Choose option “Author from scratch” to create the function.
  3. Add the following settings for the Lambda function:
    • Function name: merge-daily-csvs (our one of your choice)
    • Runtime: Python 3.9 (or higher).
  4. Under Permissions, select Use an existing role and choose IAM role we created earlier i.e. lambda-s3-master-role.
  5.  Click Create function.

Step 4: Deploy the Lambda Code

  1. In your lambda function’s Code tab, delete any default code.
  2. Copy and paste the complete Python code below into lambda_function.py:
import os
import boto3
import csv
from io import StringIO
from datetime import datetime, timezone

s3 = boto3.client('s3')

# ─── Configuration ──────────────────────────────────────────────
BUCKET        = os.environ.get('BUCKET', 'YOUR_BUCKET_NAME')
BASE_PREFIX   = os.environ.get('BASE_PREFIX', 'lambda/raw/')
MASTER_PREFIX = os.environ.get('MASTER_PREFIX', 'lambda/master/')
MASTER_KEY    = f"{MASTER_PREFIX}master.csv"
MARKER_KEY    = f"{MASTER_PREFIX}last_processed.txt"
# ────────────────────────────────────────────────────────────────

def lambda_handler(event, context):
    print("Lambda triggered")

    # 1️⃣ Load last processed marker
    try:
        resp = s3.get_object(Bucket=BUCKET, Key=MARKER_KEY)
        last_processed = resp['Body'].read().decode('utf-8').strip()
        print("Last processed date:", last_processed)
    except s3.exceptions.NoSuchKey:
        last_processed = "0000-00-00"
        print("No marker found; processing all folders")

    # 2️⃣ List all folders under raw/
    paginator = s3.get_paginator('list_objects_v2')
    all_folders = []
    for page in paginator.paginate(Bucket=BUCKET, Prefix=BASE_PREFIX, Delimiter='/'):
        all_folders += [p['Prefix'] for p in page.get('CommonPrefixes', [])]

    folders_to_proc = [
        p for p in all_folders
        if p.rstrip('/').split('/')[-1] > last_processed
    ]
    if not folders_to_proc:
        print("✅ No new folders — nothing to do")
        return
    folders_to_proc.sort()
    print("Folders to process:", folders_to_proc)

    # 3️⃣ Start output buffer
    output = StringIO()
    writer = csv.writer(output)
    header_written = False

    try:
        resp = s3.get_object(Bucket=BUCKET, Key=MASTER_KEY)
        output.write(resp['Body'].read().decode('utf-8'))
        header_written = True
        print("✅ Seeded with existing master.csv")
    except s3.exceptions.NoSuchKey:
        print("ℹ️ No existing master.csv found — starting fresh")

    # 4️⃣ Process new files
    for prefix in folders_to_proc:
        print("→ Scanning folder:", prefix)
        result = s3.list_objects_v2(Bucket=BUCKET, Prefix=prefix)
        if 'Contents' not in result:
            print("⚠️ No files found in", prefix)
            continue

        for obj in result['Contents']:
            key = obj['Key']
            if not key.lower().endswith('.csv'):
                continue
            print("   Processing:", key)
            try:
                body = s3.get_object(Bucket=BUCKET, Key=key)['Body'].read().decode('utf-8')
                reader = csv.reader(StringIO(body))
                header = next(reader)
                if not header_written:
                    writer.writerow(header)
                    header_written = True
                writer.writerows(reader)
            except Exception as e:
                print("❌ Error reading", key, ":", str(e))

    # 5️⃣ Upload new master
    output.seek(0)  # 🔧 Required to reset pointer before reading buffer
    s3.put_object(
        Bucket=BUCKET,
        Key=MASTER_KEY,
        Body=output.getvalue().encode('utf-8')
    )
    print(f"📄 Wrote up-to-date master to {MASTER_KEY}")

    # 6️⃣ Update marker
    newest_date = folders_to_proc[-1].rstrip('/').split('/')[-1]
    s3.put_object(
        Bucket=BUCKET,
        Key=MARKER_KEY,
        Body=newest_date.encode('utf-8')
    )
    print(f"📝 Updated marker to {newest_date}")
3. Click Deploy to deploy your code.

Step 5: Configure S3 Trigger

Next, configure a trigger for the Lambda function so that it runs automatically whenever the specified condition is met. This ensures the pipeline remains fully automated and eliminates the need for manual intervention.
  1. In Lambda function’s Configuration Tab → Triggers, click Add trigger and add the following settings:
    • Select S3, choose YOUR_BUCKET_NAME.
    • Event type: PUT.
    • Prefix: raw-data-prefix/, Suffix: .csv.
    • Runtime: Python 3.9 (or higher).
2. Click button Add.
You have now successfully set up an automated pipeline that automatically triggers the Lambda function whenever a new CSV file is uploaded to the S3 bucket, ensuring the data is seamlessly consolidated into a single master CSV file.

Step 6: Test and Verify

To test and verify your solution, follow these steps:
  1. Upload a new folder YYYY-MM-DD/ with CSV files under raw-data-prefix/.
  2. In CloudWatch Logs, check /aws/lambda/merge-daily-csvs for execution logs where you’ll be able to see that a new master.csv file is created in the same s3 bucket.
  3. Confirm master.csv and last_processed.txt under master-data-prefix/.
With the first solution in place, it’s time to dive into the second serverless approach.

Option 2: AWS Glue Crawler + Glue Data Catalog + Athena

Architecture Overview

  • AWS Glue Crawler: Scans S3 daily and catalogs CSV schema and partitions.
  • AWS Glue Data Catalog: Stores metadata and table definitions.
  • Amazon Athena: Executes SQL queries over the cataloged data.
  • Tableau: Connects via Athena ODBC/JDBC driver for live queries.

Prerequisites

  • AWS account with Glue, Athena, and S3 Bucket permissions.
  • IAM role for Glue with AWSGlueServiceRole and S3 read access.
  • An Athena workgroup (optional).
Here’s a step by step guide on how we can configure AWS Glue Crawler to scan our daily CSV files coming up in S3 Bucket

Step 1: Create the Glue Crawler

In the AWS Console, navigate to AWS Glue → Crawlers → Add crawler.
  1. Name your crawler:
    • Enter a unique name, such as “daily-s3-crawler”. Optionally, you may add a description to clarify the crawler’s purpose. Click Next.
  2. Configure the data source:
    • Choose S3 as the data store.
    • Set the path to point to your S3 bucket (e.g., s3://YOUR_BUCKET_NAME/raw-data-prefix/).
    • For crawler behavior, select “Crawl new sub-folders only” to ensure only newly added folders are processed, avoiding redundancy.
    • Once configured, add the data source and click Next.
  3. Set up security:
    • Choose an existing IAM role or create a new one that includes both AWSGlueServiceRole and S3 read permissions.
  4. Define output and schedule:
    • Choose to create a new Glue database (e.g., daily_data_db) or select an existing one.
    • Set the crawler schedule to either Run on demand or define a schedule (e.g., daily at 00:15 UTC).
  5. Review and complete:
    • Review all settings and click Finish.
Congratulations! Your crawler is now configured and ready to scan your S3 data.

Step 2: Run the Crawler & Validate

Select your crawler and click Run crawler if you have configured it to run on demand.
Once the crawler completes its run, navigate to AWS Glue → Tables. You should now see a table (e.g., raw_data_prefix) that includes partitions organized by date. The table is typically named after the folder path where the daily CSV files are stored.
You’ll observe that the table includes data from all the CSV files present in your S3 bucket. In AWS Glue, partitions act like virtual folders — each partition corresponds to a sub-folder within your S3 bucket, allowing for efficient querying and organization based on those subdirectories.

Step 3: Query Data in Athena

  1. Open Amazon Athena.
  2. In Query Editor, select the database daily_data_db.
  3. Run the following sample query, you’ll be able to view the data for the specified date.
SELECT *
FROM raw_data_prefix
WHERE date = current_date;

Step 4: Connect Tableau to Athena

  1. In Tableau Desktop, click Connect → Amazon Athena.
  2. Enter your AWS credentials or use IAM Role if on EC2.
  3. Select your region, S3 staging directory (e.g., s3://YOUR_BUCKET_NAME/query-results/).
  4. Choose the database daily_data_db and your table.
  5. Build and publish dashboards that automatically reflect daily updates.
We’ve now walked through a detailed, step-by-step guide on how to consolidate daily data dumps — uploaded as CSV files into an Amazon S3 bucket — and make them accessible to Tableau without requiring any manual intervention. Both solutions presented offer a fully automated, serverless setup to keep your Tableau dashboards up-to-date in near real-time.
Next, let’s compare the two solutions in terms of their advantages, trade-offs, and associated costs.

Comparison: Pros, Cons & Costs

Aspect
Lambda + Master CSV
Glue Crawler + Athena
Latency
Near real-time (on CSV upload) 
Depends on crawler schedule (e.g., hourly/daily) 
Maintenance
Low (1 function)
Medium (monitor crawler runs and schema) 
Monthly cost
~$0.001 (Lambda) + S3 fees 
$1–3 (Glue crawler) + Athena scanning ($0.02) 
Scalability
Limited by Lambda memory/timeout (< 15 GB) 
High (scales to TB/PB via Athena) 
Schema handling
Manual (CSV header-driven) 
Automatic schema inference and partitioning 
Setup complexity
Low (5 steps)
Medium (4 components: Crawler, Catalog, Athena, Tableau) 
Tableau integration
Direct S3 file import 
Athena connector (ODBC/JDBC) 

When to choose Lambda solution

  • You have small-to-medium daily CSVs (<100 MB/day).
  • You need immediate updates on file arrival.
  • You prefer minimal AWS services and nearly zero cost.

When to choose Glue + Athena solution

  • You expect schema evolution or large data volumes.
  • You prefer SQL querying and partition pruning.
  • You can tolerate slight delays based on crawler schedule.

Conclusion

Both solutions deliver a serverless pipeline to make daily S3 CSV dumps available in Tableau:
  • Lambda + Master CSV is ideal for ultra-simple, near real-time merging.
  • Glue Crawler + Athena offers robust schema management and SQL-based analytics at scale.
Select the pattern that best fits your data volume, freshness requirements, and operational overhead.
Happy Learning!!

References

Here are some resources to learn more about the services used in these solutions:
  1. AWS S3
  2. AWS Lambda
  3. AWS Glue
  4. AWS Athena
  5. Tableau
  6. Using glue crawlers to populate the Data Catalog