A reusable, config-driven ingestion framework for streaming and batch data — built on AWS Parameter Store, Kinesis, Glue, and DynamoDB. One Lambda. Any schema. Zero code changes to add a new source.
A single reusable Lambda function handles both phone and PC streaming events. Each source has its own Kinesis Data Stream. The Lambda reads its input and output schemas from AWS Parameter Store at runtime — no hardcoded field names, no code changes to add a new stream source.
Raw JSON lands in S3 raw/, triggers a second Lambda to update DynamoDB in real time, then Glue enriches the data — resolving lat/long to city and state, and translating song, artist, album, and playlist IDs to human-readable names via RDS lookups and DynamoDB user profile lookups.
Employee and customer records live in RDS as fully normalized relational tables. A config-driven Lambda batch extractor reads the schemas from Parameter Store, extracts records, and writes them to S3 raw/.
Glue then flattens the relational structure — joining across address, phone, position, salary, and bonus tables — producing analytics-ready flat records in S3 curated/, which are then loaded into DynamoDB. Sensitive fields (SSN, CC numbers) are KMS-encrypted and excluded from all pipeline output schemas.
This showcase uses synthetic data to demonstrate the pipeline. The Lambda producers generate realistic records — phone events with lat/long coordinates, PC events with browser types, employee records with relational lookups, and customer records with enrollment history.
The pipeline architecture, Parameter Store config pattern, Glue enrichment logic, and DynamoDB state management are all production-grade — the data source is simulated so the framework can be demonstrated without live RDS infrastructure cost.
Input and output schemas are stored as JSON in AWS Parameter Store — versioned, audited, and encrypted independently of code. The Lambda reads its schema path from an environment variable, loads the config at runtime, validates incoming records, and routes output — all without a single hardcoded field name.
Adding a new data source = add a Parameter Store entry + a new Kinesis trigger. Zero Lambda code changes.
The curated S3 output answers a real business question: "What is the #1 song by region, age, and city?" — by enriching each raw streaming event with resolved song name, artist, city, state, and user age at Glue transform time. The curated dataset is analytics-ready for ML models, BI tools, or direct Athena queries without any further joins.
Input and output schemas live in Parameter Store — not in code. The Lambda resolves its schema path from an environment variable, caches the config on warm invocations, and validates every record against it at runtime.
Trigger individual pipelines or start all four simultaneously. Watch records flow through each stage in real time.
Records populate as each pipeline runs. Scroll to explore the synthetic dataset.
| User ID | Song | Artist | City | St | Device |
|---|
| User ID | Song | Artist | City | Browser |
|---|
| Emp ID | Name | Title | City | St |
|---|
| User ID | Name | City | St | Enrolled |
|---|