Metadata-Driven ELT Framework in dbt + Snowflake for REST API Integration

Learn how to build a dynamic ELT framework that reads from REST APIs using metadata tables to control endpoints, methods, and transformations—scalable and fully dbt-integrated.

DATA INTEGRATION & AUTOMATION

Kiran Yenugudhati

11/23/20242 min read

This blog walks through how to build a scalable, metadata-driven ELT framework to extract data from REST APIs into Snowflake using:

❄️ Snowflake Snowpark Python functions for API calls
🔐 External Access + Secrets for secure connectivity
📄 Metadata tables to manage all API configurations
🧱 dbt incremental models to automate ingestion and transformation

🚫 No external ingestion tools
🚫 No manual coding per endpoint
✅ Just metadata + modular components = fully reusable framework

💡 Why Metadata-Driven?

Traditional ingestion pipelines:

Require a Python or DAG per endpoint
Involve hardcoding filters, auth, pagination
Need rework for every change or new object

That doesn’t scale.

With a metadata-first approach, you can:

Reuse a single Snowpark function for all API calls
Maintain configs in Snowflake tables
Add or update integrations by inserting metadata
Let dbt handle incremental loads declaratively

Hardcoded URLs, headers, filters
Separate scripts per API/module
Manual scheduling or orchestration

🧰 Framework Components

Metadata Tables (Snowflake) : Configure all APIs, objects, and fields
Snowpark Python Function (UDF/UDTF) : Calls REST APIs securely from Snowflake
External Access Integration : Controls internet access + auth
Secrets Management : Stores credentials securely
dbt Incremental Models : Ingest, filter, and transform API data

🧪 Metadata Table Structure

You manage everything via metadata stored in Snowflake. Here's a high-level summary:

CONFIG_DATA_SOURCES => Stores API base config (source name, credentials, connection type).
CONFIG_DATA_SOURCE_OBJECTS => Defines individual API modules/resources (e.g., Users, Invoices), filters, and active status.
CONFIG_DATA_SOURCE_OBJECT_FIELD_MAPPINGS => Describes fields, friendly names, data types, primary keys — used for dynamic SELECT clauses.

🛠️ Sample schemas and population examples coming soon.

🧠 Snowpark Function: Highlights

The core component is a Snowflake UDTF (User Defined Table Function) that:

Accepts parameters like OBJECT_NAME, OBJECT_FILTERS
Authenticates securely using stored credentials
Builds dynamic REST API URLs
Returns a stream of JSON results (VARIANT)
Logs API response codes and errors to Snowflake logs

✅ All built using Snowpark Python with requests, logging, and snowflake.getusername_password()
✅ Compatible with external access integration and secret manager
✅ Flexible for future APIs or vendors

⚙️ External Access Setup (Once Per API)

CREATE NETWORK RULE ... CREATE SECRET ... CREATE EXTERNAL ACCESS INTEGRATION ...

✅ Enables the Snowpark function to securely call external APIs from within Snowflake.

📥 Ingesting Data with dbt Incremental Models

Each API module is handled by a dbt model configured as:

materialized: incremental
Uses a macro that:
- Reads metadata for the object
- Calculates a dynamic last_modified filter
- Calls the Snowpark UDTF with correct filters
- Outputs structured rows into the RAW layer
Supports time zone shifts, DST, and deduplication

💡 unique_key, merge_exclude_columns, and field selections are all metadata-driven — no hardcoded SQL.

🛠️ dbt macro samples coming soon.

🎯 Benefits

No external tool required => Entirely Snowflake + dbt
Easy to extend => Add new API endpoints via metadata rows
Centralized config => Everything lives in Snowflake (auditable)
Incremental by design => Only fetches new/changed records
Secure => Uses built-in secrets + access controls
Declarative & testable => Follows dbt best practices
Scalable => Single pattern supports any API

🔍 Ideal Use Cases

REST API integrations across multiple applications
Partner or platform data (e.g., CRM, finance, booking)
External systems with complex or paginated endpoints
Projects that require no external orchestrator

🧩 Bonus: Future-Ready Features

Retry-on-failure and error JSON responses
Row-level logging with response metadata
Streamlit UI for managing object configs
dbt exposures and metrics built on top of clean layers
Usage tracking per API or module

📎 Artefacts

GitHub repo with full dbt + Snowpark implementation
Metadata table sample data + config UI (Streamlit optional)
Visual diagrams showing end-to-end flow
Audit log setup using Snowflake event table
Best practices for flattening and late-binding joins