Extracting Insights from Unstructured PDFs Using Snowflake Cortex LLM

Discover how to extract structured information from messy, unstructured PDF documents (like invoices, reports, or forms) using Snowflake’s Cortex LLM capabilities—no external tools or Python required. Ideal for operations, finance, and compliance use cases.

AI, ML & GENAIUSE CASES & MVP STORIES

Kiran Yenugudhati

2/23/20252 min read

This blog walks through how to extract structured data from unstructured documents like invoices, reports, or internal forms using Snowflake Cortex LLM. You’ll learn how to:

  • Process raw PDFs and extract useful content

  • Generate structured outputs (like JSON or table rows)

  • Use native LLMs in Snowflake — no external code or tools

  • Store results in Snowflake for reporting, automation, or compliance

🔍 Why This Matters

In nearly every business function — finance, legal, compliance, operations — valuable data is locked away in unstructured PDFs and emails:

  • Utility bills

  • Vendor invoices

  • Compliance reports

  • Contracts and policy documents

  • Internal forms or scanned records

Traditional extraction requires:

  • Python scripts

  • OCR engines

  • Manual copy/paste

Not anymore.
With Snowflake Cortex, you can now extract structured information from unstructured content entirely in SQL.

🧰 Prerequisites

To follow this approach, you’ll need:

  • Snowflake Enterprise Edition (with Cortex LLM enabled)

  • Access to a Snowflake stage to store/upload PDFs

  • Snowflake Cortex functions

  • Basic familiarity with Snowflake SQL

🛠️ Step-by-Step Guide

1. Upload PDF Files to Snowflake Stage

Upload your unstructured documents (e.g., PDFs) to a Snowflake stage:

This could be a monthly folder of utility invoices, legal reports, or any batch of documents you want to process.

2. Extract Raw Text from PDFs

If your PDFs contain extractable text (i.e., not scanned images), you can extract the raw content within Snowflake using UDFs or upload preprocessed text into a staging table.

3. Use Cortex LLM to Extract Structured Information

Use snowflake ML function's to extract fields from the document text. A sample prompt might look like:

You are a document parser. Extract the following fields from this invoice: provider name, billing period, electricity usage (in kWh), gas usage (in MJ), total amount, due date. Return the result in JSON format.”

🛠️ Prompt templates and examples coming soon

🌱 Real-World Use Case: Sustainability Reporting from Utility PDFs

🔋 The Challenge

In Australia and globally, organisations receive monthly utility invoices in PDF format from energy providers. These documents contain:

  • Electricity usage (kWh)

  • Gas consumption (MJ or m³)

  • Site address

  • Billing period

  • Emissions intensity (optional)

  • Total amount

Manually entering these into spreadsheets is time-consuming and error-prone. The data is critical for:

  • Sustainability reporting (ESG, emissions tracking)

  • Finance audits and cost control

  • Automated alerting on abnormal consumption

✅ The LLM-Powered Solution

Using Snowflake Cortex:

  1. Upload invoices to a stage (e.g., /utility_invoices/2025)

  2. Extract text using Snowflake functions

  3. Run an LLM prompt to extract key values into structured JSON

  4. Store the results in a Snowflake table for downstream analytics

📊 Example Output Table

These results can feed into:

  • Dashboards showing monthly energy use

  • Carbon calculators estimating emissions

  • Alerts for spikes in usage

🧠 Prompt Engineering Tips

  • Keep prompts concise and use bullet-style output

  • Ask for specific formats (e.g. "return as JSON with these keys:...")

  • Add instructions like “Ignore boilerplate or payment instructions”

  • Tune prompts separately for electricity vs gas vs water documents if needed

🔐 Security & Governance

  • All document data stays within Snowflake — no third-party LLM required

  • Role-based access ensures only authorized users can access sensitive data

  • Prompts and responses can be logged for auditability

✅ Benefits of This Approach

  • Zero setup: no need for Python, OCR, or external services

  • Fast, scalable processing of high volumes of PDFs

  • Fully integrated with Snowflake + dashboards

  • Reproducible and secure — your LLM never leaves your data platform

📌 Conclusion

PDFs and unstructured documents don’t need to be a black hole anymore. With Snowflake Cortex LLM, you can:

  • Extract meaningful insights

  • Automate manual data collection

  • Power dashboards, alerts, and ESG reports

  • Enable finance and ops teams to move faster, smarter

Whether you're processing invoices, contracts, forms, or internal documents — this is your new superpower.

📎 Artefacts

  • Prompt templates for utilities, contracts, and finance docs

  • dbt model to automate ingestion + parsing pipeline

  • Streamlit UI to upload + preview LLM outputs

  • GitHub starter project for quick deployment