Question Answering on PDFs Using Databricks Foundation Models

Categories: AI, Coding

by Hudhaifa Ahmed

I had to get answers from a long PDF. It wasn’t something I wanted to read through manually, so I set up a quick workflow in Databricks using PyPDF2, Spark, and Databricks Foundation Models. The whole thing runs inside a notebook, using our model serving endpoint to power the question-answering logic.

This setup follows a map-reduce-style pattern. First, you map over the document page by page, using the model to find relevant sections. Then you reduce by combining the useful parts and asking the actual question. It’s simple and effective, especially when you’re dealing with large documents.

All of this runs entirely in Databricks, no third-party services involved.

What this uses

PyPDF2 to extract text from PDF files
Pandas to structure the data
Apache Spark for distributed querying
Databricks Foundation Models served via ai_query
Databricks notebooks to run the entire flow
A Databricks model serving endpoint (databricks-llama-4-maverick)

The next step is to package this into a full Databricks Lakehouse App so other users can run it without touching code.

Step 1: Extract the PDF content

We start by loading the PDF and pulling out the text, page by page.

Why this works

This setup avoids running the full document through the model all at once, which would break on longer texts due to token limits. Instead, it uses a lightweight filtering step to find the good parts, then passes just those into the model for a proper answer. It’s fast, accurate enough for most real use cases, and fully contained within Databricks. No need for external APIs, no special infrastructure. You can run this today using standard notebooks and a model serving endpoint.

Next steps

The notebook version works well for internal use, but it’s not very user-friendly yet. Here’s what’s coming next:

Turn this into a Databricks Lakehouse App Add a file uploader and question input so anyone can use it through a UI.
Support more file formats Add DOCX and plain text support alongside PDF.
Use semantic search to improve filtering Replace the binary yes/no filter with vector similarity scoring for better relevance.
Track user feedback for continuous improvement Let users rate the answers to improve filtering and prompt tuning over time.

Final thoughts

If you work in Databricks, you already have everything you need to build an end-to-end document QA system. You can parse PDFs, query them with Spark, and call Databricks Foundation Models for smart, scalable answers. This approach is ideal for technical docs, research papers, audit logs, or anywhere you need to search and summarize long documents without reading them line by line. The current notebook setup is simple but powerful. With a little UI, this becomes a lightweight app anyone on your team can use. That’s where we’re headed next.

View all

Question Answering on PDFs Using Databricks Foundation Models

Question Answering on PDFs Using Databricks Foundation Models

by Hudhaifa Ahmed

Share

by Hudhaifa Ahmed

Step 1: Extract the PDF content

Step 2: Move it to Spark

Step 3: Filter the document using an LLM (map step)

Step 4: Combine and get the final answer (reduce step)

Why this works

Next steps

Share

Related Posts

Context Engineering: Beyond the Prompt for Better AI

Vibe Coding: The AI-Assisted Revolution and Its Perils

Why switch to Azure Virtual WAN

Building Scalable AI Solutions with Azure AI Foundry

Company

Academy

Services

Newsletter