Categories: AI, Coding

by Hudhaifa Ahmed

Share

by Hudhaifa Ahmed

I had to get answers from a long PDF. It wasn’t something I wanted to read through manually, so I set up a quick workflow in Databricks using PyPDF2, Spark, and Databricks Foundation Models. The whole thing runs inside a notebook, using our model serving endpoint to power the question-answering logic.

This setup follows a map-reduce-style pattern. First, you map over the document page by page, using the model to find relevant sections. Then you reduce by combining the useful parts and asking the actual question. It’s simple and effective, especially when you’re dealing with large documents.

All of this runs entirely in Databricks, no third-party services involved.

What this uses

  • PyPDF2 to extract text from PDF files
  • Pandas to structure the data
  • Apache Spark for distributed querying
  • Databricks Foundation Models served via ai_query
  • Databricks notebooks to run the entire flow
  • A Databricks model serving endpoint (databricks-llama-4-maverick)

The next step is to package this into a full Databricks Lakehouse App so other users can run it without touching code.

Step 1: Extract the PDF content

We start by loading the PDF and pulling out the text, page by page.

Q/A withj Databricks, pip and pandas

Step 2: Move it to Spark

We convert the DataFrame into Spark so we can query it efficiently.

Step 3: Filter the document using an LLM (map step)

Next, we run a simple yes/no query on each page using the ai_query function. This asks the model whether each page contains relevant information.

This gives you a short list of potentially relevant pages. That’s your map step.

Q/A DBX: spark and pdfs

Step 4: Combine and get the final answer (reduce step)

Now take the filtered results, combine them into one chunk of context, and ask the model the actual question.

Q&A DBX: reduce

Why this works

This setup avoids running the full document through the model all at once, which would break on longer texts due to token limits. Instead, it uses a lightweight filtering step to find the good parts, then passes just those into the model for a proper answer. It’s fast, accurate enough for most real use cases, and fully contained within Databricks. No need for external APIs, no special infrastructure. You can run this today using standard notebooks and a model serving endpoint.

Next steps

The notebook version works well for internal use, but it’s not very user-friendly yet. Here’s what’s coming next:

  1. Turn this into a Databricks Lakehouse App Add a file uploader and question input so anyone can use it through a UI.
  2. Support more file formats Add DOCX and plain text support alongside PDF.
  3. Use semantic search to improve filtering Replace the binary yes/no filter with vector similarity scoring for better relevance.
  4. Track user feedback for continuous improvement Let users rate the answers to improve filtering and prompt tuning over time.

Final thoughts

If you work in Databricks, you already have everything you need to build an end-to-end document QA system. You can parse PDFs, query them with Spark, and call Databricks Foundation Models for smart, scalable answers. This approach is ideal for technical docs, research papers, audit logs, or anywhere you need to search and summarize long documents without reading them line by line. The current notebook setup is simple but powerful. With a little UI, this becomes a lightweight app anyone on your team can use. That’s where we’re headed next.

Share