by Hudhaifa Ahmed
Share
by Hudhaifa Ahmed

I had to get answers from a long PDF. It wasn’t something I wanted to read through manually, so I set up a quick workflow in Databricks using PyPDF2, Spark, and Databricks Foundation Models. The whole thing runs inside a notebook, using our model serving endpoint to power the question-answering logic.
This setup follows a map-reduce-style pattern. First, you map over the document page by page, using the model to find relevant sections. Then you reduce by combining the useful parts and asking the actual question. It’s simple and effective, especially when you’re dealing with large documents.
All of this runs entirely in Databricks, no third-party services involved.
What this uses
- PyPDF2 to extract text from PDF files
- Pandas to structure the data
- Apache Spark for distributed querying
- Databricks Foundation Models served via ai_query
- Databricks notebooks to run the entire flow
- A Databricks model serving endpoint (databricks-llama-4-maverick)
The next step is to package this into a full Databricks Lakehouse App so other users can run it without touching code.
Step 1: Extract the PDF content
We start by loading the PDF and pulling out the text, page by page.




Why this works
This setup avoids running the full document through the model all at once, which would break on longer texts due to token limits. Instead, it uses a lightweight filtering step to find the good parts, then passes just those into the model for a proper answer. It’s fast, accurate enough for most real use cases, and fully contained within Databricks. No need for external APIs, no special infrastructure. You can run this today using standard notebooks and a model serving endpoint.
Next steps
The notebook version works well for internal use, but it’s not very user-friendly yet. Here’s what’s coming next:
- Turn this into a Databricks Lakehouse App Add a file uploader and question input so anyone can use it through a UI.
- Support more file formats Add DOCX and plain text support alongside PDF.
- Use semantic search to improve filtering Replace the binary yes/no filter with vector similarity scoring for better relevance.
- Track user feedback for continuous improvement Let users rate the answers to improve filtering and prompt tuning over time.
Final thoughts
If you work in Databricks, you already have everything you need to build an end-to-end document QA system. You can parse PDFs, query them with Spark, and call Databricks Foundation Models for smart, scalable answers. This approach is ideal for technical docs, research papers, audit logs, or anywhere you need to search and summarize long documents without reading them line by line. The current notebook setup is simple but powerful. With a little UI, this becomes a lightweight app anyone on your team can use. That’s where we’re headed next.