Categories: Databricks

by Dr. Amir Sahraei

Share

by Dr. Amir Sahraei

Governing GenAI in Production with MLflow: Build a Tracked AI Agent on Databricks

A hands-on walkthrough for building a tool-using AI agent on Databricks, with MLflow experiment tracking, LLM tracing, and automated quality evaluation built in from the start.

By Dr. Amir Sahraei

Deploying a GenAI application is one thing. Trusting it in production is another. When your LLM starts answering real questions from real users, a new set of challenges emerges: Is the model giving correct answers? How do you detect when quality degrades? How do you reproduce a specific response to debug a complaint?

These are not theoretical concerns. They are everyday realities of running AI in production. And this is exactly what MLflow was built to address.

In this post, we are going to build something concrete: a simple tool-using AI agent on Databricks. As we build it, we wire in MLflow tracking and tracing at every step. By the end, you will have a working agent with full observability built in from the start.

What We Are Building

Our agent is a simple IT knowledge assistant. It takes a user question, decides which tool to call, retrieves relevant information, and returns a grounded answer. The agent has two tools:

  • search_knowledge_base: searches a Mosaic AI Vector Search index of internal IT documentation
  • get_ticket_status: looks up an IT support ticket by ID

This is a realistic pattern that many enterprise teams build. It is small enough to understand fully, but real enough to demonstrate every governance concept we care about.

The full stack: Databricks notebook, LangChain for the agent framework, Databricks Foundation Model APIs for the LLM, and MLflow for tracking and tracing.

Step 1: Install Dependencies and Configure MLflow

Start by installing the required packages in your Databricks notebook. Databricks Runtime 14.3 LTS ML and above includes MLflow 2.x with native LLM tracing support. We also install langgraph as it is required by the latest version of the agent framework.

%pip install langchain langchain-community databricks-langchain mlflow>=2.13 langgraph --upgrade
%restart_python

Once the kernel restarts, bring in all imports at the top of your next cell so everything is available for the steps that follow.

import os
import mlflow
import pandas as pd

from databricks.vector_search.client import VectorSearchClient
from databricks_langchain import ChatDatabricks
from langchain_core.tools import tool
from langchain_classic.agents import AgentExecutor, create_react_agent
from langchain_classic import hub

Next, configure MLflow to point to the Databricks tracking server and enable automatic tracing for LangChain.

# Point MLflow to the Databricks tracking server
mlflow.set_tracking_uri("databricks")

# Replace with your own Databricks workspace experiment path
mlflow.set_experiment("/Users/your.email@yourcompany.com/it-agent-governance")

# Enable automatic tracing for all LangChain calls
mlflow.langchain.autolog()

Note: mlflow.langchain.autolog() captures a full trace for every LangChain call automatically. You do not need to add any manual instrumentation to your agent code to get complete traces in the MLflow UI.

Step 2: Create the Knowledge Base and Vector Search Index

Before defining the agent tools, we need a knowledge base to search. We create a small Delta table with sample IT documentation, enable Change Data Feed on it (required by Mosaic AI Vector Search), and then create a Delta Sync index that automatically embeds the content using a Databricks-hosted embedding model.

Replace the catalog and schema names with your own. A common convention is to use your personal catalog during development and a shared team catalog for production.

# === One-time setup: create source table and Vector Search index ===

# Step 1: Create a Delta table with sample IT knowledge base documents
# Replace 'your_catalog.your_schema' with your own catalog and schema
TARGET_TABLE = "your_catalog.your_schema.it_docs_knowledge_base"
INDEX_NAME = "your_catalog.your_schema.it_docs_knowledge_base_index"
ENDPOINT = "it-docs-endpoint"

sample_docs = [
    (1, "To reset your VPN credentials, open the self-service portal at it.company.com, click 'VPN Access', and select 'Reset Credentials'. You will receive a new configuration file via email within 5 minutes.", "vpn_guide.md"),
    (2, "If your laptop screen is flickering, first try updating your display drivers via the Software Center. If the issue persists, submit a ticket to Desktop Support with your device serial number.", "hardware_troubleshooting.md"),
    (3, "To request a new software license, go to the IT Service Catalog, select 'Software Request', fill in the business justification, and submit for manager approval. Standard licenses are provisioned within 24 hours.", "software_requests.md"),
    (4, "For Wi-Fi connectivity issues in the office, ensure you are connecting to 'Corp-Secure' (not 'Guest'). If problems persist, forget the network and reconnect using your AD credentials.", "wifi_guide.md"),
    (5, "Multi-factor authentication (MFA) can be reset by contacting the Security team via Slack #security-help or by calling the IT helpdesk at ext. 4400. Have your employee ID ready.", "mfa_guide.md"),
]

df = spark.createDataFrame(sample_docs, ["id", "content", "source"])
df.write.mode("overwrite").option("overwriteSchema", true").saveAsTable(TARGET_TABLE)

# Enable Change Data Feed (required for Delta Sync Vector Search indexes)
spark.sql(f"""
    ALTER TABLE 
    SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")
print(f"✓ Source table '' created with CDF enabled.")

# Step 2: Create the Vector Search endpoint (skip if it already exists)
client = VectorSearchClient()
try:
    client.get_endpoint(ENDPOINT)
    print(f"✓ Endpoint '' already exists.")
except Exception:
    client.create_endpoint(name=ENDPOINT, endpoint_type="STANDARD")
    print(f"✓ Endpoint '' created (may take a few minutes to provision).")

# Step 3: Create a Delta Sync index with Databricks-managed embeddings
try:
    client.get_index(endpoint_name=ENDPOINT, index_name=INDEX_NAME)
    print(f"✓ Index '' already exists.")
except Exception:
    client.create_delta_sync_index(
        endpoint_name=ENDPOINT,
        source_table_name=TARGET_TABLE,
        index_name=INDEX_NAME,
        pipeline_type="TRIGGERED",
        primary_key="id",
        embedding_source_column="content",
        embedding_model_endpoint_name="databricks-gte-large-en",
        columns_to_sync=["content", "source"]
    )
    print(f"✓ Index '' created.")
    print("  Note: the index needs a few minutes to sync before it can be queried.")

Tip: Run this cell only once. On subsequent runs, the try/except blocks detect existing resources and skip creation. This makes the notebook safe to re-run without creating duplicate endpoints or indexes.

Step 3: Define the Agent Tools

We define two tools as Python functions decorated with LangChain's @tool decorator. Each tool has a clear docstring because the LLM uses the description at runtime to decide which tool to call for a given question.

@tool
def search_knowledge_base(query: str) -> str:
    """Search the internal IT knowledge base for documentation,
    procedures, and troubleshooting guides. Use this for general
    IT questions that do not reference a specific ticket ID."""
    client = VectorSearchClient()
    index = client.get_index(
        endpoint_name=ENDPOINT,
        index_name=INDEX_NAME
    )
    results = index.similarity_search(
        query_text=query,
        columns=["content", "source"],
        num_results=3
    )
    docs = results.get("result", ).get("data_array", [])
    return "\n\n".join([f"Source: \n" for d in docs])


@tool
def get_ticket_status(ticket_id: str) -> str:
    """Retrieve the current status and details of an IT support
    ticket by its ID (format: INC-XXXXX). Use this when the user
    asks about a specific ticket."""
    # In production, replace with a real API call to your ITSM system
    mock_tickets = ,
        "INC-10055": 
    }
    ticket = mock_tickets.get(ticket_id.upper())
    if not ticket:
        return f"Ticket  not found."
    return (f"Ticket : \n"
            f"Status:  | Priority: \n"
            f"Assigned to: ")

Step 4: Build the Agent

We assemble the agent using LangChain's ReAct pattern with the ChatDatabricks class, which routes LLM calls through Databricks Foundation Model APIs. Using a Databricks-hosted model means all inference stays inside your workspace and is automatically logged in Unity Catalog audit logs.

llm = ChatDatabricks(
    endpoint="databricks-meta-llama-3-3-70b-instruct",
    temperature=0.1,
    max_tokens=1024
)

# Load a standard ReAct prompt template from LangChain Hub
prompt = hub.pull("hwchase17/react")
# Combine LLM, tools, and prompt into an agent
tools = [search_knowledge_base, get_ticket_status]
agent = create_react_agent(llm=llm, tools=tools, prompt=prompt)
# Wrap with AgentExecutor for iteration and error handling
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    max_iterations=5,
    handle_parsing_errors=True
)

Step 5: Run the Agent Inside an MLflow Run

We wrap every agent invocation inside mlflow.start_run(). This records parameters, input and output artifacts, and metrics for every experiment. Because we enabled autolog earlier, the full LangChain trace is captured automatically inside the run without any extra instrumentation.

def run_agent_with_tracking(question: str, run_name: str = None):
    """Run the agent and log everything to MLflow."""

    with mlflow.start_run(run_name=run_name or "agent-run") as run:

        mlflow.log_params()

        mlflow.log_text(question, "input_question.txt")

        result = agent_executor.invoke()
        answer = result["output"]

        mlflow.log_text(answer, "agent_answer.txt")
        mlflow.log_metric("answer_length", len(answer))

        print(f"Run ID: ")
        print(f"Answer: ")
        return answer, run.info.run_id


# Test the agent
answer, run_id = run_agent_with_tracking(
    question="What is the status of ticket INC-10042?",
    run_name="ticket-status-test"
)

What you will see: After running this, open the MLflow Experiments UI in your Databricks workspace. You will find the run with all logged parameters, the input and output text files, and a full LangChain trace showing every LLM call and tool invocation with latencies.

Step 6: Inspect Traces Programmatically

The MLflow UI gives you a visual trace explorer, but you can also retrieve traces programmatically. This is useful for offline analysis, building dashboards, or integrating quality checks into a CI pipeline.

# Retrieve the most recent traces for your experiment
experiment = mlflow.get_experiment_by_name(
    "/Users/your.email@yourcompany.com/it-agent-governance"
)

traces_df = mlflow.search_traces(
    locations=[experiment.experiment_id],
    order_by=["timestamp DESC"],
    max_results=5
)

if not traces_df.empty:
    for idx, row in traces_df.iterrows():
        print(f"Trace: ")
        print(f"State: ")
        print(f"Duration:  ms")
        print(f"Request: ")
        print(f"Response: ")
        print()
else:
    print("No traces found for this run.")

Tip: Each row in traces_df corresponds to one full agent invocation. You can filter by state ('OK' vs 'ERROR') to quickly find failed runs, or sort by execution_duration to identify slow queries worth optimising.

Step 7: Automated Quality Evaluation

Manual review does not scale. Once you have a set of test questions with expected answers, you can run mlflow.evaluate() to automatically score your agent on answer relevance and correctness. We use a Databricks-hosted LLM as the judge model.

# Golden dataset: questions paired with expected answers
eval_data = pd.DataFrame()

# Wrap the agent as a callable for mlflow.evaluate()
def agent_fn(inputs: pd.DataFrame) -> list:
    return [
        agent_executor.invoke()["output"]
        for q in inputs["inputs"]
    ]

JUDGE_MODEL = "endpoints:/databricks-meta-llama-3-3-70b-instruct"

with mlflow.start_run(run_name="agent-evaluation"):
    results = mlflow.evaluate(
        model=agent_fn,
        data=eval_data,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[
            mlflow.metrics.genai.answer_relevance(model=JUDGE_MODEL),
            mlflow.metrics.genai.answer_correctness(model=JUDGE_MODEL)
        ]
    )
    print(results.metrics)

Tip: Run this evaluation every time you change your prompt, swap the LLM, or update the knowledge base. Track the metric trends in MLflow over time to catch quality regressions before they reach production users.

What You Now Have

Starting from a blank notebook, we now have a complete, governed GenAI agent on Databricks:

  • A working tool-using AI agent that answers IT questions and looks up ticket statuses using real Databricks services.
  • Full MLflow experiment tracking logging parameters, inputs, and outputs for every run.
  • Automatic LangChain tracing capturing every LLM call and tool invocation with latencies.
  • Programmatic trace retrieval enabling offline analysis and integration with CI pipelines.
  • Automated quality evaluation scoring answer relevance and correctness against a golden dataset.

Closing Thoughts

The key insight is that governance is not something you add after the fact. By wiring MLflow into your agent from the very first line of code, you get observability for free as you develop. By the time you are ready to deploy, you already have a quality baseline, a trace history, and a clear record of every experiment you ran.

This setup scales directly to production. Swap the mock ticket data for a real ITSM API, expand the knowledge base with your own documents, and the same tracking and evaluation framework continues to work without any changes. Start small, measure everything, and iterate with confidence.

Share