Automatic LLM and Agent Evaluation: Quality Monitoring Without the Boilerplate

February 23, 2026 · 4 min read

Corey Zumar

Software Engineer at Databricks

Samraj Moorjani

Software Engineer at Databricks

Avesh Singh

Software Engineer at Databricks

Serena Ruan

Software Engineer at Databricks

Yuki Watanabe

Software Engineer at Databricks

We're excited to introduce Automatic Evaluation, a new capability in MLflow that runs LLM judges on your agent traces and conversations as they're logged—no code required.

Challenges

Building an AI agent, LLM application, or RAG system is one thing. Knowing if it's working well is another.

During development, you want fast feedback. You make a change, run your agent, and wonder: did that actually improve things? Is it still hallucinating? Are the tool calls correct? Manually writing and running evaluation scripts for every iteration slows you down.

Production introduces new challenges. Your agent is handling real traffic, and you need to know when things go wrong—ideally before users complain. Is the agent repeating itself? Is it leaking PII? Is response quality degrading over time?

These challenges share a common solution: continuous, automated quality checks that run without manual intervention.

What is Automatic Evaluation?

Automatic evaluation runs LLM judges on your traces and conversations as they're logged to MLflow. You configure judges once, and they run automatically on new traces as they arrive. Judges run asynchronously on the MLflow server, so your application's performance is unaffected.

An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of built-in judges for common evaluation criteria, plus integrations with DeepEval, RAGAS, and Phoenix Evals. The make_judge() API also enables you to create custom judges from natural language criteria.

Get Started

Before setting up automatic evaluation, you'll need MLflow 3.9 or later:

pip install mlflow>=3.9

Next, enable tracing for your agent. MLflow Tracing works with any agent framework and programming language via OpenTelemetry. MLflow also provides more than 40 autologging integrations for tracing popular libraries like LangChain, Vercel, and the OpenAI SDK with just one line of code:

import mlflow

mlflow.openai.autolog()  # or langchain.autolog(), etc.

Once tracing is enabled, you can set up automatic evaluation through the MLflow UI or SDK.

Setting Up Automatic Evaluation

The fastest way to get started is through the MLflow UI. Navigate to your experiment, open the Judges tab, and click + New LLM judge. Select a built-in judge (like safety or correctness), configure your sampling rate, and save. New traces will be evaluated automatically.

You can also configure judges programmatically:

from mlflow.genai.scorers import Safety, ScorerSamplingConfig

judge = Safety(model="gateway:/my-llm-endpoint")
registered = judge.register(name="safety_check")
registered.start(
    sampling_config=ScorerSamplingConfig(
        sample_rate=0.5,  # Evaluate 50% of traces; use 1.0 for dev, lower for production
        filter_string="metadata.environment = 'production'"  # Optional: target specific traces
    )
)

For multi-turn conversations, session-level judges evaluate the entire conversation rather than individual traces. Sessions are evaluated after 5 minutes of inactivity, so the judge sees the complete interaction before scoring.

For complete setup instructions, see the Automatic Evaluation documentation.

Viewing Results

Quality scores from your judges appear directly in the MLflow UI. The Overview tab shows trends over time—you can see at a glance if quality is improving or degrading. The Traces tab shows individual scores alongside each trace, so you can drill into specific failures. Scores typically appear within a minute or two of trace logging.

Learn More

Feedback

If this is useful, give us a star on GitHub: github.com/mlflow/mlflow

Have questions or feedback? Open an issue on GitHub or join the conversation in the MLflow community.

GenAI Apps & Agents

Model Training

GenAI Apps & Agents

Model Training

Automatic LLM and Agent Evaluation: Quality Monitoring Without the Boilerplate

Challenges

What is Automatic Evaluation?

Get Started

Setting Up Automatic Evaluation

Viewing Results

Learn More

Feedback

GenAI Apps & Agents

Model Training

GenAI Apps & Agents

Model Training

Challenges​

What is Automatic Evaluation?​

Get Started​

Setting Up Automatic Evaluation​

Viewing Results​

Learn More​

Feedback​

Challenges

What is Automatic Evaluation?

Get Started

Setting Up Automatic Evaluation

Viewing Results

Learn More

Feedback