Automatic LLM and Agent Evaluation: Quality Monitoring Without the Boilerplate
We're excited to introduce Automatic Evaluation, a new capability in MLflow that runs LLM judges on your agent traces and conversations as they're logged—no code required.
Challenges
Building an AI agent, LLM application, or RAG system is one thing. Knowing if it's working well is another.
During development, you want fast feedback. You make a change, run your agent, and wonder: did that actually improve things? Is it still hallucinating? Are the tool calls correct? Manually writing and running evaluation scripts for every iteration slows you down.
Production introduces new challenges. Your agent is handling real traffic, and you need to know when things go wrong—ideally before users complain. Is the agent repeating itself? Is it leaking PII? Is response quality degrading over time?
These challenges share a common solution: continuous, automated quality checks that run without manual intervention.
What is Automatic Evaluation?
Automatic evaluation runs LLM judges on your traces and conversations as they're logged to MLflow. You configure judges once, and they run automatically on new traces as they arrive. Judges run asynchronously on the MLflow server, so your application's performance is unaffected.
An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of built-in judges for common evaluation criteria, plus integrations with DeepEval, RAGAS, and Phoenix Evals. The make_judge() API also enables you to create custom judges from natural language criteria.
Get Started
Before setting up automatic evaluation, you'll need MLflow 3.9 or later:
pip install mlflow>=3.9
Next, enable tracing for your agent. MLflow Tracing works with any agent framework and programming language via OpenTelemetry. MLflow also provides more than 40 autologging integrations for tracing popular libraries like LangChain, Vercel, and the OpenAI SDK with just one line of code:
import mlflow
mlflow.openai.autolog() # or langchain.autolog(), etc.
Once tracing is enabled, you can set up automatic evaluation through the MLflow UI or SDK.
Setting Up Automatic Evaluation
The fastest way to get started is through the MLflow UI. Navigate to your experiment, open the Judges tab, and click + New LLM judge. Select a built-in judge (like safety or correctness), configure your sampling rate, and save. New traces will be evaluated automatically.
You can also configure judges programmatically:
from mlflow.genai.scorers import Safety, ScorerSamplingConfig
judge = Safety(model="gateway:/my-llm-endpoint")
registered = judge.register(name="safety_check")
registered.start(
sampling_config=ScorerSamplingConfig(
sample_rate=0.5, # Evaluate 50% of traces; use 1.0 for dev, lower for production
filter_string="metadata.environment = 'production'" # Optional: target specific traces
)
)
For multi-turn conversations, session-level judges evaluate the entire conversation rather than individual traces. Sessions are evaluated after 5 minutes of inactivity, so the judge sees the complete interaction before scoring.
For complete setup instructions, see the Automatic Evaluation documentation.
Viewing Results
Quality scores from your judges appear directly in the MLflow UI. The Overview tab shows trends over time—you can see at a glance if quality is improving or degrading. The Traces tab shows individual scores alongside each trace, so you can drill into specific failures. Scores typically appear within a minute or two of trace logging.
Learn More
- Automatic Evaluation documentation
- Built-in LLM judges
- Creating custom judges
- Third-party integrations (DeepEval, RAGAS, Phoenix)
Feedback
If this is useful, give us a star on GitHub: github.com/mlflow/mlflow
Have questions or feedback? Open an issue on GitHub or join the conversation in the MLflow community.




