We use cookies to ensure you get the best experience on our website.

6 min read
CI/CD for Evals: Running Prompt & Agent Regression Tests in GitHub Actions
Turn evals into a merge-blocking gate: seed datasets, deterministic checks, cost/latency budgets, and flaky-test triage. Includes ready-to-copy GitHub Actions/YAML and a minimal CLI setup for local dev + CI. (Uses popular CLI-friendly tools.)

What is CI/CD for AI Evals?

Link to this section

CI/CD for AI evals is the practice of automatically testing your AI prompts, models, and agents within your continuous integration and continuous delivery pipeline. It extends the familiar “code, test, deploy” loop to AI development, ensuring that any change to a prompt or agent doesn’t just work, but works correctly, consistently, and within budget before it ever reaches users.

Unlike traditional software testing where a function with the same input reliably produces the same output, LLM-based systems can be non-deterministic. An “eval” (evaluation) is a specialized test that assesses the quality, safety, and performance of an AI’s output, creating a critical safety net against regressions.

How Does Automated AI Evaluation Work?

Link to this section

Automated AI evaluation integrates directly into your version control system, like GitHub, and runs a series of checks whenever a developer proposes a change. By adding this workflow to your pull requests, you create a merge-blocking gate that prevents quality degradation.

The process typically follows these steps:

  1. Trigger: A developer opens a pull request with a modified prompt or agent configuration.
  2. CI Job Starts: A platform like GitHub Actions automatically initiates a new workflow.
  3. Run Evals: The workflow executes a script that runs the new AI configuration against a predefined “golden dataset” of inputs.
  4. Assert on Outputs: The script compares the AI’s outputs against a set of assertions. These can include:
    • Deterministic checks: Does the output contain a required keyword? Is it valid JSON? Does it follow a specific format?
    • Semantic checks: Does the output’s meaning align with the expected answer? This often involves using another LLM to “grade” the result.
    • Safety checks: Does the output contain harmful content or leak private information?
  5. Check Performance Budgets: The workflow analyzes the cost (token usage) and latency of the AI’s responses, failing the check if they exceed preset budgets.
  6. Report Status: The results are reported back to the pull request as a “pass” or “fail” status. A failing check blocks the merge, prompting the developer to revise their changes.

This automated loop ensures every change is rigorously vetted, allowing your team to iterate quickly and confidently.

Why Integrate Evals into Your CI/CD Pipeline?

Link to this section

Integrating evals into your CI/CD pipeline shifts quality control from a manual, post-deployment headache to an automated, proactive process. It’s about building a system that self-regulates quality, performance, and cost.

The key benefits of this approach are:

  • Preventing regressions: The primary goal is to catch issues early. An eval suite ensures that a prompt optimized for one use case doesn’t inadvertently break five others.
  • Controlling costs: LLM APIs are priced by the token. A small change to a prompt can have a huge impact on cost. Automated checks can flag a change that, for example, doubles the average token usage, preventing budget overruns.
  • Enforcing consistency: Ensure the AI’s tone, style, and output structure remain consistent and on-brand across all interactions.
  • Improving developer velocity: When developers trust the test suite to catch regressions, they can experiment and innovate more freely without fear of breaking the build.
  • Objective quality measurement: Evals provide a concrete, objective measure of quality that can be tracked over time, replacing subjective manual checks with data-driven insights.

Best Practices for Implementing CI/CD for Evals

Link to this section

Getting started with CI/CD for evals doesn’t have to be complicated. You can build a robust system using popular, CLI-friendly tools and a simple workflow configuration.

Start with a Golden Dataset

Link to this section

A golden dataset is a curated collection of inputs and their ideal outputs or evaluation criteria. This is the foundation of your regression testing. Start small with 10-20 high-priority examples that cover your most critical use cases and common edge cases. Store this dataset as a CSV or JSON file in your repository.

Create a Simple CLI Command

Link to this section

To ensure consistency between local development and CI, create a single command to run your evals. Many open-source tools (like promptfoo, llm-test, or lunary) can be configured to run from the command line.

For example, your package.json might contain a script like this:

{
    "scripts": {
        "test:evals": "promptfoo eval -c ./promptfoo.config.yaml"
    }
}

This allows a developer to run npm run test:evals on their machine to validate changes before pushing.

Set Up a GitHub Actions Workflow

Link to this section

Now, use that same command in a GitHub Actions workflow to automate the process. Create a file named .github/workflows/evals.yml with the following configuration. This workflow triggers on every pull request, checks out the code, and runs the eval script.

# .github/workflows/evals.yml

name: "Run AI Evals"

on:
    pull_request:
        paths:
            - "prompts/**" # Reruns if a prompt file changes
            - "promptfoo.config.yaml" # Reruns if the eval config changes

jobs:
    evaluate:
        runs-on: ubuntu-latest
        steps:
            - name: "Checkout code"
              uses: actions/checkout@v4

            - name: "Set up Node.js"
              uses: actions/setup-node@v4
              with:
                  node-version: "20"

            - name: "Install dependencies"
              run: npm install

            - name: "Run prompt evaluations"
              id: prompt_eval
              env:
                  # Securely access your LLM API key
                  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
              run: npm run test:evals

            # You can add more steps here to, for example,
            # post a comment to the PR with the results summary.

This simple setup turns your evaluation suite into a powerful merge-blocking gate.

Triage Flaky Tests

Link to this section

LLMs can be non-deterministic, which can lead to “flaky” tests that sometimes pass and sometimes fail without any code changes. To manage this, set the temperature parameter of your LLM to 0 for tests that require deterministic, factual outputs. For tests that assess semantic meaning or style, use model-graded assertions that are more flexible than exact-match comparisons.

How Kinde Helps

Link to this section

Building a reliable, production-grade AI application requires excellence on two fronts: the quality of the AI itself and the security of the user-facing application. While you focus on implementing robust CI/CD pipelines to ensure AI quality and prevent regressions, Kinde provides the critical infrastructure for secure authentication, user management, and authorization.

By handling the complexities of user sign-up, sign-in, and permissions, Kinde lets your team focus on the core AI functionality. For instance, you could use Kinde’s feature flags to roll out a newly-tested AI agent to a specific subset of users, confident that your automated evals have already vetted its quality. This combination allows you to build sophisticated, secure AI products faster and with greater confidence.

For more on how to manage application features and user access, see the Kinde docs.

Kinde Doc References

Link to this section

Get started now

Boost security, drive conversion and save money — in just a few minutes.