Evaluating Spring AI Applications with Dokimos

If you’re building AI features with Spring AI, you’ve probably run into this question: how do you know your LLM responses are actually good? Unit testing a deterministic function is straightforward. Testing something that produces different output every time, and where “correct” is a spectrum rather than a binary — that’s a different problem entirely.

This is where LLM evaluation (evals) comes in, and Dokimos is a new open-source framework that brings structured evaluation to Java and Spring AI applications. Think of it as a testing framework specifically designed for the non-deterministic world of LLM outputs.

In this tutorial, we’ll build a RAG-based customer service assistant with Spring AI and then systematically evaluate it using Dokimos — checking for hallucinations, faithfulness to context, and overall answer quality.

Why you need evals

Traditional tests check that code produces an expected output. LLM applications break this model in several ways:

Non-deterministic outputs: the same prompt can yield different responses each time
Quality is multidimensional: an answer can be factually correct but unhelpful, or well-written but hallucinated
Failures are subtle: a model won’t throw an exception when it fabricates information — it will confidently present fiction as fact

Evals give you a systematic way to measure these quality dimensions across a dataset of test cases, and they integrate into your CI/CD pipeline so regressions don’t slip into production unnoticed.

Setting up the project

Start with a Spring Boot project and add the required dependencies. Dokimos is modular — dokimos-core provides the evaluation engine, dokimos-spring-ai adds Spring AI integration, and dokimos-junit enables dataset-driven parameterized tests.

 1<dependencies>
 2    <dependency>
 3        <groupId>org.springframework.boot</groupId>
 4        <artifactId>spring-boot-starter-web</artifactId>
 5    </dependency>
 6    <dependency>
 7        <groupId>org.springframework.ai</groupId>
 8        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
 9    </dependency>
10
11    <!-- Dokimos evaluation framework -->
12    <dependency>
13        <groupId>dev.dokimos</groupId>
14        <artifactId>dokimos-core</artifactId>
15        <version>0.14.1</version>
16    </dependency>
17    <dependency>
18        <groupId>dev.dokimos</groupId>
19        <artifactId>dokimos-spring-ai</artifactId>
20        <version>0.14.1</version>
21    </dependency>
22    <dependency>
23        <groupId>dev.dokimos</groupId>
24        <artifactId>dokimos-junit</artifactId>
25        <version>0.14.1</version>
26        <scope>test</scope>
27    </dependency>
28</dependencies>

Configure your model in application.properties:

1spring.ai.openai.api-key=${OPENAI_API_KEY}
2spring.ai.openai.chat.options.model=gpt-4o-mini
3spring.ai.openai.chat.options.temperature=1.0
4spring.ai.openai.embedding.options.model=text-embedding-3-small

Building the knowledge assistant

Our example application is a simple RAG pipeline: a customer service assistant that retrieves relevant documents from a vector store and uses them as context when answering questions. This is the kind of application where evals are critical — the model should only answer based on the retrieved context, not make things up.

First, set up the vector store with some sample documents:

 1@Configuration
 2public class VectorStoreConfig {
 3
 4    @Bean
 5    public VectorStore vectorStore(EmbeddingModel embeddingModel) {
 6        SimpleVectorStore store = SimpleVectorStore.builder(embeddingModel).build();
 7        List<Document> documents = List.of(
 8            new Document(
 9                "Our return policy allows customers to return any product within "
10                + "30 days of purchase for a full refund. Items must be in original "
11                + "condition with tags attached. Refunds are processed within 5 "
12                + "business days."
13            ),
14            new Document(
15                "Premium members receive free shipping on all orders, 20% discount "
16                + "on all products, early access to new releases, and priority "
17                + "customer support. Premium membership costs $99 per year."
18            ),
19            new Document(
20                "Standard shipping takes 5-7 business days. Express shipping is "
21                + "available for an additional $12.99 and delivers within 2 business "
22                + "days. Free shipping is available on orders over $50."
23            )
24        );
25        store.add(documents);
26        return store;
27    }
28}

Next, the assistant service that performs retrieval and generation:

 1@Service
 2public class KnowledgeAssistant {
 3
 4    private final ChatClient chatClient;
 5    private final VectorStore vectorStore;
 6
 7    public KnowledgeAssistant(ChatClient.Builder chatClientBuilder,
 8                              VectorStore vectorStore) {
 9        this.chatClient = chatClientBuilder.build();
10        this.vectorStore = vectorStore;
11    }
12
13    public AssistantResponse answer(String question) {
14        List<Document> retrievedDocs = vectorStore.similaritySearch(
15            SearchRequest.builder()
16                .query(question)
17                .topK(3)
18                .build()
19        );
20
21        String context = retrievedDocs.stream()
22            .map(Document::getText)
23            .reduce("", (a, b) -> a + "\n\n" + b);
24
25        String systemPrompt = """
26            You are a helpful customer service assistant. Answer the user's
27            question based ONLY on the provided context. If the context does
28            not contain enough information to answer the question, say so clearly.
29
30            Context:
31            %s
32            """.formatted(context);
33
34        String response = chatClient.prompt()
35            .system(systemPrompt)
36            .user(question)
37            .call()
38            .content();
39
40        return new AssistantResponse(response, retrievedDocs);
41    }
42
43    public record AssistantResponse(
44        String answer,
45        List<Document> retrievedDocuments
46    ) {}
47}

Nothing unusual here — this is a standard RAG pattern. The interesting part is what comes next: evaluating whether the assistant actually does its job well.

Understanding Dokimos evaluators

Dokimos ships with several built-in evaluators that cover the most common quality dimensions for LLM applications. Each evaluator uses the LLM-as-judge pattern — it sends the output (and context) to a model that scores it against specific criteria.

The four evaluators we’ll use:

Evaluator	What it measures	Threshold meaning
`FaithfulnessEvaluator`	Is the answer grounded in the provided context?	Higher = stricter (0.8 = 80% of claims must be supported)
`HallucinationEvaluator`	Does the answer contain fabricated information?	Lower = stricter (0.2 = max 20% hallucinated content)
`LLMJudgeEvaluator`	Custom criteria you define	Higher = stricter (flexible)
`ContextualRelevanceEvaluator`	Did retrieval return relevant documents?	Higher = stricter

The key concept is the judge. A judge is an LLM that evaluates the output of another LLM. Dokimos provides SpringAiSupport.asJudge(chatModel) to turn any Spring AI ChatModel into a judge.

Creating a reusable evaluator factory

Rather than configuring evaluators in every test, bundle them into a factory class:

 1public final class QAEvaluators {
 2
 3    public static final String CONTEXT_KEY = "context";
 4
 5    private QAEvaluators() {}
 6
 7    public static List<Evaluator> standard(JudgeLM judge) {
 8        return List.of(
 9            FaithfulnessEvaluator.builder()
10                .threshold(0.8)
11                .judge(judge)
12                .contextKey(CONTEXT_KEY)
13                .includeReason(true)
14                .build(),
15            HallucinationEvaluator.builder()
16                .threshold(0.2)
17                .judge(judge)
18                .contextKey(CONTEXT_KEY)
19                .includeReason(true)
20                .build(),
21            LLMJudgeEvaluator.builder()
22                .name("Answer Quality")
23                .criteria("""
24                    Evaluate the answer based on these criteria:
25                    1. Does it directly address the user's question?
26                    2. Is it clear and easy to understand?
27                    3. Does it provide specific, actionable information?
28                    4. Is it appropriately concise without missing key details?
29                    """)
30                .evaluationParams(List.of(
31                    EvalTestCaseParam.INPUT,
32                    EvalTestCaseParam.ACTUAL_OUTPUT
33                ))
34                .threshold(0.7)
35                .judge(judge)
36                .build()
37        );
38    }
39}

The imports for this class come from:

1import dev.dokimos.core.Evaluator;
2import dev.dokimos.core.EvalTestCaseParam;
3import dev.dokimos.core.JudgeLM;
4import dev.dokimos.core.evaluators.FaithfulnessEvaluator;
5import dev.dokimos.core.evaluators.HallucinationEvaluator;
6import dev.dokimos.core.evaluators.LLMJudgeEvaluator;

Notice how FaithfulnessEvaluator and HallucinationEvaluator both require a contextKey — this tells them which field in the test case contains the retrieved context to evaluate against. The LLMJudgeEvaluator is the flexible option: you define your own evaluation criteria in natural language.

Creating a test dataset

Dokimos supports datasets in JSON, CSV, and JSONL formats, or you can build them programmatically. For a JSON dataset, create src/test/resources/datasets/qa-dataset.json:

 1[
 2  {
 3    "input": "What is your return policy?",
 4    "expectedOutput": "Customers can return products within 30 days for a full refund."
 5  },
 6  {
 7    "input": "What are the benefits of premium membership?",
 8    "expectedOutput": "Premium members get free shipping, 20% discount, early access, and priority support for $99/year."
 9  },
10  {
11    "input": "How long does standard shipping take?",
12    "expectedOutput": "Standard shipping takes 5-7 business days."
13  }
14]

You can also build datasets in code, which is useful for dynamic test cases:

 1Dataset dataset = Dataset.builder()
 2    .name("QA Dataset")
 3    .description("Customer service Q&A test cases")
 4    .addExample(Example.of(
 5        "What is your return policy?",
 6        "Customers can return products within 30 days for a full refund."))
 7    .addExample(Example.of(
 8        "How much is premium membership?",
 9        "Premium membership costs $99 per year."))
10    .build();

Writing JUnit evaluation tests

Here’s where it all comes together. Dokimos integrates with JUnit 5’s parameterized test infrastructure through the @DatasetSource annotation:

 1@SpringBootTest
 2class KnowledgeAssistantEvaluationTest {
 3
 4    @Autowired
 5    private KnowledgeAssistant assistant;
 6
 7    @Autowired
 8    private ChatModel chatModel;
 9
10    private List<Evaluator> evaluators;
11
12    @BeforeEach
13    void setup() {
14        JudgeLM judge = SpringAiSupport.asJudge(chatModel);
15        evaluators = QAEvaluators.standard(judge);
16    }
17
18    @ParameterizedTest
19    @DatasetSource("classpath:datasets/qa-dataset.json")
20    void shouldProvideQualityAnswers(Example example) {
21        var response = assistant.answer(example.input());
22        List<String> contextTexts = response.retrievedDocuments().stream()
23            .map(Document::getText)
24            .toList();
25
26        EvalTestCase testCase = EvalTestCase.builder()
27            .input(example.input())
28            .actualOutput(response.answer())
29            .actualOutput(QAEvaluators.CONTEXT_KEY, contextTexts)
30            .expectedOutput(example.expectedOutput())
31            .build();
32
33        Assertions.assertEval(testCase, evaluators);
34    }
35}

The flow for each test case is:

Load the input question from the dataset
Run it through the assistant to get a response and the retrieved documents
Build an EvalTestCase with the input, actual output, expected output, and context
Use Assertions.assertEval to run all evaluators — the test fails if any evaluator’s score falls below its threshold

Running experiments programmatically

For scenarios where you want to evaluate your entire dataset in one go and analyze aggregate results, Dokimos offers the Experiment API:

 1Task evaluationTask = example -> {
 2    var response = assistant.answer(example.input());
 3    List<String> contextTexts = response.retrievedDocuments().stream()
 4        .map(Document::getText)
 5        .toList();
 6    return Map.of(
 7        "output", response.answer(),
 8        "context", contextTexts
 9    );
10};
11
12ExperimentResult result = Experiment.builder()
13    .name("Knowledge Assistant v1.0 Evaluation")
14    .description("Evaluating the RAG-based knowledge assistant")
15    .dataset(dataset)
16    .task(evaluationTask)
17    .evaluators(evaluators)
18    .metadata("model", "gpt-4o-mini")
19    .metadata("retrievalTopK", 3)
20    .build()
21    .run();
22
23System.out.println("Total examples: " + result.totalCount());
24System.out.println("Passed: " + result.passCount());
25System.out.println("Failed: " + result.failCount());
26System.out.println("Pass rate: " + String.format("%.1f%%", result.passRate() * 100));

The ExperimentResult also supports multiple export formats, so you can generate reports for review:

1result.exportJson(Path.of("results/eval-results.json"));
2result.exportHtml(Path.of("results/eval-report.html"));
3result.exportMarkdown(Path.of("results/eval-results.md"));
4result.exportCsv(Path.of("results/eval-results.csv"));

Building custom evaluators

The built-in evaluators cover common scenarios, but you’ll often need domain-specific checks. Dokimos makes this straightforward by extending BaseEvaluator:

 1public class ResponseLengthEvaluator extends BaseEvaluator {
 2
 3    private final int minWords;
 4    private final int maxWords;
 5
 6    public ResponseLengthEvaluator(int minWords, int maxWords) {
 7        super("Response Length", 1.0,
 8              List.of(EvalTestCaseParam.ACTUAL_OUTPUT));
 9        this.minWords = minWords;
10        this.maxWords = maxWords;
11    }
12
13    @Override
14    protected EvalResult runEvaluation(EvalTestCase testCase) {
15        String output = testCase.actualOutput();
16        int wordCount = output.split("\\s+").length;
17        boolean withinBounds = wordCount >= minWords && wordCount <= maxWords;
18        double score = withinBounds ? 1.0 : 0.0;
19        String reason = String.format(
20            "Response has %d words (expected %d-%d)",
21            wordCount, minWords, maxWords);
22        return EvalResult.builder()
23            .name(name())
24            .score(score)
25            .threshold(threshold())
26            .reason(reason)
27            .build();
28    }
29}

This evaluator checks that responses stay within a word count range — useful for ensuring your assistant doesn’t produce one-word answers or rambling multi-paragraph responses. You can mix custom evaluators with built-in ones freely in both JUnit tests and experiments.

Tracking results over time

For long-running projects, Dokimos includes an optional server component that tracks evaluation results across runs. This lets you monitor quality trends and catch gradual degradation:

 1var reporter = DokimosServerReporter.builder()
 2    .serverUrl("http://localhost:8080")
 3    .projectName("knowledge-assistant")
 4    .build();
 5
 6ExperimentResult result = Experiment.builder()
 7    .name("Knowledge Assistant v1.0")
 8    .dataset(dataset)
 9    .task(evaluationTask)
10    .evaluators(evaluators)
11    .reporter(reporter)
12    .build()
13    .run();

The server provides a web UI for viewing runs, comparing results across experiments, and debugging individual failures.

Key takeaways

Dokimos fills an important gap in the Java AI ecosystem. Where Python developers have had tools like DeepEval and Ragas for a while, Java teams building with Spring AI now have a native evaluation framework that fits their existing workflow.

The core ideas to take away:

Evals are not optional for production AI applications. If you can’t measure quality, you can’t maintain it.
Use multiple evaluators — faithfulness, hallucination, and custom quality checks each catch different failure modes.
Dataset-driven testing with @DatasetSource makes evals maintainable and repeatable.
Start simple — even a small dataset with the built-in evaluators will catch issues that manual testing misses.

← Meer artikelen