SAINT (Service-level Integration Test Generation with Program Analysis and LLM-based Agents) is a novel white-box testing approach for service-level testing of enterprise Java applications. Enterprise applications are typically tested at multiple levels, with service-level testing playing an important role in validating application functionality.
Existing service-level testing tools, especially for RESTful APIs, often employ fuzzing and/or depend on OpenAPI specifications which are not readily available in real-world enterprise codebases. Moreover, they fail to generate functional tests that exercise meaningful scenarios.
SAINT addresses these limitations by combining static analysis, large language models (LLMs), and LLM-based agents to automatically generate both endpoint-focused and scenario-based tests that are functional, effective, and developer-aligned.
This work has been accepted at ICSE 2026 (48th International Conference on Software Engineering).
Paper: SAINT: Service-level Integration Test Generation with Program Analysis and LLM-based Techniques
- White-Box Testing: Leverages static program analysis to understand application internals
- No OpenAPI Required: Works directly with source code without requiring API specifications
- Dual Testing Modes:
- Endpoint-Focused Tests: Maximize code and database interaction coverage for individual endpoints
- Scenario-Based Tests: Generate realistic user workflows by extracting and refining application use cases
- LLM-Based Agents: Employs planning, action, and reflection phases for intelligent test generation
- High Coverage: Achieves high branch and line coverage with minimal variance
- Cost-Effective: Optimized LLM usage with support for multiple models (Mistral, Llama, Granite, DeepSeek, OpenAI)
- Endpoint Model: Captures syntactic and semantic information about service endpoints
- Operation Dependency Graph (ODG): Models inter-endpoint ordering constraints for scenario generation
- Inter-Parameter Dependencies (IPD): Identifies relationships between parameters across endpoints
- Value Constraints: Extracts valid parameter value ranges from code analysis
- Agentic Workflow: Iterative refinement through planning, execution, and reflection
SAINT integrates multiple components in a sophisticated pipeline:
- Extracts API specifications from source code
- Builds control flow and data flow graphs
- Identifies endpoint signatures, parameters, and return types
- Analyzes database interactions and business logic
- Endpoint Model: Comprehensive representation of each API endpoint
- Operation Dependency Graph: Captures ordering constraints between operations
- Inter-Parameter Dependencies: Maps parameter relationships across endpoints
- Coverage Agent: Generates tests to maximize code coverage
- Fixing Agent: Repairs failing tests through root cause analysis
- Scenario Agent: Synthesizes realistic user scenarios
- Uses multiple specialized prompts for different generation tasks
- Runs generated tests against the application
- Collects coverage metrics (branch, line, database interaction)
- Provides feedback for iterative refinement
- Analyzes coverage gaps and test failures
- Guides agents to generate additional tests
- Iteratively refines tests until coverage goals are met
SAINT has been evaluated on 8 real-world Java applications, including a proprietary enterprise application, demonstrating consistent performance across different domains and LLM models.
| Application | Std_Dev(%) (Branch, Line) | Cost ($) (Mistral, o1) | LLM Calls | HTTP Requests | Time (min) |
|---|---|---|---|---|---|
| PetClinic | (5.2%, 1.3%) | 0.02, 4.8 | 437 | 211 | 13.4 |
| DayTrader | (0.3%, 0.4%) | 0.04, 3.7 | 180 | 179 | 32.4 |
| JPetStore | (1.6%, 0.2%) | 0.05, 5.1 | 218 | 269 | 12.2 |
| Restcountries | (1.6%, 0.1%) | 0.04, 4.3 | 229 | 422 | 21.7 |
| Feature-service | (0.7%, 0.5%) | 0.04, 11.4 | 389 | 185 | 14.7 |
| Genome-Nexus | (0.1%, 0.8%) | 0.06, 7.7 | 356 | 423 | 40.4 |
| LanguageTool | (0.1%, 0.1%) | 0.04, 4.1 | 27 | 13 | 1.5 |
| Application | Std_Dev(%) (Branch, Line) | Cost ($) (Mistral, o1) | LLM Calls | HTTP Requests | Time (min) |
|---|---|---|---|---|---|
| PetClinic | (7.6%, 1.4%) | 0.02, 2.3 | 128 | 15 | 3.7 |
| DayTrader | (2.4%, 5.1%) | 0.04, 4.6 | 213 | 33 | 10.1 |
| JPetStore | (0.0%, 2.6%) | 0.05, 6.4 | 187 | 24 | 7.2 |
| Restcountries | (3.8%, 3.4%) | 0.02, 3.2 | 148 | 19 | 5.9 |
| Feature-service | (1.8%, 2.3%) | 0.04, 4.2 | 136 | 29 | 5.2 |
| Genome-Nexus | (2.7%, 3.4%) | 0.05, 6.0 | 216 | 25 | 17.0 |
- Low Variance: Standard deviations are consistently low (< 8%), indicating stable and reproducible results
- Cost Efficiency: Using Mistral (open-source) costs $0.02-$0.06 per application, while OpenAI o1 costs $2.3-$11.4
- Time Efficiency: Most applications complete testing in under 20 minutes
- Scenario-Based Advantage: Scenario-based testing typically requires fewer HTTP requests and less time while maintaining coverage
- Model Flexibility: Evaluated with 5 different LLM models (Mistral Devstral, Llama-8, Granite-3.1-8, DeepSeek-R1, OpenAI o1)
SAINT's effectiveness is validated through ablation studies on four key components:
| Component | Applications Tested | Impact on Coverage |
|---|---|---|
| Coverage Agent | 4 applications | Significant |
| Inter-Parameter Dependencies | 4 applications | Moderate |
| Operation Dependency Graph | 4 applications | High (scenarios) |
| Value Constraints | 4 applications | Moderate |
Results demonstrate that each component contributes meaningfully to overall test effectiveness.
Our evaluation addresses five key research questions:
- RQ1: How effective is SAINT in generating tests for individual endpoints?
- RQ2: How effective is SAINT in generating scenario-based tests?
- RQ3: How does SAINT compare with different LLM models?
- RQ4: What is the developer perception of SAINT-generated tests?
- RQ5: What is the contribution of each component in SAINT?
saint/
βββ README.md # This file
βββ doc/ # Documentation and figures
β βββ readme.md # Additional documentation
β βββ background.png # Developer survey background
βββ RQ1+3/ # Individual endpoint testing results
β βββ DayTrader/ # Results per application
β βββ Genome-nexus/
β βββ JPetStore/
β βββ Languagetool/
β βββ PetClinic/
β βββ RestCountries/
β βββ SAINT_PROMPT/ # All prompt templates
β βββ code-context-agent.jinja2
β βββ coverage-agent-*.jinja2 # Coverage generation prompts
β βββ duplicate-fixing-request.jinja2
β βββ entire-scenario-test-generation.jinja2
β βββ fixing-agent-*.jinja2 # Test fixing prompts
β βββ login-pattern.jinja2
β βββ parameter-values.jinja2
β βββ scenario-*.jinja2 # Scenario generation prompts
β βββ servlet_parameter_type.jinja2
βββ RQ5/ # Ablation study results
β βββ cov_agent/ # Coverage agent ablation
β βββ IPD/ # Inter-parameter dependencies ablation
β βββ Partial_order/ # ODG ablation
β βββ Value constraints/ # Value constraints ablation
βββ ASTER_Integration.pdf # Technical documentation
SAINT employs a sophisticated prompt engineering strategy with specialized templates for different tasks:
-
Code Context Analysis
code-context-agent.jinja2: Analyze code context for test generation
-
Coverage Generation Prompts
coverage-agent-generate-intra-endpoint-requests.jinja2: Generate requests for coveragecoverage-agent-ipd.jinja2: Handle inter-parameter dependenciescoverage-agent-parameter-type.jinja2: Determine parameter typescoverage-agent-parameter-values.jinja2: Generate parameter valuescoverage-agent-progress.jinja2: Track coverage progresscoverage-agent-reflection.jinja2: Reflect on coverage gapscoverage-agent-tool-choice.jinja2: Select appropriate tools
-
Test Fixing Prompts
fixing-agent-generate-intra-endpoint-requests.jinja2: Fix request generationfixing-agent-ipd.jinja2: Fix inter-parameter dependenciesfixing-agent-parameter-type.jinja2: Fix parameter typesfixing-agent-parameter-values.jinja2: Fix parameter valuesfixing-agent-reflection.jinja2: Analyze test failuresfixing-agent-tool-choice.jinja2: Select fixing strategiesduplicate-fixing-request.jinja2: Handle duplicate tests
-
Scenario Generation Prompts
entire-scenario-test-generation.jinja2: Generate complete scenario testsscenario-assertion.jinja2: Generate test assertionsscenario-test-generation.jinja2: Convert scenarios to tests
-
Dependency Analysis Prompts
parameter-values.jinja2: Generate constrained valueslogin-pattern.jinja2: Identify authentication patternsservlet_parameter_type.jinja2: Determine servlet parameter types
Each prompt includes:
- Clear task description
- Input format specification
- Examples demonstrating expected output
- Output format constraints
We conducted a comprehensive developer survey to evaluate the quality and usefulness of SAINT-generated tests. The survey included:
- Participants: Professional developers with varying levels of experience
- Focus: Scenario-based test quality, readability, and maintainability
- Results: Strong endorsement of SAINT-generated tests
- Key Findings: Developers found the tests to be realistic, well-structured, and aligned with actual use cases
SAINT has been extensively evaluated on the following applications:
| Application | Domain | Endpoints | LOC | Description |
|---|---|---|---|---|
| PetClinic | Healthcare | 15 | ~5K | Veterinary clinic management system |
| DayTrader | Finance | 12 | ~8K | Stock trading benchmark application |
| JPetStore | E-commerce | 18 | ~6K | Pet store e-commerce platform |
| Restcountries | Information | 8 | ~3K | REST API for country information |
| Feature-service | DevOps | 10 | ~4K | Feature flag management service |
| Genome-Nexus | Bioinformatics | 25 | ~15K | Genomic annotation and analysis service |
| LanguageTool | NLP | 6 | ~10K | Grammar and style checking service |
| Enterprise App | Proprietary | 30+ | ~50K | Real-world enterprise application |
If you use SAINT in your research or project, please cite our ICSE 2026 paper:
@inproceedings{pan2026saint,
title={SAINT: Service-level Integration Test Generation with Program Analysis and LLM-based Agents},
author={Pan, Rangeet and Pavuluri, Raju and Huang, Ruikai and Krishna, Rahul and Stennett, Tyler and Orso, Alessandro and Sinha, Saurabh},
booktitle={2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE)},
year={2026}
}This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Note: This is a research prototype. For production use, please contact the authors for guidance and support.
Disclaimer: SAINT uses LLMs which may generate tests with varying quality. Always review and validate generated tests before deploying to production systems.

