On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

Abstract

The use of natural language (NL) test cases for validating graphical user interface (GUI) applications is emerging as a promising alternative to manually written executable test scripts, which are costly to develop and difficult to maintain. Recent advances in large language models (LLMs) have opened the possibility of the direct execution of NL test cases by LLM agents. This paper explores this direction, with particular attention to NL test case unsoundness and to the consistency of test case execution. NL test cases are inherently unsound because ambiguous instructions or unpredictable agent behaviour can produce false failures. Furthermore, repeated executions of the same NL test case may lead to inconsistent outcomes, undermining test reliability. To address these challenges, we propose an algorithm for executing NL test cases with specialised agents and guardrail mechanisms that dynamically verify the test step executions. We introduce measures to evaluate the capabilities of LLM agents in test execution and a measure to estimate NL test case execution consistency. We also propose a definition of weak unsoundness capturing contexts where rare incorrect verdicts are tolerable. Our experimental evaluation with eight publicly available LLMs demonstrates the potential of LLM agents for GUI testing. Our experiments show that Meta Llama 3.3 70B demonstrates good capabilities in NL test case execution with respect to the industrial quality levels 3-Sigma (mean accuracies greater than 98%), with high execution consistency.

Type
Publication
Proceedings of the 21th International Conference on Evaluation of Novel Approaches to Software Engineering, ENASE 2026, Benidorm, Spain, May 22-24, 2026