Securing AI-Generated Code with Runtime Verification
DOI:
https://doi.org/10.47363/JAICC/ICAIC2025/2025(4)30Keywords:
Securing AI, Generated CodeAbstract
Securing AI-Generated Code with Runtime Verification AI-driven code assistants (e.g., GitHub Copilot, ChatGPT) accelerate development but also surface security and compliance risks: generated code can replicate insecure idioms from training corpora, display unconventional control/data flows, and evade assumptions baked into static analysis. We propose formal runtime verification (RV) as a dedicated security layer for such untrusted code. The approach monitors executions against declarative specifications covering safety (nothing bad happens?), liveness (?something good eventually happens?), and hyperproperties that capture relational behaviors (e.g., information flow and cross-run consistency). We introduce a compact property language with parameterized temporal
operators and a fragment of HyperLTL suited to on-the-fly checking, and a stream-based monitoring architecture that ingests typed events produced by compile-time instrumentation and selective dynamic hooks. We implement Copilot-RV, extending the Copilot framework with security predicates (taint carriers, privilege boundaries, resource budgets), sanitizer oracles, and trace-windowed evaluation to bound overhead. Contracts express API usage (memory safety, file/network boundaries, TLS certificate pinning and key-usage constraints), capability invariants, and declassification rules for logs and telemetry. Monitors are compiled to C and linked with the target, emitting verdicts and mitigation actions (terminate, isolate, roll back, quarantine outputs) immediately upon
detection of a bad prefix. Our evaluation spans 1,247 LLM-generated C programs drawn from parsing, networking, serialization, and
embedded-style routines. Vulnerabilities are seeded via 18 CWE families and adversarial prompt patterns. Runtime monitors detect
94.3% of injected flaws with 3.7% median runtime overhead and 2.1% false positives; median detection latency is sub-millisecond
for memory-safety and capability-violation guards. An ablation comparing naive dynamic sanitizers to RV semantics shows a +24%
relative recall gain at similar overheads. A hybrid static?dynamic pipeline?where RV feedback guides targeted fuzzing and patch
suggestions?reduces manual verification effort by 73% relative to static analysis alone. We prove soundness and completeness
for safety/co-safety classes, with latency bounded by specification lookahead and resource usage linear in active predicates and
window size. Integration with DevSecOps is straightforward: specifications are versioned with code; policy packs are derived
from prompts/diffs; monitors gate merges in CI and run in production with budgeted sampling on hot paths. While RV remains
coverage-dependent and some hyperproperties require approximations, the results indicate that formal runtime monitors provide a
practical, mathematically grounded backstop for AI-generated code, narrowing the gap between rapid AI-assisted development and
the reliability demands of safety- and mission-critical software.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Journal of Artificial Intelligence & Cloud Computing

This work is licensed under a Creative Commons Attribution 4.0 International License.