AI code security evaluation

Is the code written by AI
actually secure?

CodeGuard benchmarks code generated by AI coding tools against 132 real-world CVE scenarios with Docker PoC dynamic validation, producing quantitative security scores, and detects CVE vulnerability patterns introduced in GitHub PRs in real time with zero false positives.

Request commercial license
# Evaluate security of gpt-4o generated code
$ codeguard evaluate --model gpt-4o
→ loading 132 CVE instances...
→ building Docker PoC images...
→ running dynamic validation...
 
passed: 87 / 132 (65.9%)
× vulnerable: 45 / 132 (34.1%)
 
CWE-89 SQL Injection (12)
CWE-79 XSS (8)
CWE-22 Path Traversal (6)
 
Final Score: 65.9 / 100

Two core capabilities

Evaluation plus defense — a complete security loop for AI coding tools

Product A

Security evaluation of AI coding tools

Score code generated by tools like GPT-4o, Claude, Gemini and Copilot. 132 real-world CVE scenarios plus Docker PoC dynamic validation produce quantitative, reproducible, and comparable security reports.

  • Severity-weighted scoring — failing everything no longer gets a perfect score
  • AST-aware dependency retrieval, structured across 6 languages
  • HTML comparative reports for cross-model benchmarking
  • FastAPI evaluation service with Swagger docs
Product B

CVE regression detection

When a developer (or AI tool) opens a PR, CodeGuard automatically checks whether it introduces known CVE patterns. Every alert is validated by a Docker PoC — zero false positives. Results are written back as commit status and PR comments.

  • GitHub webhook integration — trigger in seconds
  • Multi-dimensional CVE index (repo / language / CWE / file)
  • Docker image cache plus incremental scans
  • PR comments include remediation suggestions

Key features

Real vulnerabilities, dynamic validation, zero false positives — the non-negotiables

🛠

Docker PoC dynamic validation

Every CVE instance ships with a dedicated Docker image, an executable PoC exploit, and functional tests. Not static analysis — the code actually runs and we check whether the vuln is still there.

🔬

AST-aware retrieval

AST-level code structure analysis layered on top of BM25 full-text search. Supports 6 languages (C/C++/Java/Python/PHP/JS) for precise context completion.

📊

Severity-weighted scoring

Fixes the original scoring bug where failing every case still yielded a perfect score. Critical vulns carry far more weight than Lows, so the score actually reflects risk.

🤖

Multi-agent adapters

Supports Claude Code, Gemini CLI, OpenAI Codex, Aider and other mainstream agent code-generation frameworks with unified AgentMetrics behavior tracking.

💻

GitHub webhook integration

Point your webhook at POST /webhook/github and every PR triggers a scan. Results are pushed back as commit status plus PR comments — zero-touch integration for developers.

Image cache + incremental scans

Docker images are built once and reused. Incremental scans track vulnerability file mtime — per-PR scan time drops from minutes to seconds.

Evaluation dataset

132 real-world CVEs drawn from 51 production GitHub projects

132
CVE instances
51
Real GitHub projects
28
CWE types
7
Programming languages

Coverage

Languages: C(57) PHP(42) Java(16) Python(13) TS(2) JS(1) C++(1)
Severity: Critical(9) High(66) Medium(56) Low(1)
CWE coverage: OWASP Top 10 + CWE Top 25
Validation: Docker image + PoC exploit + functional tests

Use cases

A safety net for code in the AI programming era

🏢 Evaluating AI coding tools for procurement

Before buying an AI coding assistant, use CodeGuard to run a cross-vendor security benchmark. Data-driven decisions — don't only look at code quality, look at the security floor.

💼 Model iteration for AI vendors

Run the security benchmark continuously while iterating. Ensure each new model release does not regress on security, with auto-generated comparative reports that quantify the change.

🛡 CI/CD security gates

Connect to a GitHub webhook and scan every PR for CVE patterns. With zero false positives, it can act as a hard CI gate — a real security checkpoint, not noise.

🔍 Security research and audits

A standardized benchmark for security teams to compare models, prompt strategies, and context scopes. All 132 scenarios are reproducible and extensible.

Put a lock on AI-generated code

Want to deploy an AI code security evaluation platform inside your company, or wire CVE regression detection into your GitHub workflow? Reach out for a commercial license and deployment plan.

Contact us

CodeGuard is built as a deep enhancement on top of the open-source Tencent/AICGSecEval (A.S.E) framework and is licensed under Apache-2.0. We thank Tencent Security Platform Department and the partner universities (Fudan, Peking, SJTU, Tsinghua, Zhejiang) for the original dataset and research contributions.

Explore more products