17 5 4

perfecxion.ai

perfecXion

https://perfecxion.ai

AI & ML interests

None yet

Recent Activity

upvoted an article 2 days ago

AI Coding Assistants Keep Shipping Vulnerable Code -- Here's What We're Doing About It

reacted to scthornton's post with 👀 2 days ago

# SecureCode Dataset Family Update: 2,185 Security Examples, Framework-Specific Patterns, Clean Parquet Loading Hey y'all, Quick update on the SecureCode dataset family. We've restructured things and fixed several issues: **What changed:** - The datasets are now properly split into three repos: [unified](https://huggingface.co/datasets/scthornton/securecode) (2,185), [web](https://huggingface.co/datasets/scthornton/securecode-web) (1,378), [AI/ML](https://huggingface.co/datasets/scthornton/securecode-aiml) (750) - All repos now use Parquet format -- `load_dataset()` just works, no deprecated loading scripts - SecureCode Web now includes 219 framework-specific examples (Express, Django, Spring Boot, Flask, Rails, Laravel, ASP.NET Core, FastAPI, NestJS) - Data cards have been corrected and split sizes fixed **Why it matters:** With AI-generated code accounting for 60%+ of some codebases (Checkmarx 2025), security training data is more important than ever. Every example in SecureCode is grounded in a real CVE with 4-turn conversations that mirror actual developer-AI workflows. If you're working on code generation models, I'd love to hear how you're approaching the security angle. Are there vulnerability categories or frameworks you'd like to see covered? Paper: [arxiv.org/abs/2512.18542](https://arxiv.org/abs/2512.18542)

reacted to scthornton's post with 🚀 2 days ago

View all activity

Organizations

upvoted an article 2 days ago

Article

AI Coding Assistants Keep Shipping Vulnerable Code -- Here's What We're Doing About It

2 days ago

•

reacted to scthornton's post with 👀🚀 2 days ago

Post

1637

# SecureCode Dataset Family Update: 2,185 Security Examples, Framework-Specific Patterns, Clean Parquet Loading

Hey y'all,

Quick update on the SecureCode dataset family. We've restructured things and fixed several issues:

**What changed:**

- The datasets are now properly split into three repos: [unified]( scthornton/securecode) (2,185), [web]( scthornton/securecode-web) (1,378), [AI/ML]( scthornton/securecode-aiml) (750)
- All repos now use Parquet format -- load_dataset() just works, no deprecated loading scripts
- SecureCode Web now includes 219 framework-specific examples (Express, Django, Spring Boot, Flask, Rails, Laravel, ASP.NET Core, FastAPI, NestJS)
- Data cards have been corrected and split sizes fixed

**Why it matters:**

With AI-generated code accounting for 60%+ of some codebases (Checkmarx 2025), security training data is more important than ever. Every example in SecureCode is grounded in a real CVE with 4-turn conversations that mirror actual developer-AI workflows.

If you're working on code generation models, I'd love to hear how you're approaching the security angle. Are there vulnerability categories or frameworks you'd like to see covered?

Paper: [arxiv.org/abs/2512.18542](https://arxiv.org/abs/2512.18542)