We are thrilled to announce the launch of SKT-OMNI-CORPUS-146T-V1, a massive-scale, high-quality dataset designed to power the next generation of Foundation Models (LLMs) from scratch. Developed at SKT AI LABS, this corpus is not just a collection of data; it’s a mission to decentralize high-grade AI training for regional languages and global knowledge.
💎 Key Highlights:
•• Massive Scale: Targeting a multi-terabyte architecture for 146T-level tokenization.
•• Pure Quality: Curated from 500+ Elite Sources
•• Structured for MoE: Perfectly sharded into 3.5GB standardized units (SKT-𝕻 series) for seamless distributed training.
🤝 Open for Collaboration!
We are looking for AI researchers, CUDA engineers, and data scientists to join us in this journey of building Project Surya and the ST-X Series models. Whether it's optimization, custom tokenization, or architecture design—let’s build the future together.
Following the strong response to the Google Code Archive nyuuzyou/google-code-archive (thanks!), this release preserves another major historical repository: the Microsoft CodePlex Archive.
CodePlex served as Microsoft’s primary open-source hosting platform from 2006 to 2017. This dataset captures the distinct .NET and Windows-centric development ecosystem that flourished before the industry standardizing on GitHub.
Key Stats:
- 5,043,730 files from 38,087 repositories - 3.6 GB compressed Parquet - 91 programming languages (Heavily featuring C#, ASP.NET, and C++) - Cleaned of binaries, build artifacts, and vendor directories (node_modules, packages) - Includes platform-specific license metadata (Ms-PL, Ms-RL)