bndp (bndp)

liked a model 1 day ago

tiiuae/Falcon-Perception

Mask Generation • Updated 7 days ago • 7.16k • 89

liked a model 3 days ago

tiiuae/Falcon-OCR

Image-to-Text • Updated 9 days ago • 3.27k • 61

liked a model 5 days ago

prism-ml/Bonsai-8B-gguf

Text Generation • 8B • Updated 3 days ago • 68.2k • 540

liked a model 12 days ago

ibm-granite/granite-4.0-micro

Text Generation • Updated Nov 3, 2025 • 147k • 268

reactedto Shrijanagain's post with 🔥 17 days ago

Post

5566

We are thrilled to announce the launch of SKT-OMNI-CORPUS-146T-V1, a massive-scale, high-quality dataset designed to power the next generation of Foundation Models (LLMs) from scratch.
Developed at SKT AI LABS, this corpus is not just a collection of data; it’s a mission to decentralize high-grade AI training for regional languages and global knowledge.

💎 Key Highlights:

•• Massive Scale: Targeting a multi-terabyte architecture for 146T-level tokenization.

•• Pure Quality: Curated from 500+ Elite Sources

•• Structured for MoE: Perfectly sharded into 3.5GB standardized units (SKT-𝕻 series) for seamless distributed training.

🤝 Open for Collaboration!

We are looking for AI researchers, CUDA engineers, and data scientists to join us in this journey of building Project Surya and the ST-X Series models. Whether it's optimization, custom tokenization, or architecture design—let’s build the future together.

Explore the Dataset on Hugging Face:

🔗 https://huggingface.co/datasets/Shrijanagain/SKT-OMNI-CORPUS-146T-V1

DSR -- 🔗 https://huggingface.co/datasets/Shrijanagain/SKT-DSRx10000

#AI #MachineLearning #OpenSource #IndicAI #SKTAILABS #LLM #BigData #HuggingFace #InnovationIndia

liked a Space 21 days ago

GGUF My Repo

🦙

1.91k

Create GGUF quantized model from a Hugging Face repo

liked a model 21 days ago

nvidia/Nemotron-Cascade-2-30B-A3B

Text Generation • 32B • Updated about 9 hours ago • 233k • 468

liked a dataset 21 days ago

nvidia/Nemotron-Cascade-2-SFT-Data

Viewer • Updated 21 days ago • 15.9M • 15.7k • 52

liked a model about 1 month ago

bartowski/Qwen_Qwen3.5-0.8B-GGUF

Image-Text-to-Text • 0.8B • Updated Mar 10 • 242k • 11

liked a model 2 months ago

Qwen/Qwen3-Coder-Next

Text Generation • 80B • Updated Feb 3 • 672k • • 1.24k

reactedto nyuuzyou's post with 👍 2 months ago

Post

2740

🏛️ Microsoft CodePlex Archive Dataset - nyuuzyou/ms-codeplex-archive

Following the strong response to the Google Code Archive nyuuzyou/google-code-archive (thanks!), this release preserves another major historical repository: the Microsoft CodePlex Archive.

CodePlex served as Microsoft’s primary open-source hosting platform from 2006 to 2017. This dataset captures the distinct .NET and Windows-centric development ecosystem that flourished before the industry standardizing on GitHub.

Key Stats:

- 5,043,730 files from 38,087 repositories
- 3.6 GB compressed Parquet
- 91 programming languages (Heavily featuring C#, ASP.NET, and C++)
- Cleaned of binaries, build artifacts, and vendor directories (node_modules, packages)
- Includes platform-specific license metadata (Ms-PL, Ms-RL)