URL Classifier v2 β Autoresearch (Multi-Domain)
Binary classifier that predicts whether a URL is a list page (A) or a detail page (B).
Trained on 26 diverse domains across e-commerce, recruitment, news, social, video, travel, education, and tech documentation β significantly improved generalization over the v1 single-domain model.
Model Details
- Architecture: Custom transformer (Autoresearch framework)
- Parameters: ~161M
- Depth: 4 layers
- Model dim: 384
- Vocab: cl100k_base (100,277 tokens)
- Max seq len: 64
- Training: 30 min on RTX 4060 Laptop
- Training samples: 2,600 (A=1,300, B=1,300)
- Training accuracy: 100%
Supported Domains
| Category | Domains |
|---|---|
| E-commerce | Amazon, JD, Taobao, Tmall, Pinduoduo |
| Recruitment | Zhilian, BOSS, Lagou |
| News | Sina, NetEase, Tencent News, 36kr |
| Social | Zhihu, Douban, Xiaohongshu, Reddit |
| Video | YouTube, Bilibili |
| Travel | Ctrip, Qunar, Mafengwo |
| Education | icourse163, imooc |
| Tech Docs | GitHub, ReadTheDocs, MDN |
Usage
pip install torch tiktoken
python src/infer.py "https://example.com/product/123" # detail page
python src/infer.py "https://example.com/search?q=foo" # list page
Class Labels
| Label | Meaning |
|---|---|
| 0 (A) | List page β search results, category pages, rankings |
| 1 (B) | Detail page β product page, article, profile, video |
Limitations
- Bilibili ranking pages may be misclassified as detail pages
- Very short URLs or URL shorteners may have lower accuracy
- Third-party evaluation accuracy (~55%) indicates room for improvement with real-world labeled data
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support