URL Classifier v2 — Autoresearch (Multi-Domain)

Binary classifier that predicts whether a URL is a list page (A) or a detail page (B).

Trained on 26 diverse domains across e-commerce, recruitment, news, social, video, travel, education, and tech documentation — significantly improved generalization over the v1 single-domain model.

Model Details

Architecture: Custom transformer (Autoresearch framework)
Parameters: ~161M
Depth: 4 layers
Model dim: 384
Vocab: cl100k_base (100,277 tokens)
Max seq len: 64
Training: 30 min on RTX 4060 Laptop
Training samples: 2,600 (A=1,300, B=1,300)
Training accuracy: 100%

Supported Domains

Category	Domains
E-commerce	Amazon, JD, Taobao, Tmall, Pinduoduo
Recruitment	Zhilian, BOSS, Lagou
News	Sina, NetEase, Tencent News, 36kr
Social	Zhihu, Douban, Xiaohongshu, Reddit
Video	YouTube, Bilibili
Travel	Ctrip, Qunar, Mafengwo
Education	icourse163, imooc
Tech Docs	GitHub, ReadTheDocs, MDN

Usage

pip install torch tiktoken
python src/infer.py "https://example.com/product/123"   # detail page
python src/infer.py "https://example.com/search?q=foo"  # list page

Class Labels

Label	Meaning
0 (A)	List page — search results, category pages, rankings
1 (B)	Detail page — product page, article, profile, video

Limitations

Bilibili ranking pages may be misclassified as detail pages
Very short URLs or URL shorteners may have lower accuracy
Third-party evaluation accuracy (~55%) indicates room for improvement with real-world labeled data

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support