URL Classifier v2 β€” Autoresearch (Multi-Domain)

Binary classifier that predicts whether a URL is a list page (A) or a detail page (B).

Trained on 26 diverse domains across e-commerce, recruitment, news, social, video, travel, education, and tech documentation β€” significantly improved generalization over the v1 single-domain model.

Model Details

  • Architecture: Custom transformer (Autoresearch framework)
  • Parameters: ~161M
  • Depth: 4 layers
  • Model dim: 384
  • Vocab: cl100k_base (100,277 tokens)
  • Max seq len: 64
  • Training: 30 min on RTX 4060 Laptop
  • Training samples: 2,600 (A=1,300, B=1,300)
  • Training accuracy: 100%

Supported Domains

Category Domains
E-commerce Amazon, JD, Taobao, Tmall, Pinduoduo
Recruitment Zhilian, BOSS, Lagou
News Sina, NetEase, Tencent News, 36kr
Social Zhihu, Douban, Xiaohongshu, Reddit
Video YouTube, Bilibili
Travel Ctrip, Qunar, Mafengwo
Education icourse163, imooc
Tech Docs GitHub, ReadTheDocs, MDN

Usage

pip install torch tiktoken
python src/infer.py "https://example.com/product/123"   # detail page
python src/infer.py "https://example.com/search?q=foo"  # list page

Class Labels

Label Meaning
0 (A) List page β€” search results, category pages, rankings
1 (B) Detail page β€” product page, article, profile, video

Limitations

  • Bilibili ranking pages may be misclassified as detail pages
  • Very short URLs or URL shorteners may have lower accuracy
  • Third-party evaluation accuracy (~55%) indicates room for improvement with real-world labeled data
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support