TABERTA_cover

TABERTA: Structure-Aware Table Retrieval with Bi-Encoders

TABERTA is a structure-aware fine-tuning framework for learning dense representations of relational tables. Unlike approaches that flatten tables into unstructured text, TABERTA explicitly exposes schema structure and table content through controlled serialization views and retrieval-oriented training objectives.

The resulting table encoders support accurate and generalizable table retrieval across heterogeneous tasks, including:

  • ad-hoc dataset discovery,
  • question answering evidence selection,
  • fact verification,
  • and schema grounding for text-to-SQL.

This repository provides:

  • the TABERTA codebase (training, serialization, evaluation),
  • and 7 fine-tuned table encoders, released via Hugging Face.

Core Idea

Given a natural-language query ( q ) and a corpus of tables ( \mathcal{T} ), TABERTA learns an encoder ( E(\cdot) ) such that relevant tables are ranked highly using standard similarity search.

Two design choices are central:

  1. Serialization View — how table structure and content are exposed to the encoder.
  2. Fine-Tuning Objective — how retrieval relevance is learned.

TABERTA systematically studies the interaction between these choices.


Table Serialization Views

TABERTA supports three complementary serialization strategies:

SchemaView

Encodes only schema-level information (table name, column names, types).

  • Emphasizes structural and semantic intent.
  • Robust to noise and large tables.
  • Best suited for ad-hoc table retrieval and dataset discovery.

RowView

Encodes individual rows paired with schema context.

  • Grounds semantics in concrete values.
  • Supports evidence-based retrieval.
  • Useful when relevance depends on specific tuples.

Hybrid / FullView

Combines schema information with sampled or aggregated table content.

  • Balances abstraction and grounding.
  • Most general and consistently effective across tasks.
  • Used as the default in cross-benchmark evaluation.

Released Models

All models are bi-encoders initialized from a sentence-transformer backbone and fine-tuned for table retrieval. They differ in supervision signal and training objective.

1. Pairwise Contrastive (P/N Pair)

Supervision: Supervised
Objective: Pairwise contrastive (positive vs. negative tables)
Serialization: SchemaView / Hybrid

Learns explicit relevance boundaries between matching and non-matching tables. Strong baseline when labeled query–table pairs are available.


2. Triplet Contrastive (TC)

Supervision: Supervised
Objective: Triplet loss (anchor, positive, negative)
Serialization: SchemaView / Hybrid

Encourages relative ranking rather than absolute separation. More stable than pairwise training in heterogeneous corpora.


3. Optimized Triplet Contrastive (TC-opt)

Supervision: Supervised
Objective: Triplet loss with hard-negative mining
Serialization: SchemaView / Hybrid

Improves discrimination in large repositories where many tables are semantically close. Best suited for high-recall retrieval settings.


4. Self-Supervised Contrastive (SimCSE-style)

Supervision: Self-supervised
Objective: Contrastive learning via stochastic dropout views
Serialization: SchemaView / Hybrid

Does not require relevance labels. Captures structural regularities and semantic consistency across table representations.


5. Masked Language Modeling (MLM)

Supervision: Self-supervised
Objective: Token-level reconstruction
Serialization: FullView

Focuses on contextual encoding of table text. Useful as a pretraining signal but weaker alone for retrieval without contrastive supervision.


6. Hybrid (MLM → Contrastive)

Supervision: Self-supervised + Supervised
Objective: Two-stage (MLM pretraining followed by contrastive fine-tuning)
Serialization: Hybrid / FullView

Combines representation quality with retrieval alignment. Provides strong and stable performance across tasks.


7. Unified Hybrid (Recommended)

Supervision: Mixed
Objective: Retrieval-oriented contrastive fine-tuning
Serialization: Hybrid

A single encoder trained to generalize across:

  • table retrieval,
  • question answering evidence selection,
  • fact verification,
  • and schema grounding.

This is the default model used in the paper.


Usage

Load a TABERTA Encoder

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("enas88/taberta-hybrid")


## Encode Tables


table_embeddings = model.encode(
    serialized_tables,
    normalize_embeddings=True,
    show_progress_bar=True
)


## Encode Queries and Retrieve
query_embedding = model.encode(query, normalize_embeddings=True)
scores = table_embeddings @ query_embedding
top_k = scores.argsort()[-k:][::-1]


##  TBA :)
@inproceedings{taberta,
  title     = {TABERTA: Structure-Aware Fine-Tuning of Bi-Encoders for Table Retrieval},
  author    = {…},
  booktitle = {…},
  year      = {2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Ineso/TABERTA