arXiv 2025 · NLP / Table QA

CRAFT: Training-Free
Cascaded Retrieval
for Tabular QA

A modular, training-free cascaded retrieval framework for scalable tabular question answering — no fine-tuning required.

*Adarsh Singh¹ *Kushal Raj Bhandari² Jianxi Gao² †Soham Dan³ †Vivek Gupta¹

1 Arizona State University 2 Rensselaer Polytechnic Institute 3 Microsoft

* Equal contribution · † Joint supervision

Table QA Retrieval LLMs Training-Free

Paper Method Results

Abstract

The Problem & The Solution

Table Question Answering (TQA) requires identifying the right table from a massive corpus before reasoning over its structure to derive answers. Existing methods like DTR and ColBERT are computationally expensive and need dataset-specific fine-tuning — limiting adaptability to new domains.

We propose CRAFT, a cascaded retrieval framework that chains off-the-shelf models in three progressive stages: sparse lexical retrieval (SPLADE) → dense semantic reranking (Sentence Transformer) → neural reranking (text-embedding-3). Table representations are enriched with LLM-generated titles and descriptions via Gemini 1.5 Flash.

CRAFT matches or outperforms SOTA fine-tuned retrievers on the NQ-Tables benchmark — without training a single parameter on the target dataset. End-to-end QA results with Mistral, LLaMA3, and Qwen demonstrate its effectiveness across diverse LLM backends.

📌 Full Abstract Details

CRAFT processes 169,898 tables through three retrieval stages, ultimately producing a compact top-k set for answer generation. Query expansion via sub-question decomposition maximizes semantic alignment. The Sentence Transformer operates on mini-tables (top 5 rows per table) to reduce context noise and token count. End-to-end evaluation uses 919 unique queries from the NQ-Tables benchmark, with F1 score as the primary metric.

📊

169,898

Tables in NQ-Tables corpus

🎯

87.16

Recall@10 - strongest retrieval among shown methods

⚡

96.84

Recall@50 - strongest deep recall

🔧

Zero

Domain-specific training required

Highlights

What Makes CRAFT Stand Out

A concise snapshot of the paper's main takeaways, performance gains, and practical advantages.

🪜

Three-Stage Retrieval Cascade

Sparse filtering, dense semantic reranking, and neural reranking work together to progressively narrow candidates without dataset-specific training.

🏆

Strong Retrieval Quality

The paper reports 41.13 Recall@1, 87.16 Recall@10, and 96.84 Recall@50 on NQ-Tables, exceeding fine-tuned baselines at deeper cutoffs.

🛡️

Robust to Query Changes

CRAFT remains stable under paraphrased questions, with an average recall change of only -0.04 compared with much larger drops for fine-tuned DTR models.

⚡

Efficient End-to-End QA

Using compact sub-table context reduces token usage by more than 70% while still improving answer generation with pretrained LLMs such as Mistral, LLaMA3, and Qwen.

Contributions

What We Contribute

Five distinct advances that make CRAFT practical and competitive.

🚫🔧

Training-Free Pipeline

No fine-tuning on NQ-Tables or any target dataset. Purely off-the-shelf pretrained models chained together.

🪜

Multi-Stage Cascade

Three progressive retrieval stages — sparse → dense → neural — each refining candidates from the previous stage.

🏆

Strong Retrieval Performance

Outperforms THYME, DTR, BIBERT+SPLADE, and all other baselines at R@10 and R@50 without any training.

🛡️

Robust to Paraphrasing

Under query perturbation, CRAFT loses only ~0.04 avg. recall points, while fine-tuned DTR drops 8–12 points.

📈

Efficient & Scalable

Mini-table context reduces token count by 70%+, enabling cost-effective inference at scale across large corpora.

Framework

The CRAFT Pipeline

Progressive filtering from 169k tables down to a precise top-k for answer generation — each stage more expressive than the last.

CRAFT Architecture Overview

🔍

Preprocessing

Query & Table
Enrichment

Gemini 1.5 Flash

Sub-question decomposition
Table title generation
Table descriptions
Row ranking (Sentence Transformer)

169,898 tables

→

🔡

Stage 1

Sparse Lexical
Retrieval

SPLADE

Sparse expansion model
Title + desc + headers + cells
Query + sub-query input
R@5000 = 99.59%

5,000 candidates

→

🧠

Stage 2

Dense Semantic
Reranking

all-mpnet-base-v2

Mini-table representation
Top 5 rows per table
Dense embeddings
R@1000 = 98.91%

1,000 candidates

→

⚡

Stage 3

Neural
Reranking

text-embedding-3-small

Deep semantic reranking
Top 100 mini-tables
R@1 = 41.13%
R@10 = 87.16%

top-k tables

→

💬

Generation

Answer
Generation

Mistral / LLaMA3 / Qwen

Title + headers + top-5 rows
n = 1, 3, 5 tables
Zero-shot & few-shot
F1 metric evaluation

Answer generated

Progressive Filtering

Input

169,898 tables

Stage 1

5,000 tables

Stage 2

1,000 tables

Stage 3

top-k tables

Results

Retrieval Performance

CRAFT outperforms all fine-tuned retrievers at R@10 and R@50 on the NQ-Tables benchmark.

📊 NQ-Tables Retrieval Metrics

Model	R@1	R@10	R@50
SPARSE
BM25	18.49	36.94	52.61
SPLADE	39.84	83.33	94.65
DENSE
DPR	45.32	85.84	95.44
TAPAS	43.79	83.49	95.10
DTR	32.62	75.86	89.77
T-RAG*	46.07	85.40	95.03
HYBRID
DHR	43.67	84.65	95.62
BIBERT+SPLADE	45.62	86.72	95.62
THYME	48.55	86.38	96.08
CRAFT (Ours — Training-Free)
🏆 CRAFT	41.13	87.16	96.84

* All models except CRAFT are trained on NQ-Tables.

Recall@10 — Model Comparison

BM2536.94

SPLADE83.33

DPR85.84

THYME ✦86.38

CRAFT 🏆87.16

Sparse

Dense

Hybrid ✦ trained

CRAFT (no training)

Stage-wise Recall Progress

Stage 1 (R@10)72.90

+ Stage 2 (R@10)82.91

+ Stage 3 (R@10)87.16

End-to-End QA

F1 Performance Across LLMs

CRAFT paired with off-the-shelf LLMs consistently surpasses all fine-tuned baselines.

📊 F1 at n=3 Retrieved Tables

Comparison against THYME and other baselines (zero-shot)

BIBERT33.16

SPLADE37.17

THYME39.16

CRAFT (LLaMA3)44.06

CRAFT (Mistral)46.04

CRAFT + 5-shot (Mistral)48.50

🔬 Larger LLMs (F1, n=5)

CRAFT with instruction-tuned larger models vs. dataset-specific baselines

RAG (Lewis et al.)39.67

DPR-RAGE49.68

LI-RAGE54.17

CRAFT + Llama-3.1-70B56.94

CRAFT + Mistral-Small57.14

💡
        Key takeaway: CRAFT with Mistral-Small-Instruct (n=5) achieves F1 of 57.14,
        clearly surpassing LI-RAGE (54.17) — the previous best — which requires dataset-specific training and
        deeper retrieval. Few-shot prompting adds a modest but consistent boost, especially when more retrieved tables are included.
      

Impact

Why CRAFT Matters

Practical advantages that make CRAFT deployable in real-world scenarios.

🚫

No Fine-Tuning

Zero dataset-specific training. Deploy immediately on new domains without labeled examples.

🔌

Off-the-Shelf Models

Uses publicly available SPLADE, Sentence Transformers, and OpenAI embeddings. No proprietary stack needed.

📈

Scales to 170k+ Tables

Efficient cascaded filtering handles massive corpora without quadratic cost. Mini-tables cut token use by 70%.

🛡️

Query Robust

Only −0.04 avg recall drop under paraphrasing. Fine-tuned competitors fall 8–12 points on the same perturbations.

Limitations

Honest Assessment

Areas where CRAFT has room to grow and open questions for future work.

⏱ Multi-Stage Latency

The three-stage cascade introduces additional inference steps compared to a single-model retriever. While each stage is lightweight, the sequential nature adds latency in real-time applications. Pipeline parallelization could address this in future work.

📋 Benchmark Scope

Evaluation is primarily conducted on the NQ-Tables dataset. Broader validation on other TQA benchmarks (OTT-QA, HybridQA, etc.) would better establish generalizability across diverse table formats and domains.

📖 Future Directions

Future work could explore: (1) Replacing SPLADE with an even lighter sparse model for the first stage; (2) Joint optimization of the pipeline without full fine-tuning (e.g., lightweight adapter tuning); (3) Extension to multi-hop table QA where multiple tables are needed to answer a single question; (4) Evaluation on non-English TQA datasets to assess multilingual robustness.

Citation

Cite This Work

If CRAFT is useful for your research, please consider citing our paper.

craft2025.bib

@article{craft2025,
  title   = {CRAFT: Training-Free Cascaded Retrieval for Tabular QA},
  author  = {Singh, Adarsh and Bhandari, Kushal Raj and Gao, Jianxi
           and Dan, Soham and Gupta, Vivek},
  year    = {2025},
  eprint  = {2505.14984},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}