arXiv 2025  ·  NLP / Table QA

CRAFT: Training-Free
Cascaded Retrieval
for Tabular QA

A modular, training-free cascaded retrieval framework for scalable tabular question answering — no fine-tuning required.

*Adarsh Singh1 *Kushal Raj Bhandari2 Jianxi Gao2 †Soham Dan3 †Vivek Gupta1
1 Arizona State University 2 Rensselaer Polytechnic Institute 3 Microsoft

* Equal contribution  ·  † Joint supervision

Table QA Retrieval LLMs Training-Free

The Problem & The Solution

Table Question Answering (TQA) requires identifying the right table from a massive corpus before reasoning over its structure to derive answers. Existing methods like DTR and ColBERT are computationally expensive and need dataset-specific fine-tuning — limiting adaptability to new domains.

We propose CRAFT, a cascaded retrieval framework that chains off-the-shelf models in three progressive stages: sparse lexical retrieval (SPLADE) → dense semantic reranking (Sentence Transformer) → neural reranking (text-embedding-3). Table representations are enriched with LLM-generated titles and descriptions via Gemini 1.5 Flash.

CRAFT matches or outperforms SOTA fine-tuned retrievers on the NQ-Tables benchmark — without training a single parameter on the target dataset. End-to-end QA results with Mistral, LLaMA3, and Qwen demonstrate its effectiveness across diverse LLM backends.

📌 Full Abstract Details
CRAFT processes 169,898 tables through three retrieval stages, ultimately producing a compact top-k set for answer generation. Query expansion via sub-question decomposition maximizes semantic alignment. The Sentence Transformer operates on mini-tables (top 5 rows per table) to reduce context noise and token count. End-to-end evaluation uses 919 unique queries from the NQ-Tables benchmark, with F1 score as the primary metric.
📊
169,898
Tables in NQ-Tables corpus
🎯
87.16
Recall@10 - strongest retrieval among shown methods
96.84
Recall@50 - strongest deep recall
🔧
Zero
Domain-specific training required

What Makes CRAFT Stand Out

A concise snapshot of the paper's main takeaways, performance gains, and practical advantages.

🪜
Three-Stage Retrieval Cascade

Sparse filtering, dense semantic reranking, and neural reranking work together to progressively narrow candidates without dataset-specific training.

🏆
Strong Retrieval Quality

The paper reports 41.13 Recall@1, 87.16 Recall@10, and 96.84 Recall@50 on NQ-Tables, exceeding fine-tuned baselines at deeper cutoffs.

🛡️
Robust to Query Changes

CRAFT remains stable under paraphrased questions, with an average recall change of only -0.04 compared with much larger drops for fine-tuned DTR models.

Efficient End-to-End QA

Using compact sub-table context reduces token usage by more than 70% while still improving answer generation with pretrained LLMs such as Mistral, LLaMA3, and Qwen.


What We Contribute

Five distinct advances that make CRAFT practical and competitive.

🚫🔧
Training-Free Pipeline

No fine-tuning on NQ-Tables or any target dataset. Purely off-the-shelf pretrained models chained together.

🪜
Multi-Stage Cascade

Three progressive retrieval stages — sparse → dense → neural — each refining candidates from the previous stage.

🏆
Strong Retrieval Performance

Outperforms THYME, DTR, BIBERT+SPLADE, and all other baselines at R@10 and R@50 without any training.

🛡️
Robust to Paraphrasing

Under query perturbation, CRAFT loses only ~0.04 avg. recall points, while fine-tuned DTR drops 8–12 points.

📈
Efficient & Scalable

Mini-table context reduces token count by 70%+, enabling cost-effective inference at scale across large corpora.


The CRAFT Pipeline

Progressive filtering from 169k tables down to a precise top-k for answer generation — each stage more expressive than the last.

CRAFT Architecture Overview
🔍
Preprocessing
Query & Table
Enrichment
Gemini 1.5 Flash
  • Sub-question decomposition
  • Table title generation
  • Table descriptions
  • Row ranking (Sentence Transformer)
169,898 tables
🔡
Stage 1
Sparse Lexical
Retrieval
SPLADE
  • Sparse expansion model
  • Title + desc + headers + cells
  • Query + sub-query input
  • R@5000 = 99.59%
5,000 candidates
🧠
Stage 2
Dense Semantic
Reranking
all-mpnet-base-v2
  • Mini-table representation
  • Top 5 rows per table
  • Dense embeddings
  • R@1000 = 98.91%
1,000 candidates
Stage 3
Neural
Reranking
text-embedding-3-small
  • Deep semantic reranking
  • Top 100 mini-tables
  • R@1 = 41.13%
  • R@10 = 87.16%
top-k tables
💬
Generation
Answer
Generation
Mistral / LLaMA3 / Qwen
  • Title + headers + top-5 rows
  • n = 1, 3, 5 tables
  • Zero-shot & few-shot
  • F1 metric evaluation
Answer generated
Progressive Filtering
Input
169,898 tables
Stage 1
5,000 tables
Stage 2
1,000 tables
Stage 3
top-k tables

Retrieval Performance

CRAFT outperforms all fine-tuned retrievers at R@10 and R@50 on the NQ-Tables benchmark.

📊 NQ-Tables Retrieval Metrics
Model R@1 R@10 R@50
SPARSE
BM2518.4936.9452.61
SPLADE39.8483.3394.65
DENSE
DPR45.3285.8495.44
TAPAS43.7983.4995.10
DTR32.6275.8689.77
T-RAG*46.0785.4095.03
HYBRID
DHR43.6784.6595.62
BIBERT+SPLADE45.6286.7295.62
THYME48.5586.3896.08
CRAFT (Ours — Training-Free)
🏆 CRAFT 41.13 87.16 96.84
* All models except CRAFT are trained on NQ-Tables.
Recall@10 — Model Comparison
BM2536.94
SPLADE83.33
DPR85.84
THYME ✦86.38
CRAFT 🏆87.16
Sparse
Dense
Hybrid ✦ trained
CRAFT (no training)
Stage-wise Recall Progress
Stage 1 (R@10)72.90
+ Stage 2 (R@10)82.91
+ Stage 3 (R@10)87.16

F1 Performance Across LLMs

CRAFT paired with off-the-shelf LLMs consistently surpasses all fine-tuned baselines.

📊 F1 at n=3 Retrieved Tables
Comparison against THYME and other baselines (zero-shot)
BIBERT33.16
SPLADE37.17
THYME39.16
CRAFT (LLaMA3)44.06
CRAFT (Mistral)46.04
CRAFT + 5-shot (Mistral)48.50
🔬 Larger LLMs (F1, n=5)
CRAFT with instruction-tuned larger models vs. dataset-specific baselines
RAG (Lewis et al.)39.67
DPR-RAGE49.68
LI-RAGE54.17
CRAFT + Llama-3.1-70B56.94
CRAFT + Mistral-Small57.14
💡
Key takeaway: CRAFT with Mistral-Small-Instruct (n=5) achieves F1 of 57.14, clearly surpassing LI-RAGE (54.17) — the previous best — which requires dataset-specific training and deeper retrieval. Few-shot prompting adds a modest but consistent boost, especially when more retrieved tables are included.

Why CRAFT Matters

Practical advantages that make CRAFT deployable in real-world scenarios.

🚫
No Fine-Tuning

Zero dataset-specific training. Deploy immediately on new domains without labeled examples.

🔌
Off-the-Shelf Models

Uses publicly available SPLADE, Sentence Transformers, and OpenAI embeddings. No proprietary stack needed.

📈
Scales to 170k+ Tables

Efficient cascaded filtering handles massive corpora without quadratic cost. Mini-tables cut token use by 70%.

🛡️
Query Robust

Only −0.04 avg recall drop under paraphrasing. Fine-tuned competitors fall 8–12 points on the same perturbations.



Honest Assessment

Areas where CRAFT has room to grow and open questions for future work.

⏱ Multi-Stage Latency

The three-stage cascade introduces additional inference steps compared to a single-model retriever. While each stage is lightweight, the sequential nature adds latency in real-time applications. Pipeline parallelization could address this in future work.

📋 Benchmark Scope

Evaluation is primarily conducted on the NQ-Tables dataset. Broader validation on other TQA benchmarks (OTT-QA, HybridQA, etc.) would better establish generalizability across diverse table formats and domains.

📖 Future Directions
Future work could explore: (1) Replacing SPLADE with an even lighter sparse model for the first stage; (2) Joint optimization of the pipeline without full fine-tuning (e.g., lightweight adapter tuning); (3) Extension to multi-hop table QA where multiple tables are needed to answer a single question; (4) Evaluation on non-English TQA datasets to assess multilingual robustness.

Cite This Work

If CRAFT is useful for your research, please consider citing our paper.

craft2025.bib
@article{craft2025,
  title   = {CRAFT: Training-Free Cascaded Retrieval for Tabular QA},
  author  = {Singh, Adarsh and Bhandari, Kushal Raj and Gao, Jianxi
           and Dan, Soham and Gupta, Vivek},
  year    = {2025},
  eprint  = {2505.14984},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}