FreshStack Leaderboard

Realistic Retrieval Benchmarking on Technical Documentation

FreshStack is a holistic framework for building realistic & challenging RAG benchmarks from community-asked questions and answers on niche and fast-growing domains. FreshStack evaluates retrieval models on five domains: LangChain, Yolo v7 & v8, Laravel 10 & 11, Angular 16, 17 & 18, and Godot4. Metrics include alpha-nDCG@10, Coverage@20, and Recall@50.

Paper Code Dataset Project Home

FreshStack Metrics vs. Model Parameters

Average scores across 5 domains vs model parameter size; points are colored by model family.

FreshStack Metrics vs. Model Release Date

Average scores across 5 domains vs model release date; points are colored by model family.

Cite FreshStack

@inproceedings{
  thakur2025freshstack,
  title={FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents},
  author={Nandan Thakur and Jimmy Lin and Sam Havens and Michael Carbin and Omar Khattab and Andrew Drozdov},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2025},
  url={https://openreview.net/forum?id=54TTgXlS2U}
}