Yourbench: a new way to evaluate AI models using actual data.

April 2, 2025 Editorial Staff 170 Views 0 Comments

Learn More

Every AI model release includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix. Learn More

Every AI release includes charts that show how the model outperforms its competitors on this benchmark test, or in an evaluation matrix. These benchmarks are often used to test general capabilities. It’s difficult for organizations to determine how well a model or large language-based agent understands the organization’s specific needs. Yourbench is an open-source benchmarking tool that allows developers and enterprise to create their own benchmarks for testing model performance with their internal data. Sumuk Shashidhar announced Yourbench, a tool developed by the Hugging Face evaluations team, on X. This feature allows for “custom benchmarking, and synthetic data generation based on ANY of your documents.” Yourbench lets you evaluate models on what matters to you.”

Creating custom evaluations

Hugging Face said in a paper that Yourbench works by replicating subsets of the Massive Multitask Language Understanding (MMLU) benchmark “using minimal source text, achieving this for under $15 in total inference cost while perfectly preserving the relative model performance rankings.” Yourbench lets you evaluate models on what matters to you.”

Creating custom evaluations

Organizations need to pre-process their documents before Yourbench can work. This involves three stages:

Document Ingestion

to “normalize” file formats.

Semantic Chunking to break down the documents to meet context window limits and focus the model’s attention.
Document SummarizationNext comes the question-and-answer generation process, which creates questions from information on the documents. The user can then bring in their LLM and see which model best answers the question.
Hugging Face tested Yourbench with DeepSeek V3 and R1 models, Alibaba’s Qwen models including the reasoning model Qwen QwQ, Mistral Large 2411 and Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, GPT-4o, GPT-4o-mini, and o3 mini, and Claude 3.7 Sonnet and Claude 3.5 Haiku.

Shashidhar said Hugging Face also offers cost analysis on the models and found that Qwen and Gemini 2.0 Flash “produce tremendous value for very very low costs.”

Compute limitations

However, creating custom LLM benchmarks based on an organization’s documents comes at a cost. Yourbench is a computationally intensive program. Shashidhar said on X that the company is “adding capacity” as fast they could.

Hugging Face runs several GPUs and partners with companies like Google to use their cloud services for inference tasks. VentureBeat reached out to Hugging Face about Yourbench’s compute usage.

Benchmarking is not perfect

Benchmarks and other evaluation methods give users an idea of how well models perform, but these do not perfectly capture how the models will work daily.

Some have even voiced skepticism that benchmark tests show models’ limitations and can lead to false conclusions about their safety and performance. A study also warned that benchmarking agents could be “misleading.”

However, enterprises cannot avoid evaluating models now that there are many choices in the market, and technology leaders justify the rising cost of using AI models. It has also led to the development of different methods for testing model reliability and performance.

Google introduced FACTS Grounding to test a model’s accuracy in generating factual responses using information from documents. Researchers from Yale University and Tsinghua have developed self-invoking benchmarks for code to help enterprises decide if coding LLMs are right for them. VB Daily offers daily insights on business cases. If you want to impress the boss, VB Daily is for you. You can get the inside scoop about what companies are doing to maximize ROI with generative AI.
Thank you for subscribing. Click here to view more VB Newsletters.

An error occured.

story originally seen here

Organizations need to pre-process their documents before Yourbench can work. This involves three stages:

Hugging Face runs several GPUs and partners with companies like Google to use their cloud services for inference tasks. VentureBeat reached out to Hugging Face about Yourbench’s compute usage.

Some have even voiced skepticism that benchmark tests show models’ limitations and can lead to false conclusions about their safety and performance. A study also warned that benchmarking agents could be “misleading.”

Editorial Staff

Leave a Reply Cancel reply