Validating The LLM Validators with Shreya

Paper Club
LLMs
Evals
Author
Published

September 25, 2024

Modified

November 21, 2024

Paper Club | 2024 Week 39

We discuss the challenges of LLM validators and emphasizing the iterative nature of defining good evaluation criteria and aligning LLMs to those criteria.

Paper Presentation

  • Eugene explained the core idea: Evaluating LLMs requires looking at data first before setting criteria.
  • He outlined the EvalGen design, a workflow to assist developers in creating LLM evaluators. It involves:
    • Prompting for generation and evaluation
    • Generating criteria based on the prompt
    • Running criteria through the LLM
    • Grading sample data
    • Checking the LLM’s alignment with human grading
  • He discussed the results of using EvalGen, highlighting its advantages over previous methods like SPADE.
  • He presented the findings of a user study with industry practitioners, emphasizing the importance of feedback loops and user control.

Q&A and Discussion

  • Attendees asked questions about:
    • The meaning of “alignment” in the context of LLM evaluation.
    • The challenges of maintaining consistency in human grading.
    • The feasibility of using LLMs as validators in production.
  • Eugene clarified those points and shared his experience from the user study.

Eugene’s Prototype Demo

  • Eugene showcased a prototype he built to facilitate data labeling, evaluation, and prompt optimization.
  • He demonstrated the labeling mode, evaluation mode, and optimization mode of his tool.
  • Attendees were impressed and gave positive feedback.

Discussion with Shreya Shankar

  • Shreya joined the meeting and answered a question about incorporating natural language feedback in the EvalGen workflow.
  • She discussed how good task decomposition impacts evaluation, suggesting evaluating each component separately and performing bottleneck analysis.

Wrap Up

  • The meeting ended with a brief discussion of upcoming Paper Club sessions, potentially focusing on function calling and involving researchers from Berkeley.

Quotables

“When the data and the anecdotes disagree, I tend to trust the anecdotes.” - Jeff Bezos quoted by Eugene

Eugene invokes Bezos to argue for prioritizing individual data samples over aggregated metrics.

“They gave it a bad grade, not because the output is bad, but they wanted to be consistent with their previous grades.”

This exposes a flaw in human labeling: Anchoring bias. It questions the reliability of even meticulously labeled datasets, especially if the initial criteria were flawed.