Paper Club | 2024 Week 39

We discuss the challenges of LLM validators and emphasizing the iterative nature of defining good evaluation criteria and aligning LLMs to those criteria.

Paper Presentation

Eugene explained the core idea: Evaluating LLMs requires looking at data first before setting criteria.
He outlined the EvalGen design, a workflow to assist developers in creating LLM evaluators. It involves:
- Prompting for generation and evaluation
- Generating criteria based on the prompt
- Running criteria through the LLM
- Grading sample data
- Checking the LLM’s alignment with human grading
He discussed the results of using EvalGen, highlighting its advantages over previous methods like SPADE.
He presented the findings of a user study with industry practitioners, emphasizing the importance of feedback loops and user control.

Q&A and Discussion

Attendees asked questions about:
- The meaning of “alignment” in the context of LLM evaluation.
- The challenges of maintaining consistency in human grading.
- The feasibility of using LLMs as validators in production.
Eugene clarified those points and shared his experience from the user study.

Eugene’s Prototype Demo

Eugene showcased a prototype he built to facilitate data labeling, evaluation, and prompt optimization.
He demonstrated the labeling mode, evaluation mode, and optimization mode of his tool.
Attendees were impressed and gave positive feedback.

Discussion with Shreya Shankar

Shreya joined the meeting and answered a question about incorporating natural language feedback in the EvalGen workflow.
She discussed how good task decomposition impacts evaluation, suggesting evaluating each component separately and performing bottleneck analysis.

Wrap Up

The meeting ended with a brief discussion of upcoming Paper Club sessions, potentially focusing on function calling and involving researchers from Berkeley.

Quotables

“When the data and the anecdotes disagree, I tend to trust the anecdotes.” - Jeff Bezos quoted by Eugene

Eugene invokes Bezos to argue for prioritizing individual data samples over aggregated metrics.

“They gave it a bad grade, not because the output is bad, but they wanted to be consistent with their previous grades.”

This exposes a flaw in human labeling: Anchoring bias. It questions the reliability of even meticulously labeled datasets, especially if the initial criteria were flawed.