Evaluating o1-preview for prompt injection

AI In Action
LLMs
Evals
Author
Published

September 27, 2024

Modified

November 21, 2024

AI In Action | 2024 Week 39

Baruch iteratively tests different prompt versions using OpenAI’s new o1-preview model, analyzing results and refining the prompts to identify potential security risks.

Goal

Baruch aims to improve a prompt designed to detect prompt injections in other LLMs. He is using a dataset to evaluate the effectiveness of his prompts.

Process

  1. Initial Setup: He starts by running an existing prompt on a dataset and evaluating its performance using a Google Sheet to track results.
  2. Analysis and Iteration: He analyzes the results, identifies weaknesses in the prompt (e.g., being too permissive with certain types of injections), and uses LLMs (like OpenAI’s GPT models and Claude) to suggest improvements.
  3. Experimentation with Different Prompts: He experiments with various modifications to the prompt, including:
    • Adding specific instructions.
    • Incorporating chain-of-thought prompting to get reasoning behind the LLM’s classifications.
    • Trying different LLMs (GPT-4, Claude) for generating and evaluating prompts.
  4. Evaluation: He continuously evaluates the performance of each modified prompt by running it on the dataset and comparing its accuracy against previous versions. He looks for improvements and regressions in identifying prompt injections.
  5. Challenges: He encounters several challenges during the process, including:
    • Caching issues with Braintrust
    • Difficulty in analyzing the LLM’s outputs, particularly when only getting a binary classification (0 or 1) without explanations.
    • Finding the optimal way to prompt the LLMs for suggestions and improvements.

Key Takeaways

  • Prompt engineering for prompt injection detection is an iterative process. The process demonstrated that finding an effective prompt requires repeated refinement based on evaluation results, experimentation, and analysis.
  • LLMs can assist in prompt creation and improvement. Using GPT-4 to suggest improvements based on failure examples proved somewhat helpful, showcasing the potential of LLMs in prompt engineering.
  • Evaluating prompt effectiveness requires a robust dataset and clear metrics. The use of a dataset and Google Sheets to track performance highlighted the importance of systematic evaluation. BrainTrust was also employed, though its caching issues proved problematic.
  • Chain-of-thought prompting can provide insights, but isn’t guaranteed to improve accuracy. While attempts were made to use chain-of-thought to understand the LLM’s reasoning, it didn’t significantly improve the prompt’s performance in this case.
  • There’s a balance between prompt complexity and effectiveness. Experiments with different prompt lengths and complexities revealed that a longer, more detailed prompt performed well but wasn’t necessarily the best approach. Simpler prompts were also evaluated.
  • Prompt injection detection is a challenging problem. Even with iterative refinement and advanced prompting techniques, difficulties were still encountered, suggesting that robust prompt injection detection requires ongoing research and development.
  • Practical prompt engineering involves troubleshooting and working around limitations. Various technical issues were encountered, such as BrainTrust caching problems and OpenAI parameter adjustments, demonstrating the real-world challenges of working with LLMs.