June 14, 2024

Cracking the Code: Automated Prompt Optimization Part 1 - Insights from Industry Leaders

This article was originally published on Martian's blog, here.

TL;DR

At Martian, we are fortunate to work with many of the world's most advanced users of AI. We see the problems they face on the leading edge of AI and collaborate closely with them to overcome these challenges. In this first of a three-part series, we share a view into the future of prompt engineering we refer to as Automated Prompt Optimization (APO). In this article we summarize the challenges faced by leading AI companies including Mercor, G2, Copy.ai, Autobound, 6sense, Zelta AI, EDITED, Supernormal, and others. We identify key issues like model variability, drift, and “secret prompt handshakes”. We reveal innovative techniques used to address these challenges, including LLM observers, prompt co-pilots, and human-in-the-loop feedback systems to refine prompts. We conclude by inviting those who are interested to collaborate with us on this problem area in future research.  

Introduction

Our expertise lies in Model Routing. We dynamically route every prompt to the optimal large language model (LLM) based on our customers' specific cost and performance needs. Many of the most advanced AI companies are implementing routing. By working alongside these AI leaders, we gain firsthand insight into the challenges they encounter and jointly collaborate to solve them. We are working at the future edge of commercial AI projects.

The challenges we see are likely ones you are facing now or will encounter as you advance on your AI journey. We share this information to provide a glimpse into the future of Automated Prompt Optimization (APO) and to invite the broader AI community to collaborate with us on research in this area. If you are interested in participating, please reach out to us at contact@withmartian.com.

This article is the first of a three-part series on APO

Part One: In this article we summarize the challenges faced by leading AI companies including Mercor, G2, Copy.ai, Autobound, 6sense, Zelta AI, EDITED, Supernormal, and others. We identify key issues like model variability, drift, and “secret prompt handshakes”. We reveal innovative techniques used to address these challenges, including LLM observers, prompt co-pilots, and human-in-the-loop feedback systems to refine prompts.  

Part Two: Part two will focus on what people in industry are doing differently from research and academia. We will aim to drive collaboration to advance both research and real-world solutions in APO.

Part Three: will dive into Martian’s research into automatic prompt optimization. The goal is to introduce concrete solutions we’ve found for these problems in our research, and layout where we intend to improve things further.

By starting with the key challenges in prompt engineering encountered by leading AI companies and the solutions they have implemented, we lay the groundwork for understanding the current state of APO. We then provide detailed interview summaries for each company, showcasing how they are addressing these issues today and their innovative ideas for the future. This sets the stage for our next article, where we will explore the cutting-edge solutions researchers are developing for prompt engineering.

Key Issues and Solutions in Prompt Engineering

Challenges Faced by Prompt Engineers

  1. Variability in Prompting Techniques Across Models
    • Model Differences: Each model has unique characteristics due to variations in architecture, training data, design philosophies, context window sizes, and multimodal capabilities. This diversity necessitates tailored prompts for each model.
    • Non-Plug-and-Play Integration: Integrating new models into an existing AI stack requires reconsidering, testing, and potentially rewriting each prompt, making the adoption of new models challenging and time-consuming.
  2. Model Drift
    • Model Point Updates Impact Prompts: Updates to models can alter how they interpret and respond to prompts. A prompt that works with one version might not work with a newer version, causing inconsistency and requiring continuous adjustments.
  3. Secret Prompt Handshakes
    • Hidden Optimization Tricks: Small changes in prompts can lead to significantly different outputs. These "secret handshakes" are often discovered through trial and error, making it challenging to predict and standardize effective prompting techniques.
    • A striking illustration of this phenomenon is outlined in a recent blog post by Anthropic. This post delves into the research team's efforts at Anthropic to enhance the performance of Claude 2 on the Needle In A Haystack (NIAH) benchmark. This benchmark assesses a model's capacity to comprehend extensive contexts by strategically burying a sentence within large documents to test the model's ability to find “the needle in a haystack” across a range of document sizes and location depth.
    • In the blog post, the Anthropic team calls out how the simple addition of the sentence "Here is the most relevant sentence in the context:" to the prompt boosted Claude 2.1's score from 27% to 98% in the evaluation.
  4. End Users Prompting LLMs
    • Many companies enable end users to create their own prompts as part of their product experience, leading to usability issues. Allowing users to prompt LLMs directly can result in inconsistent results and quality control challenges
  5. Best Practices a Black Box
    • Limited Understanding: The best practices for prompting are not always fully known, even by the model developers. Effective prompting strategies are often discovered through experimentation and shared within the AI community.

Key Themes from the Interviews

  1. Tailoring Prompts for Diverse Models
    • Companies consistently highlighted the need to customize prompts for different models, each with unique requirements and behaviors. This customization, while essential for optimal performance, is seen as excessively labor-intensive.
  2. Managing Model Drift
    • Model updates and drifts are common, frustrating, and time-consuming issues. Companies emphasized the ongoing effort needed to adjust prompts and maintain consistency in outputs as models evolve. Setting up “evals”, automated methods to measure the quality of an LLM output, to alert the prompt engineer to degradations in performance. Relying on customers to complain is the most prevalent approach. The frustration level here suggests that a product dedicated solely to managing model drift could possibly stand on its own.
  3. End Users Prompting LLMs
    • Allowing users to prompt LLMs directly can result in inconsistent results and quality control challenges. Copy.ai has implemented an AI system that acts as an automated prompt engineer to help novice users achieve better results, proudly stating, “No PhD in prompt writing required!”
  4. Leveraging Simulation and Evaluation Tools
    • Advanced companies are using sophisticated tools and simulations to test and refine prompts. For example, Mercor uses AI models to simulate two-sided interviewer and candidate conversations, with an LLM judge evaluating the effectiveness of these conversations.
  5. Human-in-the-Loop Feedback Systems
    • Several companies use human-in-the-loop feedback systems to continuously improve prompts. This approach allows for real-time adjustments based on user feedback (thumps up, thumbs down, and comments), enhancing the relevance and accuracy of AI outputs.
  6. Emergence of the LLM Judges
    • The LLM Judge involves using one or more LLMs to evaluate the quality and effectiveness of prompts and model outputs. Typically associated with the "evals" framework, companies are expanding the concept beyond the evaluation and monitoring of LLM outputs.
  7. LLM Prompt “Orchestrator”
    • Supernormal has implemented a prompt layer that acts as a quality monitor and orchestrator, checking if meeting notes include follow-up action items. If no action items are found, the tokens are not sent to the part of the prompt chain that extracts action items, improving latency and lowering costs.
  8. Automated Systems Transparency
    • Companies are eager to build trust and learn from their automated systems. A system that creates automated prompt improvements should share insights with prompt engineering and product management teams. These insights are so valuable that the team at G2 considers this area potentially the most exciting in APO.
  9. Down Stream Success Signals as Feedback
    • Autobound: Focuses on using downstream success signals as feedback for prompt optimization such as the email open and reply rates of personalized email messages their system creates
  10. Recursive Iteration
    • G2: Emphasized an interest in research into auto-recursive iteration in refining prompts and improving performance.

By understanding these key issues and the innovative solutions implemented by leading AI companies we hope you can anticipate some of these challenges and begin planning to address them. At the same time we are better prepared to engage research and academia to collaborate on future solutions to APO.  

Next, we summarise our company interviews showcasing their specific issues. how they are addressing these issues today and their innovative ideas for the future.

Company Interviews

Copy.ai

Company & AI Product Overview

Prompt engineering and prompt management are massive undertakings at copy.ai. Our platform is designed to power complete go-to-market strategies with AI for sales and marketing teams. It includes over 400 pre-built workflows, each containing multiple prompts, addressing numerous sales and marketing use cases. For example, workflows handle tasks such as “Conduct competitor analysis from G2 reviews” or “Build customer sales FAQs from product documents.” Our platform operates with well over 2000 LLM out-of-the-box prompts. Additionally, we have over 15 million users on our platform, many of whom create custom workflows using our no-code workflow builder housing prompts they’ve written. When you do the math, our platform houses and executes a staggering number of prompts.

Prompt Engineering Environment

In our efforts to automate the improvement of prompts, our platform serves a diverse range of end users with varying levels of prompting experience. For many, copy.ai is their first experience with prompting an LLM. Naturally, we want all our users to achieve the best possible results on our platform. To assist with this, we’ve developed an AI system that processes our end-user prompts and operates behind the scenes within the copy.ai product. This helps users of all experience levels get better results and has proven highly effective. In our marketing materials, we proudly state, “No PhD in prompt writing required.”

Automated Prompt Optimization

Looking at a larger prompt engineering organization and the future of automated prompt optimization, you can imagine a unified prompting layer that interfaces effectively with multiple models designed with the needs of the engineering organization in mind. This system would contain a translation layer that has learned the unique prompting nuances that maximize each model's performance. By building this model-level prompting intelligence into a common infrastructure to handle various models’ unique prompting requirements, full-fledged prompt engineers and by extension, end-user customers can focus more on their use case requirements and less on mastering the nuances of each model. This abstraction layer between the application and the LLM model would allow for model and vendor flexibility and independence, resulting in better outcomes for users, prompt engineers, and AI development teams.

-----------

For the rest of the company interviews, please visit Martian's blog.

Ready to level-up?

Write 10x faster, engage your audience, & never struggle with the blank page again.

Get Started for Free
Get Started for Free
No credit card required
2,000 free words per month
90+ content types to explore