Often when faced with a complex reasoning problem, modern large language models (LLMs) fail to understand the correct approach to take. In particular, problems that require multiple steps tend to confuse the model into producing an incorrect response.
For example, GPT-3 couldn't tell you: What x + 6 * y^5 is, where x is the sum of the squares of the individual digits of the release year of Miley Cyrus's "Bangerz" and y is the day-of-month portion of Harry Styles's birthday.
Here's how to fix that.
First, we use Chain-of-Thought (CoT) prompting: Forcing the answer to begin with "Let's think step by step" makes the model lay out explicit reasoning steps, making it more likely to be correct. This gives the correct answer sometimes, but it's not reliable.
To get the robust answer this important question deserves, we also need consensus: We sample the model multiple times, and take whatever answer occurs most often.
To compute this, we first use an "extraction prompt" to get a final answer we can compare programmatically:
Let's turn this into Python. Import the OpenAI API and grab your API key from an environment variable that you set outside of the script. Define a quick helper method to make the API more concise.
Next we turn "Let's think step by step" into Python. We take the completion of our extraction prompt, do some string parsing, and ideally we find a number. If we don't, we just try the prompt again — the results are stochastic.
The consensus answer is just the modal response, so that function is a one-liner. We define our question, and run.
The code for this is available here:
https://gist.github.com/goodside/454169131c93ab5c57a9cdfac0754028
Requires `pip install openai`. Tested on Python 3.10.
This is a simplified version of the technique used in the recent, anonymous paper "Large Language Models Can Self Improve":
https://openreview.net/forum?id=NiEtU7blzN
In that paper, the authors show that models fine-tuned to emit chain-of-thought rationales for answers found through consistency sampling will improve without using any external labels as input. In this way, they achieve state-of-the-art-level accuracy.
The original paper on consensus, is Wang et al. (2022): https://arxiv.org/abs/2203.11171
“Let’s think step by step” is from Kojima et al. (2022): https://arxiv.org/abs/2205.11916
Write 10x faster, engage your audience, & never struggle with the blank page again.