The Facts About Harry Styles That GPT-3 Won't Tell You

Often when faced with a complex reasoning problem, modern large language models (LLMs) fail to understand the correct approach to take. In particular, problems that require multiple steps tend to confuse the model into producing an incorrect response.

For example, GPT-3 couldn't tell you: What x + 6 * y^5 is, where x is the sum of the squares of the individual digits of the release year of Miley Cyrus's "Bangerz" and y is the day-of-month portion of Harry Styles's birthday.

Here's how to fix that.

First, we use Chain-of-Thought (CoT) prompting: Forcing the answer to begin with "Let's think step by step" makes the model lay out explicit reasoning steps, making it more likely to be correct. This gives the correct answer sometimes, but it's not reliable.

Occasionally an incorrect result...
Occasionally a correct result

To get the robust answer this important question deserves, we also need consensus: We sample the model multiple times, and take whatever answer occurs most often.

To compute this, we first use an "extraction prompt" to get a final answer we can compare programmatically:

Let's turn this into Python. Import the OpenAI API and grab your API key from an environment variable that you set outside of the script. Define a quick helper method to make the API more concise.

import os
import re
from statistics import mode

import openai

    openai.api_key = os.environ["OPENAI_API_KEY"]
except KeyError:
    raise RuntimeError("Please set the OPENAI_API_KEY environment variable.")

def complete(prompt: str, **kwargs):
    defaults = {"engine": "text-davinci-002"}
    kwargs = defaults | kwargs
    response = openai.Completion.create(prompt=prompt, **kwargs)
    return response.choices[0].text.strip()

Next we turn "Let's think step by step" into Python. We take the completion of our extraction prompt, do some string parsing, and ideally we find a number. If we don't, we just try the prompt again — the results are stochastic.

def calculate_step_by_step(question: str, max_retries=3) -> str:
    Answer a math question via chain-of-thought prompting.
    Retry until the result looks like a single number.
    prompt = f"Q: {question}\nA: Let's think step by step."
    long_answer = complete(prompt, max_tokens=128, temperature=0.5)
    extraction_prompt = (
        f"{prompt} {long_answer}\nTherefore, the answer (Arabic numerals) is"
    short_answer = complete(extraction_prompt, max_tokens=32, temperature=0)
    short_answer = short_answer.strip().rstrip(".").replace(",", "").split("=")[-1]
        short_answer = re.findall(r"-?\d+\.?\d*", short_answer)[0]
    except IndexError:
        if max_retries > 0:
            return calculate_step_by_step(question, max_retries - 1)
            raise RuntimeError(f"Could not extract answer from '{short_answer}'")
    return short_answer

The consensus answer is just the modal response, so that function is a one-liner. We define our question, and run.

def answer_by_consensus(question: str, n=10) -> str:
    return mode(calculate_step_by_step(question) for _ in range(n))

Q: What is x + 6 * y^5 where x is the sum of the squares of the individual digits of the
release year of Miley Cyrus's "Bangerz" and y is the day-of-month portion of Harry
Styles's birthday?\

if __name__ == "__main__":
    print("Q:", EXAMPLE_QUESTION)
    print("A:", answer_by_consensus(EXAMPLE_QUESTION))
In the output, we see the correct answer is 20. (Obviously.)

The code for this is available here: 

Requires `pip install openai`. Tested on Python 3.10.

This is a simplified version of the technique used in the recent, anonymous paper "Large Language Models Can Self Improve":

In that paper, the authors show that models fine-tuned to emit chain-of-thought rationales for answers found through consistency sampling will improve without using any external labels as input. In this way, they achieve state-of-the-art-level accuracy.

The original paper on consensus, is Wang et al. (2022):

“Let’s think step by step” is from Kojima et al. (2022):