CAPPr: Completion After Prompt Probability

Make your LLM pick from a list of choices.
Or compute the probability of a completion given a prompt, which may be
useful.
Squeeze more out
of open source LLMs.
Usage
Use a GGUF model
from llama_cpp import Llama
from cappr.llama_cpp.classify import predict
model = Llama("./TinyLLama-v0.Q8_0.gguf", verbose=False)
prompt = """Gary told Spongebob a story:
There once was a man from Peru; who dreamed he was eating his shoe. He
woke with a fright, in the middle of the night, to find that his dream
had come true.
The moral of the story is to"""
completions = (
"look at the bright side",
"use your imagination",
"eat shoes",
)
pred = predict(prompt, completions, model)
print(pred)
See this page of the
documentation
for more info on using GGUF models.
Use a Hugging Face transformers model
from transformers import AutoModelForCausalLM, AutoTokenizer
from cappr.huggingface.classify import predict
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Which planet is closer to the Sun: Mercury or Earth?"
completions = ("Mercury", "Earth")
pred = predict(prompt, completions, model_and_tokenizer=(model, tokenizer))
print(pred)
See this page of the
documentation
for more info on using transformers
models.
Cache instructions to save time
Many prompts start with the same set of instructions, e.g., a system prompt plus a
handful of example input-output pairs. Instead of repeatedly running the model on common
instructions, cache them so that future computations are faster.
Here's an
example using
cappr.huggingface.classify.cache_model
.
from transformers import AutoModelForCausalLM, AutoTokenizer
from cappr.huggingface.classify import cache_model, predict
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model_and_tokenizer = (model, tokenizer)
prompt_prefix = '''Instructions: complete the sequence.
Here are examples:
A, B, C => D
1, 2, 3 => 4
Complete this sequence:'''
prompts = ["X, Y =>", "10, 9, 8 =>"]
completions = ["7", "Z", "Hi"]
cached_model_and_tokenizer = cache_model(
model_and_tokenizer, prompt_prefix
)
preds = predict(
prompts, completions, cached_model_and_tokenizer
)
print(preds)
Compute token-level log-probabilities
Here's an example using
cappr.huggingface.classify.log_probs_conditional
.
from transformers import AutoModelForCausalLM, AutoTokenizer
from cappr.huggingface.classify import log_probs_conditional
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
prompts = ["x y", "a b c"]
completions = ["z", "d e"]
log_probs_completions = log_probs_conditional(
prompts, completions, model_and_tokenizer=(model, tokenizer)
)
print(log_probs_completions[0])
print(log_probs_completions[1])
Efficiently aggregate these log-probabilities using
cappr.utils.classify.agg_log_probs
.
For a slightly more advanced demo, see
./demos/huggingface/dpo.ipynb
.
Extract the final answer from a step-by-step completion
Step-by-step and chain-of-thought prompts are highly effective ways to get an LLM to
"reason" about more complex tasks. But if you need a structured output, a step-by-step
completion is unwieldy. Use CAPPr to extract the final answer from these types of
completions, given a list of possible answers.
See this idea in action here in the
documentation.
Run in batches, predict probabilities
from transformers import AutoModelForCausalLM, AutoTokenizer
from cappr.huggingface.classify import predict_proba
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompts = [
"Stephen Curry is a",
"Martina Navratilova was a",
"Dexter, from the TV Series Dexter's Laboratory, is a",
"LeBron James is a",
]
class_names = ("basketball player", "tennis player", "scientist")
prior = ( 1/6, 1/6, 2/3 )
pred_probs = predict_proba(
prompts=prompts,
completions=class_names,
model_and_tokenizer=(model, tokenizer),
batch_size=2,
prior=prior,
)
print(pred_probs.round(1))
pred_class_idxs = pred_probs.argmax(axis=-1)
preds = [class_names[pred_class_idx] for pred_class_idx in pred_class_idxs]
print(preds)
Run in batches, where each prompt has a different set of possible completions
Again, let's predict probabilities.
from transformers import AutoModelForCausalLM, AutoTokenizer
from cappr.huggingface.classify import predict_proba_examples
from cappr import Example
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
examples = [
Example(
prompt="Jodie Foster played",
completions=("Clarice Starling", "Trinity in The Matrix"),
),
Example(
prompt="Batman, from Batman: The Animated Series, was played by",
completions=("Pete Holmes", "Kevin Conroy", "Spongebob!"),
prior= ( 1/3 , 2/3 , 0 ),
),
]
pred_probs = predict_proba_examples(
examples, model_and_tokenizer=(model, tokenizer)
)
print([example_pred_probs.round(2) for example_pred_probs in pred_probs])
pred_class_idxs = [
example_pred_probs.argmax() for example_pred_probs in pred_probs
]
preds = [
example.completions[pred_class_idx]
for example, pred_class_idx in zip(examples, pred_class_idxs)
]
print(preds)
See the demos
for demonstrations
of slightly harder classification tasks.
For CAPPr, GPTQ models are the most computationally performant. These models are
compatible with cappr.huggingface.classify
. See this page of the
documentation
for more info on using these models.
Documentation
https://cappr.readthedocs.io
Installation
See this page of the
documentation.
Related work
See this page of the
documentation.
Motivation
Reduce engineering complexity.
See this page of the
documentation for more info.
Performance
Statistical performance
Computational performance
How it works
You input a prompt
string, a end_of_prompt
string (a whitespace or empty) and a set
of candidate completion
strings such that the string—
{prompt}{end_of_prompt}{completion}
—is a naturally flowing thought. CAPPr picks the completion
which is mostly likely to
follow prompt
by computing the—
Completion
After
Prompt
Probability
—as fleshed out in my question on Cross
Validated.
Local development
See this page of the documentation.
Todo
I'm dumping todos here:
Code changes
Reseach experiments
Feel free to raise issues ofc