This original article was spotlighted in BetaKit’s “How to Tame a Large Language Model.”

If you have used any of the LLM (Large Language Model) solutions out there, you are probably aware of prompts in the form of chats/discussions. Experimenting with prompts in the creative sense (ex. blogs, poems, stories) can be really fun and useful because you might have an open ended idea that you are trying to hone in or might not have a strong opinion on.

If we are thinking about a task like generating code, there are often strong opinions and exact outcomes that you are aiming to generate. These prompts tend to be less conversational and count on users to load large amounts of context into a single prompt to perform a task.

This article will dive into the mechanics of how an LLM reasons about a prompt and how adding context to prompts tune/influence an LLM’s output.

Normal vs. Advanced Prompt Interface Comparison

Temperature, Top P, and Top K

When dealing with most LLM chat bot interfaces, they are usually going to be set up to do highly generated tasks by default because that is the most marketed and tangible use case for an introduction into this world. Having a certain amount of randomness and creativity when asking the same question repeatedly is sometimes very desirable and then you can continue to chat to tweak the results.

When getting more specific about the outcome and when using a script or API (Application Programming Interface) instead of a chat interface, it is useful to take a look at the advanced settings available when submitting prompts to an LLM.

Temperature: How much randomness in the considered tokens. Default for most services is between 0-1.
Top P: How many token/word choices the LLM considers.
Top K: How many token/word options the LLM looks at for each choice.

These settings have company-defined defaults (ie. temperature is usually set between 0 and 1) in the normal chat interfaces. To get access to these and more advanced settings, most solutions have alternate interfaces.

Google: aistudio.google.com
Groq: console.groq.com/playground
Open AI: platform.openai.com/playground
Anthropic: console.anthropic.com

When speaking about prompt settings that affect the randomness of outputs, it can be tricky to distinguish between an LLM’s “hallucination” and its natural tendency to use probabilities to craft responses. While hallucinations are often blatant inaccuracies, like claiming the Earth is flat, the LLM might also simply be giving you the most likely answer based on its training data. This is especially apparent with open-ended prompts where there is no single correct answer.

For example, here is an example of an open ended prompt:

make a list of the best top 10 programming languages. Only answer in JSON

{ "languages": string[] }

Temperature 0:

Run 1: ["Python", "JavaScript", "Java", "C#", "C++", "PHP", "Go", "Swift", "Kotlin", "Ruby"]

Run 2: ["Python", "JavaScript", "Java", "C#", "C++", "PHP", "Go", "Swift", "Kotlin", "Ruby"]

Run 3: ["Python", "JavaScript", "Java", "C#", "C++", "PHP", "Go", "Swift", "Kotlin", "Ruby"]

Temperature 1:

Run 1: ["Python", "JavaScript", "Java", "C++", "C#", "PHP", "Go", "Swift", "Kotlin", "Ruby"]

Run 2: ["Python", "JavaScript", "Java", "C#", "C++", "PHP", "C", "Swift", "Go", "Ruby"]

Run 3: ["Python", "JavaScript", "Java", "C#", "C++", "PHP", "Go", "Swift", "Kotlin", "Ruby"]

Temperature 2:

Run 1: ["JavaScript", "Python", "Java", "C#", "C++", "PHP", "C", "Go", "Swift", "Ruby"]

Run 2: ["Python", "JavaScript", "Java", "C++", "C#", "PHP", "Swift", "Go", "Ruby", "Kotlin"]

Run 3: ["JavaScript", "Python", "Java", "C#", "C++", "PHP", "Swift", "Go", "Kotlin", "R"]

A quick, and oversimplified, explanation of what is happening here is that each token/word that a LLM selects for output has a probability score and the LLM settings can be used to reason token selection. These scores are heavily affected by how a LLM has been trained, so knowing how it is scoring something is hard to tell and differs between LLMs. To get a glimpse of how this works in practice, I’ve found asking for open ended lists as a good example.

To get a glimpse of how this works in practice, I’ve found asking for open ended lists as a good example.

	Probability	Temp – 0 Top P -1 Top K – 1	Temp – 1 Top P – 1 Top K – 1	Temp – 2 Top P – 1 Top K – 1
Python	1.0	1	1	1
JavaScript	0.9	2	2	1
Java	0.85	3	3	1
C#	0.7	4	4	2
C++	0.65	5	4	2
PHP	0.5	6	5	3
Go	0.3	7	6	4
Swift	0.2	8	6	4

(Probability scores are not calculated and purely for illustrative purposes.)

With this you can start to visualize the cause and effect of LLM settings, context, and the prompt itself. Generative and creative tasks (e.g. writing a haiku) can benefit from a higher temperature, while other more succinct tasks (e.g. asking for the title of an article) can benefit a lower temperature. The relationship between temperature, top-p, and top-k is very direct as the description implies and their effect on the output becomes close to non-existent as the temperature reaches 0.

It is important to note that setting temperature to 0 does not always mean an answer will be deterministic. Given enough uncertainty in a prompt, there can be a point where the probability scores of the available tokens are so low, that it does introduce some randomness.

This can be shown if you take the example prompt above and expand the request to 35 languages and inspect the last 5 languages in the list.

Temperature 0:

Run 1: [ …, "Rust", "Julia", "Haskell", "Erlang", "Elixir"]

Run 2: [ …, "Haskell", "Erlang", "Elixir", "Clojure", "F#"]

What makes a good prompt?

Now that we can start to understand how context and the settings affect each other, we can start to reason what we can change to get the most out of different types of LLM interactions.

Summarization Prompt

Document:
<Document Text>

Summarize this document.

Temperature 0: More literal interpretation of the output text that is taken from or is very similar to the article.

Temperature 1: Summarization will be more verbose and written in a more natural language. Begin to see things like lists and key points.

Temperature 2: Much longer summarization where the messaging can start going off track. Sometimes dips into topics, words, languages, etc that are adjacent to the article.

Generation Prompt

Write me a short story in a few sentences.

Temperature 0: Same story every time.

Temperature 1: Different story, but you might start to notice some trends over time.

Temperature 2: Much larger range of subjects and settings. Sometimes it can have a wandering narrative.

Extraction Prompt

Article:
<Article describing life of a Bear>

Where did the Bear get transferred to?

Temperature 0: Answer worded in a consistent way every execution.

Temperature 1: Answer worded in different ways every execution.

Temperature 2: Answer worded in different ways and may contain extra context that is not relevant to the question.

These are some simple examples, but the amount of scenarios and approaches when using LLMs are limited by the creativity of the user, token windows, and how a LLM has been trained. It will be up to you to decide what settings will work for you for your task and the LLM you’re using.

Conclusion

By understanding these mechanisms and experimenting with different settings, we can unlock the power of LLMs for tasks ranging from code generation to summarization and creative writing. We can fine-tune their responses to match specific needs, whether it’s concise code with minimal variation or a more free-flowing narrative.

This journey is just beginning. As LLMs continue to evolve, so too will the ability to craft increasingly nuanced and powerful prompts. So, continue experimenting and exploring what you can do with these valuable technologies.

This is Part 1 of Mantle’s Working with AI series. Read Part 2: Code Conversion.

To learn more about Mantle, check out our website or book a demo.

Working with AI (Part 1): Understanding the Art of the Prompt