by Nelson Lemos de Sousa
Share
by Nelson Lemos de Sousa

From fundamentals to creativity tuning, learn how small tweaks make a big difference.
Imagine if just a few tweaks could transform your AI’s responses from robotic to remarkably natural. In the world of LLMs, mastering the settings behind the machine is not only about customisation, it’s about taking control of your output to perfectly suit your needs.
Small Mathematical Detour
Before diving into the settings let’s have a small detour; While we wont cover the architecture or mathematics of the ghost behind the machine it will be useful to introduce a small concept, namely the Softmax function.
In general, the result of any transformer-based LLM is a one-dimensional vector even if internal representations might be multidimensional. In the case of a generative text LLM, this vector will contain a number for each token in the LLM’s “dictionary” of words.
But what information can we get from these values? Here is where the Softmax function comes into play.

This function accentuates the relative differences between values and returns a normalised vector in which each value is part of a probability distribution ranging from 0 to 1.

By applying the Softmax, each of the values gets transformed into a probability, which, in the case of text generative LLMs, corresponds to the likelihood of that token being produced.

Key Points of Softmax:
- Converts raw scores into probabilities
- Normalises outputs to a 0–1 range
- Highlights
From here we can enter the realm of the settings per se.
Settings
Temperature
The Temperature is a value applied to the Softmax in order to change its behaviour.

Temperature Effects:
- High Temperature: Increases entropy, flattens the distribution, encourages creative responses.
- Low Temperature: Reduces entropy, sharpens focus on the most likely outputs.
Altering the Temperature will alter the entropy (“peakedness”) and, in an informal sense its kurtosis (“tailedness”) by acting as an “attenuator”.
Higher temperatures will lead to a higher entropy, “flattening” the probability distribution and making exceptional values more common; conversely, lower temperatures will lower the entropy, focusing the probability distribution on the most probable outputs.
Tip
Lower the temperature for precise Q&A scenarios, and raise it for creative tasks.
Top P:
Top Probability: A sampling technique applied after the Softmax. It samples tokens that cumulatively account for the top percentage on the results. Lowering this value will restrict the model to a smaller sample of tokens, leading to a reduced level of variability of the responses given.
Key Points:
- Limits token choices by cumulative probability.
- Lower values restrict variability; ideal when you need focused output.
Note
While both Top P and Temperature control randomness, adjusting one can often be sufficient. If you choose to modify both, do so cautiously, as their effects can overlap.

Top K
A close cousin of the Top P, this value samples the top values by a specific number rather thanby cumulative probability.
Example
With K=10, it will only take the 10 highest-valued tokens into account when choosing the next token.
Max Length
Maximum number of tokens that the LLM can generate.
Stop Sequences
A string or a list of strings that indicates a stopping point for the LLM. If any of these values is generated, the generation will stop. This can be exceedingly useful to stop XML or other structured data generation at a specific point.
Tip
You can use } to stop JSON generation at certain points or <\end tag> for XML.
Frequency Penalty
A penalty that is applied to a token in proportion to the number of times it has previously appeared in the response and in the prompt, reducing, as such, the repeated usage of the same words.
Tip
Increasing penalties might be a fun way to discover new words.
Presence Penalty
A penalty that is applied to a token if it has already been generated. In contrast to the proportional Frequency Penalty, this penalty is the same if the token appears once or n times.
Reasoning Effort
Used in some newer Chain of Thought models, this value allows control over how much “effort” an LLM expends in its COT reasoning steps.
This value allows a simple trade-off adjustment between cost&speed vs result quality.
Example
OpenAI commonly allows for low, medium, and high settings for their reasoning models, where high provides more reasoning steps before the final answer, increasing its accuracy, but also it’s output and costs.
Sed
This value is the Seed for probabilistic random algorithms and allows for reproducible generation when set.
Context Length
The number of tokens that a model can consider at once when generating a response. When the input text exceeds this value, the LLM has to truncate parts of it, which is a common cause of the well-known lapses of memory on prolonged conversation with LLMs.
The maximum context length is fixed for a specific model or architecture and can’t be increased.
Tip
Lower total context length for faster results on small local models if complex answers aren’t needed.
Conclusion
In this blog post we focused on the most common settings available in several APIs. However, this is not an exhaustive list; a whole world of less common settings exists for specific models; From sampling methods like Mirostat and Tail-Free Sampling, to more differing types of repetition penalties, specific CPU and GPU settings for local models, and much more…
Keep Learning.