Foundations · primer 01

A short, honest answer to
"what is generative AI?"

Not a hype piece, not a sales pitch - a vendor-agnostic explanation of what a large language model is, what it does well, what it does badly, and the two binary deployment choices that shape every downstream conversation. Read this before your first vendor meeting.

The mechanism

A language model predicts the next token.

A large language model is a function that takes a sequence of text and predicts what comes next. Trained on vast amounts of text - books, articles, code, conversations, technical documentation - the model learns the statistical patterns of language well enough that its next-token guesses, chained together, produce paragraphs of coherent prose.

The model has no notion of truth. It has a notion of plausibility - what sequences of tokens were common in the data it was trained on. That distinction is the source of nearly every failure mode you'll encounter. The model produces output that sounds correct because it is statistically similar to correct output it was trained on, not because it has any mechanism for verifying that it is correct.

The artefact you actually deploy is the model's weights - a multi-gigabyte file containing the numerical parameters that define how it predicts the next token. The weights are the thing you store, the thing you load into memory at inference time, and the thing that makes the model the model. Different model "sizes" - 7-billion-parameter, 70-billion-parameter, 400-billion-parameter - refer to the count of those numerical parameters and roughly correlate with capability.

That is the whole mechanism. Everything else - chat assistants, retrieval-augmented generation, agentic workflows - is an architecture around a model that predicts the next token.

Where it works

Five tasks where current models are reliable.

"Reliable" here means: at quality high enough to use in production with appropriate guardrails. Not "always correct" - the failure rate is low enough that the value of the correct outputs exceeds the cost of catching the failures.

Reliable · 01

Drafting and summarising prose

Reports, emails, meeting notes, technical documentation, summaries of long documents. Output quality is high; verification is fast because a human can read it.

Reliable · 02

Translating between formats

Restructuring data, converting between markup formats, reshaping tabular data into prose, extracting structured data from unstructured input. The model is doing pattern transformation, which it does well.

Reliable · 03

Answering questions grounded in supplied text

When the answer is in a document the model is reading right now, the model is reliable. This is the entire premise of retrieval-augmented generation.

Reliable · 04

Code generation and code understanding

Particularly for common languages and well-documented frameworks. The model is reliable enough to be useful as a fast-typing junior developer; it is not reliable enough to be unsupervised.

Reliable · 05

Classification with clear taxonomies

Sorting incoming items into predefined categories. The model often outperforms hand-coded rule systems on the messy edge cases where the rules break.

Where it doesn't

Five tasks where current models are unreliable.

Each is a specific failure mode of next-token prediction. The model is not lying - it has no concept of truth to lie about. It is producing the most plausible continuation of the text it was given. Sometimes the most plausible continuation is wrong, and the model has no way of knowing.

Failure mode · 01

Stating facts the model was not given

A model will confidently produce dates, citations, names, and numbers that are wrong. The mechanism is the same one that produces correct output - pattern completion - but with no verification. Treat any factual claim from a model as a hypothesis until verified.

Failure mode · 02

Deterministic logic and exact arithmetic

Models are sometimes correct on these and sometimes not. They are not arithmetic engines and not logic engines. If your task needs a guaranteed-correct calculation, run a calculator.

Failure mode · 03

Reasoning beyond their training distribution

Asking a model to reason about scenarios it has no precedent for produces output that sounds plausible but cannot be relied on. This is the most subtle failure mode and the hardest to test for.

Failure mode · 04

Maintaining state across long sessions

The "memory" in a chat is just the prior turns being re-sent on each call. Long sessions saturate the context window, drift, and contradict their earlier turns. Treat memory as a deployment concern, not a model property.

Failure mode · 05

Anything safety-critical without a human in the loop

Medical decisions, legal rulings, financial executions, infrastructure operations. The model can be the assistant; it cannot be the decision-maker. The reliability profile does not allow it.

The two binary choices

Two decisions shape every deployment.

Open-weight or proprietary. Local or API. They are not orthogonal - most real deployments combine the same answer to both - but they are the two structural decisions that frame the rest of the architecture conversation.

Decision · 01

Open-weight vs proprietary

Open-weight

You can download the weights, run them on your hardware, fine-tune them, and audit what they do. Quality is approaching parity with proprietary at the top end and exceeding it on specific narrow tasks.

proprietary

You access the model through an API. The provider runs the inference; you pay per token; the model evolves on the provider's timeline. Top-end quality leads on broad, general tasks.

How to decide

For sensitive data, open-weight is the only architecture that meets sovereignty requirements. For non-sensitive workloads at startup speed, proprietary often wins on time-to-deployment.

Decision · 02

Local inference vs API

Local inference

The model runs on hardware you control. Data never leaves your perimeter. Latency is bounded by your hardware. Cost is capex plus electricity. Capacity is what you bought.

API

The model runs on the provider's hardware. Data is sent to them under their terms. Latency is bounded by the network and the provider's queue. Cost is per-token and visible. Capacity is "whatever they'll give you today."

How to decide

Same analysis as above plus a throughput / cost-curve dimension. For sustained workloads, local typically becomes cheaper above a usage threshold; the threshold depends on the model size and the per-token API price.

Implications

What this means for your deployment.

The reliability profile above tells you which workloads to attempt first. Drafting, summarising, and document-grounded Q&A are reliable enough to be production-worthy with reasonable guardrails. Anything that requires guaranteed-correct facts, exact arithmetic, or unsupervised decision-making is a multi-step engineering problem, not a model-choice problem.

The two binary choices tell you the deployment architecture. If the data is sensitive enough that sovereign deployment is non-negotiable, you are running an open-weight model on local infrastructure - full stop. If the data is non-sensitive and the workload is bursty, an API model is often the right starter call. Most organisations end up running both: sovereign for the workloads that demand it, API for everything else.

The next page in the foundations track - workload patterns - turns this binary into shapes you can recognise in your own organisation.

Read next

Now that you have the basics.

Vocabulary in place?

When you're ready to talk to a real partner about a real deployment, the routing form on the AI Solutions hub takes your situation and comes back with a partner shortlist that fits.

Next: workload patterns →

LM TEK d.o.o. · Pod Lipami 10 · 1218 Komenda · Slovenia

Get in touch

Partner with LM TEK

Request information

We will respond within two business days. Your details stay with LM TEK and are not shared with partners until you confirm the introduction.

Request a quote