Prevod v pripravi Ta stran še ni v celoti prevedena v slovenščino. Vsebina je trenutno prikazana v angleščini. Odpri angleško različico →

Foundations · primer 04

The terms
in plain English.

Sixty-plus terms you will hear in any serious AI deployment conversation. Each one short. Each one stable-anchored - partners can deep-link from their own proposals to /glossary#fine-tuning and the link will hold.

Open in a second tab while reading the rest of the foundations track. Use Ctrl-F.

Jump to A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

A

5 terms

Agent #

A model wrapped in a loop that can plan, call tools, observe results, and iterate towards a goal. The architectural pattern that turns "AI assistant" into "AI worker."

Agentic workflow #

A multi-step task in which an agent (or several) decomposes the goal, executes steps using tools, and reasons about the result before continuing. See the workload-patterns primer for the deployment shape.

Alignment #

The discipline of training a model to behave in line with intended values - to refuse harmful requests, to hedge appropriately on uncertainty, to follow instructions reliably. RLHF and DPO are alignment techniques.

API model #

A model accessed over the network through a provider's service. The model itself runs on the provider's infrastructure; you pay per token of input and output. Contrast with a self-hosted open-weight model.

Attention #

The mechanism that lets a transformer model relate any token in a sequence to any other. Attention is what makes long-range coherence work - and what makes the context window expensive in compute and memory.

B

5 terms

Base model #

A pre-trained model that has not yet been fine-tuned for a specific use case. Base models complete text plausibly but do not respond to instructions in a chat-shaped way without further training.

Batch inference #

Running many inference requests through a model together rather than one at a time. Higher throughput, higher latency per individual request - the right choice for offline classification or extraction.

Benchmark #

A standardised test that scores model performance on a defined task. Useful for comparing models in the abstract; less useful for predicting how a model will perform on your specific data.

BF16 / FP16 / FP32 #

Numeric precision formats. FP32 is full single-precision; FP16 and BF16 are half-precision and reduce memory at the cost of some numerical range. Most modern inference and training run in BF16 by default.

Bias #

In an AI deployment context, the systematic skew in model outputs traceable to the training data, the labelling process, or the architecture. Bias is a property to be measured and managed, not eliminated.

C

4 terms

Chat model #

A base model that has been fine-tuned to respond to user messages in a conversational format, typically with instruction-tuning followed by alignment. The form most enterprise deployments use.

Classification #

A workload pattern: structured output (a category, a score, a flag) from unstructured input. See the workload-patterns primer.

Context window #

The maximum number of tokens a model can process in a single call - input plus output combined. Models with larger context windows can read longer documents but cost more in compute per call.

Continual pre-training #

Continuing the training of a base model on additional domain data before fine-tuning. Used to adapt a general model to specialised vocabulary or low-resource languages.

D

3 terms

Distillation #

Training a smaller "student" model to imitate a larger "teacher" model. The result is a model close to the teacher in capability but cheaper to run.

DPO #

Direct Preference Optimization. A simpler alternative to RLHF for aligning a model with human preferences. Same goal, less training infrastructure.

Drift #

The slow degradation of a deployed model's quality over time as the real-world data, user behaviour, or domain shifts away from what the model was trained on. Detected through ongoing evaluation, addressed through retraining or model refresh.

E

2 terms

Embedding #

A numeric vector representation of a piece of text such that similar texts have similar vectors. The mechanism behind semantic search and the retrieval step of RAG.

Evaluation (eval) #

A repeatable test that scores model output against expected behaviour. The single most under-invested-in part of enterprise AI; without an eval, model swaps are hope.

F

3 terms

Fine-tuning #

Training a base model further on additional data to adapt its behaviour to a specific task or domain. See the workload-patterns primer for when this is worth doing.

Foundation model #

A large model pre-trained on broad data, intended to be the starting point for many downstream applications. The term overlaps with "base model" but emphasises the role rather than the training stage.

Function calling #

A pattern where a model produces structured output describing a function to call rather than generating prose. The mechanism behind tool use in agentic workflows.

G

3 terms

GPU #

Graphics Processing Unit. The hardware that does the parallel arithmetic that makes neural-network training and inference fast. The dominant cost driver in any non-trivial AI deployment.

Ground truth #

The labelled correct answers used to evaluate model output or to fine-tune a model. The quality of any evaluation is bounded by the quality of the ground truth.

Guardrails #

Programmatic checks applied to model input or output to enforce policy - content filtering, PII redaction, output schema validation, refusal-to-answer for restricted topics. Sit alongside the model rather than inside it.

H

2 terms

Hallucination #

When a model produces output that is plausible but incorrect - fabricated citations, invented dates, confident wrong numbers. The native failure mode of next-token prediction.

Hybrid retrieval #

A retrieval approach that combines lexical matching (keywords) with semantic matching (embeddings). Often outperforms either alone for enterprise documents where exact terminology matters.

I

3 terms

Inference #

The act of running a trained model to produce output. Distinct from training (which produces the model). Inference is what costs you in production; training is what cost you to get there.

Instruction tuning #

Fine-tuning a base model on examples of instructions paired with appropriate responses. The training stage that turns a text-completion model into a chat-shaped one.

Interconnect #

The high-speed network between GPUs in a multi-GPU server (and between servers in a cluster). NVLink, InfiniBand, and NVSwitch are common interconnect technologies. The interconnect often determines training throughput.

J

1 term

Jailbreak #

A prompt crafted to make a model violate its alignment - produce restricted content, ignore guardrails, leak system instructions. An ongoing arms race between alignment training and adversarial prompting.

K

1 term

KV cache #

Key-Value cache. The intermediate state a transformer accumulates while generating tokens, kept in GPU memory to avoid recomputation. The KV cache often dominates GPU memory consumption at long context lengths.

L

3 terms

Latency #

The time from request to response. In chat, often measured as time-to-first-token and tokens-per-second separately. Latency is bounded by hardware, model size, prompt length, and queue depth.

LLM #

Large Language Model. A neural network trained on vast text data to predict the next token in a sequence. The substrate beneath every workload pattern in enterprise AI.

LoRA #

Low-Rank Adaptation. A parameter-efficient fine-tuning technique that trains a small adapter layer rather than the full model. Faster, cheaper, and stackable - multiple LoRA adapters can be loaded against a single base model.

M

4 terms

MCP #

Model Context Protocol. An open standard for letting models discover and call external tools and data sources. Increasingly the way agentic deployments connect to enterprise systems.

MLOps #

The operational discipline around deploying, monitoring, and retraining ML models. The AI-deployment equivalent of DevOps. Mostly partner work - system integrators and AI consultancies, rarely the customer alone.

Multi-agent #

An agentic deployment with several agents collaborating, often with defined roles (planner, worker, critic). More complex than single-agent; the orchestration framework matters a lot.

Multimodal #

A model that handles more than one input modality - text plus images, text plus audio, text plus video. Increasingly the default for new models; the workload patterns extend to multimodal versions.

O

1 term

Open-weight model #

A model whose weights are published and downloadable, runnable on hardware you control. Llama, Mistral, Qwen, Mixtral are open-weight families. The architecture of choice for sovereign deployments.

P

6 terms

Parameters #

The numerical weights that define a neural network. "Parameter count" is the rough capacity proxy - a 70-billion-parameter model has 70 billion numbers. Bigger correlates with capability but also with inference cost.

PoC #

Proof of Concept. A narrow, fast experiment to validate technical feasibility before committing to production development. See the PoC-to-production primer for what one is and isn't for.

Pre-training #

The first and most expensive stage of model training, in which a model learns general language patterns from a vast corpus. Done once, by the model maker, before any customer-specific work.

Prompt #

The input given to a model - often including a system instruction, conversation history, retrieved context, and the user's current message. Prompt engineering is the discipline of crafting these inputs for reliable output.

Prompt injection #

An attack in which malicious instructions are embedded in data the model reads (a document, a web page, a user message) to override the system prompt or guardrails. The dominant security concern in agentic deployments.

Proprietary model #

A model whose weights are not published. Accessed through the model maker's API. Top-end quality leads on broad tasks; sovereignty and cost-at-scale work against it.

Q

2 terms

Quantisation #

Reducing the numeric precision of a model's weights to save memory and speed up inference. INT8 and INT4 quantisation are common; quality cost varies by model and task.

QLoRA #

Quantised LoRA. Fine-tuning with LoRA adapters on top of a quantised base model. The cheapest path to fine-tuning any large model.

R

4 terms

RAG #

Retrieval-Augmented Generation. The pattern in which a retrieval system finds relevant documents, the model reads them, and the answer is grounded in your data. The most common pattern in enterprise AI.

ReAct #

Reasoning + Acting. A loop in which an agent alternates between thinking step-by-step and calling tools. The simplest agentic pattern that works.

Red-teaming #

Deliberate adversarial testing of a model - trying to jailbreak it, extract training data, induce harmful output. A required step before production deployment of any non-trivial system.

RLHF #

Reinforcement Learning from Human Feedback. A training stage in which humans rate model outputs and the model is updated to prefer the highly-rated ones. The classic alignment technique; DPO is a simpler alternative.

S

3 terms

Sovereign AI #

A deployment where the data, the model, and the inference infrastructure are all under the deploying organisation's control or jurisdiction. See the private-AI page for the data classes that force this.

Streaming #

Returning model output token-by-token as it is produced rather than waiting for the full response. Improves perceived latency in chat; less relevant for batch workloads.

System prompt #

The high-level instruction that frames a model's behaviour for a session - its role, the rules it should follow, the tone to use. Distinct from user messages; handled differently by the model.

T

5 terms

Temperature #

A parameter that controls the randomness of model output. Low temperature (close to 0) produces deterministic, predictable output; higher temperatures produce more varied output. Set low for classification, higher for creative drafting.

Token #

The unit a language model reads and produces. Usually a few characters or part of a word. "Tokens" are the unit of cost in API pricing and the unit of capacity in the context window.

Tokens per second #

Throughput metric. How many tokens a model produces in one second on a given hardware setup. The metric chat deployments live and die by.

Tool use #

A model invoking external functions or APIs as part of producing its response. The mechanism behind agentic workflows; usually implemented via function calling or MCP.

Transformer #

The neural network architecture underlying every current large language model. The attention mechanism is what distinguishes it. The architecture has been remarkably stable since 2017.

V

2 terms

Vector database #

A storage system optimised for finding embeddings similar to a query embedding. The "retrieval" half of RAG runs through one. Pinecone, Weaviate, Qdrant, pgvector are common options.

VRAM #

Video RAM. The on-GPU memory where model weights, the KV cache, and activations live during inference. Insufficient VRAM is the most common reason a model fails to load on given hardware.

W

1 term

Weights #

The trained parameters of a neural network. The artefact you actually deploy. A model file is mostly its weights.

The terms in plain English.

A

B

C

D

E

F

G

H

I

J

K

L

M

O

P

Q

R

S

T

V

W

The terms
in plain English.