Data readiness

Hardware can't fix
a data problem.

Enterprise AI deployments fail at the data layer more often than at any other. The model is downstream of the architecture; the architecture is downstream of the data; the hardware is downstream of all three. This page is what you need to know about data readiness before you commit to a model, an architecture, or a hardware platform.

The five dimensions Common failure modes

On this page Why upstream Five dimensions Failure modes Engagement shape Hardware connection Honest scoping Read next

Why this is upstream

The data sets the ceiling.

A retrieval-augmented generation deployment cannot retrieve from a corpus that does not exist as a corpus. A fine-tuned model cannot learn from data that has no consistent labels. A classification system cannot reach 90% accuracy when 30% of the input data is malformed. The data layer is not a downstream concern - it is the ceiling above which no amount of model selection, prompt engineering, or hardware sizing can lift the deployment.

The most expensive moment to discover data problems is in production. The second most expensive is during a PoC, when fixing them means stopping the project. The cheapest moment - by an order of magnitude - is before any of that, in a structured assessment whose only job is to find them.

"Garbage in, garbage out" understates the problem. Bad data does not just produce bad AI. It produces confidently wrong AI - the model has no way of knowing the input is bad, so the output sounds as authoritative as it would on clean input. The cost of confident wrong-ness in production is what the readiness work prevents.

The five dimensions

Five questions a serious AI data layer can answer.

Each dimension comes with concrete questions. If your team can answer these for the dataset that the AI deployment is going to operate on, the data layer is in shape. If the answers are "we'll need to check" or "the person who knows that left," the readiness work is the project - and it has to happen before anything else.

Dimension

01

Quality

Are the values actually correct?

Accuracy, consistency, freshness. Whether duplicates are de-duplicated. Whether old data has been updated. Whether values that look like dates parse like dates and values that look like currencies are in the same currency. Quality is the dimension most often assumed and least often measured.

Questions you should be able to answer

→ What percentage of records have been validated against an authoritative source in the last 12 months?
→ How are duplicates identified, merged, or flagged?
→ How fresh is the data - and is the freshness consistent across the dataset?
→ Where are the known accuracy issues, and are they documented?

Dimension

02

Completeness

Does the data cover the population the model will see?

Coverage, gaps, missing fields, sampling biases. A dataset that covers 80% of customers excellently and 20% sparsely will produce a model that performs excellently on 80% of customers and unpredictably on the rest. Completeness is the dimension that quietly determines whether AI is fair across your user base.

Questions you should be able to answer

→ What population does the dataset represent? What is excluded?
→ Which fields have systematic missing-data patterns, and why?
→ Are there sub-segments where coverage is materially worse than the average?
→ How is the dataset refreshed when the underlying population changes?

Dimension

03

Structure

Is the data parseable, joinable, and predictable?

Schemas, formats, normalisation, conventions. Whether the same entity is referenced consistently across sources. Whether free-text fields follow conventions or are free-for-alls. Whether nested structures are documented. Structure is the dimension that determines how much engineering work happens before the model sees anything.

Questions you should be able to answer

→ Is there a documented schema, and does the data actually conform to it?
→ Can the same entity be reliably linked across the systems where it appears?
→ How are free-text fields, mixed languages, or inconsistent formats handled?
→ What share of the data requires custom parsing versus standard tooling?

Dimension

04

Lineage

Can you trace any value back to its source?

Provenance, transformations, audit trail. Where did this value come from. How has it been transformed since then. Who touched it. When the regulator asks why the model produced a particular output, lineage is what lets you answer. Without it, the AI deployment is hard to defend after the fact.

Questions you should be able to answer

→ For any record, can you produce the source system and the path it took to get here?
→ Are transformations applied to the data documented and reproducible?
→ Is there an audit log of who has read, written, or modified the data?
→ How long is lineage data retained, and what regulatory regime sets that retention?

Dimension

05

Classification

Does each data element have a sensitivity label?

PII, IP, regulated data, confidential, public. Who owns each class. Who is allowed to use each class for AI. Without classification, the conversation about whether AI can process this data is stuck at the door - nobody can give a clean answer about what is in scope, because nobody has labelled the data.

Questions you should be able to answer

→ Is every data element tagged with a sensitivity classification?
→ Is there a defined owner for each classification?
→ Is there a documented policy stating which classifications are permitted for AI processing?
→ Has the DPO and legal team signed off on the classification scheme?

Where projects break

Five failure modes that show up regularly.

These are the failure modes that an experienced data-readiness practitioner spots in the first week of an engagement. Each comes with an indicator - the early sign that the failure mode is present in your environment. If any of these indicators are familiar, the readiness work is real, and not optional.

Failure

01

Siloed data with no integration layer

The data exists, but it lives in five different business systems that do not talk to each other. Each system has its own schema, its own user, its own concept of "customer" or "transaction." Building an AI workload across them requires either an integration project that nobody budgeted or a workaround that quietly limits the deployment to a subset of the data.

Early indicator

"You can describe the dataset in a meeting, but nobody can produce it as one queryable thing."

Failure

02

Undocumented schemas

The columns have names. The columns have meaning. The meaning is in the head of the engineer who built the system, who left two years ago. Models cannot ask people what fields mean - they treat fields literally. An undocumented schema makes every fine-tuning, every RAG, every classification project a small archaeology project first.

Early indicator

"Field names like "status_2", "flag_old", "do_not_use_anymore" are present in production data."

Failure

03

Classification gaps

Nobody can say with certainty which fields are PII, which are IP, which are regulated. The data probably contains all three classes, mixed. Without classification, the GDPR / IP / regulatory analysis cannot be done - and without that analysis, the AI deployment cannot legitimately move forward. Common in organisations that grew the data layer faster than the governance layer.

Early indicator

"When asked "is this dataset cleared for AI use?", the answer is "we'll need to check" - and stays that way for weeks."

Failure

04

No defined ownership

Data exists but does not have a named owner. Nobody can authorise its use for AI because nobody knows who has the authority to authorise. Decisions get pushed up the org chart until they hit someone with the authority but not the context. The decision that comes back is conservative, slow, or both.

Early indicator

"The "data owner" question on any project plan template comes back blank or filled with "TBD.""

Failure

05

No DPO or legal sign-off path

The data is technically ready and classified. But there is no defined path to get DPO, legal, or compliance sign-off for AI processing. The deployment hits a procedural wall that everyone agrees should not be a wall - but the wall remains because the procedure to take it down was never built.

Early indicator

"Three months in, the project is "waiting for legal" - but legal has not been formally asked anything specific."

Engagement shape

What a readiness engagement looks like.

Four phases. The first two - assessment and remediation plan - are short and lead to a go / no-go decision. The third - remediation execution - is the real work, and its scope depends on what the assessment found. The fourth is the validation and handover that turns the data layer into a documented foundation.

Phase 01

2–4 weeks

Assessment

A data-engineering or AI-consultancy partner reviews the data layer against the five dimensions above. Output is a gap analysis with severity-ranked findings - and an honest read on whether the deployment can move forward, has to wait, or needs scope adjustment.

Phase 02

1–2 weeks

Remediation plan

For each gap, a defined remediation: what to fix, who to fix it, how long it takes, what it costs. The plan is a forecast, not a commitment yet - it lets the customer decide what to fund and what to defer.

Phase 03

4–24 weeks

Remediation execution

Variable. A classification cleanup might be two weeks. A schema documentation project across five systems might be six months. The plan above scopes this; the execution is real engineering work delivered by the partner with internal data-team participation.

Phase 04

1–2 weeks

Validation and handover

The remediated data layer is re-tested against the readiness dimensions. The findings are documented in a way that survives auditor review. The data layer becomes the documented foundation that the AI deployment then builds on.

Why we ask before we size

Hardware sizing depends on the data state.

Two RAG deployments with the same notional workload - same number of users, same queries per second, same response-time target - can size very differently depending on the data state. A clean, well-indexed corpus retrieves in one shot and feeds the model concise context. A messy, sparsely-indexed corpus needs multiple retrieval passes, longer context windows, and bigger models to compensate for retrieval misses. Same workload on paper; very different infrastructure footprint.

A fine-tuning workload over high-quality labelled data can use a smaller base model and converge in fewer epochs. A fine-tuning workload over noisier data needs a larger base model to extract whatever signal is there, more compute to converge, and more careful evaluation to know it converged. A workstation-class problem becomes a cluster-class problem, just because the data is not where it could be.

This is why the LM TEK conversation about hardware sizing starts with questions about data readiness. We do not undertake the data work. We do need to know its shape before we can size the hardware honestly. A platform sized for the messy-data version of a deployment is overspend if the data is in fact ready; a platform sized for the clean-data version is undersize if the data is not. Neither is an outcome the buying-guide stages account for if the data state is unknown.

See the buying guide

Honest scoping

LM TEK does not do data engineering.

We are not a data-engineering firm and we are not pretending to be one. We do not run readiness assessments, we do not build pipelines, we do not write transformation logic, we do not classify data layers. That work belongs to data-engineering specialists, AI consultancies with a data practice, and larger system integrators with the right team in-house.

What we do is the hardware foundation that production AI deployments run on, once the data layer is in shape. The reason this page exists is that customers reaching us at the hardware conversation are best served when their data work is either complete or scoped - and the most useful thing we can do for them before that is to set out, clearly, what data readiness involves and which partners deliver it.

When you tell us your situation in the routing form, the data state is one of the questions we'll ask. If the data work is not yet done, the partner shortlist we recommend includes a data-engineering or readiness specialist alongside the system integrator and AI consultancy. We do not introduce hardware partners into a deployment whose data layer is unresolved - the result of doing so is overpriced hardware sitting next to a deployment that cannot run.

Read next

Where this connects.

The journey

Deploying AI in your business

Six phases from discovery to production. Data readiness is the cross-cutting concern that runs through Phases 1, 2, and 4.

Compliance

Private AI for sensitive data

When data classification matters most. Data readiness is the precondition for any defensible compliance posture.

Foundations

AI workload patterns

The five shapes most enterprise AI deployments take. Each pattern has its own data-readiness requirements; this page is the upstream prerequisite.

Tell us your data starting point.

Describe where the data layer is - what's in shape, what isn't, what you already know about the gaps. We'll route you to a partner who runs disciplined readiness work, and we'll wait at the hardware conversation until that work is complete.

Back to AI Solutions

How these relate

These four explainers look like competing stage-models, but they measure different things. Read them by their axis, not as one list viewed four ways.

The journey Six phases of activity over time

The master timeline — the map everything else attaches to.
From PoC to Production A zoom into the validation gate

Zooms into Phase 3 of the journey — the gate most projects stall at.
AI Infrastructure Buying Guide Five stages of hardware investment size

A different lens — the scaling tail downstream of the PoC gate, not a 1:1 relabel of the phases.
Data Readiness for AI The upstream data precondition You are here

The precondition all three depend on.

Privacy Terms Cookies

LM TEK d.o.o. · Pod Lipami 10 · 1218 Komenda · Slovenia