Garbage In, Garbage Out: Why a Robust Data Strategy is Non-Negotiable for AI in Real Estate

AI will not rescue poor data. In UK real estate, where value hinges on leases, planning, energy performance and local policy, the accuracy and auditability of data determine whether algorithms accelerate insight or institutionalise error. This paper explains why a “data-first” strategy is the precondition for “AI-first” ambitions, diagnoses the structural problems peculiar to property data, and sets out a pragmatic blueprint for improvement. It also examines the promise and pitfalls of synthetic data and illustrates, with UK-focused examples, how better data practice changes outcomes.

Why data quality decides AI outcomes in property

Real estate decisions are document-heavy, locally contingent and time-sensitive. A valuation or risk model is only as good as its inputs: if EPC ratings are misread, planning status is outdated, or lease clauses are missing, model sophistication is irrelevant. Unlike consumer tech, property markets lack a single canonical feed; useful signals live across HM Land Registry, local planning portals, EPC registers, lender term sheets, surveyor reports and building telemetry. A robust data strategy is therefore not hygiene, it is the competitive edge that makes AI defensible.

Deconstructing the data problem

The industry’s data issues are structural rather than incidental. Fragmentation is the first: public registers, broker platforms, property-management systems and IoT sensors operate in silos with inconsistent identifiers and licensing. Latency is the second: transaction records and many operational datasets arrive weeks or months after the fact, so naïve models are trained on yesterday’s regime. Inaccessibility compounds both: the most valuable information sits in PDFs, leases, side letters, officer reports, where critical facts are text, not fields. Finally, standardisation is weak. Even apparently simple concepts such as Net Internal Area or “green clauses” vary by document set and practice, making portfolio-wide analysis unreliable unless terms are harmonised.

A fifth problem is ownership. Many firms cannot name the steward for a “source of truth” on core entities (asset, lease, counterparty, EPC, policy), leading to duplication, silent drift and disputes that surface only at investment committee.

The cost of poor data: financial, legal, reputational

Bad data scales faster with AI. An AVM trained on stale comparables and inconsistent floor-area definitions will misprice assets systematically; speed simply magnifies the error. Bias in historical records—coverage skew by postcode, asset type or tenant profile—propagates into models unless detected and mitigated, creating exposure under the Equality Act 2010 and undermining investor trust. Privacy lapses are common when lease or resident data are pooled without a lawful basis; under UK GDPR, “experimentation” is not a defence. Most waste, however, is operational: teams invest in models only to spend the majority of time cleaning inputs, while decisions revert to spreadsheets because outputs cannot be explained from the underlying records.

Failure case (illustrative).
A London office acquisition used an internal AVM for initial pricing. EPC data were scraped without deduplication; “proposed” ratings were mixed with “current”. The model over-weighted energy performance in two boroughs and overstated value uplift for retrofit. The error was discovered late, when the lender’s due diligence contradicted the assumption. Root cause analysis showed absent lineage, no field-level definitions, and no freshness checks on EPC snapshots.

A modern data blueprint that teams can live with

A workable strategy couples architecture with operating model. The backbone is a data platform that ingests, stores and serves both structured and unstructured sources with lineage preserved end-to-end. Around it sit four elements:

1) Clear data products and ownership. Treat core domains assets, leases, parties, locations, EPCs, policies—as data products with named owners, published schemas, SLAs for freshness and quality, and contract-like guarantees (“schema contracts”) for downstream users. Ownership lives in the business, not only in IT.

2) Document intelligence, not just tables. Build a repeatable pipeline that turns PDFs into facts with citations. Lease abstraction should store both the extracted field and the paragraph it came from; planning watchers should index official documents and expose paragraph-level retrieval. If a model cannot point to the clause behind a feature, expect committee pushback.

3) A feature store and knowledge graph. Features used in models (e.g., distance to transport nodes under construction, indexation type, service-charge caps, flood score) should be computed once, catalogued and reused across use-cases. A lightweight knowledge graph links properties, leases, counterparties, policies and infrastructure so that queries resolve relationships rather than brittle joins.

4) Governance that travels with the data. Every dataset carries provenance, licence, refresh cadence, and quality tests (completeness, conformance, uniqueness, accuracy, timeliness). Monitoring surfaces breaches; change control manages schema evolution. Where personal data are processed, complete DPIAs and minimise access by design.

Quality you can measure

Quality cannot be wished into being; it must be measured and enforced.

Freshness/latency: time since last update for Land Registry snapshots, EPC pulls, planning indices and rent rolls.
Completeness/fill rate: coverage of core fields (e.g., indexation type present in ≥95% of leases in scope).
Conformance: schema and units (e.g., NIA units and measurement standard tagged and consistent).
Accuracy: cross-checks (e.g., floor area vs. rateable value ranges; EPC ratings vs. energy bills where lawful).
Uniqueness: duplicate detection for assets, leases and parties.
Stability: volatility of key features absent real-world change; unexpected swings indicate pipeline faults.

Dashboards that show these metrics by source and by asset class do more to improve modelling than an extra algorithmic tweak.

Synthetic data: power, with sharp edges

Synthetic data can help where real data are sparse, sensitive or missing. In property it is useful for three things: augmenting rare events (e.g., arrears spikes), enabling safe development and testing without exposing personal or contractual information, and stress-testing models under unobserved shocks (rate jumps, carbon pricing, abrupt policy shifts).

But synthetic data are not magic. Poorly trained generators replicate bias or invent spurious correlations; “membership inference” attacks can leak original records if privacy is not enforced; and models trained primarily on synthetic records may generalise poorly. Governance must therefore include:

Utility tests: “Train on synthetic, test on real” (TSTR) to verify that augmentation improves out-of-sample performance.
Privacy tests: disclosure risk assessment and bounds (e.g., differential privacy for tabular generators where appropriate).
Provenance & scope labels: synthetic records must be tagged and barred from certain downstream decisions (e.g., cannot be sole basis for pricing).
Domain oversight: subject-matter experts review distributions and relationships for plausibility before release.

Worked example (stress).
A BTR operator builds arrears-risk models. Historical data cover mild inflation only. A synthetic module generates payment trajectories under CPI > 8% combined with energy-price volatility, preserving household-income mix and tenancy structures. TSTR shows improved calibration for high-stress scenarios; privacy tests confirm low disclosure risk. The model card notes that synthetic augmentation is used for scenario planning, not tenant-level decisions.

UK-specific examples: how better data changes outcomes

Logistics valuation in outer London.
A fund’s AVM previously blended undated planning references and inconsistent NIA. After introducing a knowledge graph anchored to official planning documents, paragraph-level retrieval, and a floor-area standard tag, error bands narrowed and, more importantly, explanations stabilised. Investment papers now show “why” with citations rather than only numbers.

Planning watcher for Greater Manchester.
A developer’s data pipeline indexes officer reports and committee minutes, classifies stage (consultation, committee, adoption) and extracts conditions precedent. A style guide separates quoted policy from analyst interpretation. An incident where a press release was mistaken for adoption led to a source-prioritisation rule (statutory docs over media) and a “citation coverage” score in briefs.

PBSA pipeline scoring.
A PBSA investor harmonises definitions of room types and amenity bundles across operators, links them to mobility and university intake data, and measures fill rates and freshness. Scores now reflect comparable entities; portfolio exposure by driver (student mix, travel time, supply pipeline) replaces a cruder postcode proxy.

Operating model: who does what

Good data strategy is as much about people as pipelines. Create a small Data Council (Investment, Asset Management, Risk, Legal, Data) to set priorities and resolve trade-offs. Assign Data Stewards for each domain with time in their job plan, not as a hobby. Tie model releases to data readiness: if quality metrics breach thresholds or definitions are in flux, the model waits. Require model factsheets to reference dataset versions and quality scores; require decision logs to store the data snapshot and citations behind recommendations.

A 90-day roadmap that teams follow

Begin with a single, material decision where poor data currently slows or undermines outcomes, often lease analytics for diligence or a planning watcher for pipeline. In the first month, inventory sources, define domains and owners, agree field-level definitions and licences, and stand up basic lineage and freshness monitoring. In the second month, build a document-to-fact pipeline with citations and a small feature store for the immediate use-case; publish a first quality dashboard and set thresholds. In the third month, connect the model to the curated features, run regime-aware back-tests, and embed the artefacts, factsheet, data catalogue entry, quality metrics and decision log into the investment paper. From there, scale horizontally: new use-cases inherit the plumbing.

Common pitfalls—and how to avoid them

Two traps recur. First, “platform first, semantics later”: buying tooling without agreeing definitions simply moves chaos into the cloud. Start with language, not logos. Second, “accuracy at any cost”: squeezing model error while ignoring freshness, coverage or explainability yields brittle systems and sceptical committees. Optimise holistically: quality, traceability and stability matter as much as point accuracy.

Conclusion

AI compounds whatever you feed it. In UK real estate, the firms that win will not be the ones with the flashiest models, but the ones that can show, clearly, repeatably, where each number came from, how fresh it is, and what it means. A modern data strategy built on owned domains, document-grounded facts, reusable features, measurable quality and disciplined governance turns “Garbage In, Garbage Out” into a design constraint rather than a post-mortem. Only on that foundation does AI become a reliable accelerator instead of a risk multiplier.

‍

Key benefits

Uncover hidden value & risk

Orchestrate expert workflows

Decide with confidence