Introduction: Why integrations are where retail tech programmes fail
Ask any CTO who has run a large retail technology programme what kept them up at night, and the answer is rarely the platform they chose. It is the spaces between platforms. The integration between the commerce engine and the OMS that drops orders under load. The sync between the ERP and the warehouse system that runs hours behind during peak trade. The customer data that lives in three different places because the CRM integration was never fully finished.
Platform selection gets the attention because it is the part with the impressive demos and the headline price tags. Integration gets the budget cuts because it looks like plumbing: necessary but unglamorous, and easy to deprioritise until something breaks. The result is that most retail technology programmes underinvest in integration architecture, understaff integration delivery, and discover the consequences in the first peak trading period after go-live.
This article covers the practical dimensions of building an integration strategy that holds together under operational conditions: architecture patterns, technology choices, vendor management, error handling, and the often-neglected question of who actually owns the spaces between systems.
The spaghetti problem: how integration debt accumulates
How it starts: a few simple connections
No retail business starts with a complex integration landscape. It starts with a commerce platform and an ERP, connected by a nightly export that someone built in a weekend. Then a new OMS gets added, and two more connections get built. Then a CRM, a marketing automation platform, a loyalty programme, a new warehouse system. Each connection gets built for the immediate need, in isolation, with no architecture pattern and no shared standards.
Five years later, a realistic mid-market retailer is running a commerce platform, an ERP, an OMS, a WMS, a CRM, a marketing platform, a loyalty programme, a product information management system, and a customer service tool. That is ten systems. At full point-to-point connectivity, ten systems can have up to 45 direct connections between them. In practice, most retailers have somewhere between fifteen and thirty active integration paths, built at different times by different teams using different approaches, documented nowhere comprehensively, and understood end-to-end by nobody.
The symptoms of integration debt
Integration debt manifests in specific, operational ways. Data inconsistencies between systems: the product count in the catalogue differs from the product count in the OMS, and neither team is sure which is correct. Sync processes that run hours behind schedule under normal load and fall days behind during peak periods. Customer service teams who cannot answer basic questions about order status because the data they can see is not the data that matters.
The diagnostic that matters most is this: if something goes wrong in your integration layer at 10pm on a Friday night, how long before someone notices, how long before they know which system is responsible, and how long before they can fix it? For most businesses with accumulated integration debt, the honest answers to those three questions are uncomfortable.
The compounding effect
Integration debt compounds faster than most other forms of technical debt because every new system added to the landscape creates multiple new failure modes rather than one. A new platform with five integration points does not add five integration risks. It adds five integration risks multiplied by the fragility of the existing integration layer it is connecting to.
The practical consequence is that businesses with high integration debt find it increasingly expensive to add new systems, integrate acquisitions, or respond to new channel requirements. What should be a two-week integration project becomes a three-month archaeology exercise just to understand what the current state is.
Integration architecture patterns
Point-to-point: when it works and when it breaks
Point-to-point integration, where System A talks directly to System B, is the right choice at small scale. If you are running three or four connected systems with modest data volumes and simple transformation requirements, direct connections are simpler to build, easier to understand, and perfectly adequate for the job.
The conditions under which point-to-point becomes a liability are fairly predictable. When you exceed five or six connected systems, the number of potential integration paths starts to become unmanageable. When the same data needs to flow to multiple systems, maintaining consistency across multiple direct connections becomes fragile. When any connected system changes its API, you have to update every connection to it individually rather than updating a single integration point. These are not theoretical concerns. They are the operational reality of point-to-point integration at scale.
Hub-and-spoke: the workhorse pattern
Hub-and-spoke architecture routes all data through a central integration layer that handles message routing, data transformation, and error management. Instead of System A talking directly to Systems B, C, and D, it talks to the hub, which knows how to translate and route the message to the appropriate destinations.
This is the right default architecture for most mid-market retailers running more than five connected systems. The central layer provides a single place to observe what is happening across all integration flows, a single place to apply transformation logic, and a single point for error handling and retry management. The tradeoff is that the hub becomes a critical dependency: if it goes down, most integrations stop. This makes hub selection and operational resilience important decisions, not afterthoughts.
Event-driven: decoupled and scalable
Event-driven integration uses published events rather than direct calls. When something happens in System A, it publishes an event to a message broker. Any system interested in that event subscribes and processes it independently. This creates looser coupling than hub-and-spoke: systems do not need to know about each other, only about the events they produce and consume.
Event-driven architecture is well-suited to high-volume, asynchronous flows where real-time processing matters but tight coupling is undesirable. Inventory updates across multiple channels, order status changes flowing to fulfilment and customer communication systems simultaneously, and real-time pricing updates are all good candidates. The operational maturity required to run an event-driven architecture effectively, with proper dead letter handling, event schema governance, and consumer group management, is meaningfully higher than for hub-and-spoke. This is not an architecture to adopt speculatively.
Choosing the right pattern for your context
Most retailers will not run a single integration pattern across their entire landscape. The practical approach is hub-and-spoke for core commerce flows where visibility and reliability matter most, point-to-point for simple low-volume connections that do not justify middleware overhead, and event-driven selectively for the specific high-volume or real-time flows where its advantages are most pronounced.
The decision framework is straightforward: how many systems are you connecting, what are the data volumes and latency requirements, how mature is your team’s operational capability, and what is your budget for middleware? For businesses that cannot answer these questions confidently, hub-and-spoke with a well-chosen iPaaS is almost always the right starting point.
Middleware vs API-first: the technology layer
What middleware actually does
Middleware is not a luxury for large enterprises. It is the operational infrastructure that makes integration manageable at scale. Its core functions are message routing (getting data from System A to System B via the right path), data transformation (converting the format and structure of data between systems), error handling (detecting failures and routing them for investigation), retry logic (re-attempting failed operations safely), and monitoring (providing visibility into what is happening across all integration flows).
Without middleware, these functions get implemented inconsistently across individual integrations, or not at all. The result is an integration landscape where some connections have basic error handling and others silently discard failures, where monitoring requires checking multiple systems, and where diagnosing an incident requires a developer who remembers how each individual connection was built.
iPaaS, enterprise middleware, or custom-built
For most mid-market retailers, an iPaaS platform is the right starting point. Options like Celigo, Workato, Boomi, and similar tools provide pre-built connectors for the most common retail systems, a visual development environment that non-engineers can partly manage, built-in monitoring and alerting, and managed infrastructure. The build time and operational overhead are materially lower than a custom middleware build.
Enterprise middleware platforms like MuleSoft are appropriate for large businesses with complex integration requirements, significant custom transformation logic, and large enough teams to justify the licensing and operational cost. They are over-engineered for most mid-market retailers and will consume budget that is better spent on the integrations themselves.
Custom-built middleware is appropriate in a narrow set of circumstances: when your transformation logic is sufficiently specialised that no iPaaS handles it well, when your data volumes are high enough to make per-transaction iPaaS pricing uneconomical, or when you have a strong engineering team that can maintain the middleware over the long term. Custom-built middleware that outlives the engineer who built it is one of the most common sources of integration debt. Build it only when you have a clear reason and a clear ownership plan.
API-first thinking
API-first means designing system interfaces around well-documented, versioned APIs from the start, rather than relying on database replication, file transfers, or proprietary vendor connectors as primary integration mechanisms. It is an orientation more than a technology choice.
The practical value of API-first thinking is reduced coupling between systems and more resilient integration design. An integration built on a published, versioned API is far more likely to survive the target system’s upgrades and changes than one built on a vendor-specific connector that breaks whenever the platform releases a major update. This is not a theoretical benefit. Retailers who have lived through a major platform upgrade while running a file-transfer-based integration know exactly what it costs.
Managing vendor dependencies
Vendor APIs are not all equal
Vendor API quality varies enormously, and the gap between what is promised in the sales process and what the API actually delivers is a consistent source of integration project overruns. Common issues include: rate limits that are tighter than the vendor disclosed, missing endpoints that force workarounds or manual processes, documentation that is either incomplete or out of date, breaking changes introduced without notice or adequate migration time, and bulk data operations that are either unsupported or prohibitively slow.
The way to discover these issues is to ask the right questions during vendor evaluation: request API documentation before you sign, ask for references specifically from businesses that have integrated the platform with your existing systems, and include API capability as an explicit evaluation criterion in your scoring model.
Protecting yourself from vendor API changes
Abstraction layers are the primary defence against vendor API volatility. Rather than building your integration logic directly against a vendor’s API endpoints, build against your own internal abstraction that the vendor-specific connector sits beneath. When the vendor changes their API, you update the connector. Your integration logic does not need to change.
Contract testing between your integration layer and vendor APIs provides early warning of breaking changes. Version pinning where vendors support it gives you control over when you absorb changes. Maintaining a direct communication channel with your vendor’s technical team, separate from the support queue, means you get advance notice of changes that affect your integrations rather than discovering them in production.
The vendor lock-in integration dimension
Every deep integration creates switching costs that go well beyond the platform licence. A business that has built custom integration logic against a vendor’s proprietary API and data model has invested engineering effort, operational knowledge, and operational process around that vendor’s specific implementation. Replacing the vendor means replacing not just the platform but the integration layer, the transformation logic, and the operational procedures built around it.
Integration architecture should be designed with future flexibility in mind even when you have no current plan to change vendors. That means abstraction layers, standards-based data formats, and integration patterns that isolate your business logic from vendor-specific implementation details. This is not gold-plating. It is the difference between a future migration that costs three months and one that costs twelve.
Monitoring, error handling, and operational resilience
Why error handling is not an afterthought
Most integration failures do not happen at build time. They happen under production conditions, when data quality is different from what was tested, when load is higher than anticipated, when a dependent system has a brief outage, or when an edge case in the data triggers a transformation error that was never considered. The integrations that work reliably in production are the ones that were built with explicit handling for all of these conditions, not just the happy path.
The cost of inadequate error handling is not just technical. It is operational. An order that fails to flow from the commerce platform to the OMS becomes a customer service problem. A product update that fails to sync from the PIM to the search index becomes a data quality problem. A customer record that fails to update in the CRM becomes a marketing problem. Each failure generates operational overhead that is far more expensive than the engineering cost of building proper error handling from the start.
Building observable integrations
Observable integrations are integrations that tell you what they are doing without requiring a developer to investigate. At a minimum, this means: transaction-level logging that records what was sent, what was received, and what happened; alerting on failures, latency thresholds, and unexpected volumes; dashboards that give your operations team visibility into integration health; and data reconciliation checks that compare record counts and key metrics across connected systems.
The operational rule of thumb is that your operations team should be able to identify whether an integration is healthy or unhealthy without raising a support ticket. If the only way to know whether last night’s product sync worked correctly is to ask an engineer, your monitoring is inadequate.
Retry logic, dead letter queues, and graceful degradation
Idempotent operations, where the same message can be processed multiple times without incorrect side effects, are a prerequisite for safe retry logic. An order that gets processed twice becomes a duplicate fulfilment problem. Idempotency prevents this by ensuring that re-processing a message that was already handled produces no change.
Dead letter queues capture messages that have failed all retry attempts, preserving them for investigation and manual remediation rather than discarding them silently. This is the difference between a failed integration that you can recover from and one that loses data permanently.
Circuit breakers prevent cascading failures by detecting when a downstream system is unresponsive and temporarily halting traffic to it rather than allowing the queue to grow unbounded. This is particularly important for synchronous integrations in customer-facing flows, where a slow downstream system can degrade the entire customer experience rather than just the affected integration.
The human side: who owns integrations
The ownership gap
The most common integration failure pattern is organisational, not technical. In most retail businesses, each team owns their system and is accountable for its performance. Nobody explicitly owns the connection between systems. When an order status is not updating correctly, the commerce team points at the OMS team, the OMS team points at the ERP team, and the ERP team points at the integration that “someone built two years ago.” The incident takes two days to diagnose and fix something that should have taken two hours, because the knowledge of how the integration works is distributed across three teams and nobody has explicit accountability.
This is not a failure of individual capability. It is a structural failure to assign ownership at the right level of granularity. Integration points are first-class operational components, and they need the same ownership accountability as the systems they connect.
Integration as a team capability
The organisational answer is either a dedicated integration team or a named integration owner with cross-system authority. The integration owner is accountable for: setting integration standards (patterns, error handling, documentation requirements), approving new integration designs before build, maintaining the middleware layer, and coordinating incident response when failures span multiple systems.
In a large business, this is a team of three to five people. In a mid-market business, it may be a single engineer with the right scope of authority and the right relationships across system-owning teams. The title matters less than the accountability. What must not happen is for integration ownership to be split between system-owning teams, because the gap between systems will always be at best a shared responsibility and at worst nobody’s problem.
Documentation and integration contracts
Every integration should have a documented contract covering: what data flows in each direction and in what format, how frequently and triggered by what event, what error handling behaviour is expected from each side, what the SLA is for the integration under normal and degraded conditions, and who is accountable for each end of the connection.
This documentation does not need to be elaborate. A one-page integration contract per connection, maintained in a central register, gives you everything you need to diagnose incidents, onboard new engineers, and manage vendor conversations when something goes wrong. The cost of creating this documentation at build time is trivial. The cost of not having it when you need it is not.
A phased approach to integration maturity
Phase 1: Inventory and stabilise
The starting point for any integration improvement programme is a complete map of what exists. This sounds simple but is frequently challenging because in businesses with significant integration debt, the integration landscape is not documented comprehensively anywhere. The inventory needs to capture every system, every integration between systems, the technology it is built on, who built it, and when it was last touched.
Once you have the map, prioritise by risk. Which integrations carry the most critical data flows? Which ones are the least well-monitored or the most fragile? Start by adding monitoring and basic error handling to the most critical connections before attempting any architectural changes. The first priority is visibility. You cannot improve what you cannot observe.
Phase 2: Introduce a broker layer
With a stable understanding of the current landscape, the next step is migrating the most critical and most fragile integrations onto a hub-and-spoke architecture with proper middleware. Start with the integrations that support customer-facing flows, where failures have immediate commercial impact. Migrate point-to-point connections to the middleware layer, standardise the error handling and retry logic, and validate the monitoring setup before moving on.
Resist the temptation to migrate everything at once. A phased migration that keeps the most fragile integrations running, even imperfectly, while you build out the new architecture is lower risk than a big-bang cutover that disrupts multiple systems simultaneously.
Phase 3: Standardise and scale
Once the critical integrations are running on a stable middleware layer, establish the standards and governance that allow new systems to be connected predictably. This includes integration design patterns, API standards, documentation requirements, a change management process for integration modifications, and a review gate for new integration designs before they go into build.
The goal of Phase 3 is to change the default from “build it however seems fastest” to “build it to the standard, and here is what the standard looks like.” This prevents the next generation of integration debt from accumulating while you are still cleaning up the previous one.
Conclusion: Integration is a capability, not a project
Integration strategy is not a problem you solve and move on from. It is an operational capability that needs sustained investment, clear ownership, and continuous attention. The businesses that manage integration well treat it the way they treat security or reliability: as a foundational concern that runs alongside all delivery, not a project that runs before delivery and then finishes.
The practical indicators of integration maturity are operational, not architectural. Can your team tell whether integrations are healthy without asking an engineer? Do incidents get diagnosed in minutes rather than days? Can you add a new system to the landscape without triggering a forensic archaeology exercise on the existing connections? If the answers are yes, your integration capability is in reasonable shape. If the answers are no, the architecture in this article gives you a path to get there.
What to read next
- Headless vs monolith: a practical decision framework for retailers addresses how your platform architecture choice shapes the integration complexity you are managing. The two articles should be read together.
- Why your CRM implementation failed and what to do about it covers a specific domain where integration failure is the most common cause of CRM underperformance.
- Vendor selection without the theatre is relevant when your integration strategy points to replacing or adding a middleware or iPaaS vendor.
Next steps
If your integration landscape has become a source of operational risk or delivery bottleneck, get in touch to discuss a structured assessment.