Case Study: Decoupling a Monolithic Inventory Sync
Problem
A legacy e-commerce platform relied on a single Node.js script to sync inventory levels between an on-premise MSSQL ERP and a cloud-based PostgreSQL storefront.
The script ran every 15 minutes via cron. As the SKU count grew to 500,000+, the execution time exceeded the 15-minute window, causing overlapping execution. This led to database deadlocks, CPU spikes on the ERP, and eventually, incorrect stock levels displayed to customers.
Constraints
- No modifications allowed to the Legacy ERP: We had read-only access to the MSSQL views.
- Near real-time requirement: Business required stock updates within 2 minutes of a change.
- Hardware limitations: The on-premise gateway had limited memory (4GB RAM).
Architecture
We moved from a batch-pull model to a stream-based model with backpressure.
Old Flow:
Cron -> Select * from ERP -> Diff in Memory -> Bulk Update Postgres
New Flow:
- Change Detection Service (Poller): A lightweight Node.js service polls the ERP's
LastModifiedcolumn every 30 seconds. It fetches only changed rows (cursors allowed). - Message Queue (RabbitMQ): Changed SKUs are pushed to a queue.
- Worker Pool: 5 consumer workers pick up SKU updates.
- They validate the data.
- They apply complex pricing rules.
- They perform an upsert into PostgreSQL.
Database & Data Handling
We utilized raw SQL for all high-throughput operations to avoid TypeORM overhead.
-- Example of the upsert logic used in Postgres
INSERT INTO inventory (sku, quantity, last_synced)
VALUES ($1, $2, NOW())
ON CONFLICT (sku)
DO UPDATE SET
quantity = EXCLUDED.quantity,
last_synced = NOW()
WHERE inventory.last_synced < EXCLUDED.last_synced;
Note the WHERE clause: This ensures we never overwrite newer data if messages arrive out of order.
Key Decisions
- RabbitMQ over Redis Pub/Sub: We needed durable queues. If the worker service crashed, we couldn't afford to lose inventory updates.
- Raw SQL over ORM: The initial TypeORM implementation used too much heap memory when hydrating thousands of model instances. Switching to
pg-nodereduced memory usage by 60%. - Concurrency Limiting: We implemented a semaphore in the poller to ensure we never overloaded the legacy ERP, regardless of the backlog size.
Failure Scenarios Handled
- ERP Downtime: The poller saves the last successful
LastModifiedtimestamp to disk. On restart, it resumes exactly where it left off. - Bad Data: If a SKU update fails validation (e.g., negative price), it is moved to a Dead Letter Queue (DLQ) for manual inspection, preventing the queue from blocking.
Lessons Learned
- Observability First: We built the metrics dashboard (OpenTelemetry) after the first outage. It should have been built first. We were flying blind on queue depths.
- Idempotency is cheap: Making the consumer idempotent saved us weeks of debugging when the queue delivered duplicate messages during a network partition.