This is a fictional but realistic Solution Architecture Document for NorthWind Retail Ltd’s customer-facing e-commerce platform. It demonstrates the Architecture Description Standard at Recommended documentation depth — the expected level for a Tier 2 High Impact system handling PCI-DSS regulated payment data and peak sales volumes of £30M/day.
Fictional company: NorthWind Retail Ltd — a UK-based B2C retailer with 450 stores and £2.8bn annual turnover.
Fictional solution: NorthWind Online — a customer-facing e-commerce platform (web and mobile app) migrating from a legacy .NET monolith to a microservices architecture on AWS.
This SAD describes the architecture of NorthWind Online, the customer-facing e-commerce platform for NorthWind Retail Ltd. It replaces the legacy NW-Commerce .NET monolith with a cloud-native microservices platform hosted on AWS, supporting peak sales of £30M/day during seasonal events.
Scope boundary: Customer-facing web storefront (Next.js), mobile application back-end services, microservices domain (catalogue, basket, checkout, order, customer, search), data stores, payment integration, and supporting AWS infrastructure.
Out of scope: Warehouse management system (documented in APP-0214), in-store EPOS (APP-0088), marketing cloud platform (SaaS — vendor-managed), and the corporate SAP ERP (APP-0001).
Related documents: NorthWind Cloud Landing Zone SAD (APP-0750), PCI-DSS Scope Document (SEC-PCI-2025-03), Data Protection Impact Assessment (DPIA-2025-091), Digital Channels Strategy (STRAT-DGT-2025).
NorthWind Online is the primary digital sales channel for NorthWind Retail Ltd, serving approximately 12 million active customers across the UK via responsive web (www.northwind.co.uk) and native mobile applications (iOS and Android). The new platform replaces the legacy NW-Commerce .NET monolith — which has reached the limits of its scaling capacity and cannot reliably handle Black Friday and Boxing Day peaks — with a cloud-native microservices architecture on AWS.
The platform is built on Amazon EKS running Node.js microservices, fronted by a Next.js storefront (server-side rendered via Vercel-equivalent pattern on AWS), and backed by Amazon RDS Aurora PostgreSQL. Payments are processed via Stripe (tokenised at the browser via Stripe Elements), email via SendGrid, and customer behaviour events are captured via Segment CDP for downstream marketing analytics.
Legacy monolith failed twice in Black Friday 2024 peak, losing an estimated £8.2M in sales over 3 hours; board directive to remediate before Black Friday 2026
Critical
PCI-DSS compliance
Current platform is PCI-DSS v3.2.1 scoped at Level 1; v4.0 transition required by 31 March 2026 with tokenised payment flow to reduce scope
Critical
Digital growth strategy
Board target of 40% of group revenue online by 2028 (currently 22%); requires platform able to deliver new customer experiences quickly
High
Legacy end-of-life
.NET Framework 4.7.2 and Windows Server 2016 reach end of extended support in 2026; Oracle Commerce platform is unsupported since 2024
High
Mobile channel growth
Mobile traffic has grown from 48% to 71% of sessions in 2 years; current platform has no mobile-specific API surface, relying on scraped web views
High
Personalisation & CDP
Marketing team requires real-time customer event stream for personalisation; legacy platform cannot emit structured events
The legacy NW-Commerce platform was built in 2016 on Oracle Commerce 11 and .NET Framework 4.7.2, hosted on Windows Server 2016 virtual machines in NorthWind’s private data centre in Basingstoke. It serves the current £620M/year online turnover.
Key limitations:
Peak capacity: Vertical scaling limits reached at approximately 1,800 orders/minute; Black Friday 2024 demand peaked at 2,400 orders/minute and the platform failed for 3 hours 12 minutes, losing an estimated £8.2M in sales.
Release velocity: Full-regression release cycle of 6 weeks; any code change requires full platform deployment.
Mobile experience: No mobile-specific APIs; the iOS and Android apps scrape the responsive website HTML, which is brittle and slow.
Vendor support: Oracle Commerce 11 is unsupported since 2024; there is no patch stream for security or functional issues.
Operational cost: Annual hosting, licensing and operational support totals £4.1M including 11 FTEs.
PCI-DSS scope: The entire application stack is in PCI-DSS scope because cardholder data enters the application server prior to tokenisation.
What is being retained: SAP ERP (integration via APIs), warehouse management system (APP-0214), loyalty platform (APP-0417).
What is being replaced: Oracle Commerce 11, .NET monolith, on-premises Windows Server hosting.
What is being decommissioned:NW-Commerce application servers (post 3-month parallel-run period).
Justification: NorthWind Online is the primary digital sales channel, contributing £620M/year currently (projected £1.1bn by 2028). Failure during peak trading periods would cause:
Direct revenue loss of up to £30M per day during peak trading (Black Friday, Boxing Day, Cyber Week)
Breach of PCI-DSS obligations if security controls fail, with potential fines and card scheme sanctions
UK GDPR breach notification obligations if customer PII is exposed
Reputational damage in a competitive retail market
Failure is not immediately life-safety critical (Tier 1 reserved for in-store point-of-sale and safety systems)
Scope reduced to SAQ A-EP via Stripe Elements tokenisation; PAN never traverses NorthWind systems. Network segmentation, encryption, audit logging and quarterly ASV scans still required
UK GDPR / Data Protection Act 2018
Mandatory — platform processes customer PII at scale
Application architecture: Customers access Next.js storefront and mobile apps via CloudFront. The API Gateway routes to Node.js microservices on EKS (catalogue, basket, checkout, order, customer, search). Microservices use Aurora PostgreSQL, OpenSearch and Redis. Payments go to Stripe, emails via SendGrid, events to Segment CDP.
graph TD
R53[Route 53] --> CF[CloudFront + WAF + Shield]
CF --> ALB[Application Load Balancer]
subgraph Primary[eu-west-2 London - 2 AZs]
subgraph Public[Public Subnets]
ALB
NAT[NAT Gateways]
end
subgraph Private[Private Subnets]
EKS[EKS Node Groups]
Aurora[Aurora PostgreSQL Multi-AZ]
Redis[ElastiCache Redis]
OS[OpenSearch]
end
end
ALB --> EKS
EKS --> Aurora
EKS --> Redis
EKS --> OS
EKS --> NAT
NAT --> Stripe[Stripe]
NAT --> SG[SendGrid]
subgraph DR[eu-west-1 Ireland - Pilot Light]
AuroraDR[Aurora Global Replica]
OSDR[OpenSearch Replica]
end
Aurora -- Global DB --> AuroraDR
Deployment architecture: CloudFront fronts the platform with WAF and Shield. An Application Load Balancer distributes to EKS node groups across two Availability Zones in eu-west-2 (London). Aurora PostgreSQL, ElastiCache Redis, and OpenSearch are Multi-AZ. DR is a pilot-light deployment in eu-west-1 Ireland.
eu-west-2 (London) — chosen primarily for UK customer proximity and data residency. AWS published carbon intensity for eu-west-2 is moderate; AWS commitment to 100% renewable matching by 2025 applies. DR region eu-west-1 (Ireland) operates at lower carbon intensity than the AWS European average.
Non-production environments auto-shutdown out of hours
Yes — dev and staging EKS clusters scale to 1-2 system nodes overnight (19:00-07:00 weekdays) and weekends; non-prod Aurora paused via Lambda cron. ~£14k/year saving on non-prod compute.
Compute family chosen for performance-per-watt
Yes — Graviton3 (c7g/m7g) throughout; AWS published data shows ~60% better performance-per-watt vs equivalent x86. CloudFront and S3 reduce origin compute.
Auto-scaling configured to release capacity when idle
Yes — Karpenter consolidates underutilised nodes; HPA on CPU + custom queue-depth metrics; nodes scaled down within 5 minutes of becoming idle. Black Friday peak fleet (~24 nodes) scales back to 8 within 2 hours of peak passing.
DR strategy proportionate
Pilot-light (Aurora Global Database secondary + minimal EKS) chosen over warm standby. RTO 4 hours, RPO 1 minute. Hot active-active was rejected: unnecessary for the business RTO and would have ~50% additional always-on compute and replication carbon cost.
Encryption at rest (storage + field-level for selected columns), TLS in transit, access audited, 7-year retention
No cardholder primary account number (PAN) is stored. PAN is tokenised by Stripe Elements at the browser; NorthWind stores only the opaque Stripe payment method token. This keeps the platform out of full PCI-DSS scope (SAQ A-EP applies).
The DPIA identified a medium-risk processing activity (behavioural event capture for personalisation) which is mitigated by consent-gated event emission and a public-facing privacy portal where customers can view and manage their data.
Production customer data is tokenised into a masked dataset via a scheduled AWS Glue job for staging use. Names, addresses, emails and phone numbers are replaced with synthetic but realistic values derived from the Faker library. Test and dev environments use entirely synthetic data.
Yes — Aurora provides ACID transactions and foreign-key constraints; orders are reconciled nightly against SAP via a scheduled integrity job; discrepancies alert to Finance operations.
Yes (limited) — mobile apps cache the product catalogue and basket for offline browsing. No payment data or full PII (beyond display name) is cached. Mobile caches are encrypted at rest via platform keychain / keystore.
Yes — customer PII and order data must remain in the UK (eu-west-2 London). The DR region (eu-west-1 Ireland) contains only operational telemetry. Aurora Global Database is configured to replicate non-PII schemas only; customer PII tables are replicated via a filtered logical replication stream terminated at a UK-only subsystem. Segment is configured to use its EU data plane; Stripe operates under UK and EU safeguards under standard contractual clauses.
Customer order data 7 years (HMRC); browsing/clickstream 25 months (legitimate interest basis); inactive customer accounts archived after 3 years inactivity (PII deleted at 5 years); session data ≤ 24 hours. Lifecycle policies enforce automated expiry.
Older data tiered to cold/archive storage
Yes — order archives transition S3 Standard → Intelligent-Tiering → Glacier IR (90 days) → Glacier Deep Archive (1 year). Aurora cold tables exported to S3 quarterly. ~75% of historical data sits in archive tiers.
Unused or duplicate replicas
Single Aurora primary + 1 DR replica; no read-replicas (SearchKit + ElastiCache absorb read load). Quarterly review of S3 buckets via AWS Trusted Advisor.
Compression applied
Brotli on HTTPS (~70% reduction on JSON catalogue payloads); WebP/AVIF for product images (CloudFront origin transformation); Parquet+Snappy for BigQuery exports.
Cross-region replication justified
Aurora Global Database secondary required by DR RPO (1 min). Customer PII tables explicitly excluded from cross-region replication (sovereignty + reduced carbon cost).
Large data transfers off-peak
Nightly Snowflake export 02:00-04:00 UTC; weekly partner reconciliations Sunday 03:00 UTC; both align with low-carbon-intensity periods on the UK grid.
VPC with public, private and data subnets across 2 AZs; security groups per service; NACLs as secondary layer; Kubernetes network policies for pod-to-pod
Basket is valid; customer has provided delivery and billing details; Stripe Elements has loaded
Main Flow
1. Customer enters card details into Stripe Elements iframe; Stripe returns a payment method token to the browser. 2. Browser posts the token to Checkout Service. 3. Checkout Service revalidates basket and price server-side. 4. Checkout Service calls Stripe PaymentIntent.confirm with the token. 5. Stripe performs 3-D Secure challenge if required; customer completes in-browser. 6. On success, Checkout Service calls Order Service to create the order. 7. Order Service writes to Aurora and publishes OrderCreated to SQS. 8. Order Service triggers transactional email via SendGrid. 9. Customer redirected to order-confirmation page. 10. SAP integration Lambda consumes SQS and creates the sales order in SAP.
Post-conditions
Customer sees confirmation; order visible in “My Orders”; SAP has sales order; email sent; event emitted to Segment
Views Involved
Logical, Integration & Data Flow, Physical, Data, Security
UC-02: Black Friday Traffic Surge
Attribute
Detail
Actor(s)
Retail customers (aggregate); SRE on-call
Trigger
18:00 Black Friday launch; traffic surges from 200 to 2,400+ orders/min
Pre-conditions
Platform warmed; capacity plan executed; “freeze” window in force (no deployments)
Main Flow
1. CloudFront absorbs cacheable product-detail traffic at the edge. 2. HPAs scale Catalogue and Search pods (6 to 24 pods within 90 seconds). 3. Karpenter provisions additional EKS nodes. 4. Aurora read-replica auto-scaling adds 2 replicas within 3 minutes. 5. WAF rate-based rules throttle abusive IPs. 6. P95 API latency rises to 240ms but remains within SLA; error rate held below 0.05%. 7. SRE on-call monitors Datadog dashboard; no manual intervention required.
Post-conditions
Peak traffic absorbed without service degradation; post-event review captures metrics for next year
Views Involved
Logical, Physical, Performance
UC-03: Customer Requests Right-to-Erasure
Attribute
Detail
Actor(s)
Customer; DPO team
Trigger
Customer submits erasure request via privacy portal
Pre-conditions
Customer is authenticated; consent model supports erasure request
Main Flow
1. Customer submits request via Customer Service portal. 2. Customer Service queues an erasure job (SQS erasure queue). 3. Erasure Lambda anonymises PII in Aurora (customer record retained as pseudonymised placeholder for financial/order integrity); order records retain statutory minimum for 7 years. 4. Cognito account deleted. 5. Segment is sent a user.delete call to purge behavioural events. 6. SendGrid suppression list updated. 7. Customer receives confirmation email. 8. DPO notified via dashboard; audit record retained.
Post-conditions
Customer PII removed within 30 days (UK GDPR statutory timeframe); audit trail preserved
ADR-001: PostgreSQL (Aurora) over MySQL for Transactional Store
Field
Content
Status
Accepted
Date
2025-08-05
Context
The platform requires a relational database for catalogue, customer, order and promotion data. Both Aurora PostgreSQL and Aurora MySQL are approved under the Cloud Landing Zone Standard.
Decision
Use Amazon RDS Aurora PostgreSQL 15.
Alternatives Considered
Aurora MySQL: Widely used at NorthWind but weaker JSONB support for semi-structured product attributes; the team found the MySQL JSON functions awkward for catalogue filtering. DynamoDB: Rejected because the data is strongly relational (customer -> orders -> order-lines) and multi-row ACID is a hard requirement for checkout.
Consequences
Positive: rich JSONB for flexible catalogue attributes, stronger CTE and window function support for reporting, PostGIS available for store-locator if later needed, excellent observability via pg_stat_statements. Negative: less internal familiarity than MySQL; training investment needed for the ops team (closed via a 3-day workshop).
Quality Attribute Tradeoffs
Performance: comparable (positive). Maintainability: PostgreSQL richer ecosystem for our data model (positive). Operational Excellence: increased training cost (negative, one-off).
ADR-002: Next.js SSR over Client-Only SPA for the Storefront
Field
Content
Status
Accepted
Date
2025-08-12
Context
The storefront must be highly discoverable via search engines (organic search is 42% of customer acquisition) and must deliver first-paint quickly on cellular networks.
Decision
Use Next.js 14 with server-side rendering (SSR) for category, product and landing pages; use incremental static regeneration (ISR) for campaign pages; use client-side rendering only for the account area.
Alternatives Considered
Client-only SPA (React + Vite): Simpler operationally but poor SEO, slower first contentful paint, and heavy JavaScript bundle on mobile. Static site (Gatsby / Astro): Good for marketing pages but cannot handle the dynamic, personalised storefront.
Consequences
Positive: strong SEO, improved Core Web Vitals (LCP improved from 3.1s to 1.4s in prototype), identical rendering for crawlers and users. Negative: additional server capacity for SSR (budget allocated); cache invalidation more complex than pure static.
Quality Attribute Tradeoffs
Performance: major improvement (positive). Cost: increased compute for SSR (negative, quantified and accepted). Reliability: SSR failure could impact page rendering — mitigated by graceful fallback to client-side hydration.
ADR-003: Stripe Elements Tokenisation to Reduce PCI-DSS Scope
Field
Content
Status
Accepted
Date
2025-09-02
Context
The legacy platform is in full PCI-DSS scope (SAQ D) because cardholder data enters application servers. This imposes substantial audit and remediation cost. The target is SAQ A-EP via client-side tokenisation.
Decision
Integrate Stripe Elements so that card data is entered into a Stripe-hosted iframe and never traverses NorthWind servers. Only opaque Stripe payment method tokens are stored.
Alternatives Considered
Direct card acceptance into Checkout Service: Rejected — expands PCI scope to the entire platform. Stripe Checkout redirect: Rejected — breaks the custom checkout UX the business requires. Alternative PSP (Adyen, Worldpay): Evaluated; Stripe selected due to existing group-wide contract and superior developer experience.
Consequences
Positive: SAQ A-EP scope achieved (annual audit cost reduced by an estimated £240k/year); reduced blast radius in the event of a storefront compromise. Negative: Stripe vendor lock-in is elevated (see R-002); Stripe outage would halt all card payments.
Quality Attribute Tradeoffs
Security: major reduction in scope and risk (positive). Cost: lower audit cost (positive); Stripe transaction fees higher than some alternatives (negative, small). Reliability: additional SaaS dependency (negative, mitigated by fallback messaging during Stripe outage).
Is the application deployed across multiple hosting venues for continuity?
Yes — eu-west-2 (London) primary; eu-west-1 (Ireland) pilot-light DR
What is the DR strategy?
Pilot Light. DR region has Aurora Global Database secondary (continuous replication, 1-minute RPO), minimum EKS node group (2 nodes), pre-provisioned OpenSearch snapshot restore. Scaled up on failover.
Are there data sovereignty requirements affecting geographic choices?
Yes — PII must remain in UK (eu-west-2); DR carries only non-PII operational data; failover including PII requires DPO approval
Yes (confirmed) — Stripe SLA supports 10k TPS; SendGrid transactional sending limits raised to 500k/day by arrangement; SAP order queue sized for 5,000 orders/min peak (buffered via SQS)
Dependency details
SQS buffering protects against SAP slow-down; circuit breakers prevent cascade failure. OpenSearch indexing throttles to 2,000 docs/sec during peak reindex.
Component failures: Each microservice runs 3+ replicas across 2 AZs; Kubernetes reschedules failed pods; pod disruption budgets enforced.
Graceful degradation: If Stripe is unavailable, the storefront disables the “Pay now” button and surfaces a clear message with a “Notify me” option; no partial orders are created.
Circuit breakers: Stripe (open after 5 failures, half-open after 30s) and SAP (open after 3 failures, half-open after 60s); opossum library.
Health checks: Kubernetes liveness (/health/live, 10s), readiness (/health/ready, 5s, checks DB + Redis reachability).
Testing: Monthly chaos tests (AWS Fault Injection Service: AZ blackout, pod kill, latency injection); quarterly DR failover drill.
Monthly load tests at 2x current peak; quarterly peak-readiness tests at 3x current peak; soak test (72 hours at steady state) before each major release
Testing tools
k6 (Grafana Cloud) for load generation; Datadog for observation
Testing environment
Staging (production-mirror); read-only production smoke tests off-peak
Testing frequency
Monthly (standard); weekly in September/October prior to Black Friday
Will the current design scale to accommodate projected growth?
Yes for 3 years. At 5-year horizon, Aurora vertical scaling is the primary concern; assessment of sharding via Aurora Limitless Database scheduled for 2028 review.
Are there known seasonal or cyclical demand patterns?
Strongly seasonal. Black Friday week: 8x baseline; Christmas: 4x; January sale: 3x; Easter: 1.5x; payday (last working day): 1.3x. Capacity plan aligns with retail calendar.
Yes — detailed cost model produced using AWS Pricing Calculator and validated against 4 months of running-cost data in staging. Estimated annual opex is £800,000 (production + non-prod + SaaS). Reserved instance / Savings Plan commitment produces an approximately 22% saving versus pure on-demand.
Has the hosting location been chosen to reduce environmental impact?
Partially — eu-west-2 (London) was chosen primarily for data sovereignty; AWS London operates under AWS’s 100% renewable energy commitment achieved in 2023
What is the expected workload demand pattern?
Variable — strong UK business-hours pattern with extreme seasonal peaks (Black Friday, Christmas)
Multiple times daily for services (trunk-based with feature flags); fortnightly release-train for coordinated changes; freeze from 1 November to 31 December (peak trading)
Release process
Feature branch -> PR (automated tests + 1 approval) -> merge to main -> auto-deploy staging -> canary (5% for 15 min) -> full production via Argo Rollouts
Feature flags
LaunchDarkly used extensively for progressive roll-out, A/B testing, and kill switches
Yes — decommissioning runbook requires Terraform destroy + S3 emptying + KMS schedule deletion; CMDB Retired status only after Cost Explorer confirms zero spend for 30 days.
EKS: upgraded within 60 days of minor release; Aurora PostgreSQL: minor versions in monthly maintenance window; Node.js: LTS tracked, upgraded within 90 days
Certificate management
ACM for public TLS (auto-renewal); AWS Private CA for internal mTLS
Microservices are containerised (Helm charts); PostgreSQL is standard; data exportable; storefront (Next.js) portable to any Node.js host
Data portability
Aurora: pg_dump / logical replication; S3: standard APIs; Cognito: CSV export with password reset required
Vendor lock-in assessment
Low-Moderate overall. Primary concerns are Stripe (High — see R-002) and Cognito (Moderate — migration requires password reset cycle). All other components are standard and portable.
Mobile app store review delays or rejection blocks timely release
Delivery
Medium
Medium
Fred Bloggs
R-006
AWS eu-west-2 regional outage during peak trading
Operational
Critical
Low
Sally Doe
Risk response:
ID
Mitigation Strategy
Mitigation Plan
Residual Risk
Last Assessed
R-001
Mitigate
Monthly load tests at 2x peak, quarterly at 3x; full game-day 4 weeks before Black Friday; peak-readiness sign-off gate; additional SRE on rota Nov-Dec
Low
2026-03-01
R-002
Mitigate
Payment abstraction layer in Checkout Service isolates Stripe SDK; documented 6-9 month migration plan to a secondary PSP; stored payment token strategy reviewed annually; Adyen considered for dual-acquirer model from 2027
Strict Content Security Policy (script-src allowlist); Subresource Integrity on all third-party scripts; Stripe Elements isolates card entry in Stripe iframe; quarterly client-side security audit; tag-manager discipline enforced by Marketing
Medium
2026-03-01
R-005
Mitigate
Early submission 4 weeks before hard deadline; in-flight review with Apple / Google developer support; progressive web app (PWA) fallback if native store delays
Low
2026-03-01
R-006
Accept (with mitigation)
Pilot-light DR in eu-west-1; RTO 2 hours validated quarterly; customer-facing status page; accept 1-minute RPO
Does the design materially change the organisation’s technology risk profile?
Yes — reduces PCI-DSS scope and operational risk by replacing unsupported legacy; introduces elevated SaaS dependency on Stripe. Net impact assessed as favourable by Risk & Controls (RC-2025-118).
This SAD was assessed at Recommended depth — the expected level for a Tier 2 High Impact regulated system. The scores below reflect a well-documented architecture proportionate to a B2C e-commerce platform with PCI-DSS and UK GDPR obligations.
Section
Score (0-5)
Assessor
Date
Notes
1. Executive Summary
5
Design Authority
2026-03-18
Clear business drivers with priority, strategic alignment with reuse documented, current-state architecture complete, revenue impact quantified
3.1 Logical View
4
Design Authority
2026-03-18
Full component decomposition, design patterns with rationale, vendor lock-in assessed. Service mesh detail could be deeper
3.2 Integration & Data Flow
4
Design Authority
2026-03-18
All internal and external integrations documented with protocols and auth; customer-event tracking plan referenced externally
3.3 Physical View
4
Design Authority
2026-03-18
Deployment, hosting, networking, environments fully documented; peak bandwidth characterised from real Black Friday telemetry
3.4 Data View
4
Design Authority
2026-03-18
All data stores classified with retention and encryption; DPIA approved; sovereignty addressed with filtered replication. Field-level encryption detail at Recommended depth, not exemplary
3.5 Security View
4
Design Authority
2026-03-18
STRIDE threat model with 6 named threats and mitigations; PCI-DSS scope-reduction strategy documented; identity models comprehensive
3.6 Scenarios
4
Design Authority
2026-03-18
Three architecturally significant use cases; three ADRs with alternatives and tradeoffs
4.1 Operational Excellence
4
Design Authority
2026-03-18
Datadog APM/Logs/RUM, PagerDuty on-call, peak-readiness drills. Runbook library noted but detail out of this document
4.2 Reliability
4
Design Authority
2026-03-18
Multi-AZ with pilot-light DR, RTO/RPO validated via quarterly drills, fault tolerance with circuit breakers, immutable backups