This is a fictional but realistic Solution Architecture Document for Stellar Platform, an Internal Developer Platform (IDP) at Stellar Engineering Ltd — a 400-engineer B2B SaaS company. It demonstrates the ADS standard at Recommended documentation depth, appropriate for a Tier 3 internal productivity platform with no direct customer impact.
The example is written in the language of modern platform engineering: Team Topologies, cognitive load, golden paths, paved roads, platform-as-a-product, and DevEx. Use it as a reference when writing your own SAD for an internal platform or developer experience initiative.
This SAD describes the architecture of Stellar Platform — a self-service Internal Developer Platform (IDP) that provides Stellar Engineering Ltd’s 60 stream-aligned product teams with golden paths for service creation, deployment, observability, and day-2 operations.
Scope boundary: The Backstage developer portal, the platform control plane (Crossplane, Terraform), the delivery plane (ArgoCD, Tekton), the observability stack (Prometheus, Grafana, Datadog), and the golden-path templates they expose. Includes the GKE (primary) and EKS (secondary) Kubernetes fleets that host both the platform itself and its customer workloads.
Out of scope: The individual product-team services that run on the platform (documented by their owning teams), the corporate identity provider (Okta, documented under APP-0008), and the customer-facing Stellar SaaS product (documented under APP-0100).
Related documents: Stellar Engineering Platform Strategy 2026-2028 (STRAT-0004), Platform-as-a-Product Operating Model (POL-0031), Stellar Cloud Landing Zone Standards (STD-0012), Information Security Policy (POL-0001).
Stellar Platform is an Internal Developer Platform (IDP) built on Backstage that offers Stellar’s 400 engineers a curated, self-service experience for the entire software delivery lifecycle. It exposes a small number of well-paved golden paths — opinionated templates and automation — that reduce the cognitive load on stream-aligned product teams and let them ship independently without having to reason about Kubernetes manifests, Terraform modules, IAM boundaries, or observability wiring.
The platform is architected as three loosely-coupled planes:
Portal plane: A Backstage instance acting as the single pane of glass for discovery, self-service actions, software catalogue, TechDocs, and scorecards.
Control plane: Crossplane-managed infrastructure abstractions, Terraform for everything Crossplane cannot model yet, GitHub as the source of truth, and Dagger for reusable CI pipelines.
Runtime plane: A federated fleet of Kubernetes clusters (GKE as primary, EKS as secondary), delivered via ArgoCD (GitOps) and Tekton (for build and security pipelines), with observability provided by a Prometheus + Grafana stack and Datadog for cross-cloud APM and incident workflow.
The platform is treated as a product. It has a product manager, a roadmap, user research cadence, and opt-in adoption — teams can route around it, but we design the paved road to be the path of least resistance.
Lead time for changes has stretched from 2 days to 9 days as the estate has grown; new service bootstrapping takes 3-6 weeks of coordination across SRE, Security, and Platform
Critical
Cognitive load
Product teams are carrying too many accidental responsibilities (clusters, pipelines, IAM, alerting) instead of focusing on customer value
High
Fragmentation
14 different CI patterns, 6 Terraform module styles, 4 Kubernetes deployment approaches, and 3 competing observability stacks across teams
High
Reliability
Production incidents increasingly rooted in configuration drift, unclear ownership, and inconsistent runbooks; change failure rate at 18% (DORA high-performer threshold is 15%)
High
Security
Inconsistent supply-chain controls and secret handling across teams; audit findings in SOC 2 Type II report
High
Cost
Cloud spend grew 42% YoY against 18% revenue growth; no unified FinOps view across teams
Stellar Engineering reached its current scale (400 engineers, 60 teams, ~850 services) without a deliberate platform strategy. The result is a high-cognitive-load environment for stream-aligned teams:
Manual service bootstrapping: New services take 3-6 weeks. The process spans 9 Jira tickets across SRE, Security, Networking, Platform, and Finance. Engineers cite this as their top frustration in the 2025 DevEx survey (Net DevEx Score: -18).
Jenkins monorepo: A single 12-year-old Jenkins instance runs 2,400 jobs; >60% of incidents in the CI/CD domain originate here. The maintainer left in 2024 and no one fully understands the Groovy shared library.
Terraform sprawl: Each team maintains its own Terraform modules. Six competing approaches to VPC, IAM, and Kubernetes namespace provisioning exist.
Kubernetes fragmentation: Some teams deploy via Helm charts manually, some via ad hoc kubectl apply, a few via Flux. No consistent RBAC, no consistent resource-quota policy.
Observability silos: Three teams run their own Prometheus; others export straight to Datadog; some still use CloudWatch. Cross-service traces are unusable.
Documentation decay: Team wikis in Confluence are frequently out of date; new joiners spend their first 3-4 weeks “finding the right page”.
The DORA baseline (measured via manual sampling Q4 2025) sits in the medium performer band: deployment frequency weekly, lead time 9 days, change failure rate 18%, MTTR 8 hours.
No — the platform itself does not process customer financial, health, or payment data. Product services running on the platform may, but they remain individually accountable for their regulatory posture.
Stellar Platform logical architecture. The Portal plane (Backstage, CLI, docs) sits above the Control plane (Crossplane, Terraform, Dagger, GitHub as source of truth), which sits above the Runtime plane (ArgoCD, Tekton, GKE and EKS clusters) and the Observability plane (Prometheus, Grafana, OpenTelemetry, Datadog).
Primary developer journey — “Create a new service”:
sequenceDiagram
participant Dev as Engineer
participant BS as Backstage
participant GH as GitHub
participant TKN as Tekton
participant CP as Crossplane
participant ARGO as ArgoCD
participant GKE as GKE Cluster
participant DD as Datadog
Dev->>BS: Choose golden-path template
BS->>GH: Create repo (code + IaC)
GH->>TKN: Trigger pipeline (push)
TKN->>TKN: Build, SBOM, sign image
TKN->>GH: Publish manifests to infra repo
GH->>CP: Apply Crossplane claim
CP->>GKE: Provision namespace + secrets
GH->>ARGO: Sync new Application
ARGO->>GKE: Deploy workload
GKE->>DD: Emit metrics + traces
BS->>Dev: "Service ready - see scorecard"
Developer journey for creating a new service. An engineer opens Backstage, picks a golden-path template, Backstage Scaffolder creates a GitHub repository with CI and infra as code. Crossplane provisions cloud resources. ArgoCD deploys the resulting container image. Datadog observability is auto-configured.
Secondary data flow — DORA telemetry:
Each Tekton pipeline run emits a CloudEvents-formatted event to a Pub/Sub topic.
A Dagger batch job (runs every 15 minutes) aggregates events into deployment, lead time, and CFR metrics per team.
Metrics land in Snowflake (PLATFORM.DORA schema) and are surfaced back into Backstage scorecards.
Weekly exec digest is generated from Snowflake via scheduled query.
Stellar Platform deployment. GKE is primary across three regions for the portal, control plane, ArgoCD, Tekton and observability. EKS is secondary across two regions, running ArgoCD satellites and tenant workloads. GitHub, Datadog, Okta, and Snowflake are SaaS. HashiCorp Vault is self-hosted on GKE.
europe-west2 (London), us-east4, asia-southeast1 chosen for customer proximity. Each region operates under its respective cloud provider’s carbon-neutral / 100% renewable matching commitments; europe-west2 published carbon intensity tracks with the UK grid.
Non-production environments auto-shutdown
Yes — dev and integration-test GKE Autopilot clusters scale to zero outside business hours; non-prod databases (Cloud SQL) auto-paused; ~£18k/year saving on non-prod compute (referenced in 4.4 FinOps).
Compute family chosen for performance-per-watt
GKE Autopilot uses Google’s latest-generation efficient nodes (Tau-T2D ARM-equivalent on supported workloads); EKS uses Graviton3 (c7g/m7g) where customer workloads tolerate ARM. AWS Graviton’s ~60% performance-per-watt advantage is captured for backend services.
Auto-scaling configured to release capacity when idle
Yes — GKE Autopilot scales pods on resource demand; Karpenter on EKS consolidates within 5 minutes; Backstage portal scales to two replicas overnight (down from peak of eight).
DR strategy proportionate
Multi-region active-active for the data plane (delivery / artefact services), warm standby for the portal control plane. Hot active-active rejected for the portal: not justified by the SLO (99.5%), would have ~30% additional always-on compute and PostgreSQL replication carbon cost.
The DPIA concluded that engineer telemetry (DORA, DevEx) is legitimate-interest processing of employee data. Engineers are informed via the engineering handbook; team-level aggregation is preferred over individual attribution.
No — no secrets, certificates, or customer data land on engineer workstations. The stellar CLI uses OIDC device-code flow with tokens in OS keychain (30-minute TTL).
Yes — UK customer-facing tenants’ metadata remains in europe-west2 / eu-west-2. Datadog data is routed to the EU site. Snowflake uses an EU deployment.
Build artefacts retained 30 days (latest 5 successful per repo retained indefinitely); container images expire on tag age (90 days for non-stable tags); audit logs 7 years (per Stellar audit policy); telemetry rolled up after 30 days. Lifecycle policies enforce automatic expiry.
Older data tiered to cold/archive storage
Yes — Cloud Storage / S3 lifecycle: artefacts transition Standard → Nearline → Coldline (90 days) → Archive (1 year). Datadog rolls metrics from raw to aggregated tiers automatically.
Unused or duplicate replicas
Single Cloud SQL primary + 1 read replica (justified by Backstage read-heavy load); Snowflake reserves no idle warehouses (auto-suspend after 10 min). Quarterly orphan-bucket review via gcloud + AWS Trusted Advisor.
Compression applied
Brotli on Backstage HTTPS responses; gzip on artefact uploads to Cloud Storage; Parquet+Zstandard for DORA metric exports to Snowflake.
Cross-region replication justified
Yes — multi-region active-active for the data plane is required by the platform SLO (99.9%). Portal control-plane uses regional Cloud SQL replication only. No cross-cloud data replication beyond explicit pipelines.
Large data transfers off-peak
Nightly DORA metric ingest to Snowflake 03:00 UTC; weekly Backstage analytics export Sunday 02:00 UTC. Aligned with low UK / EU grid carbon intensity.
A lightweight STRIDE threat model has been produced (THREAT-1042-01). Top threats: (1) compromised Backstage instance as a super-power surface, (2) supply-chain injection at Tekton, (3) Crossplane as blast-radius amplifier across clouds.
UC-01: Engineer bootstraps a new service from a golden-path template
Attribute
Detail
Actor(s)
Engineer on a stream-aligned product team
Trigger
New service needed to deliver a product increment
Pre-conditions
Engineer is authenticated; has membership of the owning team’s Okta group
Main Flow
1. Open Backstage, choose “Create new Go service” template. 2. Fill 6 fields (name, team, description, tier, region, data classification). 3. Scaffolder creates GitHub repo + infra repo with sensible defaults. 4. Tekton pipeline runs on first commit — builds, tests, generates SBOM, signs with cosign. 5. Crossplane provisions namespace, bucket, and service account. 6. ArgoCD deploys to staging automatically. 7. Datadog dashboard and SLO are auto-created. 8. Backstage scorecard shows green.
Post-conditions
Service is in staging, discoverable in catalogue, observable; total elapsed time target < 30 minutes
Views Involved
Logical, Integration & Data Flow, Physical, Security
1. PR merged to main. 2. Tekton builds new image and pushes signed artefact. 3. A bot PR is raised against the infra repo bumping the image tag in the prod overlay. 4. Once approved and merged, ArgoCD detects drift and syncs to the target cluster. 5. Progressive delivery (Argo Rollouts, canary) shifts traffic 10% -> 50% -> 100% with SLO-based gating. 6. If the SLO burn rate exceeds threshold, automatic rollback.
Post-conditions
Change is live; DORA pipeline emits deployment event; scorecard updates
Views Involved
Logical, Integration, Physical, Security
UC-03: SRE responds to a platform incident (break-glass)
ADR-001: Adopt Backstage rather than build an in-house portal
Field
Content
Status
Accepted
Date
2026-01-22
Context
The platform needs a unified front-door. We considered three directions: build a bespoke portal, adopt Backstage, or buy a commercial IDP (Port.io, Cortex, OpsLevel). Our ambition is a deeply integrated, opinionated IDP and we expect to run it for 5+ years.
Decision
Adopt Backstage as the foundation of the portal plane.
Alternatives Considered
Build bespoke: Full control and perfect fit, but requires 4-6 engineer-years to reach catalogue parity; hiring and retention signal is weaker. Port.io / commercial IDP: Fast to stand up, strong out-of-the-box experience, but ongoing per-user SaaS cost at 400 engineers is material (~GBP 200k/year) and customisation of core data model is limited. Backstage: CNCF incubating, large ecosystem (>300 plugins), portable catalogue model, healthy community, used by organisations at comparable scale (Spotify, American Airlines, Expedia).
Consequences
Positive: strong hiring signal; community velocity; deep extension points; OSS means no per-seat cost. Negative: TypeScript/Node.js operational stack introduced; upstream velocity is high, we must track releases; initial plugin quality is variable.
Quality Attribute Tradeoffs
Operational excellence and cost (positive) vs. initial delivery speed (slightly negative — steeper initial curve than a SaaS IDP).
ADR-002: ArgoCD for GitOps rather than Flux
Field
Content
Status
Accepted
Date
2026-02-09
Context
We need a GitOps engine to reconcile Kubernetes state across GKE and EKS. The two mature CNCF options are ArgoCD and Flux.
Decision
Use ArgoCD in HA mode as the primary delivery-plane engine.
Alternatives Considered
Flux: Lightweight, GitOps-toolkit-based, composable, lower resource footprint. Excellent for small deployments but the UX for 850+ applications across 5 regions is weaker. ArgoCD: Rich UI suited to a developer-facing portal experience, Argo Rollouts integration for progressive delivery, Application sets for template-driven fan-out, mature multi-cluster model.
Consequences
Positive: excellent developer UX; first-class progressive delivery; strong RBAC model. Negative: heavier resource footprint; in-cluster UI is another attack surface (mitigated via IAP + OIDC).
Quality Attribute Tradeoffs
Operational excellence (positive) over small efficiency gains from Flux (minor negative).
ADR-003: Multi-cloud (GKE primary, EKS secondary) from day one
Field
Content
Status
Accepted
Date
2026-03-11
Context
Two of our five largest customers contractually require workloads to run in AWS regions they already operate in. A third (regulated) requires GCP. Consolidating onto a single cloud would force a painful customer-facing negotiation. The platform is the leverage point: if the platform is cloud-agnostic, product teams inherit multi-cloud capability without new cognitive load.
Decision
Design Stellar Platform as multi-cloud from inception. GKE is the primary cloud for platform-plane workloads (lower operational cost for control plane at our scale, Autopilot maturity). EKS is a peer runtime for tenant workloads requiring AWS presence. Crossplane provides a uniform abstraction over cloud resources.
Alternatives Considered
Single-cloud (GCP only): Simpler, cheaper to run, faster to deliver. Rejected because it forces commercial negotiation with AWS-bound customers. Single-cloud (AWS only): Similar trade-off in reverse. Cloud-agnostic from day one, deploy later: Architecturally tempting but creates a “second day” surprise; abstractions untested under load.
Consequences
Positive: strategic flexibility, customer alignment, vendor-lock-in reduced. Negative: roughly 25% higher platform engineering cost; requires disciplined use of abstractions (no reaching directly for cloud-specific primitives outside agreed extension points).
Quality Attribute Tradeoffs
Reliability and strategic flexibility (positive) over cost optimisation (negative in the short term).
GKE Autopilot handles platform pods; Karpenter handles EKS; ArgoCD application controller sharded by cluster; Backstage horizontal pod autoscaling on CPU and request latency
GitHub Enterprise Cloud scales with enterprise contract; Datadog contract sized for 3x current ingest; Okta has room for 2x workforce; Vault HA cluster sized for 10x current QPS
Yes — platform-plane components run in HA mode (>= 3 replicas across zones); ArgoCD and Crossplane reconcile continuously; circuit breakers on third-party calls (Datadog, GitHub); Backstage degrades gracefully if catalogue DB is read-only (serves cached data, self-service creation paused).
Cost deliberately balanced against strategic value
[x]
GKE Autopilot premium accepted in exchange for reduced SRE toil; Datadog retained (vs. full self-host) to avoid re-tooling cost; multi-cloud accepted as a strategic cost; spot/preemptible nodes for non-production; scale-to-zero in non-prod
Yes — modelled in FinOps tooling (Cloudability). Run cost of approximately GBP 350k/year (hosting + Datadog + Okta increments + incidental) versus estimated opportunity cost of 15 engineer-years/year lost to platform-adjacent toil in the current state. Payback estimated at 11 months.
Has the hosting location been chosen to reduce environmental impact?
Partially — europe-west2 (London), us-east4, and asia-southeast1 are all chosen for customer proximity; each region is on a carbon-neutral / renewable power commitment from its respective cloud provider
What is the expected workload demand pattern?
Variable predictable — heavier during engineering working hours across regions
Strangler Fig — platform stands up alongside existing estate; teams migrate in waves
Migration waves
Wave 0: platform team dogfoods (months 0-3). Wave 1: 5 volunteer teams (months 4-6). Wave 2: remaining teams opted in by directorate (months 7-18).
Data migration mode
Not applicable (no customer data in the platform); catalogue populated via GitHub scan
End-user cutover
Phased by team; no forced cutover
External system cutover
Phased — Jenkins retired per directorate once last job migrates
Maximum acceptable downtime
Hours (during migration windows), zero (steady state)
Rollback plan
Teams can revert to prior CI or deployment pattern at any time during Wave 2; platform monitors adoption and DORA and escalates if rollback trend emerges
Acceptance criteria (Wave 1)
1. Five teams onboarded. 2. New-service lead time < 1 day. 3. Net DevEx score positive. 4. SLOs met.
GKE Autopilot non-prod clusters scale to zero out of hours; Cloud SQL non-prod auto-paused; AWS Config + GCP Org Policy alert FinOps if non-prod resources run > 24h without exception tag.
Right-sizing review cadence
Quarterly via Cloudability + GCP Recommender + AWS Compute Optimizer. Last review (2026-Q1) downsized 4 EKS node groups and one Cloud SQL instance, recovering ~£42k/year.
Yes — decommissioning runbook requires Terraform destroy + bucket emptying + key destruction; CMDB Retired status only after both AWS Cost Explorer and GCP Billing confirm zero spend for 30 days.
Platform team becomes a bottleneck for feature requests from 60 teams
Operational
High
High
Jane Doe
R-002
Golden paths become too restrictive and teams lose autonomy (“paved road fatigue”)
Operational
High
Medium
Claire Doe
R-003
Shadow platforms emerge — teams route around Stellar Platform, rebuilding parallel stacks
Operational
High
Medium
Tom Bloggs
R-004
Backstage upstream velocity outpaces our ability to track; plugins break on version bumps
Technical
Medium
High
Tom Bloggs
R-005
Multi-cloud abstractions leak, producing unpredictable behaviour between GKE and EKS
Technical
High
Medium
Tom Bloggs
R-006
Compromise of the platform (ArgoCD, Crossplane) amplifies blast radius across all tenant workloads
Security
Critical
Low
Joe Bloggs
R-007
Jenkins migration drags beyond 18 months; carrying cost of two systems becomes unsustainable
Delivery
Medium
Medium
Tom Bloggs
R-008
Datadog vendor lock-in hardens as custom monitors proliferate
Commercial
Medium
Medium
Amir Bloggs
R-009
DORA metrics misinterpreted as individual performance rather than system health
Operational
Medium
Medium
Jane Doe
Risk response:
ID
Mitigation Strategy
Mitigation Plan
Residual Risk
Last Assessed
R-001
Mitigate
Platform-as-a-product model with PM-owned roadmap; quarterly prioritisation with top-20 product teams; explicit “escape hatch” guidance so teams can self-serve outside the paved road; community-of-practice model for common contributions back into platform
Medium
2026-04-10
R-002
Mitigate
Paved-road-with-opt-out philosophy baked in; quarterly DevEx surveys specifically ask about fit; template versioning so teams can pin and diverge if needed
Medium
2026-04-10
R-003
Mitigate
Visibility through catalogue (anything in GitHub appears); Engineering Director engagement model to sponsor platform adoption; quarterly adoption review at senior leadership level
Medium
2026-04-10
R-004
Mitigate
Track Backstage upstream actively; contribute upstream where we depend on behaviour; plugin acceptance tests in CI; monthly Backstage upgrade cadence
Medium
2026-04-10
R-005
Mitigate
Clear composition contract per Crossplane resource; contract tests run on both clouds; ADR required before a new cloud-specific primitive is exposed; deliberate small exposure surface
Medium
2026-04-10
R-006
Mitigate
Defence in depth: Sigstore admission, Falco runtime, signed Git, no shared credentials, Crossplane workload identity, annual red-team engagement, zero-standing-privilege model
Low
2026-04-10
R-007
Mitigate
Migration wave plan with quarterly go/no-go; published Jenkins EOL date; clear “rehost first, refactor later” policy; dedicated migration squad
Medium
2026-04-10
R-008
Mitigate
OpenTelemetry Collector as abstraction; dashboards-as-code via Terraform provider (portable); quarterly review of Datadog-specific usage
Medium
2026-04-10
R-009
Mitigate
DORA only shown at team level; engineering handbook explicitly describes DORA as system-health signals; director-level coaching on psychologically safe use
Strong business context; drivers, DORA baseline, and platform-as-a-product framing clear; strategic alignment to platform strategy is explicit
3.1 Logical View
4
ARB Panel
2026-04-18
Three-plane decomposition, component ownership, design patterns, and lock-in assessment all documented
3.2 Integration & Data Flow
3
ARB Panel
2026-04-18
All interfaces described with protocols and auth; developer-journey sequence diagram present; formal API contracts for DORA endpoint not yet published (tracked)
3.3 Physical View
3
ARB Panel
2026-04-18
Multi-cloud topology and environment list complete; cross-cloud failover drill scheduled but not yet executed end-to-end
3.4 Data View
3
ARB Panel
2026-04-18
Data stores classified, retention and encryption defined, DPIA complete; sovereignty addressed. Data-contract-style schemas between planes not formalised
3.5 Security View
4
ARB Panel
2026-04-18
Zero-standing-privilege model, workload identity, Sigstore, Vault all covered; threat model produced; annual red-team committed
3.6 Scenarios
4
ARB Panel
2026-04-18
Three strong use cases (bootstrap, deploy, break-glass); three ADRs with genuine alternatives and trade-offs
HA, multi-region warm standby, chaos monthly; cross-cloud DR rehearsal outstanding
4.3 Performance
3
ARB Panel
2026-04-18
Targets explicit including DORA deltas; growth modelled to year 5; continuous synthetic load testing
4.4 Cost Optimisation
3
ARB Panel
2026-04-18
Showback per team, FinOps review cadence; multi-cloud premium explicitly accepted and tracked
4.5 Sustainability
3
ARB Panel
2026-04-18
Non-prod scale-to-zero; renewable-commitment regions; carbon dashboard planned for Phase 2
5. Lifecycle
4
ARB Panel
2026-04-18
Mature CI/CD and supply-chain posture; migration plan with 6 Rs applied to Jenkins estate; skill gaps named and mitigated
6. Decision Making
4
ARB Panel
2026-04-18
Constraints, assumptions, and especially risks are well grounded in platform-engineering reality (bottleneck, paved-road fatigue, shadow IT, vendor lock-in)
Overall
3
ARB Panel
2026-04-18
Solid Tier 3 platform SAD at Recommended depth. Genuine platform-engineering thinking throughout. Lowest-scoring sections (3) are all known gaps with owners and plans: cross-cloud DR rehearsal, data contracts between planes, Phase-2 carbon dashboard.