Hosted Stellar Relayer on GCP: Operator Deployment Guide

This guide covers deploying and operating the Stellar Relayer Channels service on GCP. The infrastructure runs on Cloud Run backed by Memorystore Redis, Pub/Sub for distributed job processing, and Cloud KMS for transaction signing, with optional Cloudflare Workers for API-key management and per-user rate limiting.

Work through the deployment steps in order; each step produces configuration or keys that later steps depend on. For the AWS deployment, see the AWS Operator Deployment Guide.


1. Architecture

The service connects several GCP-managed components into a single transaction processing pipeline. Understanding this layout helps with capacity planning and narrows the search space when diagnosing failures. Most operational issues map to one specific layer.

1.1. Cloud Architecture

ComponentGCP ServicePurpose
Edge gatewayCloudflare Worker + KV (optional)API-key issuance, rate limiting, usage tracking
Load balancerExternal HTTPS LB + Google-managed certTLS termination, health-checked routing
ComputeCloud Run v2Runs the relayer container with autoscaling
StateMemorystore Redis 7.2Transaction records, sequence counters, distributed locks
Queue8 Pub/Sub topics + subscriptionsDistributed transaction processing
SecretsSecret ManagerAPI keys, admin secrets, encryption keys
SigningCloud KMS (EC_SIGN_ED25519)Transaction signing for fund + channel accounts
Image registryArtifact Registry (remote repo)Proxies ECR Public image for Cloud Run
NetworkingVPC + VPC Connector + Private Service AccessPrivate connectivity to Memorystore

1.2. App Architecture (Channels Plugin Runtime)

1.3. Transaction Lifecycle

1.4. How Pub/Sub Queues Work

Eight topics with pull subscriptions handle the transaction pipeline. Pub/Sub has no native delayed delivery, so deferred jobs (retries with backoff) sit in Redis sorted sets until due, then get published to the topic. The topic only ever carries ready-to-process jobs; no dead-letter topics needed.

1.5. How Channels Works on Stellar

Every Stellar transaction has a source account with a monotonically increasing sequence number. Only one transaction per source account can be in-flight at a time; this is the constraint that caps parallel throughput on Stellar.

The Channels service works around it with a pool of dedicated source accounts (channel accounts). Each in-flight transaction acquires one channel account from the pool, uses its sequence number, and releases it after confirmation. A separate fund account holds the XLM balance. When submitting, the service wraps the channel-signed envelope in a fee-bump transaction, a Stellar primitive that lets a second account pay the network fee. Both accounts are backed by Cloud KMS ED25519 keys.

The pool size you provision in section 4.10 is your throughput ceiling. See section 12.1 for the sizing formula.

1.6. Resource Sizing

Module defaults work for getting started. Operators are advised to bump them as traffic grows.

ResourceModule default (prod)Current GCP deployment
CPU1 vCPU4 vCPU
Memory2 Gi8 Gi
Min instances23
Max instances1020
Redis tierSTANDARD_HASTANDARD_HA
Redis memory5 GB5 GB

The module auto-adjusts sizing by environment (prod vs everything else):

Settingprodother
Min instances21
Max instances104
CPU always allocatedyesno
Redis tierSTANDARD_HABASIC
Redis memory5 GB1 GB
LB deletion protectiononoff
Log retention30 days7 days

2. Prerequisites

Gather everything in this section before running terraform apply. Missing any item will block either the initial deployment or the post-deploy bootstrap steps.

2.1. Accounts and Access

  • GCP project with billing enabled and permission to create Cloud Run services, Memorystore instances, Pub/Sub topics and subscriptions, Secret Manager secrets, Cloud KMS keyrings and keys, Compute Engine load balancers, VPC connectors, Artifact Registry repositories, and IAM role bindings.
  • Service account for Terraform with the specific predefined roles it needs. Do not grant the broad editor basic role (it subsumes far more than this stack requires and undermines least privilege). Grant: roles/resourcemanager.projectIamAdmin, roles/compute.networkAdmin, roles/cloudkms.admin, roles/pubsub.admin, roles/secretmanager.admin, roles/run.admin, roles/artifactregistry.admin, roles/iam.serviceAccountAdmin, and roles/iam.serviceAccountUser. If a first terraform apply fails on a missing permission, add the specific predefined role it names rather than falling back to editor.
  • Domain with DNS access (Route53, Cloud DNS, or other)
  • (Optional) Cloudflare account for the /gen API-key flow

2.2. Tooling

ToolVersionWhy
Terraform1.5.0 or laterModule language constraints
Google provider5.0 or later, below 7.0Pinned in versions.tf
Cloudflare provider~> 5.0Required even when enable_cloudflare = false
gcloud CLIrecent stableAuth, Artifact Registry, debugging
Node.js 18+ and pnpm 10+recent stableOnly if you modify the Channels plugin

2.3. Stellar-Side Prerequisites

  • Soroban RPC access: at least two independent private providers from different operators recommended for mainnet. The public image ships with the default public RPC; you override it after deployment (see section 4.8).
  • XLM to fund the relayer's Stellar account and bootstrap channel accounts. Budget at least 250 XLM for 200 channel accounts plus the fund account.

2.4. Repos You'll Reference

RepoWhat it is
OpenZeppelin/relayer-channels-infraThis repo: Terraform modules + operator CLIs
OpenZeppelin/openzeppelin-relayerThe relayer application
OpenZeppelin/relayer-plugin-channelsChannels plugin (TypeScript)

3. Environments

Run stg and prod as separate Terraform workspaces with isolated state:

EnvNetworkWorking directoryPub/Sub prefixVPC connector CIDR
stgtestnetfor example, examples/gcp/ as shippedrelayer-testnet-stg-10.8.0.0/28
prodmainnetfor example, examples/gcp-prod/ (separate isolated workspace)relayer-mainnet-prod-10.9.0.0/28

Use different CIDRs if both environments share a VPC. Resource names auto-suffix with -<environment> except for prod.


4. Deployment

Work through the steps below in order on a fresh deployment. Each step produces output or configuration that later steps depend on.

4.1. Authenticate

Prefer short-lived Application Default Credentials, which avoid creating a long-lived key file altogether:

gcloud auth application-default login

For CI/CD, prefer Workload Identity Federation over a downloaded key.

Only if your org blocks ADC login: create a service account key (IAM & Admin > Service Accounts > Keys) and point GOOGLE_APPLICATION_CREDENTIALS at it. This is a long-lived credential for a highly privileged Terraform service account. Treat it as the most sensitive secret in the deployment. Delete the key as soon as the deployment is done (gcloud iam service-accounts keys delete), store it only outside the repo, and rotate it on a schedule while it exists.

export GOOGLE_APPLICATION_CREDENTIALS="$HOME/path/to/service-account-key.json"

4.2. Get the Module

Reference it directly from GitHub:

module "relayer_channels" {
  source = "git::https://github.com/OpenZeppelin/relayer-channels-infra.git//modules/gcp?ref=main"
  # ...
}

Or clone and use the examples:

git clone https://github.com/OpenZeppelin/relayer-channels-infra.git
cd relayer-channels-infra/examples/gcp       # stg
cd relayer-channels-infra/examples/gcp-prod   # prod

4.3. Configure the Terraform Backend

In versions.tf, configure remote state:

terraform {
  backend "gcs" {
    bucket = "your-org-terraform-state"
    prefix = "relayer-channels/prod.tfstate"
  }
}

4.4. Create Your Tfvars

cp terraform.tfvars.example terraform.tfvars

Minimum config:

project_id      = "my-gcp-project"
region          = "us-east1"
environment     = "prod"
network         = "default"
subnetwork      = "default"
domain_name     = "channels.your-company.com"
stellar_network = "mainnet"
queue_backend   = "pubsub"
container_image = "us-east1-docker.pkg.dev/my-project/ecr-public/w5h5k2p1/openzeppelin-relayer-channels:mainnet-1.4.2"  # pin a version in prod; do not use mainnet-latest

# Secrets — never commit these
relayer_api_key        = ""  # set via TF_VAR_relayer_api_key
channels_admin_secret  = ""  # set via TF_VAR_channels_admin_secret
storage_encryption_key = ""  # set via TF_VAR_storage_encryption_key

Generate secrets:

export TF_VAR_relayer_api_key="$(uuidgen | tr '[:upper:]' '[:lower:]')"
export TF_VAR_channels_admin_secret="$(openssl rand -base64 32)"
export TF_VAR_webhook_signing_key="$(openssl rand -hex 32)"
export TF_VAR_storage_encryption_key="$(openssl rand -base64 32)"   # must be base64, not hex

These commands embed freshly generated secrets in the command line, so they land in your shell history (and are briefly visible in ps//proc on a shared host). Set HISTCONTROL=ignorespace and prefix each command with a leading space, or run them in a subshell with history disabled (set +o history). Never paste the generated values into terraform.tfvars or commit them.

4.5. Set Up Artifact Registry

Cloud Run can't pull from ECR Public directly. Set up a remote repo to proxy it:

  1. GCP Console > Artifact Registry > Create Repository
  2. Format: Docker, Mode: Remote, Source: Custom, URL: https://public.ecr.aws
  3. Name it ecr-public, pick your region

Then reference it in container_image in your tfvars (as shown in section 4.4).

Tag scheme: mainnet-<version> (pinned, use in prod), mainnet-latest (moves), testnet-<version>, testnet-latest.

The public image ships with mainnet.sorobanrpc.com as the default RPC. Override it with private providers after deployment (see section 4.8).

4.6. Deploy

terraform init
terraform plan -out plan.tfplan
terraform apply plan.tfplan

Takes 10–15 min. Memorystore creation is the slowest part.

Key outputs:

OutputUsed for
load_balancer_ipDNS record creation
cloud_run_service_nameService management
kms_signing_key_idSigner creation
artifact_registry_urlImage pull path

4.7. DNS and SSL

The Google-managed cert needs DNS pointing at the LB IP before it provisions.

Without Cloudflare:

  1. Create an A record: channels.your-company.com<load_balancer_ip>
  2. Wait 15–60 min for cert to go ACTIVE

With Cloudflare:

  1. Create Cloudflare A record → LB IP (proxy OFF, grey cloud)
  2. Create Route53 A record → LB IP
  3. Wait for cert to go ACTIVE
  4. Change Route53 to CNAME → channels.your-company.com.cdn.cloudflare.net
  5. Turn Cloudflare proxy ON (orange cloud)

If the cert stays FAILED_NOT_VISIBLE for 30+ min, bump the cert name suffix in load-balancer.tf (e.g. -cert-v2-cert-v3) and re-apply. create_before_destroy swaps it without downtime.

4.8. Override RPC Endpoints

The public image uses the free public Soroban RPC, which rate-limits under load. After the service is healthy, override it with your private providers. This is a one-time call (the config persists in Redis).

curl -s \
  -H "Authorization: Bearer <your-relayer-api-key>" \
  -H "Content-Type: application/json" \
  -X PATCH https://channels.your-company.com/api/v1/networks/stellar:mainnet \
  -d '{
    "rpc_urls": [
      { "url": "https://your-primary-rpc.com/key", "weight": 100 },
      { "url": "https://your-secondary-rpc.com/key", "weight": 100 }
    ]
  }'

Verify:

curl -s -H "Authorization: Bearer <your-relayer-api-key>" \
  "https://channels.your-company.com/api/v1/networks?per_page=200" \
  | jq '.data[] | select(.id=="stellar:mainnet") | .rpc_urls'

Use at least two independent providers from different operators. The relayer load-balances by weight and rotates on failure.

You only need to re-run this after a RESET_STORAGE_ON_START=true restart, which wipes Redis. Normal restarts preserve it.

4.9. Create the Signer

ENV=mainnet API_KEY="$TF_VAR_relayer_api_key" \
GCP_SA_KEY_FILE="$HOME/path/to/sa-key.json" \
./scripts/gcp-kms-signer.sh

Then create the fund relayer via the relayer API:

curl -s -X POST https://channels.your-company.com/api/v1/relayers \
  -H "Authorization: Bearer $TF_VAR_relayer_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "id": "channels-fund",
    "name": "channels-fund",
    "network": "mainnet",
    "signer_id": "<signer-id-from-above>",
    "network_type": "stellar",
    "paused": false,
    "policies": { "min_balance": 0, "fee_payment_strategy": "relayer" }
  }'

4.10. Bootstrap Channels

Size the pool before bootstrapping. Formula: min_pool = ceil(target_TPS × avg_settlement_seconds × 1.5). Stellar settlement averages 5–7 seconds. At 23 TPS sustained that gives 173 channels minimum. Use --dry-run to preview before committing.

Install the CLI from cli/ in this repo:

cd cli && bun install && bun run build
cd packages/oz-channels && bun link
cd ../oz-relayer && bun link

Set up a profile and bootstrap:

oz-channels profile init prod-mainnet

oz-channels bootstrap --to 200 --dry-run -p prod-mainnet   # preview
oz-channels bootstrap --to 200 -p prod-mainnet             # provision

4.10.1. Scaling Beyond ~100 Channels

When scaling the pool aggressively (e.g. 100 → 1000 channels), oz-channels bootstrap will start failing with TRY_AGAIN_LATER or tx_bad_seq errors from Horizon. This happens because every createAccount operation uses the fund relayer (channels-fund) as the transaction source, serializing all submissions on a single sequence number. Under high concurrency, Horizon rejects the overlapping submissions.

Use scripts/fund-new-channels.ts instead, it routes the transaction source through an existing funded channel account (e.g. channel-0001) while keeping the fund relayer as the operation source (so the treasury still pays). It also batches up to 100 createAccount ops per transaction, so a 100→1000 scale-up fits in ~9 submissions.

npx tsx scripts/fund-new-channels.ts \
  --env mainnet \
  --api-key <key> \
  --source-relayer channel-0001 \
  --fund-relayer channels-fund \
  --from 101 --to 1000 \
  --starting-balance 2 \
  --report fund-report.json

The script is idempotent, it preflights every slot via the relayer API and Horizon, skipping any account already funded onchain. Safe to re-run.

4.11. Verify

curl -sS https://channels.your-company.com/api/v1/health
oz-channels smoke run -p prod-mainnet

A healthy service returns {"status":"ok"}. The smoke test submits a test transaction end-to-end and polls for confirmation; success prints a confirmed transaction ID. If it times out, check channel pool size and fund account balance before debugging further.


5. Configuration Reference

Most environment variables are managed by the Terraform module and should not be overridden without a specific reason. The tables below document what the module sets automatically and which values operators should tune for production scale.

5.1. Module-Managed Container Environment Variables

The Terraform module sets these. Do not override them unless you have a specific reason.

Env varSet toSource
HOST0.0.0.0Module
STELLAR_NETWORKvar.stellar_networkModule
FUND_RELAYER_IDvar.fund_relayer_idModule
API_KEY_HEADERx-consumer-keyModule
REPOSITORY_STORAGE_TYPEredisModule
RESET_STORAGE_ON_STARTfalseModule
METRICS_ENABLEDtrueModule
METRICS_PORT8081Module
LOG_FORMATjsonModule
LOG_LEVELvar.log_levelModule
REDIS_URLredis://<memorystore-host>:<port>Module
REDIS_READER_URLredis://<read-endpoint>:<port>Module
GCP_PROJECT_IDvar.project_idModule
GCP_REGIONvar.regionModule
DISTRIBUTED_MODEvar.distributed_modeModule
QUEUE_BACKENDvar.queue_backendModule
PUBSUB_TOPIC_PREFIXrelayer-{network}-{environment}Module
PUBSUB_PROJECT_IDvar.project_idModule

5.2. Module-Managed Secrets

Container env varSecret Manager IDRequired?Notes
API_KEY{app_name}-relayer-api-keyYesAuthenticates all API requests
PLUGIN_ADMIN_SECRET{app_name}-channels-admin-secretYesRequired for channel management
WEBHOOK_SIGNING_KEY{app_name}-webhook-signing-keyOptionalOnly set if using webhook notifications
STORAGE_ENCRYPTION_KEY{app_name}-storage-encryption-keyOptionalEncrypts data at rest in Redis. Strongly recommended for prod. Must be base64-encoded 32 bytes.

Rotation procedure:

echo -n "new-value" | gcloud secrets versions add \
  relayer-channels-relayer-api-key --data-file=- \
  --project=your-project

gcloud run services update relayer-channels-service \
  --region=us-east1 --project=your-project \
  --update-labels="redeploy=$(date +%s)"

5.3. Production Reference Values

If you are targeting OpenZeppelin's reference scale (~2M+ tx/day), these are the env vars to tune:

container_environment = [
  { name = "BACKGROUND_WORKER_TRANSACTION_REQUEST_CONCURRENCY",                value = "200" },
  { name = "BACKGROUND_WORKER_TRANSACTION_SENDER_CONCURRENCY",                 value = "200" },
  { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_STELLAR_CONCURRENCY", value = "300" },
  { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_CONCURRENCY",         value = "1" },
  { name = "BACKGROUND_WORKER_TRANSACTION_STATUS_CHECKER_EVM_CONCURRENCY",     value = "1" },
  { name = "BACKGROUND_WORKER_NOTIFICATION_SENDER_CONCURRENCY",                value = "1" },
  { name = "BACKGROUND_WORKER_SOLANA_TOKEN_SWAP_REQUEST_CONCURRENCY",          value = "1" },
  { name = "BACKGROUND_WORKER_RELAYER_HEALTH_CHECK_CONCURRENCY",               value = "1" },
  { name = "RELAYER_CONCURRENCY_LIMIT",       value = "800" },
  { name = "PLUGIN_MAX_CONCURRENCY",          value = "8000" },
  { name = "MAX_CONNECTIONS",                  value = "4000" },
  { name = "REQUEST_TIMEOUT_SECONDS",          value = "60" },
  { name = "PLUGIN_POOL_REQUEST_TIMEOUT_SECS", value = "60" },
  { name = "PLUGIN_GLOBAL_TIMEOUT_MS",         value = "55000" },
  { name = "PLUGIN_POLLING_TIMEOUT_MS",        value = "45000" },
  { name = "RATE_LIMIT_REQUESTS_PER_SECOND",   value = "400" },
  { name = "REDIS_POOL_MAX_SIZE",              value = "3000" },
  { name = "REDIS_READER_POOL_MAX_SIZE",       value = "3000" },
  { name = "TRANSACTION_EXPIRATION_HOURS",     value = "0.1" },
  { name = "LIMITED_CONTRACTS",               value = "C<contract1>,C<contract2>" },
  { name = "CONTRACT_CAPACITY_RATIO",          value = "0.6" },
]

6. Cloudflare (Optional)

When enabled, a Cloudflare Worker handles API-key issuance (/gen), per-key rate limiting, and proxies requests to the LB with static-key injection.

enable_cloudflare     = true
cloudflare_zone_id    = "your-zone-id"
cloudflare_account_id = "your-account-id"
# Sensitive values below are set via TF_VAR_* env vars, not inline — see note.

The sensitive values (cloudflare_api_token, relayer_static_api_key, key_salt, cf_analytics_api_token) are marked sensitive and must not be committed in terraform.tfvars. Set them via environment variables, the same way as the core secrets:

export TF_VAR_cloudflare_api_token="$CLOUDFLARE_API_TOKEN"
export TF_VAR_relayer_static_api_key="$TF_VAR_relayer_api_key"   # must match relayer_api_key
export TF_VAR_key_salt="$(openssl rand -base64 32)"
export TF_VAR_cf_analytics_api_token="$CF_ANALYTICS_API_TOKEN"

relayer_static_api_key should match your relayer_api_key; the Worker swaps every user's Bearer token for this key upstream. key_salt is used to hash user keys before storing in KV.

6.1. Without Cloudflare

The /gen endpoint is not available; there's no self-service API-key generation. Callers authenticate directly with the relayer_api_key you configured. If you need per-user keys or rate limiting without Cloudflare, build that into your own API gateway layer in front of the load balancer.


7. Operations

Routine operations follow the same terraform apply workflow as the initial deployment. Stellar-specific operations (managing the channel pool, inspecting transactions) use the CLIs in cli/.

7.1. Deploys

To deploy a new version, update container_image in your tfvars and run terraform apply. Cloud Run creates a new revision and shifts traffic over automatically with no downtime.

7.2. Rollbacks

To roll back, set container_image to the previous version tag in your tfvars and run terraform apply.

7.3. Scaling

cpu                = "4"
memory             = "8Gi"
min_instance_count = 3
max_instance_count = 20

Run terraform apply to pick up the new limits. Cloud Run handles the transition without downtime.

7.4. Channel Pool

oz-channels bootstrap --from 201 --to 400 -p prod-mainnet   # grow the pool
oz-channels channels list -p prod-mainnet
oz-channels channels add channel-0050 -p prod-mainnet
oz-channels channels remove channel-0050 -p prod-mainnet

7.5. Transactions

oz-relayer tx show <tx-id> -r channels-fund -p prod-mainnet --json
oz-relayer tx list -r channels-fund --status pending -p prod-mainnet
oz-relayer relayer balance channels-fund -p prod-mainnet

8. Observability

The service emits structured JSON logs to Cloud Logging, Cloud Run request metrics, and Pub/Sub queue metrics. Set up the log-based metrics and alerting policies below before putting the service under production load.

8.1. Logs

Cloud Run streams structured JSON logs to Cloud Logging.

# Errors in the last hour
gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service" AND severity>=ERROR' \
  --project=your-project --limit=20 --freshness=1h --format='value(textPayload)'

# Filter by tx ID
gcloud logging read 'resource.type="cloud_run_revision" AND textPayload:"<tx-id>"' \
  --project=your-project --limit=20 --freshness=1h

# Live tail
gcloud logging tail 'resource.type="cloud_run_revision" AND resource.labels.service_name="relayer-channels-service"' \
  --project=your-project

8.2. Cloud Run Metrics

Console > Cloud Run > Service > Metrics:

MetricSignal
container/cpu/utilization>80% sustained → scale up
container/memory/utilization>70% → risk of OOM
request_count by status5xx spikes
request_latenciesp95/p99 degradation
container/instance_countautoscaling behavior

8.3. Pub/Sub Metrics

Console > Pub/Sub > Subscription > Metrics:

MetricSignal
num_undelivered_messagesgrowing backlog → falling behind
oldest_unacked_message_age>60s → workers stuck
pull_message_operation_countconfirms workers are active

8.4. Memorystore Metrics

Console > Memorystore > Instance > Monitoring:

MetricSignal
CPU utilization>75% sustained
Memory usage ratio>70%
Connected clientsnear limit

8.5. Log-Based Metrics

Create in Cloud Logging > Log-based Metrics > Create Metric:

Metric nameFilterPurpose
relayer/errorsseverity>=ERRORTotal error rate
relayer/pool_capacitytextPayload:"POOL_CAPACITY"Pool exhaustion events
relayer/provider_pausedtextPayload:"provider paused"RPC failover events

8.6. Alerting

Key alert policies to set up in Cloud Monitoring > Alerting:

AlertConditionSeverity
High error rate>50 errors in 5 minCritical
Cloud Run high CPU>80% for 10 minWarning
Cloud Run high memory>70% for 10 minWarning
Pub/Sub backlog>5000 messages for 10 minWarning
Pub/Sub old messages>300s for 5 minCritical
Pool exhaustionPOOL_CAPACITY log > 0 in 5 minCritical

8.7. Prometheus

The relayer exposes metrics at :8081/debug/metrics/scrape. Scrape with Google Cloud Managed Prometheus or your own Prometheus instance.

8.8. Stellar-Side Monitoring

GCP metrics reflect service health. These check the Stellar network side; monitor both.

Fund account balance:

oz-relayer relayer balance channels-fund -p prod-mainnet

Alert when balance drops below 50 XLM. A depleted fund account fails all fee-bumps silently.

Ledger close time: Stellar closes a ledger roughly every 5 seconds normally. Sustained close times above 10 seconds indicate network stress and inflate settlement latency beyond your pool sizing assumptions.

curl -sS "https://horizon.stellar.org/ledgers?order=desc&limit=5" | jq '._embedded.records[] | {sequence, closed_at}'

TRY_AGAIN_LATER in logs: Horizon is rejecting transactions due to fee competition. Raise MAX_FEE (see section 12.7). If it appears alongside provider paused, check RPC provider health first.

RPC provider health:

curl -sS -X POST <your-rpc-url> \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}' | jq .

9. Debugging

Almost every failure belongs to a specific layer. Identify the layer first, then pull the logs for that component.

A request that never returns a tx_id failed in the synchronous path (edge, LB, auth, fee budget, enqueue). A request that returned a tx_id but never confirmed failed in the async path (channel acquisition, build/simulate, sign, fee-bump, submit, status poll). Match the symptom to the layer, then pull the logs for that component.

Pool exhaustion, sequence drift, and an RPC throttle all look like "transactions are failing" from the outside; each lives in a different layer and has a different fix.

You haveDo this
Transaction IDoz-relayer tx show <tx-id> -r channels-fund --json -p <env>
Error messageSearch Cloud Logging: textPayload:"<error>"
"What's broken right now"gcloud logging read ... AND severity>=ERROR
Stellar tx hashCheck Horizon, then find the relayer tx record

Common log patterns:

PatternMeans
provider pausedRPC failover kicked in
POOL_CAPACITYChannel pool exhausted; bootstrap more
LOCKED_CONFLICTTwo workers grabbed the same channel
TRY_AGAIN_LATERHorizon throttling

9.1. Redis Inspection

Connect from a VM in the same VPC:

redis-cli -h <redis_host> -p 6379
KEYS *tx:*
GET "oz-relayer:relayer:channels-fund:tx:<tx-id>"

10. Security

This section documents the security posture of the deployed infrastructure. Review it before go-live and consult it when rotating credentials or adjusting network ingress rules.

10.1. Secrets

All secrets are stored in Secret Manager, but are currently injected into Cloud Run as plain environment variables rather than secret_key_ref references (see Known Issues for the plan to switch). Consequence: anyone with run.viewer on the project (or who can run gcloud run services describe) can read the secret values directly from the revision config, including relayer_api_key, channels_admin_secret, and storage_encryption_key. Until secret_key_ref lands, restrict roles/run.viewer / roles/run.admin to the minimum set of principals and audit it.

10.2. Network Isolation

  • Cloud Run ingress: INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER in prod; INGRESS_TRAFFIC_ALL for testing.
  • Cloud Run egress: VPC Connector with PRIVATE_RANGES_ONLY. Private traffic goes through the VPC (to Memorystore); public traffic (Stellar RPC, KMS API) goes direct.
  • Memorystore: Private Service Access only, no public IP.
  • Pub/Sub: IAM-scoped per topic/subscription.

10.3. IAM

The Cloud Run SA ({app_name}-run) gets:

RoleScope
secretmanager.secretAccessorper-secret
monitoring.metricWriterproject
logging.logWriterproject
monitoring.viewerproject
cloudkms.signerVerifierper-key
cloudkms.publicKeyViewerper-key
pubsub.publisherper-topic
pubsub.subscriberper-subscription
artifactregistry.readerper-repo

10.4. TLS

  • Load balancer: Google-managed SSL cert, HTTPS on 443, HTTP redirects to HTTPS.
  • Memorystore: transit encryption is disabled (see Known Issues). Private Service Access provides network-level isolation.
  • Cloudflare to LB: set the Cloudflare zone SSL mode to "Full" for end-to-end TLS.

10.5. Cloud KMS

EC_SIGN_ED25519, SOFTWARE protection. Rotation: provision a new key, register a new signer and relayer, fund the new onchain account, drain the old one, retire it.


11. Post-Restart Checklist

If you ever restart with RESET_STORAGE_ON_START=true (which wipes Redis), you need to redo the following (the service will be up but non-functional until these are done):

  1. Re-create the signer: ./scripts/gcp-kms-signer.sh (section 4.9)
  2. Re-create the fund relayer: via the relayer API using the new signer ID
  3. Re-run the RPC override: the PATCH to /api/v1/networks/stellar:mainnet (section 4.8)
  4. Re-bootstrap channels: oz-channels bootstrap --to <N> -p <env> (section 4.10)
  5. Fund the fund relayer: if the onchain account was recreated, send XLM to the new address

Normal restarts and redeployments (without RESET_STORAGE_ON_START=true) preserve everything in Redis; none of the above is needed.


12. Gotchas

Common deployment and operational pitfalls, with fixes. Check here first when something does not behave as expected.

12.1. Channel Pool Exhaustion

min_pool = ceil(TPS × settlement_seconds × 1.5). At 23 TPS with 5s settlement: 173 channels minimum. Fix: oz-channels bootstrap --from <next> --to <new-total>.

12.2. SSL Cert Provisioning

Google needs DNS pointing at the LB IP before it issues the cert. With Cloudflare, turn proxy off first, wait for ACTIVE, then proxy back on. If the cert stays FAILED_NOT_VISIBLE for 30+ min, bump the cert name suffix in load-balancer.tf and re-apply (create_before_destroy swaps it without downtime).

12.3. VPC Connector CIDR Overlap

Each environment in the same VPC needs a different /28 CIDR range (e.g. 10.8.0.0/28 for stg, 10.9.0.0/28 for prod).

12.4. Private Service Access Shared Connection

A VPC can hold only one Private Service Access connection to servicenetworking.googleapis.com. If stg creates it first, prod's apply will fail unless update_on_creation_fail = true is set on the connection resource. The module handles this.

12.5. Pub/Sub Topic Prefix

PUBSUB_TOPIC_PREFIX must match what the image expects. Double-dash errors (relayer-mainnet-prod--) mean the prefix has a trailing dash the image doesn't expect. Adjust via container_environment if needed.

12.6. Encryption Key Format

storage_encryption_key must be base64-encoded 32 bytes (openssl rand -base64 32). Hex keys fail silently with "Invalid key length: expected 32 bytes, got 0".

12.7. Fee-Bump Tuning Under Congestion

MAX_FEE defaults to 1M stroops (0.1 XLM). Raise to 10M during network congestion. The plugin uses static fees with no automatic bumping on INSUFFICIENT_FEE.


13. Variables

Full variable reference for the Terraform module. Required variables must be set in your tfvars file; optional variables have defaults that the module adjusts automatically based on the environment value.

13.1. Required

NameTypeDescription
project_idstringGCP project ID
regionstringGCP region
environmentstringprod, stg, etc. (1–16 chars)
networkstringVPC network name or self_link
subnetworkstringSubnet name or self_link
domain_namestringFQDN for the service
container_imagestringContainer image URI
relayer_api_keystringRelayer API key (sensitive)
channels_admin_secretstringAdmin secret (sensitive)

13.2. Optional: Core

NameTypeDefaultDescription
app_namestring"relayer-channels"Resource name prefix
name_suffix_environmentbooltrueAppend -{env} to names (auto-off for prod)
labelsmap(string){}Labels for all resources

13.3. Optional: Networking

NameTypeDefaultDescription
connector_machine_typestring"e2-micro"VPC connector machine type
connector_min_instancesnumber2Min connector instances
connector_max_instancesnumber3Max connector instances
connector_ip_cidr_rangestring"10.8.0.0/28"CIDR for the VPC connector (/28, must not overlap)

13.4. Optional: Container / Cloud Run

NameTypeDefaultDescription
container_portnumber8080Listen port
cpustring"1"CPU ("1", "2", "4")
memorystring"2Gi"Memory
min_instance_countnumbernullAuto: 2 prod, 1 other
max_instance_countnumbernullAuto: 10 prod, 4 other
cpu_always_allocatedboolnullAuto: true prod
health_check_pathstring"/api/v1/health"Probe path
container_environmentlist(object)[]Additional env vars (user overrides win)

13.5. Optional: Application

NameTypeDefaultDescription
stellar_networkstring"testnet"mainnet or testnet
fund_relayer_idstring"channels-fund"Fund relayer ID
distributed_modebooltrueEnable distributed queue processing
queue_backendstring"pubsub"pubsub (recommended) or redis
log_levelstring"warn"App log level

13.6. Optional: Secrets

NameTypeDefaultDescription
webhook_signing_keystring""Only set if using webhooks
storage_encryption_keystring""Base64-encoded 32 bytes. Recommended for prod.

13.7. Optional: Redis

NameTypeDefaultDescription
redis_tierstringnullBASIC or STANDARD_HA (auto per env)
redis_memory_size_gbnumbernullAuto: 5 prod, 1 other
redis_versionstring"REDIS_7_2"Redis version

13.8. Optional: Cloudflare

NameTypeDefaultDescription
enable_cloudflareboolfalseEnable Workers gateway
cloudflare_zone_idstring""Required when Cloudflare is enabled
cloudflare_account_idstring""Required when Cloudflare is enabled
relayer_static_api_keystring""Static key injected by the Worker (sensitive)
key_saltstring""Salt for hashing user keys in KV (sensitive)
gen_ip_rate_hournumber2Max /gen per IP per hour
relay_rpm_per_keynumber60Max relay RPM per key

13.9. Optional: Load Balancer

NameTypeDefaultDescription
lb_deletion_protectionboolnullAuto: true prod
lb_log_sample_ratenumber0Request log sampling (0 disables)

See variables.tf for the full list including Cloud Functions and additional networking options.


14. Outputs

The module exposes these outputs for use in downstream Terraform modules or post-deployment scripts.

NameDescription
cloud_run_service_name / cloud_run_service_uriService name and URL
load_balancer_ipStatic IP for DNS
redis_host / redis_port / redis_read_endpointMemorystore connection
pubsub_topics / pubsub_subscriptionsQueue resource names
kms_key_ring_name / kms_signing_key_name / kms_signing_key_idCloud KMS key info
artifact_registry_repository / artifact_registry_urlArtifact Registry info
secret_idsSecret Manager IDs
cloudflare_worker_nameWorker name (null if disabled)

15. Known Issues

Redis TLS disabled: the relayer binary doesn't support TLS for Redis connections. Memorystore is only reachable via Private Service Access (VPC peering), so traffic stays within Google's network.

Secrets as plain env vars: secrets are passed as Cloud Run env vars rather than Secret Manager secret_key_ref references. This is a workaround for a deployment issue. Plan to switch to proper secret references.

On this page

1. Architecture1.1. Cloud Architecture1.2. App Architecture (Channels Plugin Runtime)1.3. Transaction Lifecycle1.4. How Pub/Sub Queues Work1.5. How Channels Works on Stellar1.6. Resource Sizing2. Prerequisites2.1. Accounts and Access2.2. Tooling2.3. Stellar-Side Prerequisites2.4. Repos You'll Reference3. Environments4. Deployment4.1. Authenticate4.2. Get the Module4.3. Configure the Terraform Backend4.4. Create Your Tfvars4.5. Set Up Artifact Registry4.6. Deploy4.7. DNS and SSL4.8. Override RPC Endpoints4.9. Create the Signer4.10. Bootstrap Channels4.10.1. Scaling Beyond ~100 Channels4.11. Verify5. Configuration Reference5.1. Module-Managed Container Environment Variables5.2. Module-Managed Secrets5.3. Production Reference Values6. Cloudflare (Optional)6.1. Without Cloudflare7. Operations7.1. Deploys7.2. Rollbacks7.3. Scaling7.4. Channel Pool7.5. Transactions8. Observability8.1. Logs8.2. Cloud Run Metrics8.3. Pub/Sub Metrics8.4. Memorystore Metrics8.5. Log-Based Metrics8.6. Alerting8.7. Prometheus8.8. Stellar-Side Monitoring9. Debugging9.1. Redis Inspection10. Security10.1. Secrets10.2. Network Isolation10.3. IAM10.4. TLS10.5. Cloud KMS11. Post-Restart Checklist12. Gotchas12.1. Channel Pool Exhaustion12.2. SSL Cert Provisioning12.3. VPC Connector CIDR Overlap12.4. Private Service Access Shared Connection12.5. Pub/Sub Topic Prefix12.6. Encryption Key Format12.7. Fee-Bump Tuning Under Congestion13. Variables13.1. Required13.2. Optional: Core13.3. Optional: Networking13.4. Optional: Container / Cloud Run13.5. Optional: Application13.6. Optional: Secrets13.7. Optional: Redis13.8. Optional: Cloudflare13.9. Optional: Load Balancer14. Outputs15. Known Issues