Skip to content
Unverified — AI-generated content. Help verify this page

GCP Cloud SQL Deep Dive

Cloud SQL is GCP's fully managed relational database service, supporting PostgreSQL, MySQL, and SQL Server. It handles patching, replication, backups, failover, and encryption while you focus on schema design and query optimization.

This guide covers Cloud SQL architecture from the storage engine internals through production-grade operational patterns.


1. Why Cloud SQL Exists: The Problem It Solves

The Database Operations Burden

Running a production database on bare metal or VMs requires:

  1. Hardware provisioning — Disks, RAID, memory sizing
  2. OS management — Kernel tuning, filesystem optimization
  3. Database installation — Binary installation, initialization
  4. High availability — Replication setup, failover automation
  5. Backup management — Scheduled backups, point-in-time recovery, backup testing
  6. Monitoring — Slow query logs, connection limits, disk utilization
  7. Security — SSL certificates, network isolation, audit logging
  8. Upgrades — Minor version patches, major version migrations
  9. Scaling — Vertical scaling (downtime), read replicas, connection pooling

Each of these is a full-time job at scale. Cloud SQL handles all of them.

Cloud SQL vs. Self-Managed vs. AlloyDB vs. Cloud Spanner

FeatureCloud SQLSelf-Managed (GCE)AlloyDBCloud Spanner
ManagementFully managedYou manage everythingFully managedFully managed
EnginePG, MySQL, SQL ServerAnyPostgreSQL-compatibleProprietary (SQL)
Max storage64 TBUnlimited64 TBUnlimited
Max connections~4,000 (depends on tier)Kernel limits~40,000Unlimited
HARegional (2 zones)DIYRegionalMulti-region
Read replicasUp to 20DIYUp to 20Built-in
Global distributionNoDIYNoYes
ACIDYesYesYesYes (external consistency)
Cost$$$ (but ops cost)$$$$$$$
Best forStandard OLTPFull control neededHigh-perf OLTPGlobal OLTP

2. Architecture Internals

Storage Architecture

Cloud SQL instances run on Compute Engine VMs with attached persistent disks:

Instance Tiers

TiervCPUsMemoryMax StorageMax Connections (PG)Use Case
db-f1-microShared614 MB3 TB~25Development
db-g1-smallShared1.7 GB3 TB~50Small staging
db-custom-1-384013.75 GB64 TB~100Small production
db-custom-2-768027.5 GB64 TB~200Medium production
db-custom-4-15360415 GB64 TB~400Standard production
db-custom-8-30720830 GB64 TB~800Large production
db-custom-16-614401660 GB64 TB~1,600Heavy production
db-custom-32-12288032120 GB64 TB~3,200Enterprise
db-custom-96-36864096360 GB64 TB~4,000Maximum

Connection Limits

PostgreSQL connection limits depend on memory:

max_connectionsMemory (MB)9.5

Each connection consumes ~10MB of memory. For a 4-vCPU instance with 15GB RAM:

max_connections153609.51,617

Cloud SQL caps this at a reasonable default (usually lower) to prevent OOM.

DANGER

Exhausting connections is the most common Cloud SQL production issue. Use connection pooling (PgBouncer, Cloud SQL Proxy, or application-level) and never set pool size higher than your connection limit.


3. High Availability

How HA Works

Cloud SQL HA uses regional instances with synchronous replication to a standby in a different zone:

Failover Characteristics

MetricValueNotes
Failover triggerHealth check failures3 consecutive failures
Failover time60-120 secondsDepends on transaction volume
Data lossZeroSynchronous replication
IP changeNoVIP (Virtual IP) stays the same
Connection dropYesApplications must reconnect
Cost2x instance costStandby is a full instance

Handling Failover in Application Code

typescript
// db/resilient-connection.ts
import { Pool, PoolConfig } from 'pg';

function createResilientPool(config: PoolConfig): Pool {
  const pool = new Pool({
    ...config,
    // Connection settings for Cloud SQL HA
    max: 20,
    idleTimeoutMillis: 30000,
    connectionTimeoutMillis: 5000,
    // Important: allow connections to be recycled
    maxLifetimeMillis: 1800000, // 30 minutes
  });

  pool.on('error', (err) => {
    // Connection-level errors (connection dropped during failover)
    console.error('Pool connection error:', err.message);
    // Don't exit — the pool will create new connections
  });

  return pool;
}

// Retry wrapper for transient failures during failover
async function withRetry<T>(
  pool: Pool,
  operation: (client: import('pg').PoolClient) => Promise<T>,
  maxRetries: number = 3,
): Promise<T> {
  let lastError: Error | null = null;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const client = await pool.connect();
    try {
      const result = await operation(client);
      return result;
    } catch (error: any) {
      lastError = error;

      // Retry on connection errors (failover)
      const isTransient =
        error.code === 'ECONNRESET' ||
        error.code === 'ECONNREFUSED' ||
        error.code === '57P01' || // admin_shutdown
        error.code === '57P03' || // cannot_connect_now
        error.code === '08006' || // connection_failure
        error.code === '08003';   // connection_does_not_exist

      if (!isTransient || attempt === maxRetries - 1) {
        throw error;
      }

      console.warn(`Transient error (attempt ${attempt + 1}/${maxRetries}):`, error.message);
      // Exponential backoff
      await new Promise(resolve =>
        setTimeout(resolve, Math.min(1000 * Math.pow(2, attempt), 10000))
      );
    } finally {
      client.release();
    }
  }

  throw lastError;
}

// Usage
const pool = createResilientPool({
  host: '/cloudsql/project:region:instance', // Unix socket for Cloud SQL Proxy
  database: 'mydb',
  user: 'app',
  password: process.env.DB_PASSWORD,
});

const orders = await withRetry(pool, async (client) => {
  const result = await client.query(
    'SELECT * FROM orders WHERE customer_id = $1 ORDER BY created_at DESC LIMIT 10',
    [customerId]
  );
  return result.rows;
});

4. Cloud SQL Proxy

What It Is

The Cloud SQL Auth Proxy is a client-side proxy that handles:

  1. Authentication — Uses IAM to authenticate, no passwords needed
  2. Encryption — TLS tunnel between your application and Cloud SQL
  3. Connection management — Maintains persistent connections to Cloud SQL

Why Use It

Without ProxyWith Proxy
Manage SSL certificatesAutomatic TLS
Store and rotate passwordsIAM authentication
Whitelist IPsNo IP whitelisting needed
Configure connection stringsConnect to localhost

Deployment Patterns

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-api
  namespace: orders
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-api
  template:
    metadata:
      labels:
        app: order-api
    spec:
      serviceAccountName: order-api-ksa # Workload Identity
      containers:
        - name: order-api
          image: gcr.io/my-project/order-api:v1.0.0
          env:
            - name: DB_HOST
              value: "127.0.0.1"
            - name: DB_PORT
              value: "5432"
            - name: DB_NAME
              value: "orders"
            - name: DB_USER
              value: "order-api"
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"

        # Cloud SQL Proxy sidecar
        - name: cloud-sql-proxy
          image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0
          args:
            - "--structured-logs"
            - "--auto-iam-authn"  # Use IAM authentication
            - "--port=5432"
            - "my-project:us-central1:orders-db"
          securityContext:
            runAsNonRoot: true
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "200m"
              memory: "256Mi"

Cloud Run (Built-in)

Cloud Run has built-in Cloud SQL connectivity — no proxy needed:

hcl
resource "google_cloud_run_v2_service" "api" {
  name     = "order-api"
  location = "us-central1"

  template {
    containers {
      image = "gcr.io/my-project/order-api:latest"

      env {
        name  = "DB_NAME"
        value = "orders"
      }
      env {
        name  = "DB_USER"
        value = "order-api"
      }
      env {
        name  = "INSTANCE_CONNECTION_NAME"
        value = google_sql_database_instance.main.connection_name
      }
    }

    # Built-in Cloud SQL connection
    volumes {
      name = "cloudsql"
      cloud_sql_instance {
        instances = [google_sql_database_instance.main.connection_name]
      }
    }
  }
}
typescript
// In Cloud Run, connect via Unix socket
const pool = new Pool({
  host: `/cloudsql/${process.env.INSTANCE_CONNECTION_NAME}`,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  // IAM auth — no password needed
});

5. Read Replicas

Architecture

Read Replica Characteristics

FeatureValue
Max replicas20 per primary
Replication typeAsynchronous (eventual consistency)
Replication lagTypically < 1 second (same region), 1-5s (cross-region)
PromotableYes (to standalone instance)
Cross-regionYes
Cascading replicasNo
Independent scalingYes (can be different tier than primary)

Replication Lag Formula

LagreplicaWAL Generation RateReplay Bandwidth+Network Latency

For a database generating 10MB/s of WAL with a replica in another region (10ms latency):

Lag10MB100MB/s+10ms=110ms

Read/Write Splitting

typescript
// db/read-write-split.ts
import { Pool } from 'pg';

interface DatabasePools {
  writer: Pool;
  reader: Pool;
}

function createDatabasePools(): DatabasePools {
  const writer = new Pool({
    host: process.env.DB_PRIMARY_HOST,
    database: process.env.DB_NAME,
    user: process.env.DB_USER,
    password: process.env.DB_PASSWORD,
    max: 10,
    ssl: { rejectUnauthorized: true },
  });

  const reader = new Pool({
    host: process.env.DB_REPLICA_HOST,
    database: process.env.DB_NAME,
    user: process.env.DB_USER,
    password: process.env.DB_PASSWORD,
    max: 20, // More read connections
    ssl: { rejectUnauthorized: true },
  });

  return { writer, reader };
}

class OrderRepository {
  constructor(private readonly pools: DatabasePools) {}

  // Writes go to primary
  async createOrder(order: CreateOrderInput): Promise<Order> {
    const result = await this.pools.writer.query(
      `INSERT INTO orders (customer_id, total, status)
       VALUES ($1, $2, 'pending')
       RETURNING *`,
      [order.customerId, order.total]
    );
    return result.rows[0];
  }

  // Reads can go to replica (acceptable staleness)
  async getRecentOrders(customerId: string): Promise<Order[]> {
    const result = await this.pools.reader.query(
      `SELECT * FROM orders
       WHERE customer_id = $1
       ORDER BY created_at DESC
       LIMIT 20`,
      [customerId]
    );
    return result.rows;
  }

  // Read-after-write must go to primary (consistency requirement)
  async getOrderById(orderId: string): Promise<Order | null> {
    const result = await this.pools.writer.query(
      'SELECT * FROM orders WHERE id = $1',
      [orderId]
    );
    return result.rows[0] ?? null;
  }
}

WARNING

Read replicas are eventually consistent. After a write to the primary, the replica may not have the data yet. For "read-your-own-writes" consistency, read from the primary immediately after writing. For non-critical reads (listing, search, analytics), replicas are ideal.


6. Backup and Recovery

Automated Backups

Cloud SQL provides two backup types:

FeatureAutomated BackupsOn-Demand Backups
FrequencyDaily (configurable window)Manual trigger
Retention1-365 days (default: 7)Until deleted
Point-in-time recoveryYes (with binary logging/WAL)No
CostIncluded in instance costStorage cost only
Cross-regionOptionalNo

Point-in-Time Recovery (PITR)

PITR restores to any point within the retention window:

Recovery Window=[Oldest Backup,NowReplication Lag]

Terraform Backup Configuration

hcl
resource "google_sql_database_instance" "main" {
  name             = "orders-db"
  database_version = "POSTGRES_16"
  region           = "us-central1"
  project          = var.project_id

  settings {
    tier              = "db-custom-4-15360" # 4 vCPU, 15GB RAM
    availability_type = "REGIONAL"           # HA

    disk_size       = 100
    disk_type       = "PD_SSD"
    disk_autoresize = true
    disk_autoresize_limit = 500 # Max 500 GB

    backup_configuration {
      enabled                        = true
      start_time                     = "02:00" # 2 AM UTC
      point_in_time_recovery_enabled = true
      transaction_log_retention_days = 7
      backup_retention_settings {
        retained_backups = 30
        retention_unit   = "COUNT"
      }
    }

    maintenance_window {
      day          = 7  # Sunday
      hour         = 4  # 4 AM UTC
      update_track = "stable"
    }

    ip_configuration {
      ipv4_enabled    = false           # No public IP
      private_network = google_compute_network.main.id
      require_ssl     = true
      ssl_mode        = "ENCRYPTED_ONLY"
    }

    database_flags {
      name  = "log_min_duration_statement"
      value = "1000" # Log queries > 1 second
    }
    database_flags {
      name  = "max_connections"
      value = "400"
    }
    database_flags {
      name  = "shared_buffers"
      value = "3932160" # ~25% of RAM in 8KB pages
    }
    database_flags {
      name  = "work_mem"
      value = "16384" # 16MB
    }

    insights_config {
      query_insights_enabled  = true
      query_string_length     = 4096
      record_application_tags = true
      record_client_address   = true
      query_plans_per_minute  = 5
    }

    user_labels = {
      environment = "production"
      team        = "platform"
      service     = "orders"
    }
  }

  deletion_protection = true
}

War Story

A SaaS company ran a migration that accidentally deleted 2 months of customer data from a table. They had Cloud SQL automated backups with PITR enabled. Recovery steps:

  1. Identified the exact timestamp of the DELETE statement from query logs
  2. Created a PITR clone of the database to 1 second before the DELETE (took ~15 minutes for their 200GB database)
  3. Extracted the deleted data from the clone using pg_dump with table filter
  4. Restored the data to production using COPY
  5. Total recovery time: 45 minutes, zero data loss

Without PITR, they would have lost everything since the last daily backup (up to 24 hours of data).


7. Performance Optimization

Query Insights

Cloud SQL Query Insights provides:

  • Top queries by CPU, latency, and I/O
  • Query plan visualization
  • Query tagging for application-level attribution
typescript
// Tag queries for Cloud SQL Insights attribution
import { Pool } from 'pg';

async function queryWithTags(
  pool: Pool,
  sql: string,
  params: any[],
  tags: { route: string; action: string },
): Promise<any[]> {
  const client = await pool.connect();
  try {
    // Set application tags for Query Insights
    await client.query(
      `SET google_db_advisor.query_tag = '${JSON.stringify(tags)}'`
    );
    const result = await client.query(sql, params);
    return result.rows;
  } finally {
    client.release();
  }
}

// Usage
const orders = await queryWithTags(
  pool,
  'SELECT * FROM orders WHERE customer_id = $1 AND status = $2',
  [customerId, 'active'],
  { route: '/api/orders', action: 'listActive' }
);

Key PostgreSQL Tuning Parameters

ParameterDefaultRecommendedWhy
shared_buffers128MB25% of RAMBuffer pool for caching data
effective_cache_size4GB75% of RAMPlanner hint for available cache
work_mem4MB16-64MBSort/hash operations per query
maintenance_work_mem64MB512MB-1GBVACUUM, CREATE INDEX
random_page_cost4.01.1 (SSD)Cost model for SSD storage
effective_io_concurrency1200 (SSD)Async I/O for SSD
max_parallel_workers_per_gather24Parallel query workers

Connection Pooling with PgBouncer

For applications that need more connections than Cloud SQL supports:

yaml
# PgBouncer deployment on GKE
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pgbouncer
  namespace: database
spec:
  replicas: 2
  selector:
    matchLabels:
      app: pgbouncer
  template:
    spec:
      containers:
        - name: pgbouncer
          image: edoburu/pgbouncer:1.22.0
          ports:
            - containerPort: 5432
          env:
            - name: DATABASE_URL
              value: "postgresql://app:password@127.0.0.1:5433/orders"
            - name: POOL_MODE
              value: "transaction"
            - name: DEFAULT_POOL_SIZE
              value: "50"
            - name: MAX_CLIENT_CONN
              value: "1000"
            - name: SERVER_IDLE_TIMEOUT
              value: "120"
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"

        - name: cloud-sql-proxy
          image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0
          args:
            - "--port=5433"
            - "my-project:us-central1:orders-db"

8. Security

Authentication Methods

MethodUse CaseRecommendation
Built-in users (password)Legacy applicationsRotate regularly
IAM database authenticationGCP workloadsPreferred for GKE/Cloud Run
Cloud SQL Proxy + IAMAny environmentBest security posture
SSL/TLS onlyAll connectionsAlways enforce

IAM Database Authentication

hcl
# Create IAM database user
resource "google_sql_user" "iam_user" {
  name     = "order-api@my-project.iam"
  instance = google_sql_database_instance.main.name
  type     = "CLOUD_IAM_SERVICE_ACCOUNT"
}

# Grant the service account login permission
resource "google_project_iam_member" "cloudsql_client" {
  project = var.project_id
  role    = "roles/cloudsql.client"
  member  = "serviceAccount:${google_service_account.order_api.email}"
}

resource "google_project_iam_member" "cloudsql_instance_user" {
  project = var.project_id
  role    = "roles/cloudsql.instanceUser"
  member  = "serviceAccount:${google_service_account.order_api.email}"
}

Encryption

LayerMechanismKey Management
Data at restAES-256Google-managed (default) or CMEK
Data in transitTLS 1.2+Managed certificates
BackupsAES-256Same as instance
WAL logsAES-256Same as instance

9. Migration Strategies

Migrating to Cloud SQL

SourceMethodDowntime
Self-managed PostgreSQLDatabase Migration Service (DMS)Minutes (with CDC)
Amazon RDSDMS or pg_dump/pg_restoreDepends on size
Azure DatabaseDMS or pg_dump/pg_restoreDepends on size
Another Cloud SQL instanceDMS or pg_dump/pg_restoreMinutes

Zero-Downtime Migration Pattern


10. Cost Model

Pricing Components

ComponentCost (us-central1)Notes
vCPU$0.0413/hrPer vCPU
Memory$0.0070/hr/GBPer GB
Storage (SSD)$0.170/GB/monthAuto-growing
Storage (HDD)$0.090/GB/monthAlternative
HA standby2x instance costFull replica
Backups$0.080/GB/monthAbove free tier
Network egressStandard GCP ratesCross-region
Read replicasSame as instanceFull instance cost

Cost Example

A production HA instance with 4 vCPU, 15GB RAM, 200GB SSD:

Primary=(4×$0.0413+15×$0.0070)×730=$196.81HA Standby=$196.81Storage=200×$0.170=$34.00Backups (50GB over free)=50×$0.080=$4.00Total=$431.62/month

11. Edge Cases and Failure Modes

IssueCauseMitigation
Connection exhaustionToo many application connectionsPgBouncer, reduce pool size
Storage fullDisk autoresize disabled or limit reachedEnable autoresize, monitor
Replication lag spikeHeavy write loadScale replica, reduce write volume
Maintenance window disruptionAuto-maintenance during peakSet window to off-peak hours
Slow failoverLarge uncommitted transactionsKeep transactions short
SSL certificate rotationAnnual cert rotationUse Cloud SQL Proxy (handles automatically)

12. Decision Framework

Choose Cloud SQL WhenChoose AlloyDB WhenChoose Cloud Spanner When
Standard OLTP workloadsHigh-performance OLTPGlobal distribution needed
< 4,000 connections> 4,000 connectionsUnlimited horizontal scaling
Cost-sensitivePerformance-criticalStrong consistency across regions
Standard PostgreSQL featuresPostgreSQL + column engineSLA > 99.999% required
< 10TB data< 64TB dataUnlimited data size

See Also

"What I cannot create, I do not understand." — Richard Feynman