A Survey of Cloud-native Machine Learning Platforms Sagemaker Vertex AI Azure ML

A Survey of Cloud-native Machine Learning Platforms Sagemaker Vertex AI Azure ML is about doing data science at serious scale without drowning in plumbing. Cloud platforms and managed ML services let small teams ship capabilities that only the largest companies could build in-house a decade ago.

Why Survey Cloud-native Machine Matters

Managed platforms let teams trade toil for velocity. Skill with one of the major clouds is effectively mandatory for senior data and ML roles today.

  • Use managed services until the boundary becomes painful.
  • Cost-model every long-running job before you scale it.
  • Isolate environments — dev, staging and prod are not negotiable.
  • Design for multi-region failure from day one for critical workloads.

How Survey Cloud-native Machine Shows Up in Practice

In a typical project, a survey of cloud-native machine learning platforms sagemaker vertex ai azure ml is combined with the rest of the Cloud & Platforms toolkit. You rarely use any one technique in isolation; the real skill is knowing which combination fits the problem you are trying to solve, and being able to explain that choice to a non-technical stakeholder.

The default deployment target for modern data and ML systems, from start-ups to regulated enterprises.

Back to the Data Science curriculum →

Code Examples: Survey Cloud-native Machine Learning Platforms Sagemaker (5 runnable snippets)

Copy any block into a file or notebook and run it end-to-end — each example stands alone.

Example 1: BigQuery aggregation with cost awareness

# Example 1: BigQuery aggregation with cost awareness -- Survey Cloud-native Machine Learning Platforms Sagemaker
from google.cloud import bigquery

client = bigquery.Client(project="my-analytics-project")

query = """
SELECT
    DATE(event_time)           AS day,
    COUNTIF(event = 'signup')  AS signups,
    COUNTIF(event = 'purchase') AS purchases,
    SAFE_DIVIDE(COUNTIF(event = 'purchase'),
                COUNTIF(event = 'signup'))  AS conversion
FROM `my_project.analytics.events`
WHERE event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY day
ORDER BY day;
"""

job = client.query(query)
df  = job.to_dataframe()
print(df.head(10))
print(f"bytes scanned: {job.total_bytes_processed/1e9:.2f} GB")

Example 2: Azure blob download with managed identity

# Example 2: Azure blob download with managed identity -- Survey Cloud-native Machine Learning Platforms Sagemaker
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

account_url = "https://mydatalake.blob.core.windows.net"
credential  = DefaultAzureCredential()
service     = BlobServiceClient(account_url, credential=credential)

container = service.get_container_client("raw-events")
for blob in container.list_blobs(name_starts_with="2026/04/"):
    print(blob.name, blob.size)
    client = container.get_blob_client(blob.name)
    with open(f"/tmp/{blob.name.rsplit('/', 1)[-1]}", "wb") as f:
        f.write(client.download_blob().readall())

Example 3: Kubernetes job manifest for a batch-scoring run

apiVersion: batch/v1
kind: Job
metadata:
  name: nightly-scoring
  labels:
    app: risk-scorer
spec:
  backoffLimit: 2
  ttlSecondsAfterFinished: 3600
  template:
    spec:
      restartPolicy: OnFailure
      serviceAccountName: risk-scorer-sa
      containers:
        - name: scorer
          image: ghcr.io/example/risk-scorer:1.14.0
          args: ["--date", "$(date +%F)", "--output", "s3://ml-outputs/"]
          resources:
            requests: { cpu: "1",   memory: "2Gi" }
            limits:   { cpu: "4",   memory: "8Gi" }
          env:
            - name: MODEL_URI
              value: "s3://ml-registry/risk/v3.2.1/model.joblib"
            - name: LOG_LEVEL
              value: "INFO"

Example 4: Terraform module for a managed Postgres database

terraform {
  required_version = ">= 1.6"
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.50" }
  }
}

resource "aws_db_subnet_group" "analytics" {
  name       = "analytics-db-subnets"
  subnet_ids = var.private_subnet_ids
}

resource "aws_db_instance" "analytics" {
  identifier                   = "analytics-warehouse"
  engine                       = "postgres"
  engine_version               = "16.2"
  instance_class               = "db.m6g.large"
  allocated_storage            = 100
  storage_type                 = "gp3"
  storage_encrypted            = true
  db_name                      = "analytics"
  username                     = var.db_username
  password                     = var.db_password
  db_subnet_group_name         = aws_db_subnet_group.analytics.name
  vpc_security_group_ids       = [aws_security_group.db.id]
  backup_retention_period      = 14
  deletion_protection          = true
  performance_insights_enabled = true
  tags = { Environment = "prod", Team = "data" }
}

output "db_endpoint" { value = aws_db_instance.analytics.endpoint }

Example 5: S3 upload with retries and listing

# Example 5: S3 upload with retries and listing -- Survey Cloud-native Machine Learning Platforms Sagemaker
import boto3
from botocore.config import Config

s3 = boto3.client(
    "s3",
    config=Config(retries={"max_attempts": 5, "mode": "standard"}),
)

bucket = "my-datalake-staging"
prefix = "exports/2026/"

s3.upload_file(
    "features.parquet", bucket, prefix + "features.parquet",
    ExtraArgs={"ServerSideEncryption": "AES256"},
)

total = 0
for page in s3.get_paginator("list_objects_v2").paginate(Bucket=bucket, Prefix=prefix):
    for obj in page.get("Contents", []):
        total += obj["Size"]
        print(obj["Key"], obj["Size"])
print(f"total bytes in {prefix}: {total:,}")