Cloud Security

Securing ML Model Serving Infrastructure

Model serving infrastructure is a growing attack surface that most security teams overlook. From model poisoning to inference API abuse, here are the risks and how to address them.

Alex
Security Researcher
6 min read

Machine learning models are moving from research notebooks to production serving infrastructure at an accelerating pace. TensorFlow Serving, TorchServe, Triton Inference Server, Seldon Core, BentoML -- the model serving ecosystem has matured rapidly. But the security posture of most model serving deployments has not kept up.

The typical model serving setup involves a containerized inference server pulling model artifacts from a registry, exposing an API endpoint, and processing prediction requests at scale. Every component in this chain introduces security risks that most organizations have not assessed.

The Attack Surface

Model Artifacts

Model files are code. A TensorFlow SavedModel can contain arbitrary Python functions via tf.py_function. A PyTorch model loaded with pickle can execute arbitrary code during deserialization. ONNX models can include custom operators that run native code.

This means that anyone who can modify a model artifact can achieve code execution on the inference server. The supply chain for model artifacts is usually far less controlled than the supply chain for application code:

  • Models are often stored in general-purpose object storage (S3, GCS) without integrity verification
  • Model registries like MLflow and Weights & Biases may not enforce signing or provenance verification
  • Models downloaded from public hubs (Hugging Face, TensorFlow Hub) are typically trusted implicitly
  • Model files are large binaries that are difficult to audit or scan

Inference APIs

Model serving APIs accept structured input data and return predictions. These APIs are vulnerable to several attack classes:

Input manipulation. Adversarial inputs designed to cause misclassification or model misbehavior. While this is primarily an ML security concern, it has infrastructure implications when adversarial inputs cause crashes, excessive resource consumption, or trigger unexpected code paths.

Denial of service. Inference requests consume significant compute resources, especially for large models. An attacker who can submit unlimited inference requests can exhaust GPU memory, CPU, or network bandwidth. Most model serving frameworks do not include sophisticated rate limiting.

Data exfiltration via inference. Through carefully crafted queries, attackers can extract information about the training data (membership inference) or reconstruct the model itself (model stealing). These attacks use the serving API as their primary channel.

Serving Framework Dependencies

Model serving frameworks have deep dependency trees that include ML libraries, HTTP servers, serialization utilities, and cloud SDKs. TorchServe, for example, includes PyTorch, the Java Management Extensions, and a custom HTTP server. Triton includes TensorRT, cuDNN, and protocol buffer libraries.

These dependencies are updated less frequently than typical web application dependencies because ML teams prioritize model compatibility over dependency freshness. A model serving deployment that was tested and validated with a specific set of library versions often runs unchanged for months, accumulating known vulnerabilities.

Container Images

Model serving containers are typically larger and more complex than application containers. A Triton Inference Server image includes CUDA drivers, ML frameworks, and system libraries totaling several gigabytes. The attack surface of these images is correspondingly large.

Many organizations build custom serving containers by layering their models and dependencies on top of base images provided by the serving framework. These base images may not be updated on the same cadence as the organization's other container images.

Securing the Model Supply Chain

Model Integrity Verification

Implement signing and verification for model artifacts:

  1. Generate a cryptographic hash of every model artifact when it is trained and validated
  2. Store the hash in a tamper-evident log alongside the model metadata
  3. Verify the hash when loading the model into the serving infrastructure
  4. Reject any model that fails verification

Some organizations go further by signing models with GPG or Sigstore, tying model provenance to the training pipeline that produced them. This creates an auditable chain from training data through the training code to the deployed model.

Model Registry Access Control

Treat model registries with the same access control rigor as source code repositories:

  • Restrict who can publish models to the registry
  • Require review and approval before models are promoted to production
  • Maintain an audit log of all model uploads, modifications, and deployments
  • Separate development, staging, and production model registries

Safe Model Loading

Wherever possible, use model formats that do not allow arbitrary code execution. ONNX without custom operators, TensorFlow Lite, and Apple's CoreML format are safer than pickle-based PyTorch models or TensorFlow SavedModels with custom Python functions.

When pickle-based formats are unavoidable (which is common in PyTorch), use restricted unpicklers that whitelist allowed classes. The fickling library can analyze pickle files for suspicious code before loading.

Securing the Serving Infrastructure

Network Segmentation

Model serving infrastructure should be segmented from general application infrastructure. Inference APIs that are only consumed by internal services should not be exposed to the internet. Even for public-facing inference APIs, the model loading and management interfaces should be on a separate network segment.

Resource Limits

Configure resource limits for inference requests:

  • Request size limits to prevent oversized inputs that consume excessive memory
  • Timeout limits to prevent requests that trigger long-running computation
  • Concurrency limits to prevent resource exhaustion from parallel requests
  • GPU memory limits per model to prevent a single model from monopolizing GPU resources

Authentication and Authorization

Every inference API should require authentication, even for internal services. Implement:

  • API key or token-based authentication for all inference endpoints
  • Model-level authorization (not all clients should be able to query all models)
  • Rate limiting per client to prevent abuse and model stealing attacks
  • Request logging with client identification for audit and forensic purposes

Container Hardening

Apply standard container hardening practices to model serving containers:

  • Run inference processes as non-root users
  • Use read-only file systems where possible
  • Drop unnecessary Linux capabilities
  • Scan container images for known vulnerabilities on a regular schedule
  • Use distroless or minimal base images when the serving framework supports them

Monitoring and Detection

Anomaly Detection for Inference Patterns

Monitor inference request patterns for anomalies that may indicate attack:

  • Sudden changes in request volume or distribution
  • Unusual input patterns that may indicate adversarial probing
  • Requests from unexpected client IPs or identities
  • Model loading events outside of scheduled deployment windows

Dependency Monitoring

Track the dependencies of your model serving infrastructure with the same rigor as application dependencies. Generate SBOMs for serving containers and monitor them for newly disclosed vulnerabilities.

This is particularly important because ML dependencies are updated less frequently, meaning known vulnerabilities persist longer in serving environments.

How Safeguard.sh Helps

Safeguard.sh extends supply chain security to ML model serving infrastructure by scanning serving container images, generating SBOMs for ML dependency stacks, and monitoring for vulnerabilities in frameworks like TorchServe, Triton, and TensorFlow Serving. Policy gates can enforce container image standards and dependency requirements before serving infrastructure is deployed, ensuring that the same security governance applied to application code extends to ML infrastructure.

Never miss an update

Weekly insights on software supply chain security, delivered to your inbox.