AWS adds observability for generative inference in SageMaker AI

New feature centralizes metrics for tokens, GPUs, and autoscaling for generative AI workloads in production, according to AWS.

AWS has announced a new observability capability for Amazon SageMaker AI inference endpoints, aimed at generative AI workloads in production. According to the company, the feature was designed to reduce the manual effort of investigating separate metrics in CloudWatch and correlating latency issues with factors such as GPU saturation, KV cache exhaustion, or slow scaling operations.

According to AWS, the new feature tracks real-time performance metrics, including time to first token, inter-token latency, queue depth, and tokens per second. These indicators are displayed alongside infrastructure information, such as GPU health, inference component placement, and autoscaling behavior.

The company states that the feature includes a pre-configured dashboard, called SageMaker AI Insights, within Amazon CloudWatch. According to AWS, it brings together a single view of data such as token latency, GPU usage, the number of inference component copies, scaling events, and cold start details, with native OpenTelemetry metrics published automatically, requiring no additional instrumentation.

For teams already using other observability tools, AWS says that SageMaker AI Inference can be connected to solutions like Grafana via a regional PromQL endpoint, with an option to import a pre-configured dashboard template. The company also claims the capability can help teams diagnose degradation in time to first token, verify compliance across availability zones, and adjust autoscaling policies.

According to AWS, SageMaker AI inference observability is available across multiple regions, including US East, US West, Canada Central, Europe, Asia-Pacific, and South America in São Paulo.