About AMD

Advanced Micro Devices (AMD) is a global semiconductor company that designs high-performance CPUs, GPUs, and AI accelerators. Its internal compute grids support large-scale Electronic Design Automation (EDA) and engineering workloads that require efficient resource scheduling and real-time operational visibility.

Challenge

AMD operates large internal compute grids supporting EDA workloads across multiple regions, requiring real-time visibility into job execution and resource utilization.

Limited visibility into grid resource usage

AMD’s grid infrastructure runs thousands of jobs across internal systems. Capturing resource usage data such as CPU, DRAM, job timing, and developer activity required a data platform capable of ingesting and querying metrics in real time.

High-volume operational metrics ingestion

Each grid job generates telemetry data used to track resource utilization and scheduling behavior. The monitoring platform needed to continuously ingest high volumes of data while maintaining low-latency access for analysis and troubleshooting.

Scalable infrastructure across multiple regions

As AMD expanded compute workloads for EDA, the monitoring system must scale across multiple data centers and regions. The platform needed to support future growth while maintaining reliable performance and operational simplicity.

Solution

AMD deployed Aerospike as the data platform for its grid monitoring application, ingesting and storing job-level telemetry for real-time analysis across its compute infrastructure.

Database for grid monitoring platform

Aerospike serves as the data layer for AMD’s grid computing monitoring application, storing metrics such as CPU utilization, DRAM usage, job start time, and user information associated with compute workloads.

Real-time ingestion of operational metrics

The monitoring system continuously collects and stores telemetry from the grid environment. Aerospike enables rapid ingestion and retrieval of these metrics, allowing infrastructure teams to analyze grid activity and resource usage.

Scalable deployment for expanding grids

The initial deployment runs in a single AMD data center supporting internal workloads. The architecture is designed to scale as AMD expands monitoring capabilities to additional regions and compute environments.

"We selected Aerospike because it can handle large volumes of operational data with predictable performance. That reliability allows us to monitor compute jobs across the grid while maintaining the efficiency required for high-demand modern workloads."

headshot-amd-rajdeep-sengupta
Rajdeep Sengupta
AMDApplication & Systems Engineering

Results

Aerospike provides AMD with a scalable data platform for monitoring compute grid activity in real time, improving visibility into resource utilization while supporting expansion across its infrastructure.

Real-time visibility into grid workloads

Infrastructure teams can track operational metrics for compute jobs as they run, helping identify inefficiencies and monitor resource usage across EDA workloads.

Operational insight for job scheduling

The monitoring system provides insight into job execution patterns, enabling teams to analyze how resources are consumed and ensure compute jobs are scheduled efficiently.

Foundation for expanded monitoring infrastructure

With Aerospike deployed in production, AMD plans to extend the monitoring system to additional data centers and regions, supporting larger EDA workloads across its compute grid.