Spatial-Temporal and Hierarchical Analysis for HBM Errors

Posted on Oct 16, 2025

This note is for Wu, R., Zhou, S., Lu, J., Shen, Z., Xu, Z., Shu, J., Yang, K., Lin, F., & Zhang, Y. (2024). Removing obstacles before breaking through the memory wall: A close look at HBM errors in the field. Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference, 851–867.

High-bandwidth memory (HBM): a promising technology for fundamental overcoming the memory wall

it stacks up multiple DRAM dies vertically to dramatically improve the memory access bandwidth
however, this architecture also comes with more severe reliability issues, since HBM not only inherits error patterns of the conventional DRAM, but also introduces new error causes
over 460 million error events collected from 19 data centers and span over two years of deployment under a variety of services
HBM exhibits different error patterns from conventional DRAM, in terms of spatial locality, temporal correlation, and sensor metrics which make empirical prediction models for DRAM error prediction ineffective for HBM
the paper designs and implements Calchas, a hierarchical failure prediction framework for HBM based on the findings, which integrate spatial, temporal, and sensor information from various device levels to predict upcomings failures

Introduction

most of the findings and implications made from DRAM error analyses cannot be directly applied to HBM

a large-scale dataset collected from 19 data centers over two years, which comprises

error logs: correctable errors (CE) and uncorrectable errors (UE)

perform a series of analysis

spatial analysis (spatial locality and structure analysis)
temporal analysis (CE storm)
power analysis
temperature analysis

their contributions

in-depth analyses for HBM errors collected from production clusters
lessons learned from unsuccessful attempts
- no clear correlation between the CE rate and the occurrence of UERs
- the CE-based predictor trained by historical CEs is also impractical to predict the future occurrence of UERs
hierarchical failure prediction framework for HBM

Background

High Bandwidth Memory

HBM can be constructed via either a 4Hi stack (with four DRAM dies) or an 8Hi stack (with eight DRAM dies)
a DRAM die consistsa number of channels (CH). in the pseudo-channel mode, a channel can be further divided into two pseudo-channels (PS-CH), which is composed of several bank groups (BG)
A BG is consisted of four banks, each of which comprises multiple rows and columns

Terminology

error and fault
- a fault can be active (causing errors) or dormant (not causing errors)
failure
- DRAM failure is one of the major causes of server crashes
ECC (error correcting code)
- encode data to generate additional parity bits, so that error can be identified and corrected
CE and UE
- CE refers to the errors within the correction capability of ECC and hence they can be successfully restored
- UE refers to those that exceed the correction capability of ECC
  - UER (Uncorrectable Error action Required)
  - UEO (Uncorrectable Error action Optional): usually discovered by the periodic memory scrubbing over HBM chips and does not affect the system runtime. The UEO can be taking pages offline

Data Collection

The HBM status information is periodically collected by the basedboard management controller (BMC)

two error logs: ErrLog_Cycle and ErrLog_Occurrence

Sensor information

the temperature log
the power log

Analysis

Dataset Overview

a different distribution for row and cell levels, this is due to column failures being prevalent in HBM

Spatial Analysis

Finding 1. If an error occurs in one cell of a device level, there is a high probability of experiencing subsequent errors in another cell

Spatial locality. there exists a strong correlation between the error mode and the device levels for HBM

Finding 2. While errors are common across multiple cells in most device levels, only the bank exhibits errors in multiple components

hierarchical analysis

Finding 3. The effects of crosstalk in the HBM may result in data loss, specifically in the 7th, 15th, 23rd, and 31st columns of a bank

Finding 4. Lower SIDs exhibit a higher susceptibility to errors

error modes:

single-cell mode
two-cell mode
single-row mode
two-row mode
single-column mode
two-column mode
row-dominant mode
column-dominant mode
irregular mode

Temporal analysis

Analysis of Sensor Information

impact of power
impact of temperature

Unsuccessful Attempts

Attempt 1: CE Rate Indicator

if $N_{CE}$ exceeds a predefined threshold, consider that a UER may occur in the future

but limited correaltion between the CE rate and UERs

Attempt 2: CE-based Predictor

disappointing in the dataset

the CE-based predictor frequently mislabels normal banks

Calchas Design

Feature Generation

component features
stack features
sensor features

Hierarchical Prediction

row-level predictor
column-level predictor
bank-level predictor
server-level predictor

Prediction Timing

period-based approach
event-driven approach

Discussion

Fault Tolerate

EC-based operator
Transparent migration

Limitations

collection challenges
generalizability

HBM failure study
DRAM failure analysis
DRAM failure prediction

Published in categories

← previous

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Spatial-Temporal and Hierarchical Analysis for HBM Errors

Posted on Oct 16, 2025

Introduction

Background

High Bandwidth Memory

Terminology

Data Collection

Analysis

Dataset Overview

Spatial Analysis

Temporal analysis

Analysis of Sensor Information

Unsuccessful Attempts

Attempt 1: CE Rate Indicator

Attempt 2: CE-based Predictor

Calchas Design

Feature Generation

Hierarchical Prediction

Prediction Timing

Discussion

Fault Tolerate

Limitations

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Spatial-Temporal and Hierarchical Analysis for HBM Errors

Posted on Oct 16, 2025

Introduction

Background

High Bandwidth Memory

Terminology

Data Collection

Analysis

Dataset Overview

Spatial Analysis

Temporal analysis

Analysis of Sensor Information

Unsuccessful Attempts

Attempt 1: CE Rate Indicator

Attempt 2: CE-based Predictor

Calchas Design

Feature Generation

Hierarchical Prediction

Prediction Timing

Discussion

Fault Tolerate

Limitations

Related Work