Spatial-Temporal and Hierarchical Analysis for HBM Errors
Posted on
High-bandwidth memory (HBM): a promising technology for fundamental overcoming the memory wall
- it stacks up multiple DRAM dies vertically to dramatically improve the memory access bandwidth
- however, this architecture also comes with more severe reliability issues, since HBM not only inherits error patterns of the conventional DRAM, but also introduces new error causes
- over 460 million error events collected from 19 data centers and span over two years of deployment under a variety of services
- HBM exhibits different error patterns from conventional DRAM, in terms of spatial locality, temporal correlation, and sensor metrics which make empirical prediction models for DRAM error prediction ineffective for HBM
- the paper designs and implements Calchas, a hierarchical failure prediction framework for HBM based on the findings, which integrate spatial, temporal, and sensor information from various device levels to predict upcomings failures
Introduction
most of the findings and implications made from DRAM error analyses cannot be directly applied to HBM
a large-scale dataset collected from 19 data centers over two years, which comprises
- error logs: correctable errors (CE) and uncorrectable errors (UE)
perform a series of analysis
- spatial analysis (spatial locality and structure analysis)
- temporal analysis (CE storm)
- power analysis
- temperature analysis
their contributions
- in-depth analyses for HBM errors collected from production clusters
- lessons learned from unsuccessful attempts
- no clear correlation between the CE rate and the occurrence of UERs
- the CE-based predictor trained by historical CEs is also impractical to predict the future occurrence of UERs
- hierarchical failure prediction framework for HBM
Background
High Bandwidth Memory
- HBM can be constructed via either a 4Hi stack (with four DRAM dies) or an 8Hi stack (with eight DRAM dies)
- a DRAM die consistsa number of channels (CH). in the pseudo-channel mode, a channel can be further divided into two pseudo-channels (PS-CH), which is composed of several bank groups (BG)
- A BG is consisted of four banks, each of which comprises multiple rows and columns
Terminology
- error and fault
- a fault can be active (causing errors) or dormant (not causing errors)
- failure
- DRAM failure is one of the major causes of server crashes
- ECC (error correcting code)
- encode data to generate additional parity bits, so that error can be identified and corrected
- CE and UE
- CE refers to the errors within the correction capability of ECC and hence they can be successfully restored
- UE refers to those that exceed the correction capability of ECC
- UER (Uncorrectable Error action Required)
- UEO (Uncorrectable Error action Optional): usually discovered by the periodic memory scrubbing over HBM chips and does not affect the system runtime. The UEO can be taking pages offline
Data Collection
The HBM status information is periodically collected by the basedboard management controller (BMC)
two error logs: ErrLog_Cycle and ErrLog_Occurrence
Sensor information
- the temperature log
- the power log
Analysis
Dataset Overview
- a different distribution for row and cell levels, this is due to column failures being prevalent in HBM
Spatial Analysis
Finding 1. If an error occurs in one cell of a device level, there is a high probability of experiencing subsequent errors in another cell
- Spatial locality. there exists a strong correlation between the error mode and the device levels for HBM
Finding 2. While errors are common across multiple cells in most device levels, only the bank exhibits errors in multiple components
- hierarchical analysis
Finding 3. The effects of crosstalk in the HBM may result in data loss, specifically in the 7th, 15th, 23rd, and 31st columns of a bank
Finding 4. Lower SIDs exhibit a higher susceptibility to errors
error modes:
- single-cell mode
- two-cell mode
- single-row mode
- two-row mode
- single-column mode
- two-column mode
- row-dominant mode
- column-dominant mode
- irregular mode
Temporal analysis
Analysis of Sensor Information
- impact of power
- impact of temperature
Unsuccessful Attempts
Attempt 1: CE Rate Indicator
if $N_{CE}$ exceeds a predefined threshold, consider that a UER may occur in the future
but limited correaltion between the CE rate and UERs
Attempt 2: CE-based Predictor
disappointing in the dataset
the CE-based predictor frequently mislabels normal banks
Calchas Design
Feature Generation
- component features
- stack features
- sensor features
Hierarchical Prediction
- row-level predictor
- column-level predictor
- bank-level predictor
- server-level predictor
Prediction Timing
- period-based approach
- event-driven approach
Discussion
Fault Tolerate
- EC-based operator
- Transparent migration
Limitations
- collection challenges
- generalizability
Related Work
- HBM failure study
- DRAM failure analysis
- DRAM failure prediction