WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Spatial-Temporal and Hierarchical Analysis for HBM Errors

Posted on
Tags: Spatial-Temporal, Hierarchical

This note is for Wu, R., Zhou, S., Lu, J., Shen, Z., Xu, Z., Shu, J., Yang, K., Lin, F., & Zhang, Y. (2024). Removing obstacles before breaking through the memory wall: A close look at HBM errors in the field. Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference, 851–867.

High-bandwidth memory (HBM): a promising technology for fundamental overcoming the memory wall

  • it stacks up multiple DRAM dies vertically to dramatically improve the memory access bandwidth
  • however, this architecture also comes with more severe reliability issues, since HBM not only inherits error patterns of the conventional DRAM, but also introduces new error causes
  • over 460 million error events collected from 19 data centers and span over two years of deployment under a variety of services
  • HBM exhibits different error patterns from conventional DRAM, in terms of spatial locality, temporal correlation, and sensor metrics which make empirical prediction models for DRAM error prediction ineffective for HBM
  • the paper designs and implements Calchas, a hierarchical failure prediction framework for HBM based on the findings, which integrate spatial, temporal, and sensor information from various device levels to predict upcomings failures

Introduction

most of the findings and implications made from DRAM error analyses cannot be directly applied to HBM

a large-scale dataset collected from 19 data centers over two years, which comprises

  • error logs: correctable errors (CE) and uncorrectable errors (UE)

perform a series of analysis

  • spatial analysis (spatial locality and structure analysis)
  • temporal analysis (CE storm)
  • power analysis
  • temperature analysis

their contributions

  • in-depth analyses for HBM errors collected from production clusters
  • lessons learned from unsuccessful attempts
    • no clear correlation between the CE rate and the occurrence of UERs
    • the CE-based predictor trained by historical CEs is also impractical to predict the future occurrence of UERs
  • hierarchical failure prediction framework for HBM

Background

High Bandwidth Memory

  • HBM can be constructed via either a 4Hi stack (with four DRAM dies) or an 8Hi stack (with eight DRAM dies)
  • a DRAM die consistsa number of channels (CH). in the pseudo-channel mode, a channel can be further divided into two pseudo-channels (PS-CH), which is composed of several bank groups (BG)
  • A BG is consisted of four banks, each of which comprises multiple rows and columns

Terminology

  • error and fault
    • a fault can be active (causing errors) or dormant (not causing errors)
  • failure
    • DRAM failure is one of the major causes of server crashes
  • ECC (error correcting code)
    • encode data to generate additional parity bits, so that error can be identified and corrected
  • CE and UE
    • CE refers to the errors within the correction capability of ECC and hence they can be successfully restored
    • UE refers to those that exceed the correction capability of ECC
      • UER (Uncorrectable Error action Required)
      • UEO (Uncorrectable Error action Optional): usually discovered by the periodic memory scrubbing over HBM chips and does not affect the system runtime. The UEO can be taking pages offline

Data Collection

The HBM status information is periodically collected by the basedboard management controller (BMC)

two error logs: ErrLog_Cycle and ErrLog_Occurrence

Sensor information

  • the temperature log
  • the power log

Analysis

Dataset Overview

  • a different distribution for row and cell levels, this is due to column failures being prevalent in HBM

Spatial Analysis

Finding 1. If an error occurs in one cell of a device level, there is a high probability of experiencing subsequent errors in another cell

  • Spatial locality. there exists a strong correlation between the error mode and the device levels for HBM

Finding 2. While errors are common across multiple cells in most device levels, only the bank exhibits errors in multiple components

  • hierarchical analysis

Finding 3. The effects of crosstalk in the HBM may result in data loss, specifically in the 7th, 15th, 23rd, and 31st columns of a bank

Finding 4. Lower SIDs exhibit a higher susceptibility to errors

error modes:

  • single-cell mode
  • two-cell mode
  • single-row mode
  • two-row mode
  • single-column mode
  • two-column mode
  • row-dominant mode
  • column-dominant mode
  • irregular mode

Temporal analysis

Analysis of Sensor Information

  • impact of power
  • impact of temperature

Unsuccessful Attempts

Attempt 1: CE Rate Indicator

if $N_{CE}$ exceeds a predefined threshold, consider that a UER may occur in the future

but limited correaltion between the CE rate and UERs

Attempt 2: CE-based Predictor

disappointing in the dataset

the CE-based predictor frequently mislabels normal banks

Calchas Design

Feature Generation

  • component features
  • stack features
  • sensor features

Hierarchical Prediction

  • row-level predictor
  • column-level predictor
  • bank-level predictor
  • server-level predictor

Prediction Timing

  • period-based approach
  • event-driven approach

Discussion

Fault Tolerate

  • EC-based operator
  • Transparent migration

Limitations

  • collection challenges
  • generalizability
  • HBM failure study
  • DRAM failure analysis
  • DRAM failure prediction

Published in categories