WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Spatial-Temporal Analysis for DRAM Errors

Posted on
Tags: Spatial-Temporal, Hierarchical

This note is for Yu, Q., Zhang, W., Cardoso, J., & Kao, O. (2023). Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study. 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), 01–09.

exploring error bits for memory failure prediction:

  • memory failure is a common cause of server crashes, with uncorrectable errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defeats
  • existing approaches primarily focus on predicting UEs using correctable errors (CEs), without fully considering the information provided by error bits
  • however, error bit patterns have a strong correlation with the occurrence of uncorrectable errors (UEs)

the paper studies the correlation between CEs and UEs, specifically emphasizing the importance of spatio-temporal error bit information

  • the analysis reveals a strong correlation between spatio-temporal error bits and UE occurrence
  • through real-world datasets, the approach significantly improves prediction performance by 15% in F1 score

Introduction

  • among hardware failures, DRAM (Dynamic Random Access Memory) failure is a major occurrence
  • DRAM failure is often accompanied by DRAM errors, i.e., correctable error (CE) and uncorrectable error (UE)
  • ML-based techniques have been leveraged for DRAM failure prediction, using CEs information from a large-scale datacenter to predict UEs
  • these studies have effectively utilized the spatial distribution of correctable error to enhance DRAM failure prediction

  • system-level workload indicators such as memory utilization, read and write have been applied for DRAM failure prediction
  • someone demonstrated that the workload metric is relatively less significant compared to other CE related features
  • CE storm (numerous CEs occurring in a short period) and UEs are considered for predicting DRAM-caused node unavailability (DCNU), emphasizing the importance of spatio-temporal CE features

  • the paper presents an in-depth correlative analysis between CE and UE, specifically focusing on the spatio-temporal distribution of error bits
  • also investigate latent patterns of error bits from CE to UE on the ECC of contemporary Intel servers

Background

Terminology

  • fault:
  • error: correctable error (CE) and uncorrectable error (UE)
    • sudden UE: a sudden UE typically has no CEs before it occurs
    • predictable UE: UEs that initially manifest as CEs but eventually escalate to UEs

Memory Organization and Access

Dataset

the dataset was obtained from the Baseboard Management Controller (BMC) of a large-scale datacenter, which includes

  • system configuration
  • machine check exception (MCE) log
  • memory events

they focus on DIMMs with CEs, excluding those with sudden UEs from the datasets due to a lack of prediction information

Problem Formulation and Performance Measures

at present $t$, an algorithm observes historical data from an observation window $\Delta t_d$ to predict failures within the prediction period $[t+\Delta t_l, t + \Delta t_l + \Delta t_d]$, where

  • $\Delta t_l$ is a minimum time interval between the prediction and the failure
  • $\Delta t_p$ denotes the prediction interval
  • lead prediction window $\Delta t_l \in (0, 3h]$ varies based on production use cases

VM Interruptation Reduction Rate (VIRR)

  • previous work: cost-aware models to measure the benefits of memory failure prediction
  • $V_a$: average number of VMs in a server
  • $V = V_a(TP+FN)$: the interruptions
  • even though proactive VM live migrations can reduce VM interruptions without service interruption, a notable fraction of VMs may still experience cold migration, which generally interrupt VMs
  • given that cold migration is a prevalent strategy for both VM relocation and maintaince. The precentage of such migration is represented as $y_c$
  • $V_1’ = V_ay_c(TP+FP)$: the number of VM interruptions arising from cold migrations initiated by positive failure predictions
  • on the other hand, any missed failure predictions invariably escalate the interruptions, represented by $V_2’ = V_a\cdot FN$
  • overall interruptions after factoring in the prediction algorithm sum up to $V’=V_1’ + V_2’$
\[VIRR = \frac{V - V'}{V} = \left(1 - \frac{y_c}{\text{precision}}\right)\cdot \text{recall}\]
  • in real-world production environments, $y_c$ retains a positive value

Temporal Risky CE Pattern Indicators

  • although Intel keeps the exact ECC algorithms confidential and undisclosed, they have provided some general information on error-bit patterns that can be fully correctable, partially correctable and potential risky

introduce three temporal risky pattern indicators

  • R1: Risky_CE_Cnt
  • R2: Risky_Pattern_Cnt
  • R3: Max_Risky_Pattern_Cnt

Finding 1: The performance of an individual risky CE pattern is limited. However, the proper combination of risky CE pattern indicators can signficantly improve the results, particularly precision

Correlative Analysis Between Uncorrectable Error and Various Factors

Correlative Analysis Between Error Bits and UE

Finding 2: In terms of UE occurrence, the total number of error bits exhibits weaker correlation compared to adjacent error bits. Even a small number of adjacent bits can lead to UE occurrence.

in addition to spatial correlative analysis of error bits in DQs and beats, they incorporate temporal information

note that error bits of UEs are typically unknown, since they are not correctable, typically leading to spatial distribution

Finding 3: both spatial and temporal error bits in DQs and beats play a significant role in distinguishing UE occurrences. This finding suggests that these features can serve as important indicators for UE prediction

Correlative Analysis Between DRAM Faults and UE

  • cell fault: if the number of CEs repeated in the same cell reaches a predefined threshold $\theta_{cell}$
  • row fault, column fault
  • device fault

Finding 4: while high-level faults may have a higher likelihood of causing UEs, Raw and Bank faults account for the majority of UEs in the system

Correlative Analysis Between System Configuration and UE

Finding 5: The UE rate varies across server age, manufacturers, data width, frequency and process, while they did not observe significant differences in the UE rate based on the capacity of the DIMM

Failure Prediction

  • Labeling method
  • Feature generation
    • static features
    • CE error rate
    • DQ-Beat Error Bits
    • Error bit patterns
    • Fault Counts
    • Memory Events

Result

feature engineering? is it possible to have more features? or directly handle the feature

Finding 6: the inclusion of error bits features significantly enhances UE prediction performance, even with knowledge of the error bits patterns. This alludes that the latent patterns of error bits can be predicted using spatio-temporal error bits features.

lead time: the duration between the prediction time and the expected occurrence of a failure


Published in categories