Add anomaly handler #1780

lzhangzz · 2024-06-14T11:20:41Z

Detecting and suppressing NaN/INF for debugging and robustness

Detect, suppress and report INF/NaN in the tensors
Fix invalid logits and report errors

USAGE

opt-in by setting environment variable TM_ANOMALY_HANDLER=args...

ARGS

level - default: 0
- 0 - off
- 1 - embedding/lm_head/logits
- 2 - plus rmsnorm/residual/ffn_block/attn_block
- 3 - plus all other kernel outputs
nan - value used to replace NaNs, default: NaN
inf - value used to replace INF, default: INF
fallback - fallback token when there are INFs or NaNs in the logits, default: eos_id

For example TM_ANOMALY_HANDLER=level=3,nan=0,inf=0 will

Flush all NaN/INF to 0 for all kernel outputs and count the numer of anomalies
When NaN/INF detected in the logits, all logits for the sample will be set to 0 with the exception that the fallback token will be set to MAX_HALF, an error will be set for the request
Summary of detected anomalies will be logged at WARNING level after each iteration

NOTE

Level 1 is enough to suppress crashes caused by NaN/INF but cannot save the corrupted token
Level 2/3 with proper NaN/INF replacement may suppress sporadic INFs and allow the generation to continue smoothly
Level 2/3 hurts performance as the launched kernels are doubled and the kernel for handling anomalies is not optimized
Try level 3 first then pick the suitable level based on the printed summary

src/turbomind/utils/anomaly_handler.cu

lvhan028 · 2024-06-17T05:40:42Z

src/turbomind/utils/anomaly_handler.cu

+        auto x = static_cast<float>(data[i]);
+        if (isinf(x)) {
+            ++inf_count;
+            data[i] = x > 0.f ? pinf_val : ninf_val;


I was wondering what pinf_val, ninf_val and nan_val are appropriate.

lzhangzz added 3 commits June 14, 2024 10:18

add AnomalyHandler

81304c8

fix lint

b5c4700

fix ci build

92ec74a

lzhangzz mentioned this pull request Jun 14, 2024

[Bug] Many concurrent requests with --enable-prefix-caching AND --quant-policy 8 crashes with: CUDA runtime error: an illegal memory access was encountered /opt/lmdeploy/src/turbomind/utils/allocator.h:231 #1744

Closed

2 tasks

fix missing headers

267aa89

lvhan028 requested review from lvhan028 and irexyc June 17, 2024 03:37

lvhan028 added the improvement label Jun 17, 2024

lvhan028 reviewed Jun 17, 2024

View reviewed changes

src/turbomind/utils/anomaly_handler.cu Show resolved Hide resolved

lvhan028 reviewed Jun 17, 2024

View reviewed changes

src/turbomind/utils/anomaly_handler.cu Show resolved Hide resolved

lower the log level of setLevel

11a8713

lvhan028 reviewed Jun 17, 2024

View reviewed changes

lvhan028 approved these changes Jun 17, 2024

View reviewed changes

lvhan028 merged commit 5cbefe2 into InternLM:main Jun 17, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add anomaly handler #1780

Add anomaly handler #1780

lzhangzz commented Jun 14, 2024 •

edited

Loading

lvhan028 Jun 17, 2024

Add anomaly handler #1780

Add anomaly handler #1780

Conversation

lzhangzz commented Jun 14, 2024 • edited Loading

lvhan028 Jun 17, 2024

Choose a reason for hiding this comment

lzhangzz commented Jun 14, 2024 •

edited

Loading