π€ RuntimeError: Function βLogSoftmaxBackward0β returned nan values in its 0th output
π₯ Pytorch Backward κ³Όμ μμ NaN λ°μνλ λ¬Έμ
컀μ€ν
μΌλ‘ λͺ¨λΈ, μ¬λ¬ νλ§, 맀νΈλ¦, μμ€ ν¨μλ€μ μ μνλ©΄μλΆν° μ μΌ λ§μ΄ λ§μ£Όνκ² λλ μλ¬λ€. μ§μ¬μΌλ‘ μμ¦ CUDA OOM
λ³΄λ€ ν¨μ¬ μμ£Ό 보λ κ² κ°λ€. ν΄λΉ μλ¬λ LogSoftmax
λ μ΄μ΄μ μ λ¬λ μ
λ ₯κ° μ€μμ nan
, inf
κ° ν¬ν¨λμ΄ μ°μ°μ μ§νν μ μλ€λ κ²μ μλ―Ένλ€. λ₯λ¬λ μ€νμ μ§ννλ©΄μ κ°μ₯ ν΄κ²°νκΈ° κΉλ€λ‘μ΄ λ
μμΌλ‘ μμΈμ νΉμ νκΈ° νλ€κΈ° λλ¬Έμ΄λ€. μμΈμ μ‘κΈ° μ΄λ €μ΄ μ΄μ λ λ°λ‘ μ°λ¦¬κ° μ§κΈ νκ³ μλκ² βλ₯λ¬λβ
μ΄λΌμ κ·Έλ λ€. μ μλ¬λ λλΆλΆ μ°μ°μκ° μ°λ¦¬κ° μλνμ§ μμ λμμ νλ μΌμ΄μ€ λλ¬ΈμΈλ°, νλ νλ λλ²κΉ
νκΈ°μλ λ무λλ μ°μ°μκ° λ§λ€. λν λ₯λ¬λμ μ
μΆλ ₯μΌλ‘ μμ²λκ² ν° μ¬μ΄μ¦μ νλ ¬μ μ¬μ©νλ€. μ°λ¦¬κ° nan
, inf
κ° μ‘΄μ¬μ λν΄μ μΈμ§νκΈ° μ½μ§ μλ€.
μ μλ¬λ νμμ κ²½νμ λλΆλΆ 컀μ€ν
μΌλ‘ μ μν λ μ΄μ΄μμ λ°μνλ κ²½μ°κ° λ§μμΌλ©° νΉν λΆμ
, κ°λ
, μ κ³±κ·Ό
, μ§μ
κ°λ
μ μ¬μ©νλ μ°μ°μκ° λλΆλΆ μμΈμ΄μλ€. μλ₯Ό λ€μ΄ μ½μ¬μΈ μ μ¬λλ₯Ό ꡬνλ κ³Όμ μμ μ°μ° λμ 벑ν°κ°μ zero-value
κ° ν¬ν¨λ κ²½μ° λΆλͺ¨κ° 0μ΄ λκΈ° λλ¬Έμ μ°μ° μ μκ° λμ§ μμ nan
μ λ°νν΄ μμ κ°μ μλ¬κ° λ°μνλ κ²½μ°κ° μλ€.
def check_nan(x: torch.Tensor) -> bool:
""" Check if there is NaN in tensor """
checker = False
if True in torch.isnan(x):
checker = True
return checker
def zero_filtering(x: torch.Tensor) -> torch.Tensor:
"""
Add eps value for zero embedding, because competition metric is cosine similarity
Cosine Similarity will be returned NaN, when input value has zero, like as torch.clamp()
"""
eps = 1e-4
x[x <= eps] = eps
return x
def nan_filtering(x: torch.Tensor, eps: float = 1e-4) -> torch.Tensor:
"""
Change eps value for NaN Embedding, because competition metric is cosine similarity
Cosine Similarity will be returned NaN
"""
return torch.nan_to_num(x, nan=eps)
class CLIPGEMPooling(nn.Module):
"""
Generalized Mean Pooling for Natural Language Processing
This class version of GEMPooling for CLIP, Transfer from NLP Task Code
ViT don't use attention mask, because input image shape will be same
Mean Pooling <= GEMPooling <= Max Pooling
Because of doing exponent to each token embeddings, GEMPooling is like as weight to more activation token
In original paper, they use p=3, but in this class, we use p=4 because torch doesn't support pow calculation
for negative value tensor, only for non-negative value in odd number exponent
"""
def __init__(self, auto_cfg: AutoConfig.from_pretrained) -> None:
super(CLIPGEMPooling, self).__init__()
@staticmethod
def forward(last_hidden_state, p: int = 2) -> Tensor:
"""
last_hidden_state.size: [batch_size, patches_sequence, hidden_size]
1) Pow last_hidden_state with p and then take a averaging
2) pow sum_embeddings with 1/p
"""
p_embeddings = zero_filtering(torch.pow(last_hidden_state, p))
# Check NaN value in Embedding after applying torch.pow
if check_nan(p_embeddings):
p_embeddings = nan_filtering(p_embeddings)
sum_embeddings = torch.mean(p_embeddings, 1)
gem_embeddings = zero_filtering(torch.pow(sum_embeddings, 1. / p))
# Check NaN value in Embedding after applying torch.pow
if check_nan(gem_embeddings):
gem_embeddings = nan_filtering(gem_embeddings)
return gem_embeddings
class CLIPMultipleNegativeRankingLoss(nn.Module):
"""
Multiple Negative Ranking Loss for CLIP Model
main concept is same as original one, but append suitable for other type of model (Not Sentence-Transformers)
if you set more batch size, you can get more negative pairs for each anchor & positive pair
Args:
scale: output of similarity function is multiplied by this value => I don't know why this is needed
similarity_fct: standard of distance metrics, default cosine similarity
"""
def __init__(self, reduction: str, scale: float = 20.0, similarity_fct=cos_sim) -> None:
super().__init__()
self.reduction = reduction
self.scale = scale
self.similarity_fct = similarity_fct
self.reduction = reduction
self.cross_entropy_loss = CrossEntropyLoss(self.reduction)
def forward(self, embeddings_a: Tensor, embeddings_b: Tensor) -> Tensor:
similarity_scores = zero_filtering(self.similarity_fct(embeddings_a, embeddings_b)) * self.scale
if check_nan(similarity_scores):
""" Check NaN Value in similarity_scores """
similarity_scores = nan_filtering(similarity_scores)
labels = torch.tensor(
range(len(similarity_scores)),
dtype=torch.long,
device=similarity_scores.device,
)
return self.cross_entropy_loss(similarity_scores, labels)
νμμ κ²½μ°, λ κ°μ μ
λ ₯ νλ ¬μ κ°κ° sqrt()
λ₯Ό μ μ©νκ³ λ νλ ¬μ κ°λ³ μμ μ¬μ΄μ μ½μ¬μΈ μ μ¬λλ₯Ό ꡬν΄μΌ νλ μ μ΄ μλ€. sqrt
κ³Όμ μμ λ무 μμ κ°λ€μ΄ μ
λ ₯μΌλ‘ λ€μ΄κ° underflow
κ° λ°μν΄ νλ ¬μ zero-value
κ° μκ²Όκ³ , μ΄λ₯Ό λͺ¨λ₯Έμ± μ½μ¬μΈ μ μ¬λλ₯Ό ꡬνλ€κ° νμ°Έμ μ μλ¬μ μΈμ λ μ μ΄ μλ€. μ¬μ§μ΄ μ°μ°μλ ν₯μμ μν΄μ torch.autocast
ν΄λμ€μ grad_scaler(float32 to float16)
κΉμ§ μ μ©νκ³ μμλ€.
ποΈ λ΄κ° ν΄κ²°ν λ°©λ²
μ΄ κΈμ μ½λ λΉμ μ΄ λ§μ½ sqrt
νΉμ pow
λ₯Ό νμ©νλ κ²½μ°, underflow
λ°©μ§λ₯Ό μν΄μ μ μμ μ½λμ²λΌ κΌ μ λΉν μ
μ€λ‘ κ°μ μ°μ° μ νμ νμμ λ°λΌ λν΄μ€ κ²μ κΆμ₯νλ€. μ
μ€λ‘ κ°μ μ€μ μ νμ¬ μμ μ΄ μ¬μ©νκ³ μλ λΆλ μμμ μ νλμ λ§κ² μ€μ ν΄μ£Όλ©΄ λ κ² κ°λ€. float32
λ₯Ό μ¬μ©νλ κ²½μ°μλ λλΆλΆ 1e-6
μ λ§μ΄ μ¬μ©νλ κ² κ°λ€. νμλ μ νν μ΄λ€ κ°μ΄ μ λΉνμ§ μμ§ μ λͺ¨λ₯΄κ² λ€β¦ κ·Έλ¦¬κ³ λ₯λ¬λ μ€ννλ©΄μ overflow
λλ¬Έμ inf
μ΄ λ°μνλ μ μ μμλ€.
μ
μ€λ‘ κ°μ λ¬Έμ κ° λλ μ°μ° μ μ μΌκ΄μ μΌλ‘ λν κ²½μ°, μ무리 μμ κ°μ΄λΌλ μ°μ° μ’
λ₯μ λ°λΌμ κ²°κ³Όκ° ν¬κ² μ곑λλ κ²½μ°κ° λ°μνλ€. λ°λΌμ μ°μ°μ λ¨Όμ μ μ©ν λ€ κ²°κ³Όμ NaN
, Inf
, Zero
κ° λ°μνλμ§ μ²΄ν¬νκ³ , λ°μν λΆλΆμ νν΄μ μ
μ€λ‘ κ°μ λν΄μ£Όλ 컀μ€ν
function
μΈ μ μν΄ λ¬Έμ λ₯Ό ν΄κ²°νλ€.
(μμ μ½λ μμ check_nan
, zero_filtering
, nan_filtering
)
ννΈ torch.autograd.set_detect_anomaly(True)
λ₯Ό νλ ¨ 루ν μ΄λ°μ μ μν΄μ£Όλ©΄, NaN
μ΄ λ°μνλ μ¦μ μ€νμ΄ λ©μΆκ³ NaN
μ μ λ°ν λΌμΈμ μΆλ ₯ν΄μ€λ€. κΌ νμ©ν΄λ³΄μ.
Leave a comment