ひとり「トラブルシューティングから学ぶk8s勉強会」Part2

これの続き。Part2

トラブル6 (ErrorImage/ImagePullBackOff その3)

# Eventフィールドを確認すると、ErrImagePull or ImagePullBackOff
$ k describe po echo

# 1回目(ErrImagePull)

--- snip ---

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Warning  Failed     11s (x2 over 22s)  kubelet            Failed to pull image "gcrr.io/kubernetes-e2e-test-images/echoserver:2.2": rpc error: code = Unknown desc = Error response from daemon: Get "https://gcrr.io/v2/": dial tcp: lookup gcrr.io on 1.1.1.1:53: no such host
  Warning  Failed     11s (x2 over 22s)  kubelet            Error: ErrImagePull

--- snip ---

# 2回目(ImagePullBackOff)

--- snip ---

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Warning  Failed     19s (x3 over 54s)  kubelet            Failed to pull image "gcrr.io/kubernetes-e2e-test-images/echoserver:2.2": rpc error: code = Unknown desc = Error response from daemon: Get "https://gcrr.io/v2/": dial tcp: lookup gcrr.io on 1.1.1.1:53: no such host
  Warning  Failed     19s (x3 over 54s)  kubelet            Error: ErrImagePull
  Normal   BackOff    8s (x3 over 54s)   kubelet            Back-off pulling image "gcrr.io/kubernetes-e2e-test-images/echoserver:2.2"
  Warning  Failed     8s (x3 over 54s)   kubelet            Error: ImagePullBackOff

--- snip ---

Messageを見ると、コンテナイメージを取得するためにアクセスした時に名前解決で失敗していることがわかる。コンテナイメージのtypoが原因なので存在するホスト名に修正する。typo以外にもNodeからのDNSサーバへの経路に異常がある場合にも同じようなエラーになる可能性がある。

トラブル7 (Podがnodeにスケジューリングされないその1)

# Eventフィールドを確認する
$ k describe po example-trouble-7

--- snip ---

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Warning  FailedScheduling  14m    default-scheduler  0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.  

--- snip ---

Messageを見ると、Podがスケジューリングできないことを表している。Nodeにスケジューリングできない理由はいくつかあるが、今回は`Insufficient cpu`とあるのでPodが要求するCPUリソースが不足しているとわかる。PodがNodeにスケジューリングされないまま、Pending状態となっている。

`k describe node < Node名 >`でNodeの許容されるCPUを確認できる。

トラブル8 (Podがnodeにスケジューリングされないその2)

# Eventフィールドを確認する
$ k describe po trouble-8

--- snip ---

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
Warning  FailedScheduling  2m16s  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {CriticalAddonOnly: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

--- snip ---

Messageを見ると、`1 node(s) had untolerated taint {CriticalAddonOnly: true}.` から `CriticalAddonOnly: true` のTaintがNodeに付いているのでスケジューリングできていない。Taintが設定されたNodeにPodをスケジューリングするには、Pod側のToleration設定で特定のtaintを許容するようにする。

※ Nodeに付いているTaintとPod側で指定しているTolerationの設定が一致している必要がある。

トラブル9 (Podがnodeにスケジューリングされないその3)

# Eventフィールドを確認する
$ k describe po trouble-9

--- snip ---

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
Warning  FailedScheduling  11s   default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 1 node(s) were unschedulable. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

--- snip ---

Messageを見ると、`1 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }` とあるのでTaintが原因とわかる。

# Nodeを確認する
# STATUSがSchedulingDisabled(通常管理者によって設定される状態)
$ k get no
NAME                 STATUS                     ROLES           AGE   VERSION
kind-control-plane   Ready,SchedulingDisabled   control-plane   57m   v1.25.2

`k uncordon < Node名 >` で解除する。再度STATUSを確認して、SchedulingDisabledが消えていればよい。

※ 実運用でのcordonは管理者がNodeのメンテナンス等のために意図的に設定するものなのである。Podがスケジューリングできないからといって、むやみにuncordonしてはいけない。何か操作を行う前に管理者に確認取るようにする。

トラブル10 (Pod内のコンテナが再起動する)

# CrashLoopBackOff(コンテナが起動と終了を繰り返している状態)
$ k get po trouble10-66675fcc5-cqvsd
NAME                        READY   STATUS             RESTARTS      AGE
trouble10-66675fcc5-cqvsd   0/1     CrashLoopBackOff   3 (32s ago)   82s

# Nginxが異常終了(1)している
$ k describe po trouble10-66675fcc5-cqvsd
Name:             trouble10-66675fcc5-cqvsd
--- snip ---

Containers:
  nginx:

--- snip ---

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 24 Apr 2023 14:20:21 +0900
      Finished:     Mon, 24 Apr 2023 14:20:21 +0900
    Ready:          False
    Restart Count:  4

--- snip ---

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Normal   Created    4m11s (x4 over 4m59s)  kubelet            Created container nginx
  Normal   Started    4m11s (x4 over 4m59s)  kubelet            Started container nginx
  Warning  BackOff    3m43s (x8 over 4m55s)  kubelet            Back-off restarting failed container
  Normal   Pulling    3m29s (x5 over 5m1s)   kubelet            Pulling image "nginx"

--- snip ---

Deploymentから生成されたPodなのでランダムなsuffixが付いている。Podのラベルを指定して、ログを確認する。

# Nginxのログを確認
$ k logs -l app=nginx
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2023/04/24 05:24:36 [emerg] 1#1: open() "/etc/nginx/nginx.config" failed (2: No such file or directory)
nginx: [emerg] open() "/etc/nginx/nginx.config" failed (2: No such file or directory)

Messageを見ると、`docker-entrypoint.sh`(DockerfileのENTRYPOINTに指定することで初期化処理などを行うシェルスクリプトを実行してから、ユーザーが指定したコマンドを実行)とある。`Configuration complete; ready for start up` とあるので問題はない。その後、`/etc/nginx/nginx.config` が存在しないのでエラーになっている。

# コンテナイメージの中をファイルを確認
$ d run --rm -it nginx ls -l /etc/nginx
total 28
drwxr-xr-x 2 root root 4096 Apr 12 04:42 conf.d
-rw-r--r-- 1 root root 1007 Mar 28 15:01 fastcgi_params
-rw-r--r-- 1 root root 5349 Mar 28 15:01 mime.types
lrwxrwxrwx 1 root root   22 Mar 28 16:49 modules -> /usr/lib/nginx/modules
-rw-r--r-- 1 root root  648 Mar 28 16:49 nginx.conf
-rw-r--r-- 1 root root  636 Mar 28 15:01 scgi_params
-rw-r--r-- 1 root root  664 Mar 28 15:01 uwsgi_params

`nginx.conf`とある。存在していなかったのではなく`/etc/nginx/nginx.conf`とする必要がある。manifestのargsで正しいパスに修正する。

(Part3 に続く...)

harukin721

主に学習記録 🔗 wantedly.com/id/harukin721

ひとり「トラブルシューティングから学ぶk8s勉強会」Part2

トラブル6 (ErrorImage/ImagePullBackOff その3)

トラブル7 (Podがnodeにスケジューリングされないその1)

トラブル8 (Podがnodeにスケジューリングされないその2)

トラブル9 (Podがnodeにスケジューリングされないその3)

トラブル10 (Pod内のコンテナが再起動する)

トラブル6 (ErrorImage/ImagePullBackOff その3)

トラブル7 (Podがnodeにスケジューリングされない その1)

トラブル8 (Podがnodeにスケジューリングされない その2)

トラブル9 (Podがnodeにスケジューリングされない その3)

トラブル10 (Pod内のコンテナが再起動する)

トラブル7 (Podがnodeにスケジューリングされないその1)

トラブル8 (Podがnodeにスケジューリングされないその2)

トラブル9 (Podがnodeにスケジューリングされないその3)