Cloud Check手动修复

BOSH 提供 Cloud check CLI命令去修复IaaS资源,正常情况下一般不太会使用。但是一些IaaS操作失败没有人工干预Director不能解决的问题或者不支持Resurrector(自动修复)时。

Restructor只会去恢复IaaS丢失的VM或者Agent没回应。cck也会检查这两个条件,然而不会自动去解决这个问题,会给操作者提供一些选择。 cck也会检查正确的已存在的和挂载的磁盘for each deployment job instance。

如果 deployment通过 bosh deployment命令部署的,运行bosh cck即可,当没有问题发生时如下输出:

$ bosh cck

Performing cloud check...

Processing deployment manifest
------------------------------

Director task 622
  Started scanning 1 vms
  Started scanning 1 vms > Checking VM states. Done (00:00:00)
  Started scanning 1 vms > 1 OK, 0 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00)
     Done scanning 1 vms (00:00:00)

  Started scanning 0 persistent disks
  Started scanning 0 persistent disks > Looking for inactive disks. Done (00:00:00)
  Started scanning 0 persistent disks > 0 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
     Done scanning 0 persistent disks (00:00:00)

Task 622 done

Started     2015-01-09 23:29:34 UTC
Finished    2015-01-09 23:29:34 UTC
Duration    00:00:00

Scan is complete, checking if any problems found...
No problems found

问题

VM找不着了

本来bosh申请了一个VM,结果别人通过IaaS直接给删了。

$ bosh cck

Performing cloud check...

Processing deployment manifest
------------------------------

Director task 623
  Started scanning 1 vms
  Started scanning 1 vms > Checking VM states. Done (00:00:10)
  Started scanning 1 vms > 0 OK, 0 unresponsive, 1 missing, 0 unbound, 0 out of sync. Done (00:00:00)
     Done scanning 1 vms (00:00:10)

  Started scanning 0 persistent disks
  Started scanning 0 persistent disks > Looking for inactive disks. Done (00:00:00)
  Started scanning 0 persistent disks > 0 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
     Done scanning 0 persistent disks (00:00:00)

Task 623 done

Started     2015-01-09 23:32:45 UTC
Finished    2015-01-09 23:32:56 UTC
Duration    00:00:11

Scan is complete, checking if any problems found...

Found 1 problem

Problem 1 of 1: VM with cloud ID `i-914c046a' missing.
  1. Skip for now
  2. Recreate VM
  3. Delete VM reference
Please choose a resolution [1 - 3]: 3

Below is the list of resolutions you've provided
Please make sure everything is fine and confirm your changes

  1. VM with cloud ID `i-914c046a' missing.
     Delete VM reference

Apply resolutions? (type 'yes' to continue): yes
Applying resolutions...

Director task 624
  Started applying problem resolutions > missing_vm 168: Delete VM reference. Done (00:00:00)

Task 624 done

Started     2015-01-09 23:33:20 UTC
Finished    2015-01-09 23:33:20 UTC
Duration    00:00:00
Cloudcheck is finished

提供给操作员几个选择:

  1. 忽略,Director不尝试去解决这个问题。
  2. 重新创建VM,部署和运行指定releases。cck不会等待release任务运行起来。
  3. 删除VM参照:Director不去创建新的VM,再次运行cck,将不再报告VM丢失。运行bosh deploy会重新填回丢失的VMs.

    VM无响应

    Agent不再给Director响应时,bosh vms会报告该VM的agent 为unresponsive agent ``` $ bosh vms simple-deployment --details

Deployment `simple-deployment'

Director task 630

Task 630 done

+-----------------+--------------------+---------------+-----+------------+--------------------------------------+--------------+ | Job/index | State | Resource Pool | IPs | CID | Agent ID | Resurrection | +-----------------+--------------------+---------------+-----+------------+--------------------------------------+--------------+ | unknown/unknown | unresponsive agent | | | i-1db9ede6 | 59a30081-d63d-4c1b-80be-01fa681d8787 | active | +-----------------+--------------------+---------------+-----+------------+--------------------------------------+--------------+

VMs total: 1

同样 `bosh deploy` 会停止 `Binding existing deployment`。

$ bosh deploy

..snip...

Deploying

Deployment name: tiny-dummy.yml' Director name:micro-idora' Are you sure you want to deploy? (type 'yes' to continue): yes

Director task 631 Started preparing deployment Started preparing deployment > Binding deployment. Done (00:00:00) Started preparing deployment > Binding releases. Done (00:00:00) Started preparing deployment > Binding existing deployment. Failed: Timed out sending `get_state' to 59a30081-d63d-4c1b-80be-01fa681d8787 after 45 seconds (00:02:15)

Error 450002: Timed out sending `get_state' to 59a30081-d63d-4c1b-80be-01fa681d8787 after 45 seconds

Task 631 error

For a more detailed error report, run: bosh task 631 --debug

```
$ bosh cck

Performing cloud check...

Processing deployment manifest
------------------------------

Director task 640
  Started scanning 1 vms
  Started scanning 1 vms > Checking VM states. Done (00:00:10)
  Started scanning 1 vms > 0 OK, 1 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00)
     Done scanning 1 vms (00:00:10)

  Started scanning 0 persistent disks
  Started scanning 0 persistent disks > Looking for inactive disks. Done (00:00:00)
  Started scanning 0 persistent disks > 0 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
     Done scanning 0 persistent disks (00:00:00)

Task 640 done

Started   2015-01-09 23:33:45 UTC
Finished  2015-01-09 23:33:55 UTC
Duration  00:00:10

Scan is complete, checking if any problems found...

Found 1 problem

Problem 1 of 1: dummy/0 (i-914c046a) is not responding.
  1. Skip for now
  2. Reboot VM
  3. Recreate VM
  4. Delete VM reference (forceful; may need to manually delete VM from the Cloud to avoid IP conflicts)
Please choose a resolution [1 - 4]: 4

Below is the list of resolutions you've provided
Please make sure everything is fine and confirm your changes

  1. dummy/0 (i-914c046a) is not responding.
     Delete VM reference (forceful; may need to manually delete VM from the Cloud to avoid IP conflicts)

Apply resolutions? (type 'yes' to continue): yes
Applying resolutions...

Director task 641
  Started applying problem resolutions > unresponsive_agent 168: Delete VM reference (...). Done (00:00:05)

Task 641 done

Started   2015-01-09 23:35:20 UTC
Finished  2015-01-09 23:35:25 UTC
Duration  00:00:05
Cloudcheck is finished

提供以下解决方法:

  1. 忽略
  2. 重启
  3. 重新创建
  4. 删除参照

    磁盘未挂载

    ``` $ bosh cck

Performing cloud check...

Processing deployment manifest

Director task 656 Started scanning 1 vms Started scanning 1 vms > Checking VM states. Done (00:00:00) Started scanning 1 vms > 1 OK, 0 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00) Done scanning 1 vms (00:00:00)

Started scanning 1 persistent disks Started scanning 1 persistent disks > Looking for inactive disks. Done (00:00:00) Started scanning 1 persistent disks > 0 OK, 0 missing, 0 inactive, 1 mount-info mismatch. Done (00:00:00) Done scanning 1 persistent disks (00:00:00)

Task 656 done

Started 2015-01-13 22:04:56 UTC Finished 2015-01-13 22:04:56 UTC Duration 00:00:00

Scan is complete, checking if any problems found...

Found 1 problem

Problem 1 of 1: Inconsistent mount information: Record shows that disk 'vol-549f071f' should be mounted on i-4fcd99b4. However it is currently : Not mounted in any VM.

  1. Skip for now
  2. Reattach disk to instance
  3. Reattach disk and reboot instance Please choose a resolution [1 - 3]: 2

Below is the list of resolutions you've provided Please make sure everything is fine and confirm your changes

  1. Inconsistent mount information: Record shows that disk 'vol-549f071f' should be mounted on i-4fcd99b4. However it is currently : Not mounted in any VM Reattach disk to instance

Apply resolutions? (type 'yes' to continue): yes Applying resolutions...

Director task 657 Started applying problem resolutions > mount_info_mismatch 23: Reattach disk to instance. Done (00:00:22)

Task 657 done

Started 2015-01-13 22:05:19 UTC Finished 2015-01-13 22:05:41 UTC Duration 00:00:22 Cloudcheck is finished ``` 解决方法:

  1. 忽略
  2. 重新挂载。通常挂在在/var/vcap/store目录上。重新挂载时release不会重新运行。
  3. 挂载后重启。这样Agent可以在运行release之前安全的挂载磁盘。cck不会等待VM和release重启。

    磁盘找不着了

    Those CPIs will report missing disk as Persistent Disk is not attached problem; however, both reattaching resolutions will fail since persistent disk would not be found 挂载失败就是找不着了。

results matching ""

    No results matching ""