Cloud Check手动修复
BOSH 提供 Cloud check CLI
命令去修复IaaS资源,正常情况下一般不太会使用。但是一些IaaS操作失败没有人工干预Director不能解决的问题或者不支持Resurrector(自动修复)时。
Restructor只会去恢复IaaS丢失的VM或者Agent没回应。cck也会检查这两个条件,然而不会自动去解决这个问题,会给操作者提供一些选择。 cck也会检查正确的已存在的和挂载的磁盘for each deployment job instance。
如果 deployment通过 bosh deployment
命令部署的,运行bosh cck
即可,当没有问题发生时如下输出:
$ bosh cck
Performing cloud check...
Processing deployment manifest
------------------------------
Director task 622
Started scanning 1 vms
Started scanning 1 vms > Checking VM states. Done (00:00:00)
Started scanning 1 vms > 1 OK, 0 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00)
Done scanning 1 vms (00:00:00)
Started scanning 0 persistent disks
Started scanning 0 persistent disks > Looking for inactive disks. Done (00:00:00)
Started scanning 0 persistent disks > 0 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
Done scanning 0 persistent disks (00:00:00)
Task 622 done
Started 2015-01-09 23:29:34 UTC
Finished 2015-01-09 23:29:34 UTC
Duration 00:00:00
Scan is complete, checking if any problems found...
No problems found
问题
VM找不着了
本来bosh申请了一个VM,结果别人通过IaaS直接给删了。
$ bosh cck
Performing cloud check...
Processing deployment manifest
------------------------------
Director task 623
Started scanning 1 vms
Started scanning 1 vms > Checking VM states. Done (00:00:10)
Started scanning 1 vms > 0 OK, 0 unresponsive, 1 missing, 0 unbound, 0 out of sync. Done (00:00:00)
Done scanning 1 vms (00:00:10)
Started scanning 0 persistent disks
Started scanning 0 persistent disks > Looking for inactive disks. Done (00:00:00)
Started scanning 0 persistent disks > 0 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
Done scanning 0 persistent disks (00:00:00)
Task 623 done
Started 2015-01-09 23:32:45 UTC
Finished 2015-01-09 23:32:56 UTC
Duration 00:00:11
Scan is complete, checking if any problems found...
Found 1 problem
Problem 1 of 1: VM with cloud ID `i-914c046a' missing.
1. Skip for now
2. Recreate VM
3. Delete VM reference
Please choose a resolution [1 - 3]: 3
Below is the list of resolutions you've provided
Please make sure everything is fine and confirm your changes
1. VM with cloud ID `i-914c046a' missing.
Delete VM reference
Apply resolutions? (type 'yes' to continue): yes
Applying resolutions...
Director task 624
Started applying problem resolutions > missing_vm 168: Delete VM reference. Done (00:00:00)
Task 624 done
Started 2015-01-09 23:33:20 UTC
Finished 2015-01-09 23:33:20 UTC
Duration 00:00:00
Cloudcheck is finished
提供给操作员几个选择:
- 忽略,Director不尝试去解决这个问题。
- 重新创建VM,部署和运行指定releases。cck不会等待release任务运行起来。
- 删除VM参照:Director不去创建新的VM,再次运行cck,将不再报告VM丢失。运行bosh deploy会重新填回丢失的VMs.
VM无响应
Agent不再给Director响应时,bosh vms
会报告该VM的agent 为unresponsive agent
``` $ bosh vms simple-deployment --details
Deployment `simple-deployment'
Director task 630
Task 630 done
+-----------------+--------------------+---------------+-----+------------+--------------------------------------+--------------+ | Job/index | State | Resource Pool | IPs | CID | Agent ID | Resurrection | +-----------------+--------------------+---------------+-----+------------+--------------------------------------+--------------+ | unknown/unknown | unresponsive agent | | | i-1db9ede6 | 59a30081-d63d-4c1b-80be-01fa681d8787 | active | +-----------------+--------------------+---------------+-----+------------+--------------------------------------+--------------+
VMs total: 1
同样 `bosh deploy` 会停止 `Binding existing deployment`。
$ bosh deploy
..snip...
Deploying
Deployment name: tiny-dummy.yml'
Director name:
micro-idora'
Are you sure you want to deploy? (type 'yes' to continue): yes
Director task 631 Started preparing deployment Started preparing deployment > Binding deployment. Done (00:00:00) Started preparing deployment > Binding releases. Done (00:00:00) Started preparing deployment > Binding existing deployment. Failed: Timed out sending `get_state' to 59a30081-d63d-4c1b-80be-01fa681d8787 after 45 seconds (00:02:15)
Error 450002: Timed out sending `get_state' to 59a30081-d63d-4c1b-80be-01fa681d8787 after 45 seconds
Task 631 error
For a more detailed error report, run: bosh task 631 --debug
```
$ bosh cck
Performing cloud check...
Processing deployment manifest
------------------------------
Director task 640
Started scanning 1 vms
Started scanning 1 vms > Checking VM states. Done (00:00:10)
Started scanning 1 vms > 0 OK, 1 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00)
Done scanning 1 vms (00:00:10)
Started scanning 0 persistent disks
Started scanning 0 persistent disks > Looking for inactive disks. Done (00:00:00)
Started scanning 0 persistent disks > 0 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00)
Done scanning 0 persistent disks (00:00:00)
Task 640 done
Started 2015-01-09 23:33:45 UTC
Finished 2015-01-09 23:33:55 UTC
Duration 00:00:10
Scan is complete, checking if any problems found...
Found 1 problem
Problem 1 of 1: dummy/0 (i-914c046a) is not responding.
1. Skip for now
2. Reboot VM
3. Recreate VM
4. Delete VM reference (forceful; may need to manually delete VM from the Cloud to avoid IP conflicts)
Please choose a resolution [1 - 4]: 4
Below is the list of resolutions you've provided
Please make sure everything is fine and confirm your changes
1. dummy/0 (i-914c046a) is not responding.
Delete VM reference (forceful; may need to manually delete VM from the Cloud to avoid IP conflicts)
Apply resolutions? (type 'yes' to continue): yes
Applying resolutions...
Director task 641
Started applying problem resolutions > unresponsive_agent 168: Delete VM reference (...). Done (00:00:05)
Task 641 done
Started 2015-01-09 23:35:20 UTC
Finished 2015-01-09 23:35:25 UTC
Duration 00:00:05
Cloudcheck is finished
提供以下解决方法:
- 忽略
- 重启
- 重新创建
- 删除参照
磁盘未挂载
``` $ bosh cck
Performing cloud check...
Processing deployment manifest
Director task 656 Started scanning 1 vms Started scanning 1 vms > Checking VM states. Done (00:00:00) Started scanning 1 vms > 1 OK, 0 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00) Done scanning 1 vms (00:00:00)
Started scanning 1 persistent disks Started scanning 1 persistent disks > Looking for inactive disks. Done (00:00:00) Started scanning 1 persistent disks > 0 OK, 0 missing, 0 inactive, 1 mount-info mismatch. Done (00:00:00) Done scanning 1 persistent disks (00:00:00)
Task 656 done
Started 2015-01-13 22:04:56 UTC Finished 2015-01-13 22:04:56 UTC Duration 00:00:00
Scan is complete, checking if any problems found...
Found 1 problem
Problem 1 of 1: Inconsistent mount information: Record shows that disk 'vol-549f071f' should be mounted on i-4fcd99b4. However it is currently : Not mounted in any VM.
- Skip for now
- Reattach disk to instance
- Reattach disk and reboot instance Please choose a resolution [1 - 3]: 2
Below is the list of resolutions you've provided Please make sure everything is fine and confirm your changes
- Inconsistent mount information: Record shows that disk 'vol-549f071f' should be mounted on i-4fcd99b4. However it is currently : Not mounted in any VM Reattach disk to instance
Apply resolutions? (type 'yes' to continue): yes Applying resolutions...
Director task 657 Started applying problem resolutions > mount_info_mismatch 23: Reattach disk to instance. Done (00:00:22)
Task 657 done
Started 2015-01-13 22:05:19 UTC Finished 2015-01-13 22:05:41 UTC Duration 00:00:22 Cloudcheck is finished ``` 解决方法:
- 忽略
- 重新挂载。通常挂在在
/var/vcap/store
目录上。重新挂载时release不会重新运行。 - 挂载后重启。这样Agent可以在运行release之前安全的挂载磁盘。cck不会等待VM和release重启。
磁盘找不着了
Those CPIs will report missing disk as Persistent Disk is not attached problem; however, both reattaching resolutions will fail since persistent disk would not be found 挂载失败就是找不着了。