原因分析
PG 的分布和 CRUSH ruleset 有关。我的集群当前只有一个默认的 ruleset:
[root@ceph1:~]# ceph osd crush rule dump
[
{ "rule_id": 0,
"rule_name": "replicated_ruleset",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{ "op": "take",
"item": -1,
"item_name": "default"},
{ "op": "chooseleaf_firstn",
"num": 0,
"type": "host"},
{ "op": "emit"}]}]
注意其 type 为 “host”,也就是说 CRUSH 不会为一个 PG 选择在同一个 host 上的两个 OSD。而我的环境中,目前只有 ceph1 上的两个 OSD 是in 的,因此,CRUSH 无法为所有的 PG 重新选择一个新的 OSD 来替代 osd.3.
解决办法
按照以下步骤,将 CRUSH ruleset 的 type 由 “host” 修改为 “osd”,使得 CRUSH 为 PG 选择 OSD 时不再局限于不同的 host。
[root@ceph1:~]# ceph osd getcrushmap -o crushmap_compiled_file
got crush map from osdmap epoch 77
[root@ceph1:~]# crushtool -d crushmap_compiled_file -o crushmap_decompiled_file
[root@ceph1:~]# vi crushmap_decompiled_file
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type osd #将 type 由 “host” 修改为 “osd”
step emit
}
[root@ceph1:~]# crushtool -c crushmap_decompiled_file -o newcrushmap
[root@ceph1:~]# ceph osd setcrushmap -i newcrushmap
set crush map
以上命令执行完毕后,可以看到 recovery 过程继续进行,一段时间后,集群恢复 OK 状态。
[root@ceph1:~]# ceph -s
cluster 5ccdcb2d-961d-4dcb-a9ed-e8034c56cf71
health HEALTH_WARN 256 pgs stuck unclean
monmap e2: 1 mons at {ceph1=192.168.56.102:6789/0}, election epoch 1, quorum 0 ceph1
osdmap e80: 4 osds: 4 up, 2 in
pgmap v493: 256 pgs, 4 pools, 285 MB data, 8 objects
552 MB used, 9665 MB / 10217 MB avail
256 active+remapped
[root@ceph1:~]# ceph -s
cluster 5ccdcb2d-961d-4dcb-a9ed-e8034c56cf71
health HEALTH_WARN 137 pgs stuck unclean
monmap e2: 1 mons at {ceph1=192.168.56.102:6789/0}, election epoch 1, quorum 0 ceph1
osdmap e80: 4 osds: 4 up, 2 in
pgmap v494: 256 pgs, 4 pools, 285 MB data, 8 objects
677 MB used, 9540 MB / 10217 MB avail
137 active+remapped
119 active+clean
recovery io 34977 B/s, 0 objects/s
[root@ceph1:~]# ceph -s
cluster 5ccdcb2d-961d-4dcb-a9ed-e8034c56cf71
health HEALTH_OK
monmap e2: 1 mons at {ceph1=192.168.56.102:6789/0}, election epoch 1, quorum 0 ceph1
osdmap e80: 4 osds: 4 up, 2 in
pgmap v495: 256 pgs, 4 pools, 285 MB data, 8 objects
679 MB used, 9538 MB / 10217 MB avail
256 active+clean
recovery io 18499 kB/s, 0 objects/s