ceph如何修改 CRUSH ruleset?

将跟 osd.1 在同一个host 上的 osd.3 out,看看 Ceph 集群能不能继续恢复。Ceph 集群中部分 PG 再次进入 remapped 状态:[root@ceph1:~]# ceph -s cluster 5ccdcb2d-961d-4dcb-a9ed-e8034c56cf71 health HEALTH_WARN 256 pgs stuck unclean monmap e2: 1 mons...显示全部

将跟 osd.1 在同一个host 上的 osd.3 out,看看 Ceph 集群能不能继续恢复。Ceph 集群中部分 PG 再次进入 remapped 状态:

[root@ceph1:~]# ceph -s
    cluster 5ccdcb2d-961d-4dcb-a9ed-e8034c56cf71
     health HEALTH_WARN 256 pgs stuck unclean
     monmap e2: 1 mons at {ceph1=192.168.56.102:6789/0}, election epoch 1, quorum 0 ceph1
     osdmap e77: 4 osds: 4 up, 2 in
      pgmap v480: 256 pgs, 4 pools, 285 MB data, 8 objects
            625 MB used, 9592 MB / 10217 MB avail
                 256 active+remapped

运行 ceph pg 1.0 query 查看 PG 1.0 的状态:

"recovery_state": [
        { "name": "Started\\/Primary\\/Active",
          "enter_time": "2016-06-03 01:31:22.045434",
          "might_have_unfound": [],
          "recovery_progress": { "backfill_targets": [],
              "waiting_on_backfill": [],
              "last_backfill_started": "0\\/\\/0\\/\\/-1",
              "backfill_info": { "begin": "0\\/\\/0\\/\\/-1",
                  "end": "0\\/\\/0\\/\\/-1",
                  "objects": []},
              "peer_backfill_info": [],
              "backfills_in_flight": [],
              "recovering": [],
              "pg_backend": { "pull_from_peer": [],
                  "pushing": []}},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2016-06-03 01:31:20.976290"}],

可见它已经开始 recovery 了,但是没完成。

收起
参与3

返回智诩的回答

智诩智诩存储工程师hanergy

原因分析

PG 的分布和 CRUSH ruleset 有关。我的集群当前只有一个默认的 ruleset:

[root@ceph1:~]# ceph osd crush rule dump
[
    { "rule_id": 0,
      "rule_name": "replicated_ruleset",
      "ruleset": 0,
      "type": 1,
      "min_size": 1,
      "max_size": 10,
      "steps": [
            { "op": "take",
              "item": -1,
              "item_name": "default"},
            { "op": "chooseleaf_firstn",
              "num": 0,
              "type": "host"},
            { "op": "emit"}]}]

注意其 type 为 “host”,也就是说 CRUSH 不会为一个 PG 选择在同一个 host 上的两个 OSD。而我的环境中,目前只有 ceph1 上的两个 OSD 是in 的,因此,CRUSH 无法为所有的 PG 重新选择一个新的 OSD 来替代 osd.3.

解决办法

按照以下步骤,将 CRUSH ruleset 的 type 由 “host” 修改为 “osd”,使得 CRUSH 为 PG 选择 OSD 时不再局限于不同的 host。

[root@ceph1:~]# ceph osd getcrushmap -o crushmap_compiled_file
got crush map from osdmap epoch 77
[root@ceph1:~]# crushtool -d crushmap_compiled_file -o crushmap_decompiled_file
[root@ceph1:~]# vi crushmap_decompiled_file
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type osd #将 type 由 “host” 修改为 “osd”
        step emit
}
 
[root@ceph1:~]# crushtool -c crushmap_decompiled_file -o newcrushmap
[root@ceph1:~]# ceph osd setcrushmap -i newcrushmap
set crush map

以上命令执行完毕后,可以看到 recovery 过程继续进行,一段时间后,集群恢复 OK 状态。

[root@ceph1:~]# ceph -s
    cluster 5ccdcb2d-961d-4dcb-a9ed-e8034c56cf71
     health HEALTH_WARN 256 pgs stuck unclean
     monmap e2: 1 mons at {ceph1=192.168.56.102:6789/0}, election epoch 1, quorum 0 ceph1
     osdmap e80: 4 osds: 4 up, 2 in
      pgmap v493: 256 pgs, 4 pools, 285 MB data, 8 objects
            552 MB used, 9665 MB / 10217 MB avail
                 256 active+remapped
[root@ceph1:~]# ceph -s
    cluster 5ccdcb2d-961d-4dcb-a9ed-e8034c56cf71
     health HEALTH_WARN 137 pgs stuck unclean
     monmap e2: 1 mons at {ceph1=192.168.56.102:6789/0}, election epoch 1, quorum 0 ceph1
     osdmap e80: 4 osds: 4 up, 2 in
      pgmap v494: 256 pgs, 4 pools, 285 MB data, 8 objects
            677 MB used, 9540 MB / 10217 MB avail
                 137 active+remapped
                 119 active+clean
recovery io 34977 B/s, 0 objects/s
[root@ceph1:~]# ceph -s
    cluster 5ccdcb2d-961d-4dcb-a9ed-e8034c56cf71
     health HEALTH_OK
     monmap e2: 1 mons at {ceph1=192.168.56.102:6789/0}, election epoch 1, quorum 0 ceph1
     osdmap e80: 4 osds: 4 up, 2 in
      pgmap v495: 256 pgs, 4 pools, 285 MB data, 8 objects
            679 MB used, 9538 MB / 10217 MB avail
                 256 active+clean
recovery io 18499 kB/s, 0 objects/s
能源采矿 · 2020-01-16
浏览2761

回答者

智诩
智诩0512
存储工程师hanergy
擅长领域: 存储灾备服务器

智诩 最近回答过的问题

回答状态

  • 发布时间:2020-01-16
  • 关注会员:2 人
  • 回答浏览:2761
  • X社区推广