Blame - Documentation/driver-api/md/md-cluster.rst - hafnium/third_party/linux.git

blob: 96eb52cec7ebc640ea932de6c7e729d2951e20fd [file] [log] [blame]

David Brazdil	0f672f6	2019-12-10 10:32:29 +0000	[diff] [blame^]	1	==========
				2	MD Cluster
				3	==========
				4
				5	The cluster MD is a shared-device RAID for a cluster, it supports
				6	two levels: raid1 and raid10 (limited support).
				7
				8
				9	1. On-disk format
				10	=================
				11
				12	Separate write-intent-bitmaps are used for each cluster node.
				13	The bitmaps record all writes that may have been started on that node,
				14	and may not yet have finished. The on-disk layout is::
				15
				16	0 4k 8k 12k
				17	-------------------------------------------------------------------
				18	\| idle \| md super \| bm super [0] + bits \|
				19	\| bm bits[0, contd] \| bm super[1] + bits \| bm bits[1, contd] \|
				20	\| bm super[2] + bits \| bm bits [2, contd] \| bm super[3] + bits \|
				21	\| bm bits [3, contd] \| \| \|
				22
				23	During "normal" functioning we assume the filesystem ensures that only
				24	one node writes to any given block at a time, so a write request will
				25
				26	- set the appropriate bit (if not already set)
				27	- commit the write to all mirrors
				28	- schedule the bit to be cleared after a timeout.
				29
				30	Reads are just handled normally. It is up to the filesystem to ensure
				31	one node doesn't read from a location where another node (or the same
				32	node) is writing.
				33
				34
				35	2. DLM Locks for management
				36	===========================
				37
				38	There are three groups of locks for managing the device:
				39
				40	2.1 Bitmap lock resource (bm_lockres)
				41	-------------------------------------
				42
				43	The bm_lockres protects individual node bitmaps. They are named in
				44	the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
				45	node joins the cluster, it acquires the lock in PW mode and it stays
				46	so during the lifetime the node is part of the cluster. The lock
				47	resource number is based on the slot number returned by the DLM
				48	subsystem. Since DLM starts node count from one and bitmap slots
				49	start from zero, one is subtracted from the DLM slot number to arrive
				50	at the bitmap slot number.
				51
				52	The LVB of the bitmap lock for a particular node records the range
				53	of sectors that are being re-synced by that node. No other
				54	node may write to those sectors. This is used when a new nodes
				55	joins the cluster.
				56
				57	2.2 Message passing locks
				58	-------------------------
				59
				60	Each node has to communicate with other nodes when starting or ending
				61	resync, and for metadata superblock updates. This communication is
				62	managed through three locks: "token", "message", and "ack", together
				63	with the Lock Value Block (LVB) of one of the "message" lock.
				64
				65	2.3 new-device management
				66	-------------------------
				67
				68	A single lock: "no-new-dev" is used to co-ordinate the addition of
				69	new devices - this must be synchronized across the array.
				70	Normally all nodes hold a concurrent-read lock on this device.
				71
				72	3. Communication
				73	================
				74
				75	Messages can be broadcast to all nodes, and the sender waits for all
				76	other nodes to acknowledge the message before proceeding. Only one
				77	message can be processed at a time.
				78
				79	3.1 Message Types
				80	-----------------
				81
				82	There are six types of messages which are passed:
				83
				84	3.1.1 METADATA_UPDATED
				85	^^^^^^^^^^^^^^^^^^^^^^
				86
				87	informs other nodes that the metadata has
				88	been updated, and the node must re-read the md superblock. This is
				89	performed synchronously. It is primarily used to signal device
				90	failure.
				91
				92	3.1.2 RESYNCING
				93	^^^^^^^^^^^^^^^
				94	informs other nodes that a resync is initiated or
				95	ended so that each node may suspend or resume the region. Each
				96	RESYNCING message identifies a range of the devices that the
				97	sending node is about to resync. This overrides any previous
				98	notification from that node: only one ranged can be resynced at a
				99	time per-node.
				100
				101	3.1.3 NEWDISK
				102	^^^^^^^^^^^^^
				103
				104	informs other nodes that a device is being added to
				105	the array. Message contains an identifier for that device. See
				106	below for further details.
				107
				108	3.1.4 REMOVE
				109	^^^^^^^^^^^^
				110
				111	A failed or spare device is being removed from the
				112	array. The slot-number of the device is included in the message.
				113
				114	3.1.5 RE_ADD:
				115
				116	A failed device is being re-activated - the assumption
				117	is that it has been determined to be working again.
				118
				119	3.1.6 BITMAP_NEEDS_SYNC:
				120
				121	If a node is stopped locally but the bitmap
				122	isn't clean, then another node is informed to take the ownership of
				123	resync.
				124
				125	3.2 Communication mechanism
				126	---------------------------
				127
				128	The DLM LVB is used to communicate within nodes of the cluster. There
				129	are three resources used for the purpose:
				130
				131	3.2.1 token
				132	^^^^^^^^^^^
				133	The resource which protects the entire communication
				134	system. The node having the token resource is allowed to
				135	communicate.
				136
				137	3.2.2 message
				138	^^^^^^^^^^^^^
				139	The lock resource which carries the data to communicate.
				140
				141	3.2.3 ack
				142	^^^^^^^^^
				143
				144	The resource, acquiring which means the message has been
				145	acknowledged by all nodes in the cluster. The BAST of the resource
				146	is used to inform the receiving node that a node wants to
				147	communicate.
				148
				149	The algorithm is:
				150
				151	1. receive status - all nodes have concurrent-reader lock on "ack"::
				152
				153	sender receiver receiver
				154	"ack":CR "ack":CR "ack":CR
				155
				156	2. sender get EX on "token",
				157	sender get EX on "message"::
				158
				159	sender receiver receiver
				160	"token":EX "ack":CR "ack":CR
				161	"message":EX
				162	"ack":CR
				163
				164	Sender checks that it still needs to send a message. Messages
				165	received or other events that happened while waiting for the
				166	"token" may have made this message inappropriate or redundant.
				167
				168	3. sender writes LVB
				169
				170	sender down-convert "message" from EX to CW
				171
				172	sender try to get EX of "ack"
				173
				174	::
				175
				176	[ wait until all receivers have processed the "message" ]
				177
				178	[ triggered by bast of "ack" ]
				179	receiver get CR on "message"
				180	receiver read LVB
				181	receiver processes the message
				182	[ wait finish ]
				183	receiver releases "ack"
				184	receiver tries to get PR on "message"
				185
				186	sender receiver receiver
				187	"token":EX "message":CR "message":CR
				188	"message":CW
				189	"ack":EX
				190
				191	4. triggered by grant of EX on "ack" (indicating all receivers
				192	have processed message)
				193
				194	sender down-converts "ack" from EX to CR
				195
				196	sender releases "message"
				197
				198	sender releases "token"
				199
				200	::
				201
				202	receiver upconvert to PR on "message"
				203	receiver get CR of "ack"
				204	receiver release "message"
				205
				206	sender receiver receiver
				207	"ack":CR "ack":CR "ack":CR
				208
				209
				210	4. Handling Failures
				211	====================
				212
				213	4.1 Node Failure
				214	----------------
				215
				216	When a node fails, the DLM informs the cluster with the slot
				217	number. The node starts a cluster recovery thread. The cluster
				218	recovery thread:
				219
				220	- acquires the bitmap<number> lock of the failed node
				221	- opens the bitmap
				222	- reads the bitmap of the failed node
				223	- copies the set bitmap to local node
				224	- cleans the bitmap of the failed node
				225	- releases bitmap<number> lock of the failed node
				226	- initiates resync of the bitmap on the current node
				227	md_check_recovery is invoked within recover_bitmaps,
				228	then md_check_recovery -> metadata_update_start/finish,
				229	it will lock the communication by lock_comm.
				230	Which means when one node is resyncing it blocks all
				231	other nodes from writing anywhere on the array.
				232
				233	The resync process is the regular md resync. However, in a clustered
				234	environment when a resync is performed, it needs to tell other nodes
				235	of the areas which are suspended. Before a resync starts, the node
				236	send out RESYNCING with the (lo,hi) range of the area which needs to
				237	be suspended. Each node maintains a suspend_list, which contains the
				238	list of ranges which are currently suspended. On receiving RESYNCING,
				239	the node adds the range to the suspend_list. Similarly, when the node
				240	performing resync finishes, it sends RESYNCING with an empty range to
				241	other nodes and other nodes remove the corresponding entry from the
				242	suspend_list.
				243
				244	A helper function, ->area_resyncing() can be used to check if a
				245	particular I/O range should be suspended or not.
				246
				247	4.2 Device Failure
				248	==================
				249
				250	Device failures are handled and communicated with the metadata update
				251	routine. When a node detects a device failure it does not allow
				252	any further writes to that device until the failure has been
				253	acknowledged by all other nodes.
				254
				255	5. Adding a new Device
				256	----------------------
				257
				258	For adding a new device, it is necessary that all nodes "see" the new
				259	device to be added. For this, the following algorithm is used:
				260
				261	1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
				262	ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD)
				263	2. Node 1 sends a NEWDISK message with uuid and slot number
				264	3. Other nodes issue kobject_uevent_env with uuid and slot number
				265	(Steps 4,5 could be a udev rule)
				266	4. In userspace, the node searches for the disk, perhaps
				267	using blkid -t SUB_UUID=""
				268	5. Other nodes issue either of the following depending on whether
				269	the disk was found:
				270	ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
				271	disc.number set to slot number)
				272	ioctl(CLUSTERED_DISK_NACK)
				273	6. Other nodes drop lock on "no-new-devs" (CR) if device is found
				274	7. Node 1 attempts EX lock on "no-new-dev"
				275	8. If node 1 gets the lock, it sends METADATA_UPDATED after
				276	unmarking the disk as SpareLocal
				277	9. If not (get "no-new-dev" lock), it fails the operation and sends
				278	METADATA_UPDATED.
				279	10. Other nodes get the information whether a disk is added or not
				280	by the following METADATA_UPDATED.
				281
				282	6. Module interface
				283	===================
				284
				285	There are 17 call-backs which the md core can make to the cluster
				286	module. Understanding these can give a good overview of the whole
				287	process.
				288
				289	6.1 join(nodes) and leave()
				290	---------------------------
				291
				292	These are called when an array is started with a clustered bitmap,
				293	and when the array is stopped. join() ensures the cluster is
				294	available and initializes the various resources.
				295	Only the first 'nodes' nodes in the cluster can use the array.
				296
				297	6.2 slot_number()
				298	-----------------
				299
				300	Reports the slot number advised by the cluster infrastructure.
				301	Range is from 0 to nodes-1.
				302
				303	6.3 resync_info_update()
				304	------------------------
				305
				306	This updates the resync range that is stored in the bitmap lock.
				307	The starting point is updated as the resync progresses. The
				308	end point is always the end of the array.
				309	It does not send a RESYNCING message.
				310
				311	6.4 resync_start(), resync_finish()
				312	-----------------------------------
				313
				314	These are called when resync/recovery/reshape starts or stops.
				315	They update the resyncing range in the bitmap lock and also
				316	send a RESYNCING message. resync_start reports the whole
				317	array as resyncing, resync_finish reports none of it.
				318
				319	resync_finish() also sends a BITMAP_NEEDS_SYNC message which
				320	allows some other node to take over.
				321
				322	6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel()
				323	-------------------------------------------------------------------------------
				324
				325	metadata_update_start is used to get exclusive access to
				326	the metadata. If a change is still needed once that access is
				327	gained, metadata_update_finish() will send a METADATA_UPDATE
				328	message to all other nodes, otherwise metadata_update_cancel()
				329	can be used to release the lock.
				330
				331	6.6 area_resyncing()
				332	--------------------
				333
				334	This combines two elements of functionality.
				335
				336	Firstly, it will check if any node is currently resyncing
				337	anything in a given range of sectors. If any resync is found,
				338	then the caller will avoid writing or read-balancing in that
				339	range.
				340
				341	Secondly, while node recovery is happening it reports that
				342	all areas are resyncing for READ requests. This avoids races
				343	between the cluster-filesystem and the cluster-RAID handling
				344	a node failure.
				345
				346	6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack()
				347	---------------------------------------------------------------
				348
				349	These are used to manage the new-disk protocol described above.
				350	When a new device is added, add_new_disk_start() is called before
				351	it is bound to the array and, if that succeeds, add_new_disk_finish()
				352	is called the device is fully added.
				353
				354	When a device is added in acknowledgement to a previous
				355	request, or when the device is declared "unavailable",
				356	new_disk_ack() is called.
				357
				358	6.8 remove_disk()
				359	-----------------
				360
				361	This is called when a spare or failed device is removed from
				362	the array. It causes a REMOVE message to be send to other nodes.
				363
				364	6.9 gather_bitmaps()
				365	--------------------
				366
				367	This sends a RE_ADD message to all other nodes and then
				368	gathers bitmap information from all bitmaps. This combined
				369	bitmap is then used to recovery the re-added device.
				370
				371	6.10 lock_all_bitmaps() and unlock_all_bitmaps()
				372	------------------------------------------------
				373
				374	These are called when change bitmap to none. If a node plans
				375	to clear the cluster raid's bitmap, it need to make sure no other
				376	nodes are using the raid which is achieved by lock all bitmap
				377	locks within the cluster, and also those locks are unlocked
				378	accordingly.
				379
				380	7. Unsupported features
				381	=======================
				382
				383	There are somethings which are not supported by cluster MD yet.
				384
				385	- change array_sectors.