David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame^] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | Directory Entries |
| 4 | ----------------- |
| 5 | |
| 6 | In an ext4 filesystem, a directory is more or less a flat file that maps |
| 7 | an arbitrary byte string (usually ASCII) to an inode number on the |
| 8 | filesystem. There can be many directory entries across the filesystem |
| 9 | that reference the same inode number--these are known as hard links, and |
| 10 | that is why hard links cannot reference files on other filesystems. As |
| 11 | such, directory entries are found by reading the data block(s) |
| 12 | associated with a directory file for the particular directory entry that |
| 13 | is desired. |
| 14 | |
| 15 | Linear (Classic) Directories |
| 16 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 17 | |
| 18 | By default, each directory lists its entries in an “almost-linear” |
| 19 | array. I write “almost” because it's not a linear array in the memory |
| 20 | sense because directory entries are not split across filesystem blocks. |
| 21 | Therefore, it is more accurate to say that a directory is a series of |
| 22 | data blocks and that each block contains a linear array of directory |
| 23 | entries. The end of each per-block array is signified by reaching the |
| 24 | end of the block; the last entry in the block has a record length that |
| 25 | takes it all the way to the end of the block. The end of the entire |
| 26 | directory is of course signified by reaching the end of the file. Unused |
| 27 | directory entries are signified by inode = 0. By default the filesystem |
| 28 | uses ``struct ext4_dir_entry_2`` for directory entries unless the |
| 29 | “filetype” feature flag is not set, in which case it uses |
| 30 | ``struct ext4_dir_entry``. |
| 31 | |
| 32 | The original directory entry format is ``struct ext4_dir_entry``, which |
| 33 | is at most 263 bytes long, though on disk you'll need to reference |
| 34 | ``dirent.rec_len`` to know for sure. |
| 35 | |
| 36 | .. list-table:: |
| 37 | :widths: 8 8 24 40 |
| 38 | :header-rows: 1 |
| 39 | |
| 40 | * - Offset |
| 41 | - Size |
| 42 | - Name |
| 43 | - Description |
| 44 | * - 0x0 |
| 45 | - \_\_le32 |
| 46 | - inode |
| 47 | - Number of the inode that this directory entry points to. |
| 48 | * - 0x4 |
| 49 | - \_\_le16 |
| 50 | - rec\_len |
| 51 | - Length of this directory entry. Must be a multiple of 4. |
| 52 | * - 0x6 |
| 53 | - \_\_le16 |
| 54 | - name\_len |
| 55 | - Length of the file name. |
| 56 | * - 0x8 |
| 57 | - char |
| 58 | - name[EXT4\_NAME\_LEN] |
| 59 | - File name. |
| 60 | |
| 61 | Since file names cannot be longer than 255 bytes, the new directory |
| 62 | entry format shortens the name\_len field and uses the space for a file |
| 63 | type flag, probably to avoid having to load every inode during directory |
| 64 | tree traversal. This format is ``ext4_dir_entry_2``, which is at most |
| 65 | 263 bytes long, though on disk you'll need to reference |
| 66 | ``dirent.rec_len`` to know for sure. |
| 67 | |
| 68 | .. list-table:: |
| 69 | :widths: 8 8 24 40 |
| 70 | :header-rows: 1 |
| 71 | |
| 72 | * - Offset |
| 73 | - Size |
| 74 | - Name |
| 75 | - Description |
| 76 | * - 0x0 |
| 77 | - \_\_le32 |
| 78 | - inode |
| 79 | - Number of the inode that this directory entry points to. |
| 80 | * - 0x4 |
| 81 | - \_\_le16 |
| 82 | - rec\_len |
| 83 | - Length of this directory entry. |
| 84 | * - 0x6 |
| 85 | - \_\_u8 |
| 86 | - name\_len |
| 87 | - Length of the file name. |
| 88 | * - 0x7 |
| 89 | - \_\_u8 |
| 90 | - file\_type |
| 91 | - File type code, see ftype_ table below. |
| 92 | * - 0x8 |
| 93 | - char |
| 94 | - name[EXT4\_NAME\_LEN] |
| 95 | - File name. |
| 96 | |
| 97 | .. _ftype: |
| 98 | |
| 99 | The directory file type is one of the following values: |
| 100 | |
| 101 | .. list-table:: |
| 102 | :widths: 16 64 |
| 103 | :header-rows: 1 |
| 104 | |
| 105 | * - Value |
| 106 | - Description |
| 107 | * - 0x0 |
| 108 | - Unknown. |
| 109 | * - 0x1 |
| 110 | - Regular file. |
| 111 | * - 0x2 |
| 112 | - Directory. |
| 113 | * - 0x3 |
| 114 | - Character device file. |
| 115 | * - 0x4 |
| 116 | - Block device file. |
| 117 | * - 0x5 |
| 118 | - FIFO. |
| 119 | * - 0x6 |
| 120 | - Socket. |
| 121 | * - 0x7 |
| 122 | - Symbolic link. |
| 123 | |
| 124 | In order to add checksums to these classic directory blocks, a phony |
| 125 | ``struct ext4_dir_entry`` is placed at the end of each leaf block to |
| 126 | hold the checksum. The directory entry is 12 bytes long. The inode |
| 127 | number and name\_len fields are set to zero to fool old software into |
| 128 | ignoring an apparently empty directory entry, and the checksum is stored |
| 129 | in the place where the name normally goes. The structure is |
| 130 | ``struct ext4_dir_entry_tail``: |
| 131 | |
| 132 | .. list-table:: |
| 133 | :widths: 8 8 24 40 |
| 134 | :header-rows: 1 |
| 135 | |
| 136 | * - Offset |
| 137 | - Size |
| 138 | - Name |
| 139 | - Description |
| 140 | * - 0x0 |
| 141 | - \_\_le32 |
| 142 | - det\_reserved\_zero1 |
| 143 | - Inode number, which must be zero. |
| 144 | * - 0x4 |
| 145 | - \_\_le16 |
| 146 | - det\_rec\_len |
| 147 | - Length of this directory entry, which must be 12. |
| 148 | * - 0x6 |
| 149 | - \_\_u8 |
| 150 | - det\_reserved\_zero2 |
| 151 | - Length of the file name, which must be zero. |
| 152 | * - 0x7 |
| 153 | - \_\_u8 |
| 154 | - det\_reserved\_ft |
| 155 | - File type, which must be 0xDE. |
| 156 | * - 0x8 |
| 157 | - \_\_le32 |
| 158 | - det\_checksum |
| 159 | - Directory leaf block checksum. |
| 160 | |
| 161 | The leaf directory block checksum is calculated against the FS UUID, the |
| 162 | directory's inode number, the directory's inode generation number, and |
| 163 | the entire directory entry block up to (but not including) the fake |
| 164 | directory entry. |
| 165 | |
| 166 | Hash Tree Directories |
| 167 | ~~~~~~~~~~~~~~~~~~~~~ |
| 168 | |
| 169 | A linear array of directory entries isn't great for performance, so a |
| 170 | new feature was added to ext3 to provide a faster (but peculiar) |
| 171 | balanced tree keyed off a hash of the directory entry name. If the |
| 172 | EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a |
| 173 | hashed btree (htree) to organize and find directory entries. For |
| 174 | backwards read-only compatibility with ext2, this tree is actually |
| 175 | hidden inside the directory file, masquerading as “empty” directory data |
| 176 | blocks! It was stated previously that the end of the linear directory |
| 177 | entry table was signified with an entry pointing to inode 0; this is |
| 178 | (ab)used to fool the old linear-scan algorithm into thinking that the |
| 179 | rest of the directory block is empty so that it moves on. |
| 180 | |
| 181 | The root of the tree always lives in the first data block of the |
| 182 | directory. By ext2 custom, the '.' and '..' entries must appear at the |
| 183 | beginning of this first block, so they are put here as two |
| 184 | ``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of |
| 185 | the root node contains metadata about the tree and finally a hash->block |
| 186 | map to find nodes that are lower in the htree. If |
| 187 | ``dx_root.info.indirect_levels`` is non-zero then the htree has two |
| 188 | levels; the data block pointed to by the root node's map is an interior |
| 189 | node, which is indexed by a minor hash. Interior nodes in this tree |
| 190 | contains a zeroed out ``struct ext4_dir_entry_2`` followed by a |
| 191 | minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear |
| 192 | array of all ``struct ext4_dir_entry_2``; all of these entries |
| 193 | (presumably) hash to the same value. If there is an overflow, the |
| 194 | entries simply overflow into the next leaf node, and the |
| 195 | least-significant bit of the hash (in the interior node map) that gets |
| 196 | us to this next leaf node is set. |
| 197 | |
| 198 | To traverse the directory as a htree, the code calculates the hash of |
| 199 | the desired file name and uses it to find the corresponding block |
| 200 | number. If the tree is flat, the block is a linear array of directory |
| 201 | entries that can be searched; otherwise, the minor hash of the file name |
| 202 | is computed and used against this second block to find the corresponding |
| 203 | third block number. That third block number will be a linear array of |
| 204 | directory entries. |
| 205 | |
| 206 | To traverse the directory as a linear array (such as the old code does), |
| 207 | the code simply reads every data block in the directory. The blocks used |
| 208 | for the htree will appear to have no entries (aside from '.' and '..') |
| 209 | and so only the leaf nodes will appear to have any interesting content. |
| 210 | |
| 211 | The root of the htree is in ``struct dx_root``, which is the full length |
| 212 | of a data block: |
| 213 | |
| 214 | .. list-table:: |
| 215 | :widths: 8 8 24 40 |
| 216 | :header-rows: 1 |
| 217 | |
| 218 | * - Offset |
| 219 | - Type |
| 220 | - Name |
| 221 | - Description |
| 222 | * - 0x0 |
| 223 | - \_\_le32 |
| 224 | - dot.inode |
| 225 | - inode number of this directory. |
| 226 | * - 0x4 |
| 227 | - \_\_le16 |
| 228 | - dot.rec\_len |
| 229 | - Length of this record, 12. |
| 230 | * - 0x6 |
| 231 | - u8 |
| 232 | - dot.name\_len |
| 233 | - Length of the name, 1. |
| 234 | * - 0x7 |
| 235 | - u8 |
| 236 | - dot.file\_type |
| 237 | - File type of this entry, 0x2 (directory) (if the feature flag is set). |
| 238 | * - 0x8 |
| 239 | - char |
| 240 | - dot.name[4] |
| 241 | - “.\\0\\0\\0” |
| 242 | * - 0xC |
| 243 | - \_\_le32 |
| 244 | - dotdot.inode |
| 245 | - inode number of parent directory. |
| 246 | * - 0x10 |
| 247 | - \_\_le16 |
| 248 | - dotdot.rec\_len |
| 249 | - block\_size - 12. The record length is long enough to cover all htree |
| 250 | data. |
| 251 | * - 0x12 |
| 252 | - u8 |
| 253 | - dotdot.name\_len |
| 254 | - Length of the name, 2. |
| 255 | * - 0x13 |
| 256 | - u8 |
| 257 | - dotdot.file\_type |
| 258 | - File type of this entry, 0x2 (directory) (if the feature flag is set). |
| 259 | * - 0x14 |
| 260 | - char |
| 261 | - dotdot\_name[4] |
| 262 | - “..\\0\\0” |
| 263 | * - 0x18 |
| 264 | - \_\_le32 |
| 265 | - struct dx\_root\_info.reserved\_zero |
| 266 | - Zero. |
| 267 | * - 0x1C |
| 268 | - u8 |
| 269 | - struct dx\_root\_info.hash\_version |
| 270 | - Hash type, see dirhash_ table below. |
| 271 | * - 0x1D |
| 272 | - u8 |
| 273 | - struct dx\_root\_info.info\_length |
| 274 | - Length of the tree information, 0x8. |
| 275 | * - 0x1E |
| 276 | - u8 |
| 277 | - struct dx\_root\_info.indirect\_levels |
| 278 | - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR |
| 279 | feature is set; cannot be larger than 2 otherwise. |
| 280 | * - 0x1F |
| 281 | - u8 |
| 282 | - struct dx\_root\_info.unused\_flags |
| 283 | - |
| 284 | * - 0x20 |
| 285 | - \_\_le16 |
| 286 | - limit |
| 287 | - Maximum number of dx\_entries that can follow this header, plus 1 for |
| 288 | the header itself. |
| 289 | * - 0x22 |
| 290 | - \_\_le16 |
| 291 | - count |
| 292 | - Actual number of dx\_entries that follow this header, plus 1 for the |
| 293 | header itself. |
| 294 | * - 0x24 |
| 295 | - \_\_le32 |
| 296 | - block |
| 297 | - The block number (within the directory file) that goes with hash=0. |
| 298 | * - 0x28 |
| 299 | - struct dx\_entry |
| 300 | - entries[0] |
| 301 | - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. |
| 302 | |
| 303 | .. _dirhash: |
| 304 | |
| 305 | The directory hash is one of the following values: |
| 306 | |
| 307 | .. list-table:: |
| 308 | :widths: 16 64 |
| 309 | :header-rows: 1 |
| 310 | |
| 311 | * - Value |
| 312 | - Description |
| 313 | * - 0x0 |
| 314 | - Legacy. |
| 315 | * - 0x1 |
| 316 | - Half MD4. |
| 317 | * - 0x2 |
| 318 | - Tea. |
| 319 | * - 0x3 |
| 320 | - Legacy, unsigned. |
| 321 | * - 0x4 |
| 322 | - Half MD4, unsigned. |
| 323 | * - 0x5 |
| 324 | - Tea, unsigned. |
| 325 | |
| 326 | Interior nodes of an htree are recorded as ``struct dx_node``, which is |
| 327 | also the full length of a data block: |
| 328 | |
| 329 | .. list-table:: |
| 330 | :widths: 8 8 24 40 |
| 331 | :header-rows: 1 |
| 332 | |
| 333 | * - Offset |
| 334 | - Type |
| 335 | - Name |
| 336 | - Description |
| 337 | * - 0x0 |
| 338 | - \_\_le32 |
| 339 | - fake.inode |
| 340 | - Zero, to make it look like this entry is not in use. |
| 341 | * - 0x4 |
| 342 | - \_\_le16 |
| 343 | - fake.rec\_len |
| 344 | - The size of the block, in order to hide all of the dx\_node data. |
| 345 | * - 0x6 |
| 346 | - u8 |
| 347 | - name\_len |
| 348 | - Zero. There is no name for this “unused” directory entry. |
| 349 | * - 0x7 |
| 350 | - u8 |
| 351 | - file\_type |
| 352 | - Zero. There is no file type for this “unused” directory entry. |
| 353 | * - 0x8 |
| 354 | - \_\_le16 |
| 355 | - limit |
| 356 | - Maximum number of dx\_entries that can follow this header, plus 1 for |
| 357 | the header itself. |
| 358 | * - 0xA |
| 359 | - \_\_le16 |
| 360 | - count |
| 361 | - Actual number of dx\_entries that follow this header, plus 1 for the |
| 362 | header itself. |
| 363 | * - 0xE |
| 364 | - \_\_le32 |
| 365 | - block |
| 366 | - The block number (within the directory file) that goes with the lowest |
| 367 | hash value of this block. This value is stored in the parent block. |
| 368 | * - 0x12 |
| 369 | - struct dx\_entry |
| 370 | - entries[0] |
| 371 | - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. |
| 372 | |
| 373 | The hash maps that exist in both ``struct dx_root`` and |
| 374 | ``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes |
| 375 | long: |
| 376 | |
| 377 | .. list-table:: |
| 378 | :widths: 8 8 24 40 |
| 379 | :header-rows: 1 |
| 380 | |
| 381 | * - Offset |
| 382 | - Type |
| 383 | - Name |
| 384 | - Description |
| 385 | * - 0x0 |
| 386 | - \_\_le32 |
| 387 | - hash |
| 388 | - Hash code. |
| 389 | * - 0x4 |
| 390 | - \_\_le32 |
| 391 | - block |
| 392 | - Block number (within the directory file, not filesystem blocks) of the |
| 393 | next node in the htree. |
| 394 | |
| 395 | (If you think this is all quite clever and peculiar, so does the |
| 396 | author.) |
| 397 | |
| 398 | If metadata checksums are enabled, the last 8 bytes of the directory |
| 399 | block (precisely the length of one dx\_entry) are used to store a |
| 400 | ``struct dx_tail``, which contains the checksum. The ``limit`` and |
| 401 | ``count`` entries in the dx\_root/dx\_node structures are adjusted as |
| 402 | necessary to fit the dx\_tail into the block. If there is no space for |
| 403 | the dx\_tail, the user is notified to run e2fsck -D to rebuild the |
| 404 | directory index (which will ensure that there's space for the checksum. |
| 405 | The dx\_tail structure is 8 bytes long and looks like this: |
| 406 | |
| 407 | .. list-table:: |
| 408 | :widths: 8 8 24 40 |
| 409 | :header-rows: 1 |
| 410 | |
| 411 | * - Offset |
| 412 | - Type |
| 413 | - Name |
| 414 | - Description |
| 415 | * - 0x0 |
| 416 | - u32 |
| 417 | - dt\_reserved |
| 418 | - Zero. |
| 419 | * - 0x4 |
| 420 | - \_\_le32 |
| 421 | - dt\_checksum |
| 422 | - Checksum of the htree directory block. |
| 423 | |
| 424 | The checksum is calculated against the FS UUID, the htree index header |
| 425 | (dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in |
| 426 | use, and the tail block (dx\_tail). |