David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame^] | 1 | :orphan: |
| 2 | |
| 3 | Making Filesystems Exportable |
| 4 | ============================= |
| 5 | |
| 6 | Overview |
| 7 | -------- |
| 8 | |
| 9 | All filesystem operations require a dentry (or two) as a starting |
| 10 | point. Local applications have a reference-counted hold on suitable |
| 11 | dentries via open file descriptors or cwd/root. However remote |
| 12 | applications that access a filesystem via a remote filesystem protocol |
| 13 | such as NFS may not be able to hold such a reference, and so need a |
| 14 | different way to refer to a particular dentry. As the alternative |
| 15 | form of reference needs to be stable across renames, truncates, and |
| 16 | server-reboot (among other things, though these tend to be the most |
| 17 | problematic), there is no simple answer like 'filename'. |
| 18 | |
| 19 | The mechanism discussed here allows each filesystem implementation to |
| 20 | specify how to generate an opaque (outside of the filesystem) byte |
| 21 | string for any dentry, and how to find an appropriate dentry for any |
| 22 | given opaque byte string. |
| 23 | This byte string will be called a "filehandle fragment" as it |
| 24 | corresponds to part of an NFS filehandle. |
| 25 | |
| 26 | A filesystem which supports the mapping between filehandle fragments |
| 27 | and dentries will be termed "exportable". |
| 28 | |
| 29 | |
| 30 | |
| 31 | Dcache Issues |
| 32 | ------------- |
| 33 | |
| 34 | The dcache normally contains a proper prefix of any given filesystem |
| 35 | tree. This means that if any filesystem object is in the dcache, then |
| 36 | all of the ancestors of that filesystem object are also in the dcache. |
| 37 | As normal access is by filename this prefix is created naturally and |
| 38 | maintained easily (by each object maintaining a reference count on |
| 39 | its parent). |
| 40 | |
| 41 | However when objects are included into the dcache by interpreting a |
| 42 | filehandle fragment, there is no automatic creation of a path prefix |
| 43 | for the object. This leads to two related but distinct features of |
| 44 | the dcache that are not needed for normal filesystem access. |
| 45 | |
| 46 | 1. The dcache must sometimes contain objects that are not part of the |
| 47 | proper prefix. i.e that are not connected to the root. |
| 48 | 2. The dcache must be prepared for a newly found (via ->lookup) directory |
| 49 | to already have a (non-connected) dentry, and must be able to move |
| 50 | that dentry into place (based on the parent and name in the |
| 51 | ->lookup). This is particularly needed for directories as |
| 52 | it is a dcache invariant that directories only have one dentry. |
| 53 | |
| 54 | To implement these features, the dcache has: |
| 55 | |
| 56 | a. A dentry flag DCACHE_DISCONNECTED which is set on |
| 57 | any dentry that might not be part of the proper prefix. |
| 58 | This is set when anonymous dentries are created, and cleared when a |
| 59 | dentry is noticed to be a child of a dentry which is in the proper |
| 60 | prefix. If the refcount on a dentry with this flag set |
| 61 | becomes zero, the dentry is immediately discarded, rather than being |
| 62 | kept in the dcache. If a dentry that is not already in the dcache |
| 63 | is repeatedly accessed by filehandle (as NFSD might do), an new dentry |
| 64 | will be a allocated for each access, and discarded at the end of |
| 65 | the access. |
| 66 | |
| 67 | Note that such a dentry can acquire children, name, ancestors, etc. |
| 68 | without losing DCACHE_DISCONNECTED - that flag is only cleared when |
| 69 | subtree is successfully reconnected to root. Until then dentries |
| 70 | in such subtree are retained only as long as there are references; |
| 71 | refcount reaching zero means immediate eviction, same as for unhashed |
| 72 | dentries. That guarantees that we won't need to hunt them down upon |
| 73 | umount. |
| 74 | |
| 75 | b. A primitive for creation of secondary roots - d_obtain_root(inode). |
| 76 | Those do _not_ bear DCACHE_DISCONNECTED. They are placed on the |
| 77 | per-superblock list (->s_roots), so they can be located at umount |
| 78 | time for eviction purposes. |
| 79 | |
| 80 | c. Helper routines to allocate anonymous dentries, and to help attach |
| 81 | loose directory dentries at lookup time. They are: |
| 82 | |
| 83 | d_obtain_alias(inode) will return a dentry for the given inode. |
| 84 | If the inode already has a dentry, one of those is returned. |
| 85 | |
| 86 | If it doesn't, a new anonymous (IS_ROOT and |
| 87 | DCACHE_DISCONNECTED) dentry is allocated and attached. |
| 88 | |
| 89 | In the case of a directory, care is taken that only one dentry |
| 90 | can ever be attached. |
| 91 | |
| 92 | d_splice_alias(inode, dentry) will introduce a new dentry into the tree; |
| 93 | either the passed-in dentry or a preexisting alias for the given inode |
| 94 | (such as an anonymous one created by d_obtain_alias), if appropriate. |
| 95 | It returns NULL when the passed-in dentry is used, following the calling |
| 96 | convention of ->lookup. |
| 97 | |
| 98 | Filesystem Issues |
| 99 | ----------------- |
| 100 | |
| 101 | For a filesystem to be exportable it must: |
| 102 | |
| 103 | 1. provide the filehandle fragment routines described below. |
| 104 | 2. make sure that d_splice_alias is used rather than d_add |
| 105 | when ->lookup finds an inode for a given parent and name. |
| 106 | |
| 107 | If inode is NULL, d_splice_alias(inode, dentry) is equivalent to:: |
| 108 | |
| 109 | d_add(dentry, inode), NULL |
| 110 | |
| 111 | Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) |
| 112 | |
| 113 | Typically the ->lookup routine will simply end with a:: |
| 114 | |
| 115 | return d_splice_alias(inode, dentry); |
| 116 | } |
| 117 | |
| 118 | |
| 119 | |
| 120 | A file system implementation declares that instances of the filesystem |
| 121 | are exportable by setting the s_export_op field in the struct |
| 122 | super_block. This field must point to a "struct export_operations" |
| 123 | struct which has the following members: |
| 124 | |
| 125 | encode_fh (optional) |
| 126 | Takes a dentry and creates a filehandle fragment which can later be used |
| 127 | to find or create a dentry for the same object. The default |
| 128 | implementation creates a filehandle fragment that encodes a 32bit inode |
| 129 | and generation number for the inode encoded, and if necessary the |
| 130 | same information for the parent. |
| 131 | |
| 132 | fh_to_dentry (mandatory) |
| 133 | Given a filehandle fragment, this should find the implied object and |
| 134 | create a dentry for it (possibly with d_obtain_alias). |
| 135 | |
| 136 | fh_to_parent (optional but strongly recommended) |
| 137 | Given a filehandle fragment, this should find the parent of the |
| 138 | implied object and create a dentry for it (possibly with |
| 139 | d_obtain_alias). May fail if the filehandle fragment is too small. |
| 140 | |
| 141 | get_parent (optional but strongly recommended) |
| 142 | When given a dentry for a directory, this should return a dentry for |
| 143 | the parent. Quite possibly the parent dentry will have been allocated |
| 144 | by d_alloc_anon. The default get_parent function just returns an error |
| 145 | so any filehandle lookup that requires finding a parent will fail. |
| 146 | ->lookup("..") is *not* used as a default as it can leave ".." entries |
| 147 | in the dcache which are too messy to work with. |
| 148 | |
| 149 | get_name (optional) |
| 150 | When given a parent dentry and a child dentry, this should find a name |
| 151 | in the directory identified by the parent dentry, which leads to the |
| 152 | object identified by the child dentry. If no get_name function is |
| 153 | supplied, a default implementation is provided which uses vfs_readdir |
| 154 | to find potential names, and matches inode numbers to find the correct |
| 155 | match. |
| 156 | |
| 157 | |
| 158 | A filehandle fragment consists of an array of 1 or more 4byte words, |
| 159 | together with a one byte "type". |
| 160 | The decode_fh routine should not depend on the stated size that is |
| 161 | passed to it. This size may be larger than the original filehandle |
| 162 | generated by encode_fh, in which case it will have been padded with |
| 163 | nuls. Rather, the encode_fh routine should choose a "type" which |
| 164 | indicates the decode_fh how much of the filehandle is valid, and how |
| 165 | it should be interpreted. |