Sunteți pe pagina 1din 15

SURVEY OF LOG STRUCTURED FILESYSTEMS FOR FLASH DEVICES WITH FOCUS ON F2FS

NAME: SAYANTAN CHATTERJEE M.TECH (I.T), SECOND YEAR ROLL NO: 30008012013 REG NO: 123000410092 OF 2012-2013 Guided By: Mr. Santanu Chatterjee

Contents: Introduction Flash Memory Issues of Flash Memory File System Log-Structured File System Issues of Log-Structured File System Flash Translation Layer Flash Friendly File System Design of On-Disk Layout Inode Structure Directory Structure Solution of Wandering Tree Problem F2FS Garbage Collection F2FS Adaptive Write Policy Threaded Log Conclusion References 3 3 4 4 5 6 6 6 9 9 10 10 12 12 13 14

Introduction: In recent years flash drive based memory systems have become quite popular due to their low cost and high storage capacity. These devices have replaced traditional EEPROM devices used in computers and are used as storage in various digital devices like digital cameras, mobile phones, etc. The structure and construction of these devices however differ considerably from normal (magnetic) hard drives. Due to this, traditional file systems like ntfs and ext4 are very inefficient for flash based devices. Hence a new group of logstructured file systems are used with flash memory. This document serves as a basic introduction to log based file systems with special focus on a particular file system called Flash Friendly File System (F2FS).

Flash Memory: Flash devices are constantly-powered non-volatile memory devices. They can be either NOR or NAND devices. In flash devices bit 1 denotes empty data. Whenever some data has to be written bit 1 is flipped to 0. Thus, erasing data in flash memory consists of flipping all 0 bits to 1 bit. Due to the nature of these devices individual bytes cannot be erased; rather data are erased in units of blocks called erase-blocks. As writing a byte of data may require changing one or more 0 bits to 1, the whole block containing the byte needs to be deleted first. The sequence of operations consisting of first copying the contents of an erase-block to memory, deleting the contents of the block and the writing back the updated in-core content to the block is called an erase-cycle. Every flashmemory erase-block has a life-time of a limited number of such erase-cycles, after which the block becomes unusable.

This limitation has lead to the unique requirements of a flash memory file system. Issues of Flash Memory File System: The most important issue of a flash memory file system is to extend the life of each erase-block as much as possible. In traditional hard drives data is often updated in place. As the main source of inefficiency in hard drive disks is caused by the rotational latency and seek time of the mechanical read\write head, it makes sense to reduce it by adopting the policy of in place update. However, in flash devices, rotational latency caused by movement of mechanical parts is not an issue. Moreover, frequently updating a particular erase-block leads to quick reduction of its lifetime. Thus flash based file systems adopt copy-on-write (cow) policy, where multiple applications reading the same block of data shares a single in-core copy of the block, and writes to that block of data is done on separate empty blocks. A second related issue is to choose blocks for writing in such a way so that no block is more quickly used up than any other blocks. This is called wear-leveling. As data change, blocks become fragmented. Erase-blocks which contain non-obsolete data are said to be live. Once no empty block remains in the flash device, the need for reclaiming obsolete data arises. This is called garbage-collection. Log Structured File System: A log structured file system is a file system in which data and meta-data is written sequentially to a circular buffer, called log. The design was first proposed in 1988 and implemented in 1992 by John K. Ousterhout and Mendel Rosenblum in the form of Sprite file system.

Initially log structured file system was designed for normal optical and magnetic media. As writes are made sequentially to the end of the log, write throughput is increased. Also crash recovery becomes simpler, as upon its next mount, the file system does not need to walk all its data structures to fix any inconsistencies, but can reconstruct its state from the last consistent point in the log. Another feature, called snapshotting, makes old versions of data in the log accessible and nameable. This creates a sort of versioning. Issues Of Log-Structured File Systems: One major issue of log structured file systems is to reclaim free space once the tail end of the log is reached. In vanilla log-structured file system (for e.g. Sprite) this is done by advancing the tail by skipping over obsolete data. When live data is found, it is appended to the head of the log. To reduce the overhead incurred by this garbage collection, most implementations avoid purely circular logs and divide up their storage into segments. The head of the log simply advances into nonadjacent segments which are already free. If space is needed, the least-full segments are reclaimed first. This decreases the I/O load of the garbage collector, but becomes increasingly ineffective as the file system fills up and nears capacity. Another issue of log-structured file system is associated with metadata propagation and is known as wandering-tree problem. In Linux like systems every file has an inode which contain data about the file (meta-data) and addresses of data blocks or indirect blocks containing addresses of blocks. Thus whenever some file data is updated the corresponding direct block, indirect blocks and inode also needs to be updated. This leads to a lot of overhead.

Flash Translation Layer: USB sticks, which are a type of flash device, are generally formatted with file systems like ntfs and ext4. If these file systems are allowed direct access to the raw device, its lifetime will be severely compromised. Hence an intermediate layer, called flash translation layer (FTL), interfaces with the disk drivers accessing the flash device. The main tasks of FTL are address-translation, wear-levelling and garbage collection. The higher level VFS can treat the flash device as a regular block device, while the FTL internally maps all block addresses to internal flash addresses using various techniques like sector-mapping, block-mapping, hybrid-mapping, etc. Although convenient, the presence of FTL is inefficient due to the extra level of translation required. Hence most flash file systems like JFFS2 or LogFS do not use FTL. However, as we will see, F2FS is an exception as it delegates many operations to the underlying FTL. Flash-Friendly File System: F2FS (Flash-Friendly File System) is a flash file system created by Kim Jaegeuk at Samsung for the Linux operating system kernel. The source code of F2FS was made available to the Linux kernel by Samsung in 2012. It uses logstructured file system approach. Although there are other log-structured flash file systems, f2fs have some unique features. First of all it is flash aware, i.e. it takes into consideration unique features of the flash device. Secondly, as mentioned previously, it uses the underlying FTL to do some of its work. It also tries to solve the wandering-tree problem. Lastly it tries to make garbage collection more efficient. Design of On Disk Layout:
Super Block Checkpoint Segment Information Table Node Address Table Segment Summary Area Main Area

Main Area: The on-disk layout of f2fs is divided into the mainarea and meta-data area. All data and inode blocks (including direct and indirect blocks) are stored in the main area. The main area comprises of segments, each of 2MB size, which contain 512 blocks. Each segment is a log which is written to from start to finish consecutively. When a segment becomes filled up another free segment is chosen. F2FS groups segments together into sections. By default there is one segment per section, but this can be modified. At any time there are six sections which are open for writing at the same time. Most modern flash device is a collection of multiple independent devices which can perform their own read/write at the same time. F2FS takes advantage of this feature by collecting sections in zones (again 1 per zone by default) and aligning the zones containing the six open sections with the independent read/write chips of the device. As we will see, this not only makes device access faster but is also useful during garbage collection. Metadata Area: The meta-data area contains all the information required to make sense of the main data. F2FS uses shadow-copy mechanism while updating metadata. As far as f2fs is concerned all metadata areas (like CP, NAT, etc.) have two fixed locations. Only one of them contains the most current version of data. Update of metadata occurs in the currently obsolete location. This acts as a safeguard against metadata corruption in case of crash in midst of changing metadata. Of course, updating in place is highly risky in flash memory, so f2fs offloads actual task of writing and wear-

leveling to the underlying FTL. As metadata size is very small the overhead associated with FTL is not very bad. Metadata is divided into the following parts: 1. Superblock: The Superblock contains read-only data concerning the whole file system. Once the file system is created the Superblock never changes. It is stored at the beginning of the partition and stores data like how big the file system is, how big the segments, sections, and zones are, how much space has been allocated for the various parts of the "meta" area, etc. 2. Checkpoint: The checkpoint area contains rest of the dynamic file system related information like amount of free space, address of the segment to which data should be written next, etc. After a file system crash the drive can recover part of the data instantaneously by reading the last information from CP. 3. Segment Information Table: The SIT stores 74 bytes per segment. It primarily keeps track of which blocks are still in active use so that the segment can be reused when it has no active blocks, or can be cleaned when the active block count gets low. 4. Node Address Table: The NAT is like an array which contains the address of every data and inode block stored in the main area. The address of a block can be found by referencing it like an array index into the NAT. As mentioned the NAT is used to solve the wanderingtree problem. 5. Segment Summary Area: It contains summary entries which contain owner information of all the data and node blocks stored in the main area.

Inode Structure: F2FS has three types of nodes: inode, direct node and indirect node. F2FS assigns 4KB to each inode block which contain 923 data block indices, two direct node pointers, two indirect node pointers and one double indirect node pointer. One direct node block contains entries for 1018 data blocks. Thus one inode block covers: 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) = 3.94 TB. Directory Structure: A directory entry in Unix consist of a file name, inode number pair. Original Unix file system used to search linearly through each directory entry in order to find a file. Linear search is not efficient in modern file systems. F2FS uses multi-level hash tables in order to implement directory structure. Each level has a hash table with dedicated number of hash buckets. The number of blocks and buckets are determined by the following formula: Number of blocks in level n = 2, if n < max_directory_hash_depth / 2 = 4, otherwise. Number of buckets in level n = 2^n, if n < max_directory_hash_depth /2 = 2^((max_directory_hash_deth / 2) 1), otherwise. When F2FS finds a file name in a directory, at first a hash value of the file name is calculated. Then, F2FS scans the hash table in level 0 to find the dentry consisting of the file name and its inode number. If not found, F2FS scans the next hash table in level 1. In this way, F2FS

scans hash tables in each levels incrementally from 1 to N. In each levels F2FS needs to scan only one bucket determined by the following equation: Bucket number to scan in level n = (hash value) % (number of buckets in level n). In the case of file creation, F2FS finds empty consecutive slots that cover the file name. F2FS searches the empty slots in the hash tables of whole levels from 1 to N in the same way as the lookup operation. Solution of the Wandering Tree Problem: When the address of an inode is stored in a directory, or an index block is stored in an inode or another index block, it isn't the block address that is stored, but rather an offset into the NAT. The actual block address is stored in the NAT at that offset. This means that when a data block is written, we still need to update and write the node that points to it. But writing that node only requires updating the NAT entry. The NAT is part of the metadata that uses two-location journaling (thus depending on the FTL for write-gathering) and so does not require further indexing. Thus introduction of the NAT solves the wandering tree issue. F2FS Garbage Collection: F2FS follows a dynamic cleaning policy. Depending on the situation the cleaning algorithms used by f2fs changes. Traditional garbage collection works by copying and compressing live data of highly fragmented blocks in empty segments, thus freeing up space previously occupied by obsolete data. Three issues become important during garbage collection: when to start, how many blocks to free up and which segments to choose for cleaning.

The first two questions are easily answered. Normally garbage collection starts when number of free segments goes below a certain threshold (typically 10s of segments), and it continues to clean segments until number of free segments goes above another threshold. The question of which segments to clean (called victim segment) is however not so simple. One greedy approach is to simply choose the segment with the least number of live blocks in order to minimize the copying overhead. However, this has been empirically found out to be not the best solution. If recently copied data becomes obsolete very rapidly, then the effort of copying them is wasted. We could have as well waited for a bit longer and not copied those data at all. This leads to the concept of temperature of data. Fast changing data is hot whereas relatively stable data is cold. At the time of victim selection not only cost of copying data is considered but the benefit of cleaning stable data is also taken into account. This leads to the cost-benefit approach to victim selection. Stability of a segment depends on the age of the segment, where age is the most recent modification time of data in a segment. How obsolete a segment is known by keeping track of the number of live blocks in each segment. Both these information are kept in the Segment Information Table. At the time of selecting a victim segment, segment age is maximized and number of live blocks in a segment is minimized. By default f2fs uses cost-benefit approach, however it can be configured to use greedy approach by the user.

F2FS Adaptive Write Policy: F2FS does more optimization in order to make garbage collection more efficient. The aim of ideal garbage collection is to achieve bi-modal distribution of data. In such an ideal case most of the segments are filled with rarely changing cold data, and a few segments are always occupied by rapidly changing hot data. Under such a scenario copying of live data is rarely required as segments become quickly filled with obsolete blocks. Achieving bi-modal distribution of data is however not so easy. F2FS divides blocks into hot node blocks and warm data blocks. Each of the above category are again divided into hot/warm/cold temperatures as follows: Hot node contains direct node blocks of directories.

Warm node contains direct node blocks except hot node blocks. Cold node Hot data Warm data Cold data contains indirect node blocks. contains dentry blocks. contains data blocks except hot and cold data blocks. contains multimedia data or migrated data blocks.

The above six types of data are stored in the six separate sections that are open for parallel writing. Thus something approaching bimodal distribution of data is achieved by f2fs. Threaded Log: Once the flash drive starts filling up most blocks contain live data and garbage collection becomes very inefficient. In such a case f2fs switches to threaded-loging. Now no garbage collection is done. Instead, each obsolete block maintains a pointer to the next obsolete block. These pointers or threaded-logs are used by f2fs to simply write data directly into the obsolete blocks. Though

this technique is fast the reason it is not used more often is because sequential data becomes fragmented and spatial-locality is lost. Conclusion: F2FS is a modern file system for flash devices that tries to tackle very common but persistent issues. It does so by taking advantage of the flash device itself, cooperating with the FTL and trying to make garbage collection as painless as possible. It serves as an excellent example of decisions that need to be made in real world engineering problems. Being a recent file system it is still a work-inprogress and there is scope of improvements and modifications that can be done on it.

References:
1. Rosenblum, Mendel and Ousterhout, John K. (February 1992) "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10 Issue 1. 2. Jaegeuk, Kim (March 2013) Embedded Linux Conference 2013 - Flash Friendly File System [Online Video]. Available: https://www.youtube.com/watch?v=t_4_Ba7PSg4

3. Jaegeuk, Kim f2fs.txt [Online]. Available: https://github.com/torvalds/linux/blob/master/Documentation /filesystems/f2fs.txt 4. Brown, Neil (October, 2012) An f2fs teardown [Online]. Available: http://lwn.net/Articles/518988/ 5. Chung, Tae-Sung et al. (May, 2009) A Survey of Flash Translation Layer. Journal of Systems Architecture: The EUROMICRO Journal, Vol. 55 Issue 5-6. 6. Woodhouse, David JFFS : The Journalling Flash File System [Online]. Available: http://sourceware.org/jffs2/jffs2-html/