Sunteți pe pagina 1din 38

Distibuted File System

1
Introduction
Definition:
— Implement a common file system that can be shared by all
autonomous computers in a distributed system
DFS: multiple users, multiple sites, and (possibly)
distributed storage of files.

— Benefits
— File sharing
— Uniform view of system from different clients
— Centralized administration

2
Goals
Network (Access)Transparency
— Users should be able to access files over a network as
easily as if the files were stored locally.
— Users should not have to know the physical location of a
file to access it.
— Transparency can be addressed through naming and file
mounting mechanisms.

3
Components of Access Transparency
— Location Transparency: file name doesn’t specify physical
location.
— Location Independence: files can be moved to new physical
location, no need to change references to them. (A name is
independent of its addresses )
— Location independence → location transparency, but the
reverse is not necessarily true.
— Most DFSs today:
— Support location transparent systems.
— Do NOT support migration; (automatic movement of a
file from machine to machine.)
4
Naming and Transparency
— The ANDREW DFS AS AN EXAMPLE:
— Is location independent.
— Supports file mobility.
— Separation of FS and OS allows for disk-less systems. These
have lower cost and convenient system upgrades. The
performance is not as good.

— NAMING SCHEMES:
1. Files are named with a combination of host and local name.
— This guarantees a unique name. NOT location transparent
NOR location independent.
— Same naming works on local and remote files. The DFS is a
5
loose collection of independent file systems.
Naming and Transparency
2. Remote directories are mounted to local directories.
— So a local system seems to have a coherent directory
structure.
— The remote directories must be explicitly mounted. The
files are location independent.
— SUN NFS is a good example of this technique.

3. A single global name structure spans all the files in the


system.
— The DFS is built the same way as a local filesystem.
— Location independent.

6
Goals
Availability:
Files should be easily and quickly accessible.
— The number of users, system failures, or other
consequences of distribution shouldn’t compromise the
availability.
— Addressed mainly through replication.

7
Introduction
Architectural options:
— Fully distributed: files distributed to all sites
— based on peer-to-peer technology
— Issues: performance, implementation complexity
— Client-server Model:
— Fileserver: dedicated sites storing files perform storage
and retrieval operations
— Client: rest of the sites use servers to access files
— e.g. Sun Microsystem Network File System (NFS)

8
Client-Server Architecture
— One or more machines (file servers) manage the file system.
— Files are stored on disks at the servers.
— Requests for file operations are made from clients to the
servers.
— Client-server systems centralize storage and management;
P2P systems decentralize it.

9
client
client

cache cache

Communication Network

Server
cache cache
Disks
Server Server
10 Architecture of a distributed file system: client-server model
Distributed File Systems:
Client-Server Architecture

11
Typical Data Access in a Client/File Server
Architecture

12
Distributed File Systems Services
Services provided by the distributed file system:
(1) Name Server: Provides mapping (name resolution) the names
supplied by clients into objects (files and directories)
— Takes place when process attempts to access file or directory the
first time.
(2) Cache manager: Improves performance through file caching
— Caching at the client -When client references file at server:
•Copy of data brought from server to client machine
•Subsequent accesses done locally at the client
— Caching at the server:
•File saved in memory to reduce subsequent access time

* Issue: different cached copies can become inconsistent. Cache managers


13
(at server and clients) have to provide coordination.
Mechanisms used in distributed file systems
(1) Mounting
• The mount mechanism binds together several filename
spaces (collection of files and directories) into a single
hierarchically structured name space (Example: UNIX and
its derivatives)
• A name space ‘A’ can be mounted (bounded) at an internal
node (mount point) of a name space ‘B’
• Implementation: kernel maintains the mount table, mapping
mount points to storage devices

14
Mechanisms used in distributed file systems
(1) Mounting

15
Mechanisms used in distributed file systems
(1) Mounting (cont.)
• Location of mount information
a. Mount information maintained at clients
— Each client mounts every file system
— Different clients may not see the same filename space
— If files move to another server, every client needs to update its
mount table
— Example: SUN NFS
b. Mount information maintained at servers
— Every client see the same filename space
— If files move to another server, mount info at server only
needs to change
— Example: Sprite File System

16
Mechanisms used in distributed file systems
(2) Caching
— Improves file system performance by exploiting the locality
of reference.
— When client references a remote file, the file is cached in the
main memory of the server (server cache) and at the client
(client cache).
— When multiple clients modify shared (cached) data, cache
consistency becomes a problem.
— It is very difficult to implement a solution that guarantees
consistency.

17
Simple Distributed File System
Read (RPC)
Return (Data)
Client
Server cache

Client
— Remote Disk: Reads and writes forwarded to server
— Use RPC to translate file system calls
— No local caching/can be caching at server-side
— Advantage: Server provides completely consistent view of file system
to multiple clients
— Problems? Performance!
— Going over network is slower than going to local memory
— Lots of network traffic/not well pipelined
— Server can be a bottleneck
Use of caching to reduce network load
read(f1) ®V1
cache Read (RPC)
read(f1)®V1
read(f1)®V1 F1:V1 Return (Data)
read(f1)®V1 Client
Server cache
F1:V2
F1:V1
cache
write(f1) ®OK
F1:V2
read(f1)®V2 Client

— Idea: Use caching to reduce network load


— In practice: use buffer cache at source and destination
— Advantage: if open/read/write/close can be done locally, don’t need
to do any network traffic…fast!
— Problems:
— Failure:
— Client caches have data not committed at server
— Cache consistency!
— Client caches not consistent with server/each other
Mechanisms used in distributed file systems
(3) Hints
— Treat the cached data as hints, i.e. cached data may not be
completely accurate.
— Can be used by applications that can discover that the
cached data is invalid and can recover
— Example:
— After the name of a file is mapped to an address, that
address is stored as a hint in the cache.
— If the address later fails, it is purged from the cache
— The name server is consulted to provide the actual
location of the file and the cache is updated

20
Mechanisms used in distributed file systems
(4) Bulk data transfer
— Observations:
— Overhead introduced by protocols does not depend on the
amount of data transferred in one transaction.
— Most files are accessed in their entirety.
— Common practice: when client requests one block of data,
multiple consecutive blocks are transferred

(5) Encryption
— Encryption is needed to provide security in distributed systems.
— Entities that need to communicate send request to authentication
server.
— Authentication server provides key for conversation.
21
Design Issues
1. Naming and name resolution
— Terminology
— Name: each object in a file system (file, directory) has a unique
name
— Name resolution: mapping a name to an object or multiple
objects (replication)
— Name space: collection of names with or without same
resolution mechanism
— Approaches to naming files in a distributed system
(a) Concatenate name of host to names of files on that host
— Advantage: unique filenames, simple resolution

22
Design Issues
— Disadvantages:
o Conflicts with network transparency
o Moving file to another host requires changing its name
and the applications using it

(b) Mount remote directories onto local directories


— Requires that host of remote directory is known
— After mounting, files referenced location-transparent (i.e.,
file name does not reveal its location)

(c) Have a single global directory


— All files belong to a single name space
— Limitation: having unique system wide filenames require a
single computing facility or cooperating facilities
23
Design Issues
1. Naming and Name Resolution (cont.)
— Contexts
— Solve the problem of system-wide unique names, by
partitioning a name space into contexts (geographical,
organizational, etc.)
— Name resolution is done within that context.
— Interpretation may lead to another context.
— File Name = Context + Name local to context

24
Design Issues
— Nameserver
— Process that maps file names to objects (files, directories)
— Implementation options
— Single name Server
o Simple implementation,
o reliability and performance issues
— Several Name Servers (on different hosts)
o Each server responsible for a domain
o Example:
Client requests access to file ‘A/B/C’
Local name server looks up a table (in kernel)
Local name server points to a remote server for ‘/B/C’
25 mapping
Design Issues
2. Caching
— Caching at the client: Main memory vs. Disk
— Main memory: (+) Fast, (+)Works for diskless clients,
(-) Expensive memory, (-) Complex Virtual Memory
Management.
— Disk: (+) Large files, (+) Simpler Virtual Memory Management
(-) Requires local disk.
— Cache consistency
— Server initiated
— Server informs cache managers when data in client caches is
stale.
— Client cache managers invalidate stale data or retrieve new data.
26 — Disadvantage: extensive communication
Design Issues
— Client initiated
— Cache managers at the clients validate data with server
before returning it to clients
— Disadvantage: extensive communication
— Prohibit file caching when concurrent-writing
— Several clients open a file, at least one of them for
writing
— Server informs all clients to purge that cached file
— Lock files when concurrent-write sharing (at least one
client opens for write)

27
Design Issues
3.Writing policy
— Question: Once a client writes into a file (and the local
cache), when should the modified cache be sent to the server?
— Options:
— Write-through: all writes at the clients, immediately
transferred to the servers
— Advantage: reliability
— Disadvantage: performance, it does not take advantage of
the cache.

28
Design Issues
— Delayed writing: delay transfer to servers
— Advantages:
o Many writes take place (including intermediate results)
before a transfer
o Some data may be deleted
— Disadvantage: reliability
— Delayed writing until file is closed at client
— For short open intervals, same as delayed writing
— For long intervals, reliability problems

29
Design Issues
4. Availability
— Issue: what is the level of availability of files in a distributed
file system?
— Resolution: use replication to increase availability, i.e. many
copies (replicas) of files are maintained at different
sites/servers
— Replication issues:
— How to keep replicas consistent
— How to detect inconsistency among replicas

30
Design Issues
— Unit of replication
— File
— Group of files
a) Volume: group of all files of a user or group or all files
in a server
o Advantage: ease of implementation
o Disadvantage: wasteful, user may need only a subset
replicated
b) Primary pack vs. pack
o Primary pack:all files of a user
o Pack: subset of primary pack. Can receive a different
31 degree of replication for each pack
Design Issues
5. Scalability
— Issue: Can the design support a growing system?
— Example: server-initiated cache invalidation complexity and
load grow with size of system.
— Possible solutions:
— Do not provide cache invalidation service for read-only files.
— Provide design to allow users to share cached data.
— Design file servers for scalability: threads, SMPs, clusters

32
Design Issues
6. Semantics
— Expected semantics: a read will return data stored by the
latest write.
— Possible options:
— All read and writes go through the server.
— Disadvantage: communication overhead
— Use of lock mechanism
— Disadvantage: file not always available

33
STATEFUL VS. STATELESS SERVICE:
Stateful: A server keeps track of information about client
requests.
— It maintains what files are opened by a client;
connection identifiers; server caches.
— Memory must be reclaimed when client closes file or
when client dies.

Stateless: Each client request provides complete information


needed by the server (i.e., filename, file offset ).
— The server can maintain information on behalf of the
client, but it's not required.
— Useful things to keep include file info for the last N files
34 touched.
Case Studies:
The Sun Network File System (NFS)

— Developed by Sun Microsystems to provide a distributed file system


independent of the hardware and operating system.
— Architecture
— Virtual File System (VFS):
File system interface that allows NFS to support different file systems.
— Requests for operation on remote files are routed by VFS to NFS
— Requests are sent to the VFS on the remote using
— The remote procedure call (RPC), and
— The external data representation (XDR)
— VFS on the remote server initiates files system operation locally.
— Vnode (Virtual Node):
— There is a network-wide vnode for every object in the file system (file or
directory)- equivalent of UNIX inode.
35
— vnode has a mount table, allowing any node to be a mount node.
Case Studies: NFS Architecture

36
NFS (Cont.)
— Naming and location:
— Workstations are designated as clients or file servers.
— A client defines its own private file system by mounting a subdirectory of a
remote file system on its local file system.
— Each client maintains a table which maps the remote file directories to
servers.
— Mapping a filename to an object is done the first time a client references the
field. Example:
Filename: /A/B/C
— Assume ‘A’ corresponds to ‘vnode1’
— Look up on ‘vnode1/B’ returns ‘vnode2’ for ‘B’ where‘vnode2’ indicates that
object is on server ‘X’.
— Client asks server ‘X’ to lookup ‘vnode2/C’.
— ‘file handle’ returned to client by server storing that file.
37 — Client uses ‘file handle’ for all subsequent operations on that file.
NFS (Cont.)
— Caching:
— Caching done in main memory of clients.
— Caching done for: file blocks, translation of filenames to vnodes, and attributes of files and
directories.
(1) Caching of file blocks
— Cached on demand with time stamp of the file (when last modified on the server)
— Entire file cached, if under certain size, with timestamp when last modified
— After certain age, blocks have to be validated with server
— Delayed writing policy: Modified blocks flushed to the server after certain delay

(2) Caching of filenames to vnodes for remote directory names


— Speeds up the lookup procedure.
(3) Caching of file and directory attributes
— Updated when new attributes received from the server, discarded after certain time.

— Stateless Server
— Servers are stateless
— File access requests from clients contain all needed information (pointer position, etc)
— Servers have no record of past requests.
— Simple recovery from crashes.
38

S-ar putea să vă placă și