Linux Kernel Hackers' Guide

The Linux Kernel Hackers' Guide
The HyperNews Linux KHG Discussion Pages
Linux Kernel Hackers' Guide

Due to the fact that nearly every post to this site
recently has been either by rude cracker-wannabes
asking how to break into other people's systems or a
request for basic technical support, posting to the
KHG has been disabled, probably permanently. For
now, you can read old posts, but you cannot send
replies. In any case, there are now far better
resources available.
Go get the real thing!
Alessandro Rubini wrote Linux Device Drivers, which is what the KHG could have been (maybe) but isn't. If you
have a question and can't find the answer here, go get a copy of Linux Device Drivers and read it--chances are that
when you are done, you will not need to ask a question here.
Run, don't walk to get a copy of this book.
The Linux Kernel

Go read The Linux Kernel if you want an introduction to the Linux kernel that is better than the KHG. It is a great
complement to Linux Device Drivers. Read it.
Table of Contents
Tour of the Linux Kernel
This is a somewhat incomplete tour of the Linux Kernel, based on Linux 1.0.9 and the 1.1.x development
series. Most of it is still relevant.
Device Drivers
The most common Linux kernel programming task is writing a new device driver. The great majority of the
code in the kernel is new device drivers; between 1.2.13 and 2.0 the size of the source code more than
doubled, and most of that was from adding device drivers.
http://ldp.iol.it/LDP/khg/HyperNews/get/khg.html (1 di 13) [08/03/2001 10.07.54]

Filesystems
Adding a filesystem to Linux doesn't have to involve magic...
Linux Memory Management
A few outdated documents, and one completely new one by David Miller on the Linux cache flush
architecture.
How System Calls Work on Linux/i86
Although this was written while Linux 0.99.2 was current, it still applies. A few filenames may need
updating. find is your friend--just respond with the changes and they will be added.
Other Sources of Information
The KHG is just one collection of information about the Linux kernel. There are others!
Membership and Subscription

At the bottom of the page, you will notice two hyperlinks (among several others): Subscribe and Members.
Using the KHG to its fullest involves these two hyperlinks, even though you are not required to be a member to
read these pages and post responses.
Membership
HyperNews membership is site-wide. That is, you only need to sign up and become a member once for the entire
KHG. It doesn't take much to be a member. Each member is identified by a unique name, which can either be a
nickname or an email address. We suggest using your email address; that way it will be unique and easy to
remember. On the other hand, you may want to choose a nickname if you expect to be changing your email
address at any time.
We also want your real name, email address, and home page (if you have one). You can give us your phone and
address if you want. You will be asked to choose a password. You can change any of these items at any time by
clicking on the Membership hyperlink again.
Subscription
Subscribing to a page puts you on a mailing list to be sent notification of any new responses to the page to which
you are subscribed. You subscribe separately to each page in which you are interested by clicking the
Subscription link on the page to which you want to subscribe. You are also subscribed, by default, to pages that
you write.
When you subscribe to a page, you subscribe to that page and all of its responses.
Contributing
Please respond to these pages if you have something to add. Think of posting a response rather like posting to an
email list, except that an editor might occasionally come along to clean things up and/or put them in the main
documents' bodies. So if you would post it to an email list in a similar discussion, it is probably appropriate to post
here.
In order to make reading these pages a pleasure for everyone, any incomprehensible, unrelated, outdated, abusive,
or other completely unnecessary post may be removed by an administrator. So if you have a message that would

be inappropriate on a mailing list, it's probably also inappropriate here.

The administrators have the final say on what's appropriate. We don't expect this to become an issue...
About the new KHG

The Linux Kernel Hackers' Guide has changed quite a bit since its original conception four years ago. I struggled
along with the help of many other hackers to produce a document that lived primarily on paper, and was intended
to document the kernel in much the same way that a program's user guide is intended to document the program for
users.
It was less successful than most user guides, for a number of reasons:
● I was working on it part time, and was otherwise busy.
● The Linux kernel is a moving target.
● I am not personally capable of documenting the entire Linux kernel.
● I became far too concerned with making the typesetting pretty, getting bogged down in details and making
the document typographically noisy at the same time.
I floundered around, trying to be helpful, and made at least one right decision: most of the people who needed to
read the old KHG needed to write device drivers, and the most fully-developed part of the KHG was the device
driver section.
There is a clear need for further development of the KHG, and it's clear that my making it a monolithic document
stood in the way of progress. The KHG is now a series of more or less independent web pages, with places for
readers to leave comments and corrections that can be incorporated in the document at the maintainer's
leisure--and are available to readers before they are incorporated.
The KHG is now completely web-based. There will be no official paper version. You need kernel source code
nearby to read the KHG anyway, and I want to shift the emphasis from officially documenting the Linux kernel to
being a learning resource about the Linux kernel--one that may well be useful to other people who want to
document one part or another of the Linux kernel more fully, as well as to people who just want to hack the
kernel.
Enjoy!
Copyright (C) 1996,1997 Michael K. Johnson, johnsonm@redhat.com
Messages
349. Loading shared objects - How? by Wesley Terpstra
342. How can I see the current kernel configuration? by Melwin
1. My mouse no work in X windows by alfonso santana
340. The crash(1M) command in Linux? by Dmitry
338. Where can I gen detailed info on VM86 by Sebastien Plante
335. How to print floating point numbers from the kernel? by pkunisetty@hotmail.com
333. PS/2 Mouse Operating in Remote Mode by Andrei Racz
331. basic module by vano0023@tc.umn.edu

329. How to check if the user is local? by jb@nicol.ml.org

328. Ldt & Privileges by Ganesh
326. skb queues by Rahul Singh
323. Page locking (for DMA) and process termination? by Espen Skoglund
322. SMP code by 97yadavm@scar.utoronto.ca
319. Porting GC: Difficulties with pthreads by Talin
314. Linux for "Besta - 88"? by Dmitry
1. MVME147 Linux by Edward Tulupnikov
313. /proc/locks by Marco Morandini
310. syscall by ppappu@lrc.di.epfl.ch
308. How to run a bigger kernel ? by Kyung D. Ryu
300. Linux Terminal Device Driver by Nils Appeldoorn
1. Terminal DD by Doug McNash
297. DMA to user allocated buffer ? by Chris Read
1. allocator-example in A.Rubini's book by Thomas Sefzick
293. Patching problems by Maryam
1. Untitled by welch@mcmail.com
290. Ethernet Collision by jerome bonnet
1. Ethernet collisions by Juha Laine
289. Segmentation in Linux by Andrew Sampson
288. How can the kernel copy directly data from one process to another process? by Jürgen Zeller
1. Use the /Proc file system by marty@twsu.campus.mci.net
286. Remapping Memory Buffer using vmalloc/vma_nopage by Brian W. Taylor
1. Fixed.... strncpy to blame by Brian W. Taylor
283. Does memory area assigned by "vmalloc()" get swapped to disk? by Saurabh Desai
1. Lock the pages in memory by balaji@ittc.ukans.edu
-> How about assigning a fixed size array...does it get swapped too? by saurabh desai
282. Creative Lab's DVD Encore by Brandon
274. TCP sliding window by Olivier
273. Packets and default route versus direct route by Steve Resnick
269. IPv6 description - QoS Implementation - 2 IP Queues by wehrle
2. See the kernel IPv4 implementation documentation by Juha Laine
268. writing to user file directly from kernel space, How can it be done? by Johan
267. how can i increase the number of processes running? by ElmerFudd
261. How do I change the amount of time a process is allowed before it is pre-empted? by

Escher@dn101aw.cse.eng.auburn.edu
260. Network device stops after a while by Andrew Ordin
1. Untitled by Andrew
259. Does MMAP work with Redhat 4.2? by Guy
1. Yes, it works just fine. by Michael K. Johnson
3. What about mprotect? by Sengan Baring-Gould
2. It Works! Thanks! by Guy
256. multitasking by Dennis J Perkins
1. Answer by David Welch
-> multitasking by Dennis J Perkins
-> answer by David Welch
247. linux on sparc by darrin hodges
241. How to call a function in user space from inside the kernel ? by Ronald Tonn
1. How to call a user routine from kernel mode by David Welch
240. Can I map kernel (device driver) memory into user space ? by Ronald Tonn
237. driver x_open,x_release work, x_ioctl,x_write don't by Carl Schwartz
1. Depmod Unresolved symbols? by Carl Schwartz
235. How to sleep for x jiffies? by Trent Piepho
1. Use add_timer/del_timer (in kernel/sched.c) by Amos Shapira
234. Adding code to the Linux Kernel by Patrick
1. /dev/random by Simon Green
231. MSG_WAITALL flag by Leonard Mosescu
230. possible bug in ipc/msg.c by Michael Adda
225. scheduler Question by Arne Spetzler
1. Untitled by Ovsov
-> thanks by arne spetzler
221. File Descriptor Passing? by The Llamatron
220. Linux SMP Scheduling by Angela
2. Finding definitions in the source by Felix Rauch
1. Re: Linux SMP Scheduling by Franky
217. Difference between ELF and kernel file by Thomas Prokosch
216. How kernel communicates with outside when it's started? by Xuan Cao
1. Printing to the kernel log by Thomas Prokosch
213. The way from a kernel hackers' idea to an "official" kernel? by Roger Schreiter
1. linux-kernel@vger.rutgers.edu by Michael K. Johnson

212. Curious about sleep_on_interruptible() in ancient kernels. by Colin Howell

208. Server crashes using 2.0.32 and SMP by Steve Resnick
1. Debugging server crash by Balaji Srinivasan
-> More Information by Steve Resnick
-> it should not have happenned... by Balaji Srinivasan
207. Signals ID definitions by Franky
206. the segment D000 is not visible by martinv2@ctima.uma.es
205. ICMP - Supressing Messages in 2.1.82 Kernel by Brent Johnson
1. Change /etc/syslog.conf by Balaji Srinivasan
203. Modem bits by Franky
1. Untitled by Kostya
200. I need some way to measure the time a process spend in READY QUEUE by Leandro Gelasi
197. How to make sockets work whilst my process is in kernel mode? by Mikhail Kourinny
193. Realtime Problem by Uwe Gaethke
1. SCHED_FIFO scheduling by Balaji Srinivasan
190. inodes by Ovsov
186. Difference between SOCK_RAW SOCK_PACKET by Chris Leung
1. SOCK_PACKET by Eddie Leung
185. Need additional termcap entries for TERM=linux by Karl Bullock
184. Question on Umount or sys_umount by teddy
183. Passing file descriptors to the kernel by Pradeep Gore
1. A way to "transform" a file descriptor into a struct file* in a user process by Lorenzo Cavallaro
181. Dead Man Timer by Jody Winston
179. raw sockets by lightman
178. a kernel-hacking newbie by Bradley Lawrence
2. A place to start.
1. Modems in general by Ian Carr-de Avelon
176. How to write CD-ROM Driver ? Any Source Code ? by Madhura Upadhya
174. Measuring the scheduler overhead by Jasleen Kaur
173. Where can I find the tcpdump or snoop in linux? by wangc@taurus
1. man which by trajek@j00nix.org
172. Timers don't work?? by Joshua Liew
1. Timers Work... by Balaji Srinivasan
171. problem of Linux's bridge code by wangc@taurus
170. Documention on writing kernel modules by Erik Nygren

168. How to display a clock on my console? by keco

167. Difference between SCO and Linux drivers. by M COTE
165. Changing the scheduler from round robin to shortest job first for kernel 2.0 and up by
royal_and_mary_harrell@msn.com
2. Improving the Scheduer by Lee Ingram
1. Improving the Scheduler : use QNX-like by Leandro Gelasi
1. Re: Changing the sched. from round robin to shortest job first for kernel 2.0 and up. by Pirasenna V.T.
164. meanings of file->private_data by ncuandre@ms14.hinet.net
162. /dev/signalprocess by flatmax
161. how to track VM page access sequence? by shawn
160. Whats the difference between dev_tint(dev) and mark_bh(NET_BH)? by Jaspreet Singh
159. PCI by mullerc@iname.com
1. RE: PCI by Armin A. Arbinger
158. Can I make syscall from inside a kernel module? by Shawn Chang
3. Re: Can I make syscall from inside a kernel module? by Massoud Asgharifard
1. Make a syscall despite of wrong fs!! by Mikhail Kourinny
2. code snip to make a sys_* call from a module by Pradeep Gore
1. Dont use system calls within kernel...(esp sys_mlock) by Balaji Srinivasan
157. Untitled by Steve Durst
154. RAW Sockets (Art)
153. use phy mem by WYB
151. HyperNews for RH Linux ? by Eigil Krogh Sorensen
1. Not really needed by Cameron
150. about raw ethernet frame: how to do it ? by crbild@smc.it
149. process table by Blaz Novak
148. Stream drivers by Nick Egorov
3. Streams drivers
1. Stream in Solaris by cai.yu@rdc.etc.ericsson.se
143. Xircom External Ethernet driver anywhere? by mike head
140. interruptible_sleep_on() too slow! by Bill Blackwell
1. wrong functions by Michael K. Johnson
139. creating a kernel relocatable module by Simon Kittle
138. Up to date serial console patches by Simon Green
136. Kernel-Level Support for Checkpointing on Linux? by Argenis R. Fernandez
1. Working on it. by Jan Rychter

135. Problem creating a new system call by sauru

3. How did the file /arch/i386/kernel/entry.S do its job by Wang Ju
2. system call returns "Bad Address". Why? by sauru
1. Re:return values by C.H.Gopinath
2. Re:return values by Sameer Shah
1. possible reason for segmentation fault
1. Creating a new sytem call: solution by C.H.Gopinath
2. problem with system call slot 167 by Todd Medlock
1. Kernel Debuggers for Linux by sauru
133. Resetting interface counters by Keith Dart
130. writing/accessing modules by Jones MB
1. Use a device driver and read()/write()/ioctl() by Michael K. Johnson
-> getting to the kernel's memory by Jones MB
-> use buffers! by Rubens
124. Help with CPU scheduler! by Lerris
1. Response to "Help with CPU scheduler!" by Jeremy Impson
-> Response to "Help with CPU scheduler!" (Redux) by Jeremy Impson
117. calling interupts from linux by John J. Binder
1. You can't by Michael K. Johnson
-> Calling BIOS interrupts from Linux kernel by Ian Collier
-> Possible, but takes work by Michael K. Johnson
-> VBE video driver by Ian Collier
-> VM86 mode at which abstraction level? by Michael K. Johnson
116. DVD-ROM and Linux? (sorry if it's off topic...) by Joel Hardy
3. DVD-ROM and linux by Yuqing_Deng@brown.edu
2. Response to DVD and Mpeg in Linux by Mike Corrieri
1. DVD Encryption by Mark Treiber
-> Untitled by Tim
-> DVD?
115. Kernel Makefile Configuration: how? by Simon Green
2. How to add a driver to the kernel ? by jacek Radajewski
1. See include/linux/autoconf.h by Balaji Srinivasan
113. Multiprocessor Linux by Davis Terrell
1. Building an SMP kernel by Michael K. Johnson
-> SMP and module versions by linux@catlimited.com

111. Improving event timers? by bodomo@hotmail.com

109. measuring time to load a virtual mem page from disk by kandr
94. using cli/sti() and save_flags/restore_flags() by george
93. Protected Mode by ac
2. 'Developers manual' from Intel(download)... by Mats Odman
1. Advanced 80386 Programming Techniques by Michael K. Johnson
92. DMA buffer sizes by nomrom@hotmail.com
2. DMA limits by Albert Cahalan <acahalan at cs.uml.edu>
1. Not page size, page order by Michael K. Johnson
91. Problem Getting the Kernel small enough by afish@tdcdesigncorps.com
2. Check it's the right file, zImage not vmlinux by Cameron
1. Usually easy, but.... by Ian Carr-de Avelon
89. How to create /proc/sys variables? by Orlando Cantieni
88. Linux for NeXT black? by Dale Amon
87. vremap() in kernel modules? by Liam Wickins
86. giveing compatiblity to win95 for ext2 partitions (for programmers forced to deal with both) by pharos
2. Well, What's the status of the Windows / Dos driver for Ext2? by Brock Lynn
1. Working on it! by ibaird
1. revision by ibarid
-> Untitled by Olaf
84. setsockopt() error when triying to use ipfwadm for masquerading by omaq@encomix.es
1. Re: masquerading by Charles Barrasso
83. reset the irq 0 timer after APM suspend by Dong Chen
1. Re: fixed, patch for kernel 2.0.30 by Dong Chen
77. Source Code in C for make Linux partitions. by Limbert Sanabria
1. Untitled by lolley
76. How can I "cheat" and change the IP address (src,dest) in the sent socket? by Rami
5. Transparent Proxy by Zygo Blaxell
4. Untitled by qwzhang@public2.bta.net.cn
3. Untitled by navin97@hotmail.com
2. Changing your IP address is easy, but... by Zygo Blaxell
1. You have to know a bit of C (if u wanna learn) ;) by Lorenzo Cavallaro
2. Untitled
1. Do it in the kernel by Michael K. Johnson
74. Where is the source file for accept() by kaixu@hocpa.ho.lucent.com

1. Here, in /usr/src/linux/net/socket.c by wumin@netchina.co.cn

72. How can I use RAW SOCKETS in UNIX? by Rami
1. Re: Raw sockets by genie@risq.belcaf.minsk.by
69. the KHG in spanish? by Jorge Alvarado Revatta
2. Si tenga preguntas, quisa yo pueda ayudarte. by KernelJock
3. Tengo una pregunta by riderghost@rocketmail.com
2. Español by LL2
1. No esta aqui! Pero... by Michael K. Johnson
67. How to get a Memory snapshot ? by Manuel Porras Brand
1. Why not to get a memory snapshot? by Jukka Santala
-> Why you would want to get a memory snapshot by Dave M.
66. resources hard limits by castejon@kbw350.chem.yale.edu
1. Setting resource limits by Jukka Santala
65. How to invalidate a chache page by Gerhard Uttenthaler
1. Read the rest of the KHG! by Michael K. Johnson
64. Where are the tunable parameters of the kernel? by demiguel@robot3.cps.unizar.es
1. Kernel tunable parameters by Jukka Santala
62. How can my device driver access data structures in user space? by Stephan Theil
1. Forced Cast data type by Wang Ju
61. Problem in doing RAW SOCKET Programming by anjali sharma
1. Problem with ICMP echo-request/reply by Raghavendra Bhat
59. Tunable Kernel Parameters? by dennyf@bmn.net
2. Increasing number of files in system by Simon Cooper
1. Increasing number of open files parameter by Simon Cooper
1. sysctl in Linux by Jukka Santala
1. Setting and getting kernel vars by kbrown@csuhayward.edu
58. ELF matters by Carlos Munoz
1. Information about ELF Internals by Pat Ekman
57. Droping Packets by Charles Barrasso
1. [Selectively] Droping Packets by Jose R. cordones
56. The /proc/profile by Charles Barrasso
1. readprofile systool by Jukka Santala
55. Can you block or ignore ICMP packets? by HackOps@nutnet.com
4. ICMP send rate limit / ignoring by Jukka Santala
1. Omission in earlier rate-limit... by Jukka Santala

-> Patch worked... by Jukka Santala

3. Using ipfwadm by Charles Barrasso
1. ipfwadm configuration utility by Sonny Parlin
1. Icmp.c and kernal ping replies by Don Thomas
52. encaps documentation by Kuang-chun Cheng
51. Mounting Caldrea OpenDOS formatted fs's by Trey Childs
49. finding the address that caused a SIGSEGV. by Ben Shelef
47. sti() called too late. by Erik Thiele
1. sti() called too late. by Gadi Oxman
38. Module Development Info? by Mark S. Mathews
1. Needed here too by ajay
2. Help needed here too! by ajay
35. Need quicker timer than 100 ms in kernel-module by Erik Thiele
1. 10 ms timer patch by Reinhold J. Gerharz
2. please send me 10 ms timer patch by Tolga Ayav
1. Please send me the patch by Jin Hwang
1. UTIME: Microsecond Resolution Timers by BalajiSrinivasan
34. Need help with finding the linked list of unacked sk_buffs in TCP by Vijay Gupta
31. Partition Type by Suman Ball
30. New document on exception handling by Michael K. Johnson
29. How to make paralelism in to the kernel? by Delian Dlechev
27. readv/writev & other sock funcs by Dave Wreski
25. I'd like to see the scheduler chapter by Tim Bird
1. Untitled by Vijay Gupta
3. Go ahead! by Michael K. Johnson
21. Unable to access KHG, port 8080 giving problem. by Srihari Nelakuditi
1. Get a proxy by Michael K. Johnson
20. proc fs docs? by David Woodruff
1. Examples code as documentation by Jeremy Impson
18. What is SOCK_RAW and how do I use it? by arkane
1. What raw sockets are for. by Cameron MacKinnon
15. Linux kernel debugging by yylai@hk.net
2. GDB for Linux by David Grothe
2. Another kernel debugging tool by David Hinds
2. Kernel debugging with breakpoints by Keith Owens

-> Need help for debugging by C.H.Gopinath

1. gdb debugging of kernel now available by David Grothe
1. Device debugging by alombard©iiic.ethz.ch
9. Realtime mods anyone? by bill duncan
7. Summary of Linux Real-Time Status by Markus Kuhn
6. Hard real-time now available by Michael K. Johnson
2. Shortcomings of RT-Linux by Balaji Srinivasan
1. Firm Realtime available by Balaji Srinivasan
5. found some hacks ?!? by Mayk Langer
2. I want to know how to hack Red Hat Linux Release 5.0 by Kevin
4. POSIX.4 scheduler by Peter Monta
1. cli()/sti() latency, hard numbers by Ingo Molnar
2. Realtime is already done(!) by Kai Harrekilde-Petersen
1. 100 ms real time should be easy by jeff millar
1. Real-Time Applications with Linux POSIX.4 Scheduling by P. Woolley
7. Why can't we incorporate new changes in linux kernel in KHG ? by Praveen Kumar Dwivedi
1. You can! by Michael K. Johnson
3. Kernel source code by Gabor J.Toth
1. The sounds of silence... by Gabor J.Toth
1. Breaking the silence :) by Kyle Ferrio
1. Scribbling in the margins by Michael K. Johnson
2. It requires thought... by Michael K. Johnson
2. Kernel source is already browsable online by Axel Boldt
2. Need easy way to download whole KHG
5. KHG being mirrored nightly for download! by Michael K. Johnson
2. postscript version of these documents? by Michael Stiller
1. Sure! by Michael K. Johnson
-> Not so Sure! by jeff millar
-> Enough already! by Michael K. Johnson
1. Mirror packages are available, but that's not really enough by Michael K. Johnson
4. Mirror whole KHG package, off line reading and Post to this site by Kim In-Sung
2. Untitled by Jim Van Zandt
1. That works. (using it now). Two tips: by Richard Braakman
2. Appears to be a bug in getwww, though... by Michael K. Johnson
-> Sucking up to the wrong site... ;) by Jukka Santala

1. Help make the new KHG a success by Michael K. Johnson

The Linux Kernel
Table of Contents
The Linux Kernel
● Title Page
● Preface
● Hardware Basics
● Software Basics
● Memory Management
● Processes
● Interprocess
Communication
Mechanisms This book is for Linux enthusiasts who want to know how the
Linux kernel works. It is not an internals manual. Rather it
● PCI
describes the principles and mechanisms that Linux uses; how and
● Interrupts and why the Linux kernel works the way that it does.
Interrupt Handling
Linux is a moving target; this book is based upon the current,
● Device Drivers stable, 2.0.33 sources as those are what most individuals and
● The File System companies are now using.
● Networks This book is freely distributable, you may copy and redistribute it under
● Kernel Mechanisms certain conditions. Please refer to the copyright and distribution
statement.
● Modules
● Processors Version 0.8-3
David A Rusling
● The Linux Kernel david.rusling@arm.com)
Sources
● Linux Data Structures
Table of Contents, Show Frames, No Frames
● Useful Web and FTP
© 1996-1999 David A Rusling copyright notice
Sites
● The LPD Manifesto
● The GNU General
Public License
● Glossary
David A Rusling
3 Foxglove Close,
Wokingham,
Berkshire RG41 3NF,
United Kingdom
http://ldp.iol.it/LDP/tlk/tlk.html (1 di 2) [08/03/2001 10.08.04]

The Linux Kernel
Show Frames, No Frames

© 1996-1999 David A
Rusling copyright notice
david.rusling@arm.com
http://ldp.iol.it/LDP/tlk/tlk.html (2 di 2) [08/03/2001 10.08.04]

The Linux Kernel
The Linux Kernel
This book is for Linux enthusiasts who want to know how the Linux kernel works. It is not an
internals manual. Rather it describes the principles and mechanisms that Linux uses; how and why
the Linux kernel works the way that it does.
Linux is a moving target; this book is based upon the current, stable, 2.0.33 sources as those are
what most individuals and companies are now using.
This book is freely distributable, you may copy and redistribute it under certain conditions. Please refer
to the copyright and distribution statement.
Version 0.8-3
David A Rusling
david.rusling@arm.com)

http://ldp.iol.it/LDP/tlk/tlk-title.html [08/03/2001 10.08.06]

The Linux Kernel (Copyright Notice)
Legal Notice
UNIX is a trademark of Univel.
Linux is a trademark of Linus Torvalds, and has no connection to UNIXTM or Univel.
Copyright © 1996,1997,1998,1999 David A Rusling
3 Foxglove Close, Wokingham, Berkshire RG41 3NF, UK
This book (``The Linux Kernel'') may be reproduced and distributed in whole or in part, without fee,
subject to the following conditions:
● The copyright notice above and this permission notice must be preserved complete on all complete
or partial copies.
● Any translation or derived work must be approved by the author in writing before distribution.
● If you distribute this work in part, instructions for obtaining the complete version of this manual
must be included, and a means for obtaining a complete version provided.
● Small portions may be reproduced as illustrations for reviews or quotes in other works without this
permission notice if proper citation is given.
Exceptions to these rules may be granted for academic purposes: Write to the author and ask. These
restrictions are here to protect us as authors, not to restrict you as learners and educators.
All source code in this document is placed under the GNU General Public License, available via
anonymous FTP from prep.ai.mit.edu:/pub/gnu/COPYING. It is also reproduced in appendix
gpl.
http://ldp.iol.it/LDP/tlk/misc/copyright.html [08/03/2001 10.08.06]

The Linux Kernel: Table of Contents
Table of Contents
● Title Page
● Preface
● Hardware Basics
● Software Basics
● Memory Management
● Processes
● Interprocess Communication Mechanisms
● PCI
● Interrupts and Interrupt Handling
● Device Drivers
● The File System
● Networks
● Kernel Mechanisms
● Modules
● Processors
● The Linux Kernel Sources
● Linux Data Structures
● Useful Web and FTP Sites
● The LPD Manifesto
● The GNU General Public License
● Glossary
David A Rusling
3 Foxglove Close,
Wokingham,
Berkshire RG41 3NF,
United Kingdom
Show Frames, No Frames

http://ldp.iol.it/LDP/tlk/tlk-toc.html [08/03/2001 10.08.09]

Preface
Linux is a phenomenon of the Internet. Born out of the hobby project of a student it has grown to become
more popular than any other freely available operating system. To many Linux is an enigma. How can
something that is free be worthwhile? In a world dominated by a handful of large software corporations,
how can something that has been written by a bunch of ``hackers'' (sic) hope to compete? How can
software contributed to by many different people in many different countries around the world have a
hope of being stable and effective? Yet stable and effective it is and compete it does. Many Universities
and research establishments use it for their everyday computing needs. People are running it on their
home PCs and I would wager that most companies are using it somewhere even if they do not always
realize that they do. Linux is used to browse the web, host web sites, write theses, send electronic mail
and, as always with computers, to play games. Linux is emphatically not a toy; it is a fully developed and
professionally written operating system used by enthusiasts all over the world.
The roots of Linux can be traced back to the origins of Unix TM . In 1969, Ken Thompson of the Research
Group at Bell Laboratories began experimenting on a multi-user, multi-tasking operating system using an
otherwise idle PDP-7. He was soon joined by Dennis Richie and the two of them, along with other
members of the Research Group produced the early versions of Unix TM. Richie was strongly influenced
by an earlier project, MULTICS and the name Unix TM is itself a pun on the name MULTICS. Early
versions were written in assembly code, but the third version was rewritten in a new programming
language, C. C was designed and written by Richie expressly as a programming language for writing
operating systems. This rewrite allowed Unix TM to move onto the more powerful PDP-11/45 and 11/70
computers then being produced by DIGITAL. The rest, as they say, is history. Unix TM moved out of the
laboratory and into mainstream computing and soon most major computer manufacturers were producing
their own versions.
Linux was the solution to a simple need. The only software that Linus Torvalds, Linux's author and
principle maintainer was able to afford was Minix. Minix is a simple, Unix TM like, operating system
widely used as a teaching aid. Linus was less than impressed with its features, his solution was to write
his own software. He took Unix TM as his model as that was an operating system that he was familiar
with in his day to day student life. He started with an Intel 386 based PC and started to write. Progress
was rapid and, excited by this, Linus offered his efforts to other students via the emerging world wide
computer networks, then mainly used by the academic community. Others saw the software and started
contributing. Much of this new software was itself the solution to a problem that one of the contributors
had. Before long, Linux had become an operating system. It is important to note that Linux contains no
Unix TM code, it is a rewrite based on published POSIX standards. Linux is built with and uses a lot of
the GNU (GNU's Not Unix TM) software produced by the Free Software Foundation in Cambridge,
Massachusetts.
Most people use Linux as a simple tool, often just installing one of the many good CD ROM-based
distributions. A lot of Linux users use it to write applications or to run applications written by others.
Many Linux users read the HOWTOs1 avidly and feel both the thrill of success when some part of the
http://ldp.iol.it/LDP/tlk/intro/preface.html (1 di 6) [08/03/2001 10.08.12]

system has been correctly configured and the frustration of failure when it has not. A minority are bold
enough to write device drivers and offer kernel patches to Linus Torvalds, the creator and maintainer of
the Linux kernel. Linus accepts additions and modifications to the kernel sources from anyone,
anywhere. This might sound like a recipe for anarchy but Linus exercises strict quality control and
merges all new code into the kernel himself. At any one time though, there are only a handful of people
contributing sources to the Linux kernel.
The majority of Linux users do not look at how the operating system works, how it fits together. This is a
shame because looking at Linux is a very good way to learn more about how an operating system
functions. Not only is it well written, all the sources are freely available for you to look at. This is
because although the authors retain the copyrights to their software, they allow the sources to be freely
redistributable under the Free Software Foundation's GNU Public License. At first glance though, the
sources can be confusing; you will see directories called kernel, mm and net but what do they contain
and how does that code work? What is needed is a broader understanding of the overall structure and
aims of Linux. This, in short, is the aim of this book: to promote a clear understanding of how Linux, the
operating system, works. To provide a mind model that allows you to picture what is happening within
the system as you copy a file from one place to another or read electronic mail. I well remember the
excitement that I felt when I first realized just how an operating system actually worked. It is that
excitement that I want to pass on to the readers of this book.
My involvement with Linux started late in 1994 when I visited Jim Paradis who was working on a port of
Linux to the Alpha AXP processor based systems. I had worked for Digital Equipment Co. Limited since
1984, mostly in networks and communications and in 1992 I started working for the newly formed
Digital Semiconductor division. This division's goal was to enter fully into the merchant chip vendor
market and sell chips, and in particular the Alpha AXP range of microprocessors but also Alpha
AXP system boards outside of Digital. When I first heard about Linux I immediately saw an opportunity
to have fun. Jim's enthusiasm was catching and I started to help on the port. As I worked on this, I began
more and more to appreciate not only the operating system but also the community of engineers that
produces it.
However, Alpha AXP is only one of the many hardware platforms that Linux runs on. Most Linux
kernels are running on Intel processor based systems but a growing number of non-Intel Linux systems
are becoming more commonly available. Amongst these are Alpha AXP, ARM, MIPS, Sparc and
PowerPC. I could have written this book using any one of those platforms but my background and
technical experiences with Linux are with Linux on the Alpha AXP and, to a lesser extent on the ARM.
This is why this book sometimes uses non-Intel hardware as an example to illustrate some key point. It
must be noted that around 95% of the Linux kernel sources are common to all of the hardware platforms
that it runs on. Likewise, around 95% of this book is about the machine independent parts of the Linux
kernel.
Reader Profile
This book does not make any assumptions about the knowledge or experience of the reader. I believe that
interest in the subject matter will encourage a process of self education where neccessary. That said, a
degree of familiarity with computers, preferably the PC will help the reader derive real benefit from the
material, as will some knowledge of the C programming language.

Organisation of this Book
This book is not intended to be used as an internals manual for Linux. Instead it is an introduction to
operating systems in general and to Linux in particular. The chapters each follow my rule of ``working
from the general to the particular''. They first give an overview of the kernel subsystem that they are
describing before launching into its gory details.
I have deliberately not described the kernel's algorithms, its methods of doing things, in terms of
routine_X() calls routine_Y() which increments the foo field of the bar data structure. You
can read the code to find these things out. Whenever I need to understand a piece of code or describe it to
someone else I often start with drawing its data structures on the white-board. So, I have described many
of the relevant kernel data structures and their interrelationships in a fair amount of detail.
Each chapter is fairly independent, like the Linux kernel subsystem that they each describe. Sometimes,
though, there are linkages; for example you cannot describe a process without understanding how virtual
memory works.
The Hardware Basics chapter (Chapter hw-basics-chapter) gives a brief introduction to the modern PC.
An operating system has to work closely with the hardware system that acts as its foundations. The
operating system needs certain servit d that can only be provided by the hardware. In order to fully
understand the Linux operating system, you need to understand the basics of the underlying hardware.
The Software Basics chapter (Chapter sw-basics-chapter) introduc d basic software principles and looks
at assembly and C programing languages. It looks at the toold that are used to build an operating system
like Linux and it gives an overview of the aims and functions of an operating system.
The Memory Management chapter (Chapter mm-chapter) describes the way that Linux handles the
physical and virtual memory in the system.
The Processes chapter (Chapter processes-chapter) describes what a process is and how the Linux kernel
creates, manages and deletes the processes in the system.
Processes communicate with each other and with the kernel to coordinate their activities. Linux supports
a number of Inter-Process Communication (IPC) mechanisms. Signals and pipes are two of them but
Linux also supports the System V IPC mechanisms named after the Unix TM release in which they first
appeared. These interprocess communications mechanisms are described in Chapter IPC-chapter.
The Peripheral Component Interconnect (PCI) standard is now firmly established as the low cost, high
performance data bus for PCs. The PCI chapter (Chapter PCI-chapter) describes how the Linux kernel
initializes and uses PCI buses and devit d in the system.
The Interrupts and Interrupt Handling chapter (Chapter interrupt-chapter) looks at how the Linux kernel
handles interrupts. Whilst the kernel has generic mechanisms and interfat d for handling interrupts, some
of the interrupt handling details are hardware and architecture specific.
One of Linux's strengths is its support for the many available hardware devit d for the modern PC. The
Devit Drivers chapter (Chapter dd-chapter) describes how the Linux kernel controls the physical

devices in the system.
The File system chapter (Chapter filesystem-chapter) describes how the Linux kernel maintains the files
in the file systems that it supports. It describes the Virtual File System (VFS) and how the Linux kernel's
real file systems are supported.
Networking and Linux are terms that are almost synonymous. In a very real sense Linux is a product of
the Internet or World Wide Web (WWW). Its developers and users use the web to exchange information
ideas, code and Linux itself is often used to support the networking needs of organizations. Chapter
networks-chapter describes how Linux supports the network protocols known collectively as TCP/IP.
The Kernel Mechanisms chapter (Chapter kernel-chapter) looks at some of the general tasks and
mechanisms that the Linux kernel needs to supply so that other parts of the kernel work effectively
together.
The Modules chapter (Chapter modules-chapter) describes how the Linux kernel can dynamically load
functions, for example file systems, only when they are needed.
The Processors chapter (Chapter processors-chapter) gives a brief description of some of the processors
that Linux has been ported to.
The Sources chapter (Chapter sources-chapter) describes where in the Linux kernel sources you should
start looking for particular kernel functions.
Conventions used in this Book

The following is a list of the typographical conventions used in this book.
serif fontidentifies commands or other text that is to be typed
literally by the user.
type font refers to data structures or fields
within data structures.
Throughout the text there references to pieces of code within the Linux kernel source tree (for example
the boxed margin note adjacent to this text ). These are given in case you wish to look at the source code
itself and all of the file references are relative to /usr/src/linux. Taking foo/bar.c as an
example, the full filename would be /usr/src/linux/foo/bar.c If you are running Linux (and
you should), then looking at the code is a worthwhile experience and you can use this book as an aid to
understanding the code and as a guide to its many data structures.
Trademarks
ARM is a trademark of ARM Holdings PLC.
Caldera, OpenLinux and the ``C'' logo are trademarks of Caldera, Inc.
Caldera OpenDOS 1997 Caldera, Inc.

DEC is a trademark of Digital Equipment Corporation.
DIGITAL is a trademark of Digital Equipment Corporation.
Linux is a trademark of Linus Torvalds.
Motif is a trademark of The Open System Foundation, Inc.
MSDOS is a trademark of Microsoft Corporation.
Red Hat, glint and the Red Hat logo are trademarks of Red Hat Software, Inc.
UNIX is a registered trademark of X/Open.
XFree86 is a trademark of XFree86 Project, Inc.
X Window System is a trademark of the X Consortium and the Massachusetts Institute of Technology.
The Author
I was born in 1957, a few weeks before Sputnik was launched, in the north of England. I first met Unix at
University, where a lecturer used it as an example when teaching the notions of kernels, scheduling and
other operating systems goodies. I loved using the newly delivered PDP-11 for my final year project.
After graduating (in 1982 with a First Class Honours degree in Computer Science) I worked for Prime
Computers (Primos) and then after a couple of years for Digital (VMS, Ultrix). At Digital I worked on
many things but for the last 5 years there, I worked for the semiconductor group on Alpha and
StrongARM evaluation boards. In 1998 I moved to ARM where I have a small group of engineers
writing low level firmware and porting operating systems. My children (Esther and Stephen) describe me
as a geek.
People often ask me about Linux at work and at home and I am only too happy to oblige. The more that I
use Linux in both my professional and personal life the more that I become a Linux zealot. You may note
that I use the term `zealot' and not `bigot'; I define a Linux zealot to be an enthusiast that recognizes that
there are other operating systems but prefers not to use them. As my wife, Gill, who uses Windows 95
once remarked `Ì never realized that we would have his and her operating systems''. For me, as an
engineer, Linux suits my needs perfectly. It is a superb, flexible and adaptable engineering tool that I use
at work and at home. Most freely available software easily builds on Linux and I can often simply
download pre-built executable files or install them from a CD ROM. What else could I use to learn to
program in C++, Perl or learn about Java for free?
Acknowledgements
I must thank the many people who have been kind enough to take the time to e-mail me with comments
about this book. I have attempted to incorporated those comments in each new version that I have
produced and I am more than happy to receive comments, however please note my new e-mail address.
A number of lecturers have written to me asking if they can use some or parts of this book in order to

teach computing. My answer is an emphatic yes; this is one use of the book that I particularly wanted.
Who knows, there may be another Linus Torvalds sat in the class.
Special thanks must go to John Rigby and Michael Bauer who gave me full, detailed review notes of the
whole book. Not an easy task. Alan Cox and Stephen Tweedie have patiently answered my questions -
thanks. I used Larry Ewing's penguins to brighten up the chapters a bit. Finally, thank you to Greg
Hankins for accepting this book into the Linux Documentation Project and onto their web site.
Footnotes:
1A HOWTO is just what it sounds like, a document describing how to do something. Many have been
written for Linux and all are very useful.
File translated from TEX by TTH, version 1.0.
Top of Chapter, Table of Contents, Show Frames, No Frames

© 1996-1999 David A Rusling copyright notice.

Chapter 1
Hardware Basics
An operating system has to work closely with the hardware system that acts as its foundations. The
operating system needs certain services that can only be provided by the hardware. In order to fully
understand the Linux operating system, you need to understand the basics of the underlying hardware.
This chapter gives a brief introduction to that hardware: the modern PC.
When the ``Popular Electronics'' magazine for January 1975 was printed with an illustration of the Altair
8080 on its front cover, a revolution started. The Altair 8080, named after the destination of an early Star
Trek episode, could be assembled by home electronics enthusiasts for a mere $397. With its Intel 8080
processor and 256 bytes of memory but no screen or keyboard it was puny by today's standards. Its
inventor, Ed Roberts, coined the term ``personal computer'' to describe his new invention, but the term
PC is now used to refer to almost any computer that you can pick up without needing help. By this
definition, even some of the very powerful Alpha AXP systems are PCs.
Enthusiastic hackers saw the Altair's potential and started to write software and build hardware for it. To
these early pioneers it represented freedom; the freedom from huge batch processing mainframe systems
run and guarded by an elite priesthood. Overnight fortunes were made by college dropouts fascinated by
this new phenomenon, a computer that you could have at home on your kitchen table. A lot of hardware
appeared, all different to some degree and software hackers were happy to write software for these new
machines. Paradoxically it was IBM who firmly cast the mould of the modern PC by announcing the
IBM PC in 1981 and shipping it to customers early in 1982. With its Intel 8088 processor, 64K of
memory (expandable to 256K), two floppy disks and an 80 character by 25 lines Colour Graphics
Adapter (CGA) it was not very powerful by today's standards but it sold well. It was followed, in 1983,
by the IBM PC-XT which had the luxury of a 10Mbyte hard drive. It was not long before IBM PC clones
were being produced by a host of companies such as Compaq and the architecture of the PC became a
de-facto standard. This de-facto standard helped a multitude of hardware companies to compete together
in a growing market which, happily for consumers, kept prices low. Many of the system architectural
features of these early PCs have carried over into the modern PC. For example, even the most powerful
Intel Pentium Pro based system starts running in the Intel 8086's addressing mode. When Linus Torvalds
started writing what was to become Linux, he picked the most plentiful and reasonably priced hardware,
an Intel 80386 PC.
http://ldp.iol.it/LDP/tlk/basics/hw.html (1 di 6) [08/03/2001 10.08.18]

Figure 1.1: A typical PC motherboard.
Looking at a PC from the outside, the most obvious components are a system box, a keyboard, a mouse
and a video monitor. On the front of the system box are some buttons, a little display showing some
numbers and a floppy drive. Most systems these days have a CD ROM and if you feel that you have to
protect your data, then there will also be a tape drive for backups. These devices are collectively known
as the peripherals.
Although the CPU is in overall control of the system, it is not the only intelligent device. All of the
peripheral controllers, for example the IDE controller, have some level of intelligence. Inside the PC
(Figure 1.1) you will see a motherboard containing the CPU or microprocessor, the memory and a
number of slots for the ISA or PCI peripheral controllers. Some of the controllers, for example the IDE
disk controller may be built directly onto the system board.

1.1 The CPU
The CPU, or rather microprocessor, is the heart of any computer system. The microprocessor calculates,
performs logical operations and manages data flows by reading instructions from memory and then
executing them. In the early days of computing the functional components of the microprocessor were
separate (and physically large) units. This is when the term Central Processing Unit was coined. The
modern microprocessor combines these components onto an integrated circuit etched onto a very small
piece of silicon. The terms CPU, microprocessor and processor are all used interchangeably in this book.
Microprocessors operate on binary data; that is data composed of ones and zeros.
These ones and zeros correspond to electrical switches being either on or off. Just as 42 is a decimal
number meaning ``4 10s and 2 units'', a binary number is a series of binary digits each one representing a
power of 2. In this context, a power means the number of times that a number is multiplied by itself. 10
to the power 1 ( 101 ) is 10, 10 to the power 2 ( 102 ) is 10x10, 103 is 10x10x10 and so on. Binary 0001 is
decimal 1, binary 0010 is decimal 2, binary 0011 is 3, binary 0100 is 4 and so on. So, 42 decimal is
101010 binary or (2 + 8 + 32 or 21 + 23 + 25 ). Rather than using binary to represent numbers in
computer programs, another base, hexadecimal is usually used.
In this base, each digital represents a power of 16. As decimal numbers only go from 0 to 9 the numbers
10 to 15 are represented as a single digit by the letters A, B, C, D, E and F. For example, hexadecimal E
is decimal 14 and hexadecimal 2A is decimal 42 (two 16s) + 10). Using the C programming language
notation (as I do throughout this book) hexadecimal numbers are prefaced by ``0x''; hexadecimal 2A is
written as 0x2A .
Microprocessors can perform arithmetic operations such as add, multiply and divide and logical
operations such as `ìs X greater than Y?''.
The processor's execution is governed by an external clock. This clock, the system clock, generates
regular clock pulses to the processor and, at each clock pulse, the processor does some work. For
example, a processor could execute an instruction every clock pulse. A processor's speed is described in
terms of the rate of the system clock ticks. A 100Mhz processor will receive 100,000,000 clock ticks
every second. It is misleading to describe the power of a CPU by its clock rate as different processors
perform different amounts of work per clock tick. However, all things being equal, a faster clock speed
means a more powerful processor. The instructions executed by the processor are very simple; for
example ``read the contents of memory at location X into register Y''. Registers are the microprocessor's
internal storage, used for storing data and performing operations on it. The operations performed may
cause the processor to stop what it is doing and jump to another instruction somewhere else in memory.
These tiny building blocks give the modern microprocessor almost limitless power as it can execute
millions or even billions of instructions a second.
The instructions have to be fetched from memory as they are executed. Instructions may themselves
reference data within memory and that data must be fetched from memory and saved there when
appropriate.
The size, number and type of register within a microprocessor is entirely dependent on its type. An Intel
4086 processor has a different register set to an Alpha AXP processor; for a start, the Intel's are 32 bits

wide and the Alpha AXP's are 64 bits wide. In general, though, any given processor will have a number
of general purpose registers and a smaller number of dedicated registers. Most processors have the
following special purpose, dedicated, registers:
Program Counter (PC)
This register contains the address of the next instruction to be executed. The contents of the PC are
automatically incremented each time an instruction is fetched,
Stack Pointer (SP)
Processors have to have access to large amounts of external read/write random access memory
(RAM) which facilitates temporary storage of data. The stack is a way of easily saving and
restoring temporary values in external memory. Usually, processors have special instructions
which allow you to push values onto the stack and to pop them off again later. The stack works on
a last in first out (LIFO) basis. In other words, if you push two values, x and y, onto a stack and
then pop a value off of the stack then you will get back the value of y.
Some processor's stacks grow upwards towards the top of memory whilst others grow downwards
towards the bottom, or base, of memory. Some processor's support both types, for example ARM.
Processor Status (PS)
Instructions may yield results; for example `ìs the content of register X greater than the content of
register Y?'' will yield true or false as a result. The PS register holds this and other information
about the current state of the processor. For example, most processors have at least two modes of
operation, kernel (or supervisor) and user. The PS register would hold information identifying the
current mode.
1.2 Memory
All systems have a memory hierarchy with memory at different speeds and sizes at different points in the
hierarchy. The fastest memory is known as cache memory and is what it sounds like - memory that is
used to temporarily hold, or cache, contents of the main memory. This sort of memory is very fast but
expensive, therefore most processors have a small amount of on-chip cache memory and more system
based (on-board) cache memory. Some processors have one cache to contain both instructions and data,
but others have two, one for instructions and the other for data. The Alpha AXP processor has two
internal memory caches; one for data (the D-Cache) and one for instructions (the I-Cache). The external
cache (or B-Cache) mixes the two together. Finally there is the main memory which relative to the
external cache memory is very slow. Relative to the on-CPU cache, main memory is positively crawling.
The cache and main memories must be kept in step (coherent). In other words, if a word of main memory
is held in one or more locations in cache, then the system must make sure that the contents of cache and
memory are the same. The job of cache coherency is done partially by the hardware and partially by the
operating system. This is also true for a number of major system tasks where the hardware and software
must cooperate closely to achieve their aims.

1.3 Buses
The individual components of the system board are interconnected by multiple connection systems
known as buses. The system bus is divided into three logical functions; the address bus, the data bus and
the control bus. The address bus specifies the memory locations (addresses) for the data transfers. The
data bus holds the data transfered. The data bus is bidirectional; it allows data to be read into the CPU
and written from the CPU. The control bus contains various lines used to route timing and control signals
throughout the system. Many flavours of bus exist, for example ISA and PCI buses are popular ways of
connecting peripherals to the system.
1.4 Controllers and Peripherals

Peripherals are real devices, such as graphics cards or disks controlled by controller chips on the system
board or on cards plugged into it. The IDE disks are controlled by the IDE controller chip and the SCSI
disks by the SCSI disk controller chips and so on. These controllers are connected to the CPU and to
each other by a variety of buses. Most systems built now use PCI and ISA buses to connect together the
main system components. The controllers are processors like the CPU itself, they can be viewed as
intelligent helpers to the CPU. The CPU is in overall control of the system.
All controllers are different, but they usually have registers which control them. Software running on the
CPU must be able to read and write those controlling registers. One register might contain status
describing an error. Another might be used for control purposes; changing the mode of the controller.
Each controller on a bus can be individually addressed by the CPU, this is so that the software device
driver can write to its registers and thus control it. The IDE ribbon is a good example, as it gives you the
ability to access each drive on the bus separately. Another good example is the PCI bus which allows
each device (for example a graphics card) to be accessed independently.
1.5 Address Spaces

The system bus connects the CPU with the main memory and is separate from the buses connecting the
CPU with the system's hardware peripherals. Collectively the memory space that the hardware
peripherals exist in is known as I/O space. I/O space may itself be further subdivided, but we will not
worry too much about that for the moment. The CPU can access both the system space memory and the
I/O space memory, whereas the controllers themselves can only access system memory indirectly and
then only with the help of the CPU. From the point of view of the device, say the floppy disk controller,
it will see only the address space that its control registers are in (ISA), and not the system memory.
Typically a CPU will have separate instructions for accessing the memory and I/O space. For example,
there might be an instruction that means ``read a byte from I/O address 0x3f0 into register X''. This is
exactly how the CPU controls the system's hardware peripherals, by reading and writing to their registers
in I/O space. Where in I/O space the common peripherals (IDE controller, serial port, floppy disk
controller and so on) have their registers has been set by convention over the years as the PC architecture
has developed. The I/O space address 0x3f0 just happens to be the address of one of the serial port's
(COM1) control registers.

There are times when controllers need to read or write large amounts of data directly to or from system
memory. For example when user data is being written to the hard disk. In this case, Direct Memory
Access (DMA) controllers are used to allow hardware peripherals to directly access system memory but
this access is under strict control and supervision of the CPU.
1.6 Timers
All operating systems need to know the time and so the modern PC includes a special peripheral called
the Real Time Clock (RTC). This provides two things: a reliable time of day and an accurate timing
interval. The RTC has its own battery so that it continues to run even when the PC is not powered on,
this is how your PC always ``knows'' the correct date and time. The interval timer allows the operating
system to accurately schedule essential work.


Chapter 2
Software Basics
A program is a set of computer instructions that perform a particular task. That program can be written in
assembler, a very low level computer language, or in a high level, machine independent language such as
the C programming language. An operating system is a special program which allows the user to run
applications such as spreadsheets and word processors. This chapter introduces basic programming
principles and gives an overview of the aims and functions of an operating system.
2.1 Computer Languages

2.1.1 Assembly Languages
The instructions that a CPU fetches from memory and executes are not at all understandable to human
beings. They are machine codes which tell the computer precisely what to do. The hexadecimal number
0x89E5 is an Intel 80486 instruction which copies the contents of the ESP register to the EBP register.
One of the first software tools invented for the earliest computers was an assembler, a program which
takes a human readable source file and assembles it into machine code. Assembly languages explicitly
handle registers and operations on data and they are specific to a particular microprocessor. The
assembly language for an Intel X86 microprocessor is very different to the assembly language for an
Alpha AXP microprocessor. The following Alpha AXP assembly code shows the sort of operations that a
program can perform:
ldr r16, (r15) ; Line 1

ldr r17, 4(r15) ; Line 2
beq r16,r17,100 ; Line 3
str r17, (r15) ; Line 4
100: ; Line 5
The first statement (on line 1) loads register 16 from the address held in register 15. The next instruction
loads register 17 from the next location in memory. Line 3 compares the contents of register 16 with that
of register 17 and, if they are equal, branches to label 100. If the registers do not contain the same value
then the program continues to line 4 where the contents of r17 are saved into memory. If the registers do
contain the same value then no data needs to be saved. Assembly level programs are tedious and tricky to
http://ldp.iol.it/LDP/tlk/basics/sw.html (1 di 7) [08/03/2001 10.08.23]

write and prone to errors. Very little of the Linux kernel is written in assembly language and those parts
that are are written only for efficiency and they are specific to particular microprocessors.
2.1.2 The C Programming Language and Compiler

Writing large programs in assembly language is a difficult and time consuming task. It is prone to error
and the resulting program is not portable, being tied to one particular processor family. It is far better to
use a machine independent language like C. C allows you to describe programs in terms of their logical
algorithms and the data that they operate on. Special programs called compilers read the C program and
translate it into assembly language, generating machine specific code from it. A good compiler can
generate assembly instructions that are very nearly as efficient as those written by a good assembly
programmer. Most of the Linux kernel is written in the C language. The following C fragment:
if (x != y)
x = y ;
performs exactly the same operations as the previous example assembly code. If the contents of the
variable x are not the same as the contents of variable y then the contents of y will be copied to x. C
code is organized into routines, each of which perform a task. Routines may return any value or data type
supported by C. Large programs like the Linux kernel comprise many separate C source modules each
with its own routines and data structures. These C source code modules group together logical functions
such as filesystem handling code.
C supports many types of variables, a variable is a location in memory which can be referenced by a
symbolic name. In the above C fragment x and y refer to locations in memory. The programmer does not
care where in memory the variables are put, it is the linker (see below) that has to worry about that. Some
variables contain different sorts of data, integer and floating point and others are pointers.
Pointers are variables that contain the address, the location in memory of other data. Consider a variable
called x, it might live in memory at address 0x80010000. You could have a pointer, called px, which
points at x. px might live at address 0x80010030. The value of px would be 0x80010000: the address of
the variable x.
C allows you to bundle together related variables into data structures. For example,
struct {
int i ;
char b ;
} my_struct ;
is a data structure called my_struct which contains two elements, an integer (32 bits of data storage)
called i and a character (8 bits of data) called b.

2.1.3 Linkers
Linkers are programs that link together several object modules and libraries to form a single, coherent,
program. Object modules are the machine code output from an assembler or compiler and contain
executable machine code and data together with information that allows the linker to combine the
modules together to form a program. For example one module might contain all of a program's database
functions and another module its command line argument handling functions. Linkers fix up references
between these object modules, where a routine or data structure referenced in one module actually exists
in another module. The Linux kernel is a single, large program linked together from its many constituent
object modules.
2.2 What is an Operating System?

Without software a computer is just a pile of electronics that gives off heat. If the hardware is the heart of
a computer then the software is its soul. An operating system is a collection of system programs which
allow the user to run application software. The operating system abstracts the real hardware of the system
and presents the system's users and its applications with a virtual machine. In a very real sense the
software provides the character of the system. Most PCs can run one or more operating systems and each
one can have a very different look and feel. Linux is made up of a number of functionally separate pieces
that, together, comprise the operating system. One obvious part of Linux is the kernel itself; but even that
would be useless without libraries or shells.
In order to start understanding what an operating system is, consider what happens when you type an
apparently simple command:
$ ls
Mail c images perl
docs tcl
$
The $ is a prompt put out by a login shell (in this case bash). This means that it is waiting for you, the
user, to type some command. Typing ls causes the keyboard driver to recognize that characters have been
typed. The keyboard driver passes them to the shell which processes that command by looking for an
executable image of the same name. It finds that image, in /bin/ls. Kernel services are called to pull
the ls executable image into virtual memory and start executing it. The ls image makes calls to the file
subsystem of the kernel to find out what files are available. The filesystem might make use of cached
filesystem information or use the disk device driver to read this information from the disk. It might even
cause a network driver to exchange information with a remote machine to find out details of remote files
that this system has access to (filesystems can be remotely mounted via the Networked File System or
NFS). Whichever way the information is located, ls writes that information out and the video driver
displays it on the screen.
All of the above seems rather complicated but it shows that even most simple commands reveal that an
operating system is in fact a co-operating set of functions that together give you, the user, a coherent

view of the system.
2.2.1 Memory management

With infinite resources, for example memory, many of the things that an operating system has to do
would be redundant. One of the basic tricks of any operating system is the ability to make a small amount
of physical memory behave like rather more memory. This apparently large memory is known as virtual
memory. The idea is that the software running in the system is fooled into believing that it is running in a
lot of memory. The system divides the memory into easily handled pages and swaps these pages onto a
hard disk as the system runs. The software does not notice because of another trick, multi-processing.
2.2.2 Processes
A process could be thought of as a program in action, each process is a separate entity that is running a
particular program. If you look at the processes on your Linux system, you will see that there are rather a
lot. For example, typing ps shows the following processes on my system:
$ ps
PID TTY STAT TIME COMMAND
158 pRe 1 0:00 -bash
174 pRe 1 0:00 sh /usr/X11R6/bin/startx
175 pRe 1 0:00 xinit /usr/X11R6/lib/X11/xinit/xinitrc --
178 pRe 1 N 0:00 bowman
182 pRe 1 N 0:01 rxvt -geometry 120x35 -fg white -bg black
184 pRe 1 < 0:00 xclock -bg grey -geometry -1500-1500 -padding 0
185 pRe 1 < 0:00 xload -bg grey -geometry -0-0 -label xload
187 pp6 1 9:26 /bin/bash
203 ppc 2 0:00 /bin/bash
1797 v06 1 0:00 /bin/bash
3056 pp6 3 < 0:02 emacs intro/introduction.tex
3270 pp6 3 0:00 ps
$
If my system had many CPUs then each process could (theoretically at least) run on a different CPU.
Unfortunately, there is only one so again the operating system resorts to trickery by running each process
in turn for a short period. This period of time is known as a time-slice. This trick is known as
multi-processing or scheduling and it fools each process into thinking that it is the only process.
Processes are protected from one another so that if one process crashes or malfunctions then it will not
affect any others. The operating system achieves this by giving each process a separate address space
which only they have access to.

2.2.3 Device drivers
Device drivers make up the major part of the Linux kernel. Like other parts of the operating system, they
operate in a highly privileged environment and can cause disaster if they get things wrong. Device
drivers control the interaction between the operating system and the hardware device that they are
controlling. For example, the filesystem makes use of a general block device interface when writing
blocks to an IDE disk. The driver takes care of the details and makes device specific things happen.
Device drivers are specific to the controller chip that they are driving which is why, for example, you
need the NCR810 SCSI driver if your system has an NCR810 SCSI controller.
2.2.4 The Filesystems

In Linux, as it is for Unix TM , the separate filesystems that the system may use are not accessed by
device identifiers (such as a drive number or a drive name) but instead they are combined into a single
hierarchical tree structure that represents the filesystem as a single entity. Linux adds each new
filesystem into this single filesystem tree as they are mounted onto a mount directory, for example
/mnt/cdrom. One of the most important features of Linux is its support for many different filesystems.
This makes it very flexible and well able to coexist with other operating systems. The most popular
filesystem for Linux is the EXT2 filesystem and this is the filesystem supported by most of the Linux
distributions.
A filesystem gives the user a sensible view of files and directories held on the hard disks of the system
regardless of the filesystem type or the characteristics of the underlying physical device. Linux
transparently supports many different filesystems (for example MS-DOS and EXT2) and presents all of
the mounted files and filesystems as one integrated virtual filesystem. So, in general, users and processes
do not need to know what sort of filesystem that any file is part of, they just use them.
The block device drivers hide the differences between the physical block device types (for example, IDE
and SCSI) and, so far as each filesystem is concerned, the physical devices are just linear collections of
blocks of data. The block sizes may vary between devices, for example 512 bytes is common for floppy
devices whereas 1024 bytes is common for IDE devices and, again, this is hidden from the users of the
system. An EXT2 filesystem looks the same no matter what device holds it.
2.3 Kernel Data Structures

The operating system must keep a lot of information about the current state of the system. As things
happen within the system these data structures must be changed to reflect the current reality. For
example, a new process might be created when a user logs onto the system. The kernel must create a data
structure representing the new process and link it with the data structures representing all of the other
processes in the system.
Mostly these data structures exist in physical memory and are accessible only by the kernel and its
subsystems. Data structures contain data and pointers; addresses of other data structures or the addresses
of routines. Taken all together, the data structures used by the Linux kernel can look very confusing.
Every data structure has a purpose and although some are used by several kernel subsystems, they are
more simple than they appear at first sight.

Understanding the Linux kernel hinges on understanding its data structures and the use that the various
functions within the Linux kernel makes of them. This book bases its description of the Linux kernel on
its data structures. It talks about each kernel subsystem in terms of its algorithms, its methods of getting
things done, and their usage of the kernel's data structures.
2.3.1 Linked Lists

Linux uses a number of software engineering techniques to link together its data structures. On a lot of
occasions it uses linked or chained data structures. If each data structure describes a single instance or
occurance of something, for example a process or a network device, the kernel must be able to find all of
the instances. In a linked list a root pointer contains the address of the first data structure, or element, in
the list and each data structure contains a pointer to the next element in the list. The last element's next
pointer would be 0 or NULL to show that it is the end of the list. In a doubly linked list each element
contains both a pointer to the next element in the list but also a pointer to the previous element in the list.
Using doubly linked lists makes it easier to add or remove elements from the middle of list although you
do need more memory accesses. This is a typical operating system trade off: memory accesses versus
CPU cycles.
2.3.2 Hash Tables

Linked lists are handy ways of tying data structures together but navigating linked lists can be inefficient.
If you were searching for a particular element, you might easily have to look at the whole list before you
find the one that you need. Linux uses another technique, hashing to get around this restriction. A hash
table is an array or vector of pointers. An array, or vector, is simply a set of things coming one after
another in memory. A bookshelf could be said to be an array of books. Arrays are accessed by an index,
the index is an offset into the array. Taking the bookshelf analogy a little further, you could describe each
book by its position on the shelf; you might ask for the 5th book.
A hash table is an array of pointers to data structures and its index is derived from information in those
data structures. If you had data structures describing the population of a village then you could use a
person's age as an index. To find a particular person's data you could use their age as an index into the
population hash table and then follow the pointer to the data structure containing the person's details.
Unfortunately many people in the village are likely to have the same age and so the hash table pointer
becomes a pointer to a chain or list of data structures each describing people of the same age. However,
searching these shorter chains is still faster than searching all of the data structures.
As a hash table speeds up access to commonly used data structures, Linux often uses hash tables to
implement caches. Caches are handy information that needs to be accessed quickly and are usually a
subset of the full set of information available. Data structures are put into a cache and kept there because
the kernel often accesses them. There is a drawback to caches in that they are more complex to use and
maintain than simple linked lists or hash tables. If the data structure can be found in the cache (this is
known as a cache hit, then all well and good. If it cannot then all of the relevant data structures must be
searched and, if the data structure exists at all, it must be added into the cache. In adding new data
structures into the cache an old cache entry may need discarding. Linux must decide which one to
discard, the danger being that the discarded data structure may be the next one that Linux needs.

2.3.3 Abstract Interfaces
The Linux kernel often abstracts its interfaces. An interface is a collection of routines and data structures
which operate in a particular way. For example all network device drivers have to provide certain
routines in which particular data structures are operated on. This way there can be generic layers of code
using the services (interfaces) of lower layers of specific code. The network layer is generic and it is
supported by device specific code that conforms to a standard interface.
Often these lower layers register themselves with the upper layer at boot time. This registration usually
involves adding a data structure to a linked list. For example each filesystem built into the kernel
registers itself with the kernel at boot time or, if you are using modules, when the filesystem is first used.
You can see which filesystems have registered themselves by looking at the file
/proc/filesystems. The registration data structure often includes pointers to functions. These are
the addresses of software functions that perform particular tasks. Again, using filesystem registration as
an example, the data structure that each filesystem passes to the Linux kernel as it registers includes the
address of a filesystem specfic routine which must be called whenever that filesystem is mounted.


Chapter 3
Memory Management
The memory management subsystem is one of the most important parts of the operating system. Since the early days of
computing, there has been a need for more memory than exists physically in a system. Strategies have been developed to
overcome this limitation and the most successful of these is virtual memory. Virtual memory makes the system appear to have
more memory than it actually has by sharing it between competing processes as they need it.
Virtual memory does more than just make your computer's memory go further. The memory management subsystem provides:
Large Address Spaces
The operating system makes the system appear as if it has a larger amount of memory than it actually has. The virtual
memory can be many times larger than the physical memory in the system,
Protection
Each process in the system has its own virtual address space. These virtual address spaces are completely separate from
each other and so a process running one application cannot affect another. Also, the hardware virtual memory mechanisms
allow areas of memory to be protected against writing. This protects code and data from being overwritten by rogue
applications.
Memory Mapping
Memory mapping is used to map image and data files into a processes address space. In memory mapping, the contents of
a file are linked directly into the virtual address space of a process.
Fair Physical Memory Allocation
The memory management subsystem allows each running process in the system a fair share of the physical memory of the
system,
Shared Virtual Memory
Although virtual memory allows processes to have separate (virtual) address spaces, there are times when you need
processes to share memory. For example there could be several processes in the system running the bash command shell.
Rather than have several copies of bash, one in each processes virtual address space, it is better to have only one copy in
physical memory and all of the processes running bash share it. Dynamic libraries are another common example of
executing code shared between several processes.
Shared memory can also be used as an Inter Process Communication (IPC) mechanism, with two or more processes
exchanging information via memory common to all of them. Linux supports the Unix TM System V shared memory IPC.
3.1 An Abstract Model of Virtual Memory
http://ldp.iol.it/LDP/tlk/mm/memory.html (1 di 16) [08/03/2001 10.08.28]

Figure 3.1: Abstract model of Virtual to Physical address mapping
Before considering the methods that Linux uses to support virtual memory it is useful to consider an abstract model that is not
cluttered by too much detail.
As the processor executes a program it reads an instruction from memory and decodes it. In decoding the instruction it may need
to fetch or store the contents of a location in memory. The processor then executes the instruction and moves onto the next
instruction in the program. In this way the processor is always accessing memory either to fetch instructions or to fetch and store
data.
In a virtual memory system all of these addresses are virtual addresses and not physical addresses. These virtual addresses are
converted into physical addresses by the processor based on information held in a set of tables maintained by the operating
system.
To make this translation easier, virtual and physical memory are divided into handy sized chunks called pages. These pages are
all the same size, they need not be but if they were not, the system would be very hard to administer. Linux on Alpha
AXP systems uses 8 Kbyte pages and on Intel x86 systems it uses 4 Kbyte pages. Each of these pages is given a unique number;
the page frame number (PFN).
In this paged model, a virtual address is composed of two parts; an offset and a virtual page frame number. If the page size is 4
Kbytes, bits 11:0 of the virtual address contain the offset and bits 12 and above are the virtual page frame number. Each time the
processor encounters a virtual address it must extract the offset and the virtual page frame number. The processor must translate
the virtual page frame number into a physical one and then access the location at the correct offset into that physical page. To do
this the processor uses page tables.
Figure 3.1 shows the virtual address spaces of two processes, process X and process Y, each with their own page tables. These
page tables map each processes virtual pages into physical pages in memory. This shows that process X's virtual page frame
number 0 is mapped into memory in physical page frame number 1 and that process Y's virtual page frame number 1 is mapped
into physical page frame number 4. Each entry in the theoretical page table contains the following information:
● Valid flag. This indicates if this page table entry is valid,
● The physical page frame number that this entry is describing,
● Access control information. This describes how the page may be used. Can it be written to? Does it contain executable
code?

The page table is accessed using the virtual page frame number as an offset. Virtual page frame 5 would be the 6th element of the
table (0 is the first element).
To translate a virtual address into a physical one, the processor must first work out the virtual addresses page frame number and
the offset within that virtual page. By making the page size a power of 2 this can be easily done by masking and shifting.
Looking again at Figures 3.1 and assuming a page size of 0x2000 bytes (which is decimal 8192) and an address of 0x2194 in
process Y's virtual address space then the processor would translate that address into offset 0x194 into virtual page frame number
1.
The processor uses the virtual page frame number as an index into the processes page table to retrieve its page table entry. If the
page table entry at that offset is valid, the processor takes the physical page frame number from this entry. If the entry is invalid,
the process has accessed a non-existent area of its virtual memory. In this case, the processor cannot resolve the address and must
pass control to the operating system so that it can fix things up.
Just how the processor notifies the operating system that the correct process has attempted to access a virtual address for which
there is no valid translation is specific to the processor. However the processor delivers it, this is known as a page fault and the
operating system is notified of the faulting virtual address and the reason for the page fault.
Assuming that this is a valid page table entry, the processor takes that physical page frame number and multiplies it by the page
size to get the address of the base of the page in physical memory. Finally, the processor adds in the offset to the instruction or
data that it needs.
Using the above example again, process Y's virtual page frame number 1 is mapped to physical page frame number 4 which starts
at 0x8000 (4 x 0x2000). Adding in the 0x194 byte offset gives us a final physical address of 0x8194.
By mapping virtual to physical addresses this way, the virtual memory can be mapped into the system's physical pages in any
order. For example, in Figure 3.1 process X's virtual page frame number 0 is mapped to physical page frame number 1 whereas
virtual page frame number 7 is mapped to physical page frame number 0 even though it is higher in virtual memory than virtual
page frame number 0. This demonstrates an interesting byproduct of virtual memory; the pages of virtual memory do not have to
be present in physical memory in any particular order.
3.1.1 Demand Paging

As there is much less physical memory than virtual memory the operating system must be careful that it does not use the physical
memory inefficiently. One way to save physical memory is to only load virtual pages that are currently being used by the
executing program. For example, a database program may be run to query a database. In this case not all of the database needs to
be loaded into memory, just those data records that are being examined. If the database query is a search query then it does not
make sense to load the code from the database program that deals with adding new records. This technique of only loading
virtual pages into memory as they are accessed is known as demand paging.
When a process attempts to access a virtual address that is not currently in memory the processor cannot find a page table entry
for the virtual page referenced. For example, in Figure 3.1 there is no entry in process X's page table for virtual page frame
number 2 and so if process X attempts to read from an address within virtual page frame number 2 the processor cannot translate
the address into a physical one. At this point the processor notifies the operating system that a page fault has occurred.
If the faulting virtual address is invalid this means that the process has attempted to access a virtual address that it should not
have. Maybe the application has gone wrong in some way, for example writing to random addresses in memory. In this case the
operating system will terminate it, protecting the other processes in the system from this rogue process.
If the faulting virtual address was valid but the page that it refers to is not currently in memory, the operating system must bring
the appropriate page into memory from the image on disk. Disk access takes a long time, relatively speaking, and so the process
must wait quite a while until the page has been fetched. If there are other processes that could run then the operating system will
select one of them to run. The fetched page is written into a free physical page frame and an entry for the virtual page frame
number is added to the processes page table. The process is then restarted at the machine instruction where the memory fault
occurred. This time the virtual memory access is made, the processor can make the virtual to physical address translation and so
the process continues to run.
Linux uses demand paging to load executable images into a processes virtual memory. Whenever a command is executed, the file
containing it is opened and its contents are mapped into the processes virtual memory. This is done by modifying the data

structures describing this processes memory map and is known as memory mapping. However, only the first part of the image is
actually brought into physical memory. The rest of the image is left on disk. As the image executes, it generates page faults and
Linux uses the processes memory map in order to determine which parts of the image to bring into memory for execution.
3.1.2 Swapping
If a process needs to bring a virtual page into physical memory and there are no free physical pages available, the operating
system must make room for this page by discarding another page from physical memory.
If the page to be discarded from physical memory came from an image or data file and has not been written to then the page does
not need to be saved. Instead it can be discarded and if the process needs that page again it can be brought back into memory
from the image or data file.
However, if the page has been modified, the operating system must preserve the contents of that page so that it can be accessed at
a later time. This type of page is known as a dirty page and when it is removed from memory it is saved in a special sort of file
called the swap file. Accesses to the swap file are very long relative to the speed of the processor and physical memory and the
operating system must juggle the need to write pages to disk with the need to retain them in memory to be used again.
If the algorithm used to decide which pages to discard or swap (the swap algorithm is not efficient then a condition known as
thrashing occurs. In this case, pages are constantly being written to disk and then being read back and the operating system is too
busy to allow much real work to be performed. If, for example, physical page frame number 1 in Figure 3.1 is being regularly
accessed then it is not a good candidate for swapping to hard disk. The set of pages that a process is currently using is called the
working set. An efficient swap scheme would make sure that all processes have their working set in physical memory.
Linux uses a Least Recently Used (LRU) page aging technique to fairly choose pages which might be removed from the system.
This scheme involves every page in the system having an age which changes as the page is accessed. The more that a page is
accessed, the younger it is; the less that it is accessed the older and more stale it becomes. Old pages are good candidates for
swapping.
3.1.3 Shared Virtual Memory

Virtual memory makes it easy for several processes to share memory. All memory access are made via page tables and each
process has its own separate page table. For two processes sharing a physical page of memory, its physical page frame number
must appear in a page table entry in both of their page tables.
Figure 3.1 shows two processes that each share physical page frame number 4. For process X this is virtual page frame number 4
whereas for process Y this is virtual page frame number 6. This illustrates an interesting point about sharing pages: the shared
physical page does not have to exist at the same place in virtual memory for any or all of the processes sharing it.
3.1.4 Physical and Virtual Addressing Modes

It does not make much sense for the operating system itself to run in virtual memory. This would be a nightmare situation where
the operating system must maintain page tables for itself. Most multi-purpose processors support the notion of a physical address
mode as well as a virtual address mode. Physical addressing mode requires no page tables and the processor does not attempt to
perform any address translations in this mode. The Linux kernel is linked to run in physical address space.
The Alpha AXP processor does not have a special physical addressing mode. Instead, it divides up the memory space into several
areas and designates two of them as physically mapped addresses. This kernel address space is known as KSEG address space
and it encompasses all addresses upwards from 0xfffffc0000000000. In order to execute from code linked in KSEG (by definition,
kernel code) or access data there, the code must be executing in kernel mode. The Linux kernel on Alpha is linked to execute
from address 0xfffffc0000310000.
3.1.5 Access Control

The page table entries also contain access control information. As the processor is already using the page table entry to map a
processes virtual address to a physical one, it can easily use the access control information to check that the process is not
accessing memory in a way that it should not.

There are many reasons why you would want to restrict access to areas of memory. Some memory, such as that containing
executable code, is naturally read only memory; the operating system should not allow a process to write data over its executable
code. By contrast, pages containing data can be written to but attempts to execute that memory as instructions should fail. Most
processors have at least two modes of execution: kernel and user. You would not want kernel code executing by a user or kernel
data structures to be accessible except when the processor is running in kernel mode.
Figure 3.2: Alpha AXP Page Table Entry

The access control information is held in the PTE and is processor specific; figure 3.2 shows the PTE for Alpha AXP. The bit
fields have the following meanings:
V
Valid, if set this PTE is valid,
FOE
``Fault on Execute'', Whenever an attempt to execute instructions in this page occurs, the processor reports a page fault and
passes control to the operating system,
FOW
``Fault on Write'', as above but page fault on an attempt to write to this page,
FOR
``Fault on Read'', as above but page fault on an attempt to read from this page,
ASM
Address Space Match. This is used when the operating system wishes to clear only some of the entries from the
Translation Buffer,
KRE
Code running in kernel mode can read this page,
URE
Code running in user mode can read this page,
GH
Granularity hint used when mapping an entire block with a single Translation Buffer entry rather than many,
KWE
Code running in kernel mode can write to this page,
UWE
Code running in user mode can write to this page,

page frame number
For PTEs with the V bit set, this field contains the physical Page Frame Number (page frame number) for this PTE. For
invalid PTEs, if this field is not zero, it contains information about where the page is in the swap file.
The following two bits are defined and used by Linux:
_PAGE_DIRTY
if set, the page needs to be written out to the swap file,
_PAGE_ACCESSED
Used by Linux to mark a page as having been accessed.
3.2 Caches
If you were to implement a system using the above theoretical model then it would work, but not particularly efficiently. Both
operating system and processor designers try hard to extract more performance from the system. Apart from making the
processors, memory and so on faster the best approach is to maintain caches of useful information and data that make some
operations faster. Linux uses a number of memory management related caches:
Buffer Cache
The buffer cache contains data buffers that are used by the block device drivers.
These buffers are of fixed sizes (for example 512 bytes) and contain blocks of information that have either been read from
a block device or are being written to it. A block device is one that can only be accessed by reading and writing fixed sized
blocks of data. All hard disks are block devices.
The buffer cache is indexed via the device identifier and the desired block number and is used to quickly find a block of
data. Block devices are only ever accessed via the buffer cache. If data can be found in the buffer cache then it does not
need to be read from the physical block device, for example a hard disk, and access to it is much faster.
Page Cache
This is used to speed up access to images and data on disk.
It is used to cache the logical contents of a file a page at a time and is accessed via the file and offset within the file. As
pages are read into memory from disk, they are cached in the page cache.
Swap Cache
Only modified (or dirty) pages are saved in the swap file.
So long as these pages are not modified after they have been written to the swap file then the next time the page is
swapped out there is no need to write it to the swap file as the page is already in the swap file. Instead the page can simply
be discarded. In a heavily swapping system this saves many unnecessary and costly disk operations.
Hardware Caches
One commonly implemented hardware cache is in the processor; a cache of Page Table Entries. In this case, the processor
does not always read the page table directly but instead caches translations for pages as it needs them. These are the
Translation Look-aside Buffers and contain cached copies of the page table entries from one or more processes in the
system.
When the reference to the virtual address is made, the processor will attempt to find a matching TLB entry. If it finds one,
it can directly translate the virtual address into a physical one and perform the correct operation on the data. If the
processor cannot find a matching TLB entry then it must get the operating system to help. It does this by signalling the
operating system that a TLB miss has occurred. A system specific mechanism is used to deliver that exception to the
operating system code that can fix things up. The operating system generates a new TLB entry for the address mapping.
When the exception has been cleared, the processor will make another attempt to translate the virtual address. This time it
will work because there is now a valid entry in the TLB for that address.
The drawback of using caches, hardware or otherwise, is that in order to save effort Linux must use more time and space
maintaining these caches and, if the caches become corrupted, the system will crash.

3.3 Linux Page Tables
Figure 3.3: Three Level Page Tables

Linux assumes that there are three levels of page tables. Each Page Table accessed contains the page frame number of the next
level of Page Table. Figure 3.3 shows how a virtual address can be broken into a number of fields; each field providing an offset
into a particular Page Table. To translate a virtual address into a physical one, the processor must take the contents of each level
field, convert it into an offset into the physical page containing the Page Table and read the page frame number of the next level
of Page Table. This is repeated three times until the page frame number of the physical page containing the virtual address is
found. Now the final field in the virtual address, the byte offset, is used to find the data inside the page.
Each platform that Linux runs on must provide translation macros that allow the kernel to traverse the page tables for a particular
process. This way, the kernel does not need to know the format of the page table entries or how they are arranged.
This is so successful that Linux uses the same page table manipulation code for the Alpha processor, which has three levels of
page tables, and for Intel x86 processors, which have two levels of page tables.
3.4 Page Allocation and Deallocation

There are many demands on the physical pages in the system. For example, when an image is loaded into memory the operating
system needs to allocate pages. These will be freed when the image has finished executing and is unloaded. Another use for
physical pages is to hold kernel specific data structures such as the page tables themselves. The mechanisms and data structures
used for page allocation and deallocation are perhaps the most critical in maintaining the efficiency of the virtual memory
subsystem.
All of the physical pages in the system are described by the mem_map data structure which is a list of mem_map_t
1 structures which is initialized at boot time. Each mem_map_t describes a single physical page in the system. Important fields
(so far as memory management is concerned) are:
count

This is a count of the number of users of this page. The count is greater than one when the page is shared between many
processes,
age
This field describes the age of the page and is used to decide if the page is a good candidate for discarding or swapping,
map_nr
This is the physical page frame number that this mem_map_t describes.
The free_area vector is used by the page allocation code to find and free pages. The whole buffer management scheme is
supported by this mechanism and so far as the code is concerned, the size of the page and physical paging mechanisms used by
the processor are irrelevant.
Each element of free_area contains information about blocks of pages. The first element in the array describes single pages,
the next blocks of 2 pages, the next blocks of 4 pages and so on upwards in powers of two. The list element is used as a queue
head and has pointers to the page data structures in the mem_map array. Free blocks of pages are queued here. map is a pointer
to a bitmap which keeps track of allocated groups of pages of this size. Bit N of the bitmap is set if the Nth block of pages is free.
Figure free-area-figure shows the free_area structure. Element 0 has one free page (page frame number 0) and element 2 has
2 free blocks of 4 pages, the first starting at page frame number 4 and the second at page frame number 56.
3.4.1 Page Allocation

Linux uses the Buddy algorithm 2 to effectively allocate and deallocate blocks of pages. The page allocation code
attempts to allocate a block of one or more physical pages. Pages are allocated in blocks which are powers of 2 in size. That
means that it can allocate a block 1 page, 2 pages, 4 pages and so on. So long as there are enough free pages in the system to
grant this request (nr_free_pages > min_free_pages) the allocation code will search the free_area for a block of pages of the
size requested. Each element of the free_area has a map of the allocated and free blocks of pages for that sized block. For
example, element 2 of the array has a memory map that describes free and allocated blocks each of 4 pages long.
The allocation algorithm first searches for blocks of pages of the size requested. It follows the chain of free pages that is queued
on the list element of the free_area data structure. If no blocks of pages of the requested size are free, blocks of the next size
(which is twice that of the size requested) are looked for. This process continues until all of the free_area has been searched
or until a block of pages has been found. If the block of pages found is larger than that requested it must be broken down until
there is a block of the right size. Because the blocks are each a power of 2 pages big then this breaking down process is easy as
you simply break the blocks in half. The free blocks are queued on the appropriate queue and the allocated block of pages is
returned to the caller.

Figure 3.4: The free_area data structure
For example, in Figure 3.4 if a block of 2 pages was requested, the first block of 4 pages (starting at page frame number 4)
would be broken into two 2 page blocks. The first, starting at page frame number 4 would be returned to the caller as the
allocated pages and the second block, starting at page frame number 6 would be queued as a free block of 2 pages onto element 1
of the free_area array.
3.4.2 Page Deallocation

Allocating blocks of pages tends to fragment memory with larger blocks of free pages being broken down into smaller ones. The
page deallocation code
recombines pages into larger blocks of free pages whenever it can. In fact the page block size is important as it allows for easy
combination of blocks into larger blocks.
Whenever a block of pages is freed, the adjacent or buddy block of the same size is checked to see if it is free. If it is, then it is
combined with the newly freed block of pages to form a new free block of pages for the next size block of pages. Each time two
blocks of pages are recombined into a bigger block of free pages the page deallocation code attempts to recombine that block into
a yet larger one. In this way the blocks of free pages are as large as memory usage will allow.
For example, in Figure 3.4, if page frame number 1 were to be freed, then that would be combined with the already free page
frame number 0 and queued onto element 1 of the free_area as a free block of size 2 pages.
3.5 Memory Mapping

When an image is executed, the contents of the executable image must be brought into the processes virtual address space. The
same is also true of any shared libraries that the executable image has been linked to use. The executable file is not actually
brought into physical memory, instead it is merely linked into the processes virtual memory. Then, as the parts of the program

are referenced by the running application, the image is brought into memory from the executable image. This linking of an image
into a processes virtual address space is known as memory mapping.
Figure 3.5: Areas of Virtual Memory

Every processes virtual memory is represented by an mm_struct data structure. This contains information about the image that
it is currently executing (for example bash) and also has pointers to a number of vm_area_struct data structures. Each
vm_area_struct data structure describes the start and end of the area of virtual memory, the processes access rights to that
memory and a set of operations for that memory. These operations are a set of routines that Linux must use when manipulating
this area of virtual memory. For example, one of the virtual memory operations performs the correct actions when the process
has attempted to access this virtual memory but finds (via a page fault) that the memory is not actually in physical memory. This
operation is the nopage operation. The nopage operation is used when Linux demand pages the pages of an executable image
into memory.
When an executable image is mapped into a processes virtual address a set of vm_area_struct data structures is generated.
Each vm_area_struct data structure represents a part of the executable image; the executable code, initialized data
(variables), unitialized data and so on. Linux supports a number of standard virtual memory operations and as the
vm_area_struct data structures are created, the correct set of virtual memory operations are associated with them.

3.6 Demand Paging
Once an executable image has been memory mapped into a processes virtual memory it can start to execute. As only the very
start of the image is physically pulled into memory it will soon access an area of virtual memory that is not yet in physical
memory. When a process accesses a virtual address that does not have a valid page table entry, the processor will report a page
fault to Linux.
The page fault describes the virtual address where the page fault occurred and the type of memory access that caused.
Linux must find the vm_area_struct that represents the area of memory that the page fault occurred in. As searching
through the vm_area_struct data structures is critical to the efficient handling of page faults, these are linked together in an
AVL (Adelson-Velskii and Landis) tree structure. If there is no vm_area_struct data structure for this faulting virtual
address, this process has accessed an illegal virtual address. Linux will signal the process, sending a SIGSEGV signal, and if the
process does not have a handler for that signal it will be terminated.
Linux next checks the type of page fault that occurred against the types of accesses allowed for this area of virtual memory. If the
process is accessing the memory in an illegal way, say writing to an area that it is only allowed to read from, it is also signalled
with a memory error.
Now that Linux has determined that the page fault is legal, it must deal with it.
Linux must differentiate between pages that are in the swap file and those that are part of an executable image on a disk
somewhere. It does this by using the page table entry for this faulting virtual address.
If the page's page table entry is invalid but not empty, the page fault is for a page currently being held in the swap file. For Alpha
AXP page table entries, these are entries which do not have their valid bit set but which have a non-zero value in their PFN field.
In this case the PFN field holds information about where in the swap (and which swap file) the page is being held. How pages in
the swap file are handled is described later in this chapter.
Not all vm_area_struct data structures have a set of virtual memory operations and even those that do may not have a
nopage operation. This is because by default Linux will fix up the access by allocating a new physical page and creating a valid
page table entry for it. If there is a nopage operation for this area of virtual memory, Linux will use it.
The generic Linux nopage operation is used for memory mapped executable images and it uses the page cache to bring the
required image page into physical memory.
However the required page is brought into physical memory, the processes page tables are updated. It may be necessary for
hardware specific actions to update those entries, particularly if the processor uses translation look aside buffers. Now that the
page fault has been handled it can be dismissed and the process is restarted at the instruction that made the faulting virtual
memory access.
3.7 The Linux Page Cache

Figure 3.6: The Linux Page Cache
The role of the Linux page cache is to speed up access to files on disk. Memory mapped files are read a page at a time and these
pages are stored in the page cache. Figure 3.6 shows that the page cache consists of the page_hash_table, a vector of
pointers to mem_map_t data structures.
Each file in Linux is identified by a VFS inode data structure (described in Chapter filesystem-chapter) and each VFS inode
is unique and fully describes one and only one file. The index into the page table is derived from the file's VFS inode and the
offset into the file.
Whenever a page is read from a memory mapped file, for example when it needs to be brought back into memory during demand
paging, the page is read through the page cache. If the page is present in the cache, a pointer to the mem_map_t data structure
representing it is returned to the page fault handling code. Otherwise the page must be brought into memory from the file system
that holds the image. Linux allocates a physical page and reads the page from the file on disk.
If it is possible, Linux will initiate a read of the next page in the file. This single page read ahead means that if the process is
accessing the pages in the file serially, the next page will be waiting in memory for the process.
Over time the page cache grows as images are read and executed. Pages will be removed from the cache as they are no longer
needed, say as an image is no longer being used by any process. As Linux uses memory it can start to run low on physical pages.
In this case Linux will reduce the size of the page cache.
3.8 Swapping Out and Discarding Pages

When physical memory becomes scarce the Linux memory management subsystem must attempt to free physical pages. This
task falls to the kernel swap daemon (kswapd).
The kernel swap daemon is a special type of process, a kernel thread. Kernel threads are processes have no virtual memory,
instead they run in kernel mode in the physical address space. The kernel swap daemon is slightly misnamed in that it does more
than merely swap pages out to the system's swap files. Its role is make sure that there are enough free pages in the system to keep
the memory management system operating efficiently.
The Kernel swap daemon (kswapd) is started by the kernel init process at startup time and sits waiting for the kernel swap timer
to periodically expire.

Every time the timer expires, the swap daemon looks to see if the number of free pages in the system is getting too low. It uses
two variables, free_pages_high and free_pages_low to decide if it should free some pages. So long as the number of free pages in
the system remains above free_pages_high, the kernel swap daemon does nothing; it sleeps again until its timer next expires. For
the purposes of this check the kernel swap daemon takes into account the number of pages currently being written out to the
swap file. It keeps a count of these in nr_async_pages; this is incremented each time a page is queued waiting to be written out to
the swap file and decremented when the write to the swap device has completed. free_pages_low and free_pages_high are set at
system startup time and are related to the number of physical pages in the system. If the number of free pages in the system has
fallen below free_pages_high or worse still free_pages_low, the kernel swap daemon will try three ways to reduce the number of
physical pages being used by the system:
Reducing the size of the buffer and page caches,
Swapping out System V shared memory pages,
Swapping out and discarding pages.
If the number of free pages in the system has fallen below free_pages_low, the kernel swap daemon will try to free 6 pages
before it next runs. Otherwise it will try to free 3 pages. Each of the above methods are tried in turn until enough pages have been
freed. The kernel swap daemon remembers which method it was using the last time that it attempted to free physical pages. Each
time it runs it will start trying to free pages using this last successful method.
After it has free sufficient pages, the swap daemon sleeps again until its timer expires. If the reason that the kernel swap daemon
freed pages was that the number of free pages in the system had fallen below free_pages_low, it only sleeps for half its usual
time. Once the number of free pages is more than free_pages_low the kernel swap daemon goes back to sleeping longer between
checks.
3.8.1 Reducing the Size of the Page and Buffer Caches

The pages held in the page and buffer caches are good candidates for being freed into the free_area vector. The Page Cache,
which contains pages of memory mapped files, may contain unneccessary pages that are filling up the system's memory.
Likewise the Buffer Cache, which contains buffers read from or being written to physical devices, may also contain unneeded
buffers. When the physical pages in the system start to run out, discarding pages from these caches is relatively easy as it requires
no writing to physical devices (unlike swapping pages out of memory). Discarding these pages does not have too many harmful
side effects other than making access to physical devices and memory mapped files slower. However, if the discarding of pages
from these caches is done fairly, all processes will suffer equally.
Every time the Kernel swap daemon tries to shrink these caches
it examines a block of pages in the mem_map page vector to see if any can be discarded from physical memory. The size of the
block of pages examined is higher if the kernel swap daemon is intensively swapping; that is if the number of free pages in the
system has fallen dangerously low. The blocks of pages are examined in a cyclical manner; a different block of pages is
examined each time an attempt is made to shrink the memory map. This is known as the clock algorithm as, rather like the
minute hand of a clock, the whole mem_map page vector is examined a few pages at a time.
Each page being examined is checked to see if it is cached in either the page cache or the buffer cache. You should note that
shared pages are not considered for discarding at this time and that a page cannot be in both caches at the same time. If the page
is not in either cache then the next page in the mem_map page vector is examined.
Pages are cached in the buffer cache (or rather the buffers within the pages are cached) to make buffer allocation and deallocation
more efficient. The memory map shrinking code tries to free the buffers that are contained within the page being examined.
If all the buffers are freed, then the pages that contain them are also be freed. If the examined page is in the Linux page cache, it
is removed from the page cache and freed.
When enough pages have been freed on this attempt then the kernel swap daemon will wait until the next time it is periodically
woken. As none of the freed pages were part of any process's virtual memory (they were cached pages), then no page tables need
updating. If there were not enough cached pages discarded then the swap daemon will try to swap out some shared pages.

3.8.2 Swapping Out System V Shared Memory Pages
System V shared memory is an inter-process communication mechanism which allows two or more processes to share virtual
memory in order to pass information amongst themselves. How processes share memory in this way is described in more detail
in Chapter IPC-chapter. For now it is enough to say that each area of System V shared memory is described by a shmid_ds
data structure. This contains a pointer to a list of vm_area_struct data structures, one for each process sharing this area of
virtual memory. The vm_area_struct data structures describe where in each processes virtual memory this area of System V
shared memory goes. Each vm_area_struct data structure for this System V shared memory is linked together using the
vm_next_shared and vm_prev_shared pointers. Each shmid_ds data structure also contains a list of page table entries
each of which describes the physical page that a shared virtual page maps to.
The kernel swap daemon also uses a clock algorithm when swapping out System V shared memory pages.
. Each time it runs it remembers which page of which shared virtual memory area it last swapped out. It does this by keeping two
indices, the first is an index into the set of shmid_ds data structures, the second into the list of page table entries for this area of
System V shared memory. This makes sure that it fairly victimizes the areas of System V shared memory.
As the physical page frame number for a given virtual page of System V shared memory is contained in the page tables of all of
the processes sharing this area of virtual memory, the kernel swap daemon must modify all of these page tables to show that the
page is no longer in memory but is now held in the swap file. For each shared page it is swapping out, the kernel swap daemon
finds the page table entry in each of the sharing processes page tables (by following a pointer from each vm_area_struct
data structure). If this processes page table entry for this page of System V shared memory is valid, it converts it into an invalid
but swapped out page table entry and reduces this (shared) page's count of users by one. The format of a swapped out System V
shared page table entry contains an index into the set of shmid_ds data structures and an index into the page table entries for
this area of System V shared memory.
If the page's count is zero after the page tables of the sharing processes have all been modified, the shared page can be written out
to the swap file. The page table entry in the list pointed at by the shmid_ds data structure for this area of System V shared
memory is replaced by a swapped out page table entry. A swapped out page table entry is invalid but contains an index into the
set of open swap files and the offset in that file where the swapped out page can be found. This information will be used when
the page has to be brought back into physical memory.
3.8.3 Swapping Out and Discarding Pages

The swap daemon looks at each process in the system in turn to see if it is a good candidate for swapping.
Good candidates are processes that can be swapped (some cannot) and that have one or more pages which can be swapped or
discarded from memory. Pages are swapped out of physical memory into the system's swap files only if the data in them cannot
be retrieved another way.
A lot of the contents of an executable image come from the image's file and can easily be re-read from that file. For example, the
executable instructions of an image will never be modified by the image and so will never be written to the swap file. These
pages can simply be discarded; when they are again referenced by the process, they will be brought back into memory from the
executable image.
Once the process to swap has been located, the swap daemon looks through all of its virtual memory regions looking for areas
which are not shared or locked.
Linux does not swap out all of the swappable pages of the process that it has selected; instead it removes only a small number of
pages.
Pages cannot be swapped or discarded if they are locked in memory.
The Linux swap algorithm uses page aging. Each page has a counter (held in the mem_map_t data structure) that gives the
Kernel swap daemon some idea whether or not a page is worth swapping. Pages age when they are unused and rejuvinate on
access; the swap daemon only swaps out old pages. The default action when a page is first allocated, is to give it an initial age of
3. Each time it is touched, it's age is increased by 3 to a maximum of 20. Every time the Kernel swap daemon runs it ages pages,
decrementing their age by 1. These default actions can be changed and for this reason they (and other swap related information)
are stored in the swap_control data structure.

If the page is old (age = 0), the swap daemon will process it further. Dirty pages are pages which can be swapped out. Linux uses
an architecture specific bit in the PTE to describe pages this way (see Figure 3.2). However, not all dirty pages are necessarily
written to the swap file. Every virtual memory region of a process may have its own swap operation (pointed at by the vm_ops
pointer in the vm_area_struct) and that method is used. Otherwise, the swap daemon will allocate a page in the swap file
and write the page out to that device.
The page's page table entry is replaced by one which is marked as invalid but which contains information about where the page is
in the swap file. This is an offset into the swap file where the page is held and an indication of which swap file is being used.
Whatever the swap method used, the original physical page is made free by putting it back into the free_area. Clean (or
rather not dirty) pages can be discarded and put back into the free_area for re-use.
If enough of the swappable processes pages have been swapped out or discarded, the swap daemon will again sleep. The next
time it wakes it will consider the next process in the system. In this way, the swap daemon nibbles away at each processes
physical pages until the system is again in balance. This is much fairer than swapping out whole processes.
3.9 The Swap Cache

When swapping pages out to the swap files, Linux avoids writing pages if it does not have to. There are times when a page is
both in a swap file and in physical memory. This happens when a page that was swapped out of memory was then brought back
into memory when it was again accessed by a process. So long as the page in memory is not written to, the copy in the swap file
remains valid.
Linux uses the swap cache to track these pages. The swap cache is a list of page table entries, one per physical page in the
system. This is a page table entry for a swapped out page and describes which swap file the page is being held in together with its
location in the swap file. If a swap cache entry is non-zero, it represents a page which is being held in a swap file that has not
been modified. If the page is subsequently modified (by being written to), its entry is removed from the swap cache.
When Linux needs to swap a physical page out to a swap file it consults the swap cache and, if there is a valid entry for this page,
it does not need to write the page out to the swap file. This is because the page in memory has not been modified since it was last
read from the swap file.
The entries in the swap cache are page table entries for swapped out pages. They are marked as invalid but contain information
which allow Linux to find the right swap file and the right page within that swap file.
3.10 Swapping Pages In

The dirty pages saved in the swap files may be needed again, for example when an application writes to an area of virtual
memory whose contents are held in a swapped out physical page. Accessing a page of virtual memory that is not held in physical
memory causes a page fault to occur. The page fault is the processor signalling the operating system that it cannot translate a
virtual address into a physical one. In this case this is because the page table entry describing this page of virtual memory was
marked as invalid when the page was swapped out. The processor cannot handle the virtual to physical address translation and so
hands control back to the operating system describing as it does so the virtual address that faulted and the reason for the fault.
The format of this information and how the processor passes control to the operating system is processor specific.
The processor specific page fault handling code must locate the vm_area_struct data structure that describes the area of
virtual memory that contains the faulting virtual address. It does this by searching the vm_area_struct data structures for
this process until it finds the one containing the faulting virtual address. This is very time critical code and a processes
vm_area_struct data structures are so arranged as to make this search take as little time as possible.
Having carried out the appropriate processor specific actions and found that the faulting virtual address is for a valid area of
virtual memory, the page fault processing becomes generic and applicable to all processors that Linux runs on.
The generic page fault handling code looks for the page table entry for the faulting virtual address. If the page table entry it finds
is for a swapped out page, Linux must swap the page back into physical memory. The format of the page table entry for a
swapped out page is processor specific but all processors mark these pages as invalid and put the information neccessary to
locate the page within the swap file into the page table entry. Linux needs this information in order to bring the page back into
physical memory.

At this point, Linux knows the faulting virtual address and has a page table entry containing information about where this page
has been swapped to. The vm_area_struct data structure may contain a pointer to a routine which will swap any page of the
area of virtual memory that it describes back into physical memory. This is its swapin operation. If there is a swapin operation for
this area of virtual memory then Linux will use it. This is, in fact, how swapped out System V shared memory pages are handled
as it requires special handling because the format of a swapped out System V shared page is a little different from that of an
ordinairy swapped out page. There may not be a swapin operation, in which case Linux will assume that this is an ordinairy page
that does not need to be specially handled.
It allocates a free physical page and reads the swapped out page back from the swap file. Information telling it where in the swap
file (and which swap file) is taken from the the invalid page table entry.
If the access that caused the page fault was not a write access then the page is left in the swap cache and its page table entry is not
marked as writable. If the page is subsequently written to, another page fault will occur and, at that point, the page is marked as
dirty and its entry is removed from the swap cache. If the page is not written to and it needs to be swapped out again, Linux can
avoid the write of the page to its swap file because the page is already in the swap file.
If the access that caused the page to be brought in from the swap file was a write operation, this page is removed from the swap
cache and its page table entry is marked as both dirty and writable.
Footnotes:
1 Confusingly the structure is also known as the page structure.
2 Bibliography reference here


Chapter 5
Interprocess Communication
Mechanisms
Processes communicate with each other and with the kernel to coordinate their activities. Linux supports
a number of Inter-Process Communication (IPC) mechanisms. Signals and pipes are two of them but
Linux also supports the System V IPC mechanisms named after the Unix TM release in which they first
appeared.
5.1 Signals
Signals are one of the oldest inter-process communication methods used by Unix TM systems. They are
used to signal asynchronous events to one or more processes. A signal could be generated by a keyboard
interrupt or an error condition such as the process attempting to access a non-existent location in its
virtual memory. Signals are also used by the shells to signal job control commands to their child
processes.
There are a set of defined signals that the kernel can generate or that can be generated by other processes
in the system, provided that they have the correct privileges. You can list a system's set of signals using
the kill command (kill -l), on my Intel Linux box this gives:
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL

5) SIGTRAP 6) SIGIOT 7) SIGBUS 8) SIGFPE
9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2
13) SIGPIPE 14) SIGALRM 15) SIGTERM 17) SIGCHLD
18) SIGCONT 19) SIGSTOP 20) SIGTSTP 21) SIGTTIN
22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO
30) SIGPWR
The numbers are different for an Alpha AXP Linux box. Processes can choose to ignore most of the
signals that are generated, with two notable exceptions: neither the SIGSTOP signal which causes a
process to halt its execution nor the SIGKILL signal which causes a process to exit can be ignored.
Otherwise though, a process can choose just how it wants to handle the various signals. Processes can
http://ldp.iol.it/LDP/tlk/ipc/ipc.html (1 di 11) [08/03/2001 10.08.32]

block the signals and, if they do not block them, they can either choose to handle them themselves or
allow the kernel to handle them. If the kernel handles the signals, it will do the default actions required
for this signal. For example, the default action when a process receives the SIGFPE (floating point
exception) signal is to core dump and then exit. Signals have no inherent relative priorities. If two signals
are generated for a process at the same time then they may be presented to the process or handled in any
order. Also there is no mechanism for handling multiple signals of the same kind. There is no way that a
process can tell if it received 1 or 42 SIGCONT signals.
Linux implements signals using information stored in the task_struct for the process. The number
of supported signals is limited to the word size of the processor. Processes with a word size of 32 bits can
have 32 signals whereas 64 bit processors like the Alpha AXP may have up to 64 signals. The currently
pending signals are kept in the signal field with a mask of blocked signals held in blocked. With the
exception of SIGSTOP and SIGKILL, all signals can be blocked. If a blocked signal is generated, it
remains pending until it is unblocked. Linux also holds information about how each process handles
every possible signal and this is held in an array of sigaction data structures pointed at by the
task_struct for each process. Amongst other things it contains either the address of a routine that
will handle the signal or a flag which tells Linux that the process either wishes to ignore this signal or let
the kernel handle the signal for it. The process modifies the default signal handling by making system
calls and these calls alter the sigaction for the appropriate signal as well as the blocked mask.
Not every process in the system can send signals to every other process, the kernel can and super users
can. Normal processes can only send signals to processes with the same uid and gid or to processes in the
same process group1. Signals are generated by setting the appropriate bit in the task_struct's
signal field. If the process has not blocked the signal and is waiting but interruptible (in state
Interruptible) then it is woken up by changing its state to Running and making sure that it is in the run
queue. That way the scheduler will consider it a candidate for running when the system next schedules. If
the default handling is needed, then Linux can optimize the handling of the signal. For example if the
signal SIGWINCH (the X window changed focus) and the default handler is being used then there is
nothing to be done.
Signals are not presented to the process immediately they are generated., they must wait until the process
is running again. Every time a process exits from a system call its signal and blocked fields are
checked and, if there are any unblocked signals, they can now be delivered. This might seem a very
unreliable method but every process in the system is making system calls, for example to write a
character to the terminal, all of the time. Processes can elect to wait for signals if they wish, they are
suspended in state Interruptible until a signal is presented. The Linux signal processing code looks at the
sigaction structure for each of the current unblocked signals.
If a signal's handler is set to the default action then the kernel will handle it. The SIGSTOP signal's
default handler will change the current process's state to Stopped and then run the scheduler to select a
new process to run. The default action for the SIGFPE signal will core dump the process and then cause
it to exit. Alternatively, the process may have specfied its own signal handler. This is a routine which
will be called whenever the signal is generated and the sigaction structure holds the address of this
routine. The kernel must call the process's signal handling routine and how this happens is processor
specific but all CPUs must cope with the fact that the current process is running in kernel mode and is
just about to return to the process that called the kernel or system routine in user mode. The problem is
solved by manipulating the stack and registers of the process. The process's program counter is set to the

address of its signal handling routine and the parameters to the routine are added to the call frame or
passed in registers. When the process resumes operation it appears as if the signal handling routine were
called normally.
Linux is POSIX compatible and so the process can specify which signals are blocked when a particular
signal handling routine is called. This means changing the blocked mask during the call to the
processes signal handler. The blocked mask must be returned to its original value when the signal
handling routine has finished. Therefore Linux adds a call to a tidy up routine which will restore the
original blocked mask onto the call stack of the signalled process. Linux also optimizes the case where
several signal handling routines need to be called by stacking them so that each time one handling
routine exits, the next one is called until the tidy up routine is called.
5.2 Pipes
The common Linux shells all allow redirection. For example
$ ls | pr | lpr
pipes the output from the ls command listing the directory's files into the standard input of the pr
command which paginates them. Finally the standard output from the pr command is piped into the
standard input of the lpr command which prints the results on the default printer. Pipes then are
unidirectional byte streams which connect the standard output from one process into the standard input of
another process. Neither process is aware of this redirection and behaves just as it would normally. It is
the shell which sets up these temporary pipes between the processes.

Figure 5.1: Pipes
In Linux, a pipe is implemented using two file data structures which both point at the same temporary
VFS inode which itself points at a physical page within memory. Figure 5.1 shows that each file data
structure contains pointers to different file operation routine vectors; one for writing to the pipe, the other
for reading from the pipe.
This hides the underlying differences from the generic system calls which read and write to ordinary
files. As the writing process writes to the pipe, bytes are copied into the shared data page and when the
reading process reads from the pipe, bytes are copied from the shared data page. Linux must synchronize
access to the pipe. It must make sure that the reader and the writer of the pipe are in step and to do this it
uses locks, wait queues and signals.
When the writer wants to write to the pipe it uses the standard write library functions. These all pass file
descriptors that are indices into the process's set of file data structures, each one representing an open
file or, as in this case, an open pipe. The Linux system call uses the write routine pointed at by the file

data structure describing this pipe. That write routine uses information held in the VFS inode
representing the pipe to manage the write request.
If there is enough room to write all of the bytes into the pipe and, so long as the pipe is not locked by its
reader, Linux locks it for the writer and copies the bytes to be written from the process's address space
into the shared data page. If the pipe is locked by the reader or if there is not enough room for the data
then the current process is made to sleep on the pipe inode's wait queue and the scheduler is called so that
another process can run. It is interruptible, so it can receive signals and it will be woken by the reader
when there is enough room for the write data or when the pipe is unlocked. When the data has been
written, the pipe's VFS inode is unlocked and any waiting readers sleeping on the inode's wait queue will
themselves be woken up.
Reading data from the pipe is a very similar process to writing to it.
Processes are allowed to do non-blocking reads (it depends on the mode in which they opened the file or
pipe) and, in this case, if there is no data to be read or if the pipe is locked, an error will be returned. This
means that the process can continue to run. The alternative is to wait on the pipe inode's wait queue until
the write process has finished. When both processes have finished with the pipe, the pipe inode is
discarded along with the shared data page.
Linux also supports named pipes, also known as FIFOs because pipes operate on a First In, First Out
principle. The first data written into the pipe is the first data read from the pipe. Unlike pipes, FIFOs are
not temporary objects, they are entities in the file system and can be created using the mkfifo command.
Processes are free to use a FIFO so long as they have appropriate access rights to it. The way that FIFOs
are opened is a little different from pipes. A pipe (its two file data structures, its VFS inode and the
shared data page) is created in one go whereas a FIFO already exists and is opened and closed by its
users. Linux must handle readers opening the FIFO before writers open it as well as readers reading
before any writers have written to it. That aside, FIFOs are handled almost exactly the same way as pipes
and they use the same data structures and operations.
5.3 Sockets
REVIEW NOTE: Add when networking chapter written.
5.3.1 System V IPC Mechanisms

Linux supports three types of interprocess communication mechanisms that first appeared in Unix
TM System V (1983). These are message queues, semaphores and shared memory. These System V IPC
mechanisms all share common authentication methods. Processes may access these resources only by
passing a unique reference identifier to the kernel via system calls. Access to these System V IPC objects
is checked using access permissions, much like accesses to files are checked. The access rights to the
System V IPC object is set by the creator of the object via system calls. The object's reference identifier
is used by each mechanism as an index into a table of resources. It is not a straight forward index but
requires some manipulation to generate the index.
All Linux data structures representing System V IPC objects in the system include an ipc_perm

structure which contains the owner and creator process's user and group identifiers. The access mode for
this object (owner, group and other) and the IPC object's key. The key is used as a way of locating the
System V IPC object's reference identifier. Two sets of keys are supported: public and private. If the key
is public then any process in the system, subject to rights checking, can find the reference identifier for
the System V IPC object. System V IPC objects can never be referenced with a key, only by their
reference identifier.
5.3.2 Message Queues

Message queues allow one or more processes to write messages, which will be read by one or more
reading processes. Linux maintains a list of message queues, the msgque vector; each element of which
points to a msqid_ds data structure that fully describes the message queue. When message queues are
created a new msqid_ds data structure is allocated from system memory and inserted into the vector.
Figure 5.2: System V IPC Message Queues

Each msqid_ds
data structure contains an ipc_perm data structure and pointers to the messages entered onto this
queue. In addition, Linux keeps queue modification times such as the last time that this queue was
written to and so on. The msqid_ds also contains two wait queues; one for the writers to the queue and
one for the readers of the message queue.
Each time a process attempts to write a message to the write queue its effective user and group identifiers

are compared with the mode in this queue's ipc_perm data structure. If the process can write to the
queue then the message may be copied from the process's address space into a msg
data structure and put at the end of this message queue. Each message is tagged with an application
specific type, agreed between the cooperating processes. However, there may be no room for the
message as Linux restricts the number and length of messages that can be written. In this case the process
will be added to this message queue's write wait queue and the scheduler will be called to select a new
process to run. It will be woken up when one or more messages have been read from this message queue.
Reading from the queue is a similar process. Again, the processes access rights to the write queue are
checked. A reading process may choose to either get the first message in the queue regardless of its type
or select messages with particular types. If no messages match this criteria the reading process will be
added to the message queue's read wait queue and the scheduler run. When a new message is written to
the queue this process will be woken up and run again.
5.3.3 Semaphores
In its simplest form a semaphore is a location in memory whose value can be tested and set by more than
one process. The test and set operation is, so far as each process is concerned, uninterruptible or atomic;
once started nothing can stop it. The result of the test and set operation is the addition of the current value
of the semaphore and the set value, which can be positive or negative. Depending on the result of the test
and set operation one process may have to sleep until the semphore's value is changed by another
process. Semaphores can be used to implement critical regions, areas of critical code that only one
process at a time should be executing.
Say you had many cooperating processes reading records from and writing records to a single data file.
You would want that file access to be strictly coordinated. You could use a semaphore with an initial
value of 1 and, around the file operating code, put two semaphore operations, the first to test and
decrement the semaphore's value and the second to test and increment it. The first process to access the
file would try to decrement the semaphore's value and it would succeed, the semaphore's value now
being 0. This process can now go ahead and use the data file but if another process wishing to use it now
tries to decrement the semaphore's value it would fail as the result would be -1. That process will be
suspended until the first process has finished with the data file. When the first process has finished with
the data file it will increment the semaphore's value, making it 1 again. Now the waiting process can be
woken and this time its attempt to increment the semaphore will succeed.

Figure 5.3: System V IPC Semaphores
System V IPC semaphore objects each describe a semaphore array and Linux uses the semid_ds
data structure to represent this. All of the semid_ds data structures in the system are pointed at by the
semary, a vector of pointers. There are sem_nsems in each semaphore array, each one described by a
sem data structure pointed at by sem_base. All of the processes that are allowed to manipulate the
semaphore array of a System V IPC semaphore object may make system calls that perform operations on
them. The system call can specify many operations and each operation is described by three inputs; the
semaphore index, the operation value and a set of flags. The semaphore index is an index into the
semaphore array and the operation value is a numerical value that will be added to the current value of
the semaphore. First Linux tests whether or not all of the operations would succeed. An operation will
succeed if the operation value added to the semaphore's current value would be greater than zero or if
both the operation value and the semaphore's current value are zero. If any of the semaphore operations
would fail Linux may suspend the process but only if the operation flags have not requested that the
system call is non-blocking. If the process is to be suspended then Linux must save the state of the
semaphore operations to be performed and put the current process onto a wait queue. It does this by
building a sem_queue data structure on the stack and filling it out. The new sem_queue data
structure is put at the end of this semaphore object's wait queue (using the sem_pending and
sem_pending_last pointers). The current process is put on the wait queue in the sem_queue data
structure (sleeper) and the scheduler called to choose another process to run.

If all of the semaphore operations would have succeeded and the current process does not need to be
suspended, Linux goes ahead and applies the operations to the appropriate members of the semaphore
array. Now Linux must check that any waiting, suspended, processes may now apply their semaphore
operations. It looks at each member of the operations pending queue (sem_pending) in turn, testing to
see if the semphore operations will succeed this time. If they will then it removes the sem_queue data
structure from the operations pending list and applies the semaphore operations to the semaphore array. It
wakes up the sleeping process making it available to be restarted the next time the scheduler runs. Linux
keeps looking through the pending list from the start until there is a pass where no semaphore operations
can be applied and so no more processes can be woken.
There is a problem with semaphores, deadlocks. These occur when one process has altered the
semaphores value as it enters a critical region but then fails to leave the critical region because it crashed
or was killed. Linux protects against this by maintaining lists of adjustments to the semaphore arrays.
The idea is that when these adjustments are applied, the semaphores will be put back to the state that they
were in before the a process's set of semaphore operations were applied. These adjustments are kept in
sem_undo data structures queued both on the semid_ds data structure and on the task_struct
data structure for the processes using these semaphore arrays.
Each individual semaphore operation may request that an adjustment be maintained. Linux will maintain
at most one sem_undo data structure per process for each semaphore array. If the requesting process
does not have one, then one is created when it is needed. The new sem_undo data structure is queued
both onto this process's task_struct data structure and onto the semaphore array's semid_ds data
structure. As operations are applied to the semphores in the semaphore array the negation of the
operation value is added to this semphore's entry in the adjustment array of this process's sem_undo
data structure. So, if the operation value is 2, then -2 is added to the adjustment entry for this semaphore.
When processes are deleted, as they exit Linux works through their set of sem_undo data structures
applying the adjustments to the semaphore arrays. If a semaphore set is deleted, the sem_undo data
structures are left queued on the process's task_struct but the semaphore array identifier is made
invalid. In this case the semaphore clean up code simply discards the sem_undo data structure.
5.3.4 Shared Memory

Shared memory allows one or more processes to communicate via memory that appears in all of their
virtual address spaces. The pages of the virtual memory is referenced by page table entries in each of the
sharing processes' page tables. It does not have to be at the same address in all of the processes' virtual
memory. As with all System V IPC objects, access to shared memory areas is controlled via keys and
access rights checking. Once the memory is being shared, there are no checks on how the processes are
using it. They must rely on other mechanisms, for example System V semaphores, to synchronize access
to the memory.

Figure 5.4: System V IPC Shared Memory
Each newly created shared memory area is represented by a shmid_ds data structure. These are kept in
the shm_segs vector.
The shmid_ds data structure decribes how big the area of shared memory is, how many processes are
using it and information about how that shared memory is mapped into their address spaces. It is the
creator of the shared memory that controls the access permissions to that memory and whether its key is
public or private. If it has enough access rights it may also lock the shared memory into physical
memory.
Each process that wishes to share the memory must attach to that virtual memory via a system call. This
creates a new vm_area_struct data structure describing the shared memory for this process. The
process can choose where in its virtual address space the shared memory goes or it can let Linux choose
a free area large enough. The new vm_area_struct structure is put into the list of
vm_area_struct pointed at by the shmid_ds. The vm_next_shared and vm_prev_shared
pointers are used to link them together. The virtual memory is not actually created during the attach; it
happens when the first process attempts to access it.
The first time that a process accesses one of the pages of the shared virtual memory, a page fault will
occur. When Linux fixes up that page fault it finds the vm_area_struct data structure describing it.
This contains pointers to handler routines for this type of shared virtual memory. The shared memory
page fault handling code looks in the list of page table entries for this shmid_ds to see if one exists for
this page of the shared virtual memory. If it does not exist, it will allocate a physical page and create a

page table entry for it. As well as going into the current process's page tables, this entry is saved in the
shmid_ds. This means that when the next process that attempts to access this memory gets a page fault,
the shared memory fault handling code will use this newly created physical page for that process too. So,
the first process that accesses a page of the shared memory causes it to be created and thereafter access
by the other processes cause that page to be added into their virtual address spaces.
When processes no longer wish to share the virtual memory, they detach from it. So long as other
processes are still using the memory the detach only affects the current process. Its vm_area_struct
is removed from the shmid_ds data structure and deallocated. The current process's page tables are
updated to invalidate the area of virtual memory that it used to share. When the last process sharing the
memory detaches from it, the pages of the shared memory current in physical memory are freed, as is the
shmid_ds data structure for this shared memory.
Further complications arise when shared virtual memory is not locked into physical memory. In this case
the pages of the shared memory may be swapped out to the system's swap disk during periods of high
memory usage. How shared memory memory is swapped into and out of physical memory is described
in Chapter mm-chapter.
Footnotes:
1 REVIEW NOTE: Explain process groups.


Chapter 6
PCI
Peripheral Component Interconnect (PCI), as its name implies is a standard that describes how to connect the peripheral components
of a system together in a structured and controlled way. The standard describes the way that the system components are electrically
connected and the way that they should behave. This chapter looks at how the Linux kernel initializes the system's PCI buses and
devices.
Figure 6.1: Example PCI Based System

Figure 6.1 is a logical diagram of an example PCI based system. The PCI buses and PCI-PCI bridges are the glue connecting the
system components together; the CPU is connected to PCI bus 0, the primary PCI bus as is the video device. A special PCI device, a
PCI-PCI bridge connects the primary bus to the secondary PCI bus, PCI bus 1. In the jargon of the PCI specification, PCI bus 1 is
described as being downstream of the PCI-PCI bridge and PCI bus 0 is up-stream of the bridge. Connected to the secondary PCI bus
are the SCSI and ethernet devices for the system. Physically the bridge, secondary PCI bus and two devices would all be contained
on the same combination PCI card. The PCI-ISA bridge in the system supports older, legacy ISA devices and the diagram shows a
super I/O controller chip, which controls the keyboard, mouse and floppy. 1
6.1 PCI Address Spaces

The CPU and the PCI devices need to access memory that is shared between them. This memory is used by device drivers to control
the PCI devices and to pass information between them. Typically the shared memory contains control and status registers for the
device. These registers are used to control the device and to read its status. For example, the PCI SCSI device driver would read its
status register to find out if the SCSI device was ready to write a block of information to the SCSI disk. Or it might write to the
control register to start the device running after it has been turned on.
The CPU's system memory could be used for this shared memory but if it were, then every time a PCI device accessed memory, the
http://ldp.iol.it/LDP/tlk/dd/pci.html (1 di 12) [08/03/2001 10.08.40]

CPU would have to stall, waiting for the PCI device to finish. Access to memory is generally limited to one system component at a
time. This would slow the system down. It is also not a good idea to allow the system's peripheral devices to access main memory in
an uncontrolled way. This would be very dangerous; a rogue device could make the system very unstable.
Peripheral devices have their own memory spaces. The CPU can access these spaces but access by the devices into the system's
memory is very strictly controlled using DMA (Direct Memory Access) channels. ISA devices have access to two address spaces,
ISA I/O (Input/Output) and ISA memory. PCI has three; PCI I/O, PCI Memory and PCI Configuration space. All of these address
spaces are also accessible by the CPU with the the PCI I/O and PCI Memory address spaces being used by the device drivers and the
PCI Configuration space being used by the PCI initialization code within the Linux kernel.
The Alpha AXP processor does not have natural access to addresses spaces other than the system address space. It uses support
chipsets to access other address spaces such as PCI Configuration space. It uses a sparse address mapping scheme which steals part
of the large virtual address space and maps it to the PCI address spaces.
6.2 PCI Configuration Headers
Figure 6.2: The PCI Configuration Header

Every PCI device in the system, including the PCI-PCI bridges has a configuration data structure that is somewhere in the PCI
configuration address space. The PCI Configuration header allows the system to identify and control the device. Exactly where the
header is in the PCI Configuration address space depends on where in the PCI topology that device is. For example, a PCI video
card plugged into one PCI slot on the PC motherboard will have its configuration header at one location and if it is plugged into
another PCI slot then its header will appear in another location in PCI Configuration memory. This does not matter, for wherever the
PCI devices and bridges are the system will find and configure them using the status and configuration registers in their
configuration headers.

Typically, systems are designed so that every PCI slot has it's PCI Configuration Header in an offset that is related to its slot on the
board. So, for example, the first slot on the board might have its PCI Configuration at offset 0 and the second slot at offset 256 (all
headers are the same length, 256 bytes) and so on. A system specific hardware mechanism is defined so that the PCI configuration
code can attempt to examine all possible PCI Configuration Headers for a given PCI bus and know which devices are present and
which devices are absent simply by trying to read one of the fields in the header (usually the Vendor Identification field) and getting
some sort of error. The describes one possible error message as returning 0xFFFFFFFF when attempting to read the Vendor
Identification and Device Identification fields for an empty PCI slot.
Figure 6.2 shows the layout of the 256 byte PCI configuration header. It contains the following fields:
Vendor Identification
A unique number describing the originator of the PCI device. Digital's PCI Vendor Identification is 0x1011 and Intel's is
0x8086.
Device Identification
A unique number describing the device itself. For example, Digital's 21141 fast ethernet device has a device identification of
0x0009.
Status
This field gives the status of the device with the meaning of the bits of this field set by the standard. .
Command
By writing to this field the system controls the device, for example allowing the device to access PCI I/O memory,
Class Code
This identifies the type of device that this is. There are standard classes for every sort of device; video, SCSI and so on. The
class code for SCSI is 0x0100.
Base Address Registers
These registers are used to determine and allocate the type, amount and location of PCI I/O and PCI memory space that the
device can use.
Interrupt Pin
Four of the physical pins on the PCI card carry interrupts from the card to the PCI bus. The standard labels these as A, B, C
and D. The Interrupt Pin field describes which of these pins this PCI device uses. Generally it is hardwired for a pariticular
device. That is, every time the system boots, the device uses the same interrupt pin. This information allows the interrupt
handling subsystem to manage interrupts from this device,
Interrupt Line
The Interrupt Line field of the device's PCI Configuration header is used to pass an interrupt handle between the PCI
initialisation code, the device's driver and Linux's interrupt handling subsystem. The number written there is meaningless to
the the device driver but it allows the interrupt handler to correctly route an interrupt from the PCI device to the correct device
driver's interrupt handling code within the Linux operating system. See Chapter interrupt-chapter on page for details on how
Linux handles interrupts.
6.3 PCI I/O and PCI Memory Addresses

These two address spaces are used by the devices to communicate with their device drivers running in the Linux kernel on the CPU.
For example, the DECchip 21141 fast ethernet device maps its internal registers into PCI I/O space. Its Linux device driver then
reads and writes those registers to control the device. Video drivers typically use large amounts of PCI memory space to contain
video information.
Until the PCI system has been set up and the device's access to these address spaces has been turned on using the Command field in
the PCI Configuration header, nothing can access them. It should be noted that only the PCI configuration code reads and writes PCI
configuration addresses; the Linux device drivers only read and write PCI I/O and PCI memory addresses.

6.4 PCI-ISA Bridges
These bridges support legacy ISA devices by translating PCI I/O and PCI Memory space accesses into ISA I/O and ISA Memory
accesses. A lot of systems now sold contain several ISA bus slots and several PCI bus slots. Over time the need for this backwards
compatibility will dwindle and PCI only systems will be sold. Where in the ISA address spaces (I/O and Memory) the ISA devices
of the system have their registers was fixed in the dim mists of time by the early Intel 8080 based PCs. Even a $5000 Alpha
AXP based computer systems will have its ISA floppy controller at the same place in ISA I/O space as the first IBM PC. The PCI
specification copes with this by reserving the lower regions of the PCI I/O and PCI Memory address spaces for use by the ISA
peripherals in the system and using a single PCI-ISA bridge to translate any PCI memory accesses to those regions into ISA
accesses.
6.5 PCI-PCI Bridges

PCI-PCI bridges are special PCI devices that glue the PCI buses of the system together. Simple systems have a single PCI bus but
there is an electrical limit on the number of PCI devices that a single PCI bus can support. Using PCI-PCI bridges to add more PCI
buses allows the system to support many more PCI devices. This is particularly important for a high performance server. Of course,
Linux fully supports the use of PCI-PCI bridges.
6.5.1 PCI-PCI Bridges: PCI I/O and PCI Memory Windows

PCI-PCI bridges only pass a subset of PCI I/O and PCI memory read and write requests downstream. For example, in Figure 6.1 on
page pageref, the PCI-PCI bridge will only pass read and write addresses from PCI bus 0 to PCI bus 1 if they are for PCI I/O or PCI
memory addresses owned by either the SCSI or ethernet device; all other PCI I/O and memory addresses are ignored. This filtering
stops addresses propogating needlessly throughout the system. To do this, the PCI-PCI bridges must be programmed with a base and
limit for PCI I/O and PCI Memory space access that they have to pass from their primary bus onto their secondary bus. Once the
PCI-PCI Bridges in a system have been configured then so long as the Linux device drivers only access PCI I/O and PCI Memory
space via these windows, the PCI-PCI Bridges are invisible. This is an important feature that makes life easier for Linux PCI device
driver writers. However, it also makes PCI-PCI bridges somewhat tricky for Linux to configure as we shall see later on.
6.5.2 PCI-PCI Bridges: PCI Configuration Cycles and PCI Bus Numbering
Figure 6.3: Type 0 PCI Configuration Cycle
Figure 6.4: Type 1 PCI Configuration Cycle

So that the CPU's PCI initialization code can address devices that are not on the main PCI bus, there has to be a mechanism that
allows bridges to decide whether or not to pass Configuration cycles from their primary interface to their secondary interface. A
cycle is just an address as it appears on the PCI bus. The PCI specification defines two formats for the PCI Configuration addresses;
Type 0 and Type 1; these are shown in Figure 6.3 and Figure 6.4 respectively. Type 0 PCI Configuration cycles do not contain a
bus number and these are interpretted by all devices as being for PCI configuration addresses on this PCI bus. Bits 31:11 of the Type
0 configuraration cycles are treated as the device select field. One way to design a system is to have each bit select a different
device. In this case bit 11 would select the PCI device in slot 0, bit 12 would select the PCI device in slot 1 and so on. Another way
is to write the device's slot number directly into bits 31:11. Which mechanism is used in a system depends on the system's PCI
memory controller.
Type 1 PCI Configuration cycles contain a PCI bus number and this type of configuration cycle is ignored by all PCI devices except
the PCI-PCI bridges. All of the PCI-PCI Bridges seeing Type 1 configuration cycles may choose to pass them to the PCI buses

downstream of themselves. Whether the PCI-PCI Bridge ignores the Type 1 configuration cycle or passes it onto the downstream
PCI bus depends on how the PCI-PCI Bridge has been configured. Every PCI-PCI bridge has a primary bus interface number and a
secondary bus interface number. The primary bus interface being the one nearest the CPU and the secondary bus interface being the
one furthest away. Each PCI-PCI Bridge also has a subordinate bus number and this is the maximum bus number of all the PCI
buses that are bridged beyond the secondary bus interface. Or to put it another way, the subordinate bus number is the highest
numbered PCI bus downstream of the PCI-PCI bridge. When the PCI-PCI bridge sees a Type 1 PCI configuration cycle it does one
of the following things:
● Ignore it if the bus number specified is not in between the bridge's secondary bus number and subordinate bus number
(inclusive),
● Convert it to a Type 0 configuration command if the bus number specified matches the secondary bus number of the bridge,
● Pass it onto the secondary bus interface unchanged if the bus number specified is greater than the secondary bus number and
less than or equal to the subordinate bus number.
So, if we want to address Device 1 on bus 3 of the topology Figure pci-pci-config-eg-4 on page we must generate a Type 1
Configuration command from the CPU. Bridge1 passes this unchanged onto Bus 1. Bridge2 ignores it but Bridge3 converts it into a
Type 0 Configuration command and sends it out on Bus 3 where Device 1 responds to it.
It is up to each individual operating system to allocate bus numbers during PCI configuration but whatever the numbering scheme
used the following statement must be true for all of the PCI-PCI bridges in the system:
`Àll PCI buses located behind a PCI-PCI bridge must reside between the seondary bus number and the subordinate bus number
(inclusive).''
If this rule is broken then the PCI-PCI Bridges will not pass and translate Type 1 PCI configuration cycles correctly and the system
will fail to find and initialise the PCI devices in the system. To achieve this numbering scheme, Linux configures these special
devices in a particular order. Section pci-pci-bus-numbering on page describes Linux's PCI bridge and bus numbering scheme in
detail together with a worked example.
6.6 Linux PCI Initialization

The PCI initialisation code in Linux is broken into three logical parts:
PCI Device Driver
This pseudo-device driver searches the PCI system starting at Bus 0 and locates all PCI devices and bridges in the system. It
builds a linked list of data structures describing the topology of the system. Additionally, it numbers all of the bridges that it
finds.
PCI BIOS
This software layer provides the services described in bib-pci-bios-specification. Even though Alpha AXP does not have
BIOS services, there is equivalent code in the Linux kernel providing the same functions,
PCI Fixup
System specific fixup code tidies up the system specific loose ends of PCI initialization.
6.6.1 The Linux Kernel PCI Data Structures

Figure 6.5: Linux Kernel PCI Data Structures
As the Linux kernel initialises the PCI system it builds data structures mirroring the real PCI topology of the system. Figure 6.5
shows the relationships of the data structures that it would build for the example PCI system in Figure 6.1 on page pageref.
Each PCI device (including the PCI-PCI Bridges) is described by a pci_dev data structure. Each PCI bus is described by a
pci_bus data structure. The result is a tree structure of PCI buses each of which has a number of child PCI devices attached to it.
As a PCI bus can only be reached using a PCI-PCI Bridge (except the primary PCI bus, bus 0), each pci_bus contains a pointer to
the PCI device (the PCI-PCI Bridge) that it is accessed through. That PCI device is a child of the the PCI Bus's parent PCI bus.
Not shown in the Figure 6.5 is a pointer to all of the PCI devices in the system, pci_devices. All of the PCI devices in the
system have their pci_dev data structures queued onto this queue.. This queue is used by the Linux kernel to quickly find all of
the PCI devices in the system.

6.6.2 The PCI Device Driver
The PCI device driver is not really a device driver at all but a function of the operating system called at system initialisation time.
The PCI initialisation code must scan all of the PCI buses in the system looking for all PCI devices in the system (including PCI-PCI
bridge devices).
It uses the PCI BIOS code to find out if every possible slot in the current PCI bus that it is scanning is occupied. If the PCI slot is
occupied, it builds a pci_dev data structure describing the device and links into the list of known PCI devices (pointed at by
pci_devices).
The PCI initialisation code starts by scanning PCI Bus 0. It tries to read the Vendor Identification and Device Identification fields for
every possible PCI device in every possible PCI slot. When it finds an occupied slot it builds a pci_dev data structure describing
the device. All of the pci_dev data structures built by the PCI initialisation code (including all of the PCI-PCI Bridges) are linked
into a singly linked list; pci_devices.
If the PCI device that was found was a PCI-PCI bridge then a pci_bus data structure is built and linked into the tree of pci_bus
and pci_dev data structures pointed at by pci_root. The PCI initialisation code can tell if the PCI device is a PCI-PCI Bridge
because it has a class code of 0x060400. The Linux kernel then configures the PCI bus on the other (downstream) side of the
PCI-PCI Bridge that it has just found. If more PCI-PCI Bridges are found then these are also configured. This process is known as a
depthwise algorithm; the system's PCI topology is fully mapped depthwise before searching breadthwise. Looking at Figure 6.1 on
page pageref, Linux would configure PCI Bus 1 with its Ethernet and SCSI device before it configured the video device on PCI Bus
0.
As Linux searches for downstream PCI buses it must also configure the intervening PCI-PCI bridges' secondary and subordinate bus
numbers. This is described in detail in Section pci-pci-bus-numbering below.
Configuring PCI-PCI Bridges - Assigning PCI Bus Numbers
Figure 6.6: Configuring a PCI System: Part 1

For PCI-PCI bridges to pass PCI I/O, PCI Memory or PCI Configuration address space reads and writes across them, they need to
know the following:
Primary Bus Number
The bus number immediately upstream of the PCI-PCI Bridge,
Secondary Bus Number
The bus number immediately downstream of the PCI-PCI Bridge,
Subordinate Bus Number
The highest bus number of all of the buses that can be reached downstream of the bridge.
PCI I/O and PCI Memory Windows
The window base and size for PCI I/O address space and PCI Memory address space for all addresses downstream of the
PCI-PCI Bridge.
The problem is that at the time when you wish to configure any given PCI-PCI bridge you do not know the subordinate bus number
for that bridge. You do not know if there are further PCI-PCI bridges downstream and if you did, you do not know what numbers
will be assigned to them. The answer is to use a depthwise recursive algorithm and scan each bus for any PCI-PCI bridges assigning
them numbers as they are found. As each PCI-PCI bridge is found and its secondary bus numbered, assign it a temporary
subordinate number of 0xFF and scan and assign numbers to all PCI-PCI bridges downstream of it. This all seems complicated but
the worked example below makes this process clearer.
PCI-PCI Bridge Numbering: Step 1
Taking the topology in Figure 6.6, the first bridge the scan would find is Bridge1. The PCI bus downstream of Bridge1 would
be numbered as 1 and Bridge1 assigned a secondary bus number of 1 and a temporary subordinate bus number of 0xFF. This
means that all Type 1 PCI Configuration addresses specifying a PCI bus number of 1 or higher would be passed across
Bridge1 and onto PCI Bus 1. They would be translated into Type 0 Configuration cycles if they have a bus number of 1 but
left untranslated for all other bus numbers. This is exactly what the Linux PCI initialisation code needs to do in order to go
and scan PCI Bus 1.

Linux uses a depthwise algorithm and so the initialisation code goes on to scan PCI Bus 1. Here it finds PCI-PCI Bridge2.
There are no further PCI-PCI bridges beyond PCI-PCI Bridge2, so it is assigned a subordinate bus number of 2 which matches
the number assigned to its secondary interface. Figure 6.7 shows how the buses and PCI-PCI bridges are numbered at this
point.

The PCI initialisation code returns to scanning PCI Bus 1 and finds another PCI-PCI bridge, Bridge3. It is assigned 1 as its
primary bus interface number, 3 as its secondary bus interface number and 0xFF as its subordinate bus number. Figure 6.8 on
page pageref shows how the system is configured now. Type 1 PCI configuration cycles with a bus number of 1, 2 or 3 wil be
correctly delivered to the appropriate PCI buses.

Linux starts scanning PCI Bus 3, downstream of PCI-PCI Bridge3. PCI Bus 3 has another PCI-PCI bridge (Bridge4) on it, it is
assigned 3 as its primary bus number and 4 as its secondary bus number. It is the last bridge on this branch and so it is
assigned a subordinate bus interface number of 4. The initialisation code returns to PCI-PCI Bridge3 and assigns it a
subordinate bus number of 4. Finally, the PCI initialisation code can assign 4 as the subordinate bus number for PCI-PCI
Bridge1. Figure 6.9 on page pageref shows the final bus numbers.
6.6.3 PCI BIOS Functions

The PCI BIOS functions are a series of standard routines which are common across all platforms. For example, they are the same for
both Intel and Alpha AXP based systems. They allow the CPU controlled access to all of the PCI address spaces.
Only Linux kernel code and device drivers may use them.
6.6.4 PCI Fixup

The PCI fixup code for Alpha AXP does rather more than that for Intel (which basically does nothing).
For Intel based systems the system BIOS, which ran at boot time, has already fully configured the PCI system. This leaves Linux
with little to do other than map that configuration. For non-Intel based systems further configuration needs to happen to:
● Allocate PCI I/O and PCI Memory space to each device,
● Configure the PCI I/O and PCI Memory address windows for each PCI-PCI bridge in the system,
● Generate Interrupt Line values for the devices; these control interrupt handling for the device.
The next subsections describe how that code works.
Finding Out How Much PCI I/O and PCI Memory Space a Device Needs

Each PCI device found is queried to find out how much PCI I/O and PCI Memory address space it requires. To do this, each Base
Address Register has all 1's written to it and then read. The device will return 0's in the don't-care address bits, effectively specifying
the address space required.
Figure 6.10: PCI Configuration Header: Base Address Registers

There are two basic types of Base Address Register, the first indicates within which address space the devices registers must reside;
either PCI I/O or PCI Memory space. This is indicated by Bit 0 of the register. Figure 6.10 shows the two forms of the Base
Address Register for PCI Memory and for PCI I/O.
To find out just how much of each address space a given Base Address Register is requesting, you write all 1s into the register and
then read it back. The device will specify zeros in the don't care address bits, effectively specifying the address space required. This
design implies that all address spaces used are a power of two and are naturally aligned.
For example when you initialize the DECChip 21142 PCI Fast Ethernet device, it tells you that it needs 0x100 bytes of space of
either PCI I/O or PCI Memory. The initialization code allocates it space. The moment that it allocates space, the 21142's control and
status registers can be seen at those addresses.
Allocating PCI I/O and PCI Memory to PCI-PCI Bridges and Devices
Like all memory the PCI I/O and PCI memory spaces are finite, and to some extent scarce. The PCI Fixup code for non-Intel
systems (and the BIOS code for Intel systems) has to allocate each device the amount of memory that it is requesting in an efficient
manner. Both PCI I/O and PCI Memory must be allocated to a device in a naturally aligned way. For example, if a device asks for
0xB0 of PCI I/O space then it must be aligned on an address that is a multiple of 0xB0. In addition to this, the PCI I/O and PCI
Memory bases for any given bridge must be aligned on 4K and on 1Mbyte boundaries respectively. Given that the address spaces
for downstream devices must lie within all of the upstream PCI-PCI Bridge's memory ranges for any given device, it is a somewhat
difficult problem to allocate space efficiently.
The algorithm that Linux uses relies on each device described by the bus/device tree built by the PCI Device Driver being allocated
address space in ascending PCI I/O memory order. Again a recursive algorithm is used to walk the pci_bus and pci_dev data
structures built by the PCI initialisation code. Starting at the root PCI bus (pointed at by pci_root) the BIOS fixup code:
● Aligns the current global PCI I/O and Memory bases on 4K and 1 Mbyte boundaries respectively,
● For every device on the current bus (in ascending PCI I/O memory needs),
❍ allocates it space in PCI I/O and/or PCI Memory,
❍ moves on the global PCI I/O and Memory bases by the appropriate amounts,

❍ enables the device's use of PCI I/O and PCI Memory,
● Allocates space recursively to all of the buses downstream of the current bus. Note that this will change the global PCI I/O
and Memory bases,
● Aligns the current global PCI I/O and Memory bases on 4K and 1 Mbyte boundaries respectively and in doing so figure out
the size and base of PCI I/O and PCI Memory windows required by the current PCI-PCI bridge,
● Programs the PCI-PCI bridge that links to this bus with its PCI I/O and PCI Memory bases and limits,
● Turns on bridging of PCI I/O and PCI Memory accesses in the PCI-PCI Bridge. This means that if any PCI I/O or PCI
Memory addresses seen on the Bridge's primary PCI bus that are within its PCI I/O and PCI Memory address windows will be
bridged onto its secondary PCI bus.
Taking the PCI system in Figure 6.1 on page pageref as our example the PCI Fixup code would set up the system in the following
way:
Align the PCI bases
PCI I/O is 0x4000 and PCI Memory is 0x100000. This allows the PCI-ISA bridges to translate all addresses below these into
ISA address cycles,
The Video Device
This is asking for 0x200000 of PCI Memory and so we allocate it that amount starting at the current PCI Memory base of
0x200000 as it has to be naturally aligned to the size requested. The PCI Memory base is moved to 0x400000 and the PCI I/O
base remains at 0x4000.
The PCI-PCI Bridge
We now cross the PCI-PCI Bridge and allocate PCI memory there, note that we do not need to align the bases as they are
already correctly aligned:
The Ethernet Device
This is asking for 0xB0 bytes of both PCI I/O and PCI Memory space. It gets allocated PCI I/O at 0x4000 and PCI
Memory at 0x400000. The PCI Memory base is moved to 0x4000B0 and the PCI I/O base to 0x40B0.
The SCSI Device
This is asking for 0x1000 PCI Memory and so it is allocated it at 0x401000 after it has been naturally aligned. The PCI
I/O base is still 0x40B0 and the PCI Memory base has been moved to 0x402000.
The PCI-PCI Bridge's PCI I/O and Memory Windows
We now return to the bridge and set its PCI I/O window at between 0x4000 and 0x40B0 and it's PCI Memory window at
between 0x400000 and 0x402000. This means that the PCI-PCI Bridge will ignore the PCI Memory accesses for the video
device and pass them on if they are for the ethernet or SCSI devices.
Footnotes:
1 For example?


Chapter 7
Interrupts and Interrupt Handling
This chapter looks at how interrupts are handled by the Linux kernel. Whilst the kernel has generic
mechanisms and interfaces for handling interrupts, most of the interrupt handling details are architecture
specific.
Figure 7.1: A Logical Diagram of Interrupt Routing

Linux uses a lot of different pieces of hardware to perform many different tasks. The video device drives
http://ldp.iol.it/LDP/tlk/dd/interrupts.html (1 di 6) [08/03/2001 10.08.45]

the monitor, the IDE device drives the disks and so on. You could drive these devices synchronously,
that is you could send a request for some operation (say writing a block of memory out to disk) and then
wait for the operation to complete. That method, although it would work, is very inefficient and the
operating system would spend a lot of time ``busy doing nothing'' as it waited for each operation to
complete. A better, more efficient, way is to make the request and then do other, more useful work and
later be interrupted by the device when it has finished the request. With this scheme, there may be many
outstanding requests to the devices in the system all happening at the same time.
There has to be some hardware support for the devices to interrupt whatever the CPU is doing. Most, if
not all, general purpose processors such as the Alpha AXP use a similar method. Some of the physical
pins of the CPU are wired such that changing the voltage (for example changing it from +5v to -5v)
causes the CPU to stop what it is doing and to start executing special code to handle the interruption; the
interrupt handling code. One of these pins might be connected to an interval timer and receive an
interrupt every 1000th of a second, others may be connected to the other devices in the system, such as
the SCSI controller.
Systems often use an interrupt controller to group the device interrupts together before passing on the
signal to a single interrupt pin on the CPU. This saves interrupt pins on the CPU and also gives flexibility
when designing systems. The interrupt controller has mask and status registers that control the interrupts.
Setting the bits in the mask register enables and disables interrupts and the status register returns the
currently active interrupts in the system.
Some of the interrupts in the system may be hard-wired, for example, the real time clock's interval timer
may be permanently connected to pin 3 on the interrupt controller. However, what some of the pins are
connected to may be determined by what controller card is plugged into a particular ISA or PCI slot. For
example, pin 4 on the interrupt controller may be connected to PCI slot number 0 which might one day
have an ethernet card in it but the next have a SCSI controller in it. The bottom line is that each system
has its own interrupt routing mechanisms and the operating system must be flexible enough to cope.
Most modern general purpose microprocessors handle the interrupts the same way. When a hardware
interrupt occurs the CPU stops executing the instructions that it was executing and jumps to a location in
memory that either contains the interrupt handling code or an instruction branching to the interrupt
handling code. This code usually operates in a special mode for the CPU, interrupt mode, and, normally,
no other interrupts can happen in this mode. There are exceptions though; some CPUs rank the interrupts
in priority and higher level interrupts may happen. This means that the first level interrupt handling code
must be very carefully written and it often has its own stack, which it uses to store the CPU's execution
state (all of the CPU's normal registers and context) before it goes off and handles the interrupt. Some
CPUs have a special set of registers that only exist in interrupt mode, and the interrupt code can use these
registers to do most of the context saving it needs to do.
When the interrupt has been handled, the CPU's state is restored and the interrupt is dismissed. The CPU
will then continue to doing whatever it was doing before being interrupted. It is important that the
interrupt processing code is as efficient as possible and that the operating system does not block
interrupts too often or for too long.

7.1 Programmable Interrupt Controllers
Systems designers are free to use whatever interrupt architecture they wish but IBM PCs use the Intel
82C59A-2 CMOS Programmable Interrupt Controller or its derivatives. This controller has been around
since the dawn of the PC and it is programmable with its registers being at well known locations in the
ISA address space. Even very modern support logic chip sets keep equivalent registers in the same place
in ISA memory. Non-Intel based systems such as Alpha AXP based PCs are free from these architectural
constraints and so often use different interrupt controllers.
If the ISA device driver has successfully found its IRQ number then it can now request control of it as
normal.
PCI based systems are much more dynamic than ISA based systems. The interrupt pin that an ISA device
uses is often set using jumpers on the hardware device and fixed in the device driver. On the other hand,
PCI devices have their interrupts allocated by the PCI BIOS or the PCI subsystem as PCI is initialized
when the system boots. Each PCI device may use one of four interrupt pins, A, B, C or D. This was fixed
when the device was built and most devices default to interrupt on pin A. The PCI interrupt lines A, B, C
and D for each PCI slot are routed to the interrupt controller. So, Pin A from PCI slot 4 might be routed
to pin 6 of the interrupt controller, pin B of PCI slot 4 to pin 7 of the interrupt controller and so on.
How the PCI interrupts are routed is entirely system specific and there must be some set up code which
understands this PCI interrupt routing topology. On Intel based PCs this is the system BIOS code that
runs at boot time but for system's without BIOS (for example Alpha AXP based systems) the Linux
kernel does this setup.
The PCI set up code writes the pin number of the interrupt controller into the PCI configuration header
for each device. It determines the interrupt pin (or IRQ) number using its knowledge of the PCI interrupt
routing topology together with the devices PCI slot number and which PCI interrupt pin that it is using.
The interrupt pin that a device uses is fixed and is kept in a field in the PCI configuration header for this
device. It writes this information into the interrupt line field that is reserved for this purpose. When the
device driver runs, it reads this information and uses it to request control of the interrupt from the Linux
kernel.
There may be many PCI interrupt sources in the system, for example when PCI-PCI bridges are used.
The number of interrupt sources may exceed the number of pins on the system's programmable interrupt
controllers. In this case, PCI devices may share interrupts, one pin on the interrupt controller taking
interrupts from more than one PCI device. Linux supports this by allowing the first requestor of an
interrupt source declare whether it may be shared. Sharing interrupts results in several irqaction data
structures being pointed at by one entry in the irq_action vector vector. When a shared interrupt
happens, Linux will call all of the interrupt handlers for that source. Any device driver that can share
interrupts (which should be all PCI device drivers) must be prepared to have its interrupt handler called
when there is no interrupt to be serviced.
7.3 Interrupt Handling

Figure 7.2: Linux Interrupt Handling Data Structures
One of the principal tasks of Linux's interrupt handling subsystem is to route the interrupts to the right
pieces of interrupt handling code. This code must understand the interrupt topology of the system. If, for
example, the floppy controller interrupts on pin 6 1 of the interrupt controller then it must recognize the
interrupt as from the floppy and route it to the floppy device driver's interrupt handling code. Linux uses
a set of pointers to data structures containing the addresses of the routines that handle the system's
interrupts. These routines belong to the device drivers for the devices in the system and it is the
responsibility of each device driver to request the interrupt that it wants when the driver is initialized.
Figure 7.2 shows that irq_action is a vector of pointers to the irqaction data structure. Each
irqaction data structure contains information about the handler for this interrupt, including the
address of the interrupt handling routine. As the number of interrupts and how they are handled varies
between architectures and, sometimes, between systems, the Linux interrupt handling code is architecture
specific. This means that the size of the irq_action vector vector varies depending on the number
of interrupt sources that there are.
When the interrupt happens, Linux must first determine its source by reading the interrupt status register
of the system's programmable interrupt controllers. It then translates that source into an offset into the
irq_action vector vector. So, for example, an interrupt on pin 6 of the interrupt controller from
the floppy controller would be translated into the seventh pointer in the vector of interrupt handlers. If
there is not an interrupt handler for the interrupt that occurred then the Linux kernel will log an error,
otherwise it will call into the interrupt handling routines for all of the irqaction data structures for
this interrupt source.
When the device driver's interrupt handling routine is called by the Linux kernel it must efficiently work
out why it was interrupted and respond. To find the cause of the interrupt the device driver would read
the status register of the device that interrupted. The device may be reporting an error or that a requested

operation has completed. For example the floppy controller may be reporting that it has completed the
positioning of the floppy's read head over the correct sector on the floppy disk. Once the reason for the
interrupt has been determined, the device driver may need to do more work. If it does, the Linux kernel
has mechanisms that allow it to postpone that work until later. This avoids the CPU spending too much
time in interrupt mode. See the Device Driver chapter (Chapter dd-chapter) for more details.
REVIEW NOTE: Fast and slow interrupts, are these an Intel thing?
Footnotes:
1 Actually, the floppy controller is one of the fixed interrupts in a PC system as, by convention, the
floppy controller is always wired to interrupt 6.


Chapter 8
Device Drivers
One of the purposes of an operating system is to hide the peculiarities of the system's hardware devices from its users.
For example the Virtual File System presents a uniform view of the mounted filesystems irrespective of the underlying
physical devices. This chapter describes how the Linux kernel manages the physical devices in the system.
The CPU is not the only intelligent device in the system, every physical device has its own hardware controller. The
keyboard, mouse and serial ports are controlled by a SuperIO chip, the IDE disks by an IDE controller, SCSI disks by a
SCSI controller and so on. Each hardware controller has its own control and status registers (CSRs) and these differ
between devices. The CSRs for an Adaptec 2940 SCSI controller are completely different from those of an NCR 810
SCSI controller. The CSRs are used to start and stop the device, to initialize it and to diagnose any problems with it.
Instead of putting code to manage the hardware controllers in the system into every application, the code is kept in the
Linux kernel. The software that handles or manages a hardware controller is known as a device driver. The Linux
kernel device drivers are, essentially, a shared library of privileged, memory resident, low level hardware handling
routines. It is Linux's device drivers that handle the peculiarities of the devices they are managing.
One of the basic features of is that it abstracts the handling of devices. All hardware devices look like regular files;
they can be opened, closed, read and written using the same, standard, system calls that are used to manipulate files.
Every device in the system is represented by a device special file, for example the first IDE disk in the system is
represented by /dev/hda. For block (disk) and character devices, these device special files are created by the mknod
command and they describe the device using major and minor device numbers. Network devices are also represented
by device special files but they are created by Linux as it finds and initializes the network controllers in the system. All
devices controlled by the same device driver have a common major device number. The minor device numbers are
used to distinguish between different devices and their controllers, for example each partition on the primary IDE disk
has a different minor device number. So, /dev/hda2, the second partition of the primary IDE disk has a major
number of 3 and a minor number of 2. Linux maps the device special file passed in system calls (say to mount a file
system on a block device) to the device's device driver using the major device number and a number of system tables,
for example the character device table, chrdevs .
Linux supports three types of hardware device: character, block and network. Character devices are read and written
directly without buffering, for example the system's serial ports /dev/cua0 and /dev/cua1. Block devices can
only be written to and read from in multiples of the block size, typically 512 or 1024 bytes. Block devices are accessed
via the buffer cache and may be randomly accessed, that is to say, any block can be read or written no matter where it
is on the device. Block devices can be accessed via their device special file but more commonly they are accessed via
the file system. Only a block device can support a mounted file system. Network devices are accessed via the BSD
socket interface and the networking subsytems described in the Networking chapter (Chapter network-chapter).
There are many different device drivers in the Linux kernel (that is one of Linux's strengths) but they all share some
common attributes:
kernel code
Device drivers are part of the kernel and, like other code within the kernel, if they go wrong they can seriously
http://ldp.iol.it/LDP/tlk/dd/drivers.html (1 di 15) [08/03/2001 10.08.53]

damage the system. A badly written driver may even crash the system, possibly corrupting file systems and
losing data,
Kernel interfaces
Device drivers must provide a standard interface to the Linux kernel or to the subsystem that they are part of. For
example, the terminal driver provides a file I/O interface to the Linux kernel and a SCSI device driver provides a
SCSI device interface to the SCSI subsystem which, in turn, provides both file I/O and buffer cache interfaces to
the kernel.
Kernel mechanisms and services
Device drivers make use of standard kernel services such as memory allocation, interrupt delivery and wait
queues to operate,
Loadable
Most of the Linux device drivers can be loaded on demand as kernel modules when they are needed and
unloaded when they are no longer being used. This makes the kernel very adaptable and efficient with the
system's resources,
Configurable
Linux device drivers can be built into the kernel. Which devices are built is configurable when the kernel is
compiled,
Dynamic
As the system boots and each device driver is initialized it looks for the hardware devices that it is controlling. It
does not matter if the device being controlled by a particular device driver does not exist. In this case the device
driver is simply redundant and causes no harm apart from occupying a little of the system's memory.
8.1 Polling and Interrupts

Each time the device is given a command, for example ``move the read head to sector 42 of the floppy disk'' the device
driver has a choice as to how it finds out that the command has completed. The device drivers can either poll the device
or they can use interrupts.
Polling the device usually means reading its status register every so often until the device's status changes to indicate
that it has completed the request. As a device driver is part of the kernel it would be disasterous if a driver were to poll
as nothing else in the kernel would run until the device had completed the request. Instead polling device drivers use
system timers to have the kernel call a routine within the device driver at some later time. This timer routine would
check the status of the command and this is exactly how Linux's floppy driver works. Polling by means of timers is at
best approximate, a much more efficient method is to use interrupts.
An interrupt driven device driver is one where the hardware device being controlled will raise a hardware interrupt
whenever it needs to be serviced. For example, an ethernet device driver would interrupt whenever it receives an
ethernet packet from the network. The Linux kernel needs to be able to deliver the interrupt from the hardware device
to the correct device driver. This is achieved by the device driver registering its usage of the interrupt with the kernel. It
registers the address of an interrupt handling routine and the interrupt number that it wishes to own. You can see which
interrupts are being used by the device drivers, as well as how many of each type of interrupts there have been, by
looking at /proc/interrupts:
0: 727432 timer
1: 20534 keyboard
2: 0 cascade
3: 79691 + serial
4: 28258 + serial
5: 1 sound blaster

11: 20868 + aic7xxx
13: 1 math error
14: 247 + ide0
15: 170 + ide1
This requesting of interrupt resources is done at driver initialization time. Some of the interrupts in the system are
fixed, this is a legacy of the IBM PC's architecture. So, for example, the floppy disk controller always uses interrupt 6.
Other interrupts, for example the interrupts from PCI devices are dynamically allocated at boot time. In this case the
device driver must first discover the interrupt number (IRQ) of the device that it is controlling before it requests
ownership of that interrupt. For PCI interrupts Linux supports standard PCI BIOS callbacks to determine information
about the devices in the system, including their IRQ numbers.
How an interrupt is delivered to the CPU itself is architecture dependent but on most architectures the interrupt is
delivered in a special mode that stops other interrupts from happening in the system. A device driver should do as little
as possible in its interrupt handling routine so that the Linux kernel can dismiss the interrupt and return to what it was
doing before it was interrupted. Device drivers that need to do a lot of work as a result of receiving an interrupt can use
the kernel's bottom half handlers or task queues to queue routines to be called later on.
8.2 Direct Memory Access (DMA)

Using interrupts driven device drivers to transfer data to or from hardware devices works well when the amount of data
is reasonably low. For example a 9600 baud modem can transfer approximately one character every millisecond
(1/1000 'th second). If the interrupt latency, the amount of time that it takes between the hardware device raising the
interrupt and the device driver's interrupt handling routine being called, is low (say 2 milliseconds) then the overall
system impact of the data transfer is very low. The 9600 baud modem data transfer would only take 0.002% of the
CPU's processing time. For high speed devices, such as hard disk controllers or ethernet devices the data transfer rate is
a lot higher. A SCSI device can transfer up to 40 Mbytes of information per second.
Direct Memory Access, or DMA, was invented to solve this problem. A DMA controller allows devices to transfer data
to or from the system's memory without the intervention of the processor. A PC's ISA DMA controller has 8 DMA
channels of which 7 are available for use by the device drivers. Each DMA channel has associated with it a 16 bit
address register and a 16 bit count register. To initiate a data transfer the device driver sets up the DMA channel's
address and count registers together with the direction of the data transfer, read or write. It then tells the device that it
may start the DMA when it wishes. When the transfer is complete the device interrupts the PC. Whilst the transfer is
taking place the CPU is free to do other things.
Device drivers have to be careful when using DMA. First of all the DMA controller knows nothing of virtual memory,
it only has access to the physical memory in the system. Therefore the memory that is being DMA'd to or from must be
a contiguous block of physical memory. This means that you cannot DMA directly into the virtual address space of a
process. You can however lock the process's physical pages into memory, preventing them from being swapped out to
the swap device during a DMA operation. Secondly, the DMA controller cannot access the whole of physical memory.
The DMA channel's address register represents the first 16 bits of the DMA address, the next 8 bits come from the page
register. This means that DMA requests are limited to the bottom 16 Mbytes of memory.
DMA channels are scarce resources, there are only 7 of them, and they cannot be shared between device drivers. Just
like interrupts, the device driver must be able to work out which DMA channel it should use. Like interrupts, some
devices have a fixed DMA channel. The floppy device, for example, always uses DMA channel 2. Sometimes the
DMA channel for a device can be set by jumpers; a number of ethernet devices use this technique. The more flexible
devices can be told (via their CSRs) which DMA channels to use and, in this case, the device driver can simply pick a
free DMA channel to use.
Linux tracks the usage of the DMA channels using a vector of dma_chan data structures (one per DMA channel). The
dma_chan data structure contains just two fields, a pointer to a string describing the owner of the DMA channel and a

flag indicating if the DMA channel is allocated or not. It is this vector of dma_chan data structures that is printed
when you cat /proc/dma.
8.3 Memory
Device drivers have to be careful when using memory. As they are part of the Linux kernel they cannot use virtual
memory. Each time a device driver runs, maybe as an interrupt is received or as a bottom half or task queue handler is
scheduled, the current process may change. The device driver cannot rely on a particular process running even if it
is doing work on its behalf. Like the rest of the kernel, device drivers use data structures to keep track of the device that
it is controlling. These data structures can be statically allocated, part of the device driver's code, but that would be
wasteful as it makes the kernel larger than it need be. Most device drivers allocate kernel, non-paged, memory to hold
their data.
Linux provides kernel memory allocation and deallocation routines and it is these that the device drivers use. Kernel
memory is allocated in chunks that are powers of 2. For example 128 or 512 bytes, even if the device driver asks for
less. The number of bytes that the device driver requests is rounded up to the next block size boundary. This makes
kernel memory deallocation easier as the smaller free blocks can be recombined into bigger blocks.
It may be that Linux needs to do quite a lot of extra work when the kernel memory is requested. If the amount of free
memory is low, physical pages may need to be discarded or written to the swap device. Normally, Linux would
suspend the requestor, putting the process onto a wait queue until there is enough physical memory. Not all device
drivers (or indeed Linux kernel code) may want this to happen and so the kernel memory allocation routines can be
requested to fail if they cannot immediately allocate memory. If the device driver wishes to DMA to or from the
allocated memory it can also specify that the memory is DMA'able. This way it is the Linux kernel that needs to
understand what constitutes DMA'able memory for this system, and not the device driver.
8.4 Interfacing Device Drivers with the Kernel

The Linux kernel must be able to interact with them in standard ways. Each class of device driver, character, block and
network, provides common interfaces that the kernel uses when requesting services from them. These common
interfaces mean that the kernel can treat often very different devices and their device drivers absolutely the same. For
example, SCSI and IDE disks behave very differently but the Linux kernel uses the same interface to both of them.
Linux is very dynamic, every time a Linux kernel boots it may encounter different physical devices and thus need
different device drivers. Linux allows you to include device drivers at kernel build time via its configuration scripts.
When these drivers are initialized at boot time they may not discover any hardware to control. Other drivers can be
loaded as kernel modules when they are needed. To cope with this dynamic nature of device drivers, device drivers
register themselves with the kernel as they are initialized. Linux maintains tables of registered device drivers as part of
its interfaces with them. These tables include pointers to routines and information that support the interface with that
class of devices.
8.4.1 Character Devices

Figure 8.1: Character Devices
Character devices, the simplest of Linux's devices, are accessed as files, applications use standard system calls to open
them, read from them, write to them and close them exactly as if the device were a file. This is true even if the device is
a modem being used by the PPP daemon to connect a Linux system onto a network. As a character device is initialized
its device driver registers itself with the Linux kernel by adding an entry into the chrdevs vector of
device_struct data structures. The device's major device identifier (for example 4 for the tty device) is used as
an index into this vector. The major device identifier for a device is fixed.
Each entry in the chrdevs vector, a device_struct data structure contains two elements; a pointer to the name of
the registered device driver and a pointer to a block of file operations. This block of file operations is itself the
addresses of routines within the device character device driver each of which handles specific file operations such as
open, read, write and close. The contents of /proc/devices for character devices is taken from the chrdevs
vector.
When a character special file representing a character device (for example /dev/cua0) is opened the kernel must set
things up so that the correct character device driver's file operation routines will be called. Just like an ordinairy file or
directory, each device special file is represented by a VFS inode . The VFS inode for a character special file, indeed for
all device special files, contains both the major and minor identifiers for the device. This VFS inode was created by the
underlying filesystem, for example EXT2, from information in the real filesystem when the device special file's name
was looked up.
Each VFS inode has associated with it a set of file operations and these are different depending on the filesystem object
that the inode represents. Whenever a VFS inode representing a character special file is created, its file operations are
set to the default character device operations
. This has only one file operation, the open file operation. When the character special file is opened by an application
the generic open file operation uses the device's major identifier as an index into the chrdevs vector to retrieve the
file operations block for this particular device. It also sets up the file data structure describing this character special
file, making its file operations pointer point to those of the device driver. Thereafter all of the applications file
operations will be mapped to calls to the character devices set of file operations.

8.4.2 Block Devices
Block devices also support being accessed like files. The mechanisms used to provide the correct set of file operations
for the opened block special file are very much the same as for character devices. Linux maintains the set of registered
block devices as the blkdevs vector. It, like the chrdevs vector, is indexed using the device's major device number.
Its entries are also device_struct data structures. Unlike character devices, there are classes of block devices.
SCSI devices are one such class and IDE devices are another. It is the class that registers itself with the Linux kernel
and provides file operations to the kernel. The device drivers for a class of block device provide class specific
interfaces to the class. So, for example, a SCSI device driver has to provide interfaces to the SCSI subsystem which the
SCSI subsystem uses to provide file operations for this device to the kernel.
Every block device driver must provide an interface to the buffer cache as well as the normal file operations interface.
Each block device driver fills in its entry in the blk_dev vector
of blk_dev_struct data structures . The index into this vector is, again, the device's major number. The
blk_dev_struct data structure consists of the address of a request routine and a pointer to a list of request data
structures, each one representing a request from the buffer cache for the driver to read or write a block of data.
Figure 8.2: Buffer Cache Block Device Requests

Each time the buffer cache wishes to read or write a block of data to or from a registered device it adds a request
data structure onto its blk_dev_struct. Figure 8.2 shows that each request has a pointer to one or more
buffer_head data structures, each one a request to read or write a block of data. The buffer_head structures are
locked (by the buffer cache) and there may be a process waiting on the block operation to this buffer to complete. Each
request structure is allocated from a static list, the all_requests list. If the request is being added to an empty
request list, the driver's request function is called to start processing the request queue. Otherwise the driver will simply
process every request on the request list.
Once the device driver has completed a request it must remove each of the buffer_head structures from the
request structure, mark them as up to date and unlock them. This unlocking of the buffer_head will wake up

any process that has been sleeping waiting for the block operation to complete. An example of this would be where a
file name is being resolved and the EXT2 filesystem must read the block of data that contains the next EXT2 directory
entry from the block device that holds the filesystem. The process sleeps on the buffer_head that will contain the
directory entry until the device driver wakes it up. The request data structure is marked as free so that it can be used
in another block request.
8.5 Hard Disks

Disk drives provide a more permanent method for storing data, keeping it on spinning disk platters. To write data, a
tiny head magnetizes minute particles on the platter's surface. The data is read by a head, which can detect whether a
particular minute particle is magnetized.
A disk drive consists of one or more platters, each made of finely polished glass or ceramic composites and coated
with a fine layer of iron oxide. The platters are attached to a central spindle and spin at a constant speed that can vary
between 3000 and 10,000 RPM depending on the model. Compare this to a floppy disk which only spins at 360 RPM.
The disk's read/write heads are responsible for reading and writing data and there is a pair for each platter, one head for
each surface. The read/write heads do not physically touch the surface of the platters, instead they float on a very thin
(10 millionths of an inch) cushion of air. The read/write heads are moved across the surface of the platters by an
actuator. All of the read/write heads are attached together, they all move across the surfaces of the platters together.
Each surface of the platter is divided into narrow, concentric circles called tracks. Track 0 is the outermost track and
the highest numbered track is the track closest to the central spindle. A cylinder is the set of all tracks with the same
number. So all of the 5th tracks from each side of every platter in the disk is known as cylinder 5. As the number of
cylinders is the same as the number of tracks, you often see disk geometries described in terms of cylinders. Each track
is divided into sectors. A sector is the smallest unit of data that can be written to or read from a hard disk and it is also
the disk's block size. A common sector size is 512 bytes and the sector size was set when the disk was formatted,
usually when the disk is manufactured.
A disk is usually described by its geometry, the number of cylinders, heads and sectors. For example, at boot time
Linux describes one of my IDE disks as:
hdb: Conner Peripherals 540MB - CFS540A, 516MB w/64kB Cache, CHS=1050/16/63

This means that it has 1050 cylinders (tracks), 16 heads (8 platters) and 63 sectors per track. With a sector, or block,
size of 512 bytes this gives the disk a storage capacity of 529200 bytes. This does not match the disk's stated capacity
of 516 Mbytes as some of the sectors are used for disk partitioning information. Some disks automatically find bad
sectors and re-index the disk to work around them.
Hard disks can be further subdivided into partitions. A partition is a large group of sectors allocated for a particular
purpose. Partitioning a disk allows the disk to be used by several operating system or for several purposes. A lot of
Linux systems have a single disk with three partitions; one containing a DOS filesystem, another an EXT2 filesystem
and a third for the swap partition. The partitions of a hard disk are described by a partition table; each entry describing
where the partition starts and ends in terms of heads, sectors and cylinder numbers. For DOS formatted disks, those
formatted by fdisk, there are four primary disk partitions. Not all four entries in the partition table have to be used.
There are three types of partition supported by fdisk, primary, extended and logical. Extended partitions are not real
partitions at all, they contain any number of logical parititions. Extended and logical partitions were invented as a way
around the limit of four primary partitions. The following is the output from fdisk for a disk containing two primary
partitions:
Disk /dev/sda: 64 heads, 32 sectors, 510 cylinders

Units = cylinders of 2048 * 512 bytes

Device Boot Begin Start End Blocks Id System
/dev/sda1 1 1 478 489456 83 Linux native
/dev/sda2 479 479 510 32768 82 Linux swap
Expert command (m for help): p
Disk /dev/sda: 64 heads, 32 sectors, 510 cylinders
Nr AF Hd Sec Cyl Hd Sec Cyl Start Size ID

1 00 1 1 0 63 32 477 32 978912 83
2 00 0 1 478 63 32 509 978944 65536 82
3 00 0 0 0 0 0 0 0 0 00
4 00 0 0 0 0 0 0 0 0 00
This shows that the first partition starts at cylinder or track 0, head 1 and sector 1 and extends to include cylinder 477,
sector 32 and head 63. As there are 32 sectors in a track and 64 read/write heads, this partition is a whole number of
cylinders in size. fdisk alligns partitions on cylinder boundaries by default. It starts at the outermost cylinder (0) and
extends inwards, towards the spindle, for 478 cylinders. The second partition, the swap partition, starts at the next
cylinder (478) and extends to the innermost cylinder of the disk.
Figure 8.3: Linked list of disks

During initialization Linux maps the topology of the hard disks in the system. It finds out how many hard disks there
are and of what type. Additionally, Linux discovers how the individual disks have been partitioned. This is all
represented by a list of gendisk data structures pointed at by the gendisk_head list pointer. As each disk
subsystem, for example IDE, is initialized it generates gendisk data structures representing the disks that it finds. It
does this at the same time as it registers its file operations and adds its entry into the blk_dev data structure. Each
gendisk data structure has a unique major device number and these match the major numbers of the block special
devices. For example, the SCSI disk subsystem creates a single gendisk entry (``sd'') with a major number of 8,
the major number of all SCSI disk devices. Figure 8.3 shows two gendisk entries, the first one for the SCSI disk
subsystem and the second for an IDE disk controller. This is ide0, the primary IDE controller.
Although the disk subsystems build the gendisk entries during their initialization they are only used by Linux during
partition checking. Instead, each disk subsystem maintains its own data structures which allow it to map device special

major and minor device numbers to partitions within physical disks. Whenever a block device is read from or written
to, either via the buffer cache or file operations, the kernel directs the operation to the appropriate device using the
major device number found in its block special device file (for example /dev/sda2). It is the individual device driver
or subsystem that maps the minor device number to the real physical device.
8.5.1 IDE Disks

The most common disks used in Linux systems today are Integrated Disk Electronic or IDE disks. IDE is a disk
interface rather than an I/O bus like SCSI. Each IDE controller can support up to two disks, one the master disk and the
other the slave disk. The master and slave functions are usually set by jumpers on the disk. The first IDE controller in
the system is known as the primary IDE controller, the next the secondary controller and so on. IDE can manage about
3.3 Mbytes per second of data transfer to or from the disk and the maximum IDE disk size is 538Mbytes. Extended
IDE, or EIDE, has raised the disk size to a maximum of 8.6 Gbytes and the data transfer rate up to 16.6 Mbytes per
second. IDE and EIDE disks are cheaper than SCSI disks and most modern PCs contain one or more on board IDE
controllers.
Linux names IDE disks in the order in which it finds their controllers. The master disk on the primary controller is
/dev/hda and the slave disk is /dev/hdb. /dev/hdc is the master disk on the secondary IDE controller. The IDE
subsystem registers IDE controllers and not disks with the Linux kernel. The major identifier for the primary IDE
controller is 3 and is 22 for the secondary IDE controller. This means that if a system has two IDE controllers there will
be entries for the IDE subsystem at indices at 3 and 22 in the blk_dev and blkdevs vectors. The block special files
for IDE disks reflect this numbering, disks /dev/hda and /dev/hdb, both connected to the primary IDE controller,
have a major identifier of 3. Any file or buffer cache operations for the IDE subsystem operations on these block
special files will be directed to the IDE subsystem as the kernel uses the major identifier as an index. When the request
is made, it is up to the IDE subsystem to work out which IDE disk the request is for. To do this the IDE subsystem uses
the minor device number from the device special identifier, this contains information that allows it to direct the request
to the correct partition of the correct disk. The device identifier for /dev/hdb, the slave IDE drive on the primary
IDE controller is (3,64). The device identifier for the first partition of that disk (/dev/hdb1) is (3,65).
8.5.2 Initializing the IDE Subsystem

IDE disks have been around for much of the IBM PC's history. Throughout this time the interface to these devices has
changed. This makes the initialization of the IDE subsystem more complex than it might at first appear.
The maximum number of IDE controllers that Linux can support is 4. Each controller is represented by an
ide_hwif_t data structure in the ide_hwifs vector. Each ide_hwif_t data structure contains two
ide_drive_t data structures, one per possible supported master and slave IDE drive. During the initializing of the
IDE subsystem, Linux first looks to see if there is information about the disks present in the system's CMOS memory.
This is battery backed memory that does not lose its contents when the PC is powered off. This CMOS memory is
actually in the system's real time clock device which always runs no matter if your PC is on or off. The CMOS memory
locations are set up by the system's BIOS and tell Linux what IDE controllers and drives have been found. Linux
retrieves the found disk's geometry from BIOS and uses the information to set up the ide_hwif_t data structure for
this drive. More modern PCs use PCI chipsets such as Intel's 82430 VX chipset which includes a PCI EIDE controller.
The IDE subsystem uses PCI BIOS callbacks to locate the PCI (E)IDE controllers in the system. It then calls PCI
specific interrogation routines for those chipsets that are present.
Once each IDE interface or controller has been discovered, its ide_hwif_t is set up to reflect the controllers and
attached disks. During operation the IDE driver writes commands to IDE command registers that exist in the I/O
memory space. The default I/O address for the primary IDE controller's control and status registers is 0x1F0 - 0x1F7.
These addresses were set by convention in the early days of the IBM PC. The IDE driver registers each controller with
the Linux block buffer cache and VFS, adding it to the blk_dev and blkdevs vectors respectively. The IDE drive
will also request control of the appropriate interrupt. Again these interrupts are set by convention to be 14 for the
primary IDE controller and 15 for the secondary IDE controller. However, they like all IDE details, can be overridden

by command line options to the kernel. The IDE driver also adds a gendisk entry into the list of gendisk's
discovered during boot for each IDE controller found. This list will later be used to discover the partition tables of all
of the hard disks found at boot time. The partition checking code understands that IDE controllers may each control
two IDE disks.
8.5.3 SCSI Disks

The SCSI (Small Computer System Interface) bus is an efficient peer-to-peer data bus that supports up to eight devices
per bus, including one or more hosts. Each device has to have a unique identifier and this is usually set by jumpers on
the disks. Data can be transfered synchronously or asynchronously between any two devices on the bus and with 32 bit
wide data transfers up to 40 Mbytes per second are possible. The SCSI bus transfers both data and state information
between devices, and a single transaction between an initiator and a target can involve up to eight distinct phases. You
can tell the current phase of a SCSI bus from five signals from the bus. The eight phases are:
BUS FREE
No device has control of the bus and there are no transactions currently happening,
ARBITRATION
A SCSI device has attempted to get control of the SCSI bus, it does this by asserting its SCSI identifer onto the
address pins. The highest number SCSI identifier wins.
SELECTION
When a device has succeeded in getting control of the SCSI bus through arbitration it must now signal the target
of this SCSI request that it wants to send a command to it. It does this by asserting the SCSI identifier of the
target on the address pins.
RESELECTION
SCSI devices may disconnect during the processing of a request. The target may then reselect the initiator. Not
all SCSI devices support this phase.
COMMAND
6,10 or 12 bytes of command can be transfered from the initiator to the target,
DATA IN, DATA OUT
During these phases data is transfered between the initiator and the target,
STATUS
This phase is entered after completion of all commands and allows the target to send a status byte indicating
success or failure to the initiator,
MESSAGE IN, MESSAGE OUT
Additional information is transfered between the initiator and the target.
The Linux SCSI subsystem is made up of two basic elements, each of which is represented by data structures:
host
A SCSI host is a physical piece of hardware, a SCSI controller. The NCR810 PCI SCSI controller is an example
of a SCSI host. If a Linux system has more than one SCSI controller of the same type, each instance will be
represented by a separate SCSI host. This means that a SCSI device driver may control more than one instance
of its controller. SCSI hosts are almost always the initiator of SCSI commands.
Device
The most common set of SCSI device is a SCSI disk but the SCSI standard supports several more types; tape,
CD-ROM and also a generic SCSI device. SCSI devices are almost always the targets of SCSI commands. These
devices must be treated differently, for example with removable media such as CD-ROMs or tapes, Linux needs
to detect if the media was removed. The different disk types have different major device numbers, allowing
Linux to direct block device requests to the appropriate SCSI type.

Initializing the SCSI Subsystem
Initializing the SCSI subsystem is quite complex, reflecting the dynamic nature of SCSI buses and their devices. Linux
initializes the SCSI subsystem at boot time; it finds the SCSI controllers (known as SCSI hosts) in the system and then
probes each of their SCSI buses finding all of their devices. It then initializes those devices and makes them available
to the rest of the Linux kernel via the normal file and buffer cache block device operations. This initialization is done in
four phases:
First, Linux finds out which of the SCSI host adapters, or controllers, that were built into the kernel at kernel build time
have hardware to control. Each built in SCSI host has a Scsi_Host_Template entry in the
builtin_scsi_hosts vector The Scsi_Host_Template data structure contains pointers to routines that carry
out SCSI host specific actions such as detecting what SCSI devices are attached to this SCSI host. These routines are
called by the SCSI subsystem as it configures itself and they are part of the SCSI device driver supporting this host
type. Each detected SCSI host, those for which there are real SCSI devices attached, has its Scsi_Host_Template
data structure added to the scsi_hosts list of active SCSI hosts. Each instance of a detected host type is represented
by a Scsi_Host data structure held in the scsi_hostlist list. For example a system with two NCR810 PCI
SCSI controllers would have two Scsi_Host entries in the list, one per controller. Each Scsi_Host points at the
Scsi_Host_Template representing its device driver.

Figure 8.4: SCSI Data Structures
Now that every SCSI host has been discovered, the SCSI subsystem must find out what SCSI devices are attached to
each host's bus. SCSI devices are numbered between 0 and 7 inclusively, each device's number or SCSI identifier being
unique on the SCSI bus to which it is attached. SCSI identifiers are usually set by jumpers on the device. The SCSI
initialization code finds each SCSI device on a SCSI bus by sending it a TEST_UNIT_READY command. When a
device responds, its identification is read by sending it an ENQUIRY command. This gives Linux the vendor's name
and the device's model and revision names. SCSI commands are represented by a Scsi_Cmnd data structure and these
are passed to the device driver for this SCSI host by calling the device driver routines within its
Scsi_Host_Template data structure. Every SCSI device that is found is represented by a Scsi_Device data
structure, each of which points to its parent Scsi_Host. All of the Scsi_Device data structures are added to the
scsi_devices list. Figure 8.4 shows how the main data structures relate to one another.
There are four SCSI device types: disk, tape, CD and generic. Each of these SCSI types are individually registered with
the kernel as different major block device types. However they will only register themselves if one or more of a given
SCSI device type has been found. Each SCSI type, for example SCSI disk, maintains its own tables of devices. It uses
these tables to direct kernel block operations (file or buffer cache) to the correct device driver or SCSI host. Each SCSI
type is represented by a Scsi_Device_Template data structure. This contains information about this type of SCSI
device and the addresses of routines to perform various tasks. The SCSI subsystem uses these templates to call the
SCSI type routines for each type of SCSI device. In other words, if the SCSI subsystem wishes to attach a SCSI disk
device it will call the SCSI disk type attach routine. The Scsi_Type_Template data structures are added to the
scsi_devicelist list if one or more SCSI devices of that type have been detected.
The final phase of the SCSI subsystem initialization is to call the finish functions for each registered
Scsi_Device_Template. For the SCSI disk type this spins up all of the SCSI disks that were found and then
records their disk geometry. It also adds the gendisk data structure representing all SCSI disks to the linked list of
disks shown in Figure 8.3.
Delivering Block Device Requests
Once Linux has initialized the SCSI subsystem, the SCSI devices may be used. Each active SCSI device type registers
itself with the kernel so that Linux can direct block device requests to it. There can be buffer cache requests via
blk_dev or file operations via blkdevs. Taking a SCSI disk driver that has one or more EXT2 filesystem partitions
as an example, how do kernel buffer requests get directed to the right SCSI disk when one of its EXT2 partitions is
mounted?
Each request to read or write a block of data to or from a SCSI disk partition results in a new request structure being
added to the SCSI disks current_request list in the blk_dev vector. If the request list is being processed, the
buffer cache need not do anything else; otherwise it must nudge the SCSI disk subsystem to go and process its request
queue. Each SCSI disk in the system is represented by a Scsi_Disk data structure. These are kept in the
rscsi_disks vector that is indexed using part of the SCSI disk partition's minor device number. For exmaple,
/dev/sdb1 has a major number of 8 and a minor number of 17; this generates an index of 1. Each Scsi_Disk data
structure contains a pointer to the Scsi_Device data structure representing this device. That in turn points at the
Scsi_Host data structure which `òwns'' it. The request data structures from the buffer cache are translated into
Scsi_Cmd structures describing the SCSI command that needs to be sent to the SCSI device and this is queued onto
the Scsi_Host structure representing this device. These will be processed by the individual SCSI device driver once
the appropriate data blocks have been read or written.

8.6 Network Devices
A network device is, so far as Linux's network subsystem is concerned, an entity that sends and receives packets of
data. This is normally a physical device such as an ethernet card. Some network devices though are software only such
as the loopback device which is used for sending data to yourself. Each network device is represented by a device
data structure. Network device drivers register the devices that they control with Linux during network initialization at
kernel boot time. The device data structure contains information about the device and the addresses of functions that
allow the various supported network protocols to use the device's services. These functions are mostly concerned with
transmitting data using the network device. The device uses standard networking support mechanisms to pass received
data up to the appropriate protocol layer. All network data (packets) transmitted and received are represented by
sk_buff data structures, these are flexible data structures that allow network protocol headers to be easily added and
removed. How the network protocol layers use the network devices, how they pass data back and forth using
sk_buff data structures is described in detail in the Networks chapter (Chapter networks-chapter). This chapter
concentrates on the device data structure and on how network devices are discovered and initialized.
The device data structure contains information about the network device:
Name
Unlike block and character devices which have their device special files created using the mknod command,
network device special files appear spontaniously as the system's network devices are discovered and initialized.
Their names are standard, each name representing the type of device that it is. Multiple devices of the same type
are numbered upwards from 0. Thus the ethernet devices are known as /dev/eth0,/dev/eth1,/dev/eth2
and so on. Some common network devices are:
/dev/ethN Ethernet devices
/dev/slN SLIP devices
/dev/pppN PPP devices
/dev/lo Loopback devices
Bus Information
This is information that the device driver needs in order to control the device. The irq number is the interrupt that
this device is using. The base address is the address of any of the device's control and status registers in I/O
memory. The DMA channel is the DMA channel number that this network device is using. All of this
information is set at boot time as the device is initialized.
Interface Flags
These describe the characteristics and abilities of the network device:
IFF_UP Interface is up and running,
IFF_BROADCAST Broadcast address in device is valid
IFF_DEBUG Device debugging turned on
IFF_LOOPBACK This is a loopback device
IFF_POINTTOPOINT This is point to point link (SLIP and PPP)
IFF_NOTRAILERS No network trailers
IFF_RUNNING Resources allocated
IFF_NOARP Does not support ARP protocol
IFF_PROMISC Device in promiscuous receive mode, it will receive
all packets no matter who they are addressed to
IFF_ALLMULTI Receive all IP multicast frames
IFF_MULTICAST Can receive IP multicast frames
Protocol Information

Each device describes how it may be used by the network protocool layers:
mtu
The size of the largest packet that this network can transmit not including any link layer headers that it
needs to add. This maximum is used by the protocol layers, for example IP, to select suitable packet sizes
to send.
Family
The family indicates the protocol family that the device can support. The family for all Linux network
devices is AF_INET, the Internet address family.
Type
The hardware interface type describes the media that this network device is attached to. There are many
different types of media that Linux network devices support. These include Ethernet, X.25, Token Ring,
Slip, PPP and Apple Localtalk.
Addresses
The device data structure holds a number of addresses that are relevent to this network device, including
its IP addresses.
Packet Queue
This is the queue of sk_buff packets queued waiting to be transmitted on this network device,
Support Functions
Each device provides a standard set of routines that protocol layers call as part of their interface to this device's
link layer. These include setup and frame transmit routines as well as routines to add standard frame headers and
collect statistics. These statistics can be seen using the ifconfig command.
8.6.1 Initializing Network Devices

Network device drivers can, like other Linux device drivers, be built into the Linux kernel. Each potential network
device is represented by a device data structure within the network device list pointed at by dev_base list pointer.
The network layers call one of a number of network device service routines whose addresses are held in the device
data structure if they need device specific work performing. Initially though, each device data structure holds only
the address of an initialization or probe routine.
There are two problems to be solved for network device drivers. Firstly, not all of the network device drivers built into
the Linux kernel will have devices to control. Secondly, the ethernet devices in the system are always called
/dev/eth0, /dev/eth1 and so on, no matter what their underlying device drivers are. The problem of ``missing''
network devices is easily solved. As the initialization routine for each network device is called, it returns a status
indicating whether or not it located an instance of the controller that it is driving. If the driver could not find any
devices, its entry in the device list pointed at by dev_base is removed. If the driver could find a device it fills out
the rest of the device data structure with information about the device and the addresses of the support functions
within the network device driver.
The second problem, that of dynamically assigning ethernet devices to the standard /dev/ethN device special files is
solved more elegantly. There are eight standard entries in the devices list; one for eth0, eth1 and so on to eth7. The
initialization routine is the same for all of them, it tries each ethernet device driver built into the kernel in turn until one
finds a device. When the driver finds its ethernet device it fills out the ethN device data structure, which it now
owns. It is also at this time that the network device driver initializes the physical hardware that it is controlling and
works out which IRQ it is using, which DMA channel (if any) and so on. A driver may find several instances of the
network device that it is controlling and, in this case, it will take over several of the /dev/ethN device data
structures. Once all eight standard /dev/ethN have been allocated, no more ethernet devices will be probed for.


Chapter 9
The File system
This chapter describes how the Linux kernel maintains the files in the file systems that it supports. It describes
the Virtual File System (VFS) and explains how the Linux kernel's real file systems are supported.
One of the most important features of Linux is its support for many different file systems. This makes it very
flexible and well able to coexist with many other operating systems. At the time of writing, Linux supports 15 file
systems; ext, ext2, xia, minix, umsdos, msdos, vfat, proc, smb, ncp, iso9660, sysv, hpfs, affs
and ufs, and no doubt, over time more will be added.
In Linux, as it is for Unix TM, the separate file systems the system may use are not accessed by device identifiers
(such as a drive number or a drive name) but instead they are combined into a single hierarchical tree structure
that represents the file system as one whole single entity. Linux adds each new file system into this single file
system tree as it is mounted. All file systems, of whatever type, are mounted onto a directory and the files of the
mounted file system cover up the existing contents of that directory. This directory is known as the mount
directory or mount point. When the file system is unmounted, the mount directory's own files are once again
revealed.
When disks are initialized (using fdisk, say) they have a partition structure imposed on them that divides the
physical disk into a number of logical partitions. Each partition may hold a single file system, for example an
EXT2 file system. File systems organize files into logical hierarchical structures with directories, soft links and so
on held in blocks on physical devices. Devices that can contain file systems are known as block devices. The IDE
disk partition /dev/hda1, the first partition of the first IDE disk drive in the system, is a block device. The
Linux file systems regard these block devices as simply linear collections of blocks, they do not know or care
about the underlying physical disk's geometry. It is the task of each block device driver to map a request to read a
particular block of its device into terms meaningful to its device; the particular track, sector and cylinder of its
hard disk where the block is kept. A file system has to look, feel and operate in the same way no matter what
device is holding it. Moreover, using Linux's file systems, it does not matter (at least to the system user) that
these different file systems are on different physical media controlled by different hardware controllers. The file
system might not even be on the local system, it could just as well be a disk remotely mounted over a network
link. Consider the following example where a Linux system has its root file system on a SCSI disk:
A E boot etc lib opt tmp usr

C F cdrom fd proc root var sbin
D bin dev home mnt lost+found
Neither the users nor the programs that operate on the files themselves need know that /C is in fact a mounted
VFAT file system that is on the first IDE disk in the system. In the example (which is actually my home Linux
system), /E is the master IDE disk on the second IDE controller. It does not matter either that the first IDE
http://ldp.iol.it/LDP/tlk/fs/filesystem.html (1 di 20) [08/03/2001 10.08.58]

controller is a PCI controller and that the second is an ISA controller which also controls the IDE CDROM. I can
dial into the network where I work using a modem and the PPP network protocol using a modem and in this case
I can remotely mount my Alpha AXP Linux system's file systems on /mnt/remote.
The files in a file system are collections of data; the file holding the sources to this chapter is an ASCII file called
filesystems.tex. A file system not only holds the data that is contained within the files of the file system
but also the structure of the file system. It holds all of the information that Linux users and processes see as files,
directories soft links, file protection information and so on. Moreover it must hold that information safely and
securely, the basic integrity of the operating system depends on its file systems. Nobody would use an operating
system that randomly lost data and files1.
Minix, the first file system that Linux had is rather restrictive and lacking in performance.
Its filenames cannot be longer than 14 characters (which is still better than 8.3 filenames) and the maximum file
size is 64MBytes. 64Mbytes might at first glance seem large enough but large file sizes are necessary to hold
even modest databases. The first file system designed specifically for Linux, the Extended File system, or EXT,
was introduced in April 1992 and cured a lot of the problems but it was still felt to lack performance.
So, in 1993, the Second Extended File system, or EXT2, was added.
It is this file system that is described in detail later on in this chapter.
An important development took place when the EXT file system was added into Linux. The real file systems
were separated from the operating system and system services by an interface layer known as the Virtual File
system, or VFS.
VFS allows Linux to support many, often very different, file systems, each presenting a common software
interface to the VFS. All of the details of the Linux file systems are translated by software so that all file systems
appear identical to the rest of the Linux kernel and to programs running in the system. Linux's Virtual File system
layer allows you to transparently mount the many different file systems at the same time.
The Linux Virtual File system is implemented so that access to its files is as fast and efficient as possible. It must
also make sure controllfiles and their data are kept correctly. These two requirements can be ontodds with each
other. The Linux VFS caches information in memory from each file system as it is mounted and used. A lot of
care must be taken to update the file system correctly as data within these caches is modified aslfiles and
directories are created, written to and deleted. If you could see the file system's data structures within the running
kernel, you would be oble to see data blocks being read and written by the file system. Data structures, describing
rollfiles and directories being accessed would be created and destroyed and all rolltime the device drivers would
be working away, fetching and saving data. The most important of these caches is the Buffer Cache, which is
integrated into the way controllindividual file systems access their underlying block devices. As blocks are
accessed they are put into the Buffer Cache and kept on various queues depending on their states. The Buffer
Cache not only caches data buffers, it also helps manage the asynchronous interface with the block device
drivers.
9.1 The Second Extended File system (EXT2)

Figure 9.1: Physical Layout of the EXT2 File system
The Second Extended File system was devised (by Rémy Card) as an extensible and powerful file system for
Linux. It is also the most successful file system so far in the Linux community and is the basis for all of the
currently shipping Linux distributions.
The EXT2 file system, like a lot of the file systems, is built on the premise that the data held in files is kept in
data blocks. These data blocks are all of the same length and, although that length can vary between different
EXT2 file systems the block size of a particular EXT2 file system is set when it is created (using mke2fs). Every
file's size is rounded up to an integral number of blocks. If the block size is 1024 bytes, then a file of 1025 bytes
will occupy two 1024 byte blocks. Unfortunately this means that on average you waste half a block per file.
Usually in computing you trade off CPU usage for memory and disk space utilisation. In this case Linux, along
with most operating systems, trades off a relatively inefficient disk usage in order to reduce the workload on the
CPU. Not all of the blocks in the file system hold data, some must be used to contain the information that
describes the structure of the file system. EXT2 defines the file system topology by describing each file in the
system with an inode data structure. An inode describes which blocks the data within a file occupies as well as
the access rights of the file, the file's modification times and the type of the file. Every file in the EXT2 file
system is described by a single inode and each inode has a single unique number identifying it. The inodes for the
file system are all kept together in inode tables. EXT2 directories are simply special files (themselves described
by inodes) which contain pointers to the inodes of their directory entries.
Figure 9.1 shows the layout of the EXT2 file system as occupying a series of blocks in a block structured device.
So far as each file system is concerned, block devices are just a series of blocks that can be read and written. A
file system does not need to concern itself with where on the physical media a block should be put, that is the job
of the device's driver. Whenever a file system needs to read information or data from the block device containing
it, it requests that its supporting device driver reads an integral number of blocks. The EXT2 file system divides
the logical partition that it occupies into Block Groups.
Each group duplicates information critical to the integrity of the file system as well as holding real files and
directories as blocks of information and data. This duplication is neccessary should a disaster occur and the file
system need recovering. The subsections describe in more detail the contents of each Block Group.

9.1.1 The EXT2 Inode
Figure 9.2: EXT2 Inode

In the EXT2 file system, the inode is the basic building block; every file and directory in the file system is
described by one and only one inode. The EXT2 inodes for each Block Group are kept in the inode table together
with a bitmap that allows the system to keep track of allocated and unallocated inodes. Figure 9.2 shows the
format of an EXT2 inode, amongst other information, it contains the following fields:
mode
This holds two pieces of information; what this inode describes and the permissions that users have to it.
For EXT2, an inode can describe one of file, directory, symbolic link, block device, character device or
FIFO.
Owner Information
The user and group identifiers of the owners of this file or directory. This allows the file system to
correctly allow the right sort of accesses,
Size
The size of the file in bytes,
Timestamps
The time that the inode was created and the last time that it was modified,

Datablocks
Pointers to the blocks that contain the data that this inode is describing. The first twelve are pointers to the
physical blocks containing the data described by this inode and the last three pointers contain more and
more levels of indirection. For example, the double indirect blocks pointer points at a block of pointers to
blocks of pointers to data blocks. This means that files less than or equal to twelve data blocks in length are
more quickly accessed than larger files.
You should note that EXT2 inodes can describe special device files. These are not real files but handles that
programs can use to access devices. All of the device files in /dev are there to allow programs to access Linux's
devices. For example the mount program takes as an argument the device file that it wishes to mount.
9.1.2 The EXT2 Superblock

The Superblock contains a description of the basic size and shape of this file system. The information within it
allows the file system manager to use and maintain the file system. Usually only the Superblock in Block Group
0 is read when the file system is mounted but each Block Group contains a duplicate copy in case of file system
corruption. Amongst other information it holds the:
Magic Number
This allows the mounting software to check that this is indeed the Superblock for an EXT2 file system. For
the current version of EXT2 this is 0xEF53.
Revision Level
The major and minor revision levels allow the mounting code to determine whether or not this file system
supports features that are only available in particular revisions of the file system. There are also feature
compatibility fields which help the mounting code to determine which new features can safely be used on
this file system,
Mount Count and Maximum Mount Count
Together these allow the system to determine if the file system should be fully checked. The mount count
is incremented each time the file system is mounted and when it equals the maximum mount count the
warning message ``maximal mount count reached, running e2fsck is recommended'' is displayed,
Block Group Number
The Block Group number that holds this copy of the Superblock,
Block Size
The size of the block for this file system in bytes, for example 1024 bytes,
Blocks per Group
The number of blocks in a group. Like the block size this is fixed when the file system is created,
Free Blocks
The number of free blocks in the file system,
Free Inodes
The number of free Inodes in the file system,
First Inode
This is the inode number of the first inode in the file system. The first inode in an EXT2 root file system
would be the directory entry for the '/' directory.

9.1.3 The EXT2 Group Descriptor
Each Block Group has a data structure describing it. Like the Superblock, all the group descriptors for all of the
Block Groups are duplicated in each Block Group in case of file system corruption.
Each Group Descriptor contains the following information:
Blocks Bitmap
The block number of the block allocation bitmap for this Block Group. This is used during block allocation
and deallocation,
Inode Bitmap
The block number of the inode allocation bitmap for this Block Group. This is used during inode allocation
and deallocation,
Inode Table
The block number of the starting block for the inode table for this Block Group. Each inode is represented
by the EXT2 inode data structure described below.
Free blocks count, Free Inodes count, Used directory count
The group descriptors are placed on after another and together they make the group descriptor table. Each Blocks
Group contains the entire table of group descriptors after its copy of the Superblock. Only the first copy (in Block
Group 0) is actually used by the EXT2 file system. The other copies are there, like the copies of the Superblock,
in case the main copy is corrupted.
9.1.4 EXT2 Directories

Figure 9.3: EXT2 Directory
In the EXT2 file system, directories are special files that are used to create and hold access paths to the files in
the file system. Figure 9.3 shows the layout of a directory entry in memory.
A directory file is a list of directory entries, each one containing the following information:
inode
The inode for this directory entry. This is an index into the array of inodes held in the Inode Table of the
Block Group. In figure 9.3, the directory entry for the file called file has a reference to inode number
i1,
name length
The length of this directory entry in bytes,
name
The name of this directory entry.
The first two entries for every directory are always the standard ``.'' and ``..'' entries meaning ``this directory'' and
``the parent directory'' respectively.

9.1.5 Finding a File in an EXT2 File System
A Linux filename has the same format as all Unix TM filenames have. It is a series of directory names separated
by forward slashes (``/'') and ending in the file's name. One example filename would be
/home/rusling/.cshrc where /home and /rusling are directory names and the file's name is .cshrc.
Like all other Unix TM systems, Linux does not care about the format of the filename itself; it can be any length
and consist of any of the printable characters. To find the inode representing this file within an EXT2 file system
the system must parse the filename a directory at a time until we get to the file itself.
The first inode we need is the inode for the root of the file system and we find its number in the file system's
superblock. To read an EXT2 inode we must look for it in the inode table of the appropriate Block Group. If, for
example, the root inode number is 42, then we need the 42nd inode from the inode table of Block Group 0. The
root inode is for an EXT2 directory, in other words the mode of the root inode describes it as a directory and it's
data blocks contain EXT2 directory entries.
home is just one of the many directory entries and this directory entry gives us the number of the inode
describing the /home directory. We have to read this directory (by first reading its inode and then reading the
directory entries from the data blocks described by its inode) to find the rusling entry which gives us the
number of the inode describing the /home/rusling directory. Finally we read the directory entries pointed at
by the inode describing the /home/rusling directory to find the inode number of the .cshrc file and from
this we get the data blocks containing the information in the file.
9.1.6 Changing the Size of a File in an EXT2 File System

One common problem with a file system is its tendency to fragment. The blocks that hold the file's data get
spread all over the file system and this makes sequentially accessing the data blocks of a file more and more
inefficient the further apart the data blocks are. The EXT2 file system tries to overcome this by allocating the
new blocks for a file physically close to its current data blocks or at least in the same Block Group as its current
data blocks. Only when this fails does it allocate data blocks in another Block Group.
Whenever a process attempts to write data into a file the Linux file system checks to see if the data has gone off
the end of the file's last allocated block. If it has, then it must allocate a new data block for this file. Until the
allocation is complete, the process cannot run; it must wait for the file system to allocate a new data block and
write the rest of the data to it before it can continue. The first thing that the EXT2 block allocation routines do is
to lock the EXT2 Superblock for this file system. Allocating and deallocating changes fields within the
superblock, and the Linux file system cannot allow more than one process to do this at the same time. If another
process needs to allocate more data blocks, it will have to wait until this process has finished. Processes waiting
for the superblock are suspended, unable to run, until control of the superblock is relinquished by its current user.
Access to the superblock is granted on a first come, first served basis and once a process has control of the
superblock, it keeps control until it has finished. Having locked the superblock, the process checks that there are
enough free blocks left in this file system. If there are not enough free blocks, then this attempt to allocate more
will fail and the process will relinquish control of this file system's superblock.
If there are enough free blocks in the file system, the process tries to allocate one.
If the EXT2 file system has been built to preallocate data blocks then we may be able to take one of those. The
preallocated blocks do not actually exist, they are just reserved within the allocated block bitmap. The VFS inode
representing the file that we are trying to allocate a new data block for has two EXT2 specific fields,
prealloc_block and prealloc_count, which are the block number of the first preallocated data block
and how many of them there are, respectively. If there were no preallocated blocks or block preallocation is not
enabled, the EXT2 file system must allocate a new block. The EXT2 file system first looks to see if the data

block after the last data block in the file is free. Logically, this is the most efficient block to allocate as it makes
sequential accesses much quicker. If this block is not free, then the search widens and it looks for a data block
within 64 blocks of the of the ideal block. This block, although not ideal is at least fairly close and within the
same Block Group as the other data blocks belonging to this file.
If even that block is not free, the process starts looking in all of the other Block Groups in turn until it finds some
free blocks. The block allocation code looks for a cluster of eight free data blocks somewhere in one of the Block
Groups. If it cannot find eight together, it will settle for less. If block preallocation is wanted and enabled it will
update prealloc_block and prealloc_count accordingly.
Wherever it finds the free block, the block allocation code updates the Block Group's block bitmap and allocates
a data buffer in the buffer cache. That data buffer is uniquely identified by the file system's supporting device
identifier and the block number of the allocated block. The data in the buffer is zero'd and the buffer is marked as
``dirty'' to show that it's contents have not been written to the physical disk. Finally, the superblock itself is
marked as ``dirty'' to show that it has been changed and it is unlocked. If there were any processes waiting for the
superblock, the first one in the queue is allowed to run again and will gain exclusive control of the superblock for
its file operations. The process's data is written to the new data block and, if that data block is filled, the entire
process is repeated and another data block allocated.
9.2 The Virtual File System (VFS)

Figure 9.4: A Logical Diagram of the Virtual File System
Figure 9.4 shows the relationship between the Linux kernel's Virtual File System and it's real file systems. The
virtual file system must manage all of the different file systems that are mounted at any given time. To do this it
maintains data structures that describe the whole (virtual) file system and the real, mounted, file systems.
Rather confusingly, the VFS describes the system's files in terms of superblocks and inodes in much the same
way as the EXT2 file system uses superblocks and inodes. Like the EXT2 inodes, the VFS inodes describe files
and directories within the system; the contents and topology of the Virtual File System. From now on, to avoid
confusion, I will write about VFS inodes and VFS superblocks to distinquish them from EXT2 inodes and
superblocks.
As each file system is initialised, it registers itself with the VFS. This happens as the operating system initialises
itself at system boot time. The real file systems are either built into the kernel itself or are built as loadable
modules. File System modules are loaded as the system needs them, so, for example, if the VFAT file system is
implemented as a kernel module, then it is only loaded when a VFAT file system is mounted. When a block
device based file system is mounted, and this includes the root file system, the VFS must read its superblock.
Each file system type's superblock read routine must work out the file system's topology and map that
information onto a VFS superblock data structure. The VFS keeps a list of the mounted file systems in the system
together with their VFS superblocks. Each VFS superblock contains information and pointers to routines that
perform particular functions. So, for example, the superblock representing a mounted EXT2 file system contains
a pointer to the EXT2 specific inode reading routine. This EXT2 inode read routine, like all of the file system
specific inode read routines, fills out the fields in a VFS inode. Each VFS superblock contains a pointer to the
first VFS inode on the file system. For the root file system, this is the inode that represents the ``/'' directory.
This mapping of information is very efficient for the EXT2 file system but moderately less so for other file
systems.
As the system's processes access directories and files, system routines are called that traverse the VFS inodes in
the system.
For example, typing ls for a directory or cat for a file cause the the Virtual File System to search through the
VFS inodes that represent the file system. As every file and directory on the system is represented by a VFS
inode, then a number of inodes will be being repeatedly accessed. These inodes are kept in the inode cache which
makes access to them quicker. If an inode is not in the inode cache, then a file system specific routine must be
called in order to read the appropriate inode. The action of reading the inode causes it to be put into the inode
cache and further accesses to the inode keep it in the cache. The less used VFS inodes get removed from the
cache.
All of the Linux file systems use a common buffer cache to cache data buffers from the underlying devices to
help speed up access by all of the file systems to the physical devices holding the file systems.
This buffer cache is independent of the file systems and is integrated into the mechanisms that the Linux kernel
uses to allocate and read and write data buffers. It has the distinct advantage of making the Linux file systems
independent from the underlying media and from the device drivers that support them. All block structured
devices register themselves with the Linux kernel and present a uniform, block based, usually asynchronous
interface. Even relatively complex block devices such as SCSI devices do this. As the real file systems read data
from the underlying physical disks, this results in requests to the block device drivers to read physical blocks
from the device that they control. Integrated into this block device interface is the buffer cache. As blocks are
read by the file systems they are saved in the global buffer cache shared by all of the file systems and the Linux

kernel. Buffers within it are identified by their block number and a unique identifier for the device that read it.
So, if the same data is needed often, it will be retrieved from the buffer cache rather than read from the disk,
which would take somewhat longer. Some devices support read ahead where data blocks are speculatively read
just in case they are needed.
The VFS also keeps a cache of directory lookups so that the inodes for frequently used directories can be quickly
found.
As an experiment, try listing a directory that you have not listed recently. The first time you list it, you may
notice a slight pause but the second time you list its contents the result is immediate. The directory cache does not
store the inodes for the directories itself; these should be in the inode cache, the directory cache simply stores the
mapping between the full directory names and their inode numbers.
9.2.1 The VFS Superblock

Every mounted file system is represented by a VFS superblock; amongst other information, the VFS superblock
contains the:
Device
This is the device identifier for the block device that this file system is contained in. For example,
/dev/hda1, the first IDE hard disk in the system has a device identifier of 0x301,
Inode pointers
The mounted inode pointer points at the first inode in this file system. The covered inode pointer
points at the inode representing the directory that this file system is mounted on. The root file system's
VFS superblock does not have a covered pointer,
Blocksize
The block size in bytes of this file system, for example 1024 bytes,
Superblock operations
A pointer to a set of superblock routines for this file system. Amongst other things, these routines are used
by the VFS to read and write inodes and superblocks.
File System type
A pointer to the mounted file system's file_system_type data structure,
File System specific
A pointer to information needed by this file system,
9.2.2 The VFS Inode

Like the EXT2 file system, every file, directory and so on in the VFS is represented by one and only one VFS
inode.
The information in each VFS inode is built from information in the underlying file system by file system specific
routines. VFS inodes exist only in the kernel's memory and are kept in the VFS inode cache as long as they are
useful to the system. Amongst other information, VFS inodes contain the following fields:
device
This is the device identifer of the device holding the file or whatever that this VFS inode represents,
inode number
This is the number of the inode and is unique within this file system. The combination of device and

inode number is unique within the Virtual File System,
mode
Like EXT2 this field describes what this VFS inode represents as well as access rights to it,
user ids
The owner identifiers,
times
The creation, modification and write times,
block size
The size of a block for this file in bytes, for example 1024 bytes,
inode operations
A pointer to a block of routine addresses. These routines are specific to the file system and they perform
operations for this inode, for example, truncate the file that is represented by this inode.
count
The number of system components currently using this VFS inode. A count of zero means that the inode is
free to be discarded or reused,
lock
This field is used to lock the VFS inode, for example, when it is being read from the file system,
dirty
Indicates whether this VFS inode has been written to, if so the underlying file system will need modifying,
file system specific information
9.2.3 Registering the File Systems
Figure 9.5: Registered File Systems

When you build the Linux kernel you are asked if you want each of the supported file systems. When the kernel
is built, the file system startup code contains calls to the initialisation routines of all of the built in file systems.
Linux file systems may also be built as modules and, in this case, they may be demand loaded as they are needed
or loaded by hand using insmod. Whenever a file system module is loaded it registers itself with the kernel and
unregisters itself when it is unloaded. Each file system's initialisation routine registers itself with the Virtual File
System and is represented by a file_system_type data structure which contains the name of the file system
and a pointer to its VFS superblock read routine. Figure 9.5 shows that the file_system_type data
structures are put into a list pointed at by the file_systems pointer. Each file_system_type data
structure contains the following information:
Superblock read routine

This routine is called by the VFS when an instance of the file system is mounted,
File System name
The name of this file system, for example ext2,
Device needed
Does this file system need a device to support? Not all file system need a device to hold them. The /proc
file system, for example, does not require a block device,
You can see which file systems are registered by looking in at /proc/filesystems. For example:
ext2
nodev proc
iso9660
9.2.4 Mounting a File System

When the superuser attempts to mount a file system, the Linux kernel must first validate the arguments passed in
the system call. Although mount does some basic checking, it does not know which file systems this kernel has
been built to support or that the proposed mount point actually exists. Consider the following mount command:
$ mount -t iso9660 -o ro /dev/cdrom /mnt/cdrom

This mount command will pass the kernel three pieces of information; the name of the file system, the physical
block device that contains the file system and, thirdly, where in the existing file system topology the new file
system is to be mounted.
The first thing that the Virtual File System must do is to find the file system.
To do this it searches through the list of known file systems by looking at each file_system_type data
structure in the list pointed at by file_systems.
If it finds a matching name it now knows that this file system type is supported by this kernel and it has the
address of the file system specific routine for reading this file system's superblock. If it cannot find a matching
file system name then all is not lost if the kernel is built to demand load kernel modules (see Chapter
modules-chapter). In this case the kernel will request that the kernel daemon loads the appropriate file system
module before continuing as before.
Next if the physical device passed by mount is not already mounted, it must find the VFS inode of the directory
that is to be the new file system's mount point. This VFS inode may be in the inode cache or it might have to be
read from the block device supporting the file system of the mount point. Once the inode has been found it is
checked to see that it is a directory and that there is not already some other file system mounted there. The same
directory cannot be used as a mount point for more than one file system.
At this point the VFS mount code must allocate a VFS superblock and pass it the mount information to the
superblock read routine for this file system. All of the system's VFS superblocks are kept in the super_blocks
vector of super_block data structures and one must be allocated for this mount. The superblock read routine
must fill out the VFS superblock fields based on information that it reads from the physical device. For the EXT2
file system this mapping or translation of information is quite easy, it simply reads the EXT2 superblock and fills
out the VFS superblock from there. For other file systems, such as the MS DOS file system, it is not quite such an
easy task. Whatever the file system, filling out the VFS superblock means that the file system must read whatever
describes it from the block device that supports it. If the block device cannot be read from or if it does not contain

this type of file system then the mount command will fail.
Figure 9.6: A Mounted File System

Each mounted file system is described by a vfsmount data structure; see figure 9.6. These are queued on a list
pointed at by vfsmntlist.
Another pointer, vfsmnttail points at the last entry in the list and the mru_vfsmnt pointer points at the
most recently used file system. Each vfsmount structure contains the device number of the block device
holding the file system, the directory where this file system is mounted and a pointer to the VFS superblock
allocated when this file system was mounted. In turn the VFS superblock points at the file_system_type
data structure for this sort of file system and to the root inode for this file system. This inode is kept resident in
the VFS inode cache all of the time that this file system is loaded.
9.2.5 Finding a File in the Virtual File System

To find the VFS inode of a file in the Virtual File System, VFS must resolve the name a directory at a time,
looking up the VFS inode representing each of the intermediate directories in the name. Each directory lookup
involves calling the file system specific lookup whose address is held in the VFS inode representing the parent
directory. This works because we always have the VFS inode of the root of each file system available and
pointed at by the VFS superblock for that system. Each time an inode is looked up by the real file system it
checks the directory cache for the directory. If there is no entry in the directory cache, the real file system gets the
VFS inode either from the underlying file system or from the inode cache.

9.2.6 Creating a File in the Virtual File System
9.2.7 Unmounting a File System
The workshop manual for my MG usually describes assembly as the reverse of disassembly and the reverse is
more or less true for unmounting a file system.
A file system cannot be unmounted if something in the system is using one of its files. So, for example, you
cannot umount /mnt/cdrom if a process is using that directory or any of its children. If anything is using the
file system to be unmounted there may be VFS inodes from it in the VFS inode cache, and the code checks for
this by looking through the list of inodes looking for inodes owned by the device that this file system occupies. If
the VFS superblock for the mounted file system is dirty, that is it has been modified, then it must be written back
to the file system on disk. Once it has been written to disk, the memory occupied by the VFS superblock is
returned to the kernel's free pool of memory. Finally the vfsmount data structure for this mount is unlinked
from vfsmntlist and freed.
9.2.8 The VFS Inode Cache

As the mounted file systems are navigated, their VFS inodes are being continually read and, in some cases,
written. The Virtual File System maintains an inode cache to speed up accesses to all of the mounted file systems.
Every time a VFS inode is read from the inode cache the system saves an access to a physical device.
The VFS inode cache is implmented as a hash table whose entries are pointers to lists of VFS inodes that have the
same hash value. The hash value of an inode is calculated from its inode number and from the device identifier
for the underlying physical device containing the file system. Whenever the Virtual File System needs to access
an inode, it first looks in the VFS inode cache. To find an inode in the cache, the system first calculates its hash
value and then uses it as an index into the inode hash table. This gives it a pointer to a list of inodes with the same
hash value. It then reads each inode in turn until it finds one with both the same inode number and the same
device identifier as the one that it is searching for.
If it can find the inode in the cache, its count is incremented to show that it has another user and the file system
access continues. Otherwise a free VFS inode must be found so that the file system can read the inode from
memory. VFS has a number of choices about how to get a free inode. If the system may allocate more VFS
inodes then this is what it does; it allocates kernel pages and breaks them up into new, free, inodes and puts them
into the inode list. All of the system's VFS inodes are in a list pointed at by first_inode as well as in the
inode hash table. If the system already has all of the inodes that it is allowed to have, it must find an inode that is
a good candidate to be reused. Good candidates are inodes with a usage count of zero; this indicates that the
system is not currently using them. Really important VFS inodes, for example the root inodes of file systems
always have a usage count greater than zero and so are never candidates for reuse. Once a candidate for reuse has
been located it is cleaned up. The VFS inode might be dirty and in this case it needs to be written back to the file
system or it might be locked and in this case the system must wait for it to be unlocked before continuing. The
candidate VFS inode must be cleaned up before it can be reused.
However the new VFS inode is found, a file system specific routine must be called to fill it out from information
read from the underlying real file system. Whilst it is being filled out, the new VFS inode has a usage count of
one and is locked so that nothing else accesses it until it contains valid information.
To get the VFS inode that is actually needed, the file system may need to access several other inodes. This
happens when you read a directory; only the inode for the final directory is needed but the inodes for the
intermediate directories must also be read. As the VFS inode cache is used and filled up, the less used inodes will
be discarded and the more used inodes will remain in the cache.

9.2.9 The Directory Cache
To speed up accesses to commonly used directories, the VFS maintains a cache of directory entries.
As directories are looked up by the real file systems their details are added into the directory cache. The next time
the same directory is looked up, for example to list it or open a file within it, then it will be found in the directory
cache. Only short directory entries (up to 15 characters long) are cached but this is reasonable as the shorter
directory names are the most commonly used ones. For example, /usr/X11R6/bin is very commonly
accessed when the X server is running.
The directory cache consists of a hash table, each entry of which points at a list of directory cache entries that
have the same hash value. The hash function uses the device number of the device holding the file system and the
directory's name to calculate the offset, or index, into the hash table. It allows cached directory entries to be
quickly found. It is no use having a cache when lookups within the cache take too long to find entries, or even not
to find them.
In an effort to keep the caches valid and up to date the VFS keeps lists of Least Recently Used (LRU) directory
cache entries. When a directory entry is first put into the cache, which is when it is first looked up, it is added
onto the end of the first level LRU list. In a full cache this will displace an existing entry from the front of the
LRU list. As the directory entry is accessed again it is promoted to the back of the second LRU cache list. Again,
this may displace a cached level two directory entry at the front of the level two LRU cache list. This displacing
of entries at the front of the level one and level two LRU lists is fine. The only reason that entries are at the front
of the lists is that they have not been recently accessed. If they had, they would be nearer the back of the lists.
The entries in the second level LRU cache list are safer than entries in the level one LRU cache list. This is the
intention as these entries have not only been looked up but also they have been repeatedly referenced.
REVIEW NOTE: Do we need a diagram for this?
9.3 The Buffer Cache

Figure 9.7: The Buffer Cache
As the mounted file systems are used they generate a lot of requests to the block devices to read and write data
blocks. All block data read and write requests are given to the device drivers in the form of buffer_head data
structures via standard kernel routine calls. These give all of the information that the block device drivers need;
the device identifier uniquely identifies the device and the block number tells the driver which block to read. All
block devices are viewed as linear collections of blocks of the same size. To speed up access to the physical
block devices, Linux maintains a cache of block buffers. All of the block buffers in the system are kept
somewhere in this buffer cache, even the new, unused buffers. This cache is shared between all of the physical
block devices; at any one time there are many block buffers in the cache, belonging to any one of the system's
block devices and often in many different states. If valid data is available from the buffer cache this saves the
system an access to a physical device. Any block buffer that has been used to read data from a block device or to
write data to it goes into the buffer cache. Over time it may be removed from the cache to make way for a more
deserving buffer or it may remain in the cache as it is frequently accessed.
Block buffers within the cache are uniquely identfied by the owning device identifier and the block number of the
buffer. The buffer cache is composed of two functional parts. The first part is the lists of free block buffers. There
is one list per supported buffer size and the system's free block buffers are queued onto these lists when they are
first created or when they have been discarded. The currently supported buffer sizes are 512, 1024, 2048, 4096
and 8192 bytes. The second functional part is the cache itself. This is a hash table which is a vector of pointers to
chains of buffers that have the same hash index. The hash index is generated from the owning device identifier
and the block number of the data block. Figure 9.7 shows the hash table together with a few entries. Block
buffers are either in one of the free lists or they are in the buffer cache. When they are in the buffer cache they are
also queued onto Least Recently Used (LRU) lists. There is an LRU list for each buffer type and these are used
by the system to perform work on buffers of a type, for example, writing buffers with new data in them out to

disk. The buffer's type reflects its state and Linux currently supports the following types:
clean
Unused, new buffers,
locked
Buffers that are locked, waiting to be written,
dirty
Dirty buffers. These contain new, valid data, and will be written but so far have not been scheduled to
write,
shared
Shared buffers,
unshared
Buffers that were once shared but which are now not shared,
Whenever a file system needs to read a buffer from its underlying physical device, it trys to get a block from the
buffer cache. If it cannot get a buffer from the buffer cache, then it will get a clean one from the appropriate sized
free list and this new buffer will go into the buffer cache. If the buffer that it needed is in the buffer cache, then it
may or may not be up to date. If it is not up to date or if it is a new block buffer, the file system must request that
the device driver read the appropriate block of data from the disk.
Like all caches, the buffer cache must be maintained so that it runs efficiently and fairly allocates cache entries
between the block devices using the buffer cache. Linux uses the bdflush
kernel daemon to perform a lot of housekeeping duties on the cache but some happen automatically as a result of
the cache being used.
9.3.1 The bdflush Kernel Daemon

The bdflush kernel daemon is a simple kernel daemon that provides a dynamic response to the system having
too many dirty buffers; buffers that contain data that must be written out to disk at some time. It is started as a
kernel thread at system startup time and, rather confusingly, it calls itself ``kflushd'' and that is the name that you
will see if you use the ps command to show the processes in the system. Mostly this daemon sleeps waiting for
the number of dirty buffers in the system to grow too large. As buffers are allocated and discarded the number of
dirty buffers in the system is checked. If there are too many as a percentage of the total number of buffers in the
system then bdflush is woken up. The default threshold is 60% but, if the system is desperate for buffers,
bdflush will be woken up anyway. This value can be seen and changed using the update command:
# update -d
bdflush version 1.4

0: 60 Max fraction of LRU list to examine for dirty blocks
1: 500 Max number of dirty blocks to write each time bdflush activated
2: 64 Num of clean buffers to be loaded onto free list by refill_freelist
3: 256 Dirty block threshold for activating bdflush in refill_freelist
4: 15 Percentage of cache to scan for free clusters
5: 3000 Time for data buffers to age before flushing
6: 500 Time for non-data (dir, bitmap, etc) buffers to age before flushing

7: 1884 Time buffer cache load average constant
8: 2 LAV ratio (used to determine threshold for buffer fratricide).
All of the dirty buffers are linked into the BUF_DIRTY LRU list whenever they are made dirty by having data
written to them and bdflush tries to write a reasonable number of them out to their owning disks. Again this
number can be seen and controlled by the update command and the default is 500 (see above).
9.3.2 The update Process

The update command is more than just a command; it is also a daemon. When run as superuser (during system
initialisation) it will periodically flush all of the older dirty buffers out to disk. It does this by calling a system
service routine
that does more or less the same thing as bdflush. Whenever a dirty buffer is finished with, it is tagged with the
system time that it should be written out to its owning disk. Every time that update runs it looks at all of the dirty
buffers in the system looking for ones with an expired flush time. Every expired buffer is written out to disk.
9.4 The /proc File System

The /proc file system really shows the power of the Linux Virtual File System. It does not really exist (yet
another of Linux's conjuring tricks), neither the /proc directory nor its subdirectories and its files actually exist.
So how can you cat /proc/devices? The /proc file system, like a real file system, registers itself with the
Virtual File System. However, when the VFS makes calls to it requesting inodes as its files and directories are
opened, the /proc file system creates those files and directories from information within the kernel. For
example, the kernel's /proc/devices file is generated from the kernel's data structures describing its devices.
The /proc file system presents a user readable window into the kernel's inner workings. Several Linux
subsystems, such as Linux kernel modules described in chapter modules-chapter, create entries in the the /proc
file system.
9.5 Device Special Files

Linux, like all versions of Unix TM presents its hardware devices as special files. So, for example, /dev/null is the
null device. A device file does not use any data space in the file system, it is only an access point to the device
driver. The EXT2 file system and the Linux VFS both implement device files as special types of inode. There are
two types of device file; character and block special files. Within the kernel itself, the device drivers implement
file semantices: you can open them, close them and so on. Character devices allow I/O operations in character
mode and block devices require that all I/O is via the buffer cache. When an I/O request is made to a device file,
it is forwarded to the appropriate device driver within the system. Often this is not a real device driver but a
pseudo-device driver for some subsystem such as the SCSI device driver layer. Device files are referenced by a
major number, which identifies the device type, and a minor type, which identifies the unit, or instance of that
major type. For example, the IDE disks on the first IDE controller in the system have a major number of 3 and
the first partition of an IDE disk would have a minor number of 1. So, ls -l of /dev/hda1 gives:
$ brw-rw---- 1 root disk 3, 1 Nov 24 15:09 /dev/hda1

Within the kernel, every device is uniquely described by a kdev_t data type, this is two bytes long, the first byte
containing the minor device number and the second byte holding the major device number.

The IDE device above is held within the kernel as 0x0301. An EXT2 inode that represents a block or character
device keeps the device's major and minor numbers in its first direct block pointer. When it is read by the VFS,
the VFS inode data structure representing it has its i_rdev field set to the correct device identifier.
Footnotes:
1Well, not knowingly, although I have been bitten by operating systems with more lawyers than Linux has
developers


Chapter 11
Kernel Mechanisms
This chapter describes some of the general tasks and mechanisms that the Linux kernel needs to supply
so that other parts of the kernel work effectively together.
11.1 Bottom Half Handling
Figure 11.1: Bottom Half Handling Data Structures

There are often times in a kernel when you do not want to do work at this moment. A good example of
this is during interrupt processing. When the interrupt was asserted, the processor stopped what it was
doing and the operating system delivered the interrupt to the appropriate device driver. Device drivers
should not spend too much time handling interrupts as, during this time, nothing else in the system can
run. There is often some work that could just as well be done later on. Linux's bottom half handlers were
invented so that device drivers and other parts of the Linux kernel could queue work to be done later on.
Figure 11.1 shows the kernel data structures associated with bottom half handling.
There can be up to 32 different bottom half handlers; bh_base is a vector of pointers to each of the
http://ldp.iol.it/LDP/tlk/kernel/kernel.html (1 di 7) [08/03/2001 10.09.00]

kernel's bottom half handling routines. bh_active and bh_mask have their bits set according to what
handlers have been installed and are active. If bit N of bh_mask is set then the Nth element of
bh_base contains the address of a bottom half routine. If bit N of bh_active is set then the N'th
bottom half handler routine should be called as soon as the scheduler deems reasonable. These indices
are statically defined; the timer bottom half handler is the highest priority (index 0), the console bottom
half handler is next in priority (index 1) and so on. Typically the bottom half handling routines have lists
of tasks associated with them. For example, the immediate bottom half handler works its way through the
immediate tasks queue (tq_immediate) which contains tasks that need to be performed immediately.
Some of the kernel's bottom half handers are device specific, but others are more generic:
TIMER
This handler is marked as active each time the system's periodic timer interrupts and is used to
drive the kernel's timer queue mechanisms,
CONSOLE
This handler is used to process console messages,
TQUEUE
This handler is used to process tty messages,
NET
This handler handles general network processing,
IMMEDIATE
This is a generic handler used by several device drivers to queue work to be done later.
Whenever a device driver, or some other part of the kernel, needs to schedule work to be done later, it
adds work to the appropriate system queue, for example the timer queue, and then signals the kernel that
some bottom half handling needs to be done. It does this by setting the appropriate bit in bh_active.
Bit 8 is set if the driver has queued something on the immediate queue and wishes the immediate bottom
half handler to run and process it. The bh_active bitmask is checked at the end of each system call,
just before control is returned to the calling process. If it has any bits set, the bottom half handler routines
that are active are called. Bit 0 is checked first, then 1 and so on until bit 31.
The bit in bh_active is cleared as each bottom half handling routine is called. bh_active is
transient; it only has meaning between calls to the scheduler and is a way of not calling bottom half
handling routines when there is no work for them to do.
11.2 Task Queues

Figure 11.2: A Task Queue
Task queues are the kernel's way of deferring work until later. Linux has a generic mechanism for
queuing work on queues and for processing them later.
Task queues are often used in conjunction with bottom half handlers; the timer task queue is processed
when the timer queue bottom half handler runs. A task queue is a simple data structure, see figure 11.2
which consists of a singly linked list of tq_struct data structures each of which contains the address
of a routine and a pointer to some data.
The routine will be called when the element on the task queue is processed and it will be passed a pointer
to the data.
Anything in the kernel, for example a device driver, can create and use task queues but there are three
task queues created and managed by the kernel:
timer
This queue is used to queue work that will be done as soon after the next system clock tick as is
possible. Each clock tick, this queue is checked to see if it contains any entries and, if it does, the
timer queue bottom half handler is made active. The timer queue bottom half handler is processed,
along with all the other bottom half handlers, when the scheduler next runs. This queue should not
be confused with system timers, which are a much more sophisticated mechanism.
immediate
This queue is also processed when the scheduler processes the active bottom half handlers. The
immediate bottom half handler is not as high in priority as the timer queue bottom half handler and
so these tasks will be run later.
scheduler
This task queue is processed directly by the scheduler. It is used to support other task queues in the
system and, in this case, the task to be run will be a routine that processes a task queue, say for a
device driver.
When task queues are processed, the pointer to the first element in the queue is removed from the queue
and replaced with a null pointer. In fact, this removal is an atomic operation, one that cannot be
interrupted. Then each element in the queue has its handling routine called in turn. The elements in the
queue are often statically allocated data. However there is no inherent mechanism for discarding
allocated memory. The task queue processing routine simply moves onto the next element in the list. It is
the job of the task itself to ensure that it properly cleans up any allocated kernel memory.
11.3 Timers

Figure 11.3: System Timers
An operating system needs to be able to schedule an activity sometime in the future. A mechanism is
needed whereby activities can be scheduled to run at some relatively precise time. Any microprocessor
that wishes to support an operating system must have a programmable interval timer that periodically
interrupts the processor. This periodic interrupt is known as a system clock tick and it acts like a
metronome, orchestrating the system's activities.
Linux has a very simple view of what time it is; it measures time in clock ticks since the system booted.
All system times are based on this measurement, which is known as jiffies after the globally
available variable of the same name.
Linux has two types of system timers, both queue routines to be called at some system time but they are
slightly different in their implementations. Figure 11.3 shows both mechanisms.
The first, the old timer mechanism, has a static array of 32 pointers to timer_struct data structures
and a mask of active timers, timer_active.
Where the timers go in the timer table is statically defined (rather like the bottom half handler table
bh_base). Entries are added into this table mostly at system initialization time. The second, newer,
mechanism uses a linked list of timer_list data structures held in ascending expiry time order.

Both methods use the time in jiffies as an expiry time so that a timer that wished to run in 5s would
have to convert 5s to units of jiffies and add that to the current system time to get the system time in
jiffies when the timer should expire. Every system clock tick the timer bottom half handler is
marked as active so that the when the scheduler next runs, the timer queues will be processed. The timer
bottom half handler processes both types of system timer. For the old system timers the
timer_active bit mask is check for bits that are set.
If the expiry time for an active timer has expired (expiry time is less than the current system jiffies),
its timer routine is called and its active bit is cleared. For new system timers, the entries in the linked list
of timer_list data structures are checked.
Every expired timer is removed from the list and its routine is called. The new timer mechanism has the
advantage of being able to pass an argument to the timer routine.
11.4 Wait Queues

There are many times when a process must wait for a system resource. For example a process may need
the VFS inode describing a directory in the file system and that inode may not be in the buffer cache. In
this case the process must wait for that inode to be fetched from the physical media containing the file
system before it can carry on.
wait_queue
*task
*next
Figure 11.4: Wait Queue

The Linux kernel uses a simple data structure, a wait queue (see figure 11.4),
which consists of a pointer to the processes task_struct and a pointer to the next element in the wait
queue.
When processes are added to the end of a wait queue they can either be interruptible or uninterruptible.
Interruptible processes may be interrupted by events such as timers expiring or signals being delivered
whilst they are waiting on a wait queue. The waiting processes state will reflect this and either be
INTERRUPTIBLE or UNINTERRUPTIBLE. As this process can not now continue to run, the scheduler
is run and, when it selects a new process to run, the waiting process will be suspended. 1
When the wait queue is processed, the state of every process in the wait queue is set to RUNNING. If the
process has been removed from the run queue, it is put back onto the run queue. The next time the
scheduler runs, the processes that are on the wait queue are now candidates to be run as they are now no
longer waiting. When a process on the wait queue is scheduled the first thing that it will do is remove
itself from the wait queue. Wait queues can be used to synchronize access to system resources and they
are used by Linux in its implementation of semaphores (see below).

11.5 Buzz Locks
These are better known as spin locks and they are a primitive way of protecting a data structure or piece
of code. They only allow one process at a time to be within a critical region of code. They are used in
Linux to restrict access to fields in data structures, using a single integer field as a lock. Each process
wishing to enter the region attempts to change the lock's initial value from 0 to 1. If its current value is 1,
the process tries again, spinning in a tight loop of code. The access to the memory location holding the
lock must be atomic, the action of reading its value, checking that it is 0 and then changing it to 1 cannot
be interrupted by any other process. Most CPU architectures provide support for this via special
instructions but you can also implement buzz locks using uncached main memory.
When the owning process leaves the critical region of code it decrements the buzz lock, returning its
value to 0. Any processes spinning on the lock will now read it as 0, the first one to do this will
increment it to 1 and enter the critical region.
11.6 Semaphores
Semaphores are used to protect critical regions of code or data structures. Remember that each access of
a critical piece of data such as a VFS inode describing a directory is made by kernel code running on
behalf of a process. It would be very dangerous to allow one process to alter a critical data structure that
is being used by another process. One way to achieve this would be to use a buzz lock around the critical
piece of data is being accessed but this is a simplistic approach that would not give very good system
performance. Instead Linux uses semaphores to allow just one process at a time to access critical regions
of code and data; all other processes wishing to access this resource will be made to wait until it becomes
free. The waiting processes are suspended, other processes in the system can continue to run as normal.
A Linux semaphore data structure contains the following information:
count
This field keeps track of the count of processes wishing to use this resource. A positive value
means that the resource is available. A negative or zero value means that processes are waiting for
it. An initial value of 1 means that one and only one process at a time can use this resource. When
processes want this resource they decrement the count and when they have finished with this
resource they increment the count,
waking
This is the count of processes waiting for this resource which is also the number of process waiting
to be woken up when this resource becomes free,
wait queue
When processes are waiting for this resource they are put onto this wait queue,
lock
A buzz lock used when accessing the waking field.
Suppose the initial count for a semaphore is 1, the first process to come along will see that the count is
positive and decrement it by 1, making it 0. The process now `òwns'' the critical piece of code or

resource that is being protected by the semaphore. When the process leaves the critical region it
increments the semphore's count. The most optimal case is where there are no other processes contending
for ownership of the critical region. Linux has implemented semaphores to work efficiently for this, the
most common, case.
If another process wishes to enter the critical region whilst it is owned by a process it too will decrement
the count. As the count is now negative (-1) the process cannot enter the critical region. Instead it must
wait until the owning process exits it. Linux makes the waiting process sleep until the owning process
wakes it on exiting the critical region. The waiting process adds itself to the semaphore's wait queue and
sits in a loop checking the value of the waking field and calling the scheduler until waking is
non-zero.
The owner of the critical region increments the semaphore's count and if it is less than or equal to zero
then there are processes sleeping, waiting for this resource. In the optimal case the semaphore's count
would have been returned to its initial value of 1 and no further work would be neccessary. The owning
process increments the waking counter and wakes up the process sleeping on the semaphore's wait queue.
When the waiting process wakes up, the waking counter is now 1 and it knows that it it may now enter
the critical region. It decrements the waking counter, returning it to a value of zero, and continues. All
access to the waking field of semaphore are protected by a buzz lock using the semaphore's lock.
Footnotes:
1 REVIEW NOTE: What is to stop a task in state INTERRUPTIBLE being made to run the next time
the scheduler runs? Processes in a wait queue should never run until they are woken up.


Chapter 12
Modules
This chapter describes how the Linux kernel can dynamically load functions, for example filesystems,
only when they are needed.
Linux is a monolithic kernel; that is, it is one, single, large program where all the functional components
of the kernel have access to all of its internal data structures and routines. The alternative is to have a
micro-kernel structure where the functional pieces of the kernel are broken out into separate units with
strict communication mechanisms between them. This makes adding new components into the kernel via
the configuration process rather time consuming. Say you wanted to use a SCSI driver for an NCR 810
SCSI and you had not built it into the kernel. You would have to configure and then build a new kernel
before you could use the NCR 810. There is an alternative, Linux allows you to dynamically load and
unload components of the operating system as you need them. Linux modules are lumps of code that can
be dynamically linked into the kernel at any point after the system has booted. They can be unlinked
from the kernel and removed when they are no longer needed. Mostly Linux kernel modules are device
drivers, pseudo-device drivers such as network drivers, or file-systems.
You can either load and unload Linux kernel modules explicitly using the insmod and rmmod
commands or the kernel itself can demand that the kernel daemon (kerneld) loads and unloads the
modules as they are needed.
Dynamically loading code as it is needed is attractive as it keeps the kernel size to a minimum and makes
the kernel very flexible. My current Intel kernel uses modules extensively and is only 406Kbytes long. I
only occasionally use VFAT file systems and so I build my Linux kernel to automatically load the VFAT
file system module as I mount a VFAT partition. When I have unmounted the VFAT partition the system
detects that I no longer need the VFAT file system module and removes it from the system. Modules can
also be useful for trying out new kernel code without having to rebuild and reboot the kernel every time
you try it out. Nothing, though, is for free and there is a slight performance and memory penalty
associated with kernel modules. There is a little more code that a loadable module must provide and this
and the extra data structures take a little more memory. There is also a level of indirection introduced that
makes accesses of kernel resources slightly less efficient for modules.
Once a Linux module has been loaded it is as much a part of the kernel as any normal kernel code. It has
the same rights and responsibilities as any kernel code; in other words, Linux kernel modules can crash
the kernel just like all kernel code or device drivers can.
http://ldp.iol.it/LDP/tlk/modules/modules.html (1 di 6) [08/03/2001 10.09.02]

So that modules can use the kernel resources that they need, they must be able to find them. Say a
module needs to call kmalloc(), the kernel memory allocation routine. At the time that it is built, a
module does not know where in memory kmalloc() is, so when the module is loaded, the kernel must
fix up all of the module's references to kmalloc() before the module can work. The kernel keeps a list
of all of the kernel's resources in the kernel symbol table so that it can resolve references to those
resources from the modules as they are loaded. Linux allows module stacking, this is where one module
requires the services of another module. For example, the VFAT file system module requires the services
of the FAT file system module as the VFAT file system is more or less a set of extensions to the FAT file
system. One module requiring services or resources from another module is very similar to the situation
where a module requires services and resources from the kernel itself. Only here the required services are
in another, previously loaded module. As each module is loaded, the kernel modifies the kernel symbol
table, adding to it all of the resources or symbols exported by the newly loaded module. This means that,
when the next module is loaded, it has access to the services of the already loaded modules.
When an attempt is made to unload a module, the kernel needs to know that the module is unused and it
needs some way of notifying the module that it is about to be unloaded. That way the module will be able
to free up any system resources that it has allocated, for example kernel memory or interrupts, before it is
removed from the kernel. When the module is unloaded, the kernel removes any symbols that that
module exported into the kernel symbol table.
Apart from the ability of a loaded module to crash the operating system by being badly written, it
presents another danger. What happens if you load a module built for an earlier or later kernel than the
one that you are now running? This may cause a problem if, say, the module makes a call to a kernel
routine and supplies the wrong arguments. The kernel can optionally protect against this by making
rigorous version checks on the module as it is loaded.
12.1 Loading a Module

Figure 12.1: The List of Kernel Modules
There are two ways that a kernel module can be loaded. The first way is to use the insmod command to
manually insert the it into the kernel.
The second, and much more clever way, is to load the module as it is needed; this is known as demand
loading.
When the kernel discovers the need for a module, for example when the user mounts a file system that is
not in the kernel, the kernel will request that the kernel daemon (kerneld) attempts to load the
appropriate module.
The kernel daemon is a normal user process albeit with super user privileges. When it is started up,
usually at system boot time, it opens up an Inter-Process Communication (IPC) channel to the kernel.
This link is used by the kernel to send messages to the kerneld asking for various tasks to be
performed.

Kerneld's major function is to load and unload kernel modules but it is also capable of other tasks such
as starting up the PPP link over serial line when it is needed and closing it down when it is not.
Kerneld does not perform these tasks itself, it runs the neccessary programs such as insmod to do the
work. Kerneld is just an agent of the kernel, scheduling work on its behalf.
The insmod utility must find the requested kernel module that it is to load. Demand loaded kernel
modules are normally kept in /lib/modules/kernel-version. The kernel modules are linked
object files just like other programs in the system except that they are linked as a relocatable images.
That is, images that are not linked to run from a particular address. They can be either a.out or elf
format object files. insmod makes a privileged system call to find the kernel's exported symbols.
These are kept in pairs containing the symbol's name and its value, for example its address. The kernel's
exported symbol table is held in the first module data structure in the list of modules maintained by the
kernel and pointed at by the module_list pointer.
Only specifically entered symbols are added into the table, which is built when the kernel is compiled
and linked, not every symbol in the kernel is exported to its modules. An example symbol is
``request_irq'' which is the kernel routine that must be called when a driver wishes to take
control of a particular system interrupt. In my current kernel, this has a value of 0x0010cd30. You can
easily see the exported kernel symbols and their values by looking at /proc/ksyms or by using the
ksyms utility. The ksyms utility can either show you all of the exported kernel symbols or only those
symbols exported by loaded modules. insmod reads the module into its virtual memory and fixes up its
unresolved references to kernel routines and resources using the exported symbols from the kernel. This
fixing up takes the form of patching the module image in memory. insmod physically writes the address
of the symbol into the appropriate place in the module.
When insmod has fixed up the module's references to exported kernel symbols, it asks the kernel for
enough space to hold the new kernel, again using a privileged system call. The kernel allocates a new
module data structure and enough kernel memory to hold the new module and puts it at the end of the
kernel modules list. The new module is marked as UNINITIALIZED.
Figure 12.1 shows the list of kernel modules after two modules, VFAT and VFAT have been loaded into
the kernel. Not shown in the diagram is the first module on the list, which is a pseudo-module that is only
there to hold the kernel's exported symbol table. You can use the command lsmod to list all of the loaded
kernel modules and their interdependencies. lsmod simply reformats /proc/modules which is built
from the list of kernel module data structures. The memory that the kernel allocates for it is mapped
into the insmod process's address space so that it can access it. insmod copies the module into the
allocated space and relocates it so that it will run from the kernel address that it has been allocated. This
must happen as the module cannot expect to be loaded at the same address twice let alone into the same
address in two different Linux systems. Again, this relocation involves patching the module image with
the appropriate addresses.
The new module also exports symbols to the kernel and insmod builds a table of these exported images.
Every kernel module must contain module initialization and module cleanup routines and these symbols
are deliberately not exported but insmod must know the addresses of them so that it can pass them to the
kernel. All being well, insmod is now ready to initialize the module and it makes a privileged system
call passing the kernel the addresses of the module's initialization and cleanup routines.

When a new module is added into the kernel, it must update the kernel's set of symbols and modify the
modules that are being used by the new module. Modules that have other modules dependent on them
must maintain a list of references at the end of their symbol table and pointed at by their module data
structure. Figure 12.1 shows that the VFAT file system module is dependent on the FAT file system
module. So, the FAT module contains a reference to the VFAT module; the reference was added when the
VFAT module was loaded. The kernel calls the modules initialization routine and, if it is successful it
carries on installing the module. The module's cleanup routine address is stored in it's module data
structure and it will be called by the kernel when that module is unloaded. Finally, the module's state is
set to RUNNING.
12.2 Unloading a Module

Modules can be removed using the rmmod command but demand loaded modules are automatically
removed from the system by kerneld when they are no longer being used. Every time its idle timer
expires, kerneld makes a system call requesting that all unused demand loaded modules are removed
from the system. The timer's value is set when you start kerneld; my kerneld checks every 180
seconds. So, for example, if you mount an iso9660 CD ROM and your iso9660 filesystem is a
loadable module, then shortly after the CD ROM is unmounted, the iso9660 module will be removed
from the kernel.
A module cannot be unloaded so long as other components of the kernel are depending on it. For
example, you cannot unload the VFAT module if you have one or more VFAT file systems mounted. If
you look at the output of lsmod, you will see that each module has a count associated with it. For
example:
Module: #pages: Used by:

msdos 5 1
vfat 4 1 (autoclean)
fat 6 [vfat msdos] 2 (autoclean)
The count is the number of kernel entities that are dependent on this module. In the above example, the
vfat and msdos modules are both dependent on the fat module and so it has a count of 2. Both the
vfat and msdos modules have 1 dependent, which is a mounted file system. If I were to load another
VFAT file system then the vfat module's count would become 2. A module's count is held in the first
longword of its image.
This field is slightly overloaded as it also holds the AUTOCLEAN and VISITED flags. Both of these
flags are used for demand loaded modules. These modules are marked as AUTOCLEAN so that the system
can recognize which ones it may automatically unload. The VISITED flag marks the module as in use
by one or more other system components; it is set whenever another component makes use of the
module. Each time the system is asked by kerneld to remove unused demand loaded modules it looks
through all of the modules in the system for likely candidates. It only looks at modules marked as
AUTOCLEAN and in the state RUNNING. If the candidate has its VISITED flag cleared then it will
remove the module, otherwise it will clear the VISITED flag and go on to look at the next module in the
system.

Assuming that a module can be unloaded, its cleanup routine is called to allow it to free up the kernel
resources that it has allocated.
The module data structure is marked as DELETED and it is unlinked from the list of kernel modules.
Any other modules that it is dependent on have their reference lists modified so that they no longer have
it as a dependent. All of the kernel memory that the module needed is deallocated.


Chapter 13
Processors
Linux runs on a number of processors; this chapter gives a brief outline of each of them.
13.1 X86
TBD
13.2 ARM
The ARM processor implements a low power, high performance 32 bit RISC architecture. It is being
widely used in embedded devices such as mobile phones and PDAs (Personal Data Assistants). It has 31
32 bit registers with 16 visible in any mode. Its instructions are simple load and store instructions (load a
value from memory, perform an operation and store the result back into memory). One interesting feature
it has is that every instruction is conditional. For example, you can test the value of a register and, until
you next test for the same condition, you can conditionally execute instructions as and when you like.
Another interesting feature is that you can perform arithmetic and shift operations on values as you load
them. It operates in several modes, including a system mode that can be entered from user mode via a
SWI (software interrupt).
It is a synthasisable core and ARM (the company) does not itself manufacture processors. Instead the
ARM partners (companies such as Intel or LSI for example) implement the ARM architecture in silicon.
It allows other processors to be tightly coupled via a co-processor interface and it has several memory
management unit variations. These range from simple memory protection schemes to complex page
hierarchies.
13.3 Alpha AXP Processor

The Alpha AXP architecture is a 64-bit load/store RISC architecture designed with speed in mind. All
registers are 64 bits in length; 32 integer registers and 32 floating point registers. Integer register 31 and
floating point register 31 are used for null operations. A read from them generates a zero value and a
write to them has no effect. All instructions are 32 bits long and memory operations are either reads or
http://ldp.iol.it/LDP/tlk/processors/processors.html (1 di 2) [08/03/2001 10.09.03]

writes. The architecture allows different implementations so long as the implementations follow the
architecture.
There are no instructions that operate directly on values stored in memory; all data manipulation is done
between registers. So, if you want to increment a counter in memory, you first read it into a register, then
modify it and write it out. The instructions only interact with each other by one instruction writing to a
register or memory location and another register reading that register or memory location. One
interesting feature of Alpha AXP is that there are instructions that can generate flags, such as testing if
two registers are equal, the result is not stored in a processor status register, but is instead stored in a third
register. This may seem strange at first, but removing this dependency from a status register means that it
is much easier to build a CPU which can issue multiple instructions every cycle. Instructions on
unrelated registers do not have to wait for each other to execute as they would if there were a single
status register. The lack of direct operations on memory and the large number of registers also help issue
multiple instructions.
The Alpha AXP architecture uses a set of subroutines, called privileged architecture library code
(PALcode). PALcode is specific to the operating system, the CPU implementation of the Alpha
AXP architecture and to the system hardware. These subroutines provide operating system primitives for
context switching, interrupts, exceptions and memory management. These subroutines can be invoked by
hardware or by CALL_PAL instructions. PALcode is written in standard Alpha AXP assembler with
some implementation specific extensions to provide direct access to low level hardware functions, for
example internal processor registers. PALcode is executed in PALmode, a privileged mode that stops
some system events happening and allows the PALcode complete control of the physical system
hardware.

http://ldp.iol.it/LDP/tlk/processors/processors.html (2 di 2) [08/03/2001 10.09.03]

Chapter 14
The Linux Kernel Sources
This chapter describes where in the Linux kernel sources you should start looking for particular kernel
functions.
This book does not depend on a knowledge of the 'C' programming language or require that you have the
Linux kernel sources available in order to understand how the Linux kernel works. That said, it is a
fruitful exercise to look at the kernel sources to get an in-depth understanding of the Linux operating
system. This chapter gives an overview of the kernel sources; how they are arranged and where you
might start to look for particular code.
Where to Get The Linux Kernel Sources

All of the major Linux distributions ( Craftworks, Debian, Slackware, Red Hat etcetera) include the
kernel sources in them. Usually the Linux kernel that got installed on your Linux system was built from
those sources. By their very nature these sources tend to be a little out of date so you may want to get the
latest sources from one of the web sites mentioned in chapter www-appendix. They are kept on
ftp://ftp.cs.helsinki.fi and all of the other web sites shadow them. This makes the Helsinki
web site the most up to date, but sites like MIT and Sunsite are never very far behind.
If you do not have access to the web, there are many CD ROM vendors who offer snapshots of the
world's major web sites at a very reasonable cost. Some even offer a subscription service with quarterly
or even monthly updates. Your local Linux User Group is also a good source of sources.
The Linux kernel sources have a very simple numbering system. Any even number kernel (for example
2.0.30) is a stable, released, kernel and any odd numbered kernel (for example 2.1.42 is a
development kernel. This book is based on the stable 2.0.30 source tree. Development kernels have all
of the latest features and support all of the latest devices. Although they can be unstable, which may not
be exactly what you want it, is important that the Linux community tries the latest kernels. That way they
are tested for the whole community. Remember that it is always worth backing up your system
thoroughly if you do try out non-production kernels.
Changes to the kernel sources are distributed as patch files. The patch utility is used to apply a series of
edits to a set of source files. So, for example, if you have the 2.0.29 kernel source tree and you wanted to
move to the 2.0.30 source tree, you would obtain the 2.0.30 patch file and apply the patches (edits) to that
http://ldp.iol.it/LDP/tlk/sources/sources.html (1 di 5) [08/03/2001 10.09.04]

source tree:
$ cd /usr/src/linux
$ patch -p1 < patch-2.0.30
This saves copying whole source trees, perhaps over slow serial connections. A good source of kernel
patches (official and unofficial) is the http://www.linuxhq.com web site.
How The Kernel Sources Are Arranged

At the very top level of the source tree /usr/src/linux you will see a number of directories:
arch
The arch subdirectory contains all of the architecture specific kernel code. It has further
subdirectories, one per supported architecture, for example i386 and alpha.
include
The include subdirectory contains most of the include files needed to build the kernel code. It
too has further subdirectories including one for every architecture supported. The include/asm
subdirectory is a soft link to the real include directory needed for this architecture, for example
include/asm-i386. To change architectures you need to edit the kernel makefile and rerun
the Linux kernel configuration program.
init
This directory contains the initialization code for the kernel and it is a very good place to start
looking at how the kernel works.
mm
This directory contains all of the memory management code. The architecture specific memory
management code lives down in arch/*/mm/, for example arch/i386/mm/fault.c.
drivers
All of the system's device drivers live in this directory. They are further sub-divided into classes of
device driver, for example block.
ipc
This directory contains the kernels inter-process communications code.
modules
This is simply a directory used to hold built modules.
fs
All of the file system code. This is further sub-divided into directories, one per supported file
system, for example vfat and ext2.
kernel
The main kernel code. Again, the architecture specific kernel code is in arch/*/kernel.
net

The kernel's networking code.
lib
This directory contains the kernel's library code. The architecture specific library code can be
found in arch/*/lib/.
scripts
This directory contains the scripts (for example awk and tk scripts) that are used when the kernel
is configured.
Where to Start Looking

A large complex program like the Linux kernel can be rather daunting to look at. It is rather like a large
ball of string with no end showing. Looking at one part of the kernel often leads to looking at several
other related files and before long you have forgotten what you were looking for. The next subsections
give you a hint as to where in the source tree the best place to look is for a given subject.
System Startup and Initialization

On an Intel based system, the kernel starts when either loadlin.exe or LILO has loaded the kernel into
memory and passed control to it. Look in arch/i386/kernel/head.S for this part. Head.S does
some architecture specific setup and then jumps to the main() routine in init/main.c.
Memory Management
This code is mostly in mm but the architecture specific code is in arch/*/mm. The page fault handling
code is in mm/memory.c and the memory mapping and page cache code is in mm/filemap.c. The
buffer cache is implemented in mm/buffer.c and the swap cache in mm/swap_state.c and
mm/swapfile.c.
Kernel
Most of the relevent generic code is in kernel with the architecture specific code in
arch/*/kernel. The scheduler is in kernel/sched.c and the fork code is in kernel/fork.c.
The bottom half handling code is in include/linux/interrupt.h. The task_struct data
structure can be found in include/linux/sched.h.
PCI
The PCI pseudo driver is in drivers/pci/pci.c with the system wide definitions in
include/linux/pci.h. Each architecture has some specific PCI BIOS code, Alpha AXP's is in
arch/alpha/kernel/bios32.c.

Interprocess Communication
This is all in ipc. All System V IPC objects include an ipc_perm data structure and this can be found
in include/linux/ipc.h. System V messages are implemented in ipc/msg.c, shared memory in
ipc/shm.c and semaphores in ipc/sem.c. Pipes are implemented in ipc/pipe.c.
Interrupt Handling
The kernel's interrupt handling code is almost all microprocessor (and often platform) specific. The Intel
interrupt handling code is in arch/i386/kernel/irq.c and its definitions in
include/asm-i386/irq.h.
Device Drivers
Most of the lines of the Linux kernel's source code are in its device drivers. All of Linux's device driver
sources are held in drivers but these are further broken out by type:
/block
block device drivers such as ide (in ide.c). If you want to look at how all of the devices that
could possibly contain file systems are initialized then you should look at device_setup() in
drivers/block/genhd.c. It not only initializes the hard disks but also the network as you
need a network to mount nfs file systems. Block devices include both IDE and SCSI based
devices.
/char
This the place to look for character based devices such as ttys, serial ports and mice.
/cdrom
All of the CDROM code for Linux. It is here that the special CDROM devices (such as
Soundblaster CDROM) can be found. Note that the ide CD driver is ide-cd.c in
drivers/block and that the SCSI CD driver is in scsi.c in drivers/scsi.
/pci
This are the sources for the PCI pseudo-driver. A good place to look at how the PCI subsystem is
mapped and initialized. The Alpha AXP PCI fixup code is also worth looking at in
arch/alpha/kernel/bios32.c.
/scsi
This is where to find all of the SCSI code as well as all of the drivers for the scsi devices supported
by Linux.
/net
This is where to look to find the network device drivers such as the DECChip 21040 PCI ethernet
driver which is in tulip.c.
/sound
This is where all of the sound card drivers are.

File Systems
The sources for the EXT2 file system are all in the fs/ext2/ directory with data structure definitions
in include/linux/ext2_fs.h, ext2_fs_i.h and ext2_fs_sb.h. The Virtual File System
data structures are described in include/linux/fs.h and the code is in fs/*. The buffer cache is
implemented in fs/buffer.c along with the update kernel daemon.
Network
The networking code is kept in net with most of the include files in include/net. The BSD socket
code is in net/socket.c and the IP version 4 INET socket code is in net/ipv4/af_inet.c. The
generic protocol support code (including the sk_buff handling routines) is in net/core with the
TCP/IP networking code in net/ipv4. The network device drivers are in drivers/net.
Modules
The kernel module code is partially in the kernel and partially in the modules package. The kernel code
is all in kernel/modules.c with the data structures and kernel demon kerneld messages in
include/linux/module.h and include/linux/kerneld.h respectively. You may want to
look at the structure of an ELF object file in include/linux/elf.h.


Chapter 4
Processes
This chapter describes what a process is and how the Linux kernel creates, manages and deletes the processes in the system.
Processes carry out tasks within the operating system. A program is a set of machine code instructions and data stored in an executable image
on disk and is, as such, a passive entity; a process can be thought of as a computer program in action.
It is a dynamic entity, constantly changing as the machine code instructions are executed by the processor. As well as the program's
instructions and data, the process also includes the program counter and all of the CPU's registers as well as the process stacks containing
temporary data such as routine parameters, return addresses and saved variables. The current executing program, or process, includes all of
the current activity in the microprocessor. Linux is a multiprocessing operating system. Processes are separate tasks each with their own
rights and responsibilities. If one process crashes it will not cause another process in the system to crash. Each individual process runs in its
own virtual address space and is not capable of interacting with another process except through secure, kernel managed mechanisms.
During the lifetime of a process it will use many system resources. It will use the CPUs in the system to run its instructions and the system's
physical memory to hold it and its data. It will open and use files within the filesystems and may directly or indirectly use the physical
devices in the system. Linux must keep track of the process itself and of the system resources that it has so that it can manage it and the other
processes in the system fairly. It would not be fair to the other processes in the system if one process monopolized most of the system's
physical memory or its CPUs.
The most precious resource in the system is the CPU, usually there is only one. Linux is a multiprocessing operating system, its objective is to
have a process running on each CPU in the system at all times, to maximize CPU utilization. If there are more processes than CPUs (and
there usually are), the rest of the processes must wait before a CPU becomes free until they can be run. Multiprocessing is a simple idea; a
process is executed until it must wait, usually for some system resource; when it has this resource, it may run again. In a uniprocessing
system, for example DOS, the CPU would simply sit idle and the waiting time would be wasted. In a multiprocessing system many processes
are kept in memory at the same time. Whenever a process has to wait the operating system takes the CPU away from that process and gives it
to another, more deserving process. It is the scheduler which chooses which is the most appropriate process to run next and Linux uses a
number of scheduling strategies to ensure fairness.
Linux supports a number of different executable file formats, ELF is one, Java is another and these must be managed transparently as must
the processes use of the system's shared libraries.
4.1 Linux Processes

So that Linux can manage the processes in the system, each process is represented by a task_struct data structure (task and process are
terms that Linux uses interchangeably). The task vector is an array of pointers to every task_struct data structure in the system.
This means that the maximum number of processes in the system is limited by the size of the task vector; by default it has 512 entries. As
processes are created, a new task_struct is allocated from system memory and added into the task vector. To make it easy to find, the
current, running, process is pointed to by the current pointer.
As well as the normal type of process, Linux supports real time processes. These processes have to react very quickly to external events
(hence the term ``real time'') and they are treated differently from normal user processes by the scheduler. Although the task_struct data
structure is quite large and complex, but its fields can be divided into a number of functional areas:
State
As a process executes it changes state according to its circumstances. Linux processes have the following states: 1
Running
The process is either running (it is the current process in the system) or it is ready to run (it is waiting to be assigned to one of the
system's CPUs).
Waiting
The process is waiting for an event or for a resource. Linux differentiates between two types of waiting process; interruptible
http://ldp.iol.it/LDP/tlk/kernel/processes.html (1 di 11) [08/03/2001 10.09.09]

and uninterruptible. Interruptible waiting processes can be interrupted by signals whereas uninterruptible waiting processes are
waiting directly on hardware conditions and cannot be interrupted under any circumstances.
Stopped
The process has been stopped, usually by receiving a signal. A process that is being debugged can be in a stopped state.
Zombie
This is a halted process which, for some reason, still has a task_struct data structure in the task vector. It is what it sounds
like, a dead process.
Scheduling Information
The scheduler needs this information in order to fairly decide which process in the system most deserves to run,
Identifiers
Every process in the system has a process identifier. The process identifier is not an index into the task vector, it is simply a number.
Each process also has User and group identifiers, these are used to control this processes access to the files and devices in the system,
Inter-Process Communication
Linux supports the classic Unix TM IPC mechanisms of signals, pipes and semaphores and also the System V IPC mechanisms of
shared memory, semaphores and message queues. The IPC mechanisms supported by Linux are described in Chapter IPC-chapter.
Links
In a Linux system no process is independent of any other process. Every process in the system, except the initial process has a parent
process. New processes are not created, they are copied, or rather cloned from previous processes. Every task_struct representing
a process keeps pointers to its parent process and to its siblings (those processes with the same parent process) as well as to its own
child processes. You can see the family relationship between the running processes in a Linux system using the pstree command:
init(1)-+-crond(98)
|-emacs(387)
|-gpm(146)
|-inetd(110)
|-kerneld(18)
|-kflushd(2)
|-klogd(87)
|-kswapd(3)
|-login(160)---bash(192)---emacs(225)
|-lpd(121)
|-mingetty(161)
|-mingetty(162)
|-mingetty(163)
|-mingetty(164)
|-login(403)---bash(404)---pstree(594)
|-sendmail(134)
|-syslogd(78)
`-update(166)
Additionally all of the processes in the system are held in a doubly linked list whose root is the init processes task_struct data
structure. This list allows the Linux kernel to look at every process in the system. It needs to do this to provide support for commands
such as ps or kill.
Times and Timers
The kernel keeps track of a processes creation time as well as the CPU time that it consumes during its lifetime. Each clock tick, the
kernel updates the amount of time in jiffies that the current process has spent in system and in user mode. Linux also supports
process specific interval timers, processes can use system calls to set up timers to send signals to themselves when the timers expire.
These timers can be single-shot or periodic timers.
File system
Processes can open and close files as they wish and the processes task_struct contains pointers to descriptors for each open file as
well as pointers to two VFS inodes. Each VFS inode uniquely describes a file or directory within a file system and also provides a
uniform interface to the underlying file systems. How file systems are supported under Linux is described in Chapter
filesystem-chapter. The first is to the root of the process (its home directory) and the second is to its current or pwd directory. pwd is
derived from the Unix TM command pwd, print working directory. These two VFS inodes have their count fields incremented to
show that one or more processes are referencing them. This is why you cannot delete the directory that a process has as its pwd
directory set to, or for that matter one of its sub-directories.

Virtual memory
Most processes have some virtual memory (kernel threads and daemons do not) and the Linux kernel must track how that virtual
memory is mapped onto the system's physical memory.
Processor Specific Context
A process could be thought of as the sum total of the system's current state. Whenever a process is running it is using the processor's
registers, stacks and so on. This is the processes context and, when a process is suspended, all of that CPU specific context must be
saved in the task_struct for the process. When a process is restarted by the scheduler its context is restored from here.
4.2 Identifiers
Linux, like all Unix TM uses user and group identifiers to check for access rights to files and images in the system. All of the files in a Linux
system have ownerships and permissions, these permissions describe what access the system's users have to that file or directory. Basic
permissions are read, write and execute and are assigned to three classes of user; the owner of the file, processes belonging to a particular
group and all of the processes in the system. Each class of user can have different permissions, for example a file could have permissions
which allow its owner to read and write it, the file's group to read it and for all other processes in the system to have no access at all.
REVIEW NOTE: Expand and give the bit assignments (777).
Groups are Linux's way of assigning privileges to files and directories for a group of users rather than to a single user or to all processes in the
system. You might, for example, create a group for all of the users in a software project and arrange it so that only they could read and write
the source code for the project. A process can belong to several groups (a maximum of 32 is the default) and these are held in the groups
vector in the task_struct for each process. So long as a file has access rights for one of the groups that a process belongs to then that
process will have appropriate group access rights to that file.
There are four pairs of process and group identifiers held in a processes task_struct:
uid, gid
The user identifier and group identifier of the user that the process is running on behalf of,
effective uid and gid
There are some programs which change the uid and gid from that of the executing process into their own (held as attributes in the VFS
inode describing the executable image). These programs are known as setuid programs and they are useful because it is a way of
restricting accesses to services, particularly those that run on behalf of someone else, for example a network daemon. The effective uid
and gid are those from the setuid program and the uid and gid remain as they were. The kernel checks the effective uid and gid
whenever it checks for privilege rights.
file system uid and gid
These are normally the same as the effective uid and gid and are used when checking file system access rights. They are needed for
NFS mounted filesystems where the user mode NFS server needs to access files as if it were a particular process. In this case only the
file system uid and gid are changed (not the effective uid and gid). This avoids a situation where malicious users could send a kill
signal to the NFS server. Kill signals are delivered to processes with a particular effective uid and gid.
saved uid and gid
These are mandated by the POSIX standard and are used by programs which change the processes uid and gid via system calls. They
are used to save the real uid and gid during the time that the original uid and gid have been changed.
4.3 Scheduling
All processes run partially in user mode and partially in system mode. How these modes are supported by the underlying hardware differs but
generally there is a secure mechanism for getting from user mode into system mode and back again. User mode has far less privileges than
system mode. Each time a process makes a system call it swaps from user mode to system mode and continues executing. At this point the
kernel is executing on behalf of the process. In Linux, processes do not preempt the current, running process, they cannot stop it from running
so that they can run. Each process decides to relinquish the CPU that it is running on when it has to wait for some system event. For example,
a process may have to wait for a character to be read from a file. This waiting happens within the system call, in system mode; the process
used a library function to open and read the file and it, in turn made system calls to read bytes from the open file. In this case the waiting
process will be suspended and another, more deserving process will be chosen to run.
Processes are always making system calls and so may often need to wait. Even so, if a process executes until it waits then it still might use a
disproportionate amount of CPU time and so Linux uses pre-emptive scheduling. In this scheme, each process is allowed to run for a small
amount of time, 200ms, and, when this time has expired another process is selected to run and the original process is made to wait for a little
while until it can run again. This small amount of time is known as a time-slice.
It is the scheduler that must select the most deserving process to run out of all of the runnable processes in the system.

A runnable process is one which is waiting only for a CPU to run on. Linux uses a reasonably simple priority based scheduling algorithm to
choose between the current processes in the system. When it has chosen a new process to run it saves the state of the current process, the
processor specific registers and other context being saved in the processes task_struct data structure. It then restores the state of the new
process (again this is processor specific) to run and gives control of the system to that process. For the scheduler to fairly allocate CPU time
between the runnable processes in the system it keeps information in the task_struct for each process:
policy
This is the scheduling policy that will be applied to this process. There are two types of Linux process, normal and real time. Real time
processes have a higher priority than all of the other processes. If there is a real time process ready to run, it will always run first. Real
time processes may have two types of policy, round robin and first in first out. In round robin scheduling, each runnable real time
process is run in turn and in first in, first out scheduling each runnable process is run in the order that it is in on the run queue and that
order is never changed.
priority
This is the priority that the scheduler will give to this process. It is also the amount of time (in jiffies) that this process will run for
when it is allowed to run. You can alter the priority of a process by means of system calls and the renice command.
rt_priority
Linux supports real time processes and these are scheduled to have a higher priority than all of the other non-real time processes in
system. This field allows the scheduler to give each real time process a relative priority. The priority of a real time processes can be
altered using system calls.
counter
This is the amount of time (in jiffies) that this process is allowed to run for. It is set to priority when the process is first run
and is decremented each clock tick.
The scheduler is run from several places within the kernel. It is run after putting the current process onto a wait queue and it may also be run
at the end of a system call, just before a process is returned to process mode from system mode. One reason that it might need to run is
because the system timer has just set the current processes counter to zero. Each time the scheduler is run it does the following:
kernel work
The scheduler runs the bottom half handlers and processes the scheduler task queue. These lightweight kernel threads are described in
detail in chapter kernel-chapter.
Current process
The current process must be processed before another process can be selected to run.
If the scheduling policy of the current processes is round robin then it is put onto the back of the run queue.
If the task is INTERRUPTIBLE and it has received a signal since the last time it was scheduled then its state becomes RUNNING.
If the current process has timed out, then its state becomes RUNNING.
If the current process is RUNNING then it will remain in that state.
Processes that were neither RUNNING nor INTERRUPTIBLE are removed from the run queue. This means that they will not be
considered for running when the scheduler looks for the most deserving process to run.
Process selection
The scheduler looks through the processes on the run queue looking for the most deserving process to run. If there are any real time
processes (those with a real time scheduling policy) then those will get a higher weighting than ordinary processes. The weight for a
normal process is its counter but for a real time process it is counter plus 1000. This means that if there are any runnable real time
processes in the system then these will always be run before any normal runnable processes. The current process, which has consumed
some of its time-slice (its counter has been decremented) is at a disadvantage if there are other processes with equal priority in the
system; that is as it should be. If several processes have the same priority, the one nearest the front of the run queue is chosen. The
current process will get put onto the back of the run queue. In a balanced system with many processes of the same priority, each one
will run in turn. This is known as Round Robin scheduling. However, as processes wait for resources, their run order tends to get
moved around.
Swap processes
If the most deserving process to run is not the current process, then the current process must be suspended and the new one made to
run. When a process is running it is using the registers and physical memory of the CPU and of the system. Each time it calls a routine
it passes its arguments in registers and may stack saved values such as the address to return to in the calling routine. So, when the
scheduler is running it is running in the context of the current process. It will be in a privileged mode, kernel mode, but it is still the
current process that is running. When that process comes to be suspended, all of its machine state, including the program counter (PC)
and all of the processor's registers, must be saved in the processes task_struct data structure. Then, all of the machine state for the
new process must be loaded. This is a system dependent operation, no CPUs do this in quite the same way but there is usually some

hardware assistance for this act.
This swapping of process context takes place at the end of the scheduler. The saved context for the previous process is, therefore, a
snapshot of the hardware context of the system as it was for this process at the end of the scheduler. Equally, when the context of the
new process is loaded, it too will be a snapshot of the way things were at the end of the scheduler, including this processes program
counter and register contents.
If the previous process or the new current process uses virtual memory then the system's page table entries may need to be updated.
Again, this action is architecture specific. Processors like the Alpha AXP, which use Translation Look-aside Tables or cached Page
Table Entries, must flush those cached table entries that belonged to the previous process.
4.3.1 Scheduling in Multiprocessor Systems

Systems with multiple CPUs are reasonably rare in the Linux world but a lot of work has already gone into making Linux an SMP
(Symmetric Multi-Processing) operating system. That is, one that is capable of evenly balancing work between the CPUs in the system.
Nowhere is this balancing of work more apparent than in the scheduler.
In a multiprocessor system, hopefully, all of the processors are busily running processes. Each will run the scheduler separately as its current
process exhausts its time-slice or has to wait for a system resource. The first thing to notice about an SMP system is that there is not just one
idle process in the system. In a single processor system the idle process is the first task in the task vector, in an SMP system there is one idle
process per CPU, and you could have more than one idle CPU. Additionally there is one current process per CPU, so SMP systems must keep
track of the current and idle processes for each processor.
In an SMP system each process's task_struct contains the number of the processor that it is currently running on (processor) and its
processor number of the last processor that it ran on (last_processor). There is no reason why a process should not run on a different
CPU each time it is selected to run but Linux can restrict a process to one or more processors in the system using the processor_mask. If
bit N is set, then this process can run on processor N. When the scheduler is choosing a new process to run it will not consider one that does
not have the appropriate bit set for the current processor's number in its processor_mask. The scheduler also gives a slight advantage to a
process that last ran on the current processor because there is often a performance overhead when moving a process to a different processor.
4.4 Files

Figure 4.1: A Process's Files
Figure 4.1 shows that there are two data structures that describe file system specific information for each process in the system. The first, the
fs_struct
contains pointers to this process's VFS inodes and its umask. The umask is the default mode that new files will be created in, and it can be
changed via system calls.
The second data structure, the files_struct, contains information about all of the files that this process is currently using. Programs read
from standard input and write to standard output. Any error messages should go to standard error. These may be files, terminal input/output
or a real device but so far as the program is concerned they are all treated as files. Every file has its own descriptor and the files_struct
contains pointers to up to 256 file data structures, each one describing a file being used by this process. The f_mode field describes what
mode the file has been created in; read only, read and write or write only. f_pos holds the position in the file where the next read or write
operation will occur. f_inode points at the VFS inode describing the file and f_ops is a pointer to a vector of routine addresses; one for
each function that you might wish to perform on a file. There is, for example, a write data function. This abstraction of the interface is very
powerful and allows Linux to support a wide variety of file types. In Linux, pipes are implemented using this mechanism as we shall see later.
Every time a file is opened, one of the free file pointers in the files_struct is used to point to the new file structure. Linux
processes expect three file descriptors to be open when they start. These are known as standard input, standard output and standard error and
they are usually inherited from the creating parent process. All accesses to files are via standard system calls which pass or return file
descriptors. These descriptors are indices into the process's fd vector, so standard input, standard output and standard error have file
descriptors 0, 1 and 2. Each access to the file uses the file data structure's file operation routines to together with the VFS inode to achieve
its needs.
4.5 Virtual Memory

A process's virtual memory contains executable code and data from many sources. First there is the program image that is loaded; for
example a command like ls. This command, like all executable images, is composed of both executable code and data. The image file
contains all of the information neccessary to load the executable code and associated program data into the virtual memory of the process.
Secondly, processses can allocate (virtual) memory to use during their processing, say to hold the contents of files that it is reading. This
newly allocated, virtual, memory needs to be linked into the process's existing virtual memory so that it can be used. Thirdly, Linux processes
use libraries of commonly useful code, for example file handling routines. It does not make sense that each process has its own copy of the
library, Linux uses shared libraries that can be used by several running processes at the same time. The code and the data from these shared
libraries must be linked into this process's virtual address space and also into the virtual address space of the other processes sharing the
library.
In any given time period a process will not have used all of the code and data contained within its virtual memory. It could contain code that
is only used during certain situations, such as during initialization or to process a particular event. It may only have used some of the routines
from its shared libraries. It would be wasteful to load all of this code and data into physical memory where it would lie unused. Multiply this
wastage by the number of processes in the system and the system would run very inefficiently. Instead, Linux uses a technique called demand
paging where the virtual memory of a process is brought into physical memory only when a process attempts to use it. So, instead of loading
the code and data into physical memory straight away, the Linux kernel alters the process's page table, marking the virtual areas as existing
but not in memory. When the process attempts to acccess the code or data the system hardware will generate a page fault and hand control to
the Linux kernel to fix things up. Therefore, for every area of virtual memory in the process's address space Linux needs to know where that
virtual memory comes from and how to get it into memory so that it can fix up these page faults.

Figure 4.2: A Process's Virtual Memory
The Linux kernel needs to manage all of these areas of virtual memory and the contents of each process's virtual memory is described by a
mm_struct data structure pointed at from its task_struct. The process's mm_struct
data structure also contains information about the loaded executable image and a pointer to the process's page tables. It contains pointers to a
list of vm_area_struct data structures, each
representing an area of virtual memory within this process.
This linked list is in ascending virtual memory order, figure 4.2 shows the layout in virtual memory of a simple process together with the
kernel data structures managing it. As those areas of virtual memory are from several sources, Linux abstracts the interface by having the
vm_area_struct point to a set of virtual memory handling routines (via vm_ops). This way all of the process's virtual memory can be
handled in a consistent way no matter how the underlying services managing that memory differ. For example there is a routine that will be
called when the process attempts to access the memory and it does not exist, this is how page faults are handled.
The process's set of vm_area_struct data structures is accessed repeatedly by the Linux kernel as it creates new areas of virtual memory
for the process and as it fixes up references to virtual memory not in the system's physical memory. This makes the time that it takes to find
the correct vm_area_struct critical to the performance of the system. To speed up this access, Linux also arranges the
vm_area_struct data structures into an AVL (Adelson-Velskii and Landis) tree. This tree is arranged so that each vm_area_struct
(or node) has a left and a right pointer to its neighbouring vm_area_struct structure. The left pointer points to node with a lower starting
virtual address and the right pointer points to a node with a higher starting virtual address. To find the correct node, Linux goes to the root of
the tree and follows each node's left and right pointers until it finds the right vm_area_struct. Of course, nothing is for free and inserting
a new vm_area_struct into this tree takes additional processing time.
When a process allocates virtual memory, Linux does not actually reserve physical memory for the process. Instead, it describes the virtual
memory by creating a new vm_area_struct data structure. This is linked into the process's list of virtual memory. When the process
attempts to write to a virtual address within that new virtual memory region then the system will page fault. The processor will attempt to
decode the virtual address, but as there are no Page Table Entries for any of this memory, it will give up and raise a page fault exception,
leaving the Linux kernel to fix things up. Linux looks to see if the virtual address referenced is in the current process's virtual address space.

If it is, Linux creates the appropriate PTEs and allocates a physical page of memory for this process. The code or data may need to be brought
into that physical page from the filesystem or from the swap disk. The process can then be restarted at the instruction that caused the page
fault and, this time as the memory physically exists, it may continue.
4.6 Creating a Process

When the system starts up it is running in kernel mode and there is, in a sense, only one process, the initial process. Like all processes, the
initial process has a machine state represented by stacks, registers and so on. These will be saved in the initial process's task_struct data
structure when other processes in the system are created and run. At the end of system initialization, the initial process starts up a kernel
thread (called init) and then sits in an idle loop doing nothing. Whenever there is nothing else to do the scheduler will run this, idle,
process. The idle process's task_struct is the only one that is not dynamically allocated, it is statically defined at kernel build time and is,
rather confusingly, called init_task.
The init kernel thread or process has a process identifier of 1 as it is the system's first real process. It does some initial setting up of the
system (such as opening the system console and mounting the root file system) and then executes the system initialization program. This is
one of /etc/init, /bin/init or /sbin/init depending on your system. The init program uses /etc/inittab as a script file to
create new processes within the system. These new processes may themselves go on to create new processes. For example the getty process
may create a login process when a user attempts to login. All of the processes in the system are descended from the init kernel thread.
New processes are created by cloning old processes, or rather by cloning the current process. A new task is created by a system call (fork or
clone)
and the cloning happens within the kernel in kernel mode. At the end of the system call there is a new process waiting to run once the
scheduler chooses it. A new task_struct data structure is allocated from the system's physical memory with one or more physical pages
for the cloned process's stacks (user and kernel). A new process identifier may be created, one that is unique within the set of process
identifiers in the system. However, it is perfectly reasonable for the cloned process to keep its parents process identifier. The new
task_struct is entered into the task vector and the contents of the old (current) process's task_struct are copied into the cloned
task_struct.
When cloning processes Linux allows the two processes to share resources rather than have two separate copies. This applies to the process's
files, signal handlers and virtual memory. When the resources are to be shared their respective count fields are incremented so that Linux
will not deallocate these resources until both processes have finished using them. So, for example, if the cloned process is to share virtual
memory, its task_struct will contain a pointer to the mm_struct of the original process and that mm_struct has its count field
incremented to show the number of current processes sharing it.
Cloning a process's virtual memory is rather tricky. A new set of vm_area_struct data structures must be generated together with their
owning mm_struct data structure and the cloned process's page tables. None of the process's virtual memory is copied at this point. That
would be a rather difficult and lengthy task for some of that virtual memory would be in physical memory, some in the executable image that
the process is currently executing and possibly some would be in the swap file. Instead Linux uses a technique called ``copy on write'' which
means that virtual memory will only be copied when one of the two processes tries to write to it. Any virtual memory that is not written to,
even if it can be, will be shared between the two processes without any harm occuring. The read only memory, for example the executable
code, will always be shared. For ``copy on write'' to work, the writeable areas have their page table entries marked as read only and the
vm_area_struct data structures describing them are marked as ``copy on write''. When one of the processes attempts to write to this
virtual memory a page fault will occur. It is at this point that Linux will make a copy of the memory and fix up the two processes' page tables
and virtual memory data structures.
4.7 Times and Timers

The kernel keeps track of a process's creation time as well as the CPU time that it consumes during its lifetime. Each clock tick, the kernel
updates the amount of time in jiffies that the current process has spent in system and in user mode.
In addition to these accounting timers, Linux supports process specific interval timers.
A process can use these timers to send itself various signals each time that they expire. Three sorts of interval timers are supported:
Real
the timer ticks in real time, and when the timer has expired, the process is sent a SIGALRM signal.
Virtual
This timer only ticks when the process is running and when it expires it sends a SIGVTALRM signal.
Profile
This timer ticks both when the process is running and when the system is executing on behalf of the process itself. SIGPROF is
signalled when it expires.

One or all of the interval timers may be running and Linux keeps all of the neccessary information in the process's task_struct data
structure. System calls can be made to set up these interval timers and to start them, stop them and read their current values. The virtual and
profile timers are handled the same way.
Every clock tick the current process's interval timers are decremented and, if they have expired, the appropriate signal is sent.
Real time interval timers are a little different and for these Linux uses the timer mechanism described in Chapter kernel-chapter. Each
process has its own timer_list data structure and, when the real interval timer is running, this is queued on the system timer list. When
the timer expires the timer bottom half handler removes it from the queue and calls the interval timer handler.
This generates the SIGALRM signal and restarts the interval timer, adding it back into the system timer queue.
4.8 Executing Programs

In Linux, as in Unix TM, programs and commands are normally executed by a command interpreter. A command interpreter is a user process
like any other process and is called a shell 2.
There are many shells in Linux, some of the most popular are sh, bash and tcsh. With the exception of a few built in commands, such as
cd and pwd, a command is an executable binary file. For each command entered, the shell searches the directories in the process's search
path, held in the PATH environment variable, for an executable image with a matching name. If the file is found it is loaded and executed.
The shell clones itself using the fork mechanism described above and then the new child process replaces the binary image that it was
executing, the shell, with the contents of the executable image file just found. Normally the shell waits for the command to complete, or
rather for the child process to exit. You can cause the shell to run again by pushing the child process to the background by typing
control-Z, which causes a SIGSTOP signal to be sent to the child process, stopping it. You then use the shell command bg to push it into
a background, the shell sends it a SIGCONT signal to restart it, where it will stay until either it ends or it needs to do terminal input or output.
An executable file can have many formats or even be a script file. Script files have to be recognized and the appropriate interpreter run to
handle them; for example /bin/sh interprets shell scripts. Executable object files contain executable code and data together with enough
information to allow the operating system to load them into memory and execute them. The most commonly used object file format used by
Linux is ELF but, in theory, Linux is flexible enough to handle almost any object file format.
Figure 4.3: Registered Binary Formats

As with file systems, the binary formats supported by Linux are either built into the kernel at kernel build time or available to be loaded as
modules. The kernel keeps a list of supported binary formats (see figure 4.3) and when an attempt is made to execute a file, each binary
format is tried in turn until one works.
Commonly supported Linux binary formats are a.out and ELF. Executable files do not have to be read completely into memory, a
technique known as demand loading is used. As each part of the executable image is used by a process it is brought into memory. Unused
parts of the image may be discarded from memory.
4.8.1 ELF
The ELF (Executable and Linkable Format) object file format, designed by the Unix System Laboratories, is now firmly established as the
most commonly used format in Linux. Whilst there is a slight performance overhead when compared with other object file formats such as
ECOFF and a.out, ELF is felt to be more flexible. ELF executable files contain executable code, sometimes refered to as text, and data.
Tables within the executable image describe how the program should be placed into the process's virtual memory. Statically linked images are
built by the linker (ld), or link editor, into one single image containing all of the code and data needed to run this image. The image also
specifies the layout in memory of this image and the address in the image of the first code to execute.

Figure 4.4: ELF Executable File Format
Figure 4.4 shows the layout of a statically linked ELF executable image.
It is a simple C program that prints ``hello world'' and then exits. The header describes it as an ELF image with two physical headers
(e_phnum is 2) starting 52 bytes (e_phoff) from the start of the image file. The first physical header describes the executable code in the
image. It goes at virtual address 0x8048000 and there is 65532 bytes of it. This is because it is a statically linked image which contains all of
the library code for the printf() call to output ``hello world''. The entry point for the image, the first instruction for the program, is not at
the start of the image but at virtual address 0x8048090 (e_entry). The code starts immediately after the second physical header. This
physical header describes the data for the program and is to be loaded into virtual memory at address 0x8059BB8. This data is both readable
and writeable. You will notice that the size of the data in the file is 2200 bytes (p_filesz) whereas its size in memory is 4248 bytes. This
because the first 2200 bytes contain pre-initialized data and the next 2048 bytes contain data that will be initialized by the executing code.
When Linux loads an ELF executable image into the process's virtual address space, it does not actually load the image.
It sets up the virtual memory data structures, the process's vm_area_struct tree and its page tables. When the program is executed page
faults will cause the program's code and data to be fetched into physical memory. Unused portions of the program will never be loaded into
memory. Once the ELF binary format loader is satisfied that the image is a valid ELF executable image it flushes the process's current
executable image from its virtual memory. As this process is a cloned image (all processes are) this, old, image is the program that the parent
process was executing, for example the command interpreter shell such as bash. This flushing of the old executable image discards the old
virtual memory data structures and resets the process's page tables. It also clears away any signal handlers that were set up and closes any
files that are open. At the end of the flush the process is ready for the new executable image. No matter what format the executable image is,
the same information gets set up in the process's mm_struct. There are pointers to the start and end of the image's code and data. These
values are found as the ELF executable images physical headers are read and the sections of the program that they describe are mapped into
the process's virtual address space. That is also when the vm_area_struct data structures are set up and the process's page tables are
modified. The mm_struct data structure also contains pointers to the parameters to be passed to the program and to this process's
environment variables.
ELF Shared Libraries

A dynamically linked image, on the other hand, does not contain all of the code and data required to run. Some of it is held in shared libraries
that are linked into the image at run time. The ELF shared library's tables are also used by the dynamic linker when the shared library is
linked into the image at run time. Linux uses several dynamic linkers, ld.so.1, libc.so.1 and ld-linux.so.1, all to be found in
/lib. The libraries contain commonly used code such as language subroutines. Without dynamic linking, all programs would need their own
copy of the these libraries and would need far more disk space and virtual memory. In dynamic linking, information is included in the ELF
image's tables for every library routine referenced. The information indicates to the dynamic linker how to locate the library routine and link
it into the program's address space.
REVIEW NOTE: Do I need more detail here, worked example?
4.8.2 Script Files

Script files are executables that need an interpreter to run them. There are a wide variety of interpreters available for Linux; for example
wish, perl and command shells such as tcsh. Linux uses the standard Unux TM convention of having the first line of a script file contain the
name of the interpreter. So, a typical script file would start:
#!/usr/bin/wish
The script binary loader tries to find the intepreter for the script.
It does this by attempting to open the executable file that is named in the first line of the script. If it can open it, it has a pointer to its VFS
inode and it can go ahead and have it interpret the script file. The name of the script file becomes argument zero (the first argument) and all of
the other arguments move up one place (the original first argument becomes the new second argument and so on). Loading the interpreter is
done in the same way as Linux loads all of its executable files. Linux tries each binary format in turn until one works. This means that you
could in theory stack several interpreters and binary formats making the Linux binary format handler a very flexible piece of software.
Footnotes:
1 REVIEW NOTE: I left out SWAPPING because it does not appear to be used.
2 Think of a nut the kernel is the edible bit in the middle and the shell goes around it, providing an interface.


Chapter 10
Networks
Networking and Linux are terms that are almost synonymous. In a very real sense Linux is a product of the Internet or World Wide Web
(WWW). Its developers and users use the web to exchange information ideas, code, and Linux itself is often used to support the
networking needs of organizations. This chapter describes how Linux supports the network protocols known collectively as TCP/IP.
The TCP/IP protocols were designed to support communications between computers connected to the ARPANET, an American
research network funded by the US government. The ARPANET pioneered networking concepts such as packet switching and protocol
layering where one protocol uses the services of another. ARPANET was retired in 1988 but its successors (NSF1 NET and the Internet)
have grown even larger. What is now known as the World Wide Web grew from the ARPANET and is itself supported by the TCP/IP
protocols. Unix TM was extensively used on the ARPANET and the first released networking version of Unix TM was 4.3 BSD. Linux's
networking implementation is modeled on 4.3 BSD in that it supports BSD sockets (with some extensions) and the full range of TCP/IP
networking. This programming interface was chosen because of its popularity and to help applications be portable between Linux and
other Unix TM platforms.
10.1 An Overview of TCP/IP Networking

This section gives an overview of the main principles of TCP/IP networking. It is not meant to be an exhaustive description, for that I
suggest that you read . In an IP network every machine is assigned an IP address, this is a 32 bit number that uniquely identifies the
machine. The WWW is a very large, and growing, IP network and every machine that is connected to it has to have a unique IP address
assigned to it. IP addresses are represented by four numbers separated by dots, for example, 16.42.0.9. This IP address is actually in
two parts, the network address and the host address. The sizes of these parts may vary (there are several classes of IP addresses) but
using 16.42.0.9 as an example, the network address would be 16.42 and the host address 0.9. The host address is further
subdivided into a subnetwork and a host address. Again, using 16.42.0.9 as an example, the subnetwork address would be
16.42.0 and the host address 16.42.0.9. This subdivision of the IP address allows organizations to subdivide their networks. For
example, 16.42 could be the network address of the ACME Computer Company; 16.42.0 would be subnet 0 and 16.42.1 would
be subnet 1. These subnets might be in separate buildings, perhaps connected by leased telephone lines or even microwave links. IP
addresses are assigned by the network administrator and having IP subnetworks is a good way of distributing the administration of the
network. IP subnet administrators are free to allocate IP addresses within their IP subnetworks.
Generally though, IP addresses are somewhat hard to remember. Names are much easier. linux.acme.com is much easier to
remember than 16.42.0.9 but there must be some mechanism to convert the network names into an IP address. These names can be
statically specified in the /etc/hosts file or Linux can ask a Distributed Name Server (DNS server) to resolve the name for it. In this
case the local host must know the IP address of one or more DNS servers and these are specified in /etc/resolv.conf.
Whenever you connect to another machine, say when reading a web page, its IP address is used to exchange data with that machine.
This data is contained in IP packets each of which have an IP header containing the IP addresses of the source and destination machine's
IP addresses, a checksum and other useful information. The checksum is derived from the data in the IP packet and allows the receiver
of IP packets to tell if the IP packet was corrupted during transmission, perhaps by a noisy telephone line. The data transmitted by an
application may have been broken down into smaller packets which are easier to handle. The size of the IP data packets varies
depending on the connection media; ethernet packets are generally bigger than PPP packets. The destination host must reassemble the
data packets before giving the data to the receiving application. You can see this fragmentation and reassembly of data graphically if
you access a web page containing a lot of graphical images via a moderately slow serial link.
Hosts connected to the same IP subnet can send IP packets directly to each other, all other IP packets will be sent to a special host, a
gateway. Gateways (or routers) are connected to more than one IP subnet and they will resend IP packets received on one subnet, but
destined for another onwards. For example, if subnets 16.42.1.0 and 16.42.0.0 are connected together by a gateway then any
packets sent from subnet 0 to subnet 1 would have to be directed to the gateway so that it could route them. The local host builds up
routing tables which allow it to route IP packets to the correct machine. For every IP destination there is an entry in the routing tables
http://ldp.iol.it/LDP/tlk/net/net.html (1 di 13) [08/03/2001 10.09.13]

which tells Linux which host to send IP packets to in order that they reach their destination. These routing tables are dynamic and
change over time as applications use the network and as the network topology changes.
Figure 10.1: TCP/IP Protocol Layers

The IP protocol is a transport layer that is used by other protocols to carry their data. The Transmission Control Protocol (TCP) is a
reliable end to end protocol that uses IP to transmit and receive its own packets. Just as IP packets have their own header, TCP has its
own header. TCP is a connection based protocol where two networking applications are connected by a single, virtual connection even
though there may be many subnetworks, gateways and routers between them. TCP reliably transmits and receives data between the two
applications and guarantees that there will be no lost or duplicated data. When TCP transmits its packet using IP, the data contained
within the IP packet is the TCP packet itself. The IP layer on each communicating host is responsible for transmitting and receiving IP
packets. User Datagram Protocol (UDP) also uses the IP layer to transport its packets, unlike TCP, UDP is not a reliable protocol but
offers a datagram service. This use of IP by other protocols means that when IP packets are received the receiving IP layer must know
which upper protocol layer to give the data contained in this IP packet to. To facilitate this every IP packet header has a byte containing
a protocol identifier. When TCP asks the IP layer to transmit an IP packet , that IP packet's header states that it contains a TCP packet.
The receiving IP layer uses that protocol identifier to decide which layer to pass the received data up to, in this case the TCP layer.
When applications communicate via TCP/IP they must specify not only the target's IP address but also the port address of the
application. A port address uniquely identifies an application and standard network applications use standard port addresses; for
example, web servers use port 80. These registered port addresses can be seen in /etc/services.
This layering of protocols does not stop with TCP, UDP and IP. The IP protocol layer itself uses many different physical media to
transport IP packets to other IP hosts. These media may themselves add their own protocol headers. One such example is the ethernet
layer, but PPP and SLIP are others. An ethernet network allows many hosts to be simultaneously connected to a single physical cable.
Every transmitted ethernet frame can be seen by all connected hosts and so every ethernet device has a unique address. Any ethernet
frame transmitted to that address will be received by the addressed host but ignored by all the other hosts connected to the network.
These unique addresses are built into each ethernet device when they are manufactured and it is usually kept in an SROM2 on the
ethernet card. Ethernet addresses are 6 bytes long, an example would be 08-00-2b-00-49-A4. Some ethernet addresses are
reserved for multicast purposes and ethernet frames sent with these destination addresses will be received by all hosts on the network.
As ethernet frames can carry many different protocols (as data) they, like IP packets, contain a protocol identifier in their headers. This
allows the ethernet layer to correctly receive IP packets and to pass them onto the IP layer.
In order to send an IP packet via a multi-connection protocol such as ethernet, the IP layer must find the ethernet address of the IP host.
This is because IP addresses are simply an addressing concept, the ethernet devices themselves have their own physical addresses. IP
addresses on the other hand can be assigned and reassigned by network administrators at will but the network hardware responds only to
ethernet frames with its own physical address or to special multicast addresses which all machines must receive. Linux uses the Address
Resolution Protocol (or ARP) to allow machines to translate IP addresses into real hardware addresses such as ethernet addresses. A

host wishing to know the hardware address associated with an IP address sends an ARP request packet containing the IP address that it
wishes translating to all nodes on the network by sending it to a multicast address. The target host that owns the IP address, responds
with an ARP reply that contains its physical hardware address. ARP is not just restricted to ethernet devices, it can resolve IP addresses
for other physical media, for example FDDI. Those network devices that cannot ARP are marked so that Linux does not attempt to
ARP. There is also the reverse function, Reverse ARP or RARP, which translates phsyical network addresses into IP addresses. This is
used by gateways, which respond to ARP requests on behalf of IP addresses that are in the remote network.
10.2 The Linux TCP/IP Networking Layers
Figure 10.2: Linux Networking Layers

Just like the network protocols themselves, Figure 10.2 shows that Linux implements the internet protocol address family as a series of
connected layers of software. BSD sockets are supported by a generic socket management software concerned only with BSD sockets.
Supporting this is the INET socket layer, this manages the communication end points for the IP based protocols TCP and UDP. UDP
(User Datagram Protocol) is a connectionless protocol whereas TCP (Transmission Control Protocol) is a reliable end to end protocol.
When UDP packets are transmitted, Linux neither knows nor cares if they arrive safely at their destination. TCP packets are numbered
and both ends of the TCP connection make sure that transmitted data is received correctly. The IP layer contains code implementing the
Internet Protocol. This code prepends IP headers to transmitted data and understands how to route incoming IP packets to either the

TCP or UDP layers. Underneath the IP layer, supporting all of Linux's networking are the network devices, for example PPP and
ethernet. Network devices do not always represent physical devices; some like the loopback device are purely software devices. Unlike
standard Linux devices that are created via the mknod command, network devices appear only if the underlying software has found and
initialized them. You will only see /dev/eth0 when you have built a kernel with the appropriate ethernet device driver in it. The
ARP protocol sits between the IP layer and the protocols that support ARPing for addresses.
10.3 The BSD Socket Interface

This is a general interface which not only supports various forms of networking but is also an inter-process communications
mechanism. A socket describes one end of a communications link, two communicating processes would each have a socket describing
their end of the communication link between them. Sockets could be thought of as a special case of pipes but, unlike pipes, sockets have
no limit on the amount of data that they can contain. Linux supports several classes of socket and these are known as address families.
This is because each class has its own method of addressing its communications. Linux supports the following socket address families
or domains:
UNIX Unix domain sockets,
Care is taken to ensure that data packets in transit are correctly dealt with.
The exact meaning of operations on a BSD socket depends on its underlying address family. Setting up TCP/IP connections is very
different from setting up an amateur radio X.25 connection. Like the virtual filesystem, Linux abstracts the socket interface with the
BSD socket layer being concerned with the BSD socket interface to the application programs which is in turn supported by independent
address family specific software. At kernel initialization time, the address families built into the kernel register themselves with the
BSD socket interface. Later on, as applications create and use BSD sockets, an association is made between the BSD socket and its
supporting address family. This association is made via cross-linking data structures and tables of address family specific support
routines. For example there is an address family specific socket creation routine which the BSD socket interface uses when an
application creates a new socket.
When the kernel is configured, a number of address families and protocols are built into the protocols vector. Each is represented by
its name, for example `ÌNET'' and the address of its initialization routine. When the socket interface is initialized at boot time each
protocol's initialization routine is called. For the socket address families this results in them registering a set of protocol operations. This
is a set of routines, each of which performs a a particular operation specific to that address family. The registered protocol operations
are kept in the pops vector, a vector of pointers to proto_ops data structures.
The proto_ops data structure consists of the address family type and a set of pointers to socket operation routines specific to a
particular address family. The pops vector is indexed by the address family identifier, for example the Internet address family identifier
(AF_INET is 2).

Figure 10.3: Linux BSD Socket Data Structures
10.4 The INET Socket Layer

The INET socket layer supports the internet address family which contains the TCP/IP protocols. As discussed above, these protocols
are layered, one protocol using the services of another. Linux's TCP/IP code and data structures reflect this layering. Its interface with
the BSD socket layer is through the set of Internet address family socket operations which it registers with the BSD socket layer during
network initialization. These are kept in the pops vector along with the other registered address families. The BSD socket layer calls
the INET layer socket support routines from the registered INET proto_ops data structure to perform work for it. For example a
BSD socket create request that gives the address family as INET will use the underlying INET socket create function. The BSD socket
layer passes the socket data structure representing the BSD socket to the INET layer in each of these operations. Rather than clutter
the BSD socket wiht TCP/IP specific information, the INET socket layer uses its own data structure, the sock which it links to the
BSD socket data structure. This linkage can be seen in Figure 10.3. It links the sock data structure to the BSD socket data
structure using the data pointer in the BSD socket. This means that subsequent INET socket calls can easily retrieve the sock data
structure. The sock data structure's protocol operations pointer is also set up at creation time and it depends on the protocol requested.
If TCP is requested, then the sock data structure's protocol operations pointer will point to the set of TCP protocol operations needed
for a TCP connection.
10.4.1 Creating a BSD Socket

The system call to create a new socket passes identifiers for its address family, socket type and protocol.
Firstly the requested address family is used to search the pops vector for a matching address family. It may be that a particular address
family is implemented as a kernel module and, in this case, the kerneld daemon must load the module before we can continue. A new
socket data structure is allocated to represent the BSD socket. Actually the socket data structure is physically part of the VFS
inode data structure and allocating a socket really means allocating a VFS inode. This may seem strange unless you consider that
sockets can be operated on in just the same way that ordinairy files can. As all files are represented by a VFS inode data structure,
then in order to support file operations, BSD sockets must also be represented by a VFS inode data structure.
The newly created BSD socket data structure contains a pointer to the address family specific socket routines and this is set to the
proto_ops data structure retrieved from the pops vector. Its type is set to the sccket type requested; one of SOCK_STREAM,
SOCK_DGRAM and so on. The address family specific creation routine is called using the address kept in the proto_ops data
structure.
A free file descriptor is allocated from the current processes fd vector and the file data structure that it points at is initialized. This
includes setting the file operations pointer to point to the set of BSD socket file operations supported by the BSD socket interface. Any
future operations will be directed to the socket interface and it will in turn pass them to the supporting address family by calling its
address family operation routines.
10.4.2 Binding an Address to an INET BSD Socket

In order to be able to listen for incoming internet connection requests, each server must create an INET BSD socket and bind its address
to it. The bind operation is mostly handled within the INET socket layer with some support from the underlying TCP and UDP protocol
layers. The socket having an address bound to cannot be being used for any other communication. This means that the socket's state
must be TCP_CLOSE. The sockaddr pass to the bind operation contains the IP address to be bound to and, optionally, a port number.
Normally the IP address bound to would be one that has been assigned to a network device that supports the INET address family and
whose interface is up and able to be used. You can see which network interfaces are currently active in the system by using the ifconfig
command. The IP address may also be the IP broadcast address of either all 1's or all 0's. These are special addresses that mean ``send to
everybody''3. The IP address could also be specified as any IP address if the machine is acting as a transparent proxy or firewall, but
only processes with superuser privileges can bind to any IP address. The IP address bound to is saved in the sock data structure in the
recv_addr and saddr fields. These are used in hash lookups and as the sending IP address respectively. The port number is optional
and if it is not specified the supporting network is asked for a free one. By convention, port numbers less than 1024 cannot be used by
processes without superuser privileges. If the underlying network does allocate a port number it always allocates ones greater than 1024.

As packets are being received by the underlying network devices they must be routed to the correct INET and BSD sockets so that they
can be processed. For this reason UDP and TCP maintain hash tables which are used to lookup the addresses within incoming IP
messages and direct them to the correct socket/sock pair. TCP is a connection oriented protocol and so there is more information
involved in processing TCP packets than there is in processing UDP packets.
UDP maintains a hash table of allocated UDP ports, the udp_hash table. This consists of pointers to sock data structures indexed by
a hash function based on the port number. As the UDP hash table is much smaller than the number of permissible port numbers
(udp_hash is only 128 or UDP_HTABLE_SIZE entries long) some entries in the table point to a chain of sock data structures linked
together using each sock's next pointer.
TCP is much more complex as it maintains several hash tables. However, TCP does not actually add the binding sock data stucture
into its hash tables during the bind operation, it merely checks that the port number requested is not currently being used. The sock
data structure is added to TCP's hash tables during the listen operation.
REVIEW NOTE: What about the route entered?
10.4.3 Making a Connection on an INET BSD Socket

Once a socket has been created and, provided it has not been used to listen for inbound connection requests, it can be used to make
outbound connection requests. For connectionless protocols like UDP this socket operation does not do a whole lot but for connection
orientated protocols like TCP it involves building a virtual circuit between two applications.
An outbound connection can only be made on an INET BSD socket that is in the right state; that is to say one that does not already have
a connection established and one that is not being used for listening for inbound connections. This means that the BSD socket data
structure must be in state SS_UNCONNECTED. The UDP protocol does not establish virtual connections between applications, any
messages sent are datagrams, one off messages that may or may not reach their destinations. It does, however, support the connect BSD
socket operation. A connection operation on a UDP INET BSD socket simply sets up the addresses of the remote application; its IP
address and its IP port number. Additionally it sets up a cache of the routing table entry so that UDP packets sent on this BSD socket do
not need to check the routing database again (unless this route becomes invalid). The cached routing information is pointed at from the
ip_route_cache pointer in the INET sock data structure. If no addressing information is given, this cached routing and IP
addressing information will be automatically be used for messages sent using this BSD socket. UDP moves the sock's state to
TCP_ESTABLISHED.
For a connect operation on a TCP BSD socket, TCP must build a TCP message containing the connection information and send it to IP
destination given. The TCP message contains information about the connection, a unique starting message sequence number, the
maximum sized message that can be managed by the initiating host, the transmit and receive window size and so on. Within TCP all
messages are numbered and the initial sequence number is used as the first message number. Linux chooses a reasonably random value
to avoid malicious protocol attacks. Every message transmitted by one end of the TCP connection and successfully received by the
other is acknowledged to say that it arrived successfully and uncorrupted. Unacknowledges messages will be retransmitted. The
transmit and receive window size is the number of outstanding messages that there can be without an acknowledgement being sent. The
maximum message size is based on the network device that is being used at the initiating end of the request. If the receiving end's
network device supports smaller maximum message sizes then the connection will use the minimum of the two. The application making
the outbound TCP connection request must now wait for a response from the target application to accept or reject the connection
request. As the TCP sock is now expecting incoming messages, it is added to the tcp_listening_hash so that incoming TCP
messages can be directed to this sock data structure. TCP also starts timers so that the outbound connection request can be timed out if
the target application does not respond to the request.
10.4.4 Listening on an INET BSD Socket

Once a socket has had an address bound to it, it may listen for incoming connection requests specifying the bound addresses. A network
application can listen on a socket without first binding an address to it; in this case the INET socket layer finds an unused port number
(for this protocol) and automatically binds it to the socket. The listen socket function moves the socket into state TCP_LISTEN and
does any network specific work needed to allow incoming connections.
For UDP sockets, changing the socket's state is enough but TCP now adds the socket's sock data structure into two hash tables as it is
now active. These are the tcp_bound_hash table and the tcp_listening_hash. Both are indexed via a hash function based on
the IP port number.
Whenever an incoming TCP connection request is received for an active listening socket, TCP builds a new sock data structure to
represent it. This sock data structure will become the bottom half of the TCP connection when it is eventually accepted. It also clones
the incoming sk_buff containing the connection request and queues it onto the receive_queue for the listening sock data

structure. The clone sk_buff contains a pointer to the newly created sock data structure.
10.4.5 Accepting Connection Requests

UDP does not support the concept of connections, accepting INET socket connection requests only applies to the TCP protocol as an
accept operation on a listening socket causes a new socket data structure to be cloned from the original listening socket. The accept
operation is then passed to the supporting protocol layer, in this case INET to accept any incoming connection requests. The INET
protocol layer will fail the accept operation if the underlying protocol, say UDP, does not support connections. Otherwise the accept
operation is passed through to the real protocol, in this case TCP. The accept operation can be either blocking or non-blocking. In the
non-blocking case if there are no incoming connections to accept, the accept operation will fail and the newly created socket data
structure will be thrown away. In the blocking case the network application performing the accept operation will be added to a wait
queue and then suspended until a TCP connection request is received. Once a connection request has been received the sk_buff
containing the request is discarded and the sock data structure is returned to the INET socket layer where it is linked to the new
socket data structure created earlier. The file descriptor (fd) number of the new socket is returned to the network application, and
the application can then use that file descriptor in socket operations on the newly created INET BSD socket.
10.5 The IP Layer

10.5.1 Socket Buffers
One of the problems of having many layers of network protocols, each one using the services of another, is that each protocol needs to
add protocol headers and tails to data as it is transmitted and to remove them as it processes received data. This make passing data
buffers between the protocols difficult as each layer needs to find where its particular protocol headers and tails are. One solution is to
copy buffers at each layer but that would be inefficient. Instead, Linux uses socket buffers or sk_buffs to pass data between the
protocol layers and the network device drivers. sk_buffs contain pointer and length fields that allow each protocol layer to
manipulate the application data via standard functions or ``methods''.
Figure 10.4: The Socket Buffer (sk_buff)

Figure 10.4 shows the sk_buff data structure; each sk_buff has a block of data associated with it. The sk_buff has four data
pointers, which are used to manipulate and manage the socket buffer's data:

head
points to the start of the data area in memory. This is fixed when the sk_buff and its associated data block is allocated,
data
points at the current start of the protocol data. This pointer varies depending on the protocol layer that currently owns the
sk_buff,
tail
points at the current end of the protocol data. Again, this pointer varies depending on the owning protocol layer,
end
points at the end of the data area in memory. This is fixed when the sk_buff is allocated.
There are two length fields len and truesize, which describe the length of the current protocol packet and the total size of the data
buffer respectively. The sk_buff handling code provides standard mechanisms for adding and removing protocol headers and tails to
the application data. These safely manipulate the data, tail and len fields in the sk_buff:
push
This moves the data pointer towards the start of the data area and increments the len field. This is used when adding data or
protocol headers to the start of the data to be transmitted,
pull
This moves the data pointer away from the start, towards the end of the data area and decrements the len field. This is used
when removing data or protocol headers from the start of the data that has been received,
put
This moves the tail pointer towards the end of the data area and increments the len field. This is used when adding data or
protocol information to the end of the data to be transmitted,
trim
This moves the tail pointer towards the start of the data area and decrements the len field. This is used when removing data or
protocol tails from the received packet.
The sk_buff data structure also contains pointers that are used as it is stored in doubly linked circular lists of sk_buff's during
processing. There are generic sk_buff routines for adding sk_buffs to the front and back of these lists and for removing them.
10.5.2 Receiving IP Packets

Chapter dd-chapter described how Linux's network drivers built are into the kernel and initialized. This results in a series of device
data structures linked together in the dev_base list. Each device data structure describes its device and provides a set of callback
routines that the network protocol layers call when they need the network driver to perform work. These functions are mostly concerned
with transmitting data and with the network device's addresses. When a network device receives packets from its network it must
convert the received data into sk_buff data structures. These received sk_buff's are added onto the backlog queue by the
network drivers as they are received.
If the backlog queue grows too large, then the received sk_buff's are discarded. The network bottom half is flagged as ready to run
as there is work to do.
When the network bottom half handler is run by the scheduler it processes any network packets waiting to be transmitted before
processing the backlog queue of sk_buff's determining which protocol layer to pass the received packets to.
As the Linux networking layers were initialized, each protocol registered itself by adding a packet_type data structure onto either
the ptype_all list or into the ptype_base hash table. The packet_type data structure contains the protocol type, a pointer to a
network device, a pointer to the protocol's receive data processing routine and, finally, a pointer to the next packet_type data
structure in the list or hash chain. The ptype_all chain is used to snoop all packets being received from any network device and is
not normally used. The ptype_base hash table is hashed by protocol identifier and is used to decide which protocol should receive
the incoming network packet. The network bottom half matches the protocol types of incoming sk_buff's against one or more of the
packet_type entries in either table. The protocol may match more than one entry, for example when snooping all network traffic,
and in this case the sk_buff will be cloned. The sk_buff is passed to the matching protocol's handling routine.

10.5.3 Sending IP Packets
Packets are transmitted by applications exchanging data or else they are generated by the network protocols as they support established
connections or connections being established. Whichever way the data is generated, an sk_buff is built to contain the data and
various headers are added by the protocol layers as it passes through them.
The sk_buff needs to be passed to a network device to be transmitted. First though the protocol, for example IP, needs to decide
which network device to use. This depends on the best route for the packet. For computers connected by modem to a single network,
say via the PPP protocol, the routing choice is easy. The packet should either be sent to the local host via the loopback device or to the
gateway at the end of the PPP modem connection. For computers connected to an ethernet the choices are harder as there are many
computers connected to the network.
For every IP packet transmitted, IP uses the routing tables to resolve the route for the destination IP address. Each IP destination
successfully looked up in the routing tables returns a rtable
data structure describing the route to use. This includes the source IP address to use, the address of the network device data structure
and, sometimes, a prebuilt hardware header. This hardware header is network device specific and contains the source and destination
physical addresses and other media specific information. If the network device is an ethernet device, the hardware header would be as
shown in Figure 10.1 and the source and destination addresses would be physical ethernet addresses. The hardware header is cached
with the route because it must be appended to each IP packet transmitted on this route and constructing it takes time. The hardware
header may contain physical addresses that have to be resolved using the ARP protocol. In this case the outgoing packet is stalled until
the address has been resolved. Once it has been resolved and the hardware header built, the hardware header is cached so that future IP
packets sent using this interface do not have to ARP.
10.5.4 Data Fragmentation

Every network device has a maximum packet size and it cannot transmit or receive a data packet bigger than this. The IP protocol
allows for this and will fragment data into smaller units to fit into the packet size that the network device can handle. The IP protocol
header includes a fragment field which contains a flag and the fragment offset.
When an IP packet is ready to be transmited,
IP finds the network device to send the IP packet out on. This device is found from the IP routing tables. Each device has a field
describing its maximum transfer unit (in bytes), this is the mtu field. If the device's mtu is smaller than the packet size of the IP packet
that is waiting to be transmitted, then the IP packet must be broken down into smaller (mtu sized) fragments. Each fragment is
represented by an sk_buff; its IP header marked to show that it is a fragment and what offset into the data this IP packet contains. The
last packet is marked as being the last IP fragment. If, during the fragmentation, IP cannot allocate an sk_buff, the transmit will fail.
Receiving IP fragments is a little more difficult than sending them because the IP fragments can be received in any order and they must
all be received before they can be reassembled. Each time an IP packet is received it is checked to see if it is an IP fragment. The first
time that the fragment of a message is received, IP creates a new ipq data structure, and this is linked into the ipqueue list of IP
fragments awaiting recombination. As more IP fragments are received, the correct ipq data structure is found and a new ipfrag data
structure is created to describe this fragment. Each ipq data structure uniquely describes a fragmented IP receive frame with its source
and destination IP addresses, the upper layer protocol identifier and the identifier for this IP frame. When all of the fragments have been
received, they are combined into a single sk_buff and passed up to the next protocol level to be processed. Each ipq contains a timer
that is restarted each time a valid fragment is received. If this timer expires, the ipq data structure and its ipfrag's are dismantled and
the message is presumed to have been lost in transit. It is then up to the higher level protocols to retransmit the message.
10.6 The Address Resolution Protocol (ARP)

The Address Resolution Protocol's role is to provide translations of IP addresses into physical hardware addresses such as ethernet
addresses. IP needs this translation just before it passes the data (in the form of an sk_buff) to the device driver for transmission.
It performs various checks to see if this device needs a hardware header and, if it does, if the hardware header for the packet needs to be
rebuilt. Linux caches hardware headers to avoid frequent rebuilding of them. If the hardware header needs rebuilding, it calls the device
specific hardware header rebuilding routine. All ethernet devices use the same generic header rebuilding routine
which in turn uses the ARP services to translate the destination IP address into a physical address.
The ARP protocol itself is very simple and consists of two message types, an ARP request and an ARP reply. The ARP request contains
the IP address that needs translating and the reply (hopefully) contains the translated IP address, the hardware address. The ARP request

is broadcast to all hosts connected to the network, so, for an ethernet network, all of the machines connected to the ethernet will see the
ARP request. The machine that owns the IP address in the request will respond to the ARP request with an ARP reply containing its
own physical address.
The ARP protocol layer in Linux is built around a table of arp_table data structures which each describe an IP to physical address
translation. These entries are created as IP addresses need to be translated and removed as they become stale over time. Each
arp_table data structure has the following fields:
last used the time that this ARP entry was last used,
last updated the time that this ARP entry was last updated,
flags these describe this entry's state, if it is complete and so on,
IP address The IP address that this entry describes
hardware address The translated hardware address
hardware header This is a pointer to a cached hardware header,
timer This is a timer_list entry used to time out ARP requests
that do not get a response,
retries The number of times that this ARP request has been
retried,
sk_buff queue List of sk_buff entries waiting for this IP address
to be resolved
The ARP table consists of a table of pointers (the arp_tables vector) to chains of arp_table entries. The entries are cached to
speed up access to them, each entry is found by taking the last two bytes of its IP address to generate an index into the table and then
following the chain of entries until the correct one is found. Linux also caches prebuilt hardware headers off the arp_table entries in
the form of hh_cache data structures.
When an IP address translation is requested and there is no corresponding arp_table entry, ARP must send an ARP request message.
It creates a new arp_table entry in the table and queues the sk_buff containing the network packet that needs the address
translation on the sk_buff queue of the new entry. It sends out an ARP request and sets the ARP expiry timer running. If there is no
response then ARP will retry the request a number of times and if there is still no response ARP will remove the arp_table entry.
Any sk_buff data structures queued waiting for the IP address to be translated will be notified and it is up to the protocol layer that is
transmitting them to cope with this failure. UDP does not care about lost packets but TCP will attempt to retransmit on an established
TCP link. If the owner of the IP address responds with its hardware address, the arp_table entry is marked as complete and any
queued sk_buff's will be removed from the queue and will go on to be transmitted. The hardware address is written into the hardware
header of each sk_buff.
The ARP protocol layer must also respond to ARP requests that specfy its IP address. It registers its protocol type (ETH_P_ARP),
generating a packet_type data structure. This means that it will be passed all ARP packets that are received by the network devices.
As well as ARP replies, this includes ARP requests. It generates an ARP reply using the hardware address kept in the receiving device's
device data structure.
Network topologies can change over time and IP addresses can be reassigned to different hardware addresses. For example, some dial
up services assign an IP address as each connection is established. In order that the ARP table contains up to date entries, ARP runs a
periodic timer which looks through all of the arp_table entries to see which have timed out. It is very careful not to remove entries
that contain one or more cached hardware headers. Removing these entries is dangerous as other data structures rely on them. Some
arp_table entries are permanent and these are marked so that they will not be deallocated. The ARP table cannot be allowed to grow
too large; each arp_table entry consumes some kernel memory. Whenever the a new entry needs to be allocated and the ARP table
has reached its maximum size the table is pruned by searching out the oldest entries and removing them.
10.7 IP Routing
The IP routing function determines where to send IP packets destined for a particular IP address. There are many choices to be made
when transmitting IP packets. Can the destination be reached at all? If it can be reached, which network device should be used to
transmit it? If there is more than one network device that could be used to reach the destination, which is the better one? The IP routing
database maintains information that gives answers to these questions. There are two databases, the most important being the Forwarding
Information Database. This is an exhaustive list of known IP destinations and their best routes. A smaller and much faster database, the
route cache is used for quick lookups of routes for IP destinations. Like all caches, it must contain only the frequently accessed routes;
its contents are derived from the Forwarding Information Database.

Routes are added and deleted via IOCTL requests to the BSD socket interface. These are passed onto the protocol to process. The INET
protocol layer only allows processes with superuser privileges to add and delete IP routes. These routes can be fixed or they can be
dynamic and change over time. Most systems use fixed routes unless they themselves are routers. Routers run routing protocols which
constantly check on the availability of routes to all known IP destinations. Systems that are not routers are known as end systems. The
routing protocols are implemented as daemons, for example GATED, and they also add and delete routes via the IOCTL BSD socket
interface.
10.7.1 The Route Cache

Whenever an IP route is looked up, the route cache is first checked for a matching route. If there is no matching route in the route cache
the Forwarding Information Database is searched for a route. If no route can be found there, the IP packet will fail to be sent and the
application notified. If a route is in the Forwarding Information Database and not in the route cache, then a new entry is generated and
added into the route cache for this route. The route cache is a table (ip_rt_hash_table) that contains pointers to chains of
rtable data structures. The index into the route table is a hash function based on the least significant two bytes of the IP address.
These are the two bytes most likely to be different between destinations and provide the best spread of hash values. Each rtable entry
contains information about the route; the destination IP address, the network device to use to reach that IP address, the maximum size
of message that can be used and so on. It also has a reference count, a usage count and a timestamp of the last time that they were used
(in jiffies). The reference count is incremented each time the route is used to show the number of network connections using this
route. It is decremented as applications stop using the route. The usage count is incremented each time the route is looked up and is used
to order the rtable entry in its chain of hash entries. The last used timestamp for all of the entries in the route cache is periodically
checked to see if the rtable is too old
. If the route has not been recently used, it is discarded from the route cache. If routes are kept in the route cache they are ordered so that
the most used entries are at the front of the hash chains. This means that finding them will be quicker when routes are looked up.
10.7.2 The Forwarding Information Database

Figure 10.5: The Forwarding Information Database
The forwarding information database (shown in Figure 10.5 contains IP's view of the routes available to this system at this time. It is
quite a complicated data structure and, although it is reasonably efficiently arranged, it is not a quick database to consult. In particular it
would be very slow to look up destinations in this database for every IP packet transmitted. This is the reason that the route cache exists:
to speed up IP packet transmission using known good routes. The route cache is derived from the forwarding database and represents its
commonly used entries.
Each IP subnet is represented by a fib_zone data structure. All of these are pointed at from the fib_zones hash table. The hash
index is derived from the IP subnet mask. All routes to the same subnet are described by pairs of fib_node and fib_info data
structures queued onto the fz_list of each fib_zone data structure. If the number of routes in this subnet grows large, a hash table
is generated to make finding the fib_node data structures easier.
Several routes may exist to the same IP subnet and these routes can go through one of several gateways. The IP routing layer does not
allow more than one route to a subnet using the same gateway. In other words, if there are several routes to a subnet, then each route is
guaranteed to use a different gateway. Associated with each route is its metric. This is a measure of how advantagious this route is. A
route's metric is, essentially, the number of IP subnets that it must hop across before it reaches the destination subnet. The higher the
metric, the worse the route.
Footnotes:
1 National Science Foundation
2 Synchronous Read Only Memory
3 duh? What used for?


Chapter 16
Useful Web and FTP Sites
The following World Wide Web and ftp sites are useful:
http://www.azstarnet.com/ [\tilde]axplinux
This is David Mosberger-Tang's Alpha AXP Linux web site and it is the place to go for all of the
Alpha AXP HOWTOs. It also has a large number of pointers to Linux and Alpha AXP specific
information such as CPU data sheets.
http://www.redhat.com/
Red Hat's web site. This has a lot of useful pointers.
ftp://sunsite.unc.edu
This is the major site for a lot of free software. The Linux specific software is held in pub/Linux.
http://www.intel.com
Intel's web site and a good place to look for Intel chip information.
http://www.ssc.com/lj/index.html
The Linux Journal is a very good Linux magazine and well worth the yearly subscription for its
excellent articles.
http://www.blackdown.org/java-linux.html
This is the primary site for information on Java on Linux.
ftp://tsx-11.mit.edu/ [\tilde]ftp/pub/linux
MIT's Linux ftp site.
ftp://ftp.cs.helsinki.fi/pub/Software/Linux/Kernel
Linus's kernel sources.
http://www.linux.org.uk
The UK Linux User Group.
http://sunsite.unc.edu/mdw/linux.html
Home page for the Linux Documentation Project,
http://www.digital.com
Digital Equipment Corporation's main web page.
http://altavista.digital.com
http://ldp.iol.it/LDP/tlk/appendices/www.html (1 di 2) [08/03/2001 10.09.15]

DIGITAL's Altavista search engine. A very good place to search for information within the web
and news groups.
http://www.linuxhq.com
The Linux HQ web site holds up to date official and unoffical patches as well as advice and web
pointers that help you get the best set of kernel sources possible for your system.
http://www.amd.com
The AMD web site.
http://www.cyrix.com
Cyrix's web site.
http://www.arm.com
ARM's web site.

http://ldp.iol.it/LDP/tlk/appendices/www.html (2 di 2) [08/03/2001 10.09.15]

Chapter 15
Linux Data Structures
This appendix lists the major data structures that Linux uses and which are described in this book. They have been edited
slightly to fit the paper.
block_dev_struct
block_dev_struct data structures are used to register block devices as available for use by the buffer cache. They are
held together in the blk_dev vector.
struct blk_dev_struct {
void (*request_fn)(void);
struct request * current_request;
struct request plug;
struct tq_struct plug_tq;
};
buffer_head
The buffer_head data structure holds information about a block buffer in the buffer cache.
/* bh state bits */
#define BH_Uptodate 0 /* 1 if the buffer contains valid data */
#define BH_Dirty 1 /* 1 if the buffer is dirty */
#define BH_Lock 2 /* 1 if the buffer is locked */
#define BH_Req 3 /* 0 if the buffer has been invalidated */
#define BH_Touched 4 /* 1 if the buffer has been touched (aging) */
#define BH_Has_aged 5 /* 1 if the buffer has been aged (aging) */
#define BH_Protected 6 /* 1 if the buffer is protected */
#define BH_FreeOnIO 7 /* 1 to discard the buffer_head after IO */
struct buffer_head {
/* First cache line: */
unsigned long b_blocknr; /* block number */
kdev_t b_dev; /* device (B_FREE = free) */
kdev_t b_rdev; /* Real device */
unsigned long b_rsector; /* Real buffer location on disk */
struct buffer_head *b_next; /* Hash queue list */
struct buffer_head *b_this_page; /* circular list of buffers in one
page */
http://ldp.iol.it/LDP/tlk/ds/ds.html (1 di 18) [08/03/2001 10.09.17]

/* Second cache line: */
unsigned long b_state; /* buffer state bitmap (above) */
struct buffer_head *b_next_free;
unsigned int b_count; /* users using this block */
unsigned long b_size; /* block size */
/* Non-performance-critical data follows. */

char *b_data; /* pointer to data block */
unsigned int b_list; /* List that this buffer appears */
unsigned long b_flushtime; /* Time when this (dirty) buffer
* should be written */
unsigned long b_lru_time; /* Time when this buffer was
* last used. */
struct wait_queue *b_wait;
struct buffer_head *b_prev; /* doubly linked hash list */
struct buffer_head *b_prev_free; /* doubly linked list of buffers */
struct buffer_head *b_reqnext; /* request queue */
};
device
Every network device in the system is represented by a device data structure.
struct device
{
/*
* This is the first field of the "visible" part of this structure
* (i.e. as seen by users in the "Space.c" file). It is the name
* the interface.
*/
char *name;
/* I/O specific fields */

unsigned long rmem_end; /* shmem "recv" end */
unsigned long rmem_start; /* shmem "recv" start */
unsigned long mem_end; /* shared mem end */
unsigned long mem_start; /* shared mem start */
unsigned long base_addr; /* device I/O address */
unsigned char irq; /* device IRQ number */
/* Low-level status flags. */

volatile unsigned char start, /* start an operation */
interrupt; /* interrupt arrived */
unsigned long tbusy; /* transmitter busy */
struct device *next;
/* The device initialization function. Called only once. */

int (*init)(struct device *dev);
/* Some hardware also needs these fields, but they are not part of
the usual set specified in Space.c. */
unsigned char if_port; /* Selectable AUI,TP, */

unsigned char dma; /* DMA channel */
struct enet_statistics* (*get_stats)(struct device *dev);
/*
* This marks the end of the "visible" part of the structure. All
* fields hereafter are internal to the system, and may change at
* will (read: may be cleaned up at will).
*/
/* These may be needed for future network-power-down code. */

unsigned long trans_start; /* Time (jiffies) of
last transmit */
unsigned long last_rx; /* Time of last Rx */
unsigned short flags; /* interface flags (BSD)*/
unsigned short family; /* address family ID */
unsigned short metric; /* routing metric */
unsigned short mtu; /* MTU value */
unsigned short type; /* hardware type */
unsigned short hard_header_len; /* hardware hdr len */
void *priv; /* private data */
/* Interface address info. */

unsigned char broadcast[MAX_ADDR_LEN];
unsigned char pad;
unsigned char dev_addr[MAX_ADDR_LEN];
unsigned char addr_len; /* hardware addr len */
unsigned long pa_addr; /* protocol address */
unsigned long pa_brdaddr; /* protocol broadcast addr*/
unsigned long pa_dstaddr; /* protocol P-P other addr*/
unsigned long pa_mask; /* protocol netmask */
unsigned short pa_alen; /* protocol address len */
struct dev_mc_list *mc_list; /* M'cast mac addrs */

int mc_count; /* No installed mcasts */
struct ip_mc_list *ip_mc_list; /* IP m'cast filter chain */

__u32 tx_queue_len; /* Max frames per queue */
/* For load balancing driver pair support */

unsigned long pkt_queue; /* Packets queued */
struct device *slave; /* Slave device */
struct net_alias_info *alias_info; /* main dev alias info */
struct net_alias *my_alias; /* alias devs */
/* Pointer to the interface buffers. */

struct sk_buff_head buffs[DEV_NUMBUFFS];
/* Pointers to interface service routines. */

int (*open)(struct device *dev);
int (*stop)(struct device *dev);
int (*hard_start_xmit) (struct sk_buff *skb,
struct device *dev);

int (*hard_header) (struct sk_buff *skb,
struct device *dev,
unsigned short type,
void *daddr,
void *saddr,
unsigned len);
int (*rebuild_header)(void *eth,
struct device *dev,
unsigned long raddr,
struct sk_buff *skb);
void (*set_multicast_list)(struct device *dev);
int (*set_mac_address)(struct device *dev,
void *addr);
int (*do_ioctl)(struct device *dev,
struct ifreq *ifr,
int cmd);
int (*set_config)(struct device *dev,
struct ifmap *map);
void (*header_cache_bind)(struct hh_cache **hhp,
struct device *dev,
unsigned short htype,
__u32 daddr);
void (*header_cache_update)(struct hh_cache *hh,
struct device *dev,
unsigned char * haddr);
int (*change_mtu)(struct device *dev,
int new_mtu);
struct iw_statistics* (*get_wireless_stats)(struct device *dev);
};
device_struct
device_struct data structures are used to register character and block devices (they hold its name and the set of file
operations that can be used for this device). Each valid member of the chrdevs and blkdevs vectors represents a
character or block device respectively.
struct device_struct {
const char * name;
struct file_operations * fops;
};
file
Each open file, socket etcetera is represented by a file data structure.
struct file {
mode_t f_mode;
loff_t f_pos;
unsigned short f_flags;
unsigned short f_count;
unsigned long f_reada, f_ramax, f_raend, f_ralen, f_rawin;
struct file *f_next, *f_prev;

int f_owner; /* pid or -pgrp where SIGIO should be sent */
struct inode * f_inode;
struct file_operations * f_op;
unsigned long f_version;
void *private_data; /* needed for tty driver, and maybe others */
};
files_struct
The files_struct data structure describes the files that a process has open.
struct files_struct {
int count;
fd_set close_on_exec;
fd_set open_fds;
struct file * fd[NR_OPEN];
};
fs_struct
struct fs_struct {
int count;
unsigned short umask;
struct inode * root, * pwd;
};
gendisk
The gendisk data structure holds information about a hard disk. They are used during initialization when the disks are
found and then probed for partitions.
struct hd_struct {
long start_sect;
long nr_sects;
};
struct gendisk {
int major; /* major number of driver */
const char *major_name; /* name of major driver */
int minor_shift; /* number of times minor is shifted to
get real minor */
int max_p; /* maximum partitions per device */
int max_nr; /* maximum number of real devices */
void (*init)(struct gendisk *);

/* Initialization called before we
do our thing */
struct hd_struct *part; /* partition table */
int *sizes; /* device size in blocks, copied to
blk_size[] */
int nr_real; /* number of real devices */

void *real_devices; /* internal use */
struct gendisk *next;
};
inode
The VFS inode data structure holds information about a file or directory on disk.
struct inode {
kdev_t i_dev;
unsigned long i_ino;
umode_t i_mode;
nlink_t i_nlink;
uid_t i_uid;
gid_t i_gid;
kdev_t i_rdev;
off_t i_size;
time_t i_atime;
time_t i_mtime;
time_t i_ctime;
unsigned long i_blksize;
unsigned long i_blocks;
unsigned long i_version;
unsigned long i_nrpages;
struct semaphore i_sem;
struct inode_operations *i_op;
struct super_block *i_sb;
struct wait_queue *i_wait;
struct file_lock *i_flock;
struct vm_area_struct *i_mmap;
struct page *i_pages;
struct dquot *i_dquot[MAXQUOTAS];
struct inode *i_next, *i_prev;
struct inode *i_hash_next, *i_hash_prev;
struct inode *i_bound_to, *i_bound_by;
struct inode *i_mount;
unsigned short i_count;
unsigned short i_flags;
unsigned char i_lock;
unsigned char i_dirt;
unsigned char i_pipe;
unsigned char i_sock;
unsigned char i_seek;
unsigned char i_update;
unsigned short i_writecount;
union {
struct pipe_inode_info pipe_i;
struct minix_inode_info minix_i;
struct ext_inode_info ext_i;
struct ext2_inode_info ext2_i;
struct hpfs_inode_info hpfs_i;
struct msdos_inode_info msdos_i;

struct umsdos_inode_info umsdos_i;
struct iso_inode_info isofs_i;
struct nfs_inode_info nfs_i;
struct xiafs_inode_info xiafs_i;
struct sysv_inode_info sysv_i;
struct affs_inode_info affs_i;
struct ufs_inode_info ufs_i;
struct socket socket_i;
void *generic_ip;
} u;
};
ipc_perm
The ipc_perm data structure describes the access permissions of a System V IPC object .
struct ipc_perm
{
key_t key;
ushort uid; /* owner euid and egid */
ushort gid;
ushort cuid; /* creator euid and egid */
ushort cgid;
ushort mode; /* access modes see mode flags below */
ushort seq; /* sequence number */
};
irqaction
The irqaction data structure is used to describe the system's interrupt handlers.
struct irqaction {
void (*handler)(int, void *, struct pt_regs *);
unsigned long flags;
unsigned long mask;
const char *name;
void *dev_id;
struct irqaction *next;
};
linux_binfmt
Each binary file format that Linux understands is represented by a linux_binfmt data structure.
struct linux_binfmt {
struct linux_binfmt * next;
long *use_count;
int (*load_binary)(struct linux_binprm *, struct pt_regs * regs);
int (*load_shlib)(int fd);
int (*core_dump)(long signr, struct pt_regs * regs);
};

mem_map_t
The mem_map_t data structure (also known as page) is used to hold information about each page of physical memory.
typedef struct page {

/* these must be first (free area handling) */
struct page *next;
struct page *prev;
struct inode *inode;
unsigned long offset;
struct page *next_hash;
atomic_t count;
unsigned flags; /* atomic flags, some possibly
updated asynchronously */
unsigned dirty:16,
age:8;
struct wait_queue *wait;
struct page *prev_hash;
struct buffer_head *buffers;
unsigned long swap_unlock_entry;
unsigned long map_nr; /* page->map_nr == page - mem_map */
} mem_map_t;
mm_struct
The mm_struct data structure is used to describe the virtual memory of a task or process.
struct mm_struct {
int count;
pgd_t * pgd;
unsigned long context;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack, start_mmap;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, locked_vm;
unsigned long def_flags;
struct vm_area_struct * mmap;
struct vm_area_struct * mmap_avl;
struct semaphore mmap_sem;
};
pci_bus
Every PCI bus in the system is represented by a pci_bus data structure.
struct pci_bus {
struct pci_bus *parent; /* parent bus this bridge is on */
struct pci_bus *children; /* chain of P2P bridges on this bus */
struct pci_bus *next; /* chain of all PCI buses */
struct pci_dev *self; /* bridge device as seen by parent */

struct pci_dev *devices; /* devices behind this bridge */

void *sysdata; /* hook for sys-specific extension */
unsigned char number; /* bus number */

unsigned char primary; /* number of primary bridge */
unsigned char secondary; /* number of secondary bridge */
unsigned char subordinate; /* max number of subordinate buses */
};
pci_dev
Every PCI device in the system, including PCI-PCI and PCI-ISA bridge devices is represented by a pci_dev data
structure.
/*
* There is one pci_dev structure for each slot-number/function-number
* combination:
*/
struct pci_dev {
struct pci_bus *bus; /* bus this device is on */
struct pci_dev *sibling; /* next device on this bus */
struct pci_dev *next; /* chain of all devices */
void *sysdata; /* hook for sys-specific extension */
unsigned int devfn; /* encoded device & function index */

unsigned short vendor;
unsigned short device;
unsigned int class; /* 3 bytes: (base,sub,prog-if) */
unsigned int master : 1; /* set if device is master capable */
/*
* In theory, the irq level can be read from configuration
* space and all would be fine. However, old PCI chips don't
* support these registers and return 0 instead. For example,
* the Vision864-P rev 0 chip can uses INTA, but returns 0 in
* the interrupt line and pin registers. pci_init()
* initializes this field with the value at PCI_INTERRUPT_LINE
* and it is the job of pcibios_fixup() to change it if
* necessary. The field must not be 0 unless the device
* cannot generate interrupts at all.
*/
unsigned char irq; /* irq generated by this device */
};
request
request data structures are used to make requests to the block devices in the system. The requests are always to read or
write blocks of data to or from the buffer cache.
struct request {
volatile int rq_status;
#define RQ_INACTIVE (-1)

#define RQ_ACTIVE 1
#define RQ_SCSI_BUSY 0xffff
#define RQ_SCSI_DONE 0xfffe
#define RQ_SCSI_DISCONNECTING 0xffe0
kdev_t rq_dev;
int cmd; /* READ or WRITE */
int errors;
unsigned long sector;
unsigned long nr_sectors;
unsigned long current_nr_sectors;
char * buffer;
struct semaphore * sem;
struct buffer_head * bh;
struct buffer_head * bhtail;
struct request * next;
};
rtable
Each rtable data structure holds information about the route to take in order to send packets to an IP host. rtable data
structures are used within the IP route cache.
struct rtable
{
struct rtable *rt_next;
__u32 rt_dst;
__u32 rt_src;
__u32 rt_gateway;
atomic_t rt_refcnt;
atomic_t rt_use;
unsigned long rt_window;
atomic_t rt_lastuse;
struct hh_cache *rt_hh;
struct device *rt_dev;
unsigned short rt_flags;
unsigned short rt_mtu;
unsigned short rt_irtt;
unsigned char rt_tos;
};
semaphore
Semaphores are used to protect critical data structures and regions of code. y
struct semaphore {
int count;
int waking;
int lock ; /* to make waking testing atomic */
struct wait_queue *wait;
};

sk_buff
The sk_buff data structure is used to describe network data as it moves between the layers of protocol.
struct sk_buff
{
struct sk_buff *next; /* Next buffer in list */
struct sk_buff *prev; /* Previous buffer in list */
struct sk_buff_head *list; /* List we are on */
int magic_debug_cookie;
struct sk_buff *link3; /* Link for IP protocol level buffer chains */
struct sock *sk; /* Socket we are owned by */
unsigned long when; /* used to compute rtt's */
struct timeval stamp; /* Time we arrived */
struct device *dev; /* Device we arrived on/are leaving by */
union
{
struct tcphdr *th;
struct ethhdr *eth;
struct iphdr *iph;
struct udphdr *uh;
unsigned char *raw;
/* for passing file handles in a unix domain socket */
void *filp;
} h;
union
{
/* As yet incomplete physical layer views */
unsigned char *raw;
struct ethhdr *ethernet;
} mac;
struct iphdr *ip_hdr; /* For IPPROTO_RAW */

unsigned long len; /* Length of actual data */
unsigned long csum; /* Checksum */
__u32 saddr; /* IP source address */
__u32 daddr; /* IP target address */
__u32 raddr; /* IP next hop address */
__u32 seq; /* TCP sequence number */
__u32 end_seq; /* seq [+ fin] [+ syn] + datalen */
__u32 ack_seq; /* TCP ack sequence number */
unsigned char proto_priv[16];
volatile char acked, /* Are we acked ? */
used, /* Are we in use ? */
free, /* How to free this buffer */
arp; /* Has IP/ARP resolution finished */
unsigned char tries, /* Times tried */
lock, /* Are we locked ? */
localroute, /* Local routing asserted for this frame */
pkt_type, /* Packet class */
pkt_bridged, /* Tracker for bridging */
ip_summed; /* Driver fed us an IP checksum */

#define PACKET_HOST 0 /* To us */
#define PACKET_BROADCAST 1 /* To all */
#define PACKET_MULTICAST 2 /* To group */
#define PACKET_OTHERHOST 3 /* To someone else */
unsigned short users; /* User count - see datagram.c,tcp.c */
unsigned short protocol; /* Packet protocol from driver. */
unsigned int truesize; /* Buffer size */
atomic_t count; /* reference count */
struct sk_buff *data_skb; /* Link to the actual data skb */
unsigned char *head; /* Head of buffer */
unsigned char *data; /* Data head pointer */
unsigned char *tail; /* Tail pointer */
unsigned char *end; /* End pointer */
void (*destructor)(struct sk_buff *); /* Destruct function */
__u16 redirport; /* Redirect port */
};
sock
Each sock data structure holds protocol specific information about a BSD socket. For example, for an INET (Internet
Address Domain) socket this data structure would hold all of the TCP/IP and UDP/IP specific information.
struct sock
{
/* This must be first. */
struct sock *sklist_next;
struct sock *sklist_prev;
struct options *opt;

atomic_t wmem_alloc;
atomic_t rmem_alloc;
unsigned long allocation; /* Allocation mode */
__u32 write_seq;
__u32 sent_seq;
__u32 acked_seq;
__u32 copied_seq;
__u32 rcv_ack_seq;
unsigned short rcv_ack_cnt; /* count of same ack */
__u32 window_seq;
__u32 fin_seq;
__u32 urg_seq;
__u32 urg_data;
__u32 syn_seq;
int users;
/* user count */
/*
* Not all are volatile, but some are, so we
* might as well say they all are.
*/
volatile char dead,
urginline,
intr,
blog,
done,

reuse,
keepopen,
linger,
delay_acks,
destroy,
ack_timed,
no_check,
zapped,
broadcast,
nonagle,
bsdism;
unsigned long lingertime;
int proc;
struct sock *next;

struct sock **pprev;
struct sock *bind_next;
struct sock **bind_pprev;
struct sock *pair;
int hashent;
struct sock *prev;
struct sk_buff *volatile send_head;
struct sk_buff *volatile send_next;
struct sk_buff *volatile send_tail;
struct sk_buff_head back_log;
struct sk_buff *partial;
struct timer_list partial_timer;
long retransmits;
struct sk_buff_head write_queue,
receive_queue;
struct proto *prot;
struct wait_queue **sleep;
__u32 daddr;
__u32 saddr; /* Sending source */
__u32 rcv_saddr; /* Bound address */
unsigned short max_unacked;
unsigned short window;
__u32 lastwin_seq; /* sequence number when we last
updated the window we offer */
__u32 high_seq; /* sequence number when we did
current fast retransmit */
volatile unsigned long ato; /* ack timeout */
volatile unsigned long lrcvtime; /* jiffies at last data rcv */
volatile unsigned long idletime; /* jiffies at last rcv */
unsigned int bytes_rcv;
/*
* mss is min(mtu, max_window)
*/
unsigned short mtu; /* mss negotiated in the syn's */
volatile unsigned short mss; /* current eff. mss - can change */
volatile unsigned short user_mss; /* mss requested by user in ioctl */
volatile unsigned short max_window;
unsigned long window_clamp;

unsigned int ssthresh;
unsigned short num;
volatile unsigned short cong_window;
volatile unsigned short cong_count;
volatile unsigned short packets_out;
volatile unsigned short shutdown;
volatile unsigned long rtt;
volatile unsigned long mdev;
volatile unsigned long rto;
volatile unsigned short backoff;

int err, err_soft; /* Soft holds errors that don't
cause failure but are the cause
of a persistent failure not
just 'timed out' */
unsigned char protocol;
volatile unsigned char state;
unsigned char ack_backlog;
unsigned char max_ack_backlog;
unsigned char priority;
unsigned char debug;
int rcvbuf;
int sndbuf;
unsigned short type;
unsigned char localroute; /* Route locally only */
/*
* This is where all the private (optional) areas that don't
* overlap will eventually live.
*/
union
{
struct unix_opt af_unix;
#if defined(CONFIG_ATALK) || defined(CONFIG_ATALK_MODULE)
struct atalk_sock af_at;
#endif
#if defined(CONFIG_IPX) || defined(CONFIG_IPX_MODULE)
struct ipx_opt af_ipx;
#endif
#ifdef CONFIG_INET
struct inet_packet_opt af_packet;
#ifdef CONFIG_NUTCP
struct tcp_opt af_tcp;
#endif
#endif
} protinfo;
/*
* IP 'private area'
*/
int ip_ttl; /* TTL setting */
int ip_tos; /* TOS */
struct tcphdr dummy_th;
struct timer_list keepalive_timer; /* TCP keepalive hack */
struct timer_list retransmit_timer; /* TCP retransmit timer */

struct timer_list delack_timer; /* TCP delayed ack timer */
int ip_xmit_timeout; /* Why the timeout is running */
struct rtable *ip_route_cache; /* Cached output route */
unsigned char ip_hdrincl; /* Include headers ? */
#ifdef CONFIG_IP_MULTICAST
int ip_mc_ttl; /* Multicasting TTL */
int ip_mc_loop; /* Loopback */
char ip_mc_name[MAX_ADDR_LEN]; /* Multicast device name */
struct ip_mc_socklist *ip_mc_list; /* Group array */
#endif
/*
* This part is used for the timeout functions (timer.c).
*/
int timeout; /* What are we waiting for? */
struct timer_list timer; /* This is the TIME_WAIT/receive
* timer when we are doing IP
*/
struct timeval stamp;
/*
* Identd
*/
struct socket *socket;
/*
* Callbacks
*/
void (*state_change)(struct sock *sk);
void (*data_ready)(struct sock *sk,int bytes);
void (*write_space)(struct sock *sk);
void (*error_report)(struct sock *sk);
};
socket
Each socket data structure holds information about a BSD socket. It does not exist independently; it is, instead, part of
the VFS inode data structure.
struct socket {
short type; /* SOCK_STREAM, ... */
socket_state state;
long flags;
struct proto_ops *ops; /* protocols do most everything */
void *data; /* protocol data */
struct socket *conn; /* server socket connected to */
struct socket *iconn; /* incomplete client conn.s */
struct socket *next;
struct wait_queue **wait; /* ptr to place to wait on */
struct inode *inode;
struct fasync_struct *fasync_list; /* Asynchronous wake up list */
struct file *file; /* File back pointer for gc */
};

task_struct
Each task_struct data structure describes a process or task in the system.
struct task_struct {
/* these are hardcoded - don't touch */
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
long counter;
long priority;
unsigned long signal;
unsigned long blocked; /* bitmap of masked signals */
unsigned long flags; /* per process flags, defined below */
int errno;
long debugreg[8]; /* Hardware debugging registers */
struct exec_domain *exec_domain;
/* various fields */
struct linux_binfmt *binfmt;
struct task_struct *next_task, *prev_task;
struct task_struct *next_run, *prev_run;
unsigned long saved_kernel_stack;
unsigned long kernel_stack_page;
int exit_code, exit_signal;
/* ??? */
unsigned long personality;
int dumpable:1;
int did_exec:1;
int pid;
int pgrp;
int tty_old_pgrp;
int session;
/* boolean value for session group leader */
int leader;
int groups[NGROUPS];
/*
* pointers to (original) parent process, youngest child, younger sibling,
* older sibling, respectively. (p->father can be replaced with
* p->p_pptr->pid)
*/
struct task_struct *p_opptr, *p_pptr, *p_cptr,
*p_ysptr, *p_osptr;
struct wait_queue *wait_chldexit;
unsigned short uid,euid,suid,fsuid;
unsigned short gid,egid,sgid,fsgid;
unsigned long timeout, policy, rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
long utime, stime, cutime, cstime, start_time;
/* mm fault and swap info: this can arguably be seen as either
mm-specific or thread-specific */
unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
int swappable:1;
unsigned long swap_address;

unsigned long old_maj_flt; /* old value of maj_flt */
unsigned long dec_flt; /* page fault count of the last time */
unsigned long swap_cnt; /* number of pages to swap on next pass */
/* limits */
struct rlimit rlim[RLIM_NLIMITS];
unsigned short used_math;
char comm[16];
/* file system info */
int link_count;
struct tty_struct *tty; /* NULL if no tty */
/* ipc stuff */
struct sem_undo *semundo;
struct sem_queue *semsleeping;
/* ldt for this task - used by Wine. If NULL, default_ldt is used */
struct desc_struct *ldt;
/* tss for this task */
struct thread_struct tss;
/* filesystem information */
struct fs_struct *fs;
/* open file information */
struct files_struct *files;
/* memory management info */
struct mm_struct *mm;
/* signal handlers */
struct signal_struct *sig;
#ifdef __SMP__
int processor;
int last_processor;
int lock_depth; /* Lock depth.
We can context switch in and out
of holding a syscall kernel lock... */
#endif
};
timer_list
timer_list data structure's are used to implement real time timers for processes.
struct timer_list {
struct timer_list *next;
struct timer_list *prev;
unsigned long expires;
unsigned long data;
void (*function)(unsigned long);
};
tq_struct
Each task queue (tq_struct) data structure holds information about work that has been queued. This is usually a task
needed by a device driver but which does not have to be done immediately.
struct tq_struct {

struct tq_struct *next; /* linked list of active bh's */
int sync; /* must be initialized to zero */
void (*routine)(void *); /* function to call */
void *data; /* argument to function */
};
vm_area_struct
Each vm_area_struct data structure describes an area of virtual memory for a process.
struct vm_area_struct {
struct mm_struct * vm_mm; /* VM area parameters */
unsigned long vm_start;
unsigned long vm_end;
pgprot_t vm_page_prot;
unsigned short vm_flags;
/* AVL tree of VM areas per task, sorted by address */
short vm_avl_height;
struct vm_area_struct * vm_avl_left;
struct vm_area_struct * vm_avl_right;
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct * vm_next;
/* for areas with inode, the circular list inode->i_mmap */
/* for shm areas, the circular list of attaches */
/* otherwise unused */
struct vm_area_struct * vm_next_share;
struct vm_area_struct * vm_prev_share;
/* more */
struct vm_operations_struct * vm_ops;
unsigned long vm_offset;
struct inode * vm_inode;
unsigned long vm_pte; /* shared mem */
};


Chapter 17
Linux Documentation Project Manifesto
This is the Linux Documentation Project ``Manifesto''

Last Revision 21 September 1998, by Michael K. Johnson
This file describes the goals and current status of the Linux Documentation Project, including names of
projects, volunteers, FTP sites, and so on.
17.1 Overview
The Linux Documentation Project is working on developing good, reliable docs for the Linux operating
system. The overall goal of the LDP is to collaborate in taking care of all of the issues of Linux
documentation, ranging from online docs (man pages, texinfo docs, and so on) to printed manuals
covering topics such as installing, using, and running Linux. The LDP is essentially a loose team of
volunteers with little central organization; anyone who is interested in helping is welcome to join in the
effort. We feel that working together and agreeing on the direction and scope of Linux documentation is
the best way to go, to reduce problems with conflicting efforts-two people writing two books on the same
aspect of Linux wastes someone's time along the way.
The LDP is set out to produce the canonical set of Linux online and printed documentation. Because our
docs will be freely available (like software licensed under the terms of the GNU GPL) and distributed on
the net, we are able to easily update the documentation to stay on top of the many changes in the Linux
world. If you are interested in publishing any of the LDP works, see the section ``Publishing LDP
Manuals'', below.
17.2 Getting Involved

Send mail to linux-howto@metalab.unc.edu
Of course, you'll also need to get in touch with the coordinator of whatever LDP projects you're
interested in working on; see the next section.
http://ldp.iol.it/LDP/tlk/appendices/LDP-Manifesto.html (1 di 4) [08/03/2001 10.09.20]

17.3 Current Projects
For a list of current projects, see the LDP Homepage at
http://sunsite.unc.edu/LDP/ldp.html. The best way to get involved with one of these
projects is to pick up the current version of the manual and send revisions, editions, or suggestions to the
coordinator. You probably want to coordinate with the author before sending revisions so that you know
you are working together.
17.4 FTP sites for LDP works

LDP works can be found on sunsite.unc.edu in the directory /pub/Linux/docs. LDP manuals are found in
/pub/Linux/docs/LDP, HOWTOs and other documentation found in /pub/Linux/docs/HOWTO.
17.5 Documentation Conventions

Here are the conventions that are currently used by LDP manuals. If you are interested in writing another
manual using different conventions, please let us know of your plans first.
The man pages - the Unix standard for online manuals - are created with the Unix standard nroff man (or
BSD mdoc) macros.
The guides - full books produced by the LDP - have historically been done in LaTeX, as their primary
goal has been to printed documentation. However, guide authors have been moving towards SGML with
the DocBook DTD, because it allows them to create more different kinds of output, both printed and
on-line. If you use LaTeX, we have a style file you can use to keep your printed look consistent with
other LDP documents, and we suggest that you use it.
The HOWTO documents are all required to be in SGML format. Currently, they use the linuxdoc DTD,
which is quite simple. There is a move afoot to switch to the DocBook DTD over time.
LDP documents must be freely redistributable without fees paid to the authors. It is not required that the
text be modifiable, but it is encouraged. You can come up with your own license terms that satisfy this
constraint, or you can use a previously prepared license. The LDP provides a boilerplate license that you
can use, some people like to use the GPL, and others write their own.
The copyright for each manual should be in the name of the head writer or coordinator for the project.
``The Linux Documentation Project'' isn't a formal entity and shouldn't be used to copyright the docs.
17.6 Copyright and License

Here is a ``boilerplate'' license you may apply to your work. It has not been reviewed by a lawyer; feel
free to have your own lawyer review it (or your modification of it) for its applicability to your own
desires. Remember that in order for your document to be part of the LDP, you must allow unlimited
reproduction and distribution without fee.

This manual may be reproduced and distributed in whole or in part, without fee, subject to the following
conditions:
● The copyright notice above and this permission notice must be preserved complete on all complete
or partial copies.
● Any translation or derived work must be approved by the author in writing before distribution.
● If you distribute this work in part, instructions for obtaining the complete version of this manual
must be included, and a means for obtaining a complete version provided.
● Small portions may be reproduced as illustrations for reviews or quotes in other works without this
permission notice if proper citation is given.
Exceptions to these rules may be granted for academic purposes: Write to the author and ask. These
restrictions are here to protect us as authors, not to restrict you as learners and educators.
All source code in this document is placed under the GNU General Public License, available via
anonymous FTP from prep.ai.mit.edu:/pub/gnu/COPYING.
17.7 Publishing LDP Manuals

If you're a publishing company interested in distributing any of the LDP manuals, read on.
By the license requirements given previously, anyone is allowed to publish and distribute verbatim
copies of the Linux Documentation Project manuals. You don't need our explicit permission for this.
However, if you would like to distribute a translation or derivative work based on any of the LDP
manuals, you may need to obtain permission from the author, in writing, before doing so, if the license
requires that.
You may, of course, sell the LDP manuals for profit. We encourage you to do so. Keep in mind,
however, that because the LDP manuals are freely distributable, anyone may photocopy or distribute
printed copies free of charge, if they wish to do so.
We do not require to be paid royalties for any profit earned from selling LDP manuals. However, we
would like to suggest that if you do sell LDP manuals for profit, that you either offer the author royalties,
or donate a portion of your earnings to the author, the LDP as a whole, or to the Linux development
community. You may also wish to send one or more free copies of the LDP manuals that you are
distributing to the authors. Your show of support for the LDP and the Linux community will be very
much appreciated.
We would like to be informed of any plans to publish or distribute LDP manuals, just so we know how
they're becoming available. If you are publishing or planning to publish any LDP manuals, please send
mail to ldp-l@linux.org.au. It's nice to know who's doing what.
We encourage Linux software distributors to distribute the LDP manuals (such as the Installation and
Getting Started Guide) with their software. The LDP manuals are intended to be used as the öfficial"
Linux documentation, and we are glad to see mail-order distributors bundling the LDP manuals with the
software. As the LDP manuals mature, hopefully they will fulfill this goal more and more adequately.



Chapter 18
The GNU General Public License
Printed below is the GNU General Public License (the GPL or copyleft), under which Linux is licensed.
It is reproduced here to clear up some of the confusion about Linux's copyright status-Linux is not
shareware, and it is not in the public domain. The bulk of the Linux kernel is copyright © 1993 by Linus
Torvalds, and other software and parts of the kernel are copyrighted by their authors. Thus, Linux is
copyrighted, however, you may redistribute it under the terms of the GPL printed below.
GNU GENERAL PUBLIC LICENSE
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc. 675 Mass Ave, Cambridge, MA 02139, USA
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is
not allowed.
18.1 Preamble
The licenses for most software are designed to take away your freedom to share and change it. By
contrast, the GNU General Public License is intended to guarantee your freedom to share and change
free software-to make sure the software is free for all its users. This General Public License applies to
most of the Free Software Foundation's software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by the GNU Library General Public
License instead.) You can apply it to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are
designed to make sure that you have the freedom to distribute copies of free software (and charge for this
service if you wish), that you receive source code or can get it if you want it, that you can change the
http://ldp.iol.it/LDP/tlk/appendices/gpl.html (1 di 7) [08/03/2001 10.09.21]

software or use pieces of it in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask
you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute
copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the
recipients all the rights that you have. You must make sure that they, too, receive or can get the source
code. And you must show them these terms so they know their rights.
We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which
gives you legal permission to copy, distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain that everyone understands that there
is no warranty for this free software. If the software is modified by someone else and passed on, we want
its recipients to know that what they have is not the original, so that any problems introduced by others
will not reflect on the original authors' reputations.
Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that
redistributors of a free program will individually obtain patent licenses, in effect making the program
proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free
use or not licensed at all.
The precise terms and conditions for copying, distribution and modification follow.
18.2 Terms and Conditions for Copying,

Distribution, and Modification
1. [0.] This License applies to any program or other work which contains a notice placed by the
copyright holder saying it may be distributed under the terms of this General Public License. The
``Program'', below, refers to any such program or work, and a ``work based on the Program''
means either the Program or any derivative work under copyright law: that is to say, a work
containing the Program or a portion of it, either verbatim or with modifications and/or translated
into another language. (Hereinafter, translation is included without limitation in the term
``modification''.) Each licensee is addressed as ``you''.
Activities other than copying, distribution and modification are not covered by this License; they
are outside its scope. The act of running the Program is not restricted, and the output from the
Program is covered only if its contents constitute a work based on the Program (independent of
having been made by running the Program). Whether that is true depends on what the Program
does.
2. [1.] You may copy and distribute verbatim copies of the Program's source code as you receive it,
in any medium, provided that you conspicuously and appropriately publish on each copy an
appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this
License and to the absence of any warranty; and give any other recipients of the Program a copy of
this License along with the Program.

You may charge a fee for the physical act of transferring a copy, and you may at your option offer
warranty protection in exchange for a fee.
3. [2.] You may modify your copy or copies of the Program or any portion of it, thus forming a work
based on the Program, and copy and distribute such modifications or work under the terms of
Section 1 above, provided that you also meet all of these conditions:
. [a.] You must cause the modified files to carry prominent notices stating that you changed
the files and the date of any change.
b. [b.] You must cause any work that you distribute or publish, that in whole or in part contains
or is derived from the Program or any part thereof, to be licensed as a whole at no charge to
all third parties under the terms of this License.
c. [c.] If the modified program normally reads commands interactively when run, you must
cause it, when started running for such interactive use in the most ordinary way, to print or
display an announcement including an appropriate copyright notice and a notice that there is
no warranty (or else, saying that you provide a warranty) and that users may redistribute the
program under these conditions, and telling the user how to view a copy of this License.
(Exception: if the Program itself is interactive but does not normally print such an
announcement, your work based on the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If identifiable sections of that work are
not derived from the Program, and can be reasonably considered independent and separate works
in themselves, then this License, and its terms, do not apply to those sections when you distribute
them as separate works. But when you distribute the same sections as part of a whole which is a
work based on the Program, the distribution of the whole must be on the terms of this License,
whose permissions for other licensees extend to the entire whole, and thus to each and every part
regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest your rights to work written
entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program with the Program (or with
a work based on the Program) on a volume of a storage or distribution medium does not bring the
other work under the scope of this License.
4. [3.] You may copy and distribute the Program (or a work based on it, under Section 2) in object
code or executable form under the terms of Sections 1 and 2 above provided that you also do one
of the following:
. [a.] Accompany it with the complete corresponding machine-readable source code, which
must be distributed under the terms of Sections 1 and 2 above on a medium customarily
used for software interchange; or,
b. [b.] Accompany it with a written offer, valid for at least three years, to give any third party,
for a charge no more than your cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be distributed under the terms
of Sections 1 and 2 above on a medium customarily used for software interchange; or,
c. [c.] Accompany it with the information you received as to the offer to distribute
corresponding source code. (This alternative is allowed only for noncommercial distribution

and only if you received the program in object code or executable form with such an offer,
in accord with Subsection b above.)
The source code for a work means the preferred form of the work for making modifications to it.
For an executable work, complete source code means all the source code for all modules it
contains, plus any associated interface definition files, plus the scripts used to control compilation
and installation of the executable. However, as a special exception, the source code distributed
need not include anything that is normally distributed (in either source or binary form) with the
major components (compiler, kernel, and so on) of the operating system on which the executable
runs, unless that component itself accompanies the executable.
If distribution of executable or object code is made by offering access to copy from a designated
place, then offering equivalent access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not compelled to copy the source
along with the object code.
5. [4.] You may not copy, modify, sublicense, or distribute the Program except as expressly provided
under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License. However, parties who have
received copies, or rights, from you under this License will not have their licenses terminated so
long as such parties remain in full compliance.
6. [5.] You are not required to accept this License, since you have not signed it. However, nothing
else grants you permission to modify or distribute the Program or its derivative works. These
actions are prohibited by law if you do not accept this License. Therefore, by modifying or
distributing the Program (or any work based on the Program), you indicate your acceptance of this
License to do so, and all its terms and conditions for copying, distributing or modifying the
Program or works based on it.
7. [6.] Each time you redistribute the Program (or any work based on the Program), the recipient
automatically receives a license from the original licensor to copy, distribute or modify the
Program subject to these terms and conditions. You may not impose any further restrictions on the
recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance
by third parties to this License.
8. [7.] If, as a consequence of a court judgment or allegation of patent infringement or for any other
reason (not limited to patent issues), conditions are imposed on you (whether by court order,
agreement or otherwise) that contradict the conditions of this License, they do not excuse you from
the conditions of this License. If you cannot distribute so as to satisfy simultaneously your
obligations under this License and any other pertinent obligations, then as a consequence you may
not distribute the Program at all. For example, if a patent license would not permit royalty-free
redistribution of the Program by all those who receive copies directly or indirectly through you,
then the only way you could satisfy both it and this License would be to refrain entirely from
distribution of the Program.
If any portion of this section is held invalid or unenforceable under any particular circumstance,
the balance of the section is intended to apply and the section as a whole is intended to apply in
other circumstances.
It is not the purpose of this section to induce you to infringe any patents or other property right

claims or to contest validity of any such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is implemented by public license
practices. Many people have made generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that system; it is up to the author/donor
to decide if he or she is willing to distribute software through any other system and a licensee
cannot impose that choice.
This section is intended to make thoroughly clear what is believed to be a consequence of the rest
of this License.
9. [8.] If the distribution and/or use of the Program is restricted in certain countries either by patents
or by copyrighted interfaces, the original copyright holder who places the Program under this
License may add an explicit geographical distribution limitation excluding those countries, so that
distribution is permitted only in or among countries not thus excluded. In such case, this License
incorporates the limitation as if written in the body of this License.
10. [9.] The Free Software Foundation may publish revised and/or new versions of the General Public
License from time to time. Such new versions will be similar in spirit to the present version, but
may differ in detail to address new problems or concerns.
Each version is given a distinguishing version number. If the Program specifies a version number
of this License which applies to it and `àny later version'', you have the option of following the
terms and conditions either of that version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of this License, you may choose any
version ever published by the Free Software Foundation.
11. [10.] If you wish to incorporate parts of the Program into other free programs whose distribution
conditions are different, write to the author to ask for permission. For software which is
copyrighted by the Free Software Foundation, write to the Free Software Foundation; we
sometimes make exceptions for this. Our decision will be guided by the two goals of preserving
the free status of all derivatives of our free software and of promoting the sharing and reuse of
software generally.
NO WARRANTY
12. [11.] BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO

WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE
LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS
AND/OR OTHER PARTIES PROVIDE THE PROGRAM `ÀS IS'' WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND
PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE
DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR
CORRECTION.

13. [12.] IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN
WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY
MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE
TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR
CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF
SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
END OF TERMS AND CONDITIONS
18.3 Appendix: How to Apply These Terms to Your

New Programs
If you develop a new program, and you want it to be of the greatest possible use to the public, the best
way to achieve this is to make it free software which everyone can redistribute and change under these
terms.
To do so, attach the following notices to the program. It is safest to attach them to the start of each source
file to most effectively convey the exclusion of warranty; and each file should have at least the
``copyright'' line and a pointer to where the full notice is found.
Copyright © 19yy This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this
program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge,
MA 02139, USA.
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision comes with ABSOLUTELY
NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate parts of the General
Public License. Of course, the commands you use may be called something other than `show w' and
`show c'; they could even be mouse-clicks or menu items-whatever suits your program.

You should also get your employer (if you work as a programmer) or your school, if any, to sign a
``copyright disclaimer'' for the program, if necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision'
(which makes passes at compilers) written by James Hacker.
, 1 April 1989 Ty Coon, President of Vice
This General Public License does not permit incorporating your program into proprietary programs. If
your program is a subroutine library, you may consider it more useful to permit linking proprietary
applications with the library. If this is what you want to do, use the GNU Library General Public License
instead of this License.


Direct Memory Access.
ELF
Executable and Linkable Format. This object file format designed by the Unix System
Laboratories is now firmly established as the most commonly used format in Linux.
EIDE
Extended IDE.
Executable image
A structured file containing machine instructions and data. This file can be loaded into a process's
virtual memory and executed. See also program.
Function
A piece of software that performs an action. For example, returning the bigger of two numbers.
IDE
Integrated Disk Electronics.
Image
See executable image.
IP
Internet Protocol.
IPC
Interprocess Communiction.
Interface
A standard way of calling routines and passing data structures. For example, the interface between
two layers of code might be expressed in terms of routines that pass and return a particular data
structure. Linux's VFS is a good example of an interface.
IRQ
Interrupt Request Queue.
ISA
Industry Standard Architecture. This is a standard, although now rather dated, data bus interface
for system components such as floppy disk drivers.
Kernel Module
A dynamically loaded kernel function such as a filesystem or a device driver.
Kilobyte
A thousand bytes of data, often written as Kbyte,
Megabyte
A million bytes of data, often written as Mbyte,
Microprocessor
A very integrated CPU. Most modern CPUs are Microprocessors.
Module
A file containing CPU instructions in the form of either assembly language instructions or a high
http://ldp.iol.it/LDP/tlk/appendices/glossary.html (2 di 4) [08/03/2001 10.09.22]

level language like C.
Object file
A file containing machine code and data that has not yet been linked with other object files or
libraries to become an executable image.
Page
Physical memory is divided up into equal sized pages.
Pointer
A location in memory that contains the address of another location in memory,
Process
This is an entity which can execute programs. A process could be thought of as a program in
action.
Processor
Short for Microprocessor, equivalent to CPU.
PCI
Peripheral Component Interconnect. A standard describing how the peripheral components of a
computer system may be connected together.
Peripheral
An intelligent processor that does work on behalf of the system's CPU. For example, an IDE
controller chip,
Program
A coherent set of CPU instructions that performs a task, such as printing ``hello world''. See also
executable image.
Protocol
A protocol is a networking language used to transfer application data between two cooperating
processes or network layers.
Register
A location within a chip, used to store information or instructions.
Register File
The set of registers in a processor.
RISC
Reduced Instruction Set Computer. The opposite of CISC, that is a processor with a small number
of assembly instructions, each of which performs simple operations. The ARM and Alpha
processors are both RISC architectures.
Routine
Similar to a function except that, strictly speaking, routines do not return values.
SCSI
Small Computer Systems Interface.
Shell

This is a program which acts as an interface between the operating system and a human user. Also
called a command shell, the most commonly used shell in Linux is the bash shell.
SMP
Symmetrical multiprocessing. Systems with more than one processor which fairly share the work
amongst those processors.
Socket
A socket represents one end of a network connection, Linux supports the BSD Socket interface.
Software
CPU instructions (both assembler and high level languages like C) and data. Mostly
interchangable with Program.
System V
A variant of Unix TM produced in 1983, which included, amongst other things, System V IPC
mechanisms.
TCP
Transmission Control Protocol.
Task Queue
A mechanism for deferring work in the Linux kernel.
UDP
User Datagram Protocol.
Virtual memory
A hardware and software mechanism for making the physical memory in a system appear larger
than it actually is.


Tour of the Linux kernel source

By Alessandro Rubini, rubini@pop.systemy.it
This chapter tries to explain the Linux source code in an orderly manner, trying to help the reader to
achieve a good understanding of how the source code is laid out and how the most relevant unix features
are implemented. The target is to help the experienced C programmer who is not accustomed to Linux in
getting familiar with the overall Linux design. That's why the chosen entry point for the kernel tour is the
kernel own entry point: system boot.
A good understanding of C language is required to understand this material, as well as some familiarity
with both Unix concepts and the PC architecture. However, no C code will appear in this chapter, but
rather pointers to the actual code. The finest issues of kernel design are explained in other chapters of this
guide, while this chapter tends to remain an informal overview.
Any pathname for files referenced in this chapter is referred to the main source-tree directory, usually
/usr/src/linux.
Most of the information reported here is taken from the source code of Linux release 1.0.
Nonetheless, references to later versions are provided at times. Any paragraph within the tour with the
image in front of it is meant to underline changes the kernel has undergone after the 1.0 release. If
no such paragraph is present, then no changes occurred up to release 1.0.9-1.1.76.
Sometimes a paragraph like this occurs in the text. It is a pointer to the right sources to get more
information on the subject just covered. Needless to say, the source is the primary source.
Booting the system
When the PC is powered up, the 80x86 processor finds itself in real mode and executes the code at
address 0xFFFF0, which corresponds to a ROM-BIOS address. The PC BIOS performs some tests on the
system and initializes the interrupt vector at physical address 0. After that it loads the first sector of a
bootable device to 0x7C00, and jumps to it. The device is usually the floppy or the hard drive. The
preceding description is quite a simplified one, but it's all that's needed to understand the kernel initial
workings.
The very first part of the Linux kernel is written in 8086 assembly language (boot/bootsect.S). When run,
it moves itself to absolute address 0x90000, loads the next 2 kBytes of code from the boot device to
address 0x90200, and the rest of the kernel to address 0x10000. The message ``Loading...'' is
displayed during system load. Control is then passed to the code in boot/Setup.S, another real-mode
assembly source.
The setup portion identifies some features of the host system and the type of vga board. If requested to, it
asks the user to choose the video mode for the console. It then moves the whole system from address
0x10000 to address 0x1000, enters protected mode and jumps to the rest of the system (at 0x1000).
The next step is kernel decompression. The code at 0x1000 comes from zBoot/head.S which initializes
http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour.html (1 di 11) [08/03/2001 10.09.26]

registers and invokes decompress_kernel(), which in turn is made up of zBoot/inflate.c,

zBoot/unzip.c and zBoot/misc.c. The decompressed data goes to address 0x100000 (1 Meg), and this is
the main reason why Linux can't run with less than 2 megs ram. [It's been done in 1 MB with
uncompressed kernels; see Memory Savers--ED]
Encapsulation of the kernel in a gzip file is accomplished by Makefile and utilities in the zBoot
directory. They are interesting files to look at.
Kernel release 1.1.75 moved the boot and zBoot directories down to arch/i386/boot. This change is
meant to allow true kernel builds for different architectures. Nonetheless, I'll stick to i386-specific
information.
Decompressed code is executed at address 0x1010000 [Maybe I've lost track of physical addresses,
here, as I don't know very well gas source code], where all the 32-bit setup is accomplished: IDT,
GDT and LDT are loaded, the processor and coprocessor are identified, and paging is setup; eventually,
the routine start_kernel is invoked. The source for the above operations is in boot/head.S. It is
probably the trickiest code in the whole kernel.
Note that if an error occurs during any of the preceding steps, the computer will lockup. The OS can't
deal with errors when it isn't yet fully operative.
start_kernel() resides in init/main.c, and never returns. Anything from now on is coded in C
language, left aside interrupt management and system call enter/leave (well, most of the macros embed
assembly code, too).
Spinning the wheel
After dealing with all the tricky questions, start_kernel() initializes all the parts of the kernel,
specifically:
● Sets the memory bounds and calls paging_init().
● Initializes the traps, IRQ channels and scheduling.
● Parses the command line.
● If requested to, allocates a profiling buffer.
● Initializes all the device drivers and disk buffering, as well as other minor parts.
● Calibrates the delay loop (computes the ``BogoMips'' number).
● Checks if interrupt 16 works with the coprocessor.
Finally, the kernel is ready to move_to_user_mode(), in order to fork the init process, whose
code is in the same source file. Process number 0 then, the so-called idle task, keeps running in an
infinite idle loop.
The init process tries to execute /etc/init, or /bin/init, or /sbin/init.
If none of them succeeds, code is provided to execute ``/bin/sh /etc/rc'' and fork a root shell on the first
terminal. This code dates back to Linux 0.01, when the OS was made by the kernel alone, and no login
process was available.

After exec()ing the init program from one of the standard places (let's assume we have one of them),
the kernel has no direct control on the program flow. Its role, from now on is to provide processes with
system calls, as well as servicing asynchronous events (such as hardware interrupts). Multitasking has
been setup, and it is now init which manages multiuser access by fork()ing system daemons and login
processes.
Being the kernel in charge of providing services, the tour will proceed by looking at those services (the
``system calls''), as well as by providing general ideas about the underlying data structures and code
organization.
How the kernel sees a process
From the kernel point of view, a process is an entry in the process table. Nothing more.
The process table, then, is one of the most important data structures within the system, together with the
memory-management tables and the buffer cache. The individual item in the process table is the
task_struct structure, quite a huge one, defined in include/linux/sched.h. Within the
task_struct both low-level and high-level information is kept--ranging from the copy of some
hardware registers to the inode of the working directory for the process.
The process table is both an array and a double-linked list, as well as a tree. The physical implementation
is a static array of pointers, whose length is NR_TASKS, a constant defined in include/linux/tasks.h, and
each structure resides in a reserved memory page. The list structure is achieved through the pointers
next_task and prev_task, while the tree structure is quite complex and will not be described here.
You may wish to change NR_TASKS from the default vaue of 128, but be sure to have proper
dependency files to force recompilation of all the source files involved.
After booting is over, the kernel is always working on behalf of one of the processes, and the global
variable current, a pointer to a task_struct item, is used to record the running one. current is
only changed by the scheduler, in kernel/sched.c. When, however, all procecces must be looked at, the
macro for_each_task is used. It is conderably faster than a sequential scan of the array, when the
system is lightly loaded.
A process is always running in either `ùser mode'' or ``kernel mode''. The main body of a user program
is executed in user mode and system calls are executed in kernel mode. The stack used by the process in
the two execution modes is different--a conventional stack segment is used for user mode, while a
fixed-size stack (one page, owned by the process) is used in kernel mode. The kernel stack page is never
swapped out, because it must be available whenever a system call is entered.
System calls, within the kernel, exist as C language functions, their òfficial' name being prefixed by
`sys_'. A system call named, for example, burnout invokes the kernel function sys_burnout().
The system call mechanism is described in chapter 3 of this guide. Looking at for_each_task
and SET_LINKS, in include/linux/sched.h can help understanding the list and tree structures in the
process table.
Creating and destroying processes

A unix system creates a process though the fork() system call, and process termination is performed
either by exit() or by receiving a signal. The Linux implementation for them resides in kernel/fork.c
and kernel/exit.c.
Forking is easy, and fork.c is short and ready understandable. Its main task is filling the data structure for
the new process. Relevant steps, apart from filling fields, are:
● getting a free page to hold the task_struct
● finding an empty process slot (find_empty_process())
● getting another free page for the kernel_stack_page
● copying the father's LDT to the child
● duplicating mmap information of the father
sys_fork() also manages file descriptors and inodes.

The 1.0 kernel offers some vestigial support to threading, and the fork() system call shows some
hints to that. Kernel threads is work-in-progress outside the mainstream kernel.
Exiting from a process is trickier, because the parent process must be notified about any child who exits.
Moreover, a process can exit by being kill()ed by another process (these are Unix features). The file
exit.c is therefore the home of sys_kill() and the vairious flavours of sys_wait(), in addition to
sys_exit().
The code belonging to exit.c is not described here--it is not that interesting. It deals with a lot of details in
order to leave the system in a consistent state. The POSIX standard, then, is quite demanding about
signals, and it must be dealt with.
Executing programs
After fork()ing, two copies of the same program are running. One of them usually exec()s another
program. The exec() system call must locate the binary image of the executable file, load and run it.
The word `load' doesn't necessarily mean ``copy in memory the binary image'', as Linux supports
demand loading.
The Linux implementation of exec() supports different binary formats. This is accomplished through
the linux_binfmt structure, which embeds two pointers to functions--one to load the executable and
the other to load the library, each binary format representing both the executable and the library. Loading
of shared libraries is implemented in the same source file as exec() is, but let's stick to exec() itself.
The Unix systems provide the programmer with six flavours of the exec() function. All but one of
them can be implemented as library functions, and theLinux kernel implements sys_execve() alone.
It performs quite a simple task: loading the head of the executable, and trying to execute it. If the first
two bytes are ``#!'', then the first line is parsed and an interpreter is invoked, otherwise the registered
binary formats are sequentially tried.
The native Linux format is supported directly within fs/exec.c, and the relevant functions are
load_aout_binary and load_aout_library. As for the binaries, the function loading an
`à.out'' executable ends up either in mmap()ing the disk file, or in calling read_exec(). The former
way uses the Linux demand loading mechanism to fault-in program pages when they're accessed, while

the latter way is used when memory mapping is not supported by the host filesystem (for example the
``msdos'' filesystem).
Late 1.1 kernels embed a revised msdos filesystem, which supports mmap(). Moreover, the
struct linux_binfmt is a linked list rather than an array, to allow loading a new binary format as
a kernel module. Finally, the structure itself has been extended to access format-related core-dump
routines.
Accessing filesystems
It is well known that the filesystem is the most basic resource in a Unix system, so basic and ubiquitous
that it needs a more handy name--I'll stick to the standard practice of calling it simply ``fs''.
I'll assume the reader already knows the basic Unix fs ideas--access permissions, inodes, the superblock,
mounting and umounting. Those concepts are well explained by smarter authors than me within the
standard Unix literature, so I won't duplicate their efforts and I'll stick to Linux specific issues.
While the first Unices used to support a single fs type, whose structure was widespread in the whole
kernel, today's practice is to use a standardized interface between the kernel and the fs, in order to ease
data interchange across architectures. Linux itself provides a standardized layer to pass information
between the kernel and each fs module. This interface layer is called VFS, for ``virtual filesystem''.
Filesystem code is therefore split into two layers: the upper layer is concerned with the management of
kernel tables and data structures, while the lower layer is made up of the set of fs-dependent functions,
and is invoked through the VFS data structures. All the fs-independent material resides in the fs/*.c files.
They address the following issues:
● Managing the buffer chache (buffer.c);
● Responding to the fcntl() and ioctl() system calls (fcntl.c and ioctl.c);
● Mapping pipes and fifos on inodes and buffers (fifo.c, pipe.c);
● Managing file- and inode-tables (file_table.c, inode.c);
● Locking and unlocking files and records (locks.c);
● Mapping names to inodes (namei.c, open.c);
● Implementing the tricky select() function (select.c);
● Providing information (stat.c);
● mounting and umounting filesystems (super.c);
● exec()ing executables and dumping cores (exec.c);
● Loading the various binary formats (bin_fmt*.c, as outlined above).
The VFS interface, then, consists of a set of relatively high-level operations which are invoked from the
fs-independent code and are actually performed by each filesystem type. The most relevant structures are
inode_operations and file_operations, though they're not alone: other structures exist as
well. All of them are defined within include/linux/fs.h.
The kernel entry point to the actual file system is the structure file_system_type. An array of
file_system_types is embodied within fs/filesystems.c and it is referenced whenever a mount is

issued. The function read_super for the relevant fs type is then in charge of filling a struct
super_block item, which in turn embeds a struct super_operations and a struct
type_sb_info. The former provides pointers to generic fs operations for the current fs-type, the latter
embeds specific information for the fs-type.
The array of filesystem types has been turned in a linked list, to allow loading new fs types as
kernel modules. The function (un-)register_filesystem is coded within fs/super.c.
Quick Anatomy of a Filesystem Type
The role of a filesystem type is to perform the low-level tasks used to map the relatively high level VFS
operations on the physical media (disks, network or whatever). The VFS interface is flexible enough to
allow support for both conventional Unix filesystems and exotic situations such as the msdos and umsdos
types.
Each fs-type is made up of the following items, in addition to its own directory:
● An entry in the file_systems[] array (fs/filesystems.c);
● The superblock include file (include/linux/type_fs_sb.h);
● The inode include file (include/linux/type_fs_i.h);
● The generic own include file (include/linux/type_fs.h});
● Two #include lines within include/linux/fs.h, as well as the entries in struct

super_block and struct inode.
The own directory for the fs type contains all the real code, responsible of inode and data management.
The chapter about procfs in this guide uncovers all the details about low-level code and VFS
interface for that fs type. Source code in fs/procfs is quite understandable after reading the chapter.
We'll now look at the internal workings of the VFS mechanism, and the minix filesystem source is used
as a working example. I chose the minix type because it is small but complete; moreover, any other fs
type in Linux derives from the minix one. The ext2 type, the de-facto standard in recent Linux
installations, is much more complex than that and its exploration is left as an exercise for the smart
reader.
When a minix-fs is mounted, minix_read_super fills the super_block structure with data read
from the mounted device. The s_op field of the structure will then hold a pointer to minix_sops,
which is used by the generic filesystem code to dispatch superblock operations.
Chaining the newly mounted fs in the global system tree relies on the following data items (assuming sb
is the super_block structure and dir_i points to the inode for the mount point):
● sb->s_mounted points to the root-dir inode of the mounted filesystem (MINIX_ROOT_INO);
● dir_i->i_mount holds sb->s_mounted;
● sb->s_covered holds dir_i
Umounting will eventually be performed by do_umount, which in turn invokes minix_put_super.

Whenever a file is accessed, minix_read_inode comes into play; it fills the system-wide inode

structure with fields coming form minix_inode. The inode->i_op field is filled according to
inode->i_mode and it is responsible for any further operation on the file. The source for the minix
functions just described are to be found in fs/minix/inode.c.
The inode_operations structure is used to dispatch inode operations (you guessed it) to the fs-type
specific kernel functions; the first entry in the structure is a pointer to a file_operations item,
which is the data-management equivalent of i_op. The minix fs-type allows three instances of
inode-operation sets (for direcotries, for files and for symbolic links) and two instances of file-operation
sets (symlinks don't need one).
Directory operations (minix_readdir alone) are to be found in fs/minix/dir.c; file operations (read
and write) appear within fs/minix/file.c and symlink operations (reading and following the link) in
fs/minix/symlink.c.
The rest of the minix directory implements the following tasks:
● bitmap.c manages allocation and freeing of inodes and blocks (the ext2 fs, otherwise, has two
different source files);
● fsynk.c is responsible for the fsync() system calls--it manages direct, indirect and double
indirect blocks (I assume you know about them, it's common Unix knowledge);
● namei.c embeds all the name-related inode operations, such as creating and destroying nodes,
renaming and linking;
● truncate.c performs truncation of files.
The console driver
Being the main I/O device on most Linux boxes, the console driver deserves some attention. The source
code related to the console, as well as the other character drivers, is to be found in drivers/char, and we'll
use this very directory as our referenece point when naming files.
Console initialization is performed by the function tty_init(), in tty_io.c. This function is only
concerned in getting major device numbers and calling the init function for each device set.
con_init(), then is the one related to the console, and resides in console.c.
Initialization of the console has changed quite a lot during 1.1 evolution. console_init() has
been detatched from tty_init(), and is called directly by ../../main.c. The virtual consoles are now
dynamically allocated, and quite a good deal of code has changed. So, I'll skip the details of initialization,
allocation and such.
How file operations are dispatched to the console
This paragraph is quite low-level, and can be happily skipped over.

Needless to say, a Unix device is accessed though the filesystem. This paragraph details all steps from
the device file to the actual console functions. Moreover, the following information is extracted from the
1.1.73 source code, and it may be slightly different from the 1.0 source.
When a device inode is opened, the function chrdev_open() (or blkdev_open(), but we'll stich
to character devices) in ../../fs/devices.c gets executed. This function is reached by means of the structure

def_chr_fops, which in turn is referenced by chrdev_inode_operations, used by all the

filesystem types (see the previous section about filesystems).
chrdev_open takes care of specifying the device operations by substituting the device specific
file_operations table in the current filp and calls the specific open(). Device specific tables
are kept in the array chrdevs[], indexed by the majour device number, and filled by the same
../../fs/devices.c.
If the device is a tty one (aren't we aiming at the console?), we come to the tty drivers, whose functions
are in tty_io.c, indexed by tty_fops. Thus, tty_open() calls init_dev(), which allocates
any data structure needed by the device, based on the minor device number.
The minor number is also used to retrieve the actual driver for the device, which has been registered
through tty_register_driver(). The driver, then, is still another structure used to dispatch
computation, just like file_ops; it is concerned with writing and controlling the device. The last data
structure used in managing a tty is the line discipline, described later. The line discipline for the console
(and any other tty device) is set by initialize_tty_struct(), invoked by init_dev.
Everything we touched in this paragraph is device-independent. The only console-specific particular is
that console.c, has registered its own driver during con_init(). The line discipline, on the contrary, in
independent of the device.
The tty_driver structure is fully explained within <linux/tty_driver.h>.

The above information has been extracted from 1.1.73 source code. It isn't unlikely for your kernel
to be somewhat different (``This information is subject to change without notice'').
Writing to the console
When a console device is written to, the function con_write gets invoked. This function manages all
the control characters and escape sequences used to provide applications with complete screen
management. The escape sequences implemented are those of the vt102 terminal; This means that your
environment should say TERM=vt102 when you are telnetting to a non-Linux host; the best choice for
local activities, however, is TERM=console because the Linux console offers a superset of vt102
functionality.
con_write(), thus, is mostly made up of nested switch statements, used to handle a finite state
automaton interpreting escape sequences one character at a time. When in normal mode, the character
being printed is written directly to the video memory, using the current attr-ibute. Within console.c, all
the fields of struct vc are made accessible through macros, so any reference to (for example) attr,
does actually refer to the field in the structure vc_cons[currcons], as long as currcons is the
number of the console being referred to.
Actually, vc_cons in newer kernels is no longer an array of structures , it now is an array of
pointers whose contents are kmalloc()ed. The use of macros greatly simplified changing the
approach, because much of the code didn't need to be rewritten.
Actual mapping and unmapping of the console memory to screen is performed by the functions
set_scrmem() (which copies data from the console buffer to video memory) and get_scrmem

(which copies back data to the console buffer). The private buffer of the current console is physically
mapped on the actual video RAM, in order to minimize the number of data transfers. This means that
get- and set-_scrmem() are static to console.c and are called only during a console switch.
Reading the console
Reading the console is accomplished through the line-discipline. The default (and unique) line discipline
in Linux is called tty_ldisc_N_TTY. The line discipline is what ``disciplines input through a line''. It
is another function table (we're used to the approach, aren't we?), which is concerned with reading the
device. With the help of termios flags, the line discipline is what controls input from the tty: raw,
cbreak and cooked mode; select(); ioctl() and so on.
The read function in the line discipline is called read_chan(), which reads the tty buffer
independently of whence it came from. The reason is that character arrival through a tty is managed by
asynchronous hardware interrupts.
The line discipline N_TTY is to be found in the same tty_io.c, though later kernels use a different
n_tty.c source file.
The lowest level of console input is part of keyboard management, and thus it is handled within
keyboard.c, in the function keyboard_interrupt().
Keyboard management
Keyboard management is quite a nightmare. It is confined to the file keyboard.c, which is full of
hexadecimal numbers to represent the various keycodes appearing in keyboards of different
manifacturers.
I won't dig in keyboard.c, because no relevant information is there to the kernel hacker.
For those readers who are really interested in the Linux keyboard, the best approach to keyboard.c
is from the last line upward. Lowest level details occur mainly in the first half of the file.
Switching the current console
The current console is switched through invocation of the function change_console(), which
resides in tty_io.c and is invoked by both keyboard.c and vt.c (the former switches console in response to
keypresses, the latter when a program requests it by invoking an ioctl() call).
The actual switching process is performed in two steps, and the function
complete_change_console() takes care of the second part of it. Splitting the switch is meant to
complete the task after a possible handshake with the process controlling the tty we're leaving. If the
console is not under process control, change_console() calls complete_change_console()
by itself. Process intervertion is needed to successfully switch from a graphic console to a text one and
viceversa, and the X server (for example) is the controlling process of its own graphic console.
The selection mechanism
``selection'' is the cut and paste facility for the Linux text consoles. The mechanism is mainly

handled by a user-level process, which can be instantiated by either selection or gpm. The user-level
program uses ioctl() on the console to tell the kernel to highlight a region of the screen. The selected
text, then, is copied to a selection buffer. The buffer is a static entity in console.c. Pasting text is
accomplished by `manually' pushing characters in the tty input queue. The whole selection mechanism is
protected by #ifdef so users can disable it during kernel configuration to save a few kilobytes of ram.
Selection is a very-low-level facility, and its workings are hidden from any other kernel activity. This
means that most #ifdef's simply deals with removing the highlight before the screen is modified in any
way.
Newer kernels feature improved code for selection, and the mouse pointer can be highlighted
independently of the selected text (1.1.32 and later). Moreover, from 1.1.73 onward a dynamic buffer is
used for selected text rather than a static one, making the kernel 4kB smaller.
ioctl()ling the device
The ioctl() system call is the entry point for user processes to control the behaviour of device files.
Ioctl management is spawned by ../../fs/ioctl.c, where the real sys_ioctl() resides. The standard
ioctl requests are performed right there, other file-related requests are processed by file_ioctl()
(same source file), while any other request is dispatches to the device-specific ioctl() function.
The ioctl material for console devices resides in vt.c, because the console driver dispatches ioctl
requests to vt_ioctl().
The information above refer to 1.1.7x. The 1.0 kernel doesn't have the ``driver'' table, and
vt_ioctl() is pointed to directly by the file_operations() table.
Ioctl material is quite confused, indeed. Some requests are related to the device, and some are related to
the line discipline. I'll try to summarize things for the 1.0 and the 1.1.7x kernels. Anything happened in
between.
The 1.1.7x series features the following approach: tty_ioctl.c implements only line discipline requests
(namely n_tty_ioctl(), which is the only n_tty function outside of n_tty.c), while the
file_operations field points to tty_ioctl() in tty_io.c. If the request number is not resolved
by tty_ioctl(), it is passed along to tty->driver.ioctl or, if it fails, to
tty->ldisc.ioctl. Driver-related stuff for the console it to be found in vt.c, while line discipline
material is in tty_ioctl.c.
In the 1.0 kernel, tty_ioctl() is in tty_ioctl.c and is pointed to by generic tty file_operations.
Unresolved requests are passed along to the specific ioctl function or to the line-discipline code, in a way
similar to 1.1.7x.
Note that in both cases, the TIOCLINUX request is in the device-independent code. This implies that the
console selection can be set by ioctlling any tty (set_selection() always operates on the
foreground console), and this is a security hole. It is also a good reason to switch to a newer kernel,
where the problem is fixed by only allowing the superuser to handle the selection.
A variety of requests can be issued to the console device, and the best way to know about them is to
browse the source file vt.c.

Copyright (C) 1994 Alessandro Rubini, rubini@pop.systemy.it
Messages
8. access a file from module by proy018@avellano.usal.es
7. Which head.S? by Johnie Stafford
1. Untitled by benschop@eb.ele.tue.nl
6. STREAMS and Linux by Venkatesha Murthy G.
1. Re: STREAMS and LINUX by Vineet Sharma
5. Do you still need to run update ? by Chris Ebenezer
4. Do you still need to run bdflush? by Steve Dunham
1. Already answered... by Michael K. Johnson
3. Kernel Configuration and Makefile Structure by Steffen Moeller
1. Editing services available... by Michael K. Johnson
2. Kernel configuration by Venkatesha Murthy G.
2. Re: Kernel threads by Paul Gortmaker
1. More on usage of kernel threads. by David S. Miller
1. kernel startup code by Alan Cox
1. Untitled by Karapetyants Vladimir Vladimirovitch

access a file from module
access a file from module

Forum: Tour of the Linux kernel source
Date: Thu, 08 May 1997 12:06:47 GMT
From: <proy018@avellano.usal.es>
I need to access a file from a module
http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour/8.html [08/03/2001 10.09.27]

Which head.S?
Which head.S?
Keywords: head.S
Date: Sat, 20 Jul 1996 00:57:09 GMT
From: Johnie Stafford <jms@pobox.com>
In the "Tour of the Linux kernel source" section there is

reference to boot/head.S. I did a find on the source, this
is a list of the "head.S"'s in the source:
./arch/i386/boot/compressed/head.S
./arch/i386/kernel/head.S
./arch/alpha/boot/head.S
./arch/alpha/kernel/head.S
./arch/sparc/kernel/head.S
./arch/mips/kernel/head.S
./arch/ppc/kernel/head.S
./arch/ppc/boot/compressed/head.S
./arch/m68k/kernel/head.S
Obviously, there is a different one for each architecture.

But which version for the i386 architecture is being
refered to here, and what's the difference?
Johnie
Messages
1. Untitled by benschop@eb.ele.tue.nl

Untitled
Untitled
Re: Which head.S? (Johnie Stafford)
Keywords: head.S
Date: Tue, 23 Jul 1996 07:38:08 GMT
From: <benschop@eb.ele.tue.nl>
The file arch/i386/kernel/head.S is linked with the uncompressed kernel. If the kernel is not
compressed this is the only head.S used. In a compressed kernel, all 32 bit objects from the kernel,
including the above mentioned head.o are compressed and the compressed data is lumped together in
the file piggy.o. Now the file arch/i386/boot/compressed/head.S comes into play. This and the
decompressor and piggy.o form a new 32-bit object.
http://ldp.iol.it/LDP/khg/HyperNews/get/tour/tour/7/1.html [08/03/2001 10.09.28]

STREAMS and Linux
STREAMS and Linux

Keywords: STREAMS devices drivers
Date: Mon, 15 Jul 1996 12:01:50 GMT
From: Venkatesha Murthy G. <gvmt@csa.iisc.ernet.in>
Hi all,
Correct me if i am wrong, but Linux doesn't have any STREAMS devices or drivers as of now. But as
Ritchie's paper explains, they are flexible and can find use in a lot of places where piplelined
processing is involved - net drivers for instance. Anything being done/planned in that direction?
Venkatesha Murthy (gvmt@csa.iisc.ernet.in)
Messages
1. Re: STREAMS and LINUX by Vineet Sharma

Re: STREAMS and LINUX
Re: STREAMS and LINUX

Re: STREAMS and Linux (Venkatesha Murthy G.)
Keywords: STREAMS devices drivers
Date: Thu, 10 Apr 1997 15:49:04 GMT
From: Vineet Sharma <vsharma@hss.hns.com>
Go to ftp.gcom.com/pub/linux/src/ and pick up the Lis-2.0.25.tar.gz package.

Do you still need to run update ?
Do you still need to run update ?

Date: Tue, 25 Jun 1996 13:59:41 GMT
From: Chris Ebenezer <ncfee@wmin.ac.uk>
The docs on this daemon state that it is one of a pair of two daemons - bdflush/update - which manage
disk buffers. In the latest (2.0.x) kernels starting up update does not have the effect of also starting up
bdflush. So is update still needed ?

Do you still need to run bdflush?
Do you still need to run bdflush?

Date: Mon, 27 May 1996 18:55:44 GMT
From: Steve Dunham <dunham@gdl.msu.edu>
The recent 1.3.x kernels add a kernel thread named (kflushd) What does this do? Does it replace the
functionality of the user program 'bdflush'?
Messages
1. Already answered... by Michael K. Johnson

Already answered...
Already answered...
Re: Do you still need to run bdflush? (Steve Dunham)
Keywords: kflushd, searching
Date: Mon, 27 May 1996 19:42:00 GMT
From: Michael K. Johnson <johnsonm@redhat.com>
It looks like I'll eventually have to add search capability to the KHG. It will be a while before I have
time, though.
Your questions is already answered in a response elsewhere in this document; see Kernel threads,
posted by Paul Gortmaker.

Re: Kernel threads
Re: Kernel threads

Keywords: kernel threads
Date: Wed, 22 May 1996 16:51:58 GMT
From: Paul Gortmaker <gpg109@rsphy1.anu.edu.au>
The above mentions that v1.0 has "early" support for threads, which can use a bit of an update. They
are fully functional and in use in late v1.3.x kernels. For example the internal bdflush daemon used to
be started by a non-returning syscall in all the v1.2.x kernels, but as of around v1.3.4x or so, I made it
into an internal thread, and dispensed with the reliance on the user space syscall to launch the thing.
This is now what is seen as "kflushd" or process #2 on all recent kernels. Since then, other threads
such as "kflushd" and multiple "nfsiod" processes have taken advantage of the same functionality.
Paul.
Messages
1. More on usage of kernel threads. by David S. Miller

Kernel Configuration and Makefile Structure
Kernel Configuration and Makefile Structure

Keywords: configuration makefile
Date: Wed, 22 May 1996 17:34:39 GMT
From: Steffen Moeller <steffen@di.unito.it>
I'm missing a description of the Makefile mechanism and the principle of the configuration. Or is this
too trivial for a Hacker's Guide? I do not think so since
● it's a nice introduction,
● all hackers have to understand it and
● it's a good place to put hyperlinks to the real stuff in this guide.
If there's some positive feedback I'd like to start on this myself, but I'd need some help - at least for the
language.
Steffen
Messages
1. Editing services available... by Michael K. Johnson
2. Kernel configuration by Venkatesha Murthy G.

Editing services available...
Editing services available...

Re: Kernel Configuration and Makefile Structure (Steffen Moeller)
Keywords: configuration makefile
Date: Thu, 23 May 1996 17:00:42 GMT
This is certainly not too trivial a topic for the KHG. If you are willing to tackle it, feel free. If someone
else wants to work on it, that's fine too.
If by "...but I'd need some help - at least for the language" you mean that you would like someone to
edit your piece, you can send it to me for editing. If I feel that it needs more work before being added,
I'll send it back for revision, hopefully with helpful comments... :-)

Kernel configuration
Kernel configuration
Re: Kernel Configuration and Makefile Structure (Steffen Moeller)
Keywords: configuration
Date: Thu, 11 Jul 1996 12:30:00 GMT
From: Venkatesha Murthy G. <gvmt@csa.iisc.ernet.in>
I really haven't *understood* kernel configutarion but i can tell you what i do when i want to add a
config option. I first edit arch/i386/config.in and add a line that looks like
bool 'whatever explanation' CONFIG_WHATEVER default
this is supposed to mean that CONFIG_WHATEVER is a boolean taking values y or n. When you
'make config' you'll get something like
'whatever explanation (CONFIG_WHATEVER) [default]'
and you type in y or n. Now this automagically #defines CONFIG_WHATEVER in
<linux/autoconf.h>. Code that is specefic to the configuration can now be enclosed in #ifdef
CONFIG_WHATEVER ... #endif so it will be compiled in only when configured. If you want any
more explanation than can be given on one line, you can have a set of 'comment ...." lines before the
'bool ....' line and that will be displayed for you during configuration.
I don't know if you'll find it useful but still .....
Venkatesha Murthy (gvmt@csa.iisc.ernet.in)

More on usage of kernel threads.
More on usage of kernel threads.

Re: Re: Kernel threads (Paul Gortmaker)
Keywords: kernel threads asynchronous faults
Date: Fri, 24 May 1996 05:33:03 GMT
From: David S. Miller <davem@caip.rutgers.edu>
As an another addendum the AP+ multicomputer port

(actually it is a part of the generic Sparc kernel sources)
uses kernel threads to solve the problem of servicing a
true fault from interrupt space. The kernel thread is called
asyncd(), the AP+ multicomputer takes interrupts when one
cell on the machine does a dma access to another cell and the
page is not present or otherwise needs to be faulted in or
whatever. The interrupt handler adds this fault to a queue
of faults to service and wakes up the async daemon which runs
with real time priority much like the other linux kernel
daemons. Poof, solution to the classic interrupt context
limitation problem. ;-)
Later, David S. Miller (davem@caip.rutgers.edu)

kernel startup code
kernel startup code

Keywords: SMP start_kernel()
Date: Fri, 17 May 1996 10:48:00 GMT
From: Alan Cox <unknown>
The intel startup code and start_kernel() is partly used for SMP startup as the intel MP design starts the
secondary CPU's in real mode. In addition to make it more fun you can only pass one piece of
information - the address (page boundary) that the processor is made to boot at. The SMP kernel writes
a trampoline routine at the base of a page it allocates for the stack of each CPU. The secondary
processors (or AP's as Intel calls them for Application Processors) load their SS:SP based on the code
segment enter protected mode and jump into the 32bit kernel startup.
The kernel startup for the SMP kernel in start_kernel() calls a few startup routines for the architecture
and then waits for the boot processor to complete initialisation. At this point it starts running an idle
thread and is schedulable.
Messages
1. Untitled by Karapetyants Vladimir Vladimirovitch

HyperNews Redirect
Broken URL: http://www.redhat.com:8080/HyperNews/get/tour/tour/1/1.html

Try: http://www.redhat.com:8080/HyperNews/get/tour/tour/1.html

Device Drivers
Device Drivers
If you choose to write a device driver, you must take everything written here as a guide, and no more. I
cannot guarantee that this chapter will be free of errors, and I cannot guarantee that you will not damage your
computer, even if you follow these instructions exactly. It is highly unlikely that you will damage it, but I
cannot guarantee against it. There is only one `ìnfallible'' direction I can give you: Back up! Back up before
you test your new device driver, or you may regret it later.
What is a Device Driver?
What is this ``device driver'' stuff anyway? Here's a very short introduction to the concept.
User-space device drivers
It's not always necessary to write a ``real'' device driver. Sometimes you just need to know how to
write code that runs as a normal user process and still accesses hardware.
Device Driver Basics
Assuming that you need to write a ``real'' device driver, there are some things that you need to know
regardless of what type of driver you are writing. In fact, you may need to learn what type of driver
you ought to write...
Character Device Drivers
This section includes details specific to character device drivers, and assumes that you know
everything in the previous section.
TTY drivers
This section hasn't been written yet. TTY drivers are character devices that interface with the kernel's
generic TTY support, and they require more than just a standard character device interface. I'd
appreciate it if someone would write up how to attach a character device driver to the generic TTY
layer and submit it to me for inclusion in this guide.
Block Device Drivers
This section includes details specific to block device drivers (suprise!)
Writing a SCSI Device Driver
This is a technical paper written by Rik Faith at the University of North Carolina.
Network Device Drivers
Alan Cox gives an introduction to the network layer, including device drivers.
Supporting Functions
Many functions are useful to all sorts of drivers. Here is a summary of quite a few of them.
Translating Addresses in Kernel Space
An edited version of a post of Linus Torvalds to the linux-kernel mailing list about how to correctly
deal with translating memory references when writing kernel source code such as device drivers.
Kernel-Level Exception Handling
An edited version of a post of Joerg Pommnitz to the linux-kernel mailing list about how the new
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices.html (1 di 3) [08/03/2001 10.09.40]

Device Drivers
(Linux 2.1.8) exception mechanism works.
Other sources of information
Quite a few other references are also available on the topic of writing Linux device drivers by now. I put up
some (slightly outdated by now, but still worth reading, I think) notes for a talk I gave in May 1995 entitled
Writing Linux Device Drivers, which is specifically oriented at character devices implemented as kernel
runtime-loadable modules.
Linux Journal has had a long-running series of articles called Kernel Korner which, despite the wacky
name, has had quite a bit of useful information on it. Some of the articles from that column may be available
on the web; most of them are available for purchase as back issues. One particularly useful series of articles,
which focussed in far more detail than my 30 minute talk on the subject of kernel runtime-loadable modules,
was in issues 23, 24, 25, 26, and 28. They were written by Alessandro Rubini and Georg v. Zezschwitz. Issue
29 is slated (as of this writing) to have an article on writing network device drivers, written by Alan Cox.
Issues 9, 10, and 11 have a series that I wrote on block device drivers.
Copyright (C) 1992, 1993, 1994, 1996 Michael K. Johnson, johnsonm@redhat.com.
Messages
22. DMA to user space by Marcel Boosten
21. How a device driver can driver his device by Kim yeonseop
1. Untitled
20. memcpy error? by Edgar Vonk
19. Unable to handle kernel paging request - error by Edgar Vonk
17. _syscallX() Macros by Tom Howley
16. MediaMagic Sound Card DSP-16. How to run in Linux. by Robert Hinson
15. What does mark_bh() do? by Erik Petersen
1. Untitled by Praveen Dwivedi
14. 3D Acceleration by jamesbat@innotts.co.uk
13. Device Drivers: /dev/radio... by Matthew Kirkwood
12. Does anybody know why kernel wakes my driver up without apparant reasons? by David van Leeuwen
11. Getting a DMA buffer aligned with 64k boundaries by Juan de La Figuera Bayon
10. Hardware Interface I/O Access by Terry Moore
1. You are somewhat confused... by Michael K. Johnson
9. Is Anybody know something about SIS 496 IDE chipset? by Alexander
7. Vertical Retrace Interrupt - I need to use it by Brynn Rogers
1. Your choice... by Michael K. Johnson
6. help working with skb structures by arkane

Device Drivers
5. Interrupt Sharing ? by Frieder Löffler

1. Interrupt sharing-possible by Vladimir Myslik
-> Interrupt sharing - How to do with Network Drivers? by Frieder Löffler
-> Interrupt sharing 101 by Christophe Beauregard
4. Device Driver notification of "Linux going down" by Stan Troeh
1. Through application which has opened the device by Michael K. Johnson
2. Device Driver notification of "Linux going down" by Marko Kohtala
3. Is waitv honored? by Michael K. Johnson
2. PCI Driver by Flavia Donno
1. There is linux-2.0/drivers/pci/pci.c by Hasdi
1. Re: Network Device Drivers by Paul Gortmaker
1. Re: Network Device Drivers by Neal Tucker
1. network driver info by Neal Tucker
-> Network Driver Desprately Needed by Paul Atkinson
2. Transmit function by Joerg Schorr
1. Re: Transmit function by Paul Gortmaker
-> Skbuff by Joerg Schorr


Making hardware work is tedious. To write to a hard disk, for example, requires that you write magic
numbers in magic places, wait for the hard drive to say that it is ready to receive data, and then feed it the
data it wants, very carefully. To write to a floppy disk is even harder, and requires that the program
supervise the floppy disk drive almost constantly while it is running.
Instead of putting code in each application you write to control each device, you share the code between
applications. To make sure that that code is not compromised, you protect it from users and normal
programs that use it. If you do it right, you will be able to add and remove devices from your system
without changing your applications at all. Furthermore, you need to be able to load your program into
memory and run it, which the operating system also does. So an operating system is essentially a
priviledged, general, sharable library of low-level hardware and memory and process control functions
and routines.
All versions of Unix have an abstract way of reading and writing devices. By making the devices act as
much as possible like regular files, the same calls (read(), write(), etc.) can be used for devices and
files. Within the kernel, there are a set of functions, registered with the filesystem, which are called to
handle requests to do I/O on ``device special files,'' which are those which represent devices. (See
mknod(1,2) for an explanation of how to make these files.)
All devices controlled by the same device driver are given the same major number, and of those with
the same major number, different devices are distinguished by different minor numbers. (This is not
strictly true, but it is close enough. If you understand where it is not true, you don't need to read this
section, and if you don't but want to learn, read the code for the tty devices, which uses up 2 major
numbers, and may use a third and possibly fourth by the time you read this. Also, the ``misc'' major
device supports many minor devices that only need a few minor numbers; we'll get to that later.)
This chapter explains how to write any type of Linux device driver that you might need to, including
character, block, SCSI, and network drivers. It explains what functions you need to write, how to
initialize your drivers and obtain memory for them efficiently, and what function are built in to Linux to
make your job easier.
Creating device drivers for Linux is easier than you might think. It merely involves writing a few
functions and registering them with the Virtual Filesystem Switch (VFS), so that when the proper device
special files are accessed, the VFS can call your functions.
However, a word of warning is due here: Writing a device driver is writing a part of the Linux kernel.
This means that your driver runs with kernel permissions, and can do anything it wants to: write to any
memory, reformat your hard drive, damage your monitor or video card, or even break your dishes, if
your dishwasher is controlled by your computer. Be careful.
Also, your driver will run in kernel mode, and the Linux kernel, like most Unix kernels, is
non-pre-emptible. This means that if you driver takes a long time to work without giving other programs
a chance to work, your computer will appear to ``freeze'' when your driver is running. Normal user-mode
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/whatis.html (1 di 2) [08/03/2001 10.09.41]

pre-emptive scheduling does not apply to your driver.

Messages
1. Question ? by Rose Merone
-> Not yet... by Michael K. Johnson
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/whatis.html (2 di 2) [08/03/2001 10.09.41]

Question ?
Question ?
Forum: What is a Device Driver?
Date: Mon, 24 Mar 1997 08:39:09 GMT
From: Rose Merone <unknown>
D'ya have a book that covers all about device driver management in Linux ?
Messages
1. Not yet... by Michael K. Johnson
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/whatis/1.html [08/03/2001 10.09.42]

Not yet...
Not yet...
Forum: What is a Device Driver?
Re: Question ? (Rose Merone)
Date: Mon, 21 Apr 1997 14:00:19 GMT
Alessandro Rubini is writing a book about writing device drivers for O'Reilly. See
http://www.ora.com/catalog/linuxdrive/ and http://www.ora.com/catalog/linuxdrive/desc.html
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/whatis/1/1.html [08/03/2001 10.09.43]


It is not always necessary to write a device driver for a device, especially in applications where no two
applications will compete for the device. The most useful example of this is a memory-mapped device, but
you can also do this with devices in I/O space (devices accessed with inb() and outb(), etc.). If your
process is running as superuser (root), you can use the mmap() call to map some of your process memory
to actual memory locations, by mmap()'ing a section of /dev/mem. When you have done this mapping, it is
pretty easy to write and read from real memory addresses just as you would read and write any variables.
If your driver needs to respond to interrupts, then you really need to be working in kernel space, and need
to write a real device driver, as there is no good way at this time to deliver interrupts to user processes.
Although the DOSEMU project has created something called the SIG (Silly Interrupt Generator) which
allows interrupts to be posted to user processes (I believe through the use of signals), the SIG is not
particularly fast, and should be thought of as a last resort for things like DOSEMU.
An interrupt is an asyncronous notification posted by the hardware to alert the device driver of some
condition. You have likely dealt with ÌRQ's when setting up your hardware; an IRQ is an `Ìnterrupt
ReQuest line,'' which is triggered when the device wants to talk to the driver. This may be because it has
data to give to the drive, or because it is now ready to receive data, or because of some other `èxceptional
condition'' that the driver needs to know about. It is similar to user-level processes receiving a signal, so
similar that the same sigaction structure is used in the kernel to deal with interrupts as is used in
user-level programs to deal with signals. Where the user-level has its signals delivered to it by the kernel,
the kernel has interrupt delivered to it by hardware.
If your driver must be accessible to multiple processes at once, and/or manage contention for a resource,
then you also need to write a real device driver at the kernel level, and a user-space device driver will not
be sufficient or even possible.
Example: vgalib
A good example of a user-space driver is the vgalib library. The standard read() and write() calls
are really inadequate for writing a really fast graphics driver, and so instead there is a library which acts
conceptually like a device driver, but runs in user space. Any processes which use it must run setuid root,
because it uses the ioperm() system call. It is possible for a process that is not setuid root to write to
/dev/mem if you have a group mem or kmem which is allowed write permission to /dev/mem and the
process is properly setgid, but only a process running as root can execute the ioperm() call.
There are several I/O ports associated with VGA graphics. vgalib creates symbolic names for this with
#define statements, and then issues the ioperm() call like this to make it possible for the process to
read and write directly from and to those ports:
if (ioperm(CRT_IC, 1, 1)) {
printf("VGAlib: can't get I/O permissions \n");
exit (-1);
}
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/fake.html (1 di 3) [08/03/2001 10.09.46]

ioperm(CRT_IM, 1, 1);
ioperm(ATT_IW, 1, 1);
[...]
It only needs to do error checking once, because the only reason for the ioperm() call to fail is that it is
not being called by the superuser, and this status is not going to change.
After making this call, the process is allowed to use inb and outb machine instructions, but only on
the specified ports. These instructions can be accessed without writing directly in assembly by including ,
but will only work if you compile with optimization on, by giving the -O? to gcc. Read
<linux/asm.h> for details.
After arranging for port I/O, vgalib arranges for writing directly to kernel memory with the following
code:
/* open /dev/mem */
if ((mem_fd = open("/dev/mem", O_RDWR) ) < 0) {
printf("VGAlib: can't open /dev/mem \n");
exit (-1);
}
/* mmap graphics memory */

if ((graph_mem = malloc(GRAPH_SIZE + (PAGE_SIZE-1))) == NULL) {
printf("VGAlib: allocation error \n");
exit (-1);
}
if ((unsigned long)graph_mem % PAGE_SIZE)
graph_mem += PAGE_SIZE - ((unsigned long)graph_mem % PAGE_SIZE);
graph_mem = (unsigned char *)mmap(
(caddr_t)graph_mem,
GRAPH_SIZE,
PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_FIXED,
mem_fd,
GRAPH_BASE
);
if ((long)graph_mem < 0) {
printf("VGAlib: mmap error \n");
exit (-1);
}
It first opens /dev/mem, then allocates memory enough so that the mapping can be done on a page (4 KB)
boundary, and then attempts the map. GRAPH_SIZE is the size of VGA memory, and GRAPH_BASE is the
first address of VGA memory in /dev/mem. Then by writing to the address that is returned by mmap(), the
process is actually writing to screen memory.
Example: mouse conversion

If you want a driver that acts a bit more like a kernel-level driver, but does not live in kernel space, you can
also make a fifo, or named pipe. This usually lives in the /dev/ directory (although it doesn't need to) and
acts substantially like a device once set up. However, fifo's are one-directional only--they have one reader
and one writer.
For instance, it used to be that if you had a PS/2-style mouse, and wanted to run XFree86, you had to create
a fifo called /dev/mouse, and run a program called mconv which read PS/2 mouse ``droppings'' from
/dev/psaux, and wrote the equivalent microsoft-style ``droppings'' to /dev/mouse. Then XFree86 would read
the ``droppings'' from /dev/mouse, and it would be as if there were a microsoft mouse connected to
/dev/mouse. Even though XFree86 is now able to read PS/2 style ``droppings'', the concepts in this example
still stand. (If you have a better example, I'd be glad to see it.)
The evil instruction
Don't use the cli() instruction. It's possible to use it as root to disable interrupts, and one particular
program used to used to use it--the clock program. However, this kills SMP machines. If you need to use
cli(), you need a kernel-space driver, and a user-space driver will only cause grief as more and more
Linux users use SMP machines.
Copyright (C) 1992, 1993, 1994, 1995, 1996 Michael K. Johnson, johnsonm@redhat.com.
Messages
1. What is SMP?
-> SMP: Two Definitions? by Reinhold J. Gerharz
-> Only one definition for Linux... by Michael K. Johnson

What is SMP?
What is SMP?
Forum: User-space device drivers
Keywords: SMP
Date: Mon, 16 Dec 1996 00:22:27 GMT
From: <unknown>
It might not be appropriate to ask, but it'd be real nice to
know what SMP means. I never saw cli() instruction do any
harm to any Linux machine I've met.
Messages
1. SMP: Two Definitions? by Reinhold J. Gerharz
-> Only one definition for Linux... by Michael K. Johnson
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/fake/1.html [08/03/2001 10.09.47]

SMP: Two Definitions?
SMP: Two Definitions?

Re: What is SMP?
Keywords: SMP
Date: Thu, 09 Jan 1997 03:18:21 GMT
From: Reinhold J. Gerharz <rgerharz@erols.com>
I thought SMP meant "symetric multi-processing," a technology where two or more processors share
equal access to memory, device I/O, and interrupts. Ideally one would expect a 100 percent
improvement in processing performance for each additional processor, but in reality only 80-90
percent is achieved.
However, I have discovered that to some people, SMP means "shared-memory multi-processing." This
technology allows multiple processors to run user programs, but one processor reserves interrupt and
I/O handling for itself. This is traditionally called "asymetric multi-processing," and I have tentatively
concluded that only "marketing types" would use this terminology to confuse potential customers.
Messages
1. Only one definition for Linux... by Michael K. Johnson
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/fake/1/1.html [08/03/2001 10.09.47]

Only one definition for Linux...
Only one definition for Linux...

Re: What is SMP?
Re: SMP: Two Definitions? (Reinhold J. Gerharz)
Keywords: SMP
Date: Mon, 13 Jan 1997 14:26:44 GMT
In the Linux world, SMP really does mean symmetric multi-processing. Currently, there's a lock
around the whole kernel so that only one CPU can be in kernel mode at once, but all the CPUs can run
in kernel mode at different times.
As you add more CPU's to an SMP system, the amount of extra performance you get out of each
additional CPU decreases, until at some point it actually decreases performance to add another CPU.
Most systems simply don't support enough CPUs to get a negative marginal performance gain, so that
usually isn't an issue.
Also, because Linux uses a single lock, the current kernels degrade more quickly as you add more
CPUs than a multiple-lock system would for I/O-bound tasks. CPU-bound tasks, on the other hand,
work very well with a single lock around the kernel.
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/fake/1/1/1.html [08/03/2001 10.09.47]


We will assume that you decide that you do not wish to write a user-space device, and would rather implement your
device in the kernel. You will probably be writing writing two files, a .c file and a .h file, and possibly modifying
other files as well, as will be described below. We will refer to your files as foo.c and foo.h, and your driver will be the
foo driver.
Namespace
One of the first things you will need to do, before writing any code, is to name your device. This name should be a
short (probably two or three character) string. For instance, the parallel device is the ``lp'' device, the floppies are the
``fd'' devices, and SCSI disks are the ``sd'' devices. As you write your driver, you will give your functions names
prefixed with your chosen string to avoid any namespace confusion. We will call your prefix foo, and give your
functions names like foo_read(), foo_write(), etc.
Allocating memory
Memory allocation in the kernel is a little different from memory allocation in normal user-level programs. Instead of
having a malloc() capable of delivering almost unlimited amounts of memory, there is a kmalloc() function that
is a bit different:
● Memory is provided in pieces whose size is a power of 2, except that pieces larger than 128 bytes are allocated in
blocks whose size is a power of 2 minus some small amount for overhead. You can request any odd size, but
memory will not be used any more efficiently if you request a 31-byte piece than it will if you request a 32 byte
piece. Also, there is a limit to the amount of memory that can be allocated, which is currently 131056 bytes.
● kmalloc() takes a second argument, the priority. This is used as an argument to the get_free_page()
function, where it is used to determine when to return. The usual priority is GFP_KERNEL. If it may be called
from within an interrupt, use GFP_ATOMIC and be truly prepared for it to fail (don't panic). This is because if
you specify GFP_KERNEL, kmalloc() may sleep, which cannot be done on an interrupt. The other option is
GFP_BUFFER, which is used only when the kernel is allocating buffer space, and never in device drivers.
To free memory allocated with kmalloc(), use one of two functions: kfree() or kfree_s(). These differ from
free() in a few ways as well:
● kfree() is a macro which calls kfree_s() and acts like the standard free() outside the kernel.
● If you know what size object you are freeing, you can speed things up by calling kfree_s() directly. It takes
two arguments: the first is the pointer that you are freeing, as in the single argument to kfree(), and the
second is the size of the object being freed.
See Supporting Functions for more information on kmalloc(), kfree(), and other useful functions.
Be gentle when you use kmalloc. Use only what you have to. Remember that kernel memory is unswappable, and thus
allocating extra memory in the kernel is a far worse thing to do in the kernel than in a user-level program. Take only
what you need, and free it when you are done, unless you are going to use it right away again.
Character vs. block devices
There are two main types of devices under all Unix systems, character and block devices. Character devices are those
for which no buffering is performed, and block devices are those which are accessed through a cache. Block devices
must be random access, but character devices are not required to be, though some are. Filesystems can only be mounted
if they are on block devices.
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/basics.html (1 di 9) [08/03/2001 10.09.49]

Character devices are read from and written to with two function: foo_read() and foo_write(). The read()
and write() calls do not return until the operation is complete. By contrast, block devices do not even implement the
read() and write() functions, and instead have a function which has historically been called the ``strategy
routine.'' Reads and writes are done through the buffer cache mechanism by the generic functions bread(),
breada(), and bwrite(). These functions go through the buffer cache, and so may or may not actually call the
strategy routine, depending on whether or not the block requested is in the buffer cache (for reads) or on whether or not
the buffer cache is full (for writes). A request may be asyncronous: breada() can request the strategy routine to
schedule reads that have not been asked for, and to do it asyncronously, in the background, in the hopes that they will
be needed later.
The sources for character devices are kept in drivers/char/, and the sources for block devices are kept in drivers/block/.
They have similar interfaces, and are very much alike, except for reading and writing. Because of the difference in
reading and writing, initialization is different, as block devices have to register a strategy routine, which is registered in
a different way than the foo_read() and foo_write() routines of a character device driver. Specifics are dealt
with in Character Device Initialization and Block Device Initialization.
Interrupts vs. Polling
Hardware is slow. That is, in the time it takes to get information from your average device, the CPU could be off doing
something far more useful than waiting for a busy but slow device. So to keep from having to busy-wait all the time,
interrupts are provided which can interrupt whatever is happening so that the operating system can do some task and
return to what it was doing without losing information. In an ideal world, all devices would probably work by using
interrupts. However, on a PC or clone, there are only a few interrupts available for use by your peripherals, so some
drivers have to poll the hardware: ask the hardware if it is ready to transfer data yet. This unfortunately wastes time, but
it sometimes needs to be done.
Some hardware (like memory-mapped displays) is as fast as the rest of the machine, and does not generate output
asyncronously, so an interrupt-driven driver would be rather silly, even if interrupts were provided.
In Linux, many of the drivers are interrupt-driven, but some are not, and at least one can be either, and can be switched
back and forth at runtime. For instance, the lp device (the parallel port driver) normally polls the printer to see if the
printer is ready to accept output, and if the printer stays in a not ready phase for too long, the driver will sleep for a
while, and try again later. This improves system performance. However, if you have a parallel card that supplies an
interrupt, the driver will utilize that, which will usually make performance even better.
There are some important programming differences between interrupt-driven drivers and polling drivers. To understand
this difference, you have to understand a little bit of how system calls work under Unix. The kernel is not a separate
task under Unix. Rather, it is as if each process has a copy of the kernel. When a process executes a system call, it does
not transfer control to another process, but rather, the process changes execution modes, and is said to be `ìn kernel
mode.'' In this mode, it executes kernel code which is trusted to be safe.
In kernel mode, the process can still access the user-space memory that it was previously executing in, which is done
through a set of macros: get_fs_*() and memcpy_fromfs() read user-space memory, and put_fs_*() and
memcpy_tofs() write to user-space memory. Because the process is still running, but in a different mode, there is
no question of where in memory to put the data, or where to get it from. However, when an interrupt occurs, any
process might currently be running, so these macros cannot be used--if they are, they will either write over random
memory space of the running process or cause the kernel to panic.
Instead, when scheduling the interrupt, a driver must also provide temporary space in which to put the information, and
then sleep. When the interrupt-driven part of the driver has filled up that temporary space, it wakes up the process,
which copies the information from that temporary space into the process' user space and returns. In a block device
driver, this temporary space is automatically provided by the buffer cache mechanism, but in a character device driver,
the driver is responsible for allocating it itself.

The sleep-wakeup mechanism
[Begin by giving a general description of how sleeping is used and what it does. This should mention things like
all processes sleeping on an event are woken at once, and then they contend for the event again, etc...]
Perhaps the best way to try to understand the Linux sleep-wakeup mechanism is to read the source for the
__sleep_on() function, used to implement both the sleep_on() and interruptible_sleep_on() calls.
static inline void __sleep_on(struct wait_queue **p, int state)

{
struct wait_queue wait = { current, NULL };
if (!p)
return;
if (current == task[0])
panic("task[0] trying to sleep");
current->state = state;
add_wait_queue(p, &wait);
save_flags(flags);
sti();
schedule();
remove_wait_queue(p, &wait);
restore_flags(flags);
}
A wait_queue is a circular list of pointers to task structures, defined in <linux/wait.h> to be
struct wait_queue {
struct task_struct * task;
struct wait_queue * next;
};
state is either TASK_INTERRUPTIBLE or TASK_UNINTERUPTIBLE, depending on whether or not the sleep
should be interruptable by such things as system calls. In general, the sleep should be interruptible if the device is a
slow one; one which can block indefinitely, including terminals and network devices or pseudodevices.
add_wait_queue() turns off interrupts, if they were enabled, and adds the new struct wait_queue declared
at the beginning of the function to the list p. It then recovers the original interrupt state (enabled or disabled), and
returns.
save_flags() is a macro which saves the process flags in its argument. This is done to preserve the previous state
of the interrupt enable flag. This way, the restore_flags() later can restore the interrupt state, whether it was
enabled or disabled. sti() then allows interrupts to occur, and schedule() finds a new process to run, and
switches to it. Schedule will not choose this process to run again until the state is changed to TASK_RUNNING by
wake_up() called on the same wait queue, p, or conceivably by something else.
The process then removes itself from the wait_queue, restores the orginal interrupt condition with
restore_flags(), and returns.
Whenever contention for a resource might occur, there needs to be a pointer to a wait_queue associated with that
resource. Then, whenever contention does occur, each process that finds itself locked out of access to the resource
sleeps on that resource's wait_queue. When any process is finished using a resource for which there is a
wait_queue, it should wake up and processes that might be sleeping on that wait_queue, probably by calling

wake_up(), or possibly wake_up_interruptible().

If you don't understand why a process might want to sleep, or want more details on when and how to structure this
sleeping, I urge you to buy one of the operating systems textbooks listed in the Annotated Bibliography and look up
mutual exclusion and deadlock.
More advanced sleeping
If the sleep_on()/wake_up() mechanism in Linux does not satisfy your device driver needs, you can code your
own versions of sleep_on() and wake_up() that fit your needs. For an example of this, look at the serial device
driver (drivers/char/serial.c) in function block_til_ready(), where quite a bit has to be done between the
add_wait_queue() and the schedule().
The VFS
The Virtual Filesystem Switch, or VFS, is the mechanism which allows Linux to mount many different filesystems at
the same time. In the first versions of Linux, all filesystem access went straight into routines which understood the
minix filesystem. To make it possible for other filesystems to be written, filesystem calls had to pass through a layer
of indirection which would switch the call to the routine for the correct filesystem. This was done by some generic code
which can handle generic cases and a structure of pointers to functions which handle specific cases. One structure is of
interest to the device driver writer; the file_operations structure.
From /usr/include/linux/fs.h:
struct file_operations {
int (*lseek) (struct inode *, struct file *, off_t, int);
int (*read) (struct inode *, struct file *, char *, int);
int (*write) (struct inode *, struct file *, char *, int);
int (*readdir) (struct inode *, struct file *, struct dirent *, int count);
int (*select) (struct inode *, struct file *, int, select_table *);
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned int);
int (*mmap) (struct inode *, struct file *, unsigned long, size_t, int,
unsigned long);
int (*open) (struct inode *, struct file *);
void (*release) (struct inode *, struct file *);
};
Essentially, this structure constitutes a parital list of the functions that you may have to write to create your driver.
This section details the actions and requirements of the functions in the file_operations structure. It documents
all the arguments that these functions take. [It should also detail all the defaults, and cover more carefully the
possible return values.]
The lseek() function
This function is called when the system call lseek() is called on the device special file representing your device. An
understanding of what the system call lseek() does should be sufficient to explain this function, which moves to the
desired offset. It takes these four arguments:
struct inode * inode
Pointer to the inode structure for this device.
struct file * file
Pointer to the file structure for this device.
off_t offset

Offset from origin to move to.

int origin
0 = take the offset from absolute offset 0 (the beginning).
1 = take the offset from the current position.
2 = take the offset from the end.
lseek() returns -errno on error, or the absolute position (>= 0) after the lseek.
If there is no lseek(), the kernel will take the default action, which is to modify the file->f_pos element. For an
origin of 2, the default action is to return -EINVAL if file->f_inode is NULL, otherwise it sets
file->f_pos to file->f_inode->i_size + offset. Because of this, if lseek() should return an error for
your device, you must write an lseek() function which returns that error.
The read() and write() functions
The read and write functions read and write a character string to the device. If there is no read() or write()
function in the file_operations structure registered with the kernel, and the device is a character device,
read() or write() system calls, respectively, will return -EINVAL. If the device is a block device, these functions
should not be implemented, as the VFS will route requests through the buffer cache, which will call your strategy
routine. The read and write functions take these arguments:
This is a pointer to the inode of the device special file which was accessed. From this, you can do several things,
based on the struct inode declaration about 100 lines into /usr/include/linux/fs.h. For instance, you can find
the minor number of the file by this construction: unsigned int minor = MINOR(inode->i_rdev);
The definition of the MINOR macro is in , as are many other useful definitions. Read fs.h and a few device
drivers for more details, and see Supporting Functions for a short description. inode->i_mode can be used to
find the mode of the file, and there are macros available for this, as well.
struct file * file
Pointer to file structure for this device.
char * buf
This is a buffer of characters to read or write. It is located in user-space memory, and therefore must be accessed
using the get_fs*(), put_fs*(), and memcpy*fs() macros detailed in Supporting Functions.
User-space memory is inaccessible during an interrupt, so if your driver is interrupt driven, you will have to copy
the contents of your buffer into a queue.
int count
This is a count of characters in buf to be read or written. It is the size of buf, and is how you know that you
have reached the end of buf, as buf is not guaranteed to be null-terminated.
The readdir() function
This function is another artifact of file_operations being used for implementing filesystems as well as device
drivers. Do not implement it. The kernel will return -ENOTDIR if the system call readdir() is called on your
device special file.
The select() function
The select() function is generally most useful with character devices. It is usually used to multiplex reads without
polling--the application calls the select() system call, giving it a list of file descriptors to watch, and the kernel
reports back to the program on which file descriptor has woken it up. It is also used as a timer. However, the
select() function in your device driver is not directly called by the system call select(), and so the

file_operations select() only needs to do a few things. Its arguments are:

struct file * file
int sel_type
The select type to perform:
SEL_IN read
SEL_OUT write
SEL_EX exception
select_table * wait
If wait is not NULL and there is no error condition caused by the select, select() should put the process to
sleep, and arrange to be woken up when the device becomes ready, usually through an interrupt. If wait is
NULL, then the driver should quickly see if the device is ready, and return even if it is not. The
select_wait() function does this already.
If the calling program wants to wait until one of the devices upon which it is selecting becomes available for the
operation it is interested in, the process will have to be put to sleep until one of those operations becomes available.
This does not require use of a sleep_on*() function, however. Instead the select_wait() function is used.
(See Supporting Functions for the definition of the select_wait() function). The sleep state that
select_wait() will cause is the same as that of sleep_on_interruptible(), and, in fact,
wake_up_interruptible() is used to wake up the process.
However, select_wait() will not make the process go to sleep right away. It returns directly, and the select()
function you wrote should then return. The process isn't put to sleep until the system call sys_select(), which
originall called your select() function, uses the information given to it by the select_wait() function to put
the process to sleep. select_wait() adds the process to the wait queue, but do_select() (called from
sys_select()) actually puts the process to sleep by changing the process state to TASK_INTERRUPTIBLE and
calling schedule().
The first argument to select_wait() is the same wait_queue that should be used for a sleep_on(), and the
second is the select_table that was passed to your select() function.
After having explained all this in excruciating detail, here are two rules to follow:
1. Call select_wait() if the device is not ready, and return 0.
2. Return 1 if the device is ready.
If you provide a select() function, do not provide timeouts by setting current->timeout, as the select()
mechanism uses current->timeout, and the two methods cannot co-exist, as there is only one timeout for each
process. Instead, consider using a timer to provide timeouts. See the description of the add_timer() function in
Supporting Functions for details.
The ioctl() function
The ioctl() function processes ioctl calls. The structure of your ioctl() function will be: first error checking,
then one giant (possibly nested) switch statement to handle all possible ioctls. The ioctl number is passed as cmd, and
the argument to the ioctl is passed as arg. It is good to have an understanding of how ioctls ought to work before
making them up. If you are not sure about your ioctls, do not feel ashamed to ask someone knowledgeable about it, for
a few reasons: you may not even need an ioctl for your purpose, and if you do need an ioctl, there may be a better way

to do it than what you have thought of. Since ioctls are the least regular part of the device interface, it takes perhaps the
most work to get this part right. Take the time and energy you need to get it right.
The first thing you need to do is look in Documentation/ioctl-number.txt, read it, and pick an unused number. Then go
from there.
struct file * file
unsigned int cmd
This is the ioctl command. It is generally used as the switch variable for a case statement.
unsigned int arg
This is the argument to the command. This is user defined. Since this is the same size as a (void *), this can
be used as a pointer to user space, accessed through the fs register as usual.
Returns:
-errno on error
Every other return is user-defined.
If the ioctl() slot in the file_operations structure is not filled in, the VFS will return -EINVAL. However, in
all cases, if cmd is one of FIOCLEX, FIONCLEX, FIONBIO, or FIOASYNC, default processing will be done:
FIOCLEX (0x5451)
Sets the close-on-exec bit.
FIONCLEX (0x5450)
Clears the close-on-exec bit.
FIONBIO (0x5421)
If arg is non-zero, set O_NONBLOCK, otherwise clear O_NONBLOCK.
FIOASYNC (0x5452)
If arg is non-zero, set O_SYNC, otherwise clear O_SYNC. O_SYNC is not yet implemented, but it is
documented here and parsed in the kernel for completeness.
Note that you have to avoid these four numbers when creating your own ioctls, since if they conflict, the VFS ioctl
code will interpret them as being one of these four, and act appropriately, causing a very hard-to-track-down bug.
The mmap() function

Pointer to inode structure for device.
struct file * file
Pointer to file structure for device.
unsigned long addr
Beginning of address in main memory to mmap() into.
size_t len
Length of memory to mmap().
int prot
One of:
PROT_READ region can be read.

PROT_WRITE region can be written.

PROT_EXEC region can be executed.
PROT_NONE region cannot be accessed.
unsigned long off
Offset in the file to mmap() from. This address in the file will be mapped to address addr.
The open() and release() functions

Pointer to inode structure for device.
struct file * file
Pointer to file structure for device.
open() is called when a device special files is opened. It is the policy mechanism responsible for ensuring
consistency. If only one process is allowed to open the device at once, open() should lock the device, using whatever
locking mechanism is appropriate, usually setting a bit in some state variable to mark it as busy. If a process already is
using the device (if the busy bit is already set) then open() should return -EBUSY. If more than one process may
open the device, this function is responsible to set up any necessary queues that would not be set up in write(). If no
such device exists, open() should return -ENODEV to indicate this. Return 0 on success.
release() is called only when the process closes its last open file descriptor on the files. [I am not sure this is true;
it might be called on every close.] If devices have been marked as busy, release() should unset the busy bits if
appropriate. If you need to clean up kmalloc()'ed queues or reset devices to preserve their sanity, this is the place to
do it. If no release() function is defined, none is called.
The init() function
This function is not actually included in the file_operations structure, but you are required to implement it,
because it is this function that registers the file_operations structure with the VFS in the first place--without this
function, the VFS could not route any requests to the driver. This function is called when the kernel first boots and is
configuring itself. The init function then detects all devices. You will have to call your init() function from the
correct place: for a character device, this is chr_dev_init() in drivers/char/mem.c.
While the init() function runs, it registers your driver by calling the proper registration function. For character
devices, this is register_chrdev(). (See Supporting Functions for more information on the registration
functions.) register_chrdev() takes three arguments: the major device number (an int), the ``name'' of the device
(a string), and the address of the device_fops file_operations structure.
When this is done, and a character or block special file is accessed, the VFS filesystem switch automagically routes the
call, whatever it is, to the proper function, if a function exists. If the function does not exist, the VFS routines take some
default action.
The init() function usually displays some information about the driver, and usually reports all hardware found. All
reporting is done via the printk() function.
Messages
1. using XX_select() for device without interrupts by Elwood Downey
2. found reason for select() problem

3. Why do VFS functions get both structs inode and file? by Reinhold J. Gerharz

Here is a list of many of the most common supporting functions available to the device driver writer. If
you find other supporting functions that are useful, please point them out to me. I know this is not a
complete list, but I hope it is a helpful one.
add_request()
static void add_request(struct blk_dev_struct *dev, struct request *

req)
This is a static function in ll_rw_block.c, and cannot be called by other code. However, an understanding
of this function, as well as an understanding of ll_rw_block(), may help you understand the strategy
routine.
If the device that the request is for has an empty request queue, the request is put on the queue and the
strategy routine is called. Otherwise, the proper place in the queue is chosen and the request is inserted in
the queue, maintaining proper order by insertion sort.
Proper order (the elevator algorithm) is defined as:
1. Reads come before writes.
2. Lower minor numbers come before higher minor numbers.
3. Lower block numbers come before higher block numbers.
The elevator algorithm is implemented by the macro IN_ORDER(), which is defined in
drivers/block/blk.h [This may have changed somewhat recently, but it shouldn't matter to the driver
writer anyway...]
Defined in: drivers/block/ll_rw_block.c
See also: make_request(), ll_rw_block().
add_timer()
void add_timer(struct timer_list * timer)

#include <linux/timer.h>
Installs the timer structures in the list timer in the timer list.
The timer_list structure is defined by:
struct timer_list {
struct timer_list *next;
struct timer_list *prev;
unsigned long expires;
unsigned long data;
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/reference.html (1 di 14) [08/03/2001 10.09.52]

void (*function)(unsigned long);

};
In order to call add_timer(), you need to allocate a timer_list structure, and then call
init_timer(), passing it a pointer to your timer_list. It will nullify the next and prev
elements, which is the correct initialization. If necessary, you can allocate multiple timer_list
structures, and link them into a list. Do make sure that you properly initialize all the unused pointers to
NULL, or the timer code may get very confused.
For each struct in your list, you set three variables:
expires
The number of jiffies (100ths of a second in Linux/86; thousandths or so in Linux/Alpha) after
which to time out.
function
Kernel-space function to run after timeout has occured.
data
Passed as the argument to function when function is called.
Having created this list, you give a pointer to the first (usually the only) element of the list as the
argument to add_timer(). Having passed that pointer, keep a copy of the pointer handy, because you
will need to use it to modify the elements of the list (to set a new timeout when you need a function
called again, to change the function to be called, or to change the data that is passed to the function) and
to delete the timer, if necessary.
Note: This is not process-specific. Therefore, if you want to wake a certain process at a timeout, you will
have to use the sleep and wake primitives. The functions that you install through this mechanism will run
in the same context that interrupt handlers run in.
Defined in: kernel/sched.c
See also: timer_table in include/linux/timer.h, init_timer(), del_timer().
cli()
#define cli() __asm__ __volatile__ ("cli"::)

#include <asm/system.h>
Prevents interrupts from being acknowledged. cli stands for ``CLear Interrupt enable''.
See also: sti()
del_timer
void del_timer(struct timer_list * timer)

#include <linux/timer.h>
Deletes the timer structures in the list timer in the timer list.
The timer list that you delete must be the address of a timer list you have earlier installed with

add_timer(). Once you have called del_timer() to delete the timer from the kernel timer list,
you may deallocate the memory used in the timer_list structures, as it is no longer referenced by the
kernel timer list.
See also: timer_table in include/linux/timer.h, init_timer(), add_timer().
end_request()
static void end_request(int uptodate)

#include "blk.h"
Called when a request has been satisfied or aborted. Takes one argument:
uptodate
If not equal to 0, means that the request has been satisfied.
If equal to 0, means that the request has not been satisfied.
If the request was satisfied (uptodate != 0), end_request() maintains the request list, unlocks
the buffer, and may arrange for the scheduler to be run at the next convenient time (need_resched =
1; this is implicit in wake_up(), and is not explicitly part of end_request()), before waking up all
processes sleeping on the wait_for_request event, which is slept on in make_request(),
ll_rw_page(), and ll_rw_swap_file().
Note: This function is a static function, defined in drivers/block/blk.h for every non-SCSI device that
includes blk.h. (SCSI devices do this differently; the high-level SCSI code itself provides this
functionality to the low-level device-specific SCSI device drivers.) It includes several defines dependent
on static device information, such as the device number. This is marginally faster than a more generic
normal C function.
Defined in: kernel/blk_drv/blk.h
See also: ll_rw_block(), add_request(), make_request().
free_irq()
void free_irq(unsigned int irq)

#include <linux/sched.h>
Frees an irq previously aquired with request_irq() or irqaction(). Takes one argument:
irq
interrupt level to free.
Defined in: kernel/irq.c
See also: request_irq(), irqaction().
get_user()
#define get_user(ptr)
((__typeof__(*(ptr)))__get_user((ptr),sizeof(*(ptr))))

#include <asm/segment.h>
Allows a driver to access data in user space, which is in a different segment than the kernel. Derives the
type of the argument and the return type automatically. This means that you have to use types
correctly. Shoddy typing will simply fail to work.
Note: these functions may cause implicit I/O, if the memory being accessed has been swapped out,
and therefore pre-emption may occur at this point. Do not include these functions in critical sections of
your code even if the critical sections are protected by cli()/sti() pairs, because that implicit I/O
will violate the integrity of your cli()/sti() pair. If you need to get at user-space memory, copy it to
kernel-space memory before you enter your critical section.
These functions take one argument:
addr
Address to get data from.
Returns:
Data at that offset in user space.
Defined in: include/asm/segment.h
See also: memcpy_*fs(), put_user(), cli(), sti().
inb(), inb_p()
inline unsigned int inb(unsigned short port)

inline unsigned int inb_p(unsigned short port)
#include <asm/io.h>
Reads a byte from a port. inb() goes as fast as it can, while inb_p() pauses before returning. Some
devices are happier if you don't read from them as fast as possible. Both functions take one argument:
port
Port to read byte from.
Returns:
The byte is returned in the low byte of the 32-bit integer, and the 3 high bytes are unused, and may
be garbage.
Defined in: include/asm/io.h
See also: outb(), outb_p().
init_timer()
Inline function for initializing timer_list structures for use with add_timer().
Defined in: include/linux/timer.h
See also: add_timer().
irqaction()

int irqaction(unsigned int irq, struct sigaction *new)

Hardware interrupts are really a lot like signals. Therefore, it makes sense to be able to register an
interrupt like a signal. The sa_restorer() field of the struct sigaction is not used, but
otherwise it is the same. The int argument to the sa.handler() function may mean different things,
depending on whether or not the IRQ is installed with the SA_INTERRUPT flag. If it is not installed
with the SA_INTERRUPT flag, then the argument passed to the handler is a pointer to a register
structure, and if it is installed with the SA_INTERRUPT flag, then the argument passed is the number of
the IRQ. For an example of handler set to use the SA_INTERRUPT flag, look at how
rs_interrupt() is installed in drivers/char/serial.c
The SA_INTERRUPT flag is used to determine whether or not the interrupt should be a ``fast'' interrupt.
Normally, upon return from the interrupt, need_resched, a global flag, is checked. If it is set (!= 0),
then schedule() is run, which may schedule another process to run. They are also run with all other
interrupts still enabled. However, by setting the sigaction structure member sa_flags to
SA_INTERRUPT, ``fast'' interrupts are chosen, which leave out some processing, and very specifically
do not call schedule().
irqaction() takes two arguments:
irq
The number of the IRQ the driver wishes to acquire.
new
A pointer to a sigaction struct.
Returns:
-EBUSY if the interrupt has already been acquired,
-EINVAL if sa.handler() is NULL,
0 on success.
See also: request_irq(), free_irq()
IS_*(inode)
IS_RDONLY(inode) ((inode)->i_flags & MS_RDONLY)

IS_NOSUID(inode) ((inode)->i_flags & MS_NOSUID)
IS_NODEV(inode) ((inode)->i_flags & MS_NODEV)
IS_NOEXEC(inode) ((inode)->i_flags & MS_NOEXEC)
IS_SYNC(inode) ((inode)->i_flags & MS_SYNC)
#include <linux/fs.h>
These five test to see if the inode is on a filesystem mounted the corresponding flag.
kfree*()
#define kfree(x) kfree_s((x), 0)

void kfree_s(void * obj, int size)

#include <linux/malloc.h>
Free memory previously allocated with kmalloc(). There are two possible arguments:
obj
Pointer to kernel memory to free.
size
To speed this up, if you know the size, use kfree_s() and provide the correct size. This way,
the kernel memory allocator knows which bucket cache the object belongs to, and doesn't have to
search all of the buckets. (For more details on this terminology, read mm/kmalloc.c.)
[kfree_s() may be obsolete now.]
Defined in: mm/kmalloc.c, include/linux/malloc.h
See also: kmalloc().
kmalloc()
void * kmalloc(unsigned int len, int priority)

#include <linux/kernel.h>
kmalloc() used to be limited to 4096 bytes. It is now limited to 131056 bytes ((32*4096)-16) on
Linux/Intel, and twice that on platforms such as Alpha with 8Kb pages. Buckets, which used to be all
exact powers of 2, are now a power of 2 minus some small number, except for numbers less than or equal
to 128. For more details, see the implementation in mm/kmalloc.c.
kmalloc() takes two arguments:
len
Length of memory to allocate. If the maximum is exceeded, kmalloc will log an error message of
``kmalloc of too large a block (%d bytes).'' and return NULL.
priority
GFP_KERNEL or GFP_ATOMIC. If GFP_KERNEL is chosen, kmalloc() may sleep, allowing
pre-emption to occur. This is the normal way of calling kmalloc(). However, there are cases
where it is better to return immediately if no pages are available, without attempting to sleep to
find one. One of the places in which this is true is in the swapping code, because it could cause
race conditions, and another in the networking code, where things can happen at much faster speed
that things could be handled by swapping to disk to make space for giving the networking code
more memory. The most important reason for using GFP_ATOMIC is if it is being called from an
interrupt, when you cannot sleep, and cannot receive other interrupts.
Returns:
NULL on failure.
Pointer to allocated memory on success.
Defined in: mm/kmalloc.c
See also: kfree()

ll_rw_block()
void ll_rw_block(int rw, int nr, struct buffer_head *bh[])

No device driver will ever call this code: it is called only through the buffer cache. However, an
understanding of this function may help you understand the function of the strategy routine.
After sanity checking, if there are no pending requests on the device's request queue, ll_rw_block()
``plugs'' the queue so that the requests don't go out until all the requests are in the queue, sorted by the
elevator algorithm. make_request() is then called for each request. If the queue had to be plugged,
then the strategy routine for that device is not active, and it is called, with interrupts disabled. It is the
responsibility of the strategy routine to re-enable interrupts.
Defined in: devices/block/ll_rw_block.c
See also: make_request(), add_request().
MAJOR()
#define MAJOR(a) (((unsigned)(a))>>8)

This takes a 16 bit device number and gives the associated major number by shifting off the minor
number.
See also: MINOR().
make_request()
static void make_request(int major, int rw, struct buffer_head *bh)

This is a static function in ll_rw_block.c, and cannot be called by other code. However, an understanding
of this function, as well as an understanding of ll_rw_block(), may help you understand the strategy
routine.
make_request() first checks to see if the request is readahead or writeahead and the buffer is locked.
If so, it simply ignores the request and returns. Otherwise, it locks the buffer and, except for SCSI
devices, checks to make sure that write requests don't fill the queue, as read requests should take
precedence.
If no spaces are available in the queue, and the request is neither readahead nor writeahead,
make_request() sleeps on the event wait_for_request, and tries again when woken. When a
space in the queue is found, the request information is filled in and add_request() is called to
actually add the request to the queue. Defined in: devices/block/ll_rw_block.c
See also: add_request(), ll_rw_block().
MINOR()
#define MINOR(a) ((a)&0xff)

This takes a 16 bit device number and gives the associated minor number by masking off the major
number.
See also: MAJOR().
memcpy_*fs()
inline void memcpy_tofs(void * to, const void * from, unsigned long n)

inline void memcpy_fromfs(void * to, const void * from, unsigned long
n)
Copies memory between user space and kernel space in chunks larger than one byte, word, or long. Be
very careful to get the order of the arguments right!
your code, even if the critical sections are protected by cli()/sti() pairs, because implicit I/O will
violate the cli() protection. If you need to get at user-space memory, copy it to kernel-space memory
before you enter your critical section.
These functions take three arguments:
to
Address to copy data to.
from
Address to copy data from.
n
Number of bytes to copy.
Defined in: include/asm/segment.h
See also: get_user(), put_user(), cli(), sti().
outb(), outb_p()
inline void outb(char value, unsigned short port)

inline void outb_p(char value, unsigned short port)
#include <asm/io.h>
Writes a byte to a port. outb() goes as fast as it can, while outb_p() pauses before returning. Some
devices are happier if you don't write to them as fast as possible. Both functions take two arguments:
value
The byte to write.
port
Port to write byte to.

Defined in: include/asm/io.h

See also: inb(), inb_p().
printk()
int printk(const char* fmt, ...)

printk() is a version of printf() for the kernel, with some restrictions. It cannot handle floats, and
has a few other limitations, which are documented in kernel/vsprintf.c. It takes a variable number of
arguments:
fmt
Format string, printf() style.
...
The rest of the arguments, printf() style.
Returns:
Number of bytes written.
Note: printk() may cause implicit I/O, if the memory being accessed has been swapped out, and
therefore pre-emption may occur at this point. Also, printk() will set the interrupt enable flag, so
never use it in code protected by cli(). Because it causes I/O, it is not safe to use in protected code
anyway, even it if didn't set the interrupt enable flag.
Defined in: kernel/printk.c.
put_user()
#define put_user(x,ptr) __put_user((unsigned

long)(x),(ptr),sizeof(*(ptr)))
Allows a driver to write data in user space, which is in a different segment than the kernel. Derives the
type of the arguments and the storage size automatically. This means that you have to use types
correctly. Shoddy typing will simply fail to work.
your code even if the critical sections are protected by cli()/sti() pairs, because that implicit I/O
will violate the integrity of your cli()/sti() pair. If you need to get at user-space memory, copy it to
kernel-space memory before you enter your critical section.
These functions take two arguments:
val
Value to write
addr

Address to write data to.

Defined in: asm/segment.h
See also: memcpy_*fs(), get_user(), cli(), sti().
register_*dev()
int register_chrdev(unsigned int major, const char *name, struct

file_operations *fops)
int register_blkdev(unsigned int major, const char *name, struct
file_operations *fops)
#include <linux/errno.h>
Registers a device with the kernel, letting the kernel check to make sure that no other driver has already
grabbed the same major number. Takes three arguments:
major
Major number of device being registered.
name
Unique string identifying driver. Used in the output for the /proc/devices file.
fops
Pointer to a file_operations structure for that device. This must not be NULL, or the kernel
will panic later.
Returns:
-EINVAL if major is >= MAX_CHRDEV or MAX_BLKDEV (defined in ), for character or block
devices, respectively.
-EBUSY if major device number has already been allocated.
0 on success.
Defined in: fs/devices.c
See also: unregister_*dev()
request_irq()
int request_irq(unsigned int irq, void (*handler)(int), unsigned long

flags, const char *device)
Request an IRQ from the kernel, and install an IRQ interrupt handler if successful. Takes four arguments:
irq
The IRQ being requested.
handler
The handler to be called when the IRQ occurs. The argument to the handler function will be the

number of the IRQ that it was invoked to handle.

flags
Set to SA_INTERRUPT to request a ``fast'' interrupt or 0 to request a normal, ``slow'' one.
device
A string containing the name of the device driver, device.
Returns:
-EINVAL if irq > 15 or handler = NULL.
-EBUSY if irq is already allocated.
0 on success.
If you need more functionality in your interrupt handling, use the irqaction() function. This uses
most of the capabilities of the sigaction structure to provide interrupt services similar to to the signal
services provided by sigaction() to user-level programs.
See also: free_irq(), irqaction().
select_wait()
inline void select_wait(struct wait_queue **wait_address, select_table

*p)
Add a process to the proper select_wait queue. This function takes two arguments:
wait_address
Address of a wait_queue pointer to add to the circular list of waits.
p
p is NULL, select_wait does nothing, otherwise the current process is put to sleep. This
should be the select_table *wait variable that was passed to your select() function.
Defined in: linux/sched.h
See also: *sleep_on(), wake_up*()
*sleep_on()
void sleep_on(struct wait_queue ** p)

void interruptible_sleep_on(struct wait_queue ** p)
Sleep on an event, putting a wait_queue entry in the list so that the process can be woken on that
event. sleep_on() goes into an uninteruptible sleep: The only way the process can run is to be woken
by wake_up(). interruptible_sleep_on() goes into an interruptible sleep that can be woken
by signals and process timeouts will cause the process to wake up. A call to
wake_up_interruptible() is necessary to wake up the process and allow it to continue running
where it left off. Both take one argument:

p
Pointer to a proper wait_queue structure that records the information needed to wake the
process.
See also: select_wait(), wake_up*().
sti()
#define sti() __asm__ __volatile__ ("sti"::)

#include <asm/system.h>
Allows interrupts to be acknowledged. sti stands for ``SeT Interrupt enable''.
Defined in: asm/system.h
See also: cli().
sys_get*()
int sys_getpid(void)
int sys_getuid(void)
int sys_getgid(void)
int sys_geteuid(void)
int sys_getegid(void)
int sys_getppid(void)
int sys_getpgrp(void)
These system calls may be used to get the information described in the table below, or the information
can be extracted directly from the process table, like this:
foo = current->pid;
pid Process ID
uid User ID
gid Group ID
euid Effective user ID
egid Effective group ID
ppid Process ID of process' parent process
pgid Group ID of process' parent process
The system calls should not be used because they are slower and take more space. Because of this, they
are no longer exported as symbols throughout the whole kernel.
unregister_*dev()

int unregister_chrdev(unsigned int major, const char *name)

int unregister_blkdev(unsigned int major, const char *name)
Removes the registration for a device device with the kernel, letting the kernel give the major number to
some other device. Takes two arguments:
major
Major number of device being registered. Must be the same number given to
register_*dev().
name
Unique string identifying driver. Must be the same number given to register_*dev().
Returns:
-EINVAL if major is >= MAX_CHRDEV or MAX_BLKDEV (defined in <linux/fs.h>), for
character or block devices, respectively, or if there have not been file operations registered for
major device major, or if name is not the same name that the device was registered with.
0 on success.
Defined in: fs/devices.c
See also: register_*dev()
wake_up*()
void wake_up(struct wait_queue ** p)

void wake_up_interruptible(struct wait_queue ** p)
Wakes up a process that has been put to sleep by the matching *sleep_on() function. wake_up()
can be used to wake up tasks in a queue where the tasks may be in a TASK_INTERRUPTIBLE or
TASK_UNINTERRUPTIBLE state, while wake_up_interruptible() will only wake up tasks in a
TASK_INTERRUPTIBLE state, and will be insignificantly faster than wake_up() on queues that have
only interruptible tasks. These take one argument:
q
Pointer to the wait_queue structure of the process to be woken.
Note that wake_up() does not switch tasks, it only makes processes that are woken up runnable, so
that the next time schedule() is called, they will be candidates to run.
See also: select_wait(), *sleep_on()
Messages
14. down/up() - semaphores; set/clear/test_bit() by Erez Strauss

13. Bug in printk description! by Theodore Ts'o

12. File access within a device driver? by Paul Osborn
11. man pages for reguest_region() and release_region() (?) by mharrison@i-55.com
10. Can register_*dev() assign an unused major number? by rgerharz@erols.com
1. Register_*dev() can assign an unused major number. by Reinhold J. Gerharz
9. memcpy_*fs(): which way is "fs"? by Reinhold J. Gerharz
1. memcpy_tofs() and memcpy_fromfs() by David Hinds
8. init_wait_queue() by Michael K. Johnson
7. request_irq(...,void *dev_id) by Robert Wilhelm
1. dev_id seems to be for IRQ sharing by Steven Hunyady
6. udelay should be mentioned by Klaus Lindemann
5. vprintk would be nice... by Robert Baruch
1. RE: vprintk would be nice...
4. add_timer function errata? by Tim Ferguson
1. add_timer function errata by Tom Bjorkholm
3. Very short waits by Kenn Humborg
2. Add the kill_xxx() family to Supporting functions? by Burkhard Kohl
1. Allocating large amount of memory by Michael K. Johnson
1. bigphysarea for Linux 2.0? by Greg Hager


Initialization
Besides functions defined by the file_operations structure, there is at least one other function that you will have to
write, the foo_init() function. You will have to change chr_dev_init() in drivers/char/mem.c to call your
foo_init() function.
foo_init() should first call register_chrdev() to register itself and avoid device number contention.
register_chrdev() takes three arguments:
int major
This is the major number which the driver wishes to allocate.
char *name
This is the symbolic name of the driver. This is used, among other things, to report the driver's name in the /proc
filesystem.
struct file_operations *f_ops
This is the address of your file_operations structure.
Returns:
0 if no other character device has registered with the same major number.
non-0 if the call fails, presumably because another character device has already allocated that major number.
Generally, the foo_init() routine will then attempt to detect the hardware that it is supposed to be driving. It should make
sure that all necessary data structures are filled out for all present hardware, and have some way of ensuring that non-present
hardware does not get accessed. [Detail different ways of doing this. In particular, document the request_* and related
functions.]
Interrupts vs. Polling
In a polling driver, the foo_read() and foo_write() functions are pretty easy to write. Here is an example of
foo_write():
static int foo_write(struct inode * inode, struct file * file, char * buf, int count)
{
unsigned int minor = MINOR(inode->i_rdev);
char ret;
while (count > 0) {

ret = foo_write_byte(minor);
if (ret < 0) {
foo_handle_error(WRITE, ret, minor);
continue;
}
buf++ = ret; count--
}
return count;
}
foo_write_byte() and foo_handle_error() are either functions defined elsewhere in foo.c or pseudocode. WRITE
would be a constant or #define.
It should be clear from this example how to code the foo_read() function as well.
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/char.html (1 di 3) [08/03/2001 10.09.54]

Interrupt-driven drivers are a little more difficult. Here is an example of a foo_write() that is interrupt-driven:
static int foo_write(struct inode * inode, struct file * file, char * buf, int count)
{
unsigned int minor = MINOR(inode->i_rdev);
unsigned long copy_size;
unsigned long total_bytes_written = 0;
unsigned long bytes_written;
struct foo_struct *foo = &foo_table[minor];
do {
copy_size = (count <= FOO_BUFFER_SIZE ? count : FOO_BUFFER_SIZE);
memcpy_fromfs(foo->foo_buffer, buf, copy_size);
while (copy_size) {
/* initiate interrupts */
if (some_error_has_occured) {
/* handle error condition */
}
current->timeout = jiffies + FOO_INTERRUPT_TIMEOUT;

/* set timeout in case an interrupt has been missed */
interruptible_sleep_on(&foo->foo_wait_queue);
bytes_written = foo->bytes_xfered;
foo->bytes_written = 0;
if (current->signal & ~current->blocked) {
if (total_bytes_written + bytes_written)
return total_bytes_written + bytes_written;
else
return -EINTR; /* nothing was written, system
call was interrupted, try again */
}
}
total_bytes_written += bytes_written;
buf += bytes_written;
count -= bytes_written;
} while (count > 0);
return total_bytes_written;
}
static void foo_interrupt(int irq)

{
struct foo_struct *foo = &foo_table[foo_irq[irq]];
/* Here, do whatever actions ought to be taken on an interrupt.

Look at a flag in foo_table to know whether you ought to be
reading or writing. */
/* Increment foo->bytes_xfered by however many characters were

read or written */

if (buffer too full/empty)

wake_up_interruptible(&foo->foo_wait_queue);
}
Again, a foo_read() function is written analagously. foo_table[] is an array of structures, each of which has several
members, some of which are foo_wait_queue and bytes_xfered, which can be used for both reading and writing.
foo_irq[] is an array of 16 integers, and is used for looking up which entry in foo_table[] is associated with the irq
generated and reported to the foo_interrupt() function.
To tell the interrupt-handling code to call foo_interrupt(), you need to use either request_irq() or
irqaction(). This is either done when foo_open() is called, or if you want to keep things simple, when foo_init()
is called. request_irq() is the simpler of the two, and works rather like an old-style signal handler. It takes two
arguments: the first is the number of the irq you are requesting, and the second is a pointer to your interrupt handler, which
must take an integer argument (the irq that was generated) and have a return type of void. request_irq() returns
-EINVAL if irq > 15 or if the pointer to the interrupt handler is NULL, -EBUSY if that interrupt has already been taken, or 0
on success.
irqaction() works rather like the user-level sigaction(), and in fact reuses the sigaction structure. The
sa_restorer() field of the sigaction structure is not used, but everything else is the same. See the entry for
irqaction() in Supporting Functions, for further information about irqaction().
Messages
3. release() method called when close is called
2. return value of foo_write(...) by My name here
1. TTY drivers by Daniel Taylor
1. Is anything in the works? If not ... by Andrew Manison


[Note: This has not been updated since changes were made in the block device interface to support
block device loadable modules. The changes shouldn't make it impossible for you to apply any of
this...]
To mount a filesystem on a device, it must be a block device driven by a block device driver. This means
that the device must be a random access device, not a stream device. In other words, you must be able to
seek to any location on the physical device at any time.
You do not provide read() and write() routines for a block device. Instead, your driver uses
block_read() and block_write(), which are generic functions, provided by the VFS, which will
call the strategy routine, or request() function, which you write in place of read() and write()
for your driver. This strategy routine is also called by the buffer cache, which is called by the VFS
routines, which is how normal files on normal filesystems are read and written.
Requests for I/O are given by the buffer cache to a routine called ll_rw_block(), which constructs
lists of requests ordered by an elevator algorithm, which sorts the lists to make accesses faster and more
efficient. It, in turn, calls your request() function to actually do the I/O.
Note that although SCSI disks and CDROMs are considered block devices, they are handled specially (as
are all SCSI devices). Refer to Writing a SCSI Driver for details. (Although SCSI disks and CDROMs
are block devices, SCSI tapes, like other tapes, are generally character devices.)
Initialization
Initialization of block devices is a bit more complex than initialization of character devices, especially as
some `ìnitialization'' has to be done at compile time. There is also a register_blkdev() call that
corresponds to the character device register_chrdev() call, which the driver must call to say that
it is present, working, and active.
The file blk.h
At the top of your driver code, after all other included header files, you need to write two lines of code:
#define MAJOR_NR DEVICE_MAJOR

#include "blk.h"
where DEVICE_MAJOR is the major number of your device. drivers/block/blk.h requires the use of the
MAJOR_NR define to set up many other defines and macros for your driver.
Now you need to edit blk.h. Under #ifdef MAJOR_NR, there is a section of defines that are
conditionally included for certain major numbers, protected by #elif (MAJOR_NR ==
DEVICE_MAJOR). At the end of this list, you will add another section for your driver. In that section,
the following lines are required:
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/block.html (1 di 3) [08/03/2001 10.09.55]

#define DEVICE_NAME "device"

#define DEVICE_REQUEST do_dev_request
#define DEVICE_ON(device) /* usually blank, see below */
#define DEVICE_OFF(device) /* usually blank, see below */
#define DEVICE_NR(device) (MINOR(device))
DEVICE_NAME is simply the device name. See the other entries in blk.h for examples.
DEVICE_REQUEST is your strategy routine, which will do all the I/O on the device. See The Strategy
Routine for more details on the strategy routine.
DEVICE_ON and DEVICE_OFF are for devices that need to be turned on and off, like floppies. In fact,
the floppy driver is currently the only device driver which uses these defines.
DEVICE_NR(device) is used to determine the number of the physical device from the minor device
number. For instance, in the hd driver, since the second hard drive starts at minor 64,
DEVICE_NR(device) is defined to be (MINOR(device)>>6).
If your driver is interrupt-driven, you will also set
#define DEVICE_INTR do_dev

which will become a variable automatically defined and used by the remainder of blk.h, specifically by
the SET_INTR() and CLEAR_INTR macros.
You might also consider setting these defines:
#define DEVICE_TIMEOUT DEV_TIMER

#define TIMEOUT_VALUE n
where n is the number of jiffies (clock ticks; hundredths of a second on Linux/386; thousandths or so on
Linux/Alpha) to time out after if no interrupt is received. These are used if your device can become
``stuck'': a condition where the driver waits indefinitely for an interrupt that will never arrive. If you
define these, they will automatically be used in SET_INTR to make your driver time out. Of course,
your driver will have to be able to handle the possibility of being timed out by a timer.
Recognizing PC standard partitions
[Inspect the routines in genhd.c and include detailed, correct instructions on how to use them to
allow your device to use the standard dos partitioning scheme. By now, bsd disklabel and sun's
SMD labelling are also supported, and I still haven't gotten around to documenting this. Shame on
me--but people seem to have been able to figure it out anyway :-)]
The Buffer Cache
[Here, it should be explained briefly how ll_rw_block() is called, about getblk() and
bread() and breada() and bwrite(), etc. A real explanation of the buffer cache is reserved
for the VFS reference section. Jean-Marc Lugrin wrote one, but I can't find him now.]

The Strategy Routine
All reading and writing of blocks is done through the strategy routine. This routine takes no arguments
and returns nothing, but it knows where to find a list of requests for I/O (CURRENT, defined by default as
blk_dev[MAJOR_NR].current_request), and knows how to get data from the device into the
blocks. It is called with interrupts disabled so as to avoid race conditions, and is responsible for turning
on interrupts with a call to sti() before returning.
The strategy routine first calls the INIT_REQUEST macro, which makes sure that requests are really on
the request list and does some other sanity checking. add_request() will have already sorted the
requests in the proper order according to the elevator algorithm (using an insertion sort, as it is called
once for every request), so the strategy routine ``merely'' has to satisfy the request, call
end_request(1), which will take the request off the list, and then if there is still another request on
the list, satisfy it and call end_request(1), until there are no more requests on the list, at which time
it returns.
If the driver is interrupt-driven, the strategy routine need only schedule the first request to occur, and
have the interrupt-handler call end_request(1) and the call the strategy routine again, in order to
schedule the next request. If the driver is not interrupt-driven, the strategy routine may not return until all
I/O is complete.
If for some reason I/O fails permanently on the current request, end_request(0) must be called to
destroy the request.
A request may be for a read or write. The driver determines whether a request is for a read or write by
examining CURRENT->cmd. If CURRENT->cmd == READ, the request is for a read, and if
CURRENT->cmd == WRITE, the request is for a write. If the device has seperate interrupt routines for
handling reads and writes, SET_INTR(n) must be called to assure that the proper interrupt routine will
be called.
[Here I need to include samples of both a polled strategy routine and an interrupt-driven one. The
interrupt-driven one should provide seperate read and write interrupt routines to show the use of
SET_INTR.]
Messages
1. non-block-cached block device? by Neal Tucker
2. Shall I explain elevator algorithm (+sawtooth etc) by Michael De La Rue

Annotated Bibliography
This annotated bibliography covers books on operating system theory as well as different kinds of
programming in a Unix environment. The price marked may or may not be an exact price, but should be
close enough for government work. If you have a book that you think should go in the bibliography,
please write a short review of it and send all the necessary information (title, author, publisher,
ISBN, and approximate price) and the review to johnsonm@redhat.com
The Design of the UNIX Operating System
Author: Maurice J. Bach

Publisher: Prentice Hall, 1986
ISBN: 0-13-201799-7
Price: $65.00
This is one of the books that Linus used to design Linux. It is a description of the data structures used in
the System V kernel. Many of the names of the important functions in the Linux source come from this
book, and are named after the algorithms presented here. For instance, if you can't quite figure out what
exactly getblk(), brelse(), bread(), breada(), and bwrite() are, chapter 3 explains very well.
While most of the algorithms are similar or the same, a few differences are worth noting:
● The Linux buffer cache is dynamically resized, so the algorithm for dealing with getting new
buffers is a bit different. Therefore the above referenced explanation of getblk() is a little different
than the getblk() in Linux.
● Linux does not currently use streams, and if/when streams are implemented for Linux, they are
likely to have somewhat different semantics.
● The semantics and calling structure for device drivers is different. The concept is similar, and the
chapter on device drivers is still worth reading, but for details on the device driver structures, the
KHG is the proper reference.
● The memory management algorithms are somewhat different.
There are other small differences as well, but a good understanding of this text will help you understand
the Linux source.
Advanced Programming in the UNIX Environment
Author: W. Richard Stevens

Publisher: Addison Wesley, 1992
ISBN: 0-201-56317-7
Price: $50.00
This excellent tome covers the stuff you really have to know to write real Unix programs. It includes a
http://ldp.iol.it/LDP/khg/HyperNews/get/bib/bib.html (1 di 5) [08/03/2001 10.09.56]

discussion of the various standards for Unix implementations, including POSIX, X/Open XPG3, and
FIPS, and concentrates on two implementations, SVR4 and pre-release 4.4 BSD, which it refers to as
4.3+BSD. The book concentrates heavily on application and fairly complete specification, and notes
which features relate to which standards and releases.
The chapters include: Unix Standardization and Implementations, File I/O, Files and Directories,
Standard I/O Library, System Data Files and Information, The Environment of a Unix Process, Process
Control, Process Relationships, Signals, Terminal I/O, Advanced I/O (non-blocking, streams, async,
memory-mapped, etc.), Daemon Processes, Interprocess Communication, Advanced Interprocess
Communication, and some example applications, including chapters on A Database Library,
Commmunicating with a PostScript Printer, A Modem Dialer, and then a seemingly misplaced final
chapter on Pseudo Terminals.
I have found that this book makes it possible for me to write useable programs for Unix. It will help you
achieve POSIX compliance in ways that won't break SVR4 or BSD, as a general rule. This book will
save you ten times its cost in frustration.
Advanced 80386 Programming Techniques
Author: James L. Turley

Publisher: Osborne McGraw-Hill, 1988
ISBN: 0-07-881342-5
Price: $22.95
This book covers the 80386 quite well, without touching on any other hardware. Some code samples are
included. All major features are covered, as are many of the concepts needed. The chapters of this book
are: Basics, Memory Segmentation, Privilege Levels, Paging, Multitasking, Communicating Among
Tasks, Handling Faults and Interrupts, 80286 Emulation, 8086 Emulation, Debugging, The 80387
Numeric Processor Extension, Programming for Performance, Reset and Real Mode, Hardware, and a
few appendices, including tables of the memory management structures as a handy reference.
The author has a good writing style: If you are technically minded, you will find yourself caught up just
reading this book. One strong feature of this book for Linux is that the author is very careful not to
explain how to do things under DOS, nor how to deal with particular hardware. In fact, the only times he
mentions DOS and PC-compatible hardware are in the introduction, where he promises never to mention
them again.
The C Programming Language, second edition
Author: Brian W. Kernighan and Dennis M. Ritchie

ISBN: 0-13-110362-8 (paper) 0-13-110370-9 (hard)
Price: $35.00
The C programming bible. Includes a C tutorial, Unix interface reference, C reference, and standard
library reference.

You program in C, you buy this book. It's that simple.
Operating Systems: Design and Implementation
Author: Andrew S. Tanenbaum

ISBN: 0-13-637406-9
Price: $50.00
This book, while a little simplistic in spots, and missing some important ideas, is a fairly clear exposition
of what it takes to write an operating system. Half the book is taken up with the source code to a Unix
clone called Minix, which is based on a microkernel, unlike Linux, which sports a monolithic design. It
has been said that Minix shows that it is possible to to write a microkernel-based Unix, but does not
adequately explain why one would do so.
Linux was originally intended to be a free Minix replacement (Linus' Minix, Linus tells us). In fact, it
was originally to be binary-compatible with Minix-386. Minix-386 was the development environment
under which Linux was bootstrapped. No Minix code is in Linux, but vesitiges of this heritage live on in
such things as the minix filesystem in Linux.
However, this book might still prove worthwhile for those who want a basic explanation of OS concepts,
as Tanenbaum's explanations of the basic concepts remain some of the clearer (and more entertaining, if
you like to be entertained) available. Unfortunately, basic is the key work here, as many things such as
virtual memory are not covered at all.
Modern Operating Systems
Author: Andrew S. Tanenbaum

ISBN: 0-13-588187-0
Price: $51.75
The first half of this book is a rewrite of Tanenbaum's earlier Operating Systems, but this book covers
several things that the earlier book missed, including such things as virtual memory. Minix is not
included, but overviews of MS-DOS and several distributed systems are. This book is probably more
useful to someone who wants to do something with his or her knowlege than Tanenbaum's earlier
Operating Systems: Design and Implementation. Some clue as to the reason may be found in the title...
However, what DOS is doing in a book on modern operating systems, many have failed to discover.
Operating Systems
Author: William Stallings

Publisher: Macmillan, 1992 (800-548-9939)
ISBN: 0-02-415481-4
Price: No one at Macmillan could find one...

A very thorough text on operating systems, this book gives more in-depth coverage of the topics covered
in Tannebaum's books, and covers more topics, in a much brisker style. This book covers all the major
topics that you would need to know to build an operating system, and does so in a clear way. The author
uses examples from three major systems, comparing and contrasting them: Unix, OS/2, and MVS. With
each topic covered, these example systems are used to clarify the points and provide an example of an
implementation.
Topics covered in Operating Systems include threads, real-time systems, multiprocessor scheduling,
distributed systems, process migration, and security, as well as the standard topics like memory
management and scheduling. The section on distributed processing appears to be up-to-date, and I found
it very helpful.
UNIX Network Programming
Author: W. Richard Stevens

ISBN: 0-13-949876-1
Price: $48.75
This book covers several kinds of networking under Unix, and provides very thorough references to the
forms of networking that it does not cover directly. It covers TCP/IP and XNS most heavily, and fairly
exhaustively describes how all the calls work. It also has a description and sample code using System V's
TLI, and pretty complete coverage of System V IPC. This book contains a lot of source code examples to
get you started, and many useful proceedures. One example is code to provide useable semaphores, based
on the partially broken implementation that System V provides.
Programming in the UNIX environment
Author: Brian W. Kernighan and Robert Pike

ISBN: 0-13-937699 (hardcover) 0-13-937681-X (paperback)
Price: ?
Writing UNIX Device Drivers
Author: George Pajari

Publisher: Addison Wesley, 1992
ISBN: 0-201-52374-4
Price: $32.95
This book is written by the President and founder of Driver Design Labs, a company which specializes in
the development of Unix device drivers. This book is an excellent introduction to the sometimes wacky
world of device driver design. The four basic types of drivers (character, block, tty, STREAMS) are first
discussed briefly. Many full examples of device drivers of all types are given, starting with the simplest

and progressing in complexity. All examples are of drivers which deal with Unix on PC-compatible
hardware.
Chapters include: Character Drivers I: A Test Data Generator Character Drivers II: An A/D Converter
Character Drivers III: A Line Printer Block Drivers I: A Test Data Generator Block Drivers II: A RAM
Disk Driver Block Drivers III: A SCSI Disk Driver Character Drivers IV: The Raw Disk Driver Terminal
Drivers I: The COM1 Port Character Drivers V: A Tape Drive STREAMS Drivers I: A Loop-Back
Driver STREAMS Drivers II: The COM1 Port (Revisited) Driver Installation Zen and the Art of Device
Driver Writing
Although many of the calls used in the book are not Linux-compatible, the general idea is there, and
many of the ideas map directly into Linux.
Copyright (C) 1992, 1993, 1996 Michael K. Johnson, johnsonm@redhat.com.
Messages
1. Please replace K&R reference by Harbison/Steele by Markus Kuhn
1. Replace, no; supplement, yes by Michael K. Johnson
-> Right you are Mike! by rohit patil
2. 80386 book is apparently out of print now by Austin Donnelly
1. Very unfortunate by Michael K. Johnson
3. Linux Kernel Internals-> Kernel MM IPC fs drivers net modules by Alex Stewart

using XX_select() for device without interrupts

Forum: Device Driver Basics
Keywords: select interrupts polling sleeping
Date: Thu, 25 Jul 1996 14:59:48 GMT
From: Elwood Downey <ecdowney@noao.edu>
Hello;
I have need for a select() entry point in my driver but my

device is not using interrupts so I'm not sure how to have
the os call my select() to let me poll the device. I think I must use a timer,
with select_wait(), so the system will call my select() until my device becomes
active. The trouble is I can not seem to get the timer work. The entire system
hangs _solid_ whenever it gets activated.
Below is my select() code. I use wake_up_interruptible() as the function the

timer will call in the future to just make this process runnable again, in lieu
of calling it from an interrupt service routine. A few specific questions:
1) my driver permits several processes to have the device open at once. Am I

correct in assuming that if this general approach works I will need a
separate timer_list and wait_queue for each open process instance?
2) In no examples do I ever see the wait_queue pointer ever _set_ to point at

an actual wait_queue instance. Is this correct?
Any comments would be greatly appreciated. Rememeber, the only real goal here
is some way to get the os to call us occasionally to let us poll the device,
but the device is not using interrupts.
Thank you in advance;
Elwood Downey
static int
pc39_select (struct inode *inode, struct file *file, int sel_type,
select_table *wait)
{
static struct timer_list pc39_tl;
static struct wait_queue *pc39_wq;
switch (sel_type) {
case SEL_EX:
return (0); /* never any exceptions */
case SEL_IN:
if (IBF())
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/basics/1.html (1 di 2) [08/03/2001 10.09.57]

return (1);
break;
case SEL_OUT:
if (TBE())
return (1);
break;
}
/* nothing ready -- set timer to try again later if necessary */

if (wait) {
init_timer (&pc39_tl);
pc39_tl.expires = PC39_SELTO;
pc39_tl.function = (void(*)(unsigned long))wake_up_interruptible;
pc39_tl.data = (unsigned long) &pc39_wq;
add_timer (&pc39_tl);
select_wait (&pc39_wq, wait);
}
return (0);
}
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/basics/1.html (2 di 2) [08/03/2001 10.09.57]

found reason for select() problem
found reason for select() problem

Keywords: select add_timer() del_timer()
Date: Wed, 13 Nov 1996 14:45:59 GMT
From: <unknown>
Hello again;
Evidently not many folks read this -- no responses after

4 months -- so I'll answer my own question :-)
There were several problems with the original approach. These

were all discovered through trial-and-error so I suppose
there might still be other theoretical problems but at least
now everthing seems to work.
1) call del_timer(&pc39_tl) before starting a new one.

2) always call select_wait (&pc39_wq, wait), not just when
wait != 0.
3) pc39_tl.expires is the jiffy to wake up on, not the number
of elapsed jiffies as it says in the KHG. so, it should be:
pc39_tl.expires = jiffies + PC39_SELTO;
Hope this helps someone else someday. If this is getting

too hard to follow, I'll be happy to send you the whole
driver.
Elwood Downey
ecdowney@noao.edu
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/basics/2.html [08/03/2001 10.09.58]

Why do VFS functions get both structs inode and file?
Why do VFS functions get both structs inode and

file?
Date: Thu, 09 Jan 1997 05:47:10 GMT
It appears that "struct file" contains a "struct inode *", yet both are passed to the VFS functions. Why
not simply pass "struct file *" alone?
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/basics/3.html [08/03/2001 10.09.59]

release() method called when close is called
release() method called when close is called

Forum: Character Device Drivers
Keywords: release method close fclose
Date: Sat, 26 Apr 1997 03:07:04 GMT
From: <unknown>
I just finished a character device driver and I it appears that when fclose()
is called on the device the release() method is called as well even if the device has
been opened multiple times.
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/char/3.html [08/03/2001 10.09.59]

return value of foo_write(...)
return value of foo_write(...)

Keywords: return values
Date: Fri, 25 Apr 1997 21:42:46 GMT
From: My name here <wicksr@swami.indy.tce.com>
In this section I noticed the example foo_write function returns 0 all the time. If I do this as well with
my driver and do this:
echo "test" > /dev/foo_drv
the foo_write () function gets called indefinately. Furthermore, I have noticed that from the source of
serial.c (from /usr/src/linux-2.0.0/drivers/char) always returns the number of characters transmitted.
Do you have a typo?
Also, why isn't there a DEFINITIVE list of return values for all functions? This is a bit confusing, but
still much better than programming under NT :).
Thanks for the documentation anyhow!
-Rich

TTY drivers
TTY drivers
Keywords: serial tty section
Date: Fri, 27 Sep 1996 18:48:12 GMT
From: Daniel Taylor <danielt@dgii.com>
It is noted in several places that there is no section for serial drivers, and yet in this new medium there
is not even a pointer to get started from. As the number of these drivers is increasing, even a bodiless
section of the KHG would be useful, it can be entirely filled online.
Messages
1. Is anything in the works? If not ... by Andrew Manison

Is anything in the works? If not ...
Is anything in the works? If not ...

Re: TTY drivers (Daniel Taylor)
Keywords: serial tty section
Date: Fri, 13 Dec 1996 04:28:33 GMT
From: Andrew Manison <amanison@america.net>
I am in the process of writing a device driver for an intelligent multiport serial I/O controller. I am
willing to write a section on tty drivers for the KHG if no-one else is. Let me know!
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/char/1/1.html [08/03/2001 10.10.03]


Copyright (C) 1993 Rickard E. Faith (faith@cs.unc.edu).
Written at the University of North Carolina, 1993, for COMP-291. The information contained herein
comes with ABSOLUTELY NO WARRANTY.
All rights reserved. Permission is granted to make and distribute verbatim copies of this paper provided
the copyright notice and this permission notice are preserved on all copies.
This is (with the author's explicit permission) a modified copy of the original document. If you wish to reproduce this document, you are advised to get the original
version by ftp from ftp://ftp.cs.unc.edu/pub/users/faith/papers/scsi.paper.tar.gz
[Note that this document has not been revised since its copyright date of 1993. Most things still
apply, but some of the facts like the list of currently supported SCSI host adaptors are rather out
of date by now.]
Why You Want to Write a SCSI Driver
Currently, the Linux kernel contains drivers for the following SCSI host adapters: Adaptec 1542,
Adaptec 1740, Future Domain TMC-1660/TMC-1680, Seagate ST-01/ST-02, UltraStor 14F, and
Western Digital WD-7000. You may want to write your own driver for an unsupported host adapter. You
may also want to re-write or update one of the existing drivers.
What is SCSI?
The foreword to the SCSI-2 standard draft [ANS] gives a succinct definition of the Small Computer
System Interface and briefly explains how SCSI-2 is related to SCSI-1 and CCS:
The SCSI protocol is designed to provide an efficient peer-to-peer I/O bus with up to 8
devices, including one or more hosts. Data may be transferred asynchronously at rates that
only depend on device implementation and cable length. Synchronous data transfers are
supported at rates up to 10 mega-transfers per second. With the 32 bit wide data transfer
option, data rates of up to 40 megabytes per second are possible.
SCSI-2 includes command sets for magnetic and optical disks, tapes, printers, processors,
CD-ROMs, scanners, medium changers, and communications devices.
In 1985, when the first SCSI standard was being finalized as an American National
Standard, several manufacturers approached the X3T9.2 Task Group. They wanted to
increase the mandatory requirements of SCSI and to define further features for direct-access
devices. Rather than delay the SCSI standard, X3T9.2 formed an ad hoc group to develop a
working paper that was eventually called the Common Command Set (CCS). Many disk
products were designed using this working paper in conjunction with the SCSI standard.
In parallel with the development of the CCS working paper, X3T9.2 began work on an
enhanced SCSI standard which was named SCSI-2. SCSI-2 included the results of the CCS
working paper and extended them to all device types. It also added caching commands,
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/scsi.html (1 di 19) [08/03/2001 10.10.08]

performance enhancement features, and other functions that X3T9.2 deemed worthwhile.
While SCSI-2 has gone well beyond the original SCSI standard (now referred to as SCSI-1),
it retains a high degree of compatibility with SCSI-1 devices.
SCSI phases
The ``SCSI bus'' transfers data and state information between interconnected SCSI devices. A single
transaction between an `ìnitiator'' and a ``target'' can involve up to 8 distinct ``phases.'' These phases are
almost entirely determined by the target (e.g., the hard disk drive). The current phase can be determined
from an examination of five SCSI bus signals, as shown in this table [LXT91, p. 57].
-SEL -BSY -MSG -C/D -I/O PHASE
HI HI ? ? ? BUS FREE
HI LO ? ? ? ARBITRATION
I I&T ? ? ? SELECTION
T I&T ? ? ? RESELECTION
HI LO HI HI HI DATA OUT
HI LO HI HI LO DATA IN
HI LO HI LO HI COMMAND
HI LO HI LO LO STATUS
HI LO LO LO HI MESSAGE OUT
HI LO LO LO LO MESSAGE IN
I = Initiator Asserts, T = Target Asserts, ? = HI or LO
Some controllers (notably the inexpensive Seagate controller) require direct manipulation of the SCSI
bus--other controllers automatically handle these low-level details. Each of the eight phases will be
described in detail.
BUS FREE Phase
The BUS FREE phase indicates that the SCSI bus is idle and is not currently being used.
ARBITRATION Phase
The ARBITRATION phase is entered when a SCSI device attempts to gain control of the SCSI
bus. Arbitration can start only if the bus was previously in the BUS FREE phase. During
arbitration, the arbitrating device asserts its SCSI ID on the DATA BUS. For example, if the
arbitrating device's SCSI ID is 2, then the device will assert 0x04. If multiple devices attempt
simultaneous arbitration, the device with the highest SCSI ID will win. Although ARBITRATION
is optional in the SCSI-1 standard, it is a required phase in the SCSI-2 standard.
SELECTION Phase
After ARBITRATION, the arbitrating device (now called the initiator) asserts the SCSI ID of the
target on the DATA BUS. The target, if present, will acknowledge the selection by raising the
-BSY line. This line remains active as long as the target is connected to the initiator.
RESELECTION Phase

The SCSI protocol allows a device to disconnect from the bus while processing a request. When
the device is ready, it reconnects to the host adapter. The RESELECTION phase is identical to the
SELECTION phase, with the exception that it is used by the disconnected target to reconnect to
the original initiator. Drivers which do not currently support RESELECTION do not allow the
SCSI target to disconnect. RESELECTION should be supported by all drivers, however, so that
multiple SCSI devices can simultaneously process commands. This allows dramatically increased
throughput due to interleaved I/O requests.
COMMAND Phase
During this phase, 6, 10, or 12 bytes of command information are transferred from the initiator to
the target.
DATA OUT and DATA IN Phases
During these phases, data are transferred between the initiator and the target. For example, the
DATA OUT phase transfers data from the host adapter to the disk drive. The DATA IN phase
transfers data from the disk drive to the host adapter. If the SCSI command does not require data
transfer, then neither phase is entered.
STATUS Phase
This phase is entered after completion of all commands, and allows the target to send a status byte
to the initiator. There are nine valid status bytes, as shown in the table below [ANS, p. 77]. Note
that since bits 1-5 (bit 0 is the least significant bit) are used for the status code (the other bits are
reserved), the status byte should be masked with 0x3e before being examined.
Value* Status
0x00 GOOD
0x02 CHECK CONDITION
0x04 CONDITION MET
0x08 BUSY
0x10 INTERMEDIATE
0x14 INTERMEDIATE-CONDITION MET
0x18 RESERVATION CONFLICT
0x22 COMMAND TERMINATED
0x28 QUEUE FULL
*After masking with 0x3e
The meanings of the three most important status codes are outlined below:
GOOD
The operation completed successfully.
CHECK CONDITION
An error occurred. The REQUEST SENSE command should be used to find out more
information about the error (see SCSI Commands).
BUSY

The device was unable to accept a command. This may occur during a self-test or shortly
after power-up.
MESSAGE OUT and MESSAGE IN Phases
Additional information is transferred between the target and the initiator. This information may
regard the status of an outstanding command, or may be a request for a change of protocol.
Multiple MESSAGE IN and MESSAGE OUT phases may occur during a single SCSI transaction.
If RESELECTION is supported, the driver must be able to correctly process the SAVE DATA
POINTERS, RESTORE POINTERS, and DISCONNECT messages. Although required by the
SCSI-2 standard, some devices do not automatically send a SAVE DATA POINTERS message
prior to a DISCONNECT message.
SCSI Commands
Each SCSI command is 6, 10, or 12 bytes long. The following commands must be well understood by a
SCSI driver developer.
REQUEST SENSE
Whenever a command returns a CHECK CONDITION status, the high-level Linux SCSI code
automatically obtains more information about the error by executing the REQUEST SENSE. This
command returns a sense key and a sense code (called the `àdditional sense code,'' or ASC, in the
SCSI-2 standard [ANS]). Some SCSI devices may also report an `àdditional sense code qualifier''
(ASCQ). The 16 possible sense keys are described in the next table. For information on the ASC
and ASCQ, please refer to the SCSI standard [ANS] or to a SCSI device technical manual.
Sense Key Description
0x00 NO SENSE
0x01 RECOVERED ERROR
0x02 NOT READY
0x03 MEDIUM ERROR
0x04 HARDWARE ERROR
0x05 ILLEGAL REQUEST
0x06 UNIT ATTENTION
0x07 DATA PROTECT
0x08 BLANK CHECK
0x09 (Vendor specific error)
0x0a COPY ABORTED
0x0b ABORTED COMMAND
0x0c EQUAL
0x0d VOLUME OVERFLOW
0x0e MISCOMPARE
0x0f RESERVED

TEST UNIT READY

This command is used to test the target's status. If the target can accept a medium-access command
(e.g., a READ or a WRITE), the command returns with a GOOD status. Otherwise, the command
returns with a CHECK CONDITION status and a sense key of NOT READY. This response
usually indicates that the target is completing power-on self-tests.
INQUIRY
This command returns the target's make, model, and device type. The high-level Linux code uses
this command to differentiate among magnetic disks, optical disks, and tape drives (the high-level
code currently does not support printers, processors, or juke boxes).
READ and WRITE
These commands are used to transfer data from and to the target. You should be sure your driver
can support simpler commands, such as TEST UNIT READY and INQUIRY, before attempting to
use the READ and WRITE commands.
Getting Started
The author of a low-level device driver will need to have an understanding of how interruptions are
handled by the kernel. At minimum, the kernel functions that disable (cli()) and enable (sti())
interruptions should be understood. The scheduling functions (e.g., schedule(), sleepon(), and
wakeup()) may also be needed by some drivers. A detailed explanation of these functions can be found
in Supporting Functions.
Before You Begin: Gathering Tools
Before you begin to write a SCSI driver for Linux, you will need to obtain several resources.
The most important is a bootable Linux system--preferably one which boots from an IDE, RLL, or MFM
hard disk. During the development of your new SCSI driver, you will rebuild the kernel and reboot your
system many times. Programming errors may result in the destruction of data on your SCSI drive and on
your non-SCSI drive. Back up your system before you begin.
The installed Linux system can be quite minimal: the GCC compiler distribution (including libraries and
the binary utilities), an editor, and the kernel source are all you need. Additional tools like od,
hexdump, and less will be quite helpful. All of these tools will fit on an inexpensive 20-30~MB hard
disk. (A used 20 MB MFM hard disk and controller should cost less than US$100.)
Documentation is essential. At minimum, you will need a technical manual for your host adapter. Since
Linux is freely distributable, and since you (ideally) want to distribute your source code freely, avoid
non-disclosure agreements (NDA). Most NDA's will prohibit you from releasing your source code--you
might be allowed to release an object file containing your driver, but this is simply not acceptable in the
Linux community at this time.
A manual that explains the SCSI standard will be helpful. Usually the technical manual for your disk
drive will be sufficient, but a copy of the SCSI standard will often be helpful. (The October 17, 1991,
draft of the SCSI-2 standard document is available via anonymous ftp from sunsite.unc.edu in
/pub/Linux/development/scsi-2.tar.Z, and is available for purchase from Global

Engineering Documents (2805 McGaw, Irvine, CA 92714), (800)-854-7179 or (714)-261-1455. Please

refer to document X3.131-199X. In early 1993, the manual cost US$60--70.)
Before you start, make hard copies of hosts.h, scsi.h, and one of the existing drivers in the Linux
kernel. These will prove to be useful references while you write your driver.
The Linux SCSI Interface
The high-level SCSI interface in the Linux kernel manages all of the interaction between the kernel and
the low-level SCSI device driver. Because of this layered design, a low-level SCSI driver need only
provide a few basic services to the high-level code. The author of a low-level driver does not need to
understand the intricacies of the kernel I/O system and, hence, can write a low-level driver in a relatively
short amount of time.
Two main structures (Scsi_Host and Scsi_Cmnd) are used to communicate between the high-level
code and the low-level code. The next two sections provide detailed information about these structures
and the requirements of the low-level driver.
The Scsi_Host Structure
The Scsi_Host structure serves to describe the low-level driver to the high-level code. Usually, this
description is placed in the device driver's header file in a C preprocessor definition:
#define FDOMAIN_16X0 { "Future Domain TMC-16x0", \

fdomain_16x0_detect, \
fdomain_16x0_info, \
fdomain_16x0_command, \
fdomain_16x0_queue, \
fdomain_16x0_abort, \
fdomain_16x0_reset, \
NULL, \
fdomain_16x0_biosparam, \
1, 6, 64, 1 ,0, 0}
#endif
The Scsi_Host structure is presented next. Each of the fields will be explained in detail later in this
section.
typedef struct
{
char *name;
int (* detect)(int);
const char *(* info)(void);
int (* queuecommand)(Scsi_Cmnd *,
void (*done)(Scsi_Cmnd *));
int (* command)(Scsi_Cmnd *);
int (* abort)(Scsi_Cmnd *, int);

int (* reset)(void);
int (* slave_attach)(int, int);
int (* bios_param)(int, int, int []);
int can_queue;
int this_id;
short unsigned int sg_tablesize;
short cmd_per_lun;
unsigned present:1;
unsigned unchecked_isa_dma:1;
} Scsi_Host;
Variables in the Scsi_Host structure
In general, the variables in the Scsi_Host structure are not used until after the detect() function
(see section detect()) is called. Therefore, any variables which cannot be assigned before host
adapter detection should be assigned during detection. This situation might occur, for example, if a single
driver provided support for several host adapters with very similar characteristics. Some of the
parameters in the Scsi_Host structure might then depend on the specific host adapter detected.
name
name holds a pointer to a short description of the SCSI host adapter.
can_queue
can_queue holds the number of outstanding commands the host adapter can process. Unless
RESELECTION is supported by the driver and the driver is interrupt-driven, (some of the early Linux
drivers were not interrupt driven and, consequently, had very poor performance) this variable should be
set to 1.
this_id
Most host adapters have a specific SCSI ID assigned to them. This SCSI ID, usually 6 or 7, is used for
RESELECTION. The this_id variable holds the host adapter's SCSI ID. If the host adapter does not
have an assigned SCSI ID, this variable should be set to -1 (in this case, RESELECTION cannot be
supported).
sg_tablesize
The high-level code supports ``scatter-gather,'' a method of increasing SCSI throughput by combining
many small SCSI requests into a few large SCSI requests. Since most SCSI disk drives are formatted
with 1:1 interleave, (``1:1 interleave'' means that all of the sectors in a single track appear consecutively
on the disk surface) the time required to perform the SCSI ARBITRATION and SELECTION phases is
longer than the rotational latency time between sectors. (This may be an over-simplification. On older
devices, the actual command processing can be significant. Further, there is a great deal of layered
overhead in the kernel: the high-level SCSI code, the buffering code, and the file-system code all
contribute to poor SCSI performance.) Therefore, only one SCSI request can be processed per disk

revolution, resulting in a throughput of about 50 kilobytes per second. When scatter-gather is supported,
however, average throughput is usually over 500 kilobytes per second.
The sg_tablesize variable holds the maximum allowable number of requests in the scatter-gather
list. If the driver does not support scatter-gather, this variable should be set to SG_NONE. If the driver
can support an unlimited number of grouped requests, this variable should be set to SG_ALL. Some
drivers will use the host adapter to manage the scatter-gather list and may need to limit sg_tablesize
to the number that the host adapter hardware supports. For example, some Adaptec host adapters require
a limit of 16.
cmd_per_lun
The SCSI standard supports the notion of ``linked commands.'' Linked commands allow several
commands to be queued consecutively to a single SCSI device. The cmd_per_lun variable specifies
the number of linked commands allowed. This variable should be set to 1 if command linking is not
supported. At this time, however, the high-level SCSI code will not take advantage of this feature.
Linked commands are fundamentally different from multiple outstanding commands (as described by the
can_queue variable). Linked commands always go to the same SCSI target and do not necessarily
involve a RESELECTION phase. Further, linked commands eliminate the ARBITRATION,
SELECTION, and MESSAGE OUT phases on all commands after the first one in the set. In contrast,
multiple outstanding commands may be sent to an arbitrary SCSI target, and require the
ARBITRATION, SELECTION, MESSAGE OUT, and RESELECTION phases.
present
The present bit is set (by the high-level code) if the host adapter is detected.
unchecked_isa_dma
Some host adapters use Direct Memory Access (DMA) to read and write blocks of data directly from or
to the computer's main memory. Linux is a virtual memory operating system that can use more than 16
MB of physical memory. Unfortunately, on machines using the ISA bus (the so-called `Ìndustry
Standard Architecture'' bus was introduced with the IBM PC/XT and IBM PC/AT computers), DMA is
limited to the low 16 MB of physical memory.
If the unchecked_isa_dma bit is set, the high-level code will provide data buffers which are
guaranteed to be in the low 16 MB of the physical address space. Drivers written for host adapters that do
not use DMA should set this bit to zero. Drivers specific to EISA bus (the `Èxtended Industry Standard
Architecture'' bus is a non-proprietary 32-bit bus for 386 and i486 machines) machines should also set
this bit to zero, since EISA bus machines allow unrestricted DMA access.
Functions in the Scsi_Host Structure
detect()
The detect() function's only argument is the ``host number,'' an index into the scsi_hosts
variable (an array of type struct Scsi_Host). The detect() function should return a non-zero

value if the host adapter is detected, and should return zero otherwise.
Host adapter detection must be done carefully. Usually the process begins by looking in the ROM area
for the ``BIOS signature'' of the host adapter. On PC/AT-compatible computers, the use of the address
space between 0xc0000 and 0xfffff is fairly well defined. For example, the video BIOS on most
machines starts at 0xc0000 and the hard disk BIOS, if present, starts at 0xc8000. When a
PC/AT-compatible computer boots, every 2-kilobyte block from 0xc0000 to 0xf8000 is examined for
the 2-byte signature (0x55aa) which indicates that a valid BIOS extension is present [Nor85].
The BIOS signature usually consists of a series of bytes that uniquely identifies the BIOS. For example,
one Future Domain BIOS signature is the string
FUTURE DOMAIN CORP. (C) 1986-1990 1800-V2.07/28/89

found exactly five bytes from the start of the BIOS block.
After the BIOS signature is found, it is safe to test for the presence of a functioning host adapter in more
specific ways. Since the BIOS signatures are hard-coded in the kernel, the release of a new BIOS can
cause the driver to mysteriously fail. Further, people who use the SCSI adapter exclusively for Linux
may want to disable the BIOS to speed boot time. For these reasons, if the adapter can be detected safely
without examining the BIOS, then that alternative method should be used.
Usually, each host adapter has a series of I/O port addresses which are used for communications.
Sometimes these addresses will be hard coded into the driver, forcing all Linux users who have this host
adapter to use a specific set of I/O port addresses. Other drivers are more flexible, and find the current
I/O port address by scanning all possible port addresses. Usually each host adapter will allow 3 or 4 sets
of addresses, which are selectable via hardware jumpers on the host adapter card.
After the I/O port addresses are found, the host adapter can be interrogated to confirm that it is, indeed,
the expected host adapter. These tests are host adapter specific, but commonly include methods to
determine the BIOS base address (which can then be compared to the BIOS address found during the
BIOS signature search) or to verify a unique identification number associated with the board. For MCA
bus (the ``Micro-Channel Architecture'' bus is IBM's proprietary 32 bit bus for 386 and i486 machines)
machines, each type of board is given a unique identification number which no other manufacturer can
use--several Future Domain host adapters, for example, also use this number as a unique identifier on
ISA bus machines. Other methods of verifying the host adapter existence and function will be available
to the programmer.
Requesting the IRQ
After detection, the detect() routine must request any needed interrupt or DMA channels from the
kernel. There are 16 interrupt channels, labeled IRQ 0 through IRQ 15. The kernel provides two methods
for setting up an IRQ handler: irqaction() and request_irq().
The request_irq() function takes two parameters, the IRQ number and a pointer to the handler
routine. It then sets up a default sigaction structure and calls irqaction(). The code (Linux
0.99.7 kernel source code, linux/kernel/irq.c) for the request_irq() function is shown
below. I will limit my discussion to the more general irqaction() function.

int request_irq( unsigned int irq, void (*handler)( int ) )

{
struct sigaction sa;
sa.sa_handler = handler;
sa.sa_flags = 0;
sa.sa_mask = 0;
sa.sa_restorer = NULL;
return irqaction( irq, &sa );
}
The declaration (Linux 0.99.5 kernel source code, linux/kernel/irq.c) for the irqaction()
function is
int irqaction( unsigned int irq, struct sigaction *new )

where the first parameter, irq, is the number of the IRQ that is being requested, and the second
parameter, new, is a structure with the definition (Linux 0.99.5 kernel source code,
linux/include/linux/signal.h) shown here:
struct sigaction
{
__sighandler_t sa_handler;
sigset_t sa_mask;
int sa_flags;
void (*sa_restorer)(void);
};
In this structure, sa_handler should point to your interrupt handler routine, which should have a
definition similar to the following:
void fdomain_16x0_intr( int irq )

where irq will be the number of the IRQ which caused the interrupt handler routine to be invoked.
The sa_mask variable is used as an internal flag by the irqaction() routine. Traditionally, this
variable is set to zero prior to calling irqaction().
The sa_flags variable can be set to zero or to SA_INTERRUPT. If zero is selected, the interrupt
handler will run with other interrupts enabled, and will return via the signal-handling return functions.
This option is recommended for relatively slow IRQ's, such as those associated with the keyboard and
timer interrupts. If SA_INTERRUPT is selected, the handler will be called with interrupts disabled and
return will avoid the signal-handling return functions. SA_INTERRUPT selects ``fast'' IRQ handler
invocation routines, and is recommended for interrupt driven hard disk routines. The interrupt handler
should turn interrupts on as soon as possible, however, so that other interrupts can be processed.
The sa_restorer variable is not currently used, and is traditionally set to NULL.

The request_irq() and irqaction() functions will return zero if the IRQ was successfully
assigned to the specified interrupt handler routine. Non-zero result codes may be interpreted as follows:
-EINVAL
Either the IRQ requested was larger than 15, or a NULL pointer was passed instead of a valid
pointer to the interrupt handler routine.
-EBUSY
The IRQ requested has already been allocated to another interrupt handler. This situation should
never occur, and is reasonable cause for a call to panic().
The kernel uses an Intel `ìnterrupt gate'' to set up IRQ handler routines requested via the
irqaction() function. The Intel i486 manual [Int90, p. 9-11] explains the interrupt gate as follows:
Interrupts using... interrupt gates... cause the TF flag [trap flag] to be cleared after its current
value is saved on the stack as part of the saved contents of the EFLAGS register. In so
doing, the processor prevents instruction tracing from affecting interrupt response. A
subsequent IRET [interrupt return] instruction restores the TF flag to the value in the saved
contents of the EFLAGS register on the stack.
... An interrupt which uses an interrupt gate clears the IF flag [interrupt-enable flag], which
prevents other interrupts from interfering with the current interrupt handler. A subsequent
IRET instruction restores the IF flag to the value in the saved contents of the EFLAGS
register on the stack.
Requesting the DMA channel
Some SCSI host adapters use DMA to access large blocks of data in memory. Since the CPU does not
have to deal with the individual DMA requests, data transfers are faster than CPU-mediated transfers and
allow the CPU to do other useful work during a block transfer (assuming interrupts are enabled).
The host adapter will use a specific DMA channel. This DMA channel will be determined by the
detect() function and requested from the kernel with the request_dma() function. This function
takes the DMA channel number as its only parameter and returns zero if the DMA channel was
successfully allocated. Non-zero results may be interpreted as follows:
-EINVAL
The DMA channel number requested was larger than 7.
-EBUSY
The requested DMA channel has already been allocated. This is a very serious situation, and will
probably cause any SCSI requests to fail. It is worthy of a call to panic().
info()
The info() function merely returns a pointer to a static area containing a brief description of the
low-level driver. This description, which is similar to that pointed to by the name variable, will be
printed at boot time.
queuecommand()

The queuecommand() function sets up the host adapter for processing a SCSI command and then
returns. When the command is finished, the done() function is called with the Scsi_Cmnd structure
pointer as a parameter. This allows the SCSI command to be executed in an interrupt-driven fashion.
Before returning, the queuecommand() function must do several things:
1. Save the pointer to the Scsi_Cmnd structure.
2. Save the pointer to the done() function in the scsi_done() function pointer in the
Scsi_Cmnd structure. See section done() for more information.
3. Set up the special Scsi_Cmnd variables required by the driver. See section The Scsi_Cmnd
Structure for detailed information on the Scsi_Cmnd structure.
4. Start the SCSI command. For an advanced host adapter, this may be as simple as sending the
command to a host adapter ``mailbox.'' For less advanced host adapters, the ARBITRATION
phase is manually started.
The queuecommand() function is called only if the can_queue variable (see section can_queue)
is non-zero. Otherwise the command() function is used for all SCSI requests. The queuecommand()
DID_BAD_TARGET
The SCSI ID of the target was the same as the SCSI ID of the host adapter.
DID_ABORT
The high-level code called the low-level abort() function (see section abort()).
DID_PARITY
A SCSI PARITY error was detected.
DID_ERROR
An error occurred which lacks a more appropriate error code (for example, an internal host
adapter error).
DID_RESET
The high-level code called the low-level reset() function (see section reset()).
DID_BAD_INTR
An unexpected interrupt occurred and there is no appropriate way to handle this interrupt.
Note that returning DID_BUS_BUSY will force the command to be retried, whereas returning
DID_NO_CONNECT will abort the command.
Byte 3 (MSB)
This byte is for a high-level return code, and should be left as zero by the low-level code.
Current low-level drivers do not uniformly (or correctly) implement error reporting, so it may be better to
consult scsi.c to determine exactly how errors should be reported, rather than exploring existing drivers.
command()
The command() function processes a SCSI command and returns when the command is finished. When
the original SCSI code was written, interrupt-driven drivers were not supported. The old drivers are
much less efficient (in terms of response time and latency) than the current interrupt-driven drivers, but
are also much easier to write. For new drivers, this command can be replaced with a call to the
queuecommand() function, as demonstrated here. (Linux 0.99.5 kernel,
linux/kernel/blk_drv/scsi/aha1542.c, written by Tommy Thorn.)
static volatile int internal_done_flag = 0;

static volatile int internal_done_errcode = 0;
static void internal_done( Scsi_Cmnd *SCpnt )
{
internal_done_errcode = SCpnt->result;
++internal_done_flag;
}
int aha1542_command( Scsi_Cmnd *SCpnt )

{
aha1542_queuecommand( SCpnt, internal_done );

while (!internal_done_flag);
internal_done_flag = 0;
return internal_done_errcode;
}
The return value is the same as the result variable in the Scsi_Cmnd structure. Please see sections
done() and The Scsi_Cmnd Structure for more details.
abort()
The high-level SCSI code handles all timeouts. This frees the low-level driver from having to do timing,
and permits different timeout periods to be used for different devices (e.g., the timeout for a SCSI tape
drive is nearly infinite, whereas the timeout for a SCSI disk drive is relatively short).
The abort() function is used to request that the currently outstanding SCSI command, indicated by the
Scsi_Cmnd pointer, be aborted. After setting the result variable in the Scsi_Cmnd structure, the
abort() function returns zero. If code, the second parameter to the abort() function, is zero, then
result should be set to DID_ABORT. Otherwise, result shoudl be set equal to code. If code is
not zero, it is usually DID_TIME_OUT or DID_RESET.
Currently, none of the low-level drivers is able to correctly abort a SCSI command. The initiator should
request (by asserting the -ATN line) that the target enter a MESSAGE OUT phase. Then, the initiator
should send an ABORT message to the target.
reset()
The reset() function is used to reset the SCSI bus. After a SCSI bus reset, any executing command
should fail with a DID_RESET result code (see section done()).
Currently, none of the low-level drivers handles resets correctly. To correctly reset a SCSI command, the
initiator should request (by asserting the -ATN line) that the target enter a MESSAGE OUT phase. Then,
the initiator should send a BUS DEVICE RESET message to the target. It may also be necessary to
initiate a SCSI RESET by asserting the -RST line, which will cause all target devices to be reset. After a
reset, it may be necessary to renegotiate a synchronous communications protocol with the targets.
slave_attach()
The slave_attach() function is not currently implemented. This function would be used to
negotiate synchronous communications between the host adapter and the target drive. This negotiation
requires an exchange of a pair of SYNCHRONOUS DATA TRANSFER REQUEST messages between
the initiator and the target. This exchange should occur under the following conditions [LXT91]:
A SCSI device that supports synchronous data transfer recognizes it has not communicated
with the other SCSI device since receiving the last ``hard'' RESET.
A SCSI device that supports synchronous data transfer recognizes it has not communicated
with the other SCSI device since receiving a BUS DEVICE RESET message.
bios_param()

Linux supports the MS-DOS (MS-DOS is a registered trademark of Microsoft Corporation) hard disk
partitioning system. Each disk contains a ``partition table'' which defines how the disk is divided into
logical sections. Interpretation of this partition table requires information about the size of the disk in
terms of cylinders, heads, and sectors per cylinder. SCSI disks, however, hide their physical geometry
and are accessed logically as a contiguous list of sectors. Therefore, in order to be compatible with
MS-DOS, the SCSI host adapter will ``lie'' about its geometry. The physical geometry of the SCSI disk,
while available, is seldom used as the ``logical geometry.'' (The reasons for this involve archaic and
arbitrary limitations imposed by MS-DOS.)
Linux needs to determine the ``logical geometry'' so that it can correctly modify and interpret the
partition table. Unfortunately, there is no standard method for converting between physical and logical
geometry. Hence, the bios_param() function was introduced in an attempt to provide access to the
host adapter geometry information.
The size parameter is the size of the disk in sectors. Some host adapters use a deterministic formula
based on this number to calculate the logical geometry of the drive. Other host adapters store geometry
information in tables which the driver can access. To facilitate this access, the dev parameter contains
the drive's device number. Two macros are defined in linux/fs.h which will help to interpret this
value: MAJOR(dev) is the device's major number, and MINOR(dev) is the device's minor number.
These are the same major and minor device numbers used by the standard Linux mknod command to
create the device in the /dev directory. The info parameter points to an array of three integers that the
bios_param() function will fill in before returning:
info[0]
Number of heads
info[1]
Number of sectors per cylinder
info[2]
Number of cylinders
The information in info is not the physical geometry of the drive, but only a logical geometry that is
identical to the logical geometry used by MS-DOS to access the drive. The distinction between physical
and logical geometry cannot be overstressed.
The Scsi_Cmnd Structure
The Scsi_Cmnd structure, (Linux 0.99.7 kernel, linux/kernel/blk_drv/scsi/scsi.h) as shown below, is

used by the high-level code to specify a SCSI command for execution by the low-level code. Many
variables in the Scsi_Cmnd structure can be ignored by the low-level device driver--other variables,
however, are extremely important.
typedef struct scsi_cmnd

{
int host;
unsigned char target,
lun,

index;
struct scsi_cmnd *next,
*prev;
unsigned char cmnd[10];

unsigned request_bufflen;
void *request_buffer;
unsigned char data_cmnd[10];

unsigned short use_sg;
unsigned short sglist_len;
unsigned bufflen;
void *buffer;
struct request request;

unsigned char sense_buffer[16];
int retries;
int allowed;
int timeout_per_command,
timeout_total,
timeout;
unsigned char internal_timeout;
unsigned flags;
void (*scsi_done)(struct scsi_cmnd *);

void (*done)(struct scsi_cmnd *);
Scsi_Pointer SCp;
unsigned char *host_scribble;
int result;
} Scsi_Cmnd;
Reserved Areas
Informative Variables
host is an index into the scsi_hosts array.

target stores the SCSI ID of the target of the SCSI command. This information is important if multiple
outstanding commands or multiple commands per target are supported.
cmnd is an array of bytes which hold the actual SCSI command. These bytes should be sent to the SCSI
target during the COMMAND phase. cmnd[0] is the SCSI command code. The COMMAND_SIZE
macro, defined in scsi.h, can be used to determine the length of the current SCSI command.
result is used to store the result code from the SCSI request. Please see section done() for more

information about this variable. This variable must be correctly set before the low-level routines return.
The Scatter-Gather List
use_sg contains a count of the number of pieces in the scatter-gather chain. If use_sg is zero, then
request_buffer points to the data buffer for the SCSI command, and request_bufflen is the
length of this buffer in bytes. Otherwise, request_buffer points to an array of scatterlist
structures, and use_sg will indicate how many such structures are in the array. The use of
request_buffer is non-intuitive and confusing.
Each element of the scatterlist array contains an address and a length component. If the
unchecked_isa_dma flag in the Scsi_Host structure is set to 1 (see section
unchecked_isa_dma for more information on DMA transfers), the address is guaranteed to be
within the first 16 MB of physical memory. Large amounts of data will be processed by a single SCSI
command. The length of these data will be equal to the sum of the lengths of all the buffers pointed to by
the scatterlist array.
Scratch Areas
Depending on the capabilities and requirements of the host adapter, the scatter-gather list can be handled
in a variety of ways. To support multiple methods, several scratch areas are provided for the exclusive
use of the low-level driver.
The scsi_done() Pointer
This pointer should be set to the done() function pointer in the queuecommand() function (see
section queuecommand() for more information). There are no other uses for this pointer.
The host_scribble Pointer
The high-level code supplies a pair of memory allocation functions, scsi_malloc() and
scsi_free(), which are guaranteed to return memory in the first 16 MB of physical memory. This
memory is, therefore, suitable for use with DMA. The amount of memory allocated per request must be a
multiple of 512 bytes, and must be less than or equal to 4096 bytes. The total amount of memory
available via scsi_malloc() is a complex function of the Scsi_Host structure variables
sg_tablesize, cmd_per_lun, and unchecked_isa_dma.
The host_scribble pointer is available to point to a region of memory allocated with
scsi_malloc(). The low-level SCSI driver is responsible for managing this pointer and its associated
memory, and should free the area when it is no longer needed.
The Scsi_Pointer Structure
The SCp variable, a structure of type Scsi_Pointer, is described here:
typedef struct scsi_pointer

{

char *ptr; /* data pointer */

int this_residual; /* left in this buffer */
struct scatterlist *buffer; /* which buffer */
int buffers_residual; /* how many buffers left */
volatile int Status;

volatile int Message;
volatile int have_data_in;
volatile int sent_command;
volatile int phase;
} Scsi_Pointer;
The variables in this structure can be used in any way necessary in the low-level driver. Typically,
buffer points to the current entry in the scatterlist, buffers_residual counts the number
of entries remaining in the scatterlist, ptr is used as a pointer into the buffer, and
this_residual counts the characters remaining in the transfer. Some host adapters require support of
this detail of interaction--others can completely ignore this structure.
The second set of variables provide convenient locations to store SCSI status information and various
pointers and flags.
Acknowledgements
Thanks to Drew Eckhardt, Michael K. Johnson, Karin Boes, Devesh Bhatnagar, and Doug Hoffman for
reading early versions of this paper and for providing many helpful comments. Special thanks to my
official COMP-291 (Professional Writing in Computer Science) ``readers,'' Professors Peter Calingaert
and Raj Kumar Singh.
Bibliography
[ANS]
Draft Proposed American National Standard for Information Systems: Small Computer System
Interface-2 (SCSI-2). (X3T9.2/86-109, revision 10h, October 17, 1991).
[Int90]
Intel. i486 Processor Programmer's Reference Manual. Intel/McGraw-Hiull, 1990.
[LXT91]
LXT SCSI Products: Specification and OEM Technical Manual, 1991.
[Nor85]
Peter Norton. The Peter Norton Programmer's Guide to the IBM PC. Bellevue, Washington:
Microsoft Press, 1985.
Messages
1. Writing a SCSI Device Driver by rohit patil


non-block-cached block device?
non-block-cached block device?

Forum: Block Device Drivers
Keywords: block device cache
Date: Thu, 30 May 1996 11:26:41 GMT
From: Neal Tucker <ntucker@adobe.com>
I have a question/idea regarding the block device interface...
First, some premises upon which my idea relies

1) All block device access goes through the block cache.
2) Filesystems must be mounted from block devices.
3) A block device read which is not a cache hit always puts
the calling process to sleep, which means that even if the
IO completes quickly (ie with a RAM disk), the process still
has to wait to be scheduled again.
So...
It seems to me that these three things could lead to very
poor RAM disk performance, which leads me to suggest that
it might be a advantageous to allow block devices which do
not go through the block cache.
I can see three possible reasons this isn't a good idea:

1) With the current design, it would be really hard to do.
2) It doesn't make enough of a difference that people care.
3) I'm completely wrong.
What do people think?
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/block/1.html [08/03/2001 10.10.09]

Shall I explain elevator algorithm (+sawtooth etc)
Shall I explain elevator algorithm (+sawtooth etc)

Forum: Block Device Drivers
Keywords: block device elevator sawtooth minimum algorithm
Date: Sat, 10 Aug 1996 11:12:11 GMT
From: Michael De La Rue <miked@ed.ac.uk>
I just wrote a response about it to the kernel list, so would a discussion of the elevator algorithm, and
sawtooth algorithm (plus mention of minimum movement) be appreciated if I get it checked over by
`someone who knows?'
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/block/2.html [08/03/2001 10.10.10]


Forum: Writing a SCSI Device Driver
Keywords: Good work!
Date: Thu, 02 Jan 1997 04:10:44 GMT
From: rohit patil <rohit@techie.com>
hi!
this is superb stuff. thanks. will let you know more after
i go thro' it. good work :)
-rohit.
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/scsi/1.html [08/03/2001 10.10.11]

Network Buffers And Memory Management

Reprinted with permission of Linux Journal, from issue 29, September 1996. Some changes
have been made to accomodate the web. This article was originally written for the Kernel
Korner column. The Kernel Korner series has included many other articles of interest to
Linux kernel hackers, as well.
by Alan Cox
The Linux operating system implements the industry-standard Berkeley socket API, which has its origins
in the BSD unix developments (4.2/4.3/4.4 BSD). In this article, we will look at the way the memory
management and buffering is implemented for network layers and network device drivers under the
existing Linux kernel, as well as explain how and why some things have changed over time.
Core Concepts
The networking layer tries to be fairly object-oriented in its design, as indeed is much of the Linux
kernel. The core structure of the networking code goes back to the initial networking and socket
implementations by Ross Biro and Orest Zborowski respectively. The key objects are:
Device or Interface:
A network interface represents a thing which sends and receives packets. This is normally
interface code for a physical device like an ethernet card. However some devices are software only
such as the loopback device which is used for sending data to yourself.
Protocol:
Each protocol is effectively a different language of networking. Some protocols exist purely
because vendors chose to use proprietary networking schemes, others are designed for special
purposes. Within the Linux kernel each protocol is a seperate module of code which provides
services to the socket layer.
Socket:
So called from the notion of plugs and sockets. A socket is a connection in the networking that
provides unix file I/O and exists to the user program as a file descriptor. In the kernel each socket
is a pair of structures that represent the high level socket interface and low level protocol interface.
sk_buff:
All the buffers used by the networking layers are sk_buffs. The control for these is provided by
core low-level library routines available to the whole of the networking. sk_buffs provide the
general buffering and flow control facilities needed by network protocols.
http://ldp.iol.it/LDP/khg/HyperNews/get/net/net-intro.html (1 di 19) [08/03/2001 10.10.18]

Implementation of sk_buffs
The primary goal of the sk_buff routines is to provide a consistent and efficient buffer handling
method for all of the network layers, and by being consistent to make it possible to provide higher level
sk_buff and socket handling facilities to all the protocols.
An sk_buff is a control structure with a block of memory attached. There are two primary sets of
functions provided in the sk_buff library. Firstly routines to manipulate doubly linked lists of
sk_buffs, secondly functions for controlling the attached memory. The buffers are held on linked lists
optimised for the common network operations of append to end and remove from start. As so much of
the networking functionality occurs during interrupts these routines are written to be atomic. The small
extra overhead this causes is well worth the pain it saves in bug hunting.
We use the list operations to manage groups of packets as they arrive from the network, and as we send
them to the physical interfaces. We use the memory manipulation routines for handling the contents of
packets in a standardised and efficient manner.
At its most basic level, a list of buffers is managed using functions like this:
void append_frame(char *buf, int len)

{
struct sk_buff *skb=alloc_skb(len, GFP_ATOMIC);
if(skb==NULL)
my_dropped++;
else
{
skb_put(skb,len);
memcpy(skb->data,data,len);
skb_append(&my_list, skb);
}
}
void process_queue(void)
{
struct sk_buff *skb;
while((skb=skb_dequeue(&my_list))!=NULL)
{
process_data(skb);
kfree_skb(skb, FREE_READ);
}
}
These two fairly simplistic pieces of code actually demonstrate the receive packet mechanism quite
accurately. The append_frame() function is similar to the code called from an interrupt by a device
driver receiving a packet, and process_frame() is similar to the code called to feed data into the
protocols. If you go and look in net/core/dev.c at netif_rx() and net_bh(), you will see that they
manage buffers similarly. They are far more complex, as they have to feed packets to the right protocol

and manage flow control, but the basic operations are the same. This is just as true if you look at buffers
going from the protocol code to a user application.
The example also shows the use of one of the data control functions, skb_put(). Here it is used to
reserve space in the buffer for the data we wish to pass down.
Let's look at append_frame(). The alloc_skb() fucntion obtains a buffer of len bytes (Figure
1), which consists of:
● 0 bytes of room at the head of the buffer
● 0 bytes of data, and
● len bytes of room at the end of the data.
The skb_put() function (Figure 4) grows the data area upwards in memory through the free space at
the buffer end and thus reserves space for the memcpy(). Many network operations used in sending add
to the start of the frame each time in order to add headers to packets, so the skb_push() function
(Figure 5) is provided to allow you to move the start of the data frame down through memory, providing
enough space has been reserved to leave room for doing this.
Immediately after a buffer has been allocated, all the available room is at the end. A further function
named skb_reserve() (Figure 2) can be called before data is added allows you to specify that some
of the room should be at the beginning. Thus, many sending routines start with something like:
skb=alloc_skb(len+headspace, GFP_KERNEL);
skb_reserve(skb, headspace);
skb_put(skb,len);
memcpy_fromfs(skb->data,data,len);
pass_to_m_protocol(skb);
In systems such as BSD unix you don't need to know in advance how much space you will need as it uses
chains of small buffers (mbufs) for its network buffers. Linux chooses to use linear buffers and save
space in advance (often wasting a few bytes to allow for the worst case) because linear buffers make
many other things much faster.
Now to return to the list functions. Linux provides the following operations:
● skb_dequeue() takes the first buffer from a list. If the list is empty a NULL pointer is returned.
This is used to pull buffers off queues. The buffers are added with the routines
skb_queue_head() and skb_queue_tail().
● skb_queue_head() places a buffer at the start of a list. As with all the list operations, it is
atomic.
● skb_queue_tail() places a buffer at the end of a list, which is the most commonly used
function. Almost all the queues are handled with one set of routines queueing data with this
function and another set removing items from the same queues with skb_dequeue().
● skb_unlink() removes a buffer from whatever list it was on. The buffer is not freed, merely
removed from the list. To make some operations easier, you need not know what list the buffer is
on, and you can always call skb_unlink() on a buffer which is not in a list. This enables

network code to pull a buffer out of use even when the network protocol has no idea who is
currently using it. A seperate locking mechanism is provided so device drivers do not find
someone removing a buffer they are using at that moment.
● Some more complex protocols like TCP keep frames in order and re-order their input as data is
received. Two functions, skb_insert() and skb_append(), exist to allow users to place
sk_buffs before or after a specific buffer in a list.
● alloc_skb() creates a new sk_buff and initialises it. The returned buffer is ready to use but
does assume you will fill in a few fields to indicate how the buffer should be freed. Normally this
is skb->free=1. A buffer can be told not to be freed when kfree_skb() (see below) is
called.
● kfree_skb() releases a buffer, and if skb->sk is set it lowers the memory use counts of the
socket (sk). It is up tothe socket and protocol-level routines to have incremented these counts and
to avoid freeing a socket with outstanding buffers. The memory counts are very important, as the
kernel networking layers need to know how much memory is tied up by each connection in order
to prevent remote machines or local processes from using too much memory.
● skb_clone() makes a copy of an sk_buff but does not copy the data area, which must be
considered read only.
● For some things a copy of the data is needed for editing, and skb_copy() provides the same
facilities but also copies the data (and thus has a much higher overhead).
Figure 1: After alloc_skb
Figure 2: After skb_reserve
Figure 3: An sk_buff containing data

Figure 4: After skb_put has been called on the buffer
Figure 5: After an skb_push has occured on the previous buffer
Figure 6: Network device data flow
Higher Level Support Routines

The semantics of allocating and queueing buffers for sockets also involve flow control rules and for
sending a whole list of interactions with signals and optional settings such as non blocking. Two routines
are designed to make this easy for most protocols.
The sock_queue_rcv_skb() function is used to handle incoming data flow control and is normally
used in the form:
sk=my_find_socket(whatever);
if(sock_queue_rcv_skb(sk,skb)==-1)
{
myproto_stats.dropped++;
kfree_skb(skb,FREE_READ);
return;
}
This function uses the socket read queue counters to prevent vast amounts of data being queued to a
socket. After a limit is hit, data is discarded. It is up to the application to read fast enough, or as in TCP,

for the protocol to do flow control over the network. TCP actually tells the sending machine to shut up
when it can no longer queue data.
On the sending side, sock_alloc_send_skb() handles signal handling, the non blocking flag, and
all the semantics of blocking until there is space in the send queue so you cannot tie up all of memory
with data queued for a slow interface. Many protocol send routines have this function doing almost all
the work:
skb=sock_alloc_send_skb(sk,....)
if(skb==NULL)
return -err;
skb->sk=sk;
skb_reserve(skb, headroom);
skb_put(skb,len);
memcpy(skb->data, data, len);
protocol_do_something(skb);
Most of this we have met before. The very important line is skb->sk=sk. The
sock_alloc_send_skb() has charged the memory for the buffer to the socket. By setting
skb->sk we tell the kernel that whoever does a kfree_skb() on the buffer should cause the socket
to be credited the memory for the buffer. Thus when a device has sent a buffer and frees it the user will
be able to send more.
Network Devices
All Linux network devices follow the same interface although many functions available in that interface
will not be needed for all devices. An object oriented mentality is used and each device is an object with
a series of methods that are filled into a structure. Each method is called with the device itself as the first
argument. This is done to get around the lack of the C++ concept of this within the C language.
The file drivers/net/skeleton.c contains the skeleton of a network device driver. View or print a copy
from a recent kernel and follow along throughout the rest of the article.
Basic Structure

Each network device deals entirely in the transmission of network buffers from the protocols to the
physical media, and in receiving and decoding the responses the hardware generates. Incoming frames
are turned into network buffers, identified by protocol and delivered to netif_rx(). This function
then passes the frames off to the protocol layer for further processing.
Each device provides a set of additional methods for the handling of stopping, starting, control and
physical encapsulation of packets. These and all the other control information are collected together in
the device structures that are used to manage each device.
Naming
All Linux network devices have a unique name. This is not in any way related to the file system names
devices may have, and indeed network devices do not normally have a filesystem representation,
although you may create a device which is tied to device drivers. Traditionally the name indicates only
the type of a device rather than its maker. Multiple devices of the same type are numbered upwards from
0. Thus ethernet devices are known as `èth0'', `èth1'', `èth2'' etc. The naming scheme is important as it
allows users to write programs or system configuration in terms of `àn ethernet card'' rather than
worrying about the manufacturer of the board and forcing reconfiguration if a board is changed.
The following names are currently used for generic devices:
ethn
Ethernet controllers, both 10 and 100Mb/second
trn

Token ring devices.

sln
SLIP devices. Also used in AX.25 KISS mode.
pppn
PPP devices both asynchronous and synchronous.
plipn
PLIP units. The number matches the printer port.
tunln
IPIP encapsulated tunnels
nrn
NetROM virtual devices
isdnn
ISDN interfaces handled by isdn4linux. (*)
dummyn
Null devices
lo
The loopback device
(*) At least one ISDN interface is an ethernet impersonator, that is the Sonix PC/Volante driver.
Therefore, it uses an `èth'' device name as it behaves in all aspects as if it was ethernet rather than ISDN.
If possible, a new device should pick a name that reflects existing practice. When you are adding a whole
new physical layer type you should look for other people working on such a project and use a common
naming scheme.
Certain physical layers present multiple logical interfaces over one media. Both ATM and Frame Relay
have this property, as does multi-drop KISS in the amateur radio environment. Under such circumstances
a driver needs to exist for each active channel. The Linux networking code is structured in such a way as
to make this managable without excessive additional code, and the name registration scheme allows you
to create and remove interfaces almost at will as channels come into and out of existance. The proposed
convention for such names is still under some discussion, as the simple scheme of ``sl0a'', ``sl0b'', "sl0c"
works for basic devices like multidrop KISS, but does not cope with multiple frame relay connections
where a virtual channel may be moved across physical boards.
Registering A Device
Each device is created by filling in a struct device object and passing it to the
register_netdev(struct device *) call. This links your device structure into the kernel
network device tables. As the structure you pass in is used by the kernel, you must not free this until you
have unloaded the device with void unregister_netdev(struct device *) calls. These
calls are normally done at boot time, or module load and unload.
The kernel will not object if you create multiple devices with the same name, it will break. Therefore, if

your driver is a loadable module you should use the struct device *dev_get(const char
*name) call to ensure the name is not already in use. If it is in use, you should fail or pick another
name. You may not use unregister_netdev() to unregister the other device with the name if you
discover a clash!
A typical code sequence for registration is:
int register_my_device(void)
{
int i=0;
for(i=0;i<100;i++)
{
sprintf(mydevice.name,"mydev%d",i);
if(dev_get(mydevice.name)==NULL)
{
if(register_netdev(&mydevice)!=0)
return -EIO;
return 0;
}
}
printk("100 mydevs loaded. Unable to load more.\n");
return -ENFILE;
}
The Device Structure

All the generic information and methods for each network device are kept in the device structure. To
create a device you need to fill most of these in. This section covers how they should be set up.
Naming
First, the name field holds the device name. This is a string pointer to a name in the formats discussed
previously. It may also be " " (four spaces), in which case the kernel will automatically assign an ethn
name to it. This is a special feature that is best not used. After Linux 2.0, we intend to change to a simple
support function of the form dev_make_name("eth").
Bus Interface Parameters

The next block of parameters are used to maintain the location of a device within the device address
spaces of the architecture. The irq field holds the interrupt (IRQ) the device is using. This is normally
set at boot, or by the initialization function. If an interrupt is not used, not currently known, or not
assigned, the value zero should be used. The interrupt can be set in a variety of fashions. The auto-irq
facilities of the kernel may be used to probe for the device interrupt, or the interrupt may be set when
loading the network module. Network drivers normally use a global int called irq for this so that users
can load the module with insmod mydevice irq=5 style commands. Finally, the IRQ may be set
dynamically from the ifconfig command. This causes a call to your device that will be discussed later on.

The base_addr field is the base I/O space address the device resides at. If the device uses no I/O
locations or is running on a system with no I/O space concept this field should be zero. When this is user
settable, it is normally set by a global variable called io. The interface I/O address may also be set with
ifconfig.
Two hardware shared memory ranges are defined for things like ISA bus shared memory ethernet cards.
For current purposes, the rmem_start and rmem_end fields are obsolete and should be loaded with 0.
The mem_start and mem_end addresses should be loaded with the start and end of the shared
memory block used by this device. If no shared memory block is used, then the value 0 should be stored.
Those devices that allow the user to specify this parameter use a global variable called mem to set the
memory base, and set the mem_end appropriately themselves.
The dma variable holds the DMA channel in use by the device. Linux allows DMA (like interrupts) to be
automatically probed. If no DMA channel is used, or the DMA channel is not yet set, the value 0 is used.
This may have to change, since the latest PC boards allow ISA bus DMA channel 0 to be used by
hardware boards and do not just tie it to memory refresh. If the user can set the DMA channel the global
variable dma is used.
It is important to realise that the physical information is provided for control and user viewing (as well as
the driver's internal functions), and does not register these areas to prevent them being reused. Thus the
device driver must also allocate and register the I/O, DMA and interrupt lines it wishes to use, using the
same kernel functions as any other device driver. [See the recent Kernel Korner articles on writing a
character device driver in issues 23, 24, 25, 26, and 28 of Linux Journal.]
The if_port field holds the physical media type for multi-media devices such as combo ethernet
boards.
Protocol Layer Variables

In order for the network protocol layers to perform in a sensible manner, the device has to provide a set
of capability flags and variables. These are also maintained in the device structure.
The mtu is the largest payload that can be sent over this interface (that is, the largest packet size not
including any bottom layer headers that the device itself will provide). This is used by the protocol layers
such as IP to select suitable packet sizes to send. There are minimums imposed by each protocol. A
device is not usable for IPX without a 576 byte frame size or higher. IP needs at least 72 bytes, and does
not perform sensibly below about 200 bytes. It is up to the protocol layers to decide whether to
co-operate with your device.
The family is always set to AF_INET and indicates the protocol family the device is using. Linux
allows a device to be using multiple protocol families at once, and maintains this information solely to
look more like the standard BSD networking API.
The interface hardware type (type) field is taken from a table of physical media types. The values used
by the ARP protocol (see RFC1700) are used for those media supporting ARP and additional values are
assigned for other physical layers. New values are added when neccessary both to the kernel and to
net-tools which is the package containing programs like ifconfig that need to be able to decode this field.

The fields defined as of Linux pre2.0.5 are:

From RFC1700:
ARPHRD_NETROM
NET/ROM(tm) devices.
ARPHRD_ETHER
10 and 100Mbit/second ethernet.
ARPHRD_EETHER
Experimental Ethernet (not used).
ARPHRD_AX25
AX.25 level 2 interfaces.
ARPHRD_PRONET
PROnet token ring (not used).
ARPHRD_CHAOS
ChaosNET (not used).
ARPHRD_IEE802
802.2 networks notably token ring.
ARPHRD_ARCNET
ARCnet interfaces.
ARPHRD_DLCI
Frame Relay DLCI.
Defined by Linux:
ARPHRD_SLIP
Serial Line IP protocol
ARPHRD_CSLIP
SLIP with VJ header compression
ARPHRD_SLIP6
6bit encoded SLIP
ARPHRD_CSLIP6
6bit encoded header compressed SLIP
ARPHRD_ADAPT
SLIP interface in adaptive mode
ARPHRD_PPP
PPP interfaces (async and sync)
ARPHRD_TUNNEL
IPIP tunnels
ARPHRD_TUNNEL6
IPv6 over IP tunnels

ARPHRD_FRAD
Frame Relay Access Device.
ARPHRD_SKIP
SKIP encryption tunnel.
ARPHRD_LOOPBACK
Loopback device.
ARPHRD_LOCALTLK
Localtalk apple networking device.
ARPHRD_METRICOM
Metricom Radio Network.
Those interfaces marked unused are defined types but without any current support on the existing
net-tools. The Linux kernel provides additional generic support routines for devices using ethernet and
token ring.
The pa_addr field is used to hold the IP address when the interface is up. Interfaces should start down
with this variable clear. pa_brdaddr is used to hold the configured broadcast address, pa_dstaddr
the target of a point to point link and pa_mask the IP netmask of the interface. All of these can be
initialised to zero. The pa_alen field holds the length of an address (in our case an IP address), this
should be initialised to 4.
Link Layer Variables

The hard_header_len is the number of bytes the device desires at the start of a network buffer it is
passed. It does not have to be the number of bytes of physical header that will be added, although this is
normal. A device can use this to provide itself a scratchpad at the start of each buffer.
In the 1.2.x series kernels, the skb->data pointer will point to the buffer start and you must avoid
sending your scratchpad yourself. This also means for devices with variable length headers you will need
to allocate max_size+1 bytes and keep a length byte at the start so you know where the header really
begins (the header should be contiguous with the data). Linux 1.3.x makes life much simpler and ensures
you will have at least as much room as you asked free at the start of the buffer. It is up to you to use
skb_push() appropriately as was discussed in the section on networking buffers.
The physical media addresses (if any) are maintained in dev_addr and broadcast respectively.
These are byte arrays and addresses smaller than the size of the array are stored starting from the left.
The addr_len field is used to hold the length of a hardware address. With many media there is no
hardware address, and this should be set to zero. For some other interfaces the address must be set by a
user program. The ifconfig tool permits the setting of an interface hardware address. In this case it need
not be set initially, but the open code should take care not to allow a device to start transmitting without
an address being set.

Flags
A set of flags are used to maintain the interface properties. Some of these are ``compatibility'' items and
as such not directly useful. The flags are:
IFF_UP
The interface is currently active. In Linux, the IFF_RUNNING and IFF_UP flags are basically
handled as a pair. They exist as two items for compatibility reasons. When an interface is not
marked as IFF_UP it may be removed. Unlike BSD, an interface that does not have IFF_UP set
will never receive packets.
IFF_BROADCAST
The interface has broadcast capability. There will be a valid IP address stored in the device
addresses.
IFF_DEBUG
Available to indicate debugging is desired. Not currently used.
IFF_LOOPBACK
The loopback interface (lo) is the only interface that has this flag set. Setting it on other interfaces
is neither defined nor a very good idea.
IFF_POINTOPOINT
The interface is a point to point link (such as SLIP or PPP). There is no broadcast capability as
such. The remote point to point address in the device structure is valid. A point to point link has no
netmask or broadcast normally, but this can be enabled if needed.
IFF_NOTRAILERS
More of a prehistoric than a historic compatibility flag. Not used.
IFF_RUNNING
See IFF_UP
IFF_NOARP
The interface does not perform ARP queries. Such an interface must have either a static table of
address conversions or no need to perform mappings. The NetROM interface is a good example of
this. Here all entries are hand configured as the NetROM protocol cannot do ARP queries.
IFF_PROMISC
The interface if it is possible will hear all packets on the network. This is typically used for
network monitoring although it may also be used for bridging. One or two interfaces like the
AX.25 interfaces are always in promiscuous mode.
IFF_ALLMULTI
Receive all multicast packets. An interface that cannot perform this operation but can receive all
packets will go into promiscuous mode when asked to perform this task.
IFF_MULTICAST
Indicate that the interface supports multicast IP traffic. This is not the same as supporting a
physical multicast. AX.25 for example supports IP multicast using physical broadcast. Point to
point protocols such as SLIP generally support IP multicast.

The Packet Queue

Packets are queued for an interface by the kernel protocol code. Within each device, buffs[] is an
array of packet queues for each kernel priority level. These are maintained entirely by the kernel code,
but must be initialised by the device itself on boot up. The intialisation code used is:
int ct=0;
while(ct<DEV_NUMBUFFS)
{
skb_queue_head_init(&dev->buffs[ct]);
ct++;
}
All other fields should be initialised to 0.
The device gets to select the queue length it wants by setting the field dev->tx_queue_len to the
maximum number of frames the kernel should queue for the device. Typically this is around 100 for
ethernet and 10 for serial lines. A device can modify this dynamically, although its effect will lag the
change slightly.
Network Device Methods

Each network device has to provide a set of actual functions (methods) for the basic low level operations.
It should also provide a set of support functions that interface the protocol layer to the protocol
requirements of the link layer it is providing.
Setup
The init method is called when the device is initialised and registered with the system. It should perform
any low level verification and checking needed, and return an error code if the device is not present,
areas cannot be registered or it is otherwise unable to proceed. If the init method returns an error the
register_netdev() call returns the error code and the device is not created.
Frame Transmission
All devices must provide a transmit function. It is possible for a device to exist that cannot transmit. In
this case the device needs a transmit function that simply frees the buffer it is passed. The dummy device
has exactly this functionality on transmit.
The dev->hard_start_xmit() function is called and provides the driver with its own device
pointer and network buffer (an sk_buff) to transmit. If your device is unable to accept the buffer, it
should return 1 and set dev->tbusy to a non-zero value. This will queue the buffer and it may be
retried again later, although there is no guarantee that the buffer will be retried. If the protocol layer
decides to free the buffer the driver has rejected, then it will not be offered back to the device. If the
device knows the buffer cannot be transmitted in the near future, for example due to bad congestion, it
can call dev_kfree_skb() to dump the buffer and return 0 indicating the buffer is processed.

If there is room the buffer should be processed. The buffer handed down already contains all the headers,
including link layer headers, neccessary and need only be actually loaded into the hardware for
transmission. In addition, the buffer is locked. This means that the device driver has absolute ownership
of the buffer until it chooses to relinquish it. The contents of an sk_buff remain read-only, except that
you are guaranteed that the next/previous pointers are free so you can use the sk_buff list primitives to
build internal chains of buffers.
When the buffer has been loaded into the hardware, or in the case of some DMA driven devices, when
the hardware has indicated transmission complete, the driver must release the buffer. This is done by
calling dev_kfree_skb(skb, FREE_WRITE). As soon as this call is made, the sk_buff in
question may spontaneously disappear and the device driver thus should not reference it again.
Frame Headers
It is neccessary for the high level protocols to append low level headers to each frame before queueing it
for transmission. It is also clearly undesirable that the protocol know in advance how to append low level
headers for all possible frame types. Thus the protocol layer calls down to the device with a buffer that
has at least dev->hard_header_len bytes free at the start of the buffer. It is then up to the network
device to correctly call skb_push() and to put the header on the packet in its
dev->hard_header() method. Devices with no link layer header, such as SLIP, may have this
method specified as NULL.
The method is invoked giving the buffer concerned, the device's own pointers, its protocol identity,
pointers to the source and destination hardware addresses, and the length of the packet to be sent. As the
routine may be called before the protocol layers are fully assembled, it is vital that the method use the
length parameter, not the buffer length.
The source address may be NULL to mean `ùse the default address of this device'', and the destination
may be NULL to mean `ùnknown''. If as a result of an unknown destination the header may not be
completed, the space should be allocated and any bytes that can be filled in should be filled in. This
facility is currently only used by IP when ARP processing must take place. The function must then return
the negative of the bytes of header added. If the header is completely built it must return the number of
bytes of header added.
When a header cannot be completed the protocol layers will attempt to resolve the address neccessary.
When this occurs, the dev->rebuild_header() method is called with the address at which the
header is located, the device in question, the destination IP address, and the network buffer pointer. If the
device is able to resolve the address by whatever means available (normally ARP), then it fills in the
physical address and returns 1. If the header cannot be resolved, it returns 0 and the buffer will be retried
the next time the protocol layer has reason to believe resolution will be possible.
Reception
There is no receive method in a network device, because it is the device that invokes processing of such
events. With a typical device, an interrupt notifies the handler that a completed packet is ready for
reception. The device allocates a buffer of suitable size with dev_alloc_skb() and places the bytes
from the hardware into the buffer. Next, the device driver analyses the frame to decide the packet type.

The driver sets skb->dev to the device that received the frame. It sets skb->protocol to the
protocol the frame represents so that the frame can be given to the correct protocol layer. The link layer
header pointer is stored in skb->mac.raw and the link layer header removed with skb_pull() so
that the protocols need not be aware of it. Finally, to keep the link and protocol isolated, the device driver
must set skb->pkt_type to one of the following:
PACKET_BROADCAST
Link layer broadcast.
PACKET_MULTICAST
Link layer multicast.
PACKET_SELF
Frame to us.
PACKET_OTHERHOST
Frame to another single host.
This last type is normally reported as a result of an interface running in promiscuous mode.
Finally, the device driver invokes netif_rx() to pass the buffer up to the protocol layer. The buffer is
queued for processing by the networking protocols after the interrupt handler returns. Deferring the
processing in this fashion dramatically reduces the time interrupts are disabled and improves overall
responsiveness. Once netif_rx() is called, the buffer ceases to be property of the device driver and
may not be altered or referred to again.
Flow control on received packets is applied at two levels by the protocols. Firstly a maximum amount of
data may be outstanding for netif_rx() to process. Secondly each socket on the system has a queue
which limits the amount of pending data. Thus all flow control is applied by the protocol layers. On the
transmit side a per device variable dev->tx_queue_len is used as a queue length limiter. The size of
the queue is normally 100 frames, which is enough that the queue will be kept well filled when sending a
lot of data over fast links. On a slow link such as slip link, the queue is normally set to about 10 frames,
as sending even 10 frames is several seconds of queued data.
One piece of magic that is done for reception with most existing device, and one you should implement if
possible, is to reserve the neccessary bytes at the head of the buffer to land the IP header on a long word
boundary. The existing ethernet drivers thus do:
skb=dev_alloc_skb(length+2);
if(skb==NULL)
return;
skb_reserve(skb,2);
/* then 14 bytes of ethernet hardware header */
to align IP headers on a 16 byte boundary, which is also the start of a cache line and helps give
performance improvments. On the Sparc or DEC Alpha these improvements are very noticable.

Optional Functionality
Each device has the option of providing additional functions and facilities to the protocol layers. Not
implementing these functions will cause a degradation in service available via the interface but not
prevent operation. These operations split into two categories--configuration and activation/shutdown.
Activation And Shutdown

When a device is activated (that is, the flag IFF_UP is set) the dev->open() method is invoked if the
device has provided one. This permits the device to take any action such as enabling the interface that are
needed when the interface is to be used. An error return from this function causes the device to stay down
and causes the user request to activate the device to fail with the error returned by dev->open()
The second use of this function is with any device loaded as a module. Here it is neccessary to prevent a
device being unloaded while it is open. Thus the MOD_INC_USE_COUNT macro must be used within
the open method.
The dev->close() method is invoked when the device is configured down and should shut off the
hardware in such a way as to minimise machine load (for example by disabling the interface or its ability
to generate interrupts). It can also be used to allow a module device to be unloaded now that it is down.
The rest of the kernel is structured in such a way that when a device is closed, all references to it by
pointer are removed. This ensures that the device may safely be unloaded from a running system. The
close method is not permitted to fail.
Configuration And Statistics

A set of functions provide the ability to query and to set operating parameters. The first and most basic of
these is a get_stats routine which when called returns a struct enet_statistics block for the
interface. This allows user programs such as ifconfig to see the loading on the interface and any problem
frames logged. Not providing this will lead to no statistics being available.
The dev->set_mac_address() function is called whenever a superuser process issues an ioctl of
type SIOCSIFHWADDR to change the physical address of a device. For many devices this is not
meaningful and for others not supported. If so leave this functiom pointer as NULL. Some devices can
only perform a physical address change if the interface is taken down. For these check IFF_UP and if set
then return -EBUSY.
The dev->set_config() function is called by the SIOCSIFMAP function when a user enters a
command like ifconfig eth0 irq 11. It passes an ifmap structure containing the desired I/O
and other interface parameters. For most interfaces this is not useful and you can return NULL.
Finally, the dev->do_ioctl() call is invoked whenever an ioctl in the range SIOCDEVPRIVATE to
SIOCDEVPRIVATE+15 is used on your interface. All these ioctl calls take a struct ifreq. This is
copied into kernel space before your handler is called and copied back at the end. For maximum
flexibility any user may make these calls and it is up to your code to check for superuser status when
appropriate. For example the PLIP driver uses these to set parallel port time out speeds to allow a user to
tune the plip device for their machine.

Multicasting
Certain physical media types such as ethernet support multicast frames at the physical layer. A multicast
frame is heard by a group, but not all, hosts on the network, rather than going from one host to another.
The capabilities of ethernet cards are fairly variable. Most fall into one of three categories:
1. No multicast filters. The card either receives all multicasts or none of them. Such cards can be a
nuisance on a network with a lot of multicast traffic such as group video conferences.
2. Hash filters. A table is loaded onto the card giving a mask of entries that we wish to hear multicast
for. This filters out some of the unwanted multicasts but not all.
3. Perfect filters. Most cards that support perfect filters combine this option with 1 or 2 above. This is
done because the perfect filter often has a length limit of 8 or 16 entries.
It is especially important that ethernet interfaces are programmed to support multicasting. Several
ethernet protocols (notably Appletalk and IP multicast) rely on ethernet multicasting. Fortunately, most
of the work is done by the kernel for you (see net/core/dev_mcast.c).
The kernel support code maintains lists of physical addresses your interface should be allowing for
multicast. The device driver may return frames matching more than the requested list of multicasts if it is
not able to do perfect filtering.
Whenever the list of multicast addresses changes the device drivers dev->set_multicast_list()
function is invoked. The driver can then reload its physical tables. Typically this looks something like:
if(dev->flags&IFF_PROMISC)
SetToHearAllPackets();
else if(dev->flags&IFF_ALLMULTI)
SetToHearAllMulticasts();
else
{
if(dev->mc_count<16)
{
LoadAddressList(dev->mc_list);
SetToHearList();
}
else
SetToHearAllMulticasts();
}
There are a small number of cards that can only do unicast or promiscuous mode. In this case the driver,
when presented with a request for multicasts has to go promiscuous. If this is done, the driver must itself
also set the IFF_PROMISC flag in dev->flags.
In order to aid driver writer the multicast list is kept valid at all times. This simplifies many drivers, as a
reset from error condition in a driver often has to reload the multicast address lists.

Ethernet Support Routines

Ethernet is probably the most common physical interface type that is handled. The kernel provides a set
of general purpose ethernet support routines that such drivers can use.
eth_header() is the standard ethernet handler for the dev->hard_header routine, and can be
used in any ethernet driver. Combined with eth_rebuild_header() for the rebuild routine it
provides all the ARP lookup required to put ethernet headers on IP packets.
The eth_type_trans() routine expects to be fed a raw ethernet packet. It analyses the headers and
sets skb->pkt_type and skb->mac itself as well as returning the suggested value for
skb->protocol. This routine is normally called from the ethernet driver receive interrupt handler to
classify packets.
eth_copy_and_sum(), the final ethernet support routine, is quite internally complex but offers
significant performance improvements for memory mapped cards. It provides the support to copy and
checksum data from the card into an sk_buff in a single pass. This single pass through memory almost
eliminates the cost of checksum computation when used and can really help IP throughput.
Alan Cox has been working on Linux since version 0.95, when he installed it in order to do
further work on the AberMUD game. He now manages the Linux Networking, SMP, and
Linux/8086 projects and hasn't done any work on AberMUD since November 1993.
Messages
2. Question on alloc_skb() by Joern Wohlrab
1. Re: Question on alloc_skb() by Erik Petersen
1. Question on network interfaces by Vijay Gupta
2. Re: Question on network interfaces by Pedro Roque
1. Untitled
1. Finding net info by Alan Cox

Question on alloc_skb()
Question on alloc_skb()
Forum: Network Buffers And Memory Management
Keywords: network interface
Date: Mon, 31 Mar 1997 18:16:47 GMT
From: Joern Wohlrab <unknown>
Hi,
As I read in that introduction a network device driver
has to alloc a sk_buff in it's ISR and this has to happen
atomically, isn't it? Well my experiences are that unfortunately
it often happens that alloc_skb returns NULL. So my idea was
I alloc a few sk_buff's (with GFP_KERNEL flag) in the device
drivers open function. The device driver would organize these
sk_buff's as ring. Else the device driver must forbid the
higher layer to free the sk_buff's. Is this possible just
by setting the lock flag inside sk_buff?
The only problem with this scheme is when the user process
doesn't read frequently from the socket the device driver
overrides unread sk_buff data again and the packet order would
be destroyed. But let's assume we don't care. So is this plan
possible at all?
Thank you very much.
--
Joern Wohlrab
Messages
1. Re: Question on alloc_skb() by Erik Petersen
http://ldp.iol.it/LDP/khg/HyperNews/get/net/net-intro/2.html [08/03/2001 10.10.19]

Re: Question on alloc_skb()
Re: Question on alloc_skb()

Re: Question on alloc_skb() (Joern Wohlrab)
Date: Tue, 08 Apr 1997 04:42:56 GMT
From: Erik Petersen <erik@spellcast.com>
Hmm.... I'm not sure why you would want to do this personally. If alloc_skb() returns NULL, there is
no memory to allocate the block you want. Normally you would report the squeeze and drop the data.
When you pass the skb to netif_rx you are esentially saying "here you go". You cant expect to reclaim
the buffer as it will eventually be freed.
If you must make a best effort to deliver the data regardless of the memory situation at the time the
data is received (the interrupt handler), I would create an skb list when the driver loads. Then during
the interrupt, try to alloc_skb and if it fails, stuff the data in one of the pre-alloced buffers. Then the
next time an interrupt occurs, try to replenish the buffer pool.
If you get to a point where both the alloc_skb fails and the buffer pool is empty, you're pretty much
screwed anyways.
This would solve short-term squeeze situations but if you are that tight for memory, you might want to
just printk a message saying "Get more memory cheapskate" or words to that effect.
If you're smart about it, you could balance the buffer pool at interrupt time to ensure you have enough
to do the job. If you're using a good number of the buffers continuously, you might want to
dynamically increase the number of buffers in the pool. If you aren't, you could reduce the number
dynamically.
Just a thought.
http://ldp.iol.it/LDP/khg/HyperNews/get/net/net-intro/2/1.html [08/03/2001 10.10.20]

Question on network interfaces

Date: Sun, 19 May 1996 17:46:31 GMT
From: Vijay Gupta <vijay@crhc.uiuc.edu>
Hi,
I am working on some networking code at the kernel level
using version 1.3.71. I had the following problems :
How to find :
(1) the interfaces which are available :
(e.g. in the case of a router, you will have many
interfaces, one interface for each of its IP addresses).
Many times you have both a SLIP as well as an Ethernet
interface to your router computer.
(2) the status of the interface :
whether the interface is up or down.
There is a structure called "struct ifnet" which is used in include/linux/route.h. struct ifnet has
information like the name of the interface (e.g. le0 or sl0), as well as the status of the interface
(whether up or down).
But I could not find the definition of this structure anywhere in version 1.3.71.
Older versions of linux had the definition of this structure available (struct ifnet also occurs in BSD
code). But now I am unable to find its definition or use. Is there a substitute for that structure ?
If there is no substitute, is it the case that the information about the available interfaces cannot be
obtained ?
Thank you very much,
Vijay Gupta
(Email : vijay@crhc.uiuc.edu)
Messages
2. Re: Question on network interfaces by Pedro Roque
http://ldp.iol.it/LDP/khg/HyperNews/get/net/net-intro/1.html (1 di 2) [08/03/2001 10.10.22]

1. Untitled
http://ldp.iol.it/LDP/khg/HyperNews/get/net/net-intro/1.html (2 di 2) [08/03/2001 10.10.22]

Re: Question on network interfaces
Re: Question on network interfaces

Re: Question on network interfaces (Vijay Gupta)
Date: Wed, 28 Aug 1996 15:33:15 GMT
From: Pedro Roque <roque@di.fc.ul.pt>
If you want to scan the interface list from kernel space you can do something like:
{
struct device *dev;
for (dev = dev_base; dev != NULL; dev = dev->next)

{
/* your code here */
/* example */
if (dev->family == AF_INET)
{
/* this is an inet device */
}
if (dev->type == ARPHRD_ETHER)
{
/* this is ethernet */
}
}
}
./Pedro.

Untitled
Untitled
Date: Tue, 21 May 1996 23:00:03 GMT
From: <unknown>
I more or less managed to get the answers to the above questions by winding through start_kernel ->
init -> ifconfig -> ....
Thanks,
Vijay
Messages

Finding net info
Finding net info

Date: Thu, 30 May 1996 16:12:51 GMT
From: Alan Cox <alan.cox@linux.org>
For general scanning there is both the BSD ioctl (which is a pain as you must guess the largest size), or
/proc/net/dev (just cat it). For the state of an interface you use a struct ifreq filled in and do
SIOCGIFFLAGS and test IFF_RUNNING and IFF_UP
http://ldp.iol.it/LDP/khg/HyperNews/get/net/net-intro/1/1/1.html [08/03/2001 10.10.29]

down/up() - semaphores; set/clear/test_bit()
down/up() - semaphores; set/clear/test_bit()

Forum: Supporting Functions
Date: Tue, 25 Mar 1997 17:38:15 GMT
From: Erez Strauss <unknown>
The following features are almost not documented (AFAIK). semaphore locking with down() up()
functions and the usage of them. The bit operations set_bit() clear_bit() and test_bit() are also missing
usage information. Those functions are important for drivers programmers that should take care about
SMP/resource locking. Please email me <erez@newplaces.com> refrences if you know about.
The KHG is missing an example section. Each function in the Linux kernel should have an example
page in the KGH.
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/reference/14.html [08/03/2001 10.10.30]

Bug in printk description!
Bug in printk description!

Date: Wed, 19 Feb 1997 01:43:48 GMT
From: Theodore Ts'o <tytso@mit.edu>
The printk description states that (and I quote):

``printk() may cause implicit I/O, if the memory being accessed has been swapped out, and therefore
pre-emption may occur at this point. Also, printk() will set the interrupt enable flag, so never use it in
code protected by cli(). Because it causes I/O, it is not safe to use in protected code anyway, even it if
didn't set the interrupt enable flag.''
This is wrong! First of all, printk accesses kernel memory, which is never swapped out. Hence, there is
no risk of causing implicit I/O. Secondly, printk doesn't use sti(); it uses save_flags()/restore_flags(),
so it's safe to use it in an interrupt routine (although it will do horrible things to your interrupt latency,
so you obviously only use it for debugging).

File access within a device driver?
File access within a device driver?

Keywords: file access device driver
Date: Wed, 22 Jan 1997 10:51:25 GMT
From: Paul Osborn <pao20@cam.ac.uk>
I have a device driver which locates a custom ISA card in I/O space, and then needs to download a 6kb
configuration file to an FPGA on the card.
Which functions should I use to read the datafile? Can stdio.h functions be used, or must special
functions be used within the kernel?

man pages for reguest_region() and release_region() (?)
man pages for reguest_region() and

release_region() (?)
Keywords: release request
Date: Mon, 20 Jan 1997 16:15:26 GMT
From: <mharrison@i-55.com>
helo,
Recently I read a series of articles
on writing device drivers in the
Linux Journal. The author mentions
two functions: release_region(),
request_region(). So far I have
been unlucky in finding man-pages
for these functions.
any clues or hints would be most

appreciated
cheers
Mike

Can register_*dev() assign an unused major number?
Can register_*dev() assign an unused major

number?
Date: Thu, 09 Jan 1997 06:32:55 GMT
From: <rgerharz@erols.com>
If you call register_*dev() with major=0, will it return and allocate an unused major number? If so,
will it do this for modules, also?
Messages
1. Register_*dev() can assign an unused major number. by Reinhold J. Gerharz

Register_*dev() can assign an unused major number.
Register_*dev() can assign an unused major

number.
Re: Can register_*dev() assign an unused major number?
Keywords: register_chrdev major device
Date: Mon, 03 Feb 1997 17:48:13 GMT
If the first parameter to register_chrdev() is zero (0), register_chrdev() will attempt to return an unused
major device number. If it returns <0, then the return value is an error code.
(Moderator: Please delete this paragraph and replace my previous message, above, with this one.)
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/reference/10/1.html [08/03/2001 10.10.32]

memcpy_*fs(): which way is "fs"?
memcpy_*fs(): which way is "fs"?

Keywords: USER KERNEL SPACE MEMORY COPY
Date: Thu, 09 Jan 1997 05:00:55 GMT
memcpy_*fs()
inline void memcpy_tofs(void * to, const void * from, unsigned long n)
inline void memcpy_fromfs(void * to, const void * from, unsigned long n)
It is not clear which way the copy occurs. Does "from" mean user space, or kernel space. Contrarily,
does "to" mean kernel space or user space?
Assuming the "tofs" and "fromfs" refer to the Frame Segment register, can one assume it always points
to user space? How does this carry over to other architectures? Do they have Frame Segment registers?
Messages
1. memcpy_tofs() and memcpy_fromfs() by David Hinds

memcpy_tofs() and memcpy_fromfs()
memcpy_tofs() and memcpy_fromfs()

Re: memcpy_*fs(): which way is "fs"? (Reinhold J. Gerharz)
Keywords: USER KERNEL SPACE MEMORY COPY
Date: Mon, 13 Jan 1997 22:35:53 GMT
From: David Hinds <dhinds@hyper.stanford.edu>
In older versions of the Linux kernel, the i386 FS segment register pointed to user space. So,
memcpy_tofs meant to user space, and memcpy_fromfs meant from user space. On other platforms,
these did the right thing despite the non-existence of an FS register. These calls are deprecated in
current kernels, however, and new code should use copy_from_user() and copy_to_user().

init_wait_queue()
init_wait_queue()
Date: Tue, 19 Nov 1996 17:14:17 GMT
Before calling sleep_on() or wake_up() on a wait queue, you must initialize it with the
init_wait_queue() function.

request_irq(...,void *dev_id)
request_irq(...,void *dev_id)
Keywords: request_irq
Date: Tue, 29 Oct 1996 14:54:25 GMT
From: Robert Wilhelm <robert@physiol.med.tu-muenchen.de>
request_irg() and free_irq() seem to take a new parameter in Linux 2.0.x. What is the magic behind
this?
Messages
1. dev_id seems to be for IRQ sharing by Steven Hunyady

dev_id seems to be for IRQ sharing
dev_id seems to be for IRQ sharing

Re: request_irq(...,void *dev_id) (Robert Wilhelm)
Keywords: request_irq dev_id IRQ-sharing
Date: Tue, 08 Apr 1997 02:11:34 GMT
From: Steven Hunyady <hunyady@kestrel.nmt.edu>
Look in Don Becker's 3c59x.c net driver. Apparently, IRQ sharing amongst like (or dissimilar?) cards
developed progressively in the kernel, and this driver, usable in several major kernel versions, shows
this ongoing adaptation. Most other device drivers have not yet allowed for multiple use of IRQ lines,
hence they simply put "NULL" for this fifth parameter in the function request_irq() and the second in
free_irq().

udelay should be mentioned
udelay should be mentioned

Keywords: udelay
Date: Tue, 22 Oct 1996 13:45:41 GMT
From: Klaus Lindemann <lindeman@nbi.dk>
Hi
I think that the function udelay() should be mentioned in this section, since it is not possible to use
delay in kernel modules (or at least that how I understood it).
Regards
Klaus Lindemann

vprintk would be nice...
vprintk would be nice...

Keywords: printk
Date: Mon, 21 Oct 1996 18:58:25 GMT
From: Robert Baruch <baruch@oramp.com>
I wish there were a function analagous to vprintf except for

the kernel -- vprintk.
Messages
1. RE: vprintk would be nice...

RE: vprintk would be nice...
RE: vprintk would be nice...

Re: vprintk would be nice... (Robert Baruch)
Keywords: printk
Date: Thu, 09 Jan 1997 05:19:03 GMT
From: <unknown>
What's wrong with using sprintf()? I do.

add_timer function errata?
add_timer function errata?

Date: Mon, 07 Oct 1996 09:45:17 GMT
From: Tim Ferguson <timf@dgs.monash.edu.au>
It seems that when using the add_timer function in newer versions of the kernel (2.0.0+), the èxpires'
variable in the timer_list struct is the time rather than the length of time before the timer will be
processed. To be backward compatible with older versions of linux, you need to do something like:
if the old version was:
timer.expires = TIME_LENGTH;
new version would be:
timer.expires = jiffies + TIME_LENGTH;
where TIME_LENGTH is the time in 1/100'ths of a second.
Could anyone tell me if they found this also to be the case, and if so, could the Linux hackers guide
please be updated.
thanks,
Tim.
Messages
1. add_timer function errata by Tom Bjorkholm

add_timer function errata
add_timer function errata

Re: add_timer function errata? (Tim Ferguson)
Date: Mon, 17 Feb 1997 17:42:33 GMT
From: Tom Bjorkholm <tomb@mydata.se>
Tim,
You are correct.... or at least I have the same experience as you have. The time you should give is
"jiffies + TIMEOUT"
Could someone fix this in the original documentation.
/Tom Bjorkholm

Very short waits
Very short waits

Keywords: short timer jiffies sleep wait
Date: Mon, 23 Sep 1996 20:02:38 GMT
From: Kenn Humborg <kenn@wombat.ie>
Is there any way to wait for less than a jiffy without spinning and tying up the CPU?
I'm trying to implement a key-click and kd_mksound can't make sounds shorter than 10ms.
Thanks
Kenn Humborg kenn@wombat.ie

Add the kill_xxx() family to Supporting functions?
Add the kill_xxx() family to Supporting functions?

Keywords: kill_xxx(), signaling
Date: Sun, 22 Sep 1996 15:11:54 GMT
From: Burkhard Kohl <b.kohl@ipn-b.comlink.apc.org>
For the development of a char driver I needed functionality to signal an interrupt to the process in user
space. The KHG does not give any hint how to do that. Finally, after quite some browsing through
kernel sources I came across the kill_xxxx() family in exit.c.
I found kill_pg() and kill_proc() widely used in a couple of char drivers. Another one is kill_fasync()
which is mostly used by mouse drivers.
After some hacking I managed to use kill_proc() for my purpose. But I still don't know how to handle
the priv parameter correctly. Obviously 0 means without and 1 with certain (what?) priviledges.
I have no idea what kill_fasync is used for.
Wouldn't it be nice to have the kill_xxxx() family described in the KHG? Michael, what do you think?
Anyone willing to take this? I could do the stubs if someone who really knows will do the annotation.
Any comment, thoughts and flames are welcome.
Burkhard.
P.S. My email address is:
b.kohl@ipn-b.comlink.apc.org

Allocating large amount of memory
Allocating large amount of memory

Keywords: memory allocation
Date: Mon, 03 Jun 1996 22:25:40 GMT
Matt Welsh has designed a solution to the need for very large areas of continuous physical areas of
memory, which is specifically necessary for some DMA needs. If you need it, pick up a copy of
bigphysarea, which should work with most modern kernels.
Messages
1. bigphysarea for Linux 2.0? by Greg Hager

bigphysarea for Linux 2.0?
bigphysarea for Linux 2.0?

Re: Allocating large amount of memory (Michael K. Johnson)
Keywords: memory allocation Linux 2.0
Date: Wed, 24 Jul 1996 08:47:41 GMT
From: Greg Hager <hager@cs.yale.edu>
I acquired the bigphsyarea patch (for a digitizer driver that I am writing), but unfortunately patch -p0
fails on Linux 2.0. Has anyone modifed the patch for 2.0.
Greg


From a message from Linus Torvalds to the linux-kernel mailing list of 27 Sep 1996, edited.
I'll take this opportunity to tell all device driver writers about the ugly secrets of portability. Things are actually
worse than just physical and virtual addresses.
The aha1542 is a bus-master device, and [a patch posted to the linux-kernel list] makes the driver give the
controller the physical address of the buffers, which is correct on x86, because all bus master devices see the
physical memory mappings directly.
However, on many setups, there are actually three different ways of looking at memory addresses, and in this case
we actually want the third, the so-called "bus address".
Essentially, the three ways of addressing memory are (this is "real memory", i.e. normal RAM; see later about
other details):
● CPU untranslated. This is the "physical" address, ie physical address 0 is what the CPU sees when it drives
zeroes on the memory bus.
● CPU translated address. This is the "virtual" address, and is completely internal to the CPU itself with the
CPU doing the appropriate translations into "CPU untranslated".
● Bus address. This is the address of memory as seen by OTHER devices, not the CPU. Now, in theory there
could be many different bus addresses, with each device seeing memory in some device-specific way, but
happily most hardware designers aren't actually actively trying to make things any more complex than
necessary, so you can assume that all external hardware sees the memory the same way.
Now, on normal PC's, the bus address is exactly the same as the physical address, and things are very simple
indeed. However, they are that simple because the memory and the devices share the same address space, and that
is not generally necessarily true on other PCI/ISA setups.
Now, just as an example, on the PReP (PowerPC Reference Platform), the CPU sees a memory map something
like this (this is from memory):
0-2GB
"real memory"
2GB-3GB
"system IO" (ie inb/out type accesses on x86)
3GB-4GB
"IO memory" (ie shared memory over the IO bus)
Now, that looks simple enough. However, when you look at the same thing from the viewpoint of the devices, you
have the reverse, and the physical memory address 0 actually shows up as address 2GB for any IO master.
So when the CPU wants any bus master to write to physical memory 0, it has to give the master address
0x80000000 as the memory address.
So, for example, depending on how the kernel is actually mapped on the PPC, you can end up with a setup like
this:
physical address:
0
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/addrxlate.html (1 di 4) [08/03/2001 10.10.38]

virtual address:
0xC0000000
bus address:
0x80000000
where all the addresses actually point to the same thing, it's just seen through different translations.
Similarly, on the alpha, the normal translation is
physical address:
0
virtual address:
0xfffffc0000000000
bus address:
0x40000000
(but there are also alpha's where the physical address and the bus address are the same).
Anyway, the way to look up all these translations, you do:
#include <asm/io.h>
phys_addr = virt_to_phys(virt_addr);
virt_addr = phys_to_virt(phys_addr);
bus_addr = virt_to_bus(virt_addr);
virt_addr = bus_to_virt(bus_addr);
Now, when do you need these?
You want the virtual address when you are actually going to access that pointer from the kernel. So you can have
something like this (from the aha1542 driver):
/*
* this is the hardware "mailbox" we use to communicate with
* the controller. The controller sees this directly.
*/
struct mailbox {
__u32 status;
__u32 bufstart;
__u32 buflen;
..
} mbox;
unsigned char * retbuffer;
/* get the address from the controller */

retbuffer = bus_to_virt(mbox.bufstart);
switch (retbuffer[0]) {
case STATUS_OK:
...
On the other hand, you want the bus address when you have a buffer that you want to give to the controller:

/* ask the controller to read the sense status into "sense_buffer" */

mbox.bufstart = virt_to_bus(&sense_buffer);
mbox.buflen = sizeof(sense_buffer);
mbox.status = 0;
notify_controller(&mbox);
And you generally never want to use the physical address, because you can't use that from the CPU (the CPU only
uses translated virtual addresses), and you can't use it from the bus master.
So why do we care about the physical address at all? We do need the physical address in some cases, it's just not
very often in normal code. The physical address is needed if you use memory mappings, for example, because the
remap_page_range() mm function wants the physical address of the memory to be remapped (the memory
management layer doesn't know about devices outside the CPU, so it shouldn't need to know about "bus addresses"
etc).
NOTE NOTE NOTE! The above is only one part of the whole equation. The above only talks about "real
memory", i.e. CPU memory, i.e. RAM.
There is a completely different type of memory too, and that's the "shared memory" on the PCI or ISA bus. That's
generally not RAM (although in the case of a video graphics card it can be normal DRAM that is just used for a
frame buffer), but can be things like a packet buffer in a network card etc.
This memory is called "PCI memory" or "shared memory" or "IO memory" or whatever, and there is only one way
to access it: the readb/writeb and related functions. You should never take the address of such memory,
because there is really nothing you can do with such an address: it's not conceptually in the same memory space as
"real memory" at all, so you cannot just dereference a pointer. (Sadly, on x86 it is in the same memory space, so on
x86 it actually works to just deference a pointer, but it's not portable).
For such memory, you can do things like
Reading:
/*
* read first 32 bits from ISA memory at 0xC0000, aka
* C000:0000 in DOS terms
*/
unsigned int signature = readl(0xC0000);
Remapping and writing:
/*
* remap framebuffer PCI memory area at 0xFC000000,
* size 1MB, so that we can access it: We can directly
* access only the 640k-1MB area, so anything else
* has to be remapped.
*/
char * baseptr = ioremap(0xFC000000, 1024*1024);
/* write a 'A' to the offset 10 of the area */

writeb('A',baseptr+10);
/* unmap when we unload the driver */

iounmap(baseptr);
Copying and clearing:
/* get the 6-byte ethernet address at ISA address E000:0040 */

memcpy_fromio(kernel_buffer, 0xE0040, 6);
/* write a packet to the driver */
memcpy_toio(0xE1000, skb->data, skb->len);
/* clear the frame buffer */
memset_io(0xA0000, 0, 0x10000);
Ok, that just about covers the basics of accessing IO portably. Questions? Comments? You may think that all the
above is overly complex, but one day you might find yourself with a 500MHz alpha in front of you, and then you'll
be happy that your driver works ;)
Note that kernel versions 2.0.x (and earlier) mistakenly called ioremap() "vremap()". ioremap() is the
proper name, but I didn't think straight when I wrote it originally. People who have to support both can do
something like:
/* support old naming sillyness */

#if LINUX_VERSION_CODE < 0x020100
#define ioremap vremap
#define iounmap vfree
#endif
at the top of their source files, and then they can use the right names even on 2.0.x systems.
And the above sounds worse than it really is. Most real drivers really don't do all that complex things (or rather: the
complexity is not so much in the actual IO accesses as in error handling and timeouts etc). It's generally not hard to
fix drivers, and in many cases the code actually looks better afterwards:
unsigned long signature = *(unsigned int *) 0xC0000;

vs.
unsigned long signature = readl(0xC0000);

I think the second version actually is more readable, no?
Linus


From a message from Joerg Pommnitz to the linux-kernel mailing list of 11 Nov 1996, edited.
According to Linus Torvalds:
People interested in low-level scary stuff should take a look at the uaccess.h files for x86 or alpha, and be
ready to spend some time just figuring out what it all does ;)
I am, and I did.
Kernel-level exception handling in Linux 2.1.8

When a process runs in kernel mode, it often has to access user mode memory whose address has been passed by an
untrusted program. To protect itself, the kernel has to verify this address.
In older versions of Linux, this was done with the
int verify_area(int type, const void * addr, unsigned long size)

function.
This function verified, that the memory area starting at address addr and of size size was accessible for the operation
specified in type (read or write). To do this, verify_read had to look up the virtual memory area (vma) that contained
the address addr. In the normal case (correctly working program), this test was successful. It only failed for the (hopefully)
rare, buggy program. In some kernel profiling tests, this normally unneeded verification used up a considerable amount of
time.
To overcome this situation, Linus decided to let the virtual memory hardware present in every Linux capable CPU handle
this test.
How does this work?

Whenever the kernel tries to access an address that is currently not accessible, the CPU generates a page fault exception and
calls the page fault handler
void do_page_fault(struct pt_regs *regs, unsigned long error_code)

in arch/i386/mm/fault.c. The parameters on the stack are set up by the low level assembly glue in arch/i386/kernel/entry.S.
The parameter regs is a pointer to the saved registers on the stack, error_code contains a reason code for the
exception.
do_page_fault first obtains the unaccessible address from the CPU control register CR2. If the address is within the
virtual address space of the process, the fault probably occured, because the page was not swapped in, write protected or
something similiar. However, we are interested in the other case: the address is not valid, there is no vma that contains this
address. In this case, the kernel jumps to the bad_area label.
There it uses the address of the instruction that caused the exception (i.e. regs->eip) to find an address where the
excecution can continue (fixup). If this search is successful, the fault handler modifies the return address (again
regs->eip) and returns. The execution will continue at the address in fixup.
Where does fixup point to?
Since we jump to the the contents of fixup, fixup obviously points to executable code. This code is hidden inside the user
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/exceptions.html (1 di 6) [08/03/2001 10.10.39]

access macros. I have picked the get_user macro defined in include/asm/uaccess.h as an example. The definition is
somewhat hard to follow, so lets peek at the code generated by the preprocessor and the compiler. I selected the get_user
call in drivers/char/console.c for a detailed examination.
The original code in console.c line 1405:
get_user(c, buf);
The preprocessor output (edited to become somewhat readable):
(
{
long __gu_err = - 14 , __gu_val = 0;
const __typeof__(*( ( buf ) )) *__gu_addr = ((buf));
if (((((0 + current_set[0])->tss.segment) == 0x18 ) ||
(((sizeof(*(buf))) <= 0xC0000000UL) &&
((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
do {
__gu_err = 0;
switch ((sizeof(*(buf)))) {
case 1:
__asm__ __volatile__(
"1: mov" "b" " %2,%" "b" "1\n"
"2:\n"
".section .fixup,\"ax\"\n"
"3: movl %3,%0\n"
" xor" "b" " %" "b" "1,%" "b" "1\n"
" jmp 2b\n"
".section __ex_table,\"a\"\n"
" .align 4\n"
" .long 1b,3b\n"
".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct
__large_struct *)
( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ;
break;
case 2:
"1: mov" "w" " %2,%" "w" "1\n"
"2:\n"
"3: movl %3,%0\n"
" xor" "w" " %" "w" "1,%" "w" "1\n"
" jmp 2b\n"
" .align 4\n"
" .long 1b,3b\n"
".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct
__large_struct *)
( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err ));
break;
case 4:
"1: mov" "l" " %2,%" "" "1\n"
"2:\n"

"3: movl %3,%0\n"
" xor" "l" " %" "" "1,%" "" "1\n"
" jmp 2b\n"
" .align 4\n" " .long 1b,3b\n"
".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct
__large_struct *)
( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err));
break;
default:
(__gu_val) = __get_user_bad();
}
} while (0) ;
((c)) = (__typeof__(*((buf))))__gu_val;
__gu_err;
}
);
WOW! Black GCC/assembly magic. This is impossible to follow, so lets see what code gcc generates:
xorl %edx,%edx
movl current_set,%eax
cmpl $24,788(%eax)
je .L1424
cmpl $-1073741825,64(%esp)
ja .L1423
.L1424:
movl %edx,%eax
movl 64(%esp),%ebx
#APP
1: movb (%ebx),%dl /* this is the actual user access */
2:
.section .fixup,"ax"
3: movl $-14,%eax
xorb %dl,%dl
jmp 2b
.section __ex_table,"a"
.align 4
.long 1b,3b
.text
#NO_APP
.L1423:
movzbl %dl,%esi
The optimizer does a good job and gives us something we can actually understand. Can we? The actual user access is quite
obvious. Thanks to the unified address space we can just access the address in user memory. But what does the .section
stuff do?
To understand this we have to look at the final kernel:
$ objdump --section-headers vmlinux
vmlinux: file format elf32-i386

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00098f40 c0100000 c0100000 00001000 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .fixup 000016bc c0198f40 c0198f40 00099f40 2**0
CONTENTS, ALLOC, LOAD, READONLY, CODE
2 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 .data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4
CONTENTS, ALLOC, LOAD, DATA
5 .bss 00018e21 c01ba748 c01ba748 000ba748 2**2
ALLOC
6 .comment 00000ec4 00000000 00000000 000ba748 2**0
CONTENTS, READONLY
7 .note 00001068 00000ec4 00000ec4 000bb60c 2**0
CONTENTS, READONLY
There are obviously 2 non standard ELF sections in the generated object file. But first we want to find out what happened to
our code in the final kernel executable:
$ objdump --disassemble --section=.text vmlinux
c017e785 <do_con_write+c1> xorl %edx,%edx

c017e787 <do_con_write+c3> movl 0xc01c7bec,%eax
c017e78c <do_con_write+c8> cmpl $0x18,0x314(%eax)
c017e793 <do_con_write+cf> je c017e79f <do_con_write+db>
c017e795 <do_con_write+d1> cmpl $0xbfffffff,0x40(%esp,1)
c017e79d <do_con_write+d9> ja c017e7a7 <do_con_write+e3>
c017e79f <do_con_write+db> movl %edx,%eax
c017e7a1 <do_con_write+dd> movl 0x40(%esp,1),%ebx
c017e7a5 <do_con_write+e1> movb (%ebx),%dl
c017e7a7 <do_con_write+e3> movzbl %dl,%esi
The whole user memory access is reduced to 10 x86 machine instructions. The instructions bracketed in the .section
directives are not longer in the normal execution path. They are located in a different section of the executable file:
$ objdump --disassemble --section=.fixup vmlinux
c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax

c0199ffa <.fixup+10ba> xorb %dl,%dl
c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3>
And finally:
$ objdump --full-contents --section=__ex_table vmlinux
c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................

c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................
c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................
or in human readable byte order:
c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................

c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................

this is the interesting part!

c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................
What happened? The assembly directives
.section .fixup,"ax"
told the assembler to move the following code to the specified sections in the ELF object file. So the instructions
3: movl $-14,%eax
xorb %dl,%dl
jmp 2b
ended up in the .fixup section of the object file and the addresses
.long 1b,3b
ended up in the __ex_table section of the object file. 1b and 3b are local labels. The local label 1b (1b stands for next
label 1 backward) is the address of the instruction that might fault. In our case, the address of the label 1b is c017e7a5:
the original assembly code:
1: movb (%ebx),%dl
and linked in vmlinux:
The local label 3 (backwards again) is the address of the code to handle the fault, in our case the actual value is c0199ff5:
the original assembly code:
3: movl $-14,%eax
and linked in vmlinux:
c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
The assembly code
.align 4
.long 1b,3b
becomes the value pair
c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................

^this is ^this is
1b 3b
c017e7a5,c0199ff5 in the exception table of the kernel.
In order for the function search_exception_table to find the exception table in the __ex_table section, it uses a
linker feature: whenever the linker sees a section whose entire name is a valid C identifier, it creates the symbols
__start_section and __stop_section delimiting the extents of the section. So search_exception_table
brackets its search by __start___ex_table and __stop___ex_table
Exception handling in action
So, what actually happens if a fault from kernel mode with no suitable vma occurs?
1. access to invalid address:

2. MMU generates exception

3. CPU calls do_page_fault
4. do_page_fault calls search_exception_table (regs->eip == c017e7a5);
5. search_exception_table looks up the address c017e7a5 in the exception table (i.e. the contents of the ELF
section __ex_table and returns the address of the associated fault handle code c0199ff5.
6. do_page_fault modifies its own return address to point to the fault handle code and returns.
7. execution continues in the fault handling code.
8. 8a) EAX becomes -EFAULT (== -14)
8b) DL becomes zero (the value we "read" from user space)
8c) execution continues at local label 2 (address of the instruction immediately after the faulting user access).
The steps 8a to 8c in a certain way emulate the faulting instruction.
That's it, mostly. If you look at our example, you might ask why we set EAX to -EFAULT in the exception handler code.
Well, the get_user macro actually returns a value: 0, if the user access was successful, -EFAULT on failure. Our original
code did not test this return value, however the inline assembly code in get_user tries to return -EFAULT. GCC selected
EAX to return this value.
Joerg Pommnitz | joerg@raleigh.ibm.com | Never attribute to malloc

Mobile/Wireless | Dept UMRA | that which can be adequately
Tel:(919)254-6397 | Office B502/E117 | explained by stupidity.

DMA to user space
DMA to user space

Forum: Device Drivers
Date: Wed, 11 Jun 1997 09:23:28 GMT
From: Marcel Boosten <Marcel.Boosten@cern.ch>
Hello,
I'm developing a device driver for a PCI board meant for

high performance communication. Interaction with the board
is possible via DMA. In order to get optimal performance
I need to do DMA directly to user space.
QUESTION:
How do I implement DMA to user space?
SUBQUESTIONS:
In "The Linux Kernel", David A Rusling writes the following:
"Device drivers have to be careful when using DMA. First
of all the DMA controller knows nothing of virtual memory,
it only has access to the physical memory in the system.
Therefore the memory that is being DMA'd to or from must
be a contiguous block of physical memory. This means that
you cannot DMA directly into the virtual address space of
a process. YOU CAN HOWEVER LOCK THE PROCESSES PHYSICAL
PAGES INTO MEMORY, PREVENTING THEM FROM BEING SWAPPED OUT
TO THE SWAP DEVICE DURING A DMA OPERATION. Secondly, the
DMA controller cannot access the whole of physical memory.
The DMA channel's address register represents the first 16
bits of the DMA address, the next 8 bits come from the page
register. This means that DMA requests are limited to the
bottom 16 Mbytes of memory."
[see http://www.linuxhq.com/guides/TLK/node87.html]
Reading this the following subquestions arise:

- How can one lock specific process pages?
- How can one obtain the physical address of
the pages involved?
- How can one ensure that the pages involved
are DMA-able (below 16Mb)?
- Is it possible to obtain a continues block of
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/22.html (1 di 2) [08/03/2001 10.10.39]

DMA to user space
physical memory in user space?
Greetings,
Marcel

How a device driver can driver his device
How a device driver can driver his device

Keywords: device driver
Date: Sat, 31 May 1997 07:31:17 GMT
From: Kim yeonseop <javakys@hyowon.pusan.ac.kr>
Hi.
I am a beginner for device driver on linux.

I wrote , 'zero.c' which is introduced to
'http://www.redhat.com/~johnsonm/devices.html',
a sample driver to test on my system .
But I don't know how to drive 'zero device' by this device driver.
Please help me.
Thanks for your response.
Kim yeonseop : javakys@hyowon.pusan.ac.kr
Messages
1. Untitled
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/21.html [08/03/2001 10.10.39]

Untitled
Untitled
Re: How a device driver can driver his device (Kim yeonseop)
Date: Thu, 05 Jun 1997 02:08:25 GMT
From: <unknown>
What do you mean to "drive the driver"? more clearly.
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/21/1.html [08/03/2001 10.10.40]

memcpy error?
memcpy error?
Keywords: memcpy verify_area
Date: Wed, 21 May 1997 14:33:34 GMT
From: Edgar Vonk <edgar@it.et.tudelft.nl>
I am using memcpy in a device driver to copy data between to buffers in kernel space (one is a DMA
buffer) and I keep getting segmentation faults I can't explain.
I changed the driver since and now it copies the DMA buffer directly into user space with
memcpy_tofs (and verify_area) and this seems to work just fine.
Anyone know why? Does this have to do with the memcpy faults under heavy system load? I saw a
discussion and a kernel patch about this somewhere.
thanks,
(running i586-linux-2.0.30-RedHat4.1)

Unable to handle kernel paging request - error
Unable to handle kernel paging request - error

Keywords: kernel paging
Date: Wed, 14 May 1997 16:15:40 GMT
From: Edgar Vonk <edgar@it.et.tudelft.nl>
Hai,
just a simple question. What does the "Unable to handle kernel paging request at virtual address ..."
usually indicate?
Does this mean a memory allocation problem, or just a memory addressing problem. Also, why does it
come back with a virtual address and not a physical one? Does this mean it is doing something in user
space?
I am writing a device driver for a Data Acquisition Card, but haven't got a clue what the bug in my
code is.
cheers,
(Running i586-Linux-2.0.30-RedHat4.1)

_syscallX() Macros
_syscallX() Macros
Date: Wed, 26 Mar 1997 23:07:31 GMT
From: Tom Howley <unknown>
Is it possible to use _syscallX macros in loadable device drivers. I first of all have had problems with
"errno: wrong version or undefined".It seems to be defined in linux/lib/errno.c. I want to be able to use
the system calls signal, getitimer and setitimer in my driver Does anybody know how I can get a
_syscall() macro to work in my loadable device driver??
Any advice would be much appreciated.
Tom.

MediaMagic Sound Card DSP-16. How to run in Linux.
MediaMagic Sound Card DSP-16. How to run in Linux.

Keywords: MediaMagic DSP16
Date: Tue, 18 Mar 1997 04:25:37 GMT
From: Robert Hinson <oppie@afn.org>
I am looking for a way to run the MediaMagic Sound Card DSP-16 under Linux RedHat
4.0?
I would very much appreciate it. Or how to set it up with the current drivers. I
know
it is SoundBlaster and SoundBlaster Pro Compatible, but I don't know how to make it
work.
I would like some help. My e-mail address is oppie@afn.org since I don't read this.

What does mark_bh() do?
What does mark_bh() do?

Keywords: network drivers interrupt mark_bh
Date: Wed, 12 Mar 1997 01:42:49 GMT
From: Erik Petersen <erik@spellcast.com>
Can someone expain when and how I use mark_bh(). I am assuming from general knowledge that it
mark the end of the interrupt service routine and allows a context switch in following code.
Here is why I want to know. I have a network driver in which it would be advantagous to be able to
sleep during code initiated by an interrupt. For example a piece of data is received by the device which
is passed to a kernel daemon via a character device inode and a select call. I then want to wait for the
daemon to respond or timeout.
The question is, if I call mark_bh(NET_BH) IMMEDIATE_BH?? before I sleep, can I sleep or do I
Aiee...Killing Interrupt handler, Idle task may not sleep?
mark_bh doesn't seem to be explained anywhere but is used by many net drivers for reasons I don't
understand. Is there somewhere I can look for this information?
My only obvious alternative at this point is to create a request queue of some sort and respond to
activity on the character device. The problem is that I can't really continue transferring data until I get
a response from the daemon.
Any thoughts?
Erik Petersen.
Messages
1. Untitled by Praveen Dwivedi

Untitled
Untitled
Re: What does mark_bh() do? (Erik Petersen)
Keywords: network drivers interrupt mark_bh
Date: Fri, 14 Mar 1997 08:28:12 GMT
From: Praveen Dwivedi <pkd@sequent.com>
I am not an expert on Linux kernel but here is what my

hacking wisdom says.
mark_bh marks the bottom half of some hardware interrupts.

An example would be timer interrupt which comes
100 times a second. Generally what happens is that you
do minimum stuff in actual handler and call mark_bh()
which takes care of updating lots of time related system
stuff. The reason why this is the preferred way to do
things is because you want to have actual interrupt handler
as small as possible so as to avoid losing further interrupts.
Look at the code in do_timer. It may help.
-pkd

3D Acceleration
3D Acceleration
Keywords: 3D acceleration driver
Date: Sat, 08 Mar 1997 18:04:25 GMT
From: <jamesbat@innotts.co.uk>
How would I go about making a driver for the Apocalypse 3D please Email reply

Device Drivers: /dev/radio...
Device Drivers: /dev/radio...

Keywords: device /dev radio
Date: Fri, 07 Mar 1997 00:56:50 GMT
From: Matthew Kirkwood <weejock@ferret.lmh.ox.ac.uk>
Hi,
I intend to write (when my radio card arrives in a couple of days) a driver for /dev/radio.
I have already obtained reasonable information for this, which is all fair enough, but I have not yet
seen anything along the lines of "/dev/* device creating for the inept...". Should I create a document
explaining this? (/dev/radio, as I envisage it, would be a mostly ioctl based thing, depending upon
hardware support....)
Thanks, and keep up the hacking, Matthew.

Does anybody know why kernel wakes my driver up without apparant reasons?
Does anybody know why kernel wakes my driver

up without apparant reasons?
Keywords: wake_up interrupt time_out
Date: Wed, 26 Feb 1997 17:02:31 GMT
From: David van Leeuwen <david@tm.tno.nl>
Hi, i've written a device driver for a cdrom device. It's old. I know. But i keep getting compaints that it
doesn't work reliably.
It used to work OK in the old 1.3.fourties. Since more modern kernel version, it tended to break more
often. Read errors...
I spent days tracking down the bug, it appeared that the driver was woken without an interrupt
occurring, or my own time-out causing the wake-up. I was stymied.
Now i posted a message similar to this to the kernel list half a year ago. But i wasn't capable of reading
the list (sorry) because i use my e-mail address at work. Apparently, there was some short reaction that
my go_to_sleep routine should do something like
while(!my_interrupt_woke_me)
sleep_on(&wait)
Why is this? Why does the kernel wake me up if i didn't ask for it (i.e., no interrupt occured and no
time-out occurred)
I found out that the sleep_on() could immediately wakeup (i.e., not go to sleep) for many times in a
row. I had to hack around by trying to go to sleep up to 100 times, but i am not charmed by the hack.
Does it have to do with the (new?) macros DEVICE_TIMEOUT and TIMEOUT_VALUE that i've
_not_ defined (because i wrote it in the KHG 0.5 days...).
Thanks,
---david (david@tm.tno.nl)

Getting a DMA buffer aligned with 64k boundaries
Getting a DMA buffer aligned with 64k boundaries

Date: Sun, 17 Nov 1996 01:25:45 GMT
From: Juan de La Figuera Bayon <juan@hobbes.fmc.uam.es>
I'm writing a device driver for a Data Translation DT2821 adquisition card. It includes DMA (and I
have already worked with it under MSDOS). The polled modes for DA and AD conversion already
work. But for the DMA, I need to ask for a buffer which can be up to 128k in size (ok, I usually ask
for less than 256 words in my aplication). And it should be aligned with 64k boundaries. I suppose it is
something pretty obvious, but it is my first try at device driver programming under Linux. Any help
would be appreciated.

Hardware Interface I/O Access
Hardware Interface I/O Access

Keywords: I/O
Date: Mon, 07 Oct 1996 12:36:40 GMT
From: Terry Moore <tmoore@solbrn.dseg.ti.com>
I need to write a driver using inb() and outb().

I am struggling with compiling a simple test program
to test these fuctions.
(1.) If I use gcc -o -DMODULE -D__KERNEL__ -c myfile.c
I do not understand how to use the resulting file
created -DMODULE.
(2.) If I use gcc -o tst tst.c the following fail appears.
undefined reference to __inbc
undefeined reference to __inb
What I expected was a executable program to run from the command line.
Thanks Terry M. tmoore@solbrn.dseg.ti.com
Messages
1. You are somewhat confused... by Michael K. Johnson

You are somewhat confused...
You are somewhat confused...

Re: Hardware Interface I/O Access (Terry Moore)
Keywords: I/O
Date: Mon, 14 Oct 1996 22:16:16 GMT
You've got two things mixed up--user level drivers and kernel loadable modules. An executable
program is what you want, not a module, so don't define MODULE. Just compile your executable with
-O and the undefined references should go away.

Is Anybody know something about SIS 496 IDE chipset?
Is Anybody know something about SIS 496 IDE

chipset?
Date: Fri, 27 Sep 1996 13:13:21 GMT
From: Alexander <avenco@online.ru>
I use SIS 496 (E)IDE controler chipset. Linux 2.0.21 doesn't support it. Is anybody know about it? I
need technical informationd about the chipset for writig driver. e-mail : avenco@online.ru
alex
Messages

Vertical Retrace Interrupt - I need to use it
Vertical Retrace Interrupt - I need to use it

Date: Wed, 04 Sep 1996 06:39:27 GMT
From: Brynn Rogers <brynn@wwa.com>
I am writing an application that provides new images to the screen every vertical refresh. (Think of it
as an animation)
As I understand it, I need to write a device driver to hook the vertical retrace interrupt (whatever
interrupt your graphics card generates), and to install a new colormap so the next image is cleanly
flipped in. (I don't need many colors, but I need lots of images).
I have been devouring all information (and donuts) I can get my hands on, and still am a little bit
clueless as to how I should go about this. What I am really confused about is this: Should I have a
device that my animation program opens and then uses ioctls to talk to, Just have the driver wake my
process and signal it, or Something much better that somebody will clue me in on.
The driver only needs to know a few things, like which image planes are ready and the ID's? of the
colormaps to use for which planes, and which screen or GC or whatever it needs.
Brynn
Messages
1. Your choice... by Michael K. Johnson

Your choice...
Your choice...
Re: Vertical Retrace Interrupt - I need to use it (Brynn Rogers)
Date: Sun, 29 Sep 1996 20:44:18 GMT
What I am really confused about is this: Should I have a device that my animation
program opens and then uses ioctls to talk to, Just have the driver wake my process and
signal it, or Something much better that somebody will clue me in on.
You are quite right that you need a device driver. If you can, I recommend avoiding using ioctls; if you
can use the write() method to take data from the application and the read() method to give data
back to the application (remember that those names are user-space-centric), I would recommend that
you do it that way. It doesn't sound to me like a case in which ioctl()'s would be the cleanest
solution.

help working with skb structures
help working with skb structures

Date: Thu, 29 Aug 1996 15:44:32 GMT
From: arkane <cat@iol.unh.edu>
I am working on interfacing directly with the networking device drivers on my linux box. I have
tracked down the functions for transmitting ( dev_queue_xmit() ) packets down to the driver level.
What I need to do is bypass the socket interface without destroying it ... So that I can transmit my own
packets or my own design down to the wire ( I am using this for my job of testing new networking
hardware -- RMON probes mostly ) so I need to be able to create both good and bad packets with most
any kind of data contained inside as RMON-2 will be able to pick apart a packet and identify its
contents.
We can build the packets, but we can't get them to the wire through standard means. I think that this
can be accomplished with the dev_queue_xmit() function. Question is: in the sk_buff structure what
do I need to set up specifically so that dev_queue_xmit() and the driver will simply pass my data to the
hardware without building the standard headers required by ethernet and other network types? I'll
worry about that, and if I make a mistake I will clean up the mess. Any help is appreciated.
TIA cat@iol.unh.edu

Interrupt Sharing ?
Interrupt Sharing ?
Keywords: Interrupt sharing, PCI, Plug%0Aamp;Play
Date: Tue, 11 Jun 1996 16:09:00 GMT
From: Frieder Löffler <floeff@mathematik.uni-stuttgart.de>
I wonder if interrupt sharing is an issue for the Linux kernel. I currently have a machine with 2 PCI
Plug&Play devices that choose the same irq (an HP-Vectra onboard SCSI controller and a HP J2585
100-VG-AnyLan card).
It seems that there is no way to use such configurations at the moment?
Frieder
Messages
1. Interrupt sharing-possible by Vladimir Myslik
-> Interrupt sharing - How to do with Network Drivers? by Frieder Löffler

Interrupt sharing-possible
Interrupt sharing-possible
Re: Interrupt Sharing ? (Frieder Löffler)
Date: Thu, 11 Jul 1996 02:24:57 GMT
From: Vladimir Myslik <xmyslik@cslab.felk.cvut.cz>
Linux kernel has support for shared interrupt usage. It has a list of routines (func.) that are called when
an HW intr arises. On the interrupt arrival, the routines in the list are circularily called in the order in
which the devices ISRs were hooked onto this chain.
So, if your SCSI generates int#11 and your ethernet card the same irq, and the bus really notices CPU
about them, linux should have no problems.
However, the ISA and IMHO PCI devices have problems with sharing one IRQ line per several
physical cards (devices). The devices should had been designed with open collector or with 3-state
IRQ lines with transition to IRQ active only during the interrupt generation(log. 0/1), instead of sitting
on the irq line.
So, a user wanting to find out whether it's possible to share one irq line, should set both the cards to it,
make either of them generate interrupt (packet arrival,seek on disk) and look at the /proc/interrupts
statistics, whether the appropriate number incremented or not.
Got all from usenet&kernel sources, don't blame me.
Messages
1. Interrupt sharing - How to do with Network Drivers? by Frieder Löffler

Interrupt sharing - How to do with Network Drivers?
Interrupt sharing - How to do with Network

Drivers?
Date: Thu, 11 Jul 1996 09:02:56 GMT
From: Frieder Löffler <floeff@mathematik.uni-stuttgart.de>
Hi,
you are right - as I noticed in the AM53C974 SCSI driver, some drivers seem to be designed to share
interrupts. But I cannot see at the moment how I can implement interrupt sharing in the networking
drivers. Maybe someone could explain how this can be done - for example by adding some lines of
code to skeleton.c ?
Right now, I can't see how I am supposed to register the interrupt handler routine for the second driver.
Thanks, Frieder
Messages
1. Interrupt sharing 101 by Christophe Beauregard
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/5/1/1.html [08/03/2001 10.10.47]

Interrupt sharing 101

Re: Interrupt sharing - How to do with Network Drivers? (Frieder Löffler)
Date: Wed, 28 Aug 1996 16:48:01 GMT
From: Christophe Beauregard <chrisb@truespectra.com>
I guess this would be a handy thing to have in the knowledge base...

The key thing to sharing an interrupt is to make sure that you have separate context information for
each instance of the driver. That is, no static global variables. For most network drivers you just use
the ``struct device* dev'' for the context.
Pass this to request_irq() as the last argument:
request_irq( irq, interrupt_handler, SA_SHIRQ, "MyDevice",
dev );
Note that the SA_INTERRUPT flag is significant here, since you can't share an IRQ if one driver uses
fast interrupts and the other uses slow interrupts. This is a bug, IMHO, since long chains of interrupt
handlers may alter the timing such that processing is no longer ``fast''. A better behaviour would be to
just implicitly change to slow interrupts when more than one device is on the IRQ (and change back
when the device is released down to one fast handler, of course).
Then your interrupt handler looks something like this:
static void interrupt_handler( int irq, void* dev_id, ...) {
struct device* dev = (struct device*) dev_id;
if( dev == NULL ) {
ASSERT( 0 ); /* stupid programmer error. Either
we passed a NULL dev to request_irq
or someone screwed up irq.c */
}
/* query the device to see if it caused the interrupt */
if( !(inb(something)&something_else) ) {
/* nope, not us - normally we'd call this a spurious
interrupt, but it might belong to another device. */
return;
}
/* now, using the dev structure, service the interrupt */
...
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/5/1/1/1.html (1 di 2) [08/03/2001 10.10.47]

/* tell the hardware device we're done (IMPORTANT)

If this isn't done, the device will continue to hold
the IRQ line high, and we go into a nasty interrupt
loop. Some devices might do this implicitly in the
interrupt processing (i.e. by emptying a buffer) */
outb(something, something);
}
Because you have a separate ``struct device*'' for each instance of the card, multiple cards can share
the same IRQ. Of course, they can also share the IRQ with other card, assuming they all Do The Right
Thing.
You can usually modify an existing device to do shared IRQs by simply finding the part of the code
where it spews out a spurious interrupt message and replacing that with a `return' statement, adding
SA_SHIRQ to the request_irq call, and removing references to irq2dev_map[]. I've had no problems
doing this for drivers including drivers/char/psaux.c, drivers/net/tulip.c, drivers/scsi/aicxxx7.c and
most of the MCA drivers.
c.
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/5/1/1/1.html (2 di 2) [08/03/2001 10.10.47]

Device Driver notification of "Linux going down"

Keywords: device drivers shutdown modules init watchdog
Date: Tue, 06 Aug 1996 20:21:06 GMT
From: Stan Troeh <stan@forthrt.com>
We have written a character device driver for FORTHRIGHT's PC WATCHDOG SYSTEM. We find,
however, in developing a generic "watch" application (tells the hardware that Linux is still healthy)
that we can't detect when Linux is intentionally shutting down. If Linux succeeds in going down and
back up within the default 2 minute window, everything is transparent and no problem occurs.
However we find that at times when the tolerance is set tighter or the user delays making a LILO
selection, etc., that the hardware performs a physical PC Reset while Linux is doing coming up (after a
shutdown -r).
Is there a "shutting down" call made to the drivers? We have not found one, but have found a place
where it could be added in ReadItab() or InitMain() of init.c. But that doesn't seem closely related to
device driver module management.
Suggestions for better ways to package the driver are welcome. We would also be willing to work on a
"generic" solution (such as a device driver __halt() routine) if there is interest in this approach.
Messages
1. Through application which has opened the device by Michael K. Johnson
2. Device Driver notification of "Linux going down" by Marko Kohtala

Through application which has opened the device
Through application which has opened the device

Re: Device Driver notification of "Linux going down" (Stan Troeh)
Keywords: device drivers shutdown modules init watchdog
Date: Wed, 14 Aug 1996 04:55:38 GMT
In order to shut down a device, have a user-level application have it opened, and when it is sent
SIGTERM by init (or, presumably, any other process), close the device or alert it of the shutdown in
some other way.


Re: Device Driver notification of "Linux going down" (Stan Troeh)
Keywords: device drivers shutdown modules init watchdog notifier
Date: Mon, 03 Mar 1997 09:33:14 GMT
From: Marko Kohtala <Marko.Kohtala@ntc.nokia.com>
In 2.1.x kernels there is a boot_notifier_list. See in include/linux/notifier.h and kernel/sys.c.

Is waitv honored?
Is waitv honored?
Keywords: waitv VT
Date: Sun, 07 Jul 1996 02:18:18 GMT
The vt_mode structure in /usr/include/linux/vt.h has a member called waitv that doesn't seem to be
used. That is, drivers/char/vt.c examines and sets it, and drivers/char/tty_io.c resets it when the
terminal is reset, but nothing else seems to be done with it.
I'm guessing that it exists because the SVR4 VT code has a structure member of the same name, and
that the only reason it is set and reset is for compatibility with apps written for SVR4. Am I right?

PCI Driver
PCI Driver
Keywords: A PCI Driver ???
Date: Wed, 12 Jun 1996 17:04:44 GMT
From: Flavia Donno <flavia@galileo.pi.infn.it>
Probably this is not the right place for this question, but please ... Answer to me! Has anyone written a
PCI driver for Lynux ? Any example ? Documentation ?
Thank you in advance.
Flavia
Messages
1. There is linux-2.0/drivers/pci/pci.c by Hasdi

There is linux-2.0/drivers/pci/pci.c
There is linux-2.0/drivers/pci/pci.c
Re: PCI Driver (Flavia Donno)
Keywords: A PCI Driver ???
Date: Thu, 13 Jun 1996 19:38:02 GMT
From: Hasdi <hasdi@engin.umich.edu>
The subject says it all.

I don't know why pci.c is the only file in the pci directory. I thought there are lots of pci drivers. Is
there something about pci that every kernel should know about?

Re: Network Device Drivers

Keywords: network driver prototype functions
Date: Wed, 22 May 1996 16:23:09 GMT
From: Paul Gortmaker <gpg109@rsphy1.anu.edu.au>
> I don't know anything about this topic. The kernel source
> includes a skeleton.c file that can get you started.
> Someone has promised to write this section, so check back
> sometime...
Hrrm, who was that? (just curious...)
Somebody asked me about a year or so ago as to what the basics

of a net driver would look like. I haven't seen Alan's article
in linux journal, so this may be useless in comparison.
Regardless, here it is anyway.
Paul.
------------------------------
1) Probe:
called at boot to check for existence of card. Best if it
can check un-obtrsively by reading from memory, etc. Can
also read from i/o ports. Writing to i/o ports in a probe
is *not* allowed as it may kill another device.
Some device initialization is usually done here (allocating
i/o space, IRQs,filling in the dev->??? fields etc.)
2) Interrupt handler:
Called by the kernel when the card posts an interrupt.
This has the job of determining why the card posted
an interrupt, and acting accordingly. Usual interrupt
conditions are data to be rec'd, transmit completed,
error conditions being reported.
3) Transmit function
Linked to dev->hard_start_xmit() and is called by the
kernel when there is some data that the kernel wants
to put out over the device. This puts the data onto

the card and triggers the transmit.
4) Receive function
Called by the interrupt handler when the card reports
that there is data on the card. It pulls the data off
the card, packages it into a sk_buff and lets the
kernel know the data is there for it by doing a
netif_rx(sk_buff)
5) Open function
linked to dev->open and called by the networking layers
when somebody does "ifconfig <device_name> up" -- this
puts the device on line and enables it for Rx/Tx of
data.
Someday, perhaps I will have the time to write a proper

document on the subject.... Naaaaahhhhhh.
Messages
1. Re: Network Device Drivers by Neal Tucker
2. Transmit function by Joerg Schorr


Re: Re: Network Device Drivers (Paul Gortmaker)
Keywords: network driver functions
Date: Thu, 30 May 1996 10:42:09 GMT
Paul Gortmaker says:
>Somebody asked me about a year or so ago as to what the basics

>of a net driver would look like.
>
>1) Probe:
> called at boot to check for existence of card. Best if it
> can check un-obtrsively by reading from memory, etc. Can
> also read from i/o ports. Writing to i/o ports in a probe
> is *not* allowed as it may kill another device.
> Some device initialization is usually done here (allocating
> i/o space, IRQs,filling in the dev->??? fields etc.)
You must be the guy that wrote that part of the ethernet HOWTO. :-)
I've just recently been looking at the network device driver interface, and I read your stuff and this part
confused me, since all the code I was looking at (dummy, loopback, slip..) refers to this as "init",
rather than "probe". (which may sound a bit nit picky, but there were other routines called "probe" that
I studied for a while, thinking they were the important ones (they turned out to be used for module
initialization only) :-).
But on to my real reason for writing... One thing that I think would be helpful to people trying to write
a network driver for the first time is a description of how this is all hooked into the kernel. I've found
plenty of examples of what the actual driver code needs to do, (lots of some_driver.c files, including
skeleton.c, which is usually what people point to), but no explanation of how to get it called.
Basically what it comes down to is an explanation of Space.c, which doesn't do very much, but is a bit
funny looking to a first-timer. Now that I understand it, it seems a bit obvious, but back when I was
going mad trying to figure out why my driver didn't execute, it would have been really nice to have it
all spelled out.
So once it's done, I will submit a description. If you'd like, check out a start at
http://fester.axis.net/~linux/454.html. Make sure to let me and/or the rest of the world what you think.
-Neal Tucker
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/1/1.html (1 di 2) [08/03/2001 10.10.51]

Messages
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/1/1.html (2 di 2) [08/03/2001 10.10.51]

network driver info
network driver info

Re: Re: Network Device Drivers (Neal Tucker)
Keywords: network driver functions
Date: Sat, 15 Jun 1996 03:33:21 GMT
Earlier, I posted a pointer to a bit of info on network device drivers, and the site that the web page is
on is going away, so I am including what was there here...
How a Network Device Gets Added to the Kernel
There is a global variable called dev_base which points to a linked list of "device" structures. Each
record represents a network device, and contains a pointer to the device driver's initialization function.
The initialization function is the first code from the driver to ever get executed, and is responsible for
setting up the hooks to the other driver code.
At boot time, the function device_setup (drivers/block/genhd.c) calls a function called
net_dev_init (net/core/dev.c) which walks through the linked list pointed to by dev_base,
calling each device's init function. If the init indicates failure (by returning a nonzero result),
net_dev_init removes the device from the linked list and continues on.
This brings up the question of how the devices get added to the linked list of devices before any of
their code is executed. That is accomplished by a clever piece of C preprocessor work in
drivers/net/Space.c. This file has the static declarations for each device's "device" struct, including the
pointer to the next device in the list. How can we define these links statically without knowing which
devices are going to be included? Here's how it's done (from drivers/net/Space.c):
#define NEXT_DEV NULL
#if defined(CONFIG_SLIP)
static struct device slip_dev =
{
device name and some other info goes here
...
NEXT_DEV, /* <- link to previously listed */
/* device struct (NULL here) */
slip_init, /* <- pointer to init function */
};
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/1/1/1.html (1 di 4) [08/03/2001 10.10.51]

network driver info
#undef NEXT_DEV
#define NEXT_DEV (&slip_dev)
#endif
#if defined(CONFIG_PPP)
static struct device ppp_dev =
{
...
/* device struct, which is now *
/* defined as &slip_dev */
ppp_init, /* <- pointer to init function */
};
#undef NEXT_DEV
#define NEXT_DEV (&ppp_dev)
#endif
struct device loopback_dev =

{
...
/* device struct, which is now */
/* defined as &ppp_dev */
loopback_init, /* <- pointer to init function */
};
/* And finally, the head of the list, which points */

/* to the most recently defined device struct, */
/* loopback_dev. This (dev_base) is the pointer the */
/* kernel uses to access all the devices. */
struct device *dev_base = &loopback_dev;
There is a constant, NEXT_DEV, defined to always point at the last device record declared. When each
device record gets declared, it puts the value of NEXT_DEV in itself as the "next" pointer and then
redefines NEXT_DEV to point to itself. This is how the linked list is built. Note that NEXT_DEV starts
out NULL so that the first device structure is the end of the list, and at the end, the global dev_base,
which is the head of the list, gets the value of the last device structure.
Ethernet devices

network driver info
Ethernet devices are a bit of a special case in how they get called at initialization time, probably due to
the fact that there are so many different types of ethernet devices that we'd like to be able to refer to
them by just calling them ethernet devices (ie "eth0", "eth1", etc), rather than calling them by name (ie
"NE2000", "3C509", etc).
In the linked list mentioned above, there is a single entry for all ethernet devices, whose initialization
function is set to the function ethif_probe (also defined in drivers/net/Space.c). This function
simply calls each ethernet device's init function until it finds one that succeeds. This is done with a
huge expression made up of the ANDed results of the calls to the initialization functions (note that
with the ethernet devices, the init function is conventionally called xxx_probe). Here is an abridged
version of that function:
static int ethif_probe(struct device *dev)

{
u_long base_addr = dev->base_addr;
if ((base_addr == 0xffe0) || (base_addr == 1))

return 1;
if (1 /* note start of expression here */

#ifdef CONFIG_DGRS
&& dgrs_probe(dev)
#endif
#ifdef CONFIG_VORTEX
&& tc59x_probe(dev)
#endif
#ifdef CONFIG_NE2000
&& ne_probe(dev)
#endif
&& 1 ) { /* end of expression here */
return 1;
}
return 0;
}
The result is that the if statement bails out as false if any of the probe calls returns zero (success),
and only one ethernet card is initialized and used, no matter how many drivers you have installed. For
the drivers that aren't installed, the #ifdef removes the code completely, and the expression gets a
bit smaller. The implications of this scheme are that supporting multiple ethernet cards is now a
special case, and requires providing command line parameters to the kernel which cause
ethif_probe to be executed multiple times.
Messages

network driver info
1. Network Driver Desprately Needed by Paul Atkinson

Network Driver Desprately Needed
Network Driver Desprately Needed

Re: Re: Network Device Drivers (Neal Tucker)
Re: network driver info (Neal Tucker)
Keywords: device drivers network compaq tlan thunderlan netflex
Date: Tue, 06 May 1997 20:56:36 GMT
From: Paul Atkinson <patkinson@aerotek.co.uk>
I have looked everywhere for a Compaq Netflex 100BaseT network card device driver/patch and have
come up with nothing :( I wouldn't know where to start to make my own (I have a hard enough time
recompiling the kernel!). If anyone would like to fill a void in Linux Hardware Compatibility it would
be very much appreciated. The card is based on a T1 ThunderLAN chip.
Many thanks
Paul.
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/1/1/1/1.html [08/03/2001 10.10.52]

Transmit function
Transmit function
Date: Fri, 31 May 1996 20:55:37 GMT
From: Joerg Schorr <jschorr@studi.epfl.ch>
> 3) Transmit function

> Linked to dev->hard_start_xmit() and is called by the
> kernel when there is some data that the kernel wants
> to put out over the device. This puts the data onto
> the card and triggers the transmit.
Well, i'm having to some work with network on linux, and i also
noticed this part for transmit; but the PC i am working on, uses
a WD80x3 card (using the wd.c driver), and as it seems the transmit function
is wd_block_output; but where are between the dev->hard_start_xmit
and the wd_block_ouptut??
I haven't it out for the moment.
Messages

Re: Transmit function
Re: Transmit function

Re: Transmit function (Joerg Schorr)
Date: Fri, 31 May 1996 23:55:18 GMT
From: Paul Gortmaker <unknown>
The wd driver is not a complete driver by itself. It uses the code in 8390.c to do most of the work. The
function ei_transmit() in 8390.c is what is linked to dev->hard_start_xmit(), and then ei_transmit will
call ei_block_output() which in this case is pointing at wd_block_output().
Paul.
Messages
1. Skbuff by Joerg Schorr
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/1/2/1.html [08/03/2001 10.10.52]

Skbuff
Skbuff
Re: Transmit function (Joerg Schorr)
Re: Re: Transmit function (Paul Gortmaker)
Date: Thu, 06 Jun 1996 19:39:48 GMT
From: Joerg Schorr <jschorr@studi.epfl.ch>
In the wd_block_output (in wd.c) function, there is a moment

where the buf (which is skb->data) is copied to the shared
memory of the ethercard (if I understood it right). But I
didn't found out when the message and headers (ip and udp for
the case I am interested in) are copied in skb->data??
Also: what is exactly in skb->data?? Is there more than the

message and the headers?? Also the rest of the skbuffer??
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/devices/1/2/1/1.html [08/03/2001 10.10.53]

Filesystems
Filesystems
There has been very little documentation so far regarding writing filesystems for Linux. Let's change
that...
A tour of the Linux VFS
Before you can consider writing a filesystem for Linux, you need to have at least a vague
understanding of how the Linux Virtual Filesystem Switch operates. The basic principles are fairly
simple.
Design and Implementation of the Second Extended Filesystem
The three main authors and designers of ext2fs have written an excellent paper on it, and have
contributed it to the KHG. This paper was first published in the Proceedings of the First Dutch
International Symposium on Linux, ISBN 90-367-0385-9.
Filesystem Tutorial
It may be a long time before I get around to writing a tutorial on how to write a Linux filesystem;
after all, I've never done it. Would someone else like to help?
Copyright (C) 1996 Michael K. Johnson, johnsonm@redhat.com.
Messages
12. Need /proc info by Kai Xu
11. Where to find libext2 sources? by Mark Salter
1. Nevermind... by Mark Salter
10. New File System by Vamsi Krishna
9. Partition? by Wilfredo Lugo Beauchamp
1. Please be more specific.... by Theodore Ts'o
8. Need documentation on userfs implementation by Natchu Vishnu Priya
7. ext2fs tools by Wilfredo Lugo Beauchamp
1. Where to find e2fsprogs by Theodore Ts'o
6. libext2fs documentation by James Beckett
1. libext2fs documentation by Theodore Ts'o
5. proc filesystem by Praveen Krishnan
1. man proc by Michael K. Johnson
4. Need NFS documentation by Ermelindo Mauriello
3. Even more ext2 documentation! by Michael K. Johnson
http://ldp.iol.it/LDP/khg/HyperNews/get/fs/fs.html (1 di 2) [08/03/2001 10.10.53]

Filesystems
2. More ext2 documentation by Michael K. Johnson

1. Ext2 paper by Theodore Ts'o
1. Done. by Michael K. Johnson
http://ldp.iol.it/LDP/khg/HyperNews/get/fs/fs.html (2 di 2) [08/03/2001 10.10.53]


I'm not an expert on this topic. I've never written a filesystem from scratch; I've only worked on the proc
filesystem, and I didn't do much real filesystem hacking there, only extensions to what was already there.
So if you see any mistakes or ommissions here (there have got to be ommissions in a piece this short on a topic
this large), please respond, in order to let me fix them and let other people know about them.
In Linux, all files are accessed through the Virtual Filesystem Switch, or VFS. This is a layer of code which implements
generic filesystem actions and vectors requests to the correct specific code to handle the request. Two main types of code
modules take advantage of the VFS services, device drivers and filesystems. Because device drivers are covered elsewhere in
the KHG, we won't cover them explicitly here. This tour will focus on filesystems. Because the VFS doesn't exist in a vacuum,
we'll show its relationship with the favorite Linux filesystem, the ext2 filesystem.
One warning: without a decent understanding of the system calls that have to do with files, you are not likely to be able to
make heads or tails of filesystems. Most of the VFS and most of the code in a normal Linux filesystem is pretty directly related
to completing normal system calls, and you will not be able to understand how the rest of the system works without
understanding the system calls on which it is based.
Where to find the code

The source code for the VFS is in the fs/ subdirectory of the Linux kernel source, along with a few other related pieces, such as
the buffer cache and code to deal with each executable file format. Each specific filesystem is kept in a lower subdirectory; for
example, the ext2 filesystem source code is kept in fs/ext2/.
This table gives the names of the files in the fs/ subdirectory and explains the basic purpose of each one. The middle column,
labeled system, is supposed to show to which major subsystem the file is (mainly) dedicated. EXE means that it is used for
recognizing and loading executable files. DEV means that is for device driver support. BUF means buffer cache. VFS means
that it is a part of the VFS, and delegates some functionality to filesystem-specific code. VFSg means that this code is
completely generic and never delegates part of its operation to specific filesystem code (that I noticed, anyway) and which you
shouldn't have to worry about while writing a filesystem.
Filename system Purpose
binfmt_aout.c EXE Recognize and execute old-style a.out executables.
binfmt_elf.c EXE Recognize and execute new ELF executables
binfmt_java.c EXE Recognize and execute Java apps and applets
binfmt_script.c EXE Recognize and execute #!-style scripts
block_dev.c DEV Generic read(), write(), and fsync() functions for block devices.
buffer.c BUF The buffer cache, which caches blocks read from block devices.
dcache.c VFS The directory cache, which caches directory name lookups.
devices.c DEV Generic device support functions, such as registries.
dquot.c VFS Generic disk quota support.
exec.c VFSg Generic executable support. Calls functions in the binfmt_* files.
fcntl.c VFSg fcntl() handling.
fifo.c VFSg fifo handling.
file_table.c VFSg Dynamically-extensible list of open files on the system.
filesystems.c VFS All compiled-in filesystems are initialized from here by calling init_name_fs().
inode.c VFSg Dynamically-extensible list of open inodes on the system.
http://ldp.iol.it/LDP/khg/HyperNews/get/fs/vfstour.html (1 di 6) [08/03/2001 10.10.54]

ioctl.c VFS First-stage handling for ioctl's; passes handling to the filesystem or device driver if necessary.
locks.c VFSg Support for fcntl() locking, flock() locking, and manadatory locking.
namei.c VFS Fills in the inode, given a pathname. Implements several name-related system calls.
noquot.c VFS No quotas: optimization to avoid #ifdef's in dquot.c
open.c VFS Lots of system calls including (surprise) open(), close(), and vhangup().
pipe.c VFSg Pipes.
read_write.c VFS read(), write(), readv(), writev(), lseek().
readdir.c VFS Several different interfaces for reading directories.
select.c VFS The guts of the select() system call
stat.c VFS stat() and readlink() support.
super.c VFS Superblock support, filesystem registry, mount()/umount().
Attaching a filesystem to the kernel

If you look at the code in any filesystem for init_name_fs(), you will find that it probably contains about one line of
code. For instance, in the ext2fs, it looks like this (from fs/ext2/super.c):
int init_ext2_fs(void)
{
return register_filesystem(&ext2_fs_type);
}
All it does is register the filesystem with the registry kept in fs/super.c. ext2_fs_type is a pretty simple structure:
static struct file_system_type ext2_fs_type = {

ext2_read_super, "ext2", 1, NULL
};
The ext2_read_super entry is a pointer to a function which allows a filesystem to be mounted (among other things; more
later). "ext2" is the name of the filesystem type, which is used (when you type mount ... -t ext2) to determine
which filesystem to use to mount a device. The 1 says that it needs a device to be mounted on (unlike the proc filesyste or a
network filesystem), and the NULL is required to fill up space that will be used to keep a linked list of filesystem types in the
filesystem registry, kept in (look it up in the table!) fs/super.c.
It's possible for a filesystem to support more than one type of filesystem. For instance, in fs/sysv/inode.c, three possible
filesystem types are supported by one filesystem, with this code:
static struct file_system_type sysv_fs_type[3] = {

{sysv_read_super, "xenix", 1, NULL},
{sysv_read_super, "sysv", 1, NULL},
{sysv_read_super, "coherent", 1, NULL}
};
int init_sysv_fs(void)
{
int i;
int ouch;
for (i = 0; i < 3; i++) {

if ((ouch = register_filesystem(&sysv_fs_type[i])) != 0)
return ouch;
}

return ouch;
}
Connecting the filesystem to a disk

The rest of the communication between the filesystem code and the kernel doesn't happen until a device bearing that type of
file system is mounted. When you mount a device containing an ext2 file system, ext2_read_super() is called. If it
succeeds in reading the superblock and is able to mount the filesystem, it fills in the super_block structure with
information that includes a pointer to a structure called super_operations, which contains pointers to functions which do
common operations related to superblocks; in this case, pointers to functions specific to ext2.
A superblock is the block that defines an entire filesystem on a device. It is sometimes mythical, as in the case of the DOS
filesystem--that is, the filesystem may or may not actually have a block on disk that is the real superblock. If not, it has to
make something up. Operations that pertain to the filesystem as a whole (as opposed to individual files) are considered
superblock operations. The super_operations structure contains pointers to functions which manipulate inodes, the
superblock, and which refer to or change the status of the filesystem as a whole (statfs() and remount()).
You have probably noticed that there are a lot of pointers, and especially pointers to functions, here. The good news is that all
the messy pointer work is done; that's the VFS's job. All the author for the filesystem needs to do is fill in (usually static)
structures with pointers to functions, and pass pointers to those structures back to the VFS so it can get at the filesystem and
the files.
For example, the super_operations structure looks like this (from <linux/fs.h>):
struct super_operations {
void (*read_inode) (struct inode *);
int (*notify_change) (struct inode *, struct iattr *);
void (*write_inode) (struct inode *);
void (*put_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
void (*statfs) (struct super_block *, struct statfs *, int);
int (*remount_fs) (struct super_block *, int *, char *);
};
That's the VFS part. Here's the much simpler declaration of the ext2 instance of that structure, in fs/ext2/super.c:
static struct super_operations ext2_sops = {

ext2_read_inode,
NULL,
ext2_write_inode,
ext2_put_inode,
ext2_put_super,
ext2_write_super,
ext2_statfs,
ext2_remount
};
First, notice that an unneeded entry has simply been set to NULL. That's pretty normal Linux behavior; whenever there is a
sensible default behavior of a function pointer, and that sensible default is what you want, you are almost sure to be able to
provide a NULL pointer and get the default painlessly. Second, notice how simple and clean the declaration is. All the painful
stuff like sb->s_op->write_super(sb); s hidden in the VFS implementation.
The details of how the filesystem actually reads and writes the blocks, including the superblock, from and to the disk will be
covered in a different section. There will actually be (I hope) two descriptions--a simple, functional one in a section on how to
write filesystems, and a more detailed one in a tour through the buffer cache. For now, assume that it is done by magic...

Mounting a filesystem
When a filesystem is mounted (which file is in charge of mounting a filesystem? Look at the table above, and find that it is
fs/super.c. You might want to follow along in fs/super.c), do_umount() calls read_super, which ends up calling (in the case
of the ext2 filesystem), ext2_read_super(), which returns the superblock. That superblock includes a pointer to that
structure of pointers to functions that we see in the definition of ext2_sops above. It also includes a lot of other data; you
can look at the definition of struct super_block in include/linux/fs.h if you like.
Finding a file
Once a filesystem is mounted, it is possible to access files on that filesystem. There are two main steps here: looking up the
name to find what inode it points to, and then accessing the inode.
When the VFS is looking at a name, it includes a path. Unless the filename is absolute (it starts with a / character), it is
relative to the current directory of the process that made the system call that included a path. It uses filesystem-specific code to
look up files on the filesystems specified. It takes the path name one component (filename components are separated with /
characters) at a time, and looks it up. If it is a directory, then the next component is looked up in the directory returned by the
previous lookup. Every component which is looked up, whether it is a file or a directory, returns an inode number which
uniquely identifies it, and by which its contents are accessed.
If the file turns out to be a symbolic link to another file, then the VFS starts over with the new name which is retrieved from
the symbolic link. In order to prevent infinite recursion, there's a limit on the depth of symlinks; the kernel will only follow so
many symlinks in a row before giving up.
When the VFS and the filesystem together have resolved a name into an inode number (that's the namei() function in
namei.c), then the inode can be accessed. The iget() function finds and returns the inode specified by an inode number.
The iput() function is later used to release access to the inode. It is kind of like malloc() and free(), except that more
than one process may hold an inode open at once, and a reference count is maintained to know when it's free and when it's not.
The integer file handle which is passed back to the application code is an offset into a file table for that process. That file table
slot holds the inode number that was looked up with the namei() function until the file is closed or the process terminates.
So whenever a process does anything to a ``file'' using a file handle, it is really manipulating the inode in question.
inode Operations
That inode number and inode structure have to come from somewhere, and the VFS can't make them up on it's own. They
have to come from the filesystem. So how does the VFS look up the name in the filesystem and get an inode back?
It starts at the beginning of the path name and looks up the inode of the first directory in the path. Then it uses that inode to
look up the next directory in the path. When it reachs the end, it has found the inode of the file or directory it is trying to look
up. But since it needs an inode to get started, how does it get started with the first lookup? There is an inode pointer kept in
the superblock called s_mounted which points at an inode structure for the filesystem. This inode is allocated when the
filesystem is mounted and de-allocated when the filesystem is unmounted. Normally, as in the ext2 filesystem, the
s_mounted inode is the inode of the root directory of the filesystem. From there, all the other inodes can be looked up.
Each inode includes a pointer to a structure of pointers to functions. Sound familiar? This is the inode_operations
structure. One of the elements of that structure is called lookup(), and it is used to look up another inode on the same
filesystem. In general, a filesystem has only one lookup() function that is the same in every inode on the filesystem, but it is
possible to have several different lookup() functions and assign them as appropriate for the filesystem; the proc filesystem
does this because different directories in the proc filesystem have different purposes. The inode_operations structure
looks like this (defined, like most everything we are looking at, in <linux/fs.h>):
struct inode_operations {
struct file_operations * default_file_ops;
int (*create) (struct inode *,const char *,int,int,struct inode **);
int (*lookup) (struct inode *,const char *,int,struct inode **);
int (*link) (struct inode *,struct inode *,const char *,int);

int (*unlink) (struct inode *,const char *,int);

int (*symlink) (struct inode *,const char *,int,const char *);
int (*mkdir) (struct inode *,const char *,int,int);
int (*rmdir) (struct inode *,const char *,int);
int (*mknod) (struct inode *,const char *,int,int,int);
int (*rename) (struct inode *,const char *,int,struct inode *,const char
*,int);
int (*readlink) (struct inode *,char *,int);
int (*follow_link) (struct inode *,struct inode *,int,int,struct inode
**);
int (*readpage) (struct inode *, struct page *);
int (*writepage) (struct inode *, struct page *);
int (*bmap) (struct inode *,int);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int);
int (*smap) (struct inode *,int);
};
Most of these functions map directly to system calls.
In the ext2 filesystem, directories, files, and symlinks have different inode_operations (this is normal). The file
fs/ext2/dir.c contains ext2_dir_inode_operations, the file fs/ext2/file.c contains
ext2_file_inode_operations, and the file fs/ext2/symlink.c contains ext2_symlink_inode_operations.
There are many system calls related to files (and directories) which aren't accounted for in the inode_operations
structure; those are found in the file_operations structure. The file_operations structure is the same one used
when writing device drivers and contains operations that work specifically on files, rather than inodes:
struct file_operations {
int (*lseek) (struct inode *, struct file *, off_t, int);
int (*read) (struct inode *, struct file *, char *, int);
int (*write) (struct inode *, struct file *, const char *, int);
int (*readdir) (struct inode *, struct file *, void *, filldir_t);
int (*select) (struct inode *, struct file *, int, select_table *);
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
int (*mmap) (struct inode *, struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
void (*release) (struct inode *, struct file *);
int (*fsync) (struct inode *, struct file *);
int (*fasync) (struct inode *, struct file *, int);
int (*check_media_change) (kdev_t dev);
int (*revalidate) (kdev_t dev);
};
There are also a few functions which aren't directly related to system calls--and where they don't apply, they can simply be set
to NULL.
Summary
The role of the VFS is:
● Keep track of available filesystem types.
● Associate (and disassociate) devices with instances of the appropriate filesystem.
● Do any reasonable generic processing for operations involving files.
● When filesystem-specific operations become necessary, vector them to the filesystem in charge of the file, directory, or
inode in question.

The interaction between the VFS and specific filesystem types occurs through two main data structures, the super_block
structure and the inode structure, and their associated data structures, including super_operations,
inode_operations, file_operations, and others, which are kept in the include file <linux/fs.h>.
Therefore, the role of a specific filesystem code is to provide a superblock for each filesystem mounted and a unique inode for
each file on the filesystem, and to provide code to carry out actions specific to filesystems and files that are requested by
system calls and sorted out by the VFS.
Messages
1. A couple of comments and corrections by Jeremy Fitzhardinge

A couple of comments and corrections

Forum: A tour of the Linux VFS
Keywords: comments, correction, implementation details
Date: Thu, 23 May 1996 03:20:42 GMT
From: Jeremy Fitzhardinge <jeremy@zip.com.au>
A minor point: binfmt_java only deals with Java class files,

not JavaScript (which is unrelated to Java in all but name).
In general it looks OK, but it doesn't really talk about the

distinction between file operation and inode operations (or
the distinction between struct file and struct inode).
Inodes represent entities on disks: there's one (and only one)

for each file, directory, symlink, device and FIFO on the
filesystem.
When a user process does an open() syscall, it does a couple

of things:
- it looks up the inode for the pathname
- it puts the inode into the inode cache
- it allocates a new file structure
- it allocates a filedescriptor slot, and points it
it at the file structure
The number of the file descriptor slot is what is returned.
The distinction is subtle but important. Inodes have no

notion of "current offset": that's in the file structure,
so one inode can have many file structures, each at a
different offset and mode (RO/WO/RW/append). If you use
the dup() syscall, you can have multiple file descriptors
pointing to the same file structure (so they share their
offsets and open modes).
What this means for a filesystem is that you rarely need

to implement the open() file operation, unless you actually
care about specific file open and close operations. However,
you must always implement iget() operations.
http://ldp.iol.it/LDP/khg/HyperNews/get/fs/vfstour/1.html (1 di 2) [08/03/2001 10.10.55]

I guess there's lots more that can go here:

- interaction with the VM system and page cache
- coping with dynamic filesystems (filesystems who's
contents change of their own accord)
- structure of a Unix filesystem (inodes/names/links)
- what order to do things in, and how the various
VFS calls interact
- documenting the args of the VFS interface functions
I know this is supposed to be documentation, but the kernel

source is a good place to look...
[Plug mode] A good way of finding out about how Linux

filesystems work is Userfs, which allows you to write user
processes which implement filesystems. It has a readme
which talks in some detail about how to implement filesystems,
which is somewhat applicable to writing kernel resident
filesystems too. It's available at
tsx-11.mit.edu:pub/linux/ALPHA/userfs. The structure of
the userfs kernel module is also a pretty good guide for
a skeletal filesystem.
http://ldp.iol.it/LDP/khg/HyperNews/get/fs/vfstour/1.html (2 di 2) [08/03/2001 10.10.55]

Design and Implementation of the Second

Extended Filesystem
Rémy Card, Laboratoire MASI--Institut Blaise Pascal, E-Mail: card@masi.ibp.fr, and
Theodore Ts'o, Massachussets Institute of Technology, E-Mail: tytso@mit.edu, and
Stephen Tweedie, University of Edinburgh, E-Mail: sct@dcs.ed.ac.uk
Introduction
Linux is a Unix-like operating system, which runs on PC-386 computers. It was implemented first as
extension to the Minix operating system [Tanenbaum 1987] and its first versions included support for the
Minix filesystem only. The Minix filesystem contains two serious limitations: block addresses are stored
in 16 bit integers, thus the maximal filesystem size is restricted to 64 mega bytes, and directories contain
fixed-size entries and the maximal file name is 14 characters.
We have designed and implemented two new filesystems that are included in the standard Linux kernel.
These filesystems, called `Èxtended File System'' (Ext fs) and ``Second Extended File System'' (Ext2 fs)
raise the limitations and add new features.
In this paper, we describe the history of Linux filesystems. We briefly introduce the fundamental
concepts implemented in Unix filesystems. We present the implementation of the Virtual File System
layer in Linux and we detail the Second Extended File System kernel code and user mode tools. Last, we
present performance measurements made on Linux and BSD filesystems and we conclude with the
current status of Ext2fs and the future directions.
History of Linux filesystems

In its very early days, Linux was cross-developed under the Minix operating system. It was easier to
share disks between the two systems than to design a new filesystem, so Linus Torvalds decided to
implement support for the Minix filesystem in Linux. The Minix filesystem was an efficient and
relatively bug-free piece of software.
However, the restrictions in the design of the Minix filesystem were too limiting, so people started
thinking and working on the implementation of new filesystems in Linux.
In order to ease the addition of new filesystems into the Linux kernel, a Virtual File System (VFS) layer
was developed. The VFS layer was initially written by Chris Provenzano, and later rewritten by Linus
Torvalds before it was integrated into the Linux kernel. It is described in The Virtual File System.
After the integration of the VFS in the kernel, a new filesystem, called the `Èxtended File System'' was
implemented in April 1992 and added to Linux 0.96c. This new filesystem removed the two big Minix
limitations: its maximal size was 2 giga bytes and the maximal file name size was 255 characters. It was
an improvement over the Minix filesystem but some problems were still present in it. There was no
http://ldp.iol.it/LDP/khg/HyperNews/get/fs/ext2intro.html (1 di 14) [08/03/2001 10.10.58]

support for the separate access, inode modification, and data modification timestamps. The filesystem
used linked lists to keep track of free blocks and inodes and this produced bad performances: as the
filesystem was used, the lists became unsorted and the filesystem became fragmented.
As a response to these problems, two new filesytems were released in Alpha version in January 1993: the
Xia filesystem and the Second Extended File System. The Xia filesystem was heavily based on the Minix
filesystem kernel code and only added a few improvements over this filesystem. Basically, it provided
long file names, support for bigger partitions and support for the three timestamps. On the other hand,
Ext2fs was based on the Extfs code with many reorganizations and many improvements. It had been
designed with evolution in mind and contained space for future improvements. It will be described with
more details in The Second Extended File System
When the two new filesystems were first released, they provided essentially the same features. Due to its
minimal design, Xia fs was more stable than Ext2fs. As the filesystems were used more widely, bugs
were fixed in Ext2fs and lots of improvements and new features were integrated. Ext2fs is now very
stable and has become the de-facto standard Linux filesystem.
This table contains a summary of the features provided by the different filesystems:
Minix FS Ext FS Ext2 FS Xia FS
Max FS size 64 MB 2 GB 4 TB 2 GB
Max file size 64 MB 2 GB 2 GB 64 MB
Max file name 16/30 c 255 c 255 c 248 c
3 times support No No Yes Yes
Extensible No No Yes No
Var. block size No No Yes No
Maintained Yes No Yes ?
Basic File System Concepts

Every Linux filesystem implements a basic set of common concepts derivated from the Unix operating
system [Bach 1986] files are represented by inodes, directories are simply files containing a list of entries
and devices can be accessed by requesting I/O on special files.
Inodes
Each file is represented by a structure, called an inode. Each inode contains the description of the file:
file type, access rights, owners, timestamps, size, pointers to data blocks. The addresses of data blocks
allocated to a file are stored in its inode. When a user requests an I/O operation on the file, the kernel
code converts the current offset to a block number, uses this number as an index in the block addresses
table and reads or writes the physical block. This figure represents the structure of an inode:

Directories
Directories are structured in a hierarchical tree. Each directory can contain files and subdirectories.
Directories are implemented as a special type of files. Actually, a directory is a file containing a list of
entries. Each entry contains an inode number and a file name. When a process uses a pathname, the
kernel code searchs in the directories to find the corresponding inode number. After the name has been
converted to an inode number, the inode is loaded into memory and is used by subsequent requests.
This figure represents a directory:
Links
Unix filesystems implement the concept of link. Several names can be associated with a inode. The inode

contains a field containing the number associated with the file. Adding a link simply consists in creating
a directory entry, where the inode number points to the inode, and in incrementing the links count in the
inode. When a link is deleted, i.e. when one uses the rm command to remove a filename, the kernel
decrements the links count and deallocates the inode if this count becomes zero.
This type of link is called a hard link and can only be used within a single filesystem: it is impossible to
create cross-filesystem hard links. Moreover, hard links can only point on files: a directory hard link
cannot be created to prevent the apparition of a cycle in the directory tree.
Another kind of links exists in most Unix filesystems. Symbolic links are simply files which contain a
filename. When the kernel encounters a symbolic link during a pathname to inode conversion, it replaces
the name of the link by its contents, i.e. the name of the target file, and restarts the pathname
interpretation. Since a symbolic link does not point to an inode, it is possible to create cross-filesystems
symbolic links. Symbolic links can point to any type of file, even on nonexistent files. Symbolic links are
very useful because they don't have the limitations associated to hard links. However, they use some disk
space, allocated for their inode and their data blocks, and cause an overhead in the pathname to inode
conversion because the kernel has to restart the name interpretation when it encounters a symbolic link.
Device special files
In Unix-like operating systems, devices can be accessed via special files. A device special file does not
use any space on the filesystem. It is only an access point to the device driver.
Two types of special files exist: character and block special files. The former allows I/O operations in
character mode while the later requires data to be written in block mode via the buffer cache functions.
When an I/O request is made on a special file, it is forwarded to a (pseudo) device driver. A special file is
referenced by a major number, which identifies the device type, and a minor number, which identifies the
unit.
The Virtual File System

Principle
The Linux kernel contains a Virtual File System layer which is used during system calls acting on files.
The VFS is an indirection layer which handles the file oriented system calls and calls the necessary
functions in the physical filesystem code to do the I/O.
This indirection mechanism is frequently used in Unix-like operating systems to ease the integration and
the use of several filesystem types [Kleiman 1986, Seltzer et al. 1993].
When a process issues a file oriented system call, the kernel calls a function contained in the VFS. This
function handles the structure independent manipulations and redirects the call to a function contained in
the physical filesystem code, which is responsible for handling the structure dependent operations.
Filesystem code uses the buffer cache functions to request I/O on devices. This scheme is illustrated in

this figure:
The VFS structure
The VFS defines a set of functions that every filesystem has to implement. This interface is made up of a
set of operations associated to three kinds of objects: filesystems, inodes, and open files.
The VFS knows about filesystem types supported in the kernel. It uses a table defined during the kernel
configuration. Each entry in this table describes a filesystem type: it contains the name of the filesystem
type and a pointer on a function called during the mount operation. When a filesystem is to be mounted,
the appropriate mount function is called. This function is responsible for reading the superblock from the
disk, initializing its internal variables, and returning a mounted filesystem descriptor to the VFS. After
the filesystem is mounted, the VFS functions can use this descriptor to access the physical filesystem
routines.
A mounted filesystem descriptor contains several kinds of data: informations that are common to every

filesystem types, pointers to functions provided by the physical filesystem kernel code, and private data
maintained by the physical filesystem code. The function pointers contained in the filesystem descriptors
allow the VFS to access the filesystem internal routines.
Two other types of descriptors are used by the VFS: an inode descriptor and an open file descriptor. Each
descriptor contains informations related to files in use and a set of operations provided by the physical
filesystem code. While the inode descriptor contains pointers to functions that can be used to act on any
file (e.g. create, unlink), the file descriptors contains pointer to functions which can only act on
open files (e.g. read, write).
The Second Extended File System

Motivations
The Second Extended File System has been designed and implemented to fix some problems present in
the first Extended File System. Our goal was to provide a powerful filesystem, which implements Unix
file semantics and offers advanced features.
Of course, we wanted to Ext2fs to have excellent performance. We also wanted to provide a very robust
filesystem in order to reduce the risk of data loss in intensive use. Last, but not least, Ext2fs had to
include provision for extensions to allow users to benefit from new features without reformatting their
filesystem.
``Standard'' Ext2fs features
The Ext2fs supports standard Unix file types: regular files, directories, device special files and symbolic
links.
Ext2fs is able to manage filesystems created on really big partitions. While the original kernel code
restricted the maximal filesystem size to 2 GB, recent work in the VFS layer have raised this limit to 4
TB. Thus, it is now possible to use big disks without the need of creating many partitions.
Ext2fs provides long file names. It uses variable length directory entries. The maximal file name size is
255 characters. This limit could be extended to 1012 if needed.
Ext2fs reserves some blocks for the super user (root). Normally, 5% of the blocks are reserved. This
allows the administrator to recover easily from situations where user processes fill up filesystems.
`Àdvanced'' Ext2fs features
In addition to the standard Unix features, Ext2fs supports some extensions which are not usually present
in Unix filesystems.
File attributes allow the users to modify the kernel behavior when acting on a set of files. One can set
attributes on a file or on a directory. In the later case, new files created in the directory inherit these
attributes.
BSD or System V Release 4 semantics can be selected at mount time. A mount option allows the
administrator to choose the file creation semantics. On a filesystem mounted with BSD semantics, files

are created with the same group id as their parent directory. System V semantics are a bit more complex:
if a directory has the setgid bit set, new files inherit the group id of the directory and subdirectories
inherit the group id and the setgid bit; in the other case, files and subdirectories are created with the
primary group id of the calling process.
BSD-like synchronous updates can be used in Ext2fs. A mount option allows the administrator to request
that metadata (inodes, bitmap blocks, indirect blocks and directory blocks) be written synchronously on
the disk when they are modified. This can be useful to maintain a strict metadata consistency but this
leads to poor performances. Actually, this feature is not normally used, since in addition to the
performance loss associated with using synchronous updates of the metadata, it can cause corruption in
the user data which will not be flagged by the filesystem checker.
Ext2fs allows the administrator to choose the logical block size when creating the filesystem. Block sizes
can typically be 1024, 2048 and 4096 bytes. Using big block sizes can speed up I/O since fewer I/O
requests, and thus fewer disk head seeks, need to be done to access a file. On the other hand, big blocks
waste more disk space: on the average, the last block allocated to a file is only half full, so as blocks get
bigger, more space is wasted in the last block of each file. In addition, most of the advantages of larger
block sizes are obtained by Ext2 filesystem's preallocation techniques (see section Performance
optimizations.
Ext2fs implements fast symbolic links. A fast symbolic link does not use any data block on the
filesystem. The target name is not stored in a data block but in the inode itself. This policy can save some
disk space (no data block needs to be allocated) and speeds up link operations (there is no need to read a
data block when accessing such a link). Of course, the space available in the inode is limited so not every
link can be implemented as a fast symbolic link. The maximal size of the target name in a fast symbolic
link is 60 characters. We plan to extend this scheme to small files in the near future.
Ext2fs keeps track of the filesystem state. A special field in the superblock is used by the kernel code to
indicate the status of the file system. When a filesystem is mounted in read/write mode, its state is set to
``Not Clean''. When it is unmounted or remounted in read-only mode, its state is reset to ``Clean''. At
boot time, the filesystem checker uses this information to decide if a filesystem must be checked. The
kernel code also records errors in this field. When an inconsistency is detected by the kernel code, the
filesystem is marked as `Èrroneous''. The filesystem checker tests this to force the check of the
filesystem regardless of its apparently clean state.
Always skipping filesystem checks may sometimes be dangerous, so Ext2fs provides two ways to force
checks at regular intervals. A mount counter is maintained in the superblock. Each time the filesystem is
mounted in read/write mode, this counter is incremented. When it reaches a maximal value (also
recorded in the superblock), the filesystem checker forces the check even if the filesystem is ``Clean''. A
last check time and a maximal check interval are also maintained in the superblock. These two fields
allow the administrator to request periodical checks. When the maximal check interval has been reached,
the checker ignores the filesystem state and forces a filesystem check. Ext2fs offers tools to tune the
filesystem behavior. The tune2fs program can be used to modify:
● the error behavior. When an inconsistency is detected by the kernel code, the filesystem is marked
as `Èrroneous'' and one of the three following actions can be done: continue normal execution,
remount the filesystem in read-only mode to avoid corrupting the filesystem, make the kernel
panic and reboot to run the filesystem checker.

● the maximal mount count.

● the maximal check interval.
● the number of logical blocks reserved for the super user.
Mount options can also be used to change the kernel error behavior.
An attribute allows the users to request secure deletion on files. When such a file is deleted, random data
is written in the disk blocks previously allocated to the file. This prevents malicious people from gaining
access to the previous content of the file by using a disk editor.
Last, new types of files inspired from the 4.4 BSD filesystem have recently been added to Ext2fs.
Immutable files can only be read: nobody can write or delete them. This can be used to protect sensitive
configuration files. Append-only files can be opened in write mode but data is always appended at the
end of the file. Like immutable files, they cannot be deleted or renamed. This is especially useful for log
files which can only grow.
Physical Structure
The physical structure of Ext2 filesystems has been strongly influenced by the layout of the BSD
filesystem [McKusick et al. 1984]. A filesystem is made up of block groups. Block groups are analogous
to BSD FFS's cylinder groups. However, block groups are not tied to the physical layout of the blocks on
the disk, since modern drives tend to be optimized for sequential access and hide their physical geometry
to the operating system.
The physical structure of a filesystem is represented in this table:
Boot Block Block ... Block
Sector Group 1 Group 2 ... Group N
Each block group contains a redundant copy of crucial filesystem control informations (superblock and
the filesystem descriptors) and also contains a part of the filesystem (a block bitmap, an inode bitmap, a
piece of the inode table, and data blocks). The structure of a block group is represented in this table:
Super FS Block Inode Inode Data
Block descriptors Bitmap Bitmap Table Blocks
Using block groups is a big win in terms of reliability: since the control structures are replicated in each
block group, it is easy to recover from a filesystem where the superblock has been corrupted. This
structure also helps to get good performances: by reducing the distance between the inode table and the
data blocks, it is possible to reduce the disk head seeks during I/O on files.
In Ext2fs, directories are managed as linked lists of variable length entries. Each entry contains the inode
number, the entry length, the file name and its length. By using variable length entries, it is possible to
implement long file names without wasting disk space in directories. The structure of a directory entry is
shown in this table:
inode number entry length name length filename

As an example, The next table represents the structure of a directory containing three files: file1,
long_file_name, and f2:
i1 16 05 file1
i2 40 14 long_file_name
i3 12 02 f2
Performance optimizations
The Ext2fs kernel code contains many performance optimizations, which tend to improve I/O speed
when reading and writing files.
Ext2fs takes advantage of the buffer cache management by performing readaheads: when a block has to
be read, the kernel code requests the I/O on several contiguous blocks. This way, it tries to ensure that the
next block to read will already be loaded into the buffer cache. Readaheads are normally performed
during sequential reads on files and Ext2fs extends them to directory reads, either explicit reads
(readdir(2) calls) or implicit ones (namei kernel directory lookup).
Ext2fs also contains many allocation optimizations. Block groups are used to cluster together related
inodes and data: the kernel code always tries to allocate data blocks for a file in the same group as its
inode. This is intended to reduce the disk head seeks made when the kernel reads an inode and its data
blocks.
When writing data to a file, Ext2fs preallocates up to 8 adjacent blocks when allocating a new block.
Preallocation hit rates are around 75% even on very full filesystems. This preallocation achieves good
write performances under heavy load. It also allows contiguous blocks to be allocated to files, thus it
speeds up the future sequential reads.
These two allocation optimizations produce a very good locality of:
● related files through block groups
● related blocks through the 8 bits clustering of block allocations.
The Ext2fs library

To allow user mode programs to manipulate the control structures of an Ext2 filesystem, the libext2fs
library was developed. This library provides routines which can be used to examine and modify the data
of an Ext2 filesystem, by accessing the filesystem directly through the physical device.
The Ext2fs library was designed to allow maximal code reuse through the use of software abstraction
techniques. For example, several different iterators are provided. A program can simply pass in a
function to ext2fs_block_interate(), which will be called for each block in an inode. Another
iterator function allows an user-provided function to be called for each file in a directory.
Many of the Ext2fs utilities (mke2fs, e2fsck, tune2fs, dumpe2fs, and debugfs) use the Ext2fs
library. This greatly simplifies the maintainance of these utilities, since any changes to reflect new
features in the Ext2 filesystem format need only be made in one place--in the Ext2fs library. This code

reuse also results in smaller binaries, since the Ext2fs library can be built as a shared library image.
Because the interfaces of the Ext2fs library are so abstract and general, new programs which require
direct access to the Ext2fs filesystem can very easily be written. For example, the Ext2fs library was used
during the port of the 4.4BSD dump and restore backup utilities. Very few changes were needed to adapt
these tools to Linux: only a few filesystem dependent functions had to be replaced by calls to the Ext2fs
library.
The Ext2fs library provides access to several classes of operations. The first class are the
filesystem-oriented operations. A program can open and close a filesystem, read and write the bitmaps,
and create a new filesystem on the disk. Functions are also available to manipulate the filesystem's bad
blocks list.
The second class of operations affect directories. A caller of the Ext2fs library can create and expand
directories, as well as add and remove directory entries. Functions are also provided to both resolve a
pathname to an inode number, and to determine a pathname of an inode given its inode number.
The final class of operations are oriented around inodes. It is possible to scan the inode table, read and
write inodes, and scan through all of the blocks in an inode. Allocation and deallocation routines are also
available and allow user mode programs to allocate and free blocks and inodes.
The Ext2fs tools

Powerful management tools have been developed for Ext2fs. These utilities are used to create, modify,
and correct any inconsistencies in Ext2 filesystems. The mke2fs program is used to initialize a partition
to contain an empty Ext2 filesystem.
The tune2fs program can be used to modify the filesystem parameters. As explained in section
`Àdvanced'' Ext2fs features, it can change the error behavior, the maximal mount count, the maximal
check interval, and the number of logical blocks reserved for the super user.
The most interesting tool is probably the filesystem checker. E2fsck is intended to repair filesystem
inconsistencies after an unclean shutdown of the system. The original version of e2fsck was based on
Linus Torvald's fsck program for the Minix filesystem. However, the current version of e2fsck was
rewritten from scratch, using the Ext2fs library, and is much faster and can correct more filesystem
inconsistencies than the original version.
The e2fsck program is designed to run as quickly as possible. Since filesystem checkers tend to be disk
bound, this was done by optimizing the algorithms used by e2fsck so that filesystem structures are not
repeatedly accessed from the disk. In addition, the order in which inodes and directories are checked are
sorted by block number to reduce the amount of time in disk seeks. Many of these ideas were originally
explored by [Bina and Emrath 1989] although they have since been further refined by the authors.
In pass 1, e2fsck iterates over all of the inodes in the filesystem and performs checks over each inode
as an unconnected object in the filesystem. That is, these checks do not require any cross-checks to other
filesystem objects. Examples of such checks include making sure the file mode is legal, and that all of the
blocks in the inode are valid block numbers. During pass 1, bitmaps indicating which blocks and inodes
are in use are compiled.

If e2fsck notices data blocks which are claimed by more than one inode, it invokes passes 1B through
1D to resolve these conflicts, either by cloning the shared blocks so that each inode has its own copy of
the shared block, or by deallocating one or more of the inodes.
Pass 1 takes the longest time to execute, since all of the inodes have to be read into memory and checked.
To reduce the I/O time necessary in future passes, critical filesystem information is cached in memory.
The most important example of this technique is the location on disk of all of the directory blocks on the
filesystem. This obviates the need to re-read the directory inodes structures during pass 2 to obtain this
information.
Pass 2 checks directories as unconnected objects. Since directory entries do not span disk blocks, each
directory block can be checked individually without reference to other directory blocks. This allows
e2fsck to sort all of the directory blocks by block number, and check directory blocks in ascending
order, thus decreasing disk seek time. The directory blocks are checked to make sure that the directory
entries are valid, and contain references to inode numbers which are in use (as determined by pass 1).
For the first directory block in each directory inode, the `.' and `..' entries are checked to make sure they
exist, and that the inode number for the `.' entry matches the current directory. (The inode number for the
`..' entry is not checked until pass 3.)
Pass 2 also caches information concerning the parent directory in which each directory is linked. (If a
directory is referenced by more than one directory, the second reference of the directory is treated as an
illegal hard link, and it is removed).
It is noteworthy to note that at the end of pass 2, nearly all of the disk I/O which e2fsck needs to
perform is complete. Information required by passes 3, 4 and 5 are cached in memory; hence, the
remaining passes of e2fsck are largely CPU bound, and take less than 5-10% of the total running time
of e2fsck.
In pass 3, the directory connectivity is checked. E2fsck traces the path of each directory back to the
root, using information that was cached during pass 2. At this time, the `..' entry for each directory is also
checked to make sure it is valid. Any directories which can not be traced back to the root are linked to the
/lost+found directory.
In pass 4, e2fsck checks the reference counts for all inodes, by iterating over all the inodes and
comparing the link counts (which were cached in pass 1) against internal counters computed during
passes 2 and 3. Any undeleted files with a zero link count is also linked to the /lost+found directory
during this pass.
Finally, in pass 5, e2fsck checks the validity of the filesystem summary information. It compares the
block and inode bitmaps which were constructed during the previous passes against the actual bitmaps on
the filesystem, and corrects the on-disk copies if necessary.
The filesystem debugger is another useful tool. Debugfs is a powerful program which can be used to
examine and change the state of a filesystem. Basically, it provides an interactive interface to the Ext2fs
library: commands typed by the user are translated into calls to the library routines.
Debugfs can be used to examine the internal structures of a filesystem, manually repair a corrupted
filesystem, or create test cases for e2fsck. Unfortunately, this program can be dangerous if it is used by

people who do not know what they are doing; it is very easy to destroy a filesystem with this tool. For
this reason, debugfs opens filesytems for read-only access by default. The user must explicitly specify
the -w flag in order to use debugfs to open a filesystem for read/wite access.
Performance Measurements
Description of the benchmarks
We have run benchmarks to measure filesystem performances. Benchmarks have been made on a
middle-end PC, based on a i486DX2 processor, using 16 MB of memory and two 420 MB IDE disks.
The tests were run on Ext2 fs and Xia fs (Linux 1.1.62) and on the BSD Fast filesystem in asynchronous
and synchronous mode (FreeBSD 2.0 Alpha--based on the 4.4BSD Lite distribution).
We have run two different benchmarks. The Bonnie benchmark tests I/O speed on a big file--the file size
was set to 60 MB during the tests. It writes data to the file using character based I/O, rewrites the
contents of the whole file, writes data using block based I/O, reads the file using character I/O and block
I/O, and seeks into the file. The Andrew Benchmark was developed at Carneggie Mellon University and
has been used at the University of Berkeley to benchmark BSD FFS and LFS. It runs in five phases: it
creates a directory hierarchy, makes a copy of the data, recursively examine the status of every file,
examine every byte of every file, and compile several of the files.
Results of the Bonnie benchmark
The results of the Bonnie benchmark are presented in this table:

Char Write Block Write Rewrite Char Read Block Read
(KB/s) (KB/s) (KB/s) (KB/s) (KB/s)
BSD Async 710 684 401 721 888
BSD Sync 699 677 400 710 878
Ext2 fs 452 1237 536 397 1033
Xia fs 440 704 380 366 895
The results are very good in block oriented I/O: Ext2 fs outperforms other filesystems. This is clearly a
benefit of the optimizations included in the allocation routines. Writes are fast because data is written in
cluster mode. Reads are fast because contiguous blocks have been allocated to the file. Thus there is no
head seek between two reads and the readahead optimizations can be fully used.
On the other hand, performance is better in the FreeBSD operating system in character oriented I/O. This
is probably due to the fact that FreeBSD and Linux do not use the same stdio routines in their respective
C libraries. It seems that FreeBSD has a more optimized character I/O library and its performance is
better.
Results of the Andrew benchmark
The results of the Andrew benchmark are presented in this table:

P1 Create P2 Copy P3 Stat P4 Grep P5 Compile

(ms) (ms) (ms) (ms) (ms)
BSD Async 2203 7391 6319 17466 75314
BSD Sync 2330 7732 6317 17499 75681
Ext2 fs 790 4791 7235 11685 63210
Xia fs 934 5402 8400 12912 66997
The results of the two first passes show that Linux benefits from its asynchronous metadata I/O. In passes
1 and 2, directories and files are created and BSD synchronously writes inodes and directory entries.
There is an anomaly, though: even in asynchronous mode, the performance under BSD is poor. We
suspect that the asynchronous support under FreeBSD is not fully implemented.
In pass 3, the Linux and BSD times are very similar. This is a big progress against the same benchmark
run six months ago. While BSD used to outperform Linux by a factor of 3 in this test, the addition of a
file name cache in the VFS has fixed this performance problem.
In passes 4 and 5, Linux is faster than FreeBSD mainly because it uses an unified buffer cache
management. The buffer cache space can grow when needed and use more memory than the one in
FreeBSD, which uses a fixed size buffer cache. Comparison of the Ext2fs and Xiafs results shows that
the optimizations included in Ext2fs are really useful: the performance gain between Ext2fs and Xiafs is
around 5-10%.
Conclusion
The Second Extended File System is probably the most widely used filesystem in the Linux community.
It provides standard Unix file semantics and advanced features. Moreover, thanks to the optimizations
included in the kernel code, it is robust and offers excellent performance.
Since Ext2fs has been designed with evolution in mind, it contains hooks that can be used to add new
features. Some people are working on extensions to the current filesystem: access control lists
conforming to the Posix semantics [IEEE 1992], undelete, and on-the-fly file compression.
Ext2fs was first developed and integrated in the Linux kernel and is now actively being ported to other
operating systems. An Ext2fs server running on top of the GNU Hurd has been implemented. People are
also working on an Ext2fs port in the LITES server, running on top of the Mach microkernel [Accetta et
al. 1986], and in the VSTa operating system. Last, but not least, Ext2fs is an important part of the Masix
operating system [Card et al. 1993], currently under development by one of the authors.
Acknowledgments
The Ext2fs kernel code and tools have been written mostly by the authors of this paper. Some other
people have also contributed to the development of Ext2fs either by suggesting new features or by
sending patches. We want to thank these contributors for their help.

References
[Accetta et al. 1986] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M.
Young. Mach: A New Kernel Foundation For UNIX Development. In Proceedings of the USENIX 1986
Summer Conference, June 1986.
[Bach 1986] M. Bach. The Design of the UNIX Operating System. Prentice Hall, 1986.
[Bina and Emrath 1989] E. Bina and P. Emrath. A Faster fsck for BSD Unix. In Proceedings of the
USENIX Winter Conference, January 1989.
[Card et al. 1993] R. Card, E. Commelin, S. Dayras, and F. Mével. The MASIX Multi-Server Operating
System. In OSF Workshop on Microkernel Technology for Distributed Systems, June 1993.
[IEEE 1992] SECURITY INTERFACE for the Portable Operating System Interface for Computer
Environments - Draft 13. Institute of Electrical and Electronics Engineers, Inc, 1992.
[Kleiman 1986] S. Kleiman. Vnodes: An Architecture for Multiple File System Types in Sun UNIX. In
Proceedings of the Summer USENIX Conference, pages 260--269, June 1986.
[McKusick et al. 1984] M. McKusick, W. Joy, S. Leffler, and R. Fabry. A Fast File System for UNIX.
ACM Transactions on Computer Systems, 2(3):181--197, August 1984.
[Seltzer et al. 1993] M. Seltzer, K. Bostic, M. McKusick, and C. Staelin. An Implementation of a
Log-Structured File System for UNIX. In Proceedings of the USENIX Winter Conference, January 1993.
[Tanenbaum 1987] A. Tanenbaum. Operating Systems: Design and Implementation. Prentice Hall, 1987.

Need /proc info
Need /proc info

Forum: Filesystems
Keywords: linux filesystem /proc
Date: Wed, 04 Jun 1997 15:37:10 GMT
From: Kai Xu <kai@caip.rutgers.edu>
Hi,
Where can I find the detailed write-ups about the /proc file system in Linux. I want to know how it
works and how it is implemented(not just the source code, I have it already)
Please reply to kai@caip.rutgers.edu , if you can.
Thanks.
Kai Xu
http://ldp.iol.it/LDP/khg/HyperNews/get/fs/fs/12.html [08/03/2001 10.10.58]

Where to find libext2 sources?
Where to find libext2 sources?

Forum: Filesystems
Date: Fri, 09 May 1997 14:27:59 GMT
From: Mark Salter <marks@qms.com>
The subject says it all.
Messages
1. Nevermind... by Mark Salter

Nevermind...
Nevermind...
Forum: Filesystems
Re: Where to find libext2 sources? (Mark Salter)
Date: Fri, 09 May 1997 19:34:50 GMT
From: Mark Salter <unknown>
I found it. I didn't realize it was part of e2fsprogs.
http://ldp.iol.it/LDP/khg/HyperNews/get/fs/fs/11/1.html [08/03/2001 10.10.59]

New File System
New File System

Forum: Filesystems
Keywords: linux file system file_write memcpy fragment
Date: Fri, 18 Apr 1997 05:15:59 GMT
From: Vamsi Krishna <vamsi@cslab.uky.edu>
Hi,
I am implementing a new file system, in Linux. I have
borrowed lot of code from Minix for this purpose. I had a
problem while implementing the function for file_write. I had
to modify this function a lot for supporting fragments. I
tried to use all these functions ( memcpy_tofs, memcpy_fromfs,
memcpy, bcopy, memmove) to copy the contents of the last
block into one or fragments. When I used memcpy_tofs or
memcpy_fromfs, I got segmentation faults. And when I used
the other functions, it only used to copy the first fragment
properly and not the remaining ones. Could anyone please help
me in this regard.
Thank you Vamsi

Partition?
Partition?
Forum: Filesystems
Keywords: Partition
Date: Wed, 26 Mar 1997 03:05:49 GMT
From: Wilfredo Lugo Beauchamp <ak47@amadeus.upr.clu.edu>
Hi, I'm working in some Linux ext2 filesystem

applications and I need to read an inode. The problem is
that the function iget(.....) needs the superblock. Is it
possible get the superblock without the partition info? Are
there any functions that return the current partition?
Thanks for your attention
Wilfredo Lugo
Messages
1. Please be more specific.... by Theodore Ts'o

Please be more specific....
Please be more specific....

Forum: Filesystems
Re: Partition? (Wilfredo Lugo Beauchamp)
Keywords: ext2 filesystem
Date: Wed, 26 Mar 1997 04:28:53 GMT
It's not clear from your question whether you are trying to write a user mode application, or trying to
write kernel code. Parts of your question imply that you're writing user-mode code, but iget() is a
kernel routine which isn't available to user-mode programs.
Why don't you be a bit more specific about what you're trying to do, and perhaps we can help you out.
Assuming that you're writing a user-mode application, are you trying to read the filesystem directly
using the device file, and using direct I/O to the device? Or are you trying to get some information
from a filesystem that is already mounted?
Why do you need to read an inode? What are you trying to do with it?

Need documentation on userfs implementation
Need documentation on userfs implementation

Forum: Filesystems
Keywords: userfs ftpfs
Date: Thu, 13 Feb 1997 10:08:41 GMT
From: Natchu Vishnu Priya <vishnu@cs.iitm.ernet.in>
I need documentation on userfs, in linux.. I seem to be able to find only an alpha version of userfs
available and the ftpfs in that does not work.
A list of function, etc.. on the lines of those on the vfs, might be helpful.
-vishnu

ext2fs tools
ext2fs tools
Forum: Filesystems
Keywords: ext2fs tools
Date: Wed, 05 Feb 1997 00:42:12 GMT
From: Wilfredo Lugo Beauchamp <ak47@amadeus.upr.clu.edu>
Hi, I would like to know where I can find the set of

tools e2fsprogs. I'm working on an undelete for Linux for
my Operating Systems course. I read a document prepared by
Mr. Linus Tauro of the University of California and he used
these tools.
Wilfredo Lugo
Messages
1. Where to find e2fsprogs by Theodore Ts'o

Where to find e2fsprogs
Where to find e2fsprogs

Forum: Filesystems
Re: ext2fs tools (Wilfredo Lugo Beauchamp)
Keywords: ext2fs tools
Date: Wed, 05 Feb 1997 15:46:52 GMT
The website for for the e2fsprogs package can be found at
http://web.mit.edu/tytso/www/linux/e2fsprogs.html
You can also ftp it from tsx-11.mit.edu, in the /pub/linux/packages/ext2fs directory.

libext2fs documentation
Forum: Filesystems
Keywords: libext2fs filesystem
Date: Wed, 15 Jan 1997 18:07:22 GMT
From: James Beckett <jmb@isltd.insignia.com>
After a repartition and (win95) reformat I find I didn't save away all the data I wanted from an ext2 fs,
so I've spent a morning grovelling through the source and figuring out the structure. (I think I can get
the data back, only the first block group got overwritten by format)
Now I find that libext2fs exists.. is there any documentation on how to use it, and how much does it
depend on the filesystem being intact? Can it be told to use a backup superblock? I discovered that
mount(8) can be given an option to do so, but the utilities (e2fsck, debugfs etc) don't seem to, so is it
some limitation of libext2fs?
Messages
1. libext2fs documentation by Theodore Ts'o

Forum: Filesystems
Re: libext2fs documentation (James Beckett)
Keywords: libext2fs filesystem
Date: Wed, 22 Jan 1997 23:20:07 GMT
No, there currently isnt any documentation on the libext2fs library. The library is relatively well
structured internally, and so most people who have looked at it haven't had *too* much trouble
figuring it out.
It would be nice to have some documentation on it, though, and I am soliciting volunteers who would
be willing to do a first pass documentation on it; I'm definitely willing to work with someone who is
interested in doing that sort of tech writing.
As for your question, you can absolutely tell it to use a backup superblock, just take a look at the
source code for the function signature for the ext2fs_open() function. One of the arguments is
"superblock", and that's the block number for the superblock. You're right that debugfs currently
doesn't have a method for opening the filesystem with a backup superblock. E2fsck most certainly
does have a way to do this, though, and it's documented in the man page. Try using "e2fsck -b 8193".

proc filesystem
proc filesystem
Forum: Filesystems
Keywords: proc filesystem
Date: Fri, 18 Oct 1996 21:18:40 GMT
From: Praveen Krishnan <praveen@kurinji.iitm.ernet.in>
Hello,
Isn't there any documentation on the /proc filesystem ? There was a chapter on it in the earlier versions
of KHG but i dont see it here. Maybe i've missed it. if so ,could someone please direct me to where i
would be able to get some programming related info on the /proc filesystem.
Thanx a lot
- praveen
Messages
1. man proc by Michael K. Johnson

man proc
man proc
Forum: Filesystems
Re: proc filesystem (Praveen Krishnan)
Keywords: proc filesystem
Date: Thu, 24 Oct 1996 23:20:10 GMT
man proc
The proc chapter was removed because the man page was more complete and more up-to-date.

Need NFS documentation
Need NFS documentation

Forum: Filesystems
Keywords: NFS
Date: Fri, 27 Sep 1996 11:22:14 GMT
From: Ermelindo Mauriello <ermmau@ikonos.dia.unisa.it>
I'm working on a Transparent Cryptographic Filesystem for Linux based on the NFS concept.
Documentation about this filesystem can be found ad mikonos.dia.unisa.it or www.globenet.it.
Actually this filesystem is out of the kernel but the project is to push it into the system using nfs
module. So I need documentation about NFS module. Where can I find it ? Thanks in advance to all
will help me. ermmau@mikonos.dia.unisa.it

Even more ext2 documentation!
Even more ext2 documentation!

Forum: Filesystems
Keywords: info
Date: Wed, 12 Jun 1996 16:51:45 GMT
Even more documentation on ext2fs is available. The ext2ed (available via ftp from tsx-11.mit.edu in
/pub/linux/packages/ext2fs) contains a set of detailed papers on ext2fs, including an overview, a
design document, and a users guide for ext2ed.
Also, an Analysis of the Ext2fs structure is available.

More ext2 documentation
More ext2 documentation

Forum: Filesystems
Keywords: ext2fs
Date: Mon, 03 Jun 1996 22:15:17 GMT
Remy Card recently announced that he has made postscript versions of the slides which he prepared
for the 3rd International Linux Conference in Berlin available for ftp at
tsx-11.mit.edu:/pub/linux/packages/ext2fs/slides/berlin96
One talk was on quota management for ext2fs, and the other was on the implementation of POSIX.6
ACL's for ext2fs. Four sets of slides are available: two-up slides on quotas, one-up slides on quotas,
two-up slides on ACL's, and one-up slides on ACL's.

Ext2 paper
Ext2 paper
Forum: Filesystems
Date: Wed, 29 May 1996 21:02:45 GMT
At one point, the ext2 paper which Remy, Stephen and I wrote was supposed to be going into the
KHG. It was written for the Amsterdam Linux conference 1-2 years ago, but we got copyright
clearance so that it could be included in the KHG. However, it seems that it never did get included into
the KHG. Does anyone know what happened with that? I no longer have a copy of the original TeX,
but Remy (as the primary author) should. I think it would be a very valuable addition to the KHG.
Messages
1. Done. by Michael K. Johnson

Done.
Done.
Forum: Filesystems
Re: Ext2 paper (Theodore Ts'o)
Date: Wed, 12 Jun 1996 16:11:41 GMT
See above! Thanks, Ted et. al.


The Linux Cache Flush Architecture
David Miller wrote this document explaining how Linux tries to flush caches optimally, and more
importantly, how people porting Linux can write code to use the architecture.
This chapter is rather old; it was originally written when Linux was only a year old, and was
updated a year later. The first section, on Linux's memory management code, is out of date by
now, but may still provide some sort of understanding of the basic structure that will help you
navigate through more recent kernels. The second section, an overview of 80386 memory
management, is still mostly applicable; there are a few assumptions that should not get in your way
in general.
80386 Memory Management
Linux's memory management was originally conceived for Intel's 80386 processor, which has
fairly rich and relatively easy-to-use memory management features. During the port to the Alpha,
Linux's memory management was abstracted in a way that has been successfully applied to many
different processors, including the memory management units (MMU's) that are supplied with the
386, Alpha, Sparc (there are several different MMUs for the Sparc; most are supported), Motorola
68K, PowerPC, ARM, and MIPS CPUs.
Copyright (C) 1996 Michael K. Johnson, johnsonm@redhat.com
http://ldp.iol.it/LDP/khg/HyperNews/get/memory/memory.html [08/03/2001 10.11.04]

1. The Players
The TLB.
This is more of a virtual entity than a strict model as far as the Linux flush architecture is concerned. The only
characteristics it has is:
1. It keeps track of process/kernel mappings in some way, whether in software or hardware.
2. Architecture specific code may need to be notified when the kernel has changed a process/kernel mapping.
The cache.
This entity is essentially "memory state" as the flush architecture views it. In general it has the following properties:
1. It will always hold copies of data which will be viewed as uptodate by the local processor.
2. Its proper functioning may be related to the TLB and process/kernel page mappings in some way, that is to
say they may depend upon each other.
3. It may, in a virtually cached configuration, cause aliasing problems if one physical page is mapped at the
same time to two virtual pages, and due to to the bits of an address used to index the cache line, the same
piece of data can end up residing in the cache twice, allowing inconsistancies to result.
4. Devices and DMA may or may not be able to see the most up to date copy of a piece of data which resides in
the cache of the local processor.
5. Currently, it is assumed that coherence in a multiprocessor environment is maintained by the cache/memory
subsystem. That is to say, when one processor requests a datum on the memory bus and another processor has
a more uptodate copy, by whatever means the requestor will get the uptodate copy owned by the other
processor.
(NOTE: SMP architectures without hardware cache coherence mechanisms are indeed possible, the current flush
architecture does not handle this currently. If at at some point a Linux port to some system where this is an issue occurrs, I
will add the necessary hooks. But it will not be pretty.)
2. What the flush architecture cares about

1. At all times the memory management hardware's view of a set of process/kernel mappings will be consistant with
that of the kernel page tables.
2. If the memory management kernel code makes a modification to a user process page, by modifying the data via the
kernel-space alias of the underlying physical page, the user thread of control will see the right data before it is
allowed to continue execution, regardless of the cache architecture and/or semantics.
3. In general, when address space state is changed (on the generic kernel memory management code's behalf only) the
appropriate flush architecture hook will be called describing that state change in full.
3. What the flush architecture does not care about

1. DMA/Driver coherency. This includes DMA mappings (in the sense of MMU mappings) and cache/DMA datum
consistency. These sorts of issues have no buisness in the flush architecture, see below how they should be handled.
2. Split Instruction/Data cache consistancy with respect to modifications made to the process instruction space
performed by the signal dispatch code. Again see below on how this should be handled in another way.
4. The interfaces for the flush architecture and how to implement them
In general all of the routines described below will be called with the following sequence:
http://ldp.iol.it/LDP/khg/HyperNews/get/memory/flush.html (1 di 7) [08/03/2001 10.11.05]

flush_cache_foo(...);
modify_address_space();
flush_tlb_foo(...);
The logic here is:
1. It may be illegal in a given architecture for a piece of cache data to exist when no mapping for that data exists,
therefore the flush must occur before the change is made.
2. It is possible for a given MMU/TLB architecture to perform a hardware table walk of the kernel page tables.
Therefore the TLB flush is done after the page tables have been changed so that afterwards the hardware can only
load in the new copy of the page table information to the TLB.
void flush_cache_all(void);
void flush_tlb_all(void);
These routines are to notify the architecture specific code that a change has been made to the kernel address space
mappings, which means that the mappings of every process has effectively changed.
An implementation shall:
1. Eliminate all cache entries which are valid at this point in time when flush_cache_all is invoked. This applies
to virtual cache architectures. If the cache is write-back in nature, this routine shall commit the cache data to
memory before invalidating each entry. For physical caches, no action need be performed since physical mappings
have no bearing on address space translations.
2. For flush_tlb_all, all TLB mappings for the kernel address space should be made consistant with the OS page
tables by whatever means necessary. Note that with an architecture that possesses the notion of "MMU/TLB
contexts" it may be necessary to perform this synchronization in every "active" MMU/TLB context.
void flush_cache_mm(struct mm_struct *mm);

void flush_tlb_mm(struct mm_struct *mm);
These routines notify the system that the entire address space described by the mm_struct passed is changing. Please
take note of two things in particular:
1. The mm_struct is the unit of mmu/tlb real estate as far as the flush architecture is concerned. In particular, an
mm_struct may map to one or many tasks or none!
2. This "address space" change is considered to be occurring in user space only. It is therefore safe for code to avoid
flushing kernel tlb/cache entries if that is possible for efficiency.
1. For flush_cache_mm, whatever entries could exist in a virtual cache for the address space described by
mm_struct are to be invalidated.
2. For flush_tlb_mm, the tlb/mmu hardware is to be placed in a state where it will see the (now current) kernel
page table entries for the address space described by the mm_struct.
flush_cache_range(struct mm_struct *mm, unsigned long start,

unsigned long end);
flush_tlb_range(struct mm_struct *mm, unsigned long start,
unsigned long end);
A change to a particular range of user addresses in the address space described by the mm_struct passed is occurring.
The two notes above for flush_*_mm() concerning the mm_struct passed apply here as well.
1. For flush_cache_range, on a virtually cached system, all cache entries which are valid for the range start to
end in the address space described by the mm_struct are to be invalidated.
2. For flush_tlb_range, whatever actions necessary to cause the MMU/TLB hardware to not contain stale

translations are to be performed. This means that whatever translations are in the kernel page tables in the range
start to end in the address space described by the mm_struct are to be what the memory mangement hardware
will see from this point forward, by whatever means.
void flush_cache_page(struct vm_area_struct *vma, unsigned long address);

void flush_tlb_page(struct vm_area_struct *vma, unsigned long address);
A change to a single page at address within user space to the address space described by the vm_area_struct
passed is occurring. An implementation, if need be, can get at the assosciated mm_struct for this address space via
vma->vm_mm. The VMA is passed for convenience so that an implementation can inspect vma->vm_flags. This way
in an implementation where the instruction and data spaces are not unified, one can check to see if VM_EXEC is set in
vma->vm_flags to possibly avoid flushing the instruction space, for example.
The two notes above for flush_*_mm() concerning the mm_struct (passed indirectly via vma->vm_mm) apply here
as well.
1. For flush_cache_range, on a virtually cached system, all cache entries which are valid for the page at
address in the address space described by the VMA are to be invalidated.
2. For flush_tlb_range, whatever actions necessary to cause the MMU/TLB hardware to not contain stale
translations are to be performed. This means that whatever translations are in the kernel page tables for the page at
address in the address space described by the VMA passed are to be what the memory mangement hardware will
see from this point forward, by whatever means.
void flush_page_to_ram(unsigned long page);

This is the ugly duckling. But its semantics are necessary on so many architectures that I needed to add it to the flush
architecture for Linux.
Briefly, when (as one example) the kernel services a COW fault, it uses the aliased mappings of all physical memory in
kernel space to perform the copy of the page in question to a new page. This presents a problem for virtually indexed
caches which are write-back in nature. In this case, the kernel touches two physical pages in kernel space. The code
sequence being described here essentially looks like:
do_wp_page()
{
[ ... ]
copy_cow_page(old_page,new_page);
flush_page_to_ram(old_page);
flush_page_to_ram(new_page);
flush_cache_page(vma, address);
modify_address_space();
free_page(old_page);
flush_tlb_page(vma, address);
[ ... ]
}
(Some of the actual code has been simplified for example purposes.)
Consider a virtually indexed cache which is write-back. At the point in time at which the copy of the page occurs to the
kernel space aliases, it is possible for the user space view of the original page to be in the caches (at the user's address, ie.
where the fault is occurring). The page copy can bring this data (for the old page) into the caches. It will also place the
data (at the new kernel aliased mapping of the page) being copied to into the cache, and for write back caches this data
will be dirty or modified in the cache.
In such a case main memory will not see the most recent copy of the data. The caches are stupid, so for the new page we

are giving to the user, without forcing the cached data at the kernel alias to main memory the process will see the old
contents of the page (ie. whatever garbage was there before the copy done by COW processing above).
A concrete example of what was just described:
Consider a process which shares a page, read-only with another task (or many) at virtual address 0x2000 in user space.
And for example purposes let us say that this virtual address maps to physical page 0x14000.
Virtual Pages
task 1 --------------
| 0x00000000 |
--------------
| 0x00001000 | Physical Pages
-------------- --------------
| 0x00002000 | --\ | 0x00000000 |
-------------- \ --------------
\ | ... |
task 2 -------------- \ --------------
| 0x00000000 | |----> | 0x00014000 |
-------------- / --------------
| 0x00001000 | / | ... |
-------------- / --------------
| 0x00002000 | --/
--------------
If task 2 tries to write to the read-only page at address 0x2000 we will get a fault and eventually end up at the code
fragment shown above in do_wp_page().
The kernel will get a new page for task2, let us say this is physical page 0x26000, and let us also say that the kernel alias
mappings for physical pages 0x14000 and 0x26000 can reside in the two unique cache lines at the same time based upon
the line indexing scheme of this cache.
The page contents get copied from the kernel mappings for physical page 0x14000 to the ones for physical page 0x26000.
At this point in time, on a write-back virtually indexed cache architecture we have a potential inconsistancy. The new data
copied into physical page 0x26000 is not necessary in main memory at this point, in fact it could be all in the cache only at
the kernel alias of the physical address. Also, the (non-modified, ie. clean) data for the original (old) page is in the cache at
the kernel alias for physical page 0x14000, this can produce an inconsistancy later on, so to be safe it is best to be
eliminate the cached copies of this data as well.
Let us say we did not write back the data for the page at 0x26000 and we let it just stay there. We would return to task 2
(who has this new page now mapped in at virtual address 0x2000), he would complete his write, then he would read some
other piece of data in this new page (i.e. expecting the contents that existed there beforehand). At this point in time if the
data is left in the cache at the kernel alias for the new physical page, the user will get whatever was in main memory
before the copy for his read. This can lead to disasterous results.
Therefore an architecture shall:
On virtually indexed cache architectures, do whatever is necessary to make main memory consistant with the
cached copy of the kernel space page passed.
NOTE: It is actually necessary for this routine to invalidate lines in a virtual cache which is not write-back in nature. To
see why this is really necessary, replay the above example with task 1 and 2, but this time fork() yet another task 3
before the COW faults occur, consider the contents of the caches in both kernel and user space if the following sequence
occurrs in exact succession:
1. task 1 reads some the page at 0x2000
2. task 2 COW faults the page at 0x2000

3. task 2 performs his writes to the new page at 0x2000

4. task 3 COW faults the page at 0x2000
Even on a non-writeback virtually indexed cache, task 3 can see inconsistant data after the COW fault if
flush_page_to_ram does not invalidate the kernel aliased physical page from the cache.
void update_mmu_cache(struct vm_area_struct *vma,

unsigned long address, pte_t pte);
Although not strictly part of the flush architecture, on certain architectures some critical operations and checks need to be
performed here for things to work out properly and for the system to remain consistant.
In particular, for virtually indexed caches this routine must check to see that the new mapping being added by the current
page fault does not add an "bad alias" to user space.
A "bad alias" is defined as two or more mappings (at least one of which is writable) to two or more virtual pages which all
translate to the same exact physical page, and due to the indexing algorithm of the cache can also reside in unique and
mutually exclusive cache lines.
If such a "bad alias" is detected an implementation needs to resolve this inconsistancy some how, one solution is to walk
through all of the mappings and change the page tables to make these pages as "non-cacheable" if the hardware allows
such a thing.
The checks for this are very simple, all an implementation needs to do essentially is:
if((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))

check_for_potential_bad_aliases();
So for the common case (shared writable mappings are extremely rare) only one comparison is needed for systems with
virtually indexed caches.
5. Implications for SMP

Depending upon the architecture certain amends may be needed to allow the flush architecture to work on an SMP system.
The main concern is whether one of the above flush operations cause the entire system to be globally see the flush, or the
flush is only guarenteed to be seen by the local processor.
In the latter case a cross calling mechanism is needed. The current two SMP systems supported under Linux (Intel and
Sparc) use inter-processor interrupts to "broadcast" the flush operation and cause it to run locally on all processors if
necessary.
As an example, on sun4m Sparc systems all processers in the system must execute the flush request to guarentee
consistancy across the entire system. However, on sun4d Sparc machines, TLB flushes performed on the local processor
are broadcast over the system bus by the hardware and therefore a cross call is not necessary.
6. Implications for context based MMU/CACHE architectures

The entire idea behind the concept of MMU and cache context facilities is to allow many address spaces to share the
cache/mmu resources on the cpu.
To take full advantage of such a facility, and still maintain coherency as described above, requires some extra
consideration from the implementor.
The issues involved will vary greatly from one implementation to another, at least this has been the experience of the
author. But in particular some of the issues are likely to be:
1. The relationship of kernel space mappings to user space ones, as far as contexts are concerned. On some systems

kernel mappings have a "global" attribute, in that the hardware does not concern itself with context information
when a translation is made which has this attribute. Therefore one flush (in any context) of a kernel cache/mmu
mapping could be sufficient.
However it is possible in other implementations for the kernel to share the context key assosciated with a particular
address space. It may be necessary in such a case to walk into all contexts which are currently valid and perform the
complete flush in each one for a kernel address space flush.
2. The cost of per-context flushes can become a key issue, especially with respect to the TLB. For example, if a tlb
flush is needed on a large range of addresses (or an entire address space) it may be more prudent to allocate and
assign a new mmu context to this process for the sake of efficiency.
7. How to handle what the flush architecture does not do, with examples
The flush architecture just described make no amends for device/DMA coherency with cached data. It also has no
provisions for any mapping strategies necessary for DMA and devices should that be necessary on a certain machine
Linux is ported to. Such issues are none of the flush architectures buisness.
Such issues are most cleanly dealt with at the device driver level. The author is convinced of this after his experiance with
a common set of Sparc device drivers which needed to all function correctly on more than a handfull of cache/mmu and
bus architecrures in the same kernel.
In fact this implementation is more efficient because the driver knows exactly when DMA needs to see consistant data or
when DMA is going to create an inconsistancy which must be resolved. Any attempt to reach this level of efficiency via
hooks added to the generic kernel memory management code would be complex and if anything very unclean.
As an example, consider on the Sparc how DMA buffers are handled. When a device driver must perform DMA to/from
either a single buffer or a scatter list of many buffers it uses a set of abstract routines:
char *(*mmu_get_scsi_one)(char *, unsigned long, struct linux_sbus *sbus);

void (*mmu_get_scsi_sgl)(struct mmu_sglist *, int, struct linux_sbus *sbus);
void (*mmu_release_scsi_one)(char *, unsigned long, struct linux_sbus *sbus);
void (*mmu_release_scsi_sgl)(struct mmu_sglist *, int, struct linux_sbus *sbus);
void (*mmu_map_dma_area)(unsigned long addr, int len);
Essentially the mmu_get_* routines are passed a pointer or a set pointers and size specifications to areas in kernel space
for which DMA will occur, they return a DMA capable address (i.e. one which can be loaded into the DMA controller for
the transfer). When the driver is done with the DMA and the transfer has completed the mmu_release_* routines must
be called with the DMA'able address(es) so that the resources can be freed (if necessary) and cache flushes can be
performed (if necessary).
The final routine is there for drivers which need to have a block of DMA memory for a long period of time, for example a
networking driver would use this for a pool transmit and receive buffers.
The final argument is a Sparc specific entity which allows the machine level code to perform the mapping if DMA
mappings are setup on a per-BUS basis.
8. Open issues
There seems to be some very stupid cache architectures out there which want to cause trouble when an alias is placed into
the cache (even a safe one where none of the aliased cache entries are writable!). Of note is the MIPS R4000 which will
give an exception when such a situation occurs, these can occur when COW processing is happing in the current
implementation. On most chips which do something stupid like this, the exception handler can flush the entries in the
cache being complained about and all is well. The author is mostly concerned about the cost of these exceptions during
COW processing and the effects this will have for system performance. Perhaps a new flush is neccessary, which would
be performed before the page copy in COW fault processing, which essentially is to flush a user space page if not doing so

would cause the trouble just described.

There has been heated talk lately about adding page flipping facilities for very intelligent networking hardware. It may be
necessary to extend the flush architecture to provide the interfaces and facilities necessary for these changes to the
networking code.
And by all means, the flush architecture is always subject to improvements and changes to handle new issues or new
hardware which presents a problem that was to this point unknown.
David S. Miller
davem@caip.rutgers.edu

Linux Memory Management Overview

[Note: This overview of Linux's Memory Management is several years old. Linux's MM has gone
through a nearly complete rewrite since this was written. However, if you can't understand the
Linux MM code, reading this and understanding that this documents the predecessor to the
current MM code may help you out.]
The Linux memory manager implements demand paging with a copy-on-write strategy relying on the
386's paging support. A process acquires its page tables from its parent (during a fork()) with the
entries marked as read-only or swapped. Then, if the process tries to write to that memory space, and the
page is a copy-on-write page, it is copied, and the page is marked read-write. An exec() results in the
reading in of a page or so from the executable. The process then faults in any other pages it needs.
Each process has a page directory which means it can access 1 KB of page tables pointing to 1 MB of 4
KB pages which is 4 GB of memory. A process' page directory is initialized during a fork by
copy_page_tables(). The idle process has its page directory initialized during the initialization
sequence.
Each user process has a local descriptor table that contains a code segment and data-stack segment.
These user segments extend from 0 to 3 GB (0xc0000000). In user space, linear addresses and logical
addresses are identical.
On the 80386, linear address run from 0GB to 4GB. A linear address points to a particular memory
location within this space. A linear address is not a physical address--it is a virtual address. A logical
address consists of a selector and an offset. The selector points to a segment and the offset tells how far
into that segment the address is located)
The kernel code and data segments are priveleged segments defined in the global descriptor table and
extend from 3 GB to 4 GB. The swapper page directory (swapper_page_dir is set up so that logical
addresses and physical addresses are identical in kernel space.
The space above 3 GB appears in a process' page directory as pointers to kernel page tables. This space is
invisible to the process in user mode but the mapping becomes relevant when privileged mode is entered,
for example, to handle a system call. Supervisor mode is entered within the context of the current process
so address translation occurs with respect to the process' page directory but using kernel segments. This
is identically the mapping produced by using the swapper_pg_dir and kernel segments as both page
directories use the same page tables in this space. Only task[0] (the idle task, sometimes called the
swapper task for historical reasons, even though it has nothing to do with swapping in the Linux
implementation) uses the swapper_pg_dir directly.
● The user process' segment_base = 0x00, page_dir private to the process.
● user process makes a system call: segment_base=0xc0000000 page_dir = same user

page_dir.
● swapper_pg_dir contains a mapping for all physical pages from 0xc0000000 to 0xc0000000 +
end_mem, so the first 768 entries in swapper_pg_dir are 0's, and then there are 4 or more that
http://ldp.iol.it/LDP/khg/HyperNews/get/memory/linuxmm.html (1 di 10) [08/03/2001 10.11.06]

point to kernel page tables.

● The user page directories have the same entries as swapper_pg_dir above 768. The first 768
entries map the user space.
The upshot is that whenever the linear address is above 0xc0000000 everything uses the same kernel
page tables.
The user stack sits at the top of the user data segment and grows down. The kernel stack is not a pretty
data structure or segment that I can point to with a ``yon lies the kernel stack.'' A
kernel_stack_frame (a page) is associated with each newly created process and is used whenever
the kernel operates within the context of that process. Bad things would happen if the kernel stack were
to grow below its current stack frame. [Where is the kernel stack put? I know that there is one for
every process, but where is it stored when it's not being used?]
User pages can be stolen or swapped. A user page is one that is mapped below 3 GB in a user page table.
This region does not contain page directories or page tables. Only dirty pages are swapped.
Minor alterations are needed in some places (tests for process memory limits comes to mind) to provide
support for programmer defined segments.
[There is now a modify_ldt() system call used by dosemu, Wine, TWIN, and Wabi to create
arbitrary segments.]
Physical memory
Here is a map of physical memory before any user processes are executed. The column on the left gives
the starting address of the item, numbers in italics are approximate. The column in the middle names the
item(s). The column on the far right gives the relevant routine or variable name or explains the entry.
0x110000 FREE memory_end or high_memory
mem_map mem_init()
inode_table inode_init()
device data device_init()*
0x100000 more pg_tables paging_init()
0x0A0000 RESERVED
0x060000 FREE
low_memory_start
0x006000 kernel code + data
floppy_track_buffer
bad_pg_table used by page_fault_handlers to kill processes
bad_page gracefully when out of memory.
0x002000 pg0 the first kernel page table.
0x001000 swapper_pg_dir the kernel page directory.
0x000000 null page

*device-inits that acquire memory are(main.c): profil_buffer, con_init, psaux_init,

rd_init, scsi_dev_init.
Note that all memory not marked as FREE is RESERVED (mem_init). RESERVED pages belong to
the kernel and are never freed or swapped.
A user process' view of memory

0xc0000000 The invisible kernel reserved
initial stack
room for stack growth 4 pages
0x60000000 shared libraries
brk unused
malloc memory
end_data uninitialized data
end_code initialized data
0x00000000 text
Both the code segment and data segment extend all the way from 0x00 to 3 GB. Currently the page fault
handler do_wp_page checks to ensure that a process does not write to its code space. However, by
catching the SEGV signal, it is possible to write to code space, causing a copy-on-write to occur. The
handler do_no_page ensures that any new pages the process acquires belong to either the executable, a
shared library, the stack, or lie within the brk value.
A user process can reset its brk value by calling sbrk(). This is what malloc() does when it needs
to. The text and data portions are allocated on separate pages unless one chooses the -N compiler option.
Shared library load addresses are currently taken from the shared image itself. The address is between 1.5
GB and 3 GB, except in special cases.
User process Memory Allocation
swappable shareable
a few code pages Y Y
a few data pages Y N?
stack Y N
pg_dir N N
code/data page_table N N
stack page_table N N
task_struct N N
kernel_stack_frame N N
shlib page_table N N

a few shlib pages Y Y?

[What do the question marks mean? Do they mean that they might go either way, or that you are
not sure?]
The stack, shlibs and data are too far removed from each other to be spanned by one page table. All
kernel page_tables are shared by all processes so they are not in the list. Only dirty pages are
swapped. Clean pages are stolen so the process can read them back in from the executable if it likes.
Mostly only clean pages are shared. A dirty page ends up shared across a fork until the parent or child
chooses to write to it again.
Memory Management data in the process table

Here is a summary of some of the data kept in the process table which is used for memory managment:
Process memory limits
ulong start_code, end_code, end_data, brk, start_stack;
Page fault counting
ulong min_flt, maj_flt, cmin_flt, cmaj_flt
Local descriptor table
struct desc_struct ldt[32] is the local descriptor table for task.
rss
number of resident pages.
swappable
if 0, then process's pages will not be swapped.
kernel_stack_page
pointer to page allocated in fork.
saved_kernel_stack
V86 mode stuff
struct tss
❍ Stack segments
esp0
kernel stack pointer (kernel_stack_page)
ss0
kernel stack segment (0x10)
esp1
= ss1 = esp2 = ss2 = 0
unused privelege levels.
❍ Segment selectors: ds = es = fs = gs = ss = 0x17, cs = 0x0f
All point to segments in the current ldt[].
❍ cr3: points to the page directory for this process.

❍ ldt: _LDT(n) selector for current task's LDT.
Memory initialization
In start_kernel() (main.c) there are 3 variables related to memory initialization:
memory_start starts out at 1 MB. Updated by device initialization.
memory_end end of physical memory: 8 MB, 16 MB, or whatever.
low_memory_start end of the kernel code and data that is loaded initially.
Each device init typically takes memory_start and returns an updated value if it allocates space at
memory_start (by simply grabbing it). paging_init() initializes the page tables in the {\tt
swapper_pg_dir} (starting at 0xc0000000) to cover all of the physical memory from memory_start to
memory_end. Actually the first 4 MB is done in startup_32 (head.S). memory_start is
incremented if any new page_tables are added. The first page is zeroed to trap null pointer references
in the kernel.
In sched_init() the ldt and tss descriptors for task[0] are set in the GDT, and loaded into the
TR and LDTR (the only time it's done explicitly). A trap gate (0x80) is set up for system_call().
The nested task flag is turned off in preparation for entering user mode. The timer is turned on. The
task_struct for task[0] appears in its entirety in <linux/sched.h>.
mem_map is then constructed by mem_init() to reflect the current usage of physical pages. This is the
state reflected in the physical memory map of the previous section.
Then Linux moves into user mode with an iret after pushing the current ss, esp, etc. Of course the
user segments for task[0] are mapped right over the kernel segments so execution continues exactly
where it left off.
task[0]:
pg_dir
= swapper_pg_dir which means the the only addresses mapped are in the range 3 GB to 3 GB
+ high_memory.
LDT[1]
= user code, base=0xc0000000, size = 640K
LDT[2]
= user data, base=0xc0000000, size = 640K
The first exec() sets the LDT entries for task[1] to the user values of base = 0x0, limit =
TASK_SIZE = 0xc0000000. Thereafter, no process sees the kernel segments while in user mode.
Processes and the Memory Manager
Memory-related work done by fork():

● Memory allocation

❍ 1 page for the task_struct.

❍ 1 page for the kernel stack.
❍ 1 for the pg_dir and some for pg_tables (copy_page_tables)
● Other changes
❍ ss0 set to kernel stack segment (0x10) to be sure?
❍ esp0 set to top of the newly allocated kernel_stack_page
❍ cr3 set by copy_page_tables() to point to newly allocated page directory.
❍ ldt = _LDT(task_nr) creates new ldt descriptor.
❍ descriptors set in gdt for new tss and ldt[].
❍ The remaining registers are inherited from parent.
The processes end up sharing their code and data segments (although they have separate local desctriptor
tables, the entries point to the same segments). The stack and data pages will be copied when the parent
or child writes to them (copy-on-write).
Memory-related work done by exec():
● memory allocation
❍ 1 page for exec header entire file for omagic
❍ 1 page or more for stack (MAX_ARG_PAGES)
● clear_page_tables() used to remove old pages.
● change_ldt() sets the descriptors in the new LDT[]
● ldt[1] = code base=0x00, limit=TASK_SIZE
● ldt[2] = data base=0x00, limit=TASK_SIZE

These segments are DPL=3, P=1, S=1, G=1. type=a (code) or 2 (data)
● Up to MAX_ARG_PAGES dirty pages of argv and envp are allocated and stashed at the top of the
data segment for the newly created user stack.
● Set the instruction pointer of the caller eip = ex.a_entry
● Set the stack pointer of the caller to the stack just created (esp = stack pointer) These will be
popped off the stack when the caller resumes.
● update memory limits
end_code = ex.a_text
end_data = end_code + ex.a_data
brk = end_data + ex.a_bss
Interrupts and traps are handled within the context of the current task. In particular, the page directory of
the current process is used in address translation. The segments, however, are kernel segments so that all
linear addresses point into kernel memory. For example, assume a user process invokes a system call and
the kernel wants to access a variable at address 0x01. The linear address is 0xc0000001 (using kernel
segments) and the physical address is 0x01. The later is because the process' page directory maps this
range exactly as page_pg_dir.
The kernel space (0xc0000000 + high_memory) is mapped by the kernel page tables which are

themselves part of the RESERVED memory. They are therefore shared by all processes. During a fork
copy_page_tables() treats RESERVED page tables differently. It sets pointers in the process page
directories to point to kernel page tables and does not actually allocate new page tables as it does
normally. As an example the kernel_stack_page (which sits somewhere in the kernel space) does
not need an associated page_table allocated in the process' pg_dir to map it.
The interrupt instruction sets the stack pointer and stack segment from the privilege 0 values saved in the
tss of the current task. Note that the kernel stack is a really fragmented object--it's not a single object, but
rather a bunch of stack frames each allocated when a process is created, and released when it exits. The
kernel stack should never grow so rapidly within a process context that it extends below the current
frame.
Acquiring and Freeing Memory: Paging Policy

[Note: swapping has also been massively changed in recent kernels, with the ``kswap'' changes.]
When any kernel routine wants memory it ends up calling get_free_page(). This is at a lower level
than kmalloc() (in fact kmalloc() uses get_free_page() when it needs more memory).
get_free_page() takes one parameter, a priority. Possible values are GFP_BUFFER,
GFP_KERNEL, GFP_NFS, and GFP_ATOMIC. It takes a page off of the free_page_list, updates
mem_map, zeroes the page and returns the physical address of the page (note that kmalloc() returns a
physical address. The logic of the mm depends on the identity map between logical and physical
addresses).
That itself is simple enough. The problem, of course, is that the free_page_list may be empty. If
you did not request an atomic operation, at this stage, you enter into the realm of page stealing which
we'll go into in a moment. As a last resort (and for atomic requests) a page is torn off from the
secondary_page_list (as you may have guessed, when pages are freed, the
secondary_page_list gets filled up first).
The actual manipulation of the page_lists and mem_map occurs in this mysterious macro called
REMOVE_FROM_MEM_QUEUE() which you probably never want to look into. Suffice it to say that
interrupts are disabled. [I think that this should be explained here. It is not that hard...]
Now back to the page stealing bit. get_free_page() calls try_to_free_page() which
repeatedly calls shrink_buffers() and swap_out() in that order until it is successful in freeing a
page. The priority is increased on each successive iteration so that these two routines run through their
page stealing loops more often.
Here's one run through swap_out():
● Run through the process table and get a swappable task, say, Q.
● Find a user page table (not RESERVED) in Q's space.
● For each page in the table try_to_swap_out(page).
● Quit when a page is freed.
Note that swap_out() (called by try_to_free_page()) maintains static variables so it may

resume the search where it left off on the previous call.

try_to_swap_out() scans the page tables of all user processes and enforces the stealing policy:
1. Do not fiddle with RESERVED pages.
2. Age the page if it is marked accessed (1 bit).
3. Don't tamper with recently acquired pages (last_free_pages[]).
4. Leave dirty pages with map_counts > 1 alone.
5. Decrement the map_count of clean pages.
6. Free clean pages if they are unmapped.
7. Swap dirty pages with a map_count of 1.
Of these actions, 6 and 7 will stop the process as they result in the actual freeing of a physical page.
Action 5 results in one of the processes losing an unshared clean page that was not accessed recently
(decrement Q->rss) which is not all that bad, but the cumulative effects of a few iterations can slow
down a process considerably. At present, there are 6 iterations, so a page shared by 6 processes can get
stolen if it is clean.
Page table entries are updated and the TLB invalidated.
The actual work of freeing the page is done by free_page(), the complement of
get_free_page(). It ignores RESERVED pages, updates mem_map, then frees the page and updates
the page_lists if it is unmapped. For swapping (in 6 above), write_swap_page() gets called and
does nothing remarkable from the memory management perspective.
The details of shrink_buffers() would take us too far afield. Essentially it looks for free buffers,
then writes out dirty buffers, then goes at busy buffers and calls free_page() when its able to free all
the buffers on a page.
Note that page directories and page tables along with RESERVED pages do not get swapped, stolen or
aged. They are mapped in the process page directory through reserved page tables. They are freed only
on exit from the process.
The page fault handlers

When a process is created via fork, it starts out with a page directory and a page or so of the executable.
So the page fault handler is the source of most of a processes' memory.
The page fault handler do_page_fault() retrieves the faulting address from the register cr2. The
error code (retrieved in sys_call.S) differentiates user/supervisor access and the reason for the fault--write
protection or a missing page. The former is handled by do_wp_page() and the latter by
do_no_page().
If the faulting address is greater than TASK_SIZE the process receives a SIGKILL. [Why this check?
This can only happen in kernel mode because of segment level protection.]
These routines have some subtleties as they can get called from an interrupt. You can't assume that it is
the ``current'' task that is executing.
do_no_page() handles three possible situations:

1. The page is swapped.

2. The page belongs to the executable or a shared library.
3. The page is missing--a data page that has not been allocated.
In all cases get_empty_pgtable() is called first to ensure the existence of a page table that covers
the faulting address. In case 3 get_empty_page() is called to provide a page at the required address
and in case of the swapped page, swap_in() is called.
In case 2, the handler calls share_page() to see if the page is shareable with some other process. If
that fails it reads in the page from the executable or library (It repeats the call to share_page() in
case another process did the same meanwhile). Any portion of the page beyond the brk value is zeroed.
A page read in from the disk is counted as a major fault (maj_flt). This happens with a swap_in()
or when it is read from the executable or a library. Other cases are deemed minor faults (min_flt).
When a shareable page is found, it is write-protected. A process that writes to a shared page will then
have to go through do_wp_page() which does the copy-on-write.
do_wp_page() does the following:
● Send SIGSEGV if any user process is writing to current code_space.
● If the old page is not shared then just unprotect it.

Else get_free_page() and copy_page(). The page acquires the dirty flag from the old
page. Decrement the map count of the old page.
Paging
Paging is swapping on a page basis rather than by entire processes. We will use swapping here to refer to
paging, since Linux only pages, and does not swap, and people are more used to the word ``swap'' than
``page.'' Kernel pages are never swapped. Clean pages are also not written to swap. They are freed and
reloaded when required. The swapper maintains a single bit of aging info in the PAGE_ACCESSED bit of
the page table entries. [What are the maintainance details? How is it used?]
Linux supports multiple swap files or devices which may be turned on or off by the swapon and swapoff
system calls. Each swapfile or device is described by a struct swap_info_struct (swap.c).
static struct swap_info_struct {

struct inode * swap_file;
unsigned int swap_device;
unsigned char * swap_map;
char * swap_lockmap;
int lowest_bit;
int highest_bit;
} swap_info[MAX_SWAPFILES];
The flags field (SWP_USED or SWP_WRITEOK) is used to control access to the swap files. When
SWP_WRITEOK is off space will not be allocated in that file. This is used by swapoff when it tries to

unuse a file. When swapon adds a new swap file it sets SWP_USED. A static variable nr_swapfiles
stores the number of currently active swap files. The fields lowest_bit and highest_bit bound
the free region in the swap file and are used to speed up the search for free swap space.
The user program mkswap initializes a swap device or file. The first page contains a signature
(`SWAP-SPACE') in the last 10 bytes, and holds a bitmap. Initially 0's in the bitmap signal bad pages. A
`1' in the bitmap means the corresponding page is free. This page is never allocated so the initialization
needs to be done just once.
The syscall swapon() is called by the user program swapon typically from /etc/rc. A couple of pages of
memory are allocated for swap_map and swap_lockmap.
swap_map holds a byte for each page in the swapfile. It is initialized from the bitmap to contain a 0 for
available pages and 128 for unusable pages. It is used to maintain a count of swap requests on each page
in the swap file. swap_lockmap holds a bit for each page that is used to ensure mutual exclusion when
reading or writing swap files.
When a page of memory is to be swapped out an index to the swap location is obtained by a call to
get_swap_page(). This index is then stored in bits 1-31 of the page table entry so the swapped page
may be located by the page fault handler, do_no_page() when needed.
The upper 7 bits of the index give the swapfile (or device) and the lower 24 bits give the page number on
that device. That makes as many as 128 swapfiles, each with room for about 64 GB, but the space
overhead due to the swap_map would be large. Instead the swapfile size is limited to 16 MB, because
the swap_map then takes 1 page.
The function swap_duplicate() is used by copy_page_tables() to let a child process inherit
swapped pages during a fork. It just increments the count maintained in swap_map for that page. Each
process will swap in a separate copy of the page when it accesses it.
swap_free() decrements the count maintained in swap_map. When the count drops to 0 the page
can be reallocated by get_swap_page(). It is called each time a swapped page is read into memory
(swap_in()) or when a page is to be discarded (free_one_table(), etc.).
Copyright (C) 1992, 1993 Krishna Balasubramanian and Douglas Johnson
Messages


A logical address specified in an instruction is first translated to a linear address by the segmenting
hardware. This linear address is then translated to a physical address by the paging unit.
Paging on the 386
There are two levels of indirection in address translation by the paging unit. A page directory contains
pointers to 1024 page tables. Each page table contains pointers to 1024 pages. The register CR3 contains
the physical base address of the page directory and is stored as part of the TSS in the task_struct
and is therefore loaded on each task switch.
A 32-bit Linear address is divided as follows:
31 ...... 22 21 ...... 12 11 ...... 0
DIR TABLE OFFSET
Physical address is then computed (in hardware) as:

CR3 + DIR points to the table_base.
table_base + TABLE points to the page_base.
physical_address = page_base + OFFSET
Page directories (page tables) are page aligned so the lower 12 bits are used to store useful information
about the page table (page) pointed to by the entry.
Format for Page directory and Page table entries:
31 ...... 12 11 .. 9 8 7 6 5 4 3 2 1 0
ADDRESS OS 0 0 D A 0 0 U/S R/W P
D 1 means page is dirty (undefined for page directory entry).

R/W 0 means readonly for user.
U/S 1 means user page.
P 1 means page is present in memory.
A 1 means page has been accessed (set to 0 by aging).
OS bits can be used for LRU etc, and are defined by the OS.
The corresponding definitions for Linux are in .

When a page is swapped, bits 1-31 of the page table entry are used to mark where a page is stored in
http://ldp.iol.it/LDP/khg/HyperNews/get/memory/80386mm.html (1 di 7) [08/03/2001 10.11.08]

swap (bit 0 must be 0).

Paging is enabled by setting the highest bit in CR0. [in head.S?] At each stage of the address translation
access permissions are verified and pages not present in memory and protection violations result in page
faults. The fault handler (in memory.c) then either brings in a new page or unwriteprotects a page or does
whatever needs to be done.
Page Fault handling Information
● The register CR2 contains the linear address that caused the last page fault.
● Page Fault Error Code (16 bits):
bit cleared set
0 page not present page level protection
1 fault due to read fault due to write
2 supervisor mode user mode
The rest are undefined. These are extracted in sys_call.S.

The Translation Lookaside Buffer (TLB) is a hardware cache for physical addresses of the most recently
used virtual addresses. When a virtual address is translated the 386 first looks in the TLB to see if the
information it needs is available. If not, it has to make a couple of memory references to get at the page
directory and then the page table before it can actually get at the page. Three physical memory references
for address translation for every logical memory reference would kill the system, hence the TLB.
The TLB is flushed if CR3 loaded or by task switch that changes CR0. It is explicitly flushed in Linux by
calling invalidate() which just reloads CR3.
Segments in the 80386
Segment registers are used in address translation to generate a linear address from a logical (virtual)
address.
linear_address = segment_base + logical_address
The linear address is then translated into a physical address by the paging hardware.
Each segment in the system is described by a 8 byte segment descriptor which contains all pertinent
information (base, limit, type, privilege).
The segments are:
Regular segments
❍ code and data segments
System segments
❍ (TSS) task state segments
❍ (LDT) local descriptor tables
Characteristics of system segments

● System segments are task specific.

● There is a Task State Segment (TSS) associated with each task in the system. It contains the
tss_struct (sched.h). The size of the segment is that of the tss_struct excluding the
i387_union (232 bytes). It contains all the information necessary to restart the task.
● The LDT's contain regular segment descriptors that are private to a task. In Linux there is one LDT
per task. There is room for 32 descriptors in the linux task_struct. The normal LDT generated
by Linux has a size of 24 bytes, hence room for only 3 entries as above. Its contents are:
LDT[0]
Null (mandatory)
LDT[1]
user code segment descriptor.
LDT[2]
user data/stack segment descriptor.
● The user segments all have base=0x00 so that the linear address is the same as the logical address.
To keep track of all these segments, the 386 uses a global descriptor table (GDT) that is setup in memory
by the system (located by the GDT register). The GDT contains a segment descriptors for each task state
segment, each local descriptor tablet and also regular segments. The Linux GDT contains just two
normal segment entries:
● GDT[0] is the null descriptor.
● GDT[1] is the kernel code segment descriptor.
● GDT[2] is the kernel data/stack segment descriptor.
The rest of the GDT is filled with TSS and LDT system descriptors:
● GDT[3] ???
● GDT[4] = TSS0, GDT[5] = LDT0,
● GDT[6] = TSS1, GDT[7] = LDT1
● ... etc. ...
LDT[n] != LDTn
LDT[n] = the nth descriptor in the LDT of the current task.
LDTn = a descriptor in the GDT for the LDT of the nth task.
The kernel segments have base 0xc0000000 which is where the kernel lives in the linear view. Before a
segment can be used, the contents of the descriptor for that segment must be loaded into the segment
register. The 386 has a complex set of criteria regarding access to segments so you can't simply load a
descriptor into a segment register. Also these segment registers have programmer invisible portions. The
visible portion is what is usually called a segment register: cs, ds, es, fs, gs, and ss.
The programmer loads one of these registers with a 16-bit value called a selector. The selector uniquely
identifies a segment descriptor in one of the tables. Access is validated and the corresponding descriptor
loaded by the hardware.

Currently Linux largely ignores the (overly?) complex segment level protection afforded by the 386. It is
biased towards the paging hardware and the associated page level protection. The segment level rules
that apply to user processes are
1. A process cannot directly access the kernel data or code segments
2. There is always limit checking but given that every user segment goes from 0x00 to 0xc0000000 it
is unlikely to apply. [This has changed, and needs updating, please.]
Selectors in the 80386
A segment selector is loaded into a segment register (cs, ds, etc.) to select one of the regular segments in
the system as the one addressed via that segment register.
Segment selector Format:
15 ...... 3 2 1 0
index TI RPL
TI Table indicator:
0 means selector indexes into GDT
1 means selector indexes into LDT
RPL Privelege level. Linux uses only two privelege levels.
0 means kernel
3 means user
Examples:
Kernel code segment
TI=0, index=1, RPL=0, therefore selector = 0x08 (GDT[1])
User data segment
TI=1, index=2, RPL=3, therefore selector = 0x17 (LDT[2])
Selectors used in Linux:
TI index RPL selector segment
0 1 0 0x08 kernel code GDT[1]
0 2 0 0x10 kernel data/stack GDT[2]
0 3 0 ??? ??? GDT[3]
1 1 3 0x0F user code LDT[1]
1 2 3 0x17 user data/stack LDT[2]
Selectors for system segments are not to be loaded directly into segment registers. Instead one must load
the TR or LDTR.
On entry into syscall:
● ds and es are set to the kernel data segment (0x10)

● fs is set to the user data segment (0x17) and is used to access data pointed to by arguments to the
system call.
● The stack segment and pointer are automatically set to ss0 and esp0 by the interrupt and the old
values restored when the syscall returns.
Segment descriptors
There is a segment descriptor used to describe each segment in the system. There are regular descriptors
and system descriptors. Here's a descriptor in all its glory. The strange format is essentially to maintain
compatibility with the 286. Note that it takes 8 bytes.
63-54 55 54 53 52 51-48 47 46 45 44-40 39-16 15-0
Base Limit Segment Base Segment Limit
G D R U P DPL S TYPE
31-24 19-16 23-0 15-0
Explanation:
R reserved (0)
DPL 0 means kernel, 3 means user
G 1 means 4K granularity (Always set in Linux)
D 1 means default operand size 32bits
U programmer definable
P 1 means present in physical memory
S 0 means system segment, 1 means normal code or data segment.
Type There are many possibilities. Interpreted differently for system and normal descriptors.
Linux system descriptors:

TSS: P=1, DPL=0, S=0, type=9, limit = 231 room for 1 tss_struct.
LDT: P=1, DPL=0, S=0, type=2, limit = 23 room for 3 segment descriptors.
The base is set during fork(). There is a TSS and LDT for each task.
Linux regular kernel descriptors: (head.S)
code: P=1, DPL=0, S=1, G=1, D=1, type=a, base=0xc0000000, limit=0x3ffff
data: P=1, DPL=0, S=1, G=1, D=1, type=2, base=0xc0000000, limit=0x3ffff
The LDT for task[0] contains: (sched.h)
code: P=1, DPL=3, S=1, G=1, D=1, type=a, base=0xc0000000, limit=0x9f
Format of segment register: (Only the selector is programmer visible)

16-bit 32-bit 32-bit
selector physical base addr segment limit attributes
The invisible portion of the segment register is more conveniently viewed in terms of the format used in
the descriptor table entries that the programmer sets up. The descriptor tables have registers associated
with them that are used to locate them in memory. The GDTR (and IDTR) are initialized at startup once
the tables are defined. The LDTR is loaded on each task switch.
Format of GDTR (and IDTR):
32-bits 16-bits
Linear base addr table limit
The TR and LDTR are loaded from the GDT and so have the format of the other segment registers. The
task register (TR) contains the descriptor for the currently executing task's TSS. The execution of a jump
to a TSS selector causes the state to be saved in the old TSS, the TR is loaded with the new descriptor
and the registers are restored from the new TSS. This is the process used by schedule to switch to various
user tasks. Note that the field tss_struct.ldt contains a selector for the LDT of that task. It is used
to load the LDTR. (sched.h)
Macros used in setting up descriptors
Some assembler macros are defined in sched.h and system.h to ease access and setting of descriptors.
Each TSS entry and LDT entry takes 8 bytes.
Manipulating GDT system descriptors:
_TSS(n), _LDT(n)
These provide the index into the GDT for the n'th task.
_LDT(n) is stored in the the ldt field of the tss_struct by fork.
_set_tssldt_desc(n, addr, limit, type)
ulong *n points to the GDT entry to set (see fork.c).
The segment base (TSS or LDT) is set to 0xc0000000 + addr.
Specific instances of the above are, where ltype refers to the byte containing P, DPL, S and type:
set_ldt_desc(n, addr) ltype = 0x82
P=1, DPL=0, S=0, type=2 means LDT entry.
limit = 23 => room for 3 segment descriptors.
set_tss_desc(n, addr) ltype = 0x89
P=1, DPL=0, S=0, type = 9, means available 80386 TSS limit = 231 room for 1 tss_struct.
load_TR(n),
load_ldt(n) load descriptors for task number n into the task register and ldt register.
ulong get_base (struct desc_struct ldt)
gets the base from a descriptor.

ulong get_limit (ulong segment)

gets the limit (size) from a segment selector.
Returns the size of the segment in bytes.
set_base(struct desc_struct ldt, ulong base),
set_limit(struct desc_struct ldt, ulong limit)
Will set the base and limit for descriptors (4K granular segments).
The limit here is actually the size in bytes of the segment.
_set_seg_desc(gate_addr, type, dpl, base, limit)
Default values 0x00408000 => D=1, P=1, G=0
Present, operation size is 32 bit and max size is 1M.
gate_addr must be a (ulong *)
Copyright (C) 1992, 1993 Krishna Balasubramanian and Douglas Johnson
Messages
1. paging initialization, doc update by droux@cs.unm.edu
2. User Code and Data Segment no longer in LDT. by Lennart Benschop

paging initialization, doc update
paging initialization, doc update

Forum: 80386 Memory Management
Keywords: paging, initialization, x86, CR0
Date: Mon, 03 Jun 1996 20:46:15 GMT
From: <droux@cs.unm.edu>
From the x86 memory management doc:
> Paging is enabled by setting the highest bit in CR0.

> [in head.S?]
This is correct, the editor note can be suppressed.

CR0 initialisation for paging is performed by
"setup_paging" which is implemented in (1.2.13)
arch/i386/kernel/head.S
http://ldp.iol.it/LDP/khg/HyperNews/get/memory/80386mm/1.html [08/03/2001 10.11.08]

User Code and Data Segment no longer in LDT.
User Code and Data Segment no longer in LDT.

Forum: 80386 Memory Management
Date: Tue, 23 Jul 1996 09:39:45 GMT
From: Lennart Benschop <benschop@eb.ele.tue.nl>
The user code and data segments of a process are no longer in the LDT, but in the GDT instead. The
code and data segment of each process starts at linear address 0 anyway, only the physical address is
different (different page directory =CR3)
Processes still have an LDT, this can be used by certain applications (WINE).
In very early versions of Linux, user space was restricted to 64 MB and there were a maximum of 64
processes (including process 0, which had the kernel in its user space). Back then each process had a
different linear address. making a total of 4GB. There was only one page directory, and there were
per-process code and data segments, included in the LDT. This (somewhat elegant) scheme was
abandoned to allow more than 64 processes and a per process virtual address space of more than
64MB. That's why certain kernels had they suer code and data segments in the LDT, though they were
in fact the same segments for all processes.
http://ldp.iol.it/LDP/khg/HyperNews/get/memory/80386mm/2.html [08/03/2001 10.11.09]


This section covers first the mechanisms provided by the 386 for handling system calls, and then shows
how Linux uses those mechanisms. This is not a reference to the individual system calls: There are very
many of them, new ones are added occasionally, and they are documented in man pages that should be
on your Linux system.
What Does the 386 Provide?

The 386 recognizes two event classes: exceptions and interrupts. Both cause a forced context switch to
new a procedure or task. Interrupts can occur at unexpected times during the execution of a program and
are used to respond to signals from hardware. Exceptions are caused by the execution of instructions.
Two sources of interrupts are recognized by the 386: Maskable interrupts and Nonmaskable interrupts.
Two sources of exceptions are recognized by the 386: Processor detected exceptions and programmed
exceptions.
Each interrupt or exception has a number, which is referred to by the 386 literature as the vector. The
NMI interrupt and the processor detected exceptions have been assigned vectors in the range 0 through
31, inclusive. The vectors for maskable interrupts are determined by the hardware. External interrupt
controllers put the vector on the bus during the interrupt-acknowledge cycle. Any vector in the range 32
through 255, inclusive, can be used for maskable interrupts or programmed exceptions. Here is a listing
of all the possible interrupts and exceptions:
0 divide error
1 debug exception
2 NMI interrupt
3 Breakpoint
4 INTO-detected Overflow
5 BOUND range exceeded
6 Invalid opcode
7 coprocessor not available
8 double fault
9 coprocessor segment overrun
10 invalid task state segment
11 segment not present
12 stack fault
13 general protection
14 page fault
http://ldp.iol.it/LDP/khg/HyperNews/get/syscall/syscall86.html (1 di 5) [08/03/2001 10.11.10]

15 reserved
16 coprocessor error
17-31 reserved
32-255 maskable interrupts
The priority of simultaneous interrupts and exceptions is:

HIGHEST Faults except debug faults
. Trap instructions INTO, INT n, INT 3
. Debug traps for this instruction
. Debug traps for next instruction
. NMI interrupt
LOWEST INTR interrupt
How Linux Uses Interrupts and Exceptions

Under Linux the execution of a system call is invoked by a maskable interrupt or exception class
transfer, caused by the instruction int 0x80. We use vector 0x80 to transfer control to the kernel. This
interrupt vector is initialized during system startup, along with other important vectors like the system
clock vector.
iBCS2 requries an lcall 0,7 instruction, which Linux can send to the iBCS2 compatibility module
appropriate if an iBCS2-compliant binary is being executed. In fact, Linux will assume that an
iBCS2-compliant binary is being executed if an lcall 0,7 call is executed, and will automatically
switch modes.
As of version 0.99.2 of Linux, there are 116 system calls. Documentation for these can be found in the
man (2) pages. When a user invokes a system call, execution flow is as follows:
● Each call is vectored through a stub in libc. Each call within the libc library is generally a
syscallX() macro, where X is the number of parameters used by the actual routine. Some
system calls are more complex then others because of variable length argument lists, but even
these complex system calls must use the same entry point: they just have more parameter setup
overhead. Examples of a complex system call include open() and ioctl().
● Each syscall macro expands to an assembly routine which sets up the calling stack frame and calls
_system_call() through an interrupt, via the instruction int $0x80
For example, the setuid system call is coded as
_syscall1(int,setuid,uid_t,uid);
which will expand to:
_setuid:
subl $4,%exp
pushl %ebx
movzwl 12(%esp),%eax

movl %eax,4(%esp)
movl $23,%eax
movl 4(%esp),%ebx
int $0x80
movl %eax,%edx
testl %edx,%edx
jge L2
negl %edx
movl %edx,_errno
movl $-1,%eax
popl %ebx
addl $4,%esp
ret
L2:
movl %edx,%eax
popl %ebx
addl $4,%esp
ret
The macro definition for the syscallX() macros can be found in /usr/include/linux/unistd.h,
and the user-space system call library code can be found in /usr/src/libc/syscall/
● At this point no system code for the call has been executed. Not until the int $0x80 is executed
does the call transfer to the kernel entry point _system_call(). This entry point is the same
for all system calls. It is responsible for saving all registers, checking to make sure a valid system
call was invoked and then ultimately transfering control to the actual system call code via the
offsets in the _sys_call_table. It is also responsible for calling
_ret_from_sys_call() when the system call has been completed, but before returning to
user space.
Actual code for system_call entry point can be found in /usr/src/linux/kernel/sys_call.S Actual
code for many of the system calls can be found in /usr/src/linux/kernel/sys.c, and the rest are found
elsewhere. find is your friend.
● After the system call has executed, _ret_from_sys_call() is called. It checks to see if the
scheduler should be run, and if so, calls it.
● Upon return from the system call, the syscallX() macro code checks for a negative return
value, and if there is one, puts a positive copy of the return value in the global variable _errno,
so that it can be accessed by code like perror().
How Linux Initializes the system call vectors

The startup_32() code found in /usr/src/linux/boot/head.S starts everything off by calling
setup_idt(). This routine sets up an IDT (Interrupt Descriptor Table) with 256 entries. No interrupt
entry points are actually loaded by this routine, as that is done only after paging has been enabled and the
kernel has been moved to 0xC0000000. An IDT has 256 entries, each 4 bytes long, for a total of 1024
bytes. When start_kernel() (found in /usr/src/linux/init/main.c) is called it invokes
trap_init() (found in /usr/src/linux/kernel/traps.c). trap_init() sets up the IDT via the macro

set_trap_gate() (found in /usr/include/asm/system.h). trap_init() initializes the interrupt

descriptor table as shown here:
0 divide_error
1 debug
2 nmi
3 int3
4 overflow
5 bounds
6 invalid_op
7 device_not_available
8 double_fault
9 coprocessor_segment_overrun
10 invalid_TSS
11 segment_not_present
12 stack_segment
13 general_protection
14 page_fault
15 reserved
16 coprocessor_error
17 alignment_check
18-48 reserved
At this point the interrupt vector for the system calls is not set up. It is initialized by sched_init()
(found in /usr/src/linux/kernel/sched.c). A call to set_system_gate (0x80, &system_call)
sets interrupt 0x80 to be a vector to the system_call() entry point.
How to Add Your Own System Calls

1. Create a directory under the /usr/src/linux/ directory to hold your code.
2. Put any include files in /usr/include/sys/ and /usr/include/linux/.
3. Add the relocatable module produced by the link of your new kernel code to the ARCHIVES and
the subdirectory to the SUBDIRS lines of the top level Makefile. See fs/Makefile, target fs.o for an
example.
4. Add a #define __NR_xx to unistd.h to assign a call number for your system call, where xx,
the index, is something descriptive relating to your system call. It will be used to set up the vector
through sys_call_table to invoke you code.
5. Add an entry point for your system call to the sys_call_table in sys.h. It should match the
index (xx) that you assigned in the previous step. The NR_syscalls variable will be
recalculated automatically.

6. Modify any kernel code in kernel/fs/mm/, etc. to take into account the environment needed to
support your new code.
7. Run make from the top level to produce the new kernel incorporating your new code.
At this point, you will have to either add a syscall to your libraries, or use the proper _syscalln()
macro in your user program for your programs to access the new system call. The 386DX
Microprocessor Programmer's Reference Manual is a helpful reference, as is James Turley's Advanced
80386 Programming Techniques. See the Annotated Bibliography.
Copyright (C) 1993, 1996 Michael K. Johnson, johnsonm@redhat.com.

Copyright (C) 1993 Stanley Scalsky
Messages
4. wrong file for system_call code by Tim Bird
3. would be nice to explain syscall macros by Tim Bird
2. wrong file for syscallX() macro by Tim Bird
1. the directory /usr/src/libc/syscall/ by vijay gupta
1. ...no longer exists. by Michael K. Johnson
-> the solution to the problem by Vijay Gupta

Please replace K&R reference by Harbison/Steele
Please replace K&R reference by Harbison/Steele

Forum: Annotated Bibliography
Keywords: C, textbook, K%0Aamp;R replacement, reference manual
Date: Sun, 19 May 1996 12:06:58 GMT
From: Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
I suggest that you replace the K&R reference by the following much better and more up-to-date one. I
have never touched my K&R again since I bought the following book. If you don't want to throw K&R
out, please add at least this reference and add to the K&R review that this old book does not cover the
full ISO C run-time library (e.g. the wide character and locale support is missing almost completely)
nor the 1994 C language extensions.
C - A Reference Manual, fourth edition
Author: Samuel P. Harbison and Guy L. Steele Jr.
ISBN: 0-13-326232-4 (hard) 0-13-326224-3 (paper)
Pages: 455
Price: ??.?? USD, 73.80 DEM
This book is an authoritative reference manual that provides a complete and precise description of the
C language and the run-time library. It also teaches a C programming style that emphasizes
correctness, portability, and maintainability. If you program in C, you want to have this book on your
desk, even if you are already a C expert. The authors have been members of the ANSI/ISO C standards
committee.
The Harbison/Steele has by now taken over the role of being the C bible from the traditional C book
by Kernighan/Ritchie. In contrast to K&R, the Harbison/Steele covers the full ISO C standard,
including the 1994 extensions. It also covers the old K&R C language as well as C++ compatibility
issues. Especially the description of the standard C library, which every C programmer needs for daily
reference, is considerably more complete and precise than the one found in appendix B of K&R.
Messages
1. Replace, no; supplement, yes by Michael K. Johnson
-> Right you are Mike! by rohit patil
http://ldp.iol.it/LDP/khg/HyperNews/get/bib/bib/1.html [08/03/2001 10.11.11]

Replace, no; supplement, yes
Replace, no; supplement, yes

Re: Please replace K&R reference by Harbison/Steele (Markus Kuhn)
Keywords: C, textbook, reference manual
Date: Sun, 19 May 1996 15:27:40 GMT
Since this is the kernel hackers' guide, and since the kernel doesn't use the run-time library, the fact
that Harbison and Steele document the run-time library fully is rather irrelevant for the purposes of this
document, even though it is relevant to C programmers in general.
I agree that H&S should be included, but not that K&R should be excluded.
I happen to like K&R and find it easy to read and look things up in; I occasionally supplement it with
an annotated (sometimes poorly, IMHO) copy of the ANSI standard published by Osborne/McGraw
Hill, ISBN 0-07-881952-90. Even if (like me) you ignore the annotation, it's still cheaper than an
official copy of the standard, or was last time I checked. I should probably add it to the bibliography,
along with a pointer to the GNU C documentation, since the linux kernel does use a few GNU C
extensions.
Messages
1. Right you are Mike! by rohit patil
http://ldp.iol.it/LDP/khg/HyperNews/get/bib/bib/1/1.html [08/03/2001 10.11.11]

Right you are Mike!
Right you are Mike!

Re: Please replace K&R reference by Harbison/Steele (Markus Kuhn)
Re: Replace, no; supplement, yes (Michael K. Johnson)
Keywords: C, textbook, reference manual
Date: Thu, 02 Jan 1997 04:15:15 GMT
From: rohit patil <rohit@techie.com>
Yup! Can't replace K&R. Supplement, maybe :)
http://ldp.iol.it/LDP/khg/HyperNews/get/bib/bib/1/1/1.html [08/03/2001 10.11.11]

http://
Very unfortunate
Very unfortunate
Re: 80386 book is apparently out of print now (Austin Donnelly)
Keywords: 80386 programming, out of print
Date: Sun, 26 May 1996 17:45:43 GMT
Osborne McGraw-Hill may be bringing the book in and out of print. When I got my copy in 1992, it
was out of print but still available in bookstores, so if it just recently went out of print, they probably
do infrequent printings and enough noise from potential readers may be sufficient to convince them to
bring the book out of retirement.
Write to Osborne McGraw-Hill and let them know you want to buy a copy. Bookstores don't tell them
when one person goes in and it's out of print, but publishers do sometimes listen when plenty of
potential readers write to them. Their address is (or was when the book was published...)
Osborne McGraw-Hill
2600 Tenth Street
Berkeley, California 94710
U.S.A.
Good luck!
http://ldp.iol.it/LDP/khg/HyperNews/get/bib/bib/2/1.html [08/03/2001 10.11.12]

Linux Kernel Internals-> Kernel MM IPC fs drivers net modules
Linux Kernel Internals-> Kernel MM IPC fs drivers

net modules
Keywords: Must have
Date: Tue, 15 Oct 1996 15:43:14 GMT
From: Alex Stewart <Alex@auriga.rose.brandeis.edu>
Linux Kernel Internals ISBN 0-201-87741-4 Addison Wesley Longman 1996

Beck,Boehme,Dziadzka(!),Kunitz,Magnus,Verworner $45 at Borders
A guide to the kernel code, indispensible to someone like me, who has to make hardware work with
Linux, but is unfamiliar with OS details outside of windows/dos. Supprisingly easy to read given the
subject matter. This book talks you through the timer and scheduler code and fast and slow H/W
interrrupt handlers for instance, so I can see what things like get/setitimer will do for me. Based on
1.2.13 and 1.3.x.
http://ldp.iol.it/LDP/khg/HyperNews/get/bib/bib/3.html [08/03/2001 10.11.12]

wrong file for system_call code
wrong file for system_call code

Forum: How System Calls Work on Linux/i86
Keywords: syscall assembly error
Date: Fri, 13 Sep 1996 01:44:29 GMT
From: Tim Bird <tbird@caldera.com>
This page contains the sentence "Actual code for system_call entry point can be found in
/usr/src/linux/kernel/sys_call.S"
This should read: Actual code for the system_call entry point (for the intel architecture) can be found
in /usr/src/linux/arch/i386/kernel/entry.S
http://ldp.iol.it/LDP/khg/HyperNews/get/syscall/syscall86/4.html [08/03/2001 10.11.13]

would be nice to explain syscall macros

Keywords: sycall
Date: Fri, 13 Sep 1996 01:37:11 GMT
The syscall macros are a little dense to decipher. It took me a while to determine how the macro
syscall1(int,setuid,uid_t,uid) expanded into the assembly code shown.
It might be nice to show the macro, and explain a little about how it gets expanded.
Here is the source for the _syscall1 macro
#define _syscall1(type,name,type1,arg1) \
type name(type1 arg1) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
: "=a" (__res) \
: "0" (__NR_##name),"b" ((long)(arg1))); \
if (__res >= 0) \
return (type) __res; \
errno = -__res; \
return -1; \
}
When expanded, this become the code
int setuid(uid_t uid)
{
long __res;
__asm__ volatile ("int $0x80" \
: "=a" (__res) \
: "0" (__NR_setuid), "b" ((long)(uid)));
if (__res >= 0 )
return (int) __res;
errno = -__res;
return -1;
}
It's pretty easy to see how the cleanup code converts
into assembly, but the setup code eluded me until
I figured out the following:
"=a" (__res) means the result comes back in %eax
http://ldp.iol.it/LDP/khg/HyperNews/get/syscall/syscall86/3.html (1 di 2) [08/03/2001 10.11.13]

"0" (__NR_setuid) means put the system call number

into %eax on entry
"b" ((long)(uid) means put the first argument
into %ebx on entry
syscallX macros that use additional parameters use %ecx, %edx, %esi, and %edi to hold additional
values passed through the call.
http://ldp.iol.it/LDP/khg/HyperNews/get/syscall/syscall86/3.html (2 di 2) [08/03/2001 10.11.13]

wrong file for syscallX() macro
wrong file for syscallX() macro

Date: Fri, 13 Sep 1996 01:25:44 GMT
This page contains the sentence:

" The macro definition for the syscallX() macros can be found
in /usr/include/linux/unistd.h"
Actually, this file containts the system call numbers. The macros for system call generation are located
in the file /usr/include/asm/unistd.h

the directory /usr/src/libc/syscall/
the directory /usr/src/libc/syscall/

Keywords: system call
Date: Sat, 18 May 1996 03:30:26 GMT
From: vijay gupta <vijay@crhc.uiuc.edu>
Hi,
The directory /usr/src/libc/syscall/ is essential
for hacking the libc code in order to add a new
system call.
Unfortunately, I have been unable to find this
directory from
libc-5.3.9.tar.gz from ftp://sunsite.unc.edu/pub/Linux/GCC/
or from libc-4.X.
Does anyone have any ideas on this ?
Thank you very much,
Vijay Gupta
(Email : vijay@crhc.uiuc.edu)
Messages
1. ...no longer exists. by Michael K. Johnson
-> the solution to the problem by Vijay Gupta

...no longer exists.
...no longer exists.

Re: the directory /usr/src/libc/syscall/ (vijay gupta)
Keywords: system call libc
Date: Sat, 18 May 1996 12:52:48 GMT
The /usr/src/libc/syscall/ directory no longer exists.

Instead, they files you need to modify are in the libc/sysdeps/linux/ directory. If your system call only
works on one architecture, then you need to use the architecture-dependent subdirectories i386 and
m68k (at present; that will soon expand to at least sparc, and maybe other platforms).
If you need to know more, please ask more...
Messages
1. the solution to the problem by Vijay Gupta
http://ldp.iol.it/LDP/khg/HyperNews/get/syscall/syscall86/1/1.html [08/03/2001 10.11.14]

the solution to the problem

Re: the directory /usr/src/libc/syscall/ (vijay gupta)
Re: ...no longer exists. (Michael K. Johnson)
Keywords: system call libc
Date: Tue, 21 May 1996 23:05:39 GMT
Hi everybody,
Thanks to the two people who replied to me on
this. The solution to the problem is as follows :
the khg seems to be wrong in assuming there was a directory syscall in the C library. Instead, there is a directory
sysdeps/linux, which contains, among others, socketpair.c, which defines the function
int
socketpair(int family, int type, int protocol, int sockvec[2])
{
unsigned long args[4];
args[0] = family;
args[1] = type;
args[2] = protocol;
args[3] = (unsigned long)sockvec;
return socketcall(SYS_SOCKETPAIR, args);
}
If you look at /usr/src/linux/net/socket.c, you will find a good match with that code. The socketcall function then is
not defined by a C macro, but by an assembler macro in __socketcall.S:
SYSCALL__ (socketcall, 2)
ret
Please note that the socket system calls are special because of that level
of indirection. The wait(2) function is declared as
#ifdef __SVR4_I386_ABI_L1__
#define wait4 __wait4
#else
static inline
_syscall4(__pid_t,wait4,__pid_t,pid,__WAIT_STATUS_DEFN,status,int,options,struc
t
rusage *,ru)
#endif
__pid_t
__wait(__WAIT_STATUS_DEFN wait_stat)
{
return wait4(WAIT_ANY, wait_stat, 0, NULL);
}
(so it is actually wait(3) in Linux, with wait4(2) being the system call).
http://ldp.iol.it/LDP/khg/HyperNews/get/syscall/syscall86/1/1/1.html (1 di 2) [08/03/2001 10.11.15]

------------------------
Thanks again,
Vijay
http://ldp.iol.it/LDP/khg/HyperNews/get/syscall/syscall86/1/1/1.html (2 di 2) [08/03/2001 10.11.15]


Other sources specifically about writing device drivers.
Randy Bentson recently wrote an interesting book called Inside Linux. It has some information on basic
operating system theory, some that is specifically related to Linux, and occasional parts that aren't really
related to Linux at all (such as a discussion of the Georgia Tech shell). ISBN 0-916151-89-1, published
by Specialized System Consultants
Inline Assembly with DJGPP really applies to any version of GCC on a 386, and some of it is generic
GCC inline assembly. Definitely required reading for anyone who wants to do inline assembly with
Linux and GCC.
The Annotated Bibliography mentions plenty of books out that don't have ``Linux'' in the title which may
be useful to Linux programmers. Especially if you are new to kernel programming, you may do well to
pick up one of the textbooks recommended in the bibliography.
Messages
8. Linux Pgrogrammers Guide (LPG) by Federico Lucifredi
7. TTY documentation by Michael De La Rue
1. In the queue... by Michael K. Johnson
1. TTY documentation by Eugene Kanter
2. Untitled by Yusuf Motiwala
5. The vger linux mail list archives by Drew Puch
4. German book on Linux Kernel Programming by Jochen Hein
1. English version of Linux Kernel Internals by Naoshad Eduljee
-> Book Review? by Josh Du"Bois
-> Thumbed through it by Brian J. Murrell
3. Multi-architecture support by Michael K. Johnson
1. Linux Architecture-Specific Kernel Interfaces by Drew Puch
2. Analysis of the Ext2fs structure by Michael K. Johnson
1. To add more sources of information: by Michael K. Johnson
http://ldp.iol.it/LDP/khg/HyperNews/get/other.html [08/03/2001 10.11.15]

Linux Pgrogrammers Guide (LPG)
Linux Pgrogrammers Guide (LPG)

Forum: Other Sources of Information
Keywords: Linux Programming
Date: Fri, 18 Apr 1997 04:58:44 GMT
From: Federico Lucifredi <lucifred@cs.bc.edu>
Does anybody know what happened to the
LPG
? it doesn't seem to have been updated beyond v 0.4 (3.95)
http://ldp.iol.it/LDP/khg/HyperNews/get/other/8.html [08/03/2001 10.11.16]

TTY documentation
Messages
1. In the queue... by Michael K. Johnson
2. Untitled by Yusuf Motiwala

In the queue...
In the queue...
Re: TTY documentation (Michael De La Rue)
Keywords: TTY TeX Documentation Kernel Device Driver
Date: Wed, 31 Jul 1996 15:47:36 GMT
The authors have agreed to have it included in the KHG.

I have a copy of the article, and when I have time to set it in HTML, it will be added.
Thanks much!
Messages
http://ldp.iol.it/LDP/khg/HyperNews/get/other/7/1.html [08/03/2001 10.11.18]

TTY documentation
TTY documentation
Re: In the queue... (Michael K. Johnson)
Date: Wed, 23 Oct 1996 16:34:05 GMT
From: Eugene Kanter <eugene.kanter@ab.com>
May I have at least plain text version of TTY document?

Thanks.
http://ldp.iol.it/LDP/khg/HyperNews/get/other/7/1/1.html [08/03/2001 10.11.18]

Untitled
Untitled
Date: Sat, 29 Mar 1997 07:24:16 GMT
From: Yusuf Motiwala <yusuf@scientist.com>
Can you please mail me tty documentation.

Regards,
Yusuf
ymotiwala@hss.hns.com

The vger linux mail list archives
The vger linux mail list archives

Keywords: mailing list
Date: Mon, 27 May 1996 17:51:08 GMT
From: Drew Puch <aapuch@eos.ncsu.edu>
Good place to see if the question you are about to ask are already answered vger mail list for linux
topics

German book on Linux Kernel Programming
German book on Linux Kernel Programming

Keywords: German book in Linux Kernel Hacking
Date: Mon, 27 May 1996 13:19:59 GMT
From: Jochen Hein <jochen.hein@informatik.tu-clausthal.de>
There is a german book on "Linux-Kernel-Programmierung"

Algorithmen und Strukturen der Version 1.2
Michael Beck, Harald Böhme, Mirko Dziadzka, Ulrich Kunitz,
Robert Magnus, Dirk Verworner
Addison-Wesley, Germany, ISBN 3-89319-939-x, DEM 79,90
Covers Release 1.2 in detail; 500 pages.
I don't know, if there's an english translation.
Messages
1. English version of Linux Kernel Internals by Naoshad Eduljee
-> Book Review? by Josh Du"Bois

English version of Linux Kernel Internals

Re: German book on Linux Kernel Programming (Jochen Hein)
Date: Sun, 02 Jun 1996 11:05:31 GMT
From: Naoshad Eduljee <naoshad@pacific.net.sg>
The English version of the german book on Linux Kernel Programming is published by Addison
Wesley. Here is the text of the mail I recieved from Addison Wesley when I enquired about the book :
"LINUX Kernal Internals" is priced at $38.68 but will not be available until early June 1996.
Ordering information:
The Book Express will gladly ship your order to any international location. Orders can be prepaid by a
valid credit card or a check drawn on a US bank. Orders are shipped to international locations via Air
Printed Matter Registered with an estimated delivery time of eight business days from our warehouse
in Indiana, USA. Charges for this service are $15.00 for the first book, $8.00 for each additional book
on the order.
You may order by mail: Addison-Wesley Book Express
One Jacob Way
Reading, MA 01867
by phone within the US: 1-800-824-7799
outside the US: 1-617-944-7273 extension 2188
or by fax: 1-617-944-7273
When ordering by fax, please include the title or book number, quantity of each book, credit card
number and expiration date, as well as the appropriate shipping address. Please do not send credit card
information via the internet; use the fax number listed above for prompt service.
If you need further ordering assistance or title information, please let us know.
Sincerely, ADDISON-WESLEY BOOK EXPRESS
Messages
1. Book Review? by Josh Du"Bois
http://ldp.iol.it/LDP/khg/HyperNews/get/other/4/1.html (1 di 2) [08/03/2001 10.11.21]

http://ldp.iol.it/LDP/khg/HyperNews/get/other/4/1.html (2 di 2) [08/03/2001 10.11.21]

Book Review?
Book Review?
Re: English version of Linux Kernel Internals (Naoshad Eduljee)
Date: Fri, 12 Jul 1996 23:50:34 GMT
From: Josh Du"Bois <duboisj@is.com>
Has anyone read the english version of this book? I'd love go get my hands on a good linux
kernel-hacking guide. If anyone has read this and has comments please post them here or email me at
duboisj@is.com. If I don't hear that it's worthless, or if it takes a while for anyone to respond, I'll try
and pick up a copy and read it myself/post a review here.
Naoshad Eduljee - thanks for the tip,
Josh.
-------------
duboisj@is.com
Messages
1. Thumbed through it by Brian J. Murrell
http://ldp.iol.it/LDP/khg/HyperNews/get/other/4/1/1.html [08/03/2001 10.11.22]

Thumbed through it
Thumbed through it
Re: English version of Linux Kernel Internals (Naoshad Eduljee)
Re: Book Review? (Josh Du"Bois)
Date: Thu, 16 Jan 1997 05:18:02 GMT
From: Brian J. Murrell <brian@ilinx.com>
I thumbed through it today at the bookstore. I was particularly interested in how a driver uses a handle
in the proc filesystem to write information to a process willing to read, like kmsg does. In my
thumbing I did not really get my answer.
The book looked decent but what really disappointed me was that despite it's being a 1996 release, it
only covers version 1.2 kernels. I now realize that this is because it was a translation from another
book. :-(
I really would like to see an updated version of this book! It would definately be on my bookshelf if it
got updated.
b.
http://ldp.iol.it/LDP/khg/HyperNews/get/other/4/1/1/1.html [08/03/2001 10.11.22]

Multi-architecture support
Multi-architecture support
Date: Thu, 23 May 1996 15:45:37 GMT
Michael Hohmuth, of TU Dresden, wrote a new document on Linux's multiple architecture support. A
PostScript version is also available.
Messages
1. Linux Architecture-Specific Kernel Interfaces by Drew Puch

Linux Architecture-Specific Kernel Interfaces
Linux Architecture-Specific Kernel Interfaces

Re: Multi-architecture support (Michael K. Johnson)
Keywords: instructions kernel header file intro
Date: Mon, 27 May 1996 17:38:34 GMT
From: Drew Puch <aapuch@eos.ncsu.edu>
Here is some kernel info by header files.

Thanks goes out to Michael Hohmuth of TU Dresden, Dept. of Computer Science, OS Group
document corresponds to Linux 1.3.78++

Analysis of the Ext2fs structure
Analysis of the Ext2fs structure

Date: Sat, 18 May 1996 00:45:42 GMT
There's not much available on filesystems yet, but Analysis of the Ext2fs structure, by
Louis-Dominique Dubeau, is worth visiting.

To add more sources of information:
To add more sources of information:

Keywords: instructions
Date: Wed, 15 May 1996 15:53:28 GMT
In order to add another source to this page, simple respond to the page and mention the source. You
can, if you like, simply type in the URL as your response--just click on the Respond button, enter the
title of the web page you are connecting to in the Title box, click on the URL radiobutton for the
format, and then type the URL into the large text window entitled Enter your response here:.
That is all that is required. Just click the Preview Response button, and then if it looks right, submit it
by clicking on the Post your Response button.
If you want to be notified of further changes made to this page, you can subscribe to it. Subscribing
makes you a member, with special privileges, and also puts you on a mailing list. Click on the
Membership item at the bottom. Members can also edit their posts if they want to make changes later.
Also, the more members there are, the more motivated I will be to maintain this new version of the
KHG... :-)
If you aren't subscribed, you should probably leave your name and email address, and possibly home
page if you have one.
At some point, the URL may be moved from the response list into the body of the article. If that
sentence didn't make sense to you, you can safely ignore it.
Thank you for your help!
michaelkjohnson

Loading shared objects - How?
Loading shared objects - How?

Forum: The Linux Kernel Hackers' Guide
Date: Wed, 12 Aug 1998 19:31:19 GMT
From: Wesley Terpstra <terpstra@unixg.ubc.ca>
Does anyone know where I could find a good document about how shared objects are bound to an ELF
executable before runtime? I would like to be able to import symbols from a .so file at runtime based
on user input and call the imported symbol (a function). I suspect gdb must do this since it loads
shared libraries for programs one debugs and allows one to call the imported functions. I hope to do
this as portably as possible. Can anyone out there recommend a document?
Thanks.
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/349.html [08/03/2001 10.11.24]

How can I see the current kernel configuration?
How can I see the current kernel configuration?

Date: Sun, 09 Aug 1998 10:32:18 GMT
From: Melwin <tsmelwin@hotmail.com>
Hi all,
I need help on how to see my current kernel configuration.
Thanks Melwin

My mouse no work in X windows
My mouse no work in X windows

Re: How can I see the current kernel configuration? (Melwin)
Date: Tue, 11 Aug 1998 23:04:48 GMT
From: alfonso santana <alfonsosantana@hotmail.com>
I have a serial mouse in Com1 but I can´t move my mouse in X windows (the cursor don´t move. I
tried with mouseconfig, xf86config, XF86Setup, i killed ps of mouse, i used ls -l /dev/mouse and i got
/dev/mouse --->/dev/cua0 but not work, i tried with many protocols, but nothing. Please help me.
Thanks
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/342/1.html [08/03/2001 10.11.25]

The crash(1M) command in Linux?
The crash(1M) command in Linux?

Keywords: kernel, crash
Date: Fri, 07 Aug 1998 16:29:12 GMT
From: Dmitry <defanov@romance.iki.rssi.ru>
Hi, all!
I know, that there is the crash(1M) command in System V. Is there something like crash(1M) in
Linux?
And how to get adresses of kernel's tabeles & structuries (for instance, process table or u-area)?
Thanks!
Dmitry

Where can I gen detailed info on VM86
Where can I gen detailed info on VM86

Keywords: vm86
Date: Thu, 06 Aug 1998 14:53:02 GMT
From: Sebastien Plante <sebasp@cae.ca>
I have a project of emulating 8086 card processor. I think that I can do it under Linux by using VM86.
Where can I get enough information on VM86 to be able to use it ?

How to print floating point numbers from the kernel?
How to print floating point numbers from the

kernel?
Date: Tue, 04 Aug 1998 16:51:29 GMT
From: <pkunisetty@hotmail.com>
I want to print floating point numbers from kernel module. printk is working fine for integers but not
working for floating point numbers. Is there any otherway to print the floating point numbers? Thanks.

PS/2 Mouse Operating in Remote Mode
PS/2 Mouse Operating in Remote Mode

Date: Fri, 31 Jul 1998 20:00:10 GMT
From: Andrei Racz <andreir@worldnet.att.net>
I am experimenting with a PS/2 mouse in remote operation - which is, requesting for a
pointing packet and then waiting for it. No requests - no packets. Usually, the mouse
operates in stream mode, sending continuosly.
I started with psaux.c code; first I added a timer which would fire the callback
which in turn would send the request.
I ended in hanging the machine - with some message ... iddle task could not sleep and
then AIEE ...scheduling in interrupt.
Trying with a task queue (tq_timer/tq_scheduler) did not help either.
I have limited experience with Linux.
I would appreciate some advice on this matter.
Regards, Andrei Racz

basic module
basic module
Date: Wed, 29 Jul 1998 06:55:21 GMT
From: <vano0023@tc.umn.edu>
what's wrong with this code? It will not print out current->pid
#define MODULE
#include <linux/module.h>
extern struct task_struct *current;
int init_module() {printk("<0>Process command %s pid %i",current->comm, current->pid); return 0;}
void cleanup_module() {printk("<0>Goodbye\n"); }

How to check if the user is local?
How to check if the user is local?

Date: Mon, 27 Jul 1998 16:39:18 GMT
From: <jb@nicol.ml.org>
Access to some resources should be limited for local users only (starting Xserver, access to diskette).
I wrote program that walks through /proc/*/stat files and checks if the tty field is between 1024 and
1087. If process has pseudoterminal it checks sid, ppid, sid.. etc. If it find process that is a deamon or
has other terminal than vconsole or pseudoterm it tells that it is remote user.
Is it a save way?

Ldt & Privileges
Ldt & Privileges

Keywords: LDT Memory Privilege
Date: Fri, 24 Jul 1998 15:17:57 GMT
From: Ganesh <ganesh@cs.sunysb.edu>
Hi,
I need some help with something related to modify_ldt system call
which was added to Linux. I would greatly appreciate your help.
I am experimenting with a new protection mechanism.

I want to push a user process to privilege level 2 in Linux( by
adding a system call) . If I do this, at the second level of
protection checks in the CPU (ie. at the paging level), the user
process would map to supervisor privileges.This is because x86 maps
0,1,2 to supervisor and 3 to user privilges at the paging level(that
is what I understood from the manual. Please correct me if I am
wrong). Will the process (at PL 2) be able to write to kernel pages
since the protection check would go through at the page level?
If so, I guess I can prevent it at the segment level by adding a

check to modify_ldt code to figure out whether the process is making
a pointer to a kernel segment. Is this correct? Anyway, the process
wont be actually able to reload LDTR or change the actual LD Table
directly without a system call(sys_modify_ldt). Or is there some
roundabout way in which a process at privilge level 2 can somehow
make an entry in LDT/access the kernel pages?
Again, any help would be greatly appreciated. Thanks a lot.

skb queues
skb queues
Date: Wed, 22 Jul 1998 01:19:19 GMT
From: Rahul Singh <rahul_sg@hotmail.com>
I was trying out some stuff that deals with creating Q of the sk_buffs as they are passed from routines
in ip_output.c to dev_queue_xmit() in /net/core/dev.c . Using sk_buff_head to do the Q-ing and
timer_list to control the rate at which skbs are passed from my routine to dev_queue_xmit().
The code is able to control the rate at which skbs are passed to dev_queue_xmit() but seems to have a
few bugs.
The error msgs that I encountered are "killing of interrupt handlers" and "kfree-ing of memory
allocated not using kalloc" (when I try to have an upper bound on the Q size).
It would be great if someone could give me a clue about the possible bugs.
Thanks.

Page locking (for DMA) and process termination?

Keywords: dma, page locking, process termination
Date: Tue, 21 Jul 1998 17:40:58 GMT
From: Espen Skoglund <espensk@stud.cs.uit.no>
I have some questions concerning locking of pages belonging to a user-level process.

The scenario I have is as follows: I have implemented a driver for a PCI device (as a module). All
processes that wants to access the device will have to do an open on it. When the device-file is opened,
some of the device's memory is mapped into the user-level application. Communiction between the
application and the device either goes through this buffer, or through the in-kernel module (via ioctl).
The device is also able to initiate a DMA transfer all by itself to or from the application's memory.
To be able to do this DMA transfer I will have to pin some pages to memory, do some vitual to
physical mapping, and also some scatter-gather mechanism. I am somehow able to cope with all this.
The problem that I am concerned with however, is the case when a DMA operation is going on (or
about to be started), and the process that is the destination or source of the DMA transfer dies. What is
the best way to make sure that the pages get pinned in memory until the device driver receives a
release from the dying process? When this happens, the driver will be able to pause the termination of
the process if a dangerous DMA transfer is in progress. When the DMA transfer has finished, it may
then free the pinned pages and continue termination.
From what I've seen of the process termination code (I'm doing this in a 2.0.30 kernel), the memory
mapings are freed before the open files are released (this rules out the obvious solution). I've thought
of two other solutions:
1) Using a dummy "ghost thread" associated with the module.
All locking of user-level pages are also made sharable, and
shared with this thread. During the termination of the
process, the pinned memoryis not freed (I think) beacause
the memory is also shared with this thread.
2) Making the pages "reserved" instead of locked. From what
I've seen of the kernel code, reserved pages are not freed
by the free_pte() function. They are however freed by the
forget_pte() function which is called by the
zeromap_page_range() function --- something which is only
possible to do after doing an open on /dev/mem (right?).
So, marking pages as reserved should probably work ok. I
am however reluctant to do this since I suspect I am
abusing the whole conecpt of reserved pages.
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/323.html (1 di 2) [08/03/2001 10.11.28]

Are there any other ways to accomplish what I am trying to do, or have I misinterpreted the whole
kernel-code --- overlooked an amazingly simple fact? (I guess this is a fairly easy thing to do ---
misrinterpreting the code I mean. It is afterall not what I would considered a well
documented/commented piece of software.)
eSk

SMP code
SMP code
Keywords: SMP code
Date: Mon, 20 Jul 1998 20:05:11 GMT
From: <97yadavm@scar.utoronto.ca>
Is anyone out there know a good source of explanation of the Linux SMP code?
I am writing an OS and after reading the Intel MP spec, after hearing all the problems with SMP on
Linux, I bet there is a little more to it. If anyone is an expert in this area and wouldn't mind chatting for
a bit, it'd be much appriciated.
Thanks!!!!!

Porting GC: Difficulties with pthreads
Porting GC: Difficulties with pthreads

Date: Fri, 17 Jul 1998 00:34:11 GMT
From: Talin <Talin@ACM.org>
I'm attempting to get Hans Boehm's gc to run under Linux. (GC is a conservative, incremental,
generational garbage collector for C and C++ programs). Apparently it has been ported to older
versions of Linux, but the port appears broken. Searching around the web I notice that one or two other
people have attempted to get this thing working without success. It's tantalizing because the bloody
thing almost works. One major problem has to do with pthread support. GC needs to be able to stop a
thread an examine it's stack for potential pointers, and there's no defined way in the pthreads API to do
this. On SunOS, gc uses the /proc primitives to accomplish this task, unfortunately the Linux /proc
lacks the ability to stop a process. Under Irix, it uses an evil hack -- it sends a signal to the pthread, and
has the pthread wait on a condition variable inside the signal handler! Needless to say, this method is
to be avoided if at all possible. Unfortunately, the author of gc says that this is unavoidable due to the
limitations of the pthreads API. Does anyone have any ideas for how to go about suspending a thread
and getting a copy of it's register set under Linux?

Linux for "Besta - 88"?
Linux for "Besta - 88"?

Keywords: Besta, Sapsan
Date: Mon, 13 Jul 1998 17:02:17 GMT
From: Dmitry <defanov@romance.iki.rssi.ru.>
Is there anybody, who knows about Linux for Besta-88 workstation. It is based on MVME147 board,
but has its own design.

MVME147 Linux
MVME147 Linux
Re: Linux for "Besta - 88"? (Dmitry)
Keywords: Besta, Sapsan
Date: Thu, 16 Jul 1998 11:51:12 GMT
From: Edward Tulupnikov <allin1@allin1.ml.org>
Looks like http://www.mindspring.com/~chaos/linux147/ has some info on the issue. Maybe that'll be
helpful.

/proc/locks
/proc/locks
Keywords: /proc/locks
Date: Mon, 13 Jul 1998 12:15:29 GMT
From: Marco Morandini <marc2@vedyac.aero.polimi.it>
Can you give me informations about the /proc/locks file (what it is, its format etc....)? I was not able to
find them in man pages etc...
Hoping this is the right forum.
Thanks in advance, Marco Morandini

syscall
syscall
Date: Wed, 08 Jul 1998 14:00:51 GMT
From: <ppappu@lrc.di.epfl.ch>
What is the syscall mechanism for calling system calls when

there is no library interface for the system call.
eg how do u call the
get_kernel_syms system call.

How to run a bigger kernel ?
How to run a bigger kernel ?

Keywords: kernel size
Date: Sat, 04 Jul 1998 05:13:31 GMT
From: Kyung D. Ryu <kdryu@cs.umd.edu>
I added some features on Linux 2.0.32. When I modified just a couple of lines and
rebuilt kernel, I was able to reboot and run new kernel.
So, I put a couple of functions in the source files and rebuilt it. The kernel size
got a bit bigger than last one. (kernel size: original:446281, a few chage: 446289,
more change: 446664)
It crashed when it was rebooted and attemped to uncompress the new kernel giving
message "ran out of input data...".
So, I guess the kernel size does matter. How can I make this bigger new kernel run ?
Any parameter to change ?
Thanks in advance

Linux Terminal Device Driver
Linux Terminal Device Driver

Date: Wed, 24 Jun 1998 12:39:01 GMT
From: Nils Appeldoorn <appeldoo@hio.hen.nl>
Hello,
I'm a student from Holland and have received the following assignment:
Write a paper about the Linux Terminaldriver. Explain how it handles
all the interrupts, how the datastructure looks, what the functionality
of its parts is, etcetera, etcetera.
The problem is, we can't get a clear overview of all the needed source files.
We've found /usr/src/linux/drivers/char/tty_io.c but that's probably not
the only one, and we cannot figure it out. It's a little bit fuzzy, for
starters like me.
If you can help me just a little bit, I would appreciate it! Any help,
whatsoever, is good!
Thanks anyway,
Nils Appeldoorn

Terminal DD
Terminal DD
Re: Linux Terminal Device Driver (Nils Appeldoorn)
Date: Tue, 30 Jun 1998 13:22:00 GMT
From: Doug McNash <dmcnash@computone.com>
Quickly:
serial.c - is the device driver for bare UARTS (8250-16550) others are present for various cards like
stallion, digi, specialix et.al. but you probably can't get the tech doc for those. This is the interface with
the hardware.
n_tty.c is the code for the line discipline which does the processing of the input/output stream as well
as some control function. This is the interface between the "user" and the driver.
tty_io.c and tty_ioctl.c provide some common support functions
tty.h, termio[s].h, termbits.h, ioctl.h, serial.h contain the structure definitions and defines.

DMA to user allocated buffer ?
DMA to user allocated buffer ?

Keywords: DMA user buffer physio
Date: Mon, 22 Jun 1998 17:07:54 GMT
From: Chris Read <support@active-silicon.com>
I am porting an application from Solaris and NT to Linux
I need to DMA fairly large ( >1 MByte ) data areas to a user

assigned buffer (assigned using malloc). I thus need to either
1) Lock down the pages manually in the driver and generate

a physical scatter/gather table for the DMA (I assume that
the malloc'ed pages will not be contiguous in physical memory)
2) Remap the user buffer into physically contiguous memory,

without changing the user virtual mapping (i.e. same user
virtual address)
3) Implement a Unix (Solaris) like physio routine to perform

the DMA in chunks, akin to the read entry point uoi->buf
mappings.
Any pointers or ideas ?

allocator-example in A.Rubini's book
allocator-example in A.Rubini's book

Re: DMA to user allocated buffer ? (Chris Read)
Keywords: DMA user buffer physio
Date: Tue, 07 Jul 1998 08:19:04 GMT
From: Thomas Sefzick <t.sefzick@fz-juelich.de>
Have a look into the examples from A.Rubini's 'Linux Device Drivers' book, downloadable from the
website advertised on top of the KHG page. Could the 'allocator' code in
ftp/v2.1/misc-modules/allocator.[ch] and ftp/v2.1/misc-modules/README.allocator be a solution to
your problem?

Patching problems
Patching problems
Date: Wed, 17 Jun 1998 19:27:17 GMT
From: Maryam <mshsaint@direct.ca>
Hi Every one
I currently have kernel v2.0.27 running in my computer and would like to
patch the rtlinux and do some experience on it.
So what I have done so far is I downloaded the patches from this page,
unzipped it and stored it in a disk.
While listing the files in windows I got these directory names: rtinux-0.5
and underneath this directory was kernel_patch and etc...
For mounting the files I used the following command in Linux:
mount -t msdos /dev/fd0 /mnt
But after the mounting in /mnt I had different garbled file names such as
rtlinu~1 & kernel~1. As Michael said I should use exts filesystems which
now I am using the msdos ones.
But the problem is how? Is there anyway I can convert the files from msdos
to ext2 in Linux? Which command should I use?
"Please, I am stuck here :)"
Another problem was, I tried to patch those files to the kernel and it
started asking me the question : File patch to,
which files should i mention, or is it another problem because of the
original problem?
Have a great day
Thanks in advance,
Maryam Moghaddas
Tel/ Fax : (604) 925-4683
* Plan for Today, Tomorrow comes Next

Untitled
Untitled
Re: Patching problems (Maryam)
Date: Wed, 01 Jul 1998 17:22:27 GMT
From: <welch@mcmail.com>
You need to mount a windows formatted disk with long filenames using the vfat filesystem

Ethernet Collision
Ethernet Collision
Keywords: ethernet collision packets sniffing
Date: Fri, 12 Jun 1998 14:19:27 GMT
From: jerome bonnet <bonnet@cran.esstin.u-nancy.fr>
Is there any way in the Linux network architecture that a program in the user space can get extended
network statistics from an ethernet driver for example ? I would like to have information on collisions
on the ethernet bus... If I can get collision timestamps this would be great. It is to be used to retrive
network statistics, in both terms of useful trafic (that one tcpdump does it well) and bus occupation
time (that one tcpdump and snifing device do not do it !)...
Cordialement,
Jerome bonnet PhD. Student.

Ethernet collisions
Ethernet collisions
Re: Ethernet Collision (jerome bonnet)
Keywords: ethernet collision packets sniffing
Date: Thu, 25 Jun 1998 14:34:35 GMT
From: Juha Laine <james@cs.tut.fi>
Hi.
You can see the network device specific collisions e.g. from
the proc file system (at least with kernels 2.0.34 and
2.1.105). Try yours - just 'cat /proc/net/dev' .
I do not know how you could get notices with time-stamps,

though...
Mabye there is some kind of hack/patch lying somewhere just

waiting for to be found ? :*)
Cheers !

Segmentation in Linux
Segmentation in Linux
Date: Sun, 07 Jun 1998 06:45:00 GMT
From: Andrew Sampson <sampson@wantree.com.au>
From what I know of Linux, the address space on an Intel machine is

split into 3GB for the user and 1GB for the kernel. What I would
like to know is how the operating system handles interaction between
the two address spaces. I know that a 48-bit pointer is needed, etc,
but not how its handled within the C source code of the kernel...and
I haven't been able to find anything on it in kernel sources.
If anyone could tell me how this is done I would be very happy.
Thanks,
Andrew Sampson (aspiring programmer)

How can the kernel copy directly data from one process to another process?
How can the kernel copy directly data from one

process to another process?
Keywords: direct copy, user space, kernel space
Date: Sat, 06 Jun 1998 20:57:05 GMT
From: Jürgen Zeller <zeller@rupert.franken.de>
Hi,
in my first kernel-related project, i want to copy data in the kernel from one process to another.
I have the start point/size of the user-level buffers, but i found no way to do a _direct_ copy.
The copy takes place in a write() call of a character device driver, the source/size is the write
buffer/size, the destinition is a other process, currently blocking in its read method.
Of course, i could do a kmalloc, copy to kernel, wake up the read, copy from the kmalloc'd area, kfree
the area, but ... that is too much overhead.
Any hints?
Bye,
Jürgen

Use the /Proc file system
Use the /Proc file system

Re: How can the kernel copy directly data from one process to another process? (Jürgen Zeller)
Keywords: direct copy, user space, kernel space
Date: Mon, 22 Jun 1998 16:15:21 GMT
From: <marty@twsu.campus.mci.net>
The /proc file system uses a directory for each pid. Do a man 5 proc on linux machine to find out
more? Bye, Marty.

Remapping Memory Buffer using vmalloc/vma_nopage
Remapping Memory Buffer using

vmalloc/vma_nopage
Keywords: mmap, vmalloc, DMA, and Interrupts
Date: Wed, 03 Jun 1998 16:43:15 GMT
From: Brian W. Taylor <taylor@lowell.edu>
Hello All!
Here is an interesting/odd problem that has arisen while trying to setup a large buffer of memory
allocated by a kernel driver to be remapped into user space. The driver is for a CCD camera that is
DMA and Interrupt driven system and I am able to get good consistant images using "memcpy_tofs()".
What I would like to do is to have a large buffer that can be remapped to user space so that the data
can be transferred via the network while the CCD is reading out. The camera DMAs a line at a
time(1712 bytes) to a kmalloc'ed buffer of 2048 bytes and is copied into the remappable buffer when
the problem occurs. Using two different methods I have come up with some really strange results.
The Problem:
When I readout a full frame(~1.3MB of integers), if the data is realatively uniform there is no problem.
But if the data is not uniform some of the lines will transfer fine but most will end up with zeros filling
up some or all of the values in the line. This will happen no matter how many lines are readout at a
time.
The Method:
I am using the 2.0.33 kernel, initially with Matt Welsh's bigphysarea and recently using vmalloc and
the example of remapping virtual memory example in Alessandro Rubini's "Linux Device Drivers".
From what I have been able to determine the values are good until the copy from the DMA buffer into
the remapped buffer.
I am also locking the application memory using the driver and using SCHED_FIFO for priority
scheduling. The driver functions very well until I start trying to use the remapped memory.
Any Ideas????
Thanks In Advance,
Brian W. Taylor

Fixed.... strncpy to blame
Fixed.... strncpy to blame

Re: Remapping Memory Buffer using vmalloc/vma_nopage (Brian W. Taylor)
Keywords: mmap, vmalloc, DMA, and Interrupts
Date: Thu, 11 Jun 1998 20:42:17 GMT
From: Brian W. Taylor <taylor@lowell.edu>
Hello All,
Well, I was able to solve the problem by setting a loop to transfer the data byte by byte. I had been
using "strncpy(map_buffer, dma_buffer, dma_count)" to transfer the data.
Now why?
Why was this a problem and why did it behave so strangly? When the data was fairly uniform there
were no problems. But, when the was a discouniuity in the date the transfer would latch the rest of the
dma transfer data to 0?? Very Odd.
strncpy is defined in the kernel and should be usable with vmalloc if anything else.....
Any Ideas??? I would really like to understand the mechanism inside the kernel that caused this
problem.
Thanks
Brian

Does memory area assigned by "vmalloc()" get swapped to disk?
Does memory area assigned by "vmalloc()" get

swapped to disk?
Keywords: disk swapping
Date: Sun, 31 May 1998 22:27:59 GMT
From: Saurabh Desai <sauru@juno.com>
Hi Friends !!
I am developing an Informed Prefetching and Caching system as part of the kernel in which it is very
important that the memory assigned to hold the data does not get swapped to disk. Without any
thought to swapping I used vmalloc() to assign memory to the data structures of the Informed
prefetching system. I then realized that the final prefetching system was inefficient in terms of speed
when the space assigned to the data structures using vmalloc() was in excess of 10KB. I later realized
that most probably the pages assigned using vmalloc() were getting swapped to the disk and when I
referenced them, they were brought from the disk rather than memory.
Is there any way to make sure that the assigned pages remain in memory until I release them?
I would appreciate any help in this matter.
Thanx Saurabh Desai
Messages
1. Lock the pages in memory by balaji@ittc.ukans.edu
-> How about assigning a fixed size array...does it get swapped too? by saurabh desai

Lock the pages in memory
Lock the pages in memory

Re: Does memory area assigned by "vmalloc()" get swapped to disk? (Saurabh Desai)
Date: Mon, 01 Jun 1998 17:11:06 GMT
From: <balaji@ittc.ukans.edu>
Hi, To make sure a set of pages dont get swapped out to disk just lock them in. Look at the
implementation of mlock to find out how thats done...(If you are smart enough you can call the
sys_mlock within the kernel) balaji PS: This may not be what you wanted bcos of certain
constraints...if so let me know balaji
Messages
1. How about assigning a fixed size array...does it get swapped too? by saurabh desai

How about assigning a fixed size array...does it get swapped too?
How about assigning a fixed size array...does it

get swapped too?
Re: Does memory area assigned by "vmalloc()" get swapped to disk? (Saurabh Desai)
Re: Lock the pages in memory
Date: Thu, 04 Jun 1998 15:27:38 GMT
From: saurabh desai <sauru@juno.com>
I am sure mlock() works fine..and I will also use sys_mlock() in my kernel code if I have to but the
problem with it is that my kernel code becomes part of any user process as it is a Prefetching system
and mlock() as seen from its implementation checks for the super user privileges suser(). So probably
a user process trying to prefetch is going to be deprived of this request.
I was thinking if the fixed size array (e.g. buffer[100]) gets swapped to disk. I am sure not but just
wanted to make sure. For my prefetching system I don't need a whole lot of memory. probably about
2-3K max. Hence I can afford to assign a static array.
thanx pal.
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/283/1/1.html [08/03/2001 10.11.36]

Creative Lab's DVD Encore
Creative Lab's DVD Encore

Keywords: DVD Encore drivers creativelabs Linux
Date: Tue, 26 May 1998 22:01:02 GMT
From: Brandon <kool@goodnet.com>
Hi. I'm totally new to Linux. A friend of mine told me how interesting and engrossing it was so I
grabbed it off the net and installed it on my system. One problem... I have Creative Lab's DVD Encore
and I haven't found any drivers for it to play movies in X Windows. Does anyone have these drivers?
If so, please let me know and I'll be happy to receive them. If not, I have a friend who's willing to write
the drivers. So in any case, please let me know.
Brandon (kool@goodnet.com)

TCP sliding window
TCP sliding window

Keywords: TCP, sliding window
Date: Fri, 15 May 1998 07:26:07 GMT
From: Olivier <aussage@imag.fr>
hello,
I would like to make a trace of the size of the TCP sliding window during a connexion (to see how the
size change), but I can't find where in the kernel this variable is. Please, do you have any clue on how I
can print the size of the sliding window ?
Many thanks in advance for your help,
Olivier.

Packets and default route versus direct route

Date: Thu, 14 May 1998 12:28:23 GMT
From: Steve Resnick <steve@ducksfeet.com>
Hi
I have a machine with two ethernet cards, two class C networks and roughly 50 IP aliases on various devices.
The two class C networks are distinctly different, i.e., the MSB of the network address is different, and both
use a 24 bit netmask.
for the sake of argument:
eth0 is setup on net 1: 192.168.98.0 eth1 is setup on net 2 192.168.99.0
The default route is to our router on eth0, the address is 192.168.98.10
So far, so good.
We bill our customers based on traffic usage and I wrote a libpcap based package to track network usage and
calculate aggregates for 5 minute periods and flush this data to disk. I originally wrote this on a Sun machine
running solaris 2.5.1.
This worked rather well and I was able to account for all traffic by walking through the ethernet and tcp/ip
headers to find the data size.
I rewrote this package to run under 2.0.33 and now I have an odd problem: Packets sent to a particular address
all use the same address on the return path.
If, from a different machine on our network at 192.168.98.15, I ping, with record route, to an address on the
machine in question, I see:
PING 192.168.98.42 (192.168.98.42): 56 data bytes
64 bytes from 192.168.98.42: icmp_seq=0 ttl=64 time=1.3 ms
RR: 192.168.98.15
192.168.98.42
192.168.98.10
192.168.98.15
And if I traceroute from the machine in question to another machine on our local network, and that address is
on net 2, it still goes out over net 1:
traceroute -n 192.168.99.36
traceroute: Warning: Multiple interfaces found; using 192.168.98.10 @ eth0
traceroute to 192.168.99.36 (192.168.99.36), 30 hops max, 40 byte packets
1 192.168.99.36 0.704 ms 0.606 ms 0.604 ms
So, the problem here is that I cannot track the traffic generated by by a particular website, since the source

address of all outbound traffic is not the address of the website, but rather the primary address on eth0
(192.168.98.10)
I have checked /proc/sys/net/ipv4/ip_forwarding and the value is 0, so I am assuming ip_forwarding is turned
off.
Is there a way to make this work properly, that is to say, if I request data from a particular address the address
used on the sending is correct as well?
What else am I missing or should I be looking for?
TIA, Steve

IPv6 description - QoS Implementation - 2 IP Queues
IPv6 description - QoS Implementation - 2 IP

Queues
Keywords: IPv6 Networking Kernel
Date: Thu, 07 May 1998 17:16:31 GMT
From: <wehrle>
Hello, I want to implement a QoS support for IPv6. It is a proposal, which is in discussion at the IETF.
For the implementation, I have to split the sending queue of IPv6 Packets into two queues. And also, I
have to implement a packet classifier. I am sitting now 1 week over the 2.1.98 kernel and searching for
the way, a IPv6 packet would take through the kernel of a linux software router. I do not know, where
the packets are inqueued into the IPv6 Sending Queue, and where they were dequeued by the devices.
I need to know this, because I have to change the behaviour of this functions. Can anybody help me ?
Does anybody know a (very good) description of the ipv6 implementation, their data structures, ....
Many thanks
Ciao Klaus

See the kernel IPv4 implementation documentation
See the kernel IPv4 implementation

documentation
Re: IPv6 description - QoS Implementation - 2 IP Queues
Keywords: IPv6 Networking Kernel
Date: Thu, 25 Jun 1998 14:03:43 GMT
From: Juha Laine <james@cs.tut.fi>
Hi.
If you haven't read the "The Linux Kernel" guide networking

sections, they might give you a hint.
They are NOT IPv6 specific, but I assume that the implementation
is not so different from the IPv4. The data structures will
differ but I think that operations are quite similar...
"The Linux Kernel" guide is located on the LDP homepages.

[ http://sunsite.unc.edu/mdw/ ]
Hope this helps.

writing to user file directly from kernel space, How can it be done?
writing to user file directly from kernel space,

How can it be done?
Keywords: driver write memory file help
Date: Wed, 06 May 1998 08:54:05 GMT
From: Johan <d3sunden@dtek.chalmers.se>
Hello, I am collecting data at a high rate into a circular buffer residing in my driver. I want this data to
be written to disk or network. To avoid unnecessary copying from kernel space to user space to kernel
space again I want a application to open a file or socket and then do a ioctl to my driver which I then
want to start performing write's from the buffer to the file/socket using the file operation write in the
file block of the opened file. Unfortenaly for me I get a page fault when I try f_op->write(...). My
buffer is in the kernel whith a kernel address but the write wants a virtual address (I guess).
How can I get around this? Other ideas?
I use Linux 2.0.33
Thanks in advance/ Johan

how can i increase the number of processes running?
how can i increase the number of processes

running?
Keywords: proc
Date: Wed, 06 May 1998 08:30:39 GMT
From: ElmerFudd <elfuddo@hotmail.com>
i can't seem to get more than 240 processes running (programs seg fault and nasty stuff like that) i
have been looking into it and i think it must have something to do with a limit on /proc? (i'm running
2.1.33 and libc 5.4.17)
Please email me at elfuddo@hotmail.com if you are able to shed some light on this problem, thanks

How do I change the amount of time a process is allowed before it is pre-empted?
How do I change the amount of time a process is

allowed before it is pre-empted?
Date: Tue, 21 Apr 1998 14:20:10 GMT
From: <Escher@dn101aw.cse.eng.auburn.edu>
I need help badly....I need to know how to change this field, where it is located, etc.

Network device stops after a while
Network device stops after a while

Keywords: network device, ioctl
Date: Tue, 21 Apr 1998 13:00:25 GMT
From: Andrew Ordin <ordin@dialup.ptt.ru>
I have a strange problem with a network device driver i have written. The device runs under IP for a
while and then kernel stops passing ioctl()s down to the driver. The ioctl() is used by a process to
communicate with the device.
Anybody knows where the dog is buried?

Untitled
Untitled
Re: Network device stops after a while (Andrew Ordin)
Keywords: network device, ioctl
Date: Sat, 25 Apr 1998 13:21:35 GMT
From: Andrew <waddington@usa.net>
Precisely which dog is this???

Does MMAP work with Redhat 4.2?
Does MMAP work with Redhat 4.2?

Keywords: MMAP memory error
Date: Tue, 21 Apr 1998 10:54:04 GMT
From: Guy <theguy@ionet.net>
Hi,
I'm getting an error with MMAP on Redhat 4.2... The same code worked just fine when compiled on a
Slakware 1.2.13 system...
The code actually compiles just fine, but I get a runtime error which says, "Segmentation Fault (Core
Dumped)" when I try to access the memory -- I later found that MMAP was returning a "EINVAL"
error...
Does the MMAP function work on Redhat 4.2?
What are the differences between Redhat 4.2 and Slakware 1.2.13 with respect to the MMAP
function?
I've verified that my start and length values are appropriate (I even tried setting my start value to zero,
but that didn't seem to work). How do I find a PAGESIZE boundary? How do I "allign" this?
thanks for your help,
Guy
(theguy@ionet.net)
Messages
1. Yes, it works just fine. by Michael K. Johnson
-> It Works! Thanks! by Guy

Yes, it works just fine.
Yes, it works just fine.

Re: Does MMAP work with Redhat 4.2? (Guy)
Date: Tue, 21 Apr 1998 12:12:15 GMT
mmap() works fine on a Red Hat 4.2 system, which ships a perfectly standard Linux kernel. A great
deal of the source code included in the distribution uses mmap() explicitly, and every dynamically
loaded program uses mmap() implicitly in the dynamic loader.
I suggest "man mmap" as a start. Notice the reference at the bottom of the page to getpagesize(), and
consider the modulus operator (%). That should help you out... :-)
You'll also find sample code that uses mmap() at
http://www.redhat.com/~johnsonm/lad/src/map-cat.c.html Other sample source you might find useful
is at http://www.redhat.com/~johnsonm/lad/src/
Messages
2. It Works! Thanks! by Guy

It Works! Thanks!
It Works! Thanks!
Re: Yes, it works just fine. (Michael K. Johnson)
Date: Tue, 21 Apr 1998 14:13:58 GMT
From: Guy <theguy@ionet.net>
Thanks!

What about mprotect?

Re: Yes, it works just fine. (Michael K. Johnson)
Date: Fri, 17 Jul 1998 04:48:53 GMT
From: Sengan Baring-Gould <unknown>
If I use the example given by man mprotect on Redhat 4.2, the
program does not run:
./mprotect
Couldn't mprotect: Invalid argument
Why?
Thanks,
Sengan
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
int
main(void)
{
char *p;
char c;
/* Allocate a buffer; it will have the default

protection of PROT_READ|PROT_WRITE. */
p = malloc(1024);
if (!p) {
perror("Couldn't malloc(1024)");
exit(errno);
}
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/259/1/3.html (1 di 2) [08/03/2001 10.11.47]

c = p[666]; /* Read; ok */
p[666] = 42; /* Write; ok */
/* Mark the buffer read-only. */

if (mprotect(p, 1024, PROT_READ)) {
perror("Couldn't mprotect");
exit(errno);
}
c = p[666]; /* Read; ok */
p[666] = 42; /* Write; program dies on SIGSEGV */
exit(0);
}

multitasking
multitasking
Keywords: multitasking
Date: Fri, 17 Apr 1998 17:01:10 GMT
From: Dennis J Perkins <dperkins@btigate.com>
I'm starting to learn about how the kernel works. Does Linux use cooperative or preemptive
multitasking? I know that the scheduler is called when returning from a system call, but does this mean
that Linux uses cooperative multitasking, since a timer interrupt does not force a context xwitch?

Answer
Answer
Re: multitasking (Dennis J Perkins)
Date: Mon, 20 Apr 1998 15:32:10 GMT
From: David Welch <welch@mcmail.com>
Both. User level processes can preempted in user mode but code running in kernel mode (either by
using a system call or by a dedicated kernel thread) is only preempted when it chooses to give up
control

multitasking
multitasking
Re: Answer (David Welch)
Date: Tue, 21 Apr 1998 17:41:27 GMT
From: Dennis J Perkins <dperkins@btigate.com>
So, preemptive multitasking does not mean that a process stops running, that is, it is no longer current,
as soon as do_timer decrements its priority to the point where it no longer has the highest priority? It
continues running until it returns from a system call?

answer
answer
Re: Answer (David Welch)
Date: Tue, 28 Apr 1998 12:59:48 GMT
Not quite. If a process is executing a system call and interrupts are disabled then it will not be
preempted (because the timer interrupt can't happen). Since interrupts are automatically disabled on
entry to a system call and are only reenabled when the process will have to wait for a long time, most
of time a process won't be preempted inside the kernel. However it can be preempted at any time when
running in user mode.
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/256/1/1/1.html [08/03/2001 10.11.50]

linux on sparc
linux on sparc
Keywords: sparc kernel linux
Date: Mon, 06 Apr 1998 10:52:01 GMT
From: darrin hodges <darrin.hodges@qmi.com.au>
has anybody been able to build a recent (2.0.33) kernel on a sparc, it seems that many of the defines
are missing, eg GFP_IO is defined in the i386 & alpha include tree, yet not in the sparc tree. I cant
seem to find much about linux-sparc, im familiar enough with i386 assembly and arch, but not sparc
arch. if anybody has any clues, share with me? :)

How to call a function in user space from inside the kernel ?
How to call a function in user space from inside

the kernel ?
Date: Thu, 02 Apr 1998 16:03:21 GMT
From: Ronald Tonn <tonn@infotech.tu-chemnitz.de>
Is there a way to call a function inside an application from a device driver (in kernel mode)?
I've come across this problem while working on an my ATM network driver. When opening a channel
the ATM application calls the driver with a pointer to a receive function that should be called
whenever data is received on that channel.
If I use this pointer (which points into user space) inside the driver as shown below the whole systems
crashes appearently because I'm trying to execute a user program in kernel mode.
...
receive_ptr(buffer, buffer_size);
...
Does anyone know how to fix this problem?
Thanks,
Ronald

How to call a user routine from kernel mode
How to call a user routine from kernel mode

Re: How to call a function in user space from inside the kernel ? (Ronald Tonn)
Date: Thu, 09 Apr 1998 21:11:48 GMT
There are several possible solutions

i) On kernels before 2.1.xx the kernel uses a different code segment to user programs. To call user
space would require a far call. The functions get_user and put_user similarly use a segment override to
access user space. Calls are far more complicated because (for protection) the processor won't allow a
user routine to return to a more priveleged code segment. The best way round this would be to use a
similar method to signal handlers i.e. setup a routine on the stack which on return from the signal
handler executes a system call which returns to kernel mode. See arch/i386/kernel/signal.c for more
information.
The problem is further complicated because on the i386 executing a system call will restore the default
kernel stack. The vm86 routines allocate a new kernel stack so when a GPF happens in vm86 mode
they can restore the original stack and return to the caller process. You may be able to use a similar
method.
ii) On kernels after 2.1.xx the kernel segment has the same base as user space. It should be possible to
directly call a user routine and have it return normally. However the user routine will have normal
kernel priveleges (a big security hole!).
iii) Use kerneld. If the desired routine can be written as a seperate program then kerneld can be called
from kernel mode to execute it. An example is calling the request-routine script when a network
connection is attempted.
iv) If the calling process is multithreaded then you should be able to use the ipc semaphores from
kernel mode to signal to another thread to execute the routine.

Can I map kernel (device driver) memory into user space ?
Can I map kernel (device driver) memory into

user space ?
Date: Thu, 02 Apr 1998 15:28:36 GMT
From: Ronald Tonn <tonn@infotech.tu-chemnitz.de>
I'm trying to find out a way to map memory that has been allocated by a device driver (of course in
kernel mode) into user space so that applications can have access to it.
This issue has come up while writing an ATM network driver. If send or receive buffers (allocated by
the driver) could be passed directly to an application that makes use of the driver, data could be passed
with a single-copy operation. e.g the application requests a send buffer from the driver, copies the data
into it and then forces the driver to transmit the buffer.
Right now I'm using copy_from(to)_user in order to get the data from a user allocated buffer into the
driver allocated send buffer and vice versa.
Any help would be much appreciated.
Thanks,
Ronald

driver x_open,x_release work, x_ioctl,x_write don't
driver x_open,x_release work, x_ioctl,x_write don't

Date: Tue, 31 Mar 1998 19:41:58 GMT
From: Carl Schwartz <schwcarl@e-z.net>
Using RedHat 5.0 and following KHG I performed the following steps in developing a
driver 'x':
1) created x.o with gcc -O -DMODULE -D__KERNEL__
2) created /dev/x crw-r--r-- 126 0
3) insmod ./x.o
4) lsmod listed it but ksyms did not
5) user root: fildef = open('/dev.x',O_RDWR); (fildef = 17)
6) user root:ioctl(fildef,_IOW(126,1,sizeof(daud)),(int)&daud)
returns -1 as well as does all other ioctl's and write's
I try from user app and do not print "printk's".
7) rmmod removes it OK. It seems that 'open' and 'release'
are the only functions that work (perform as expected and
"printk's" work).
I copied device file_operations, ioctl and write parameter lists from KHG, basically replacing 'foo' with 'x'.
I copied 'x.o' to /lib/modules/2.0.31/char and added 'alias char-major-126 x' to conf.module. Depmod -a does not add it to
modules.dep and Modprobe doesn't know that 'x' exists.
Messages
1. Depmod Unresolved symbols? by Carl Schwartz

Depmod Unresolved symbols?
Depmod Unresolved symbols?

Re: driver x_open,x_release work, x_ioctl,x_write don't (Carl Schwartz)
Date: Tue, 31 Mar 1998 22:50:12 GMT
From: Carl Schwartz <schwcarl@e-z.net>
Sorry, I forgot to click '?' on original message. One of my original problems that I just discovered is
that I had put 'x.o' under /lib/modules/2.0.31/ 'char' instead of 'misc'. Relating 'modules' to 'drivers', I
guess I had created the 'char' subdirectory which Modprobe apparently does not search. My remaining
problem seems to be that Depmod -a does not resolve the symbols, but does go ahead and put 'x' in the
modules.dep. I need to determine what, if not Depmod, updates the symbol table. The symbols
returned by depmod -e are all found in ksyms and some of them work just fine in the 'cleanup', 'init'
and 'open' functions which work OK. I haven't found the *.c source to help in figuring out how things
fit together for the module utilities, just the *.o files.

How to sleep for x jiffies?
How to sleep for x jiffies?

Date: Thu, 26 Mar 1998 17:02:51 GMT
From: Trent Piepho <xyzzy@u.washington.edu>
I'd like to have a delay in a device driver of X jiffies, where X = 1 on intel and 8 on alpha. What's a
clean and friendly way to do this? udelay(8333)?

Use add_timer/del_timer (in kernel/sched.c)
Use add_timer/del_timer (in kernel/sched.c)

Re: How to sleep for x jiffies? (Trent Piepho)
Date: Tue, 23 Jun 1998 22:27:06 GMT
From: Amos Shapira <amos-khg@gezernet.co.il>
At least for 2.1.xx, use the *_timer function family as defined in kernel/sched.c.
The "Linux Internals" book (not quite recommanded, better than nothing from a quick glance)
mentioens only these functions, but I see other functions with interesting names:
mod_timer which seems to change an existing timer in-place
detach_timer which seems to do what del_timer does (actually it's used by del_timer) without clearing
the timer_list 'next' and 'prev'.
There are a few interesting comments in include/linux/timer.h which describe a few more functions.
One thing I learned from the book is that when you call add_timer, you specify the absolut time in
jiffies (i.e. for one sec say 'jiffies + (1*HZ)')

Adding code to the Linux Kernel
Adding code to the Linux Kernel

Date: Wed, 25 Mar 1998 18:04:48 GMT
From: <barry@skynet.csn.ul.ie>
Hi all,
I am currently attempting my first kernel hack. I am adding code into the IP
layer to randomly drop packets.
I have two questions:
1) I want to be able to include a header file called random.h (usr/src/linux/include/net). This is so that I can generate random
numbers. When I include this in my own C file, parse errors appear everywhere. Is this the right place to include this header
file?
2) I would eventually like to get a random number generated based on seeding it from the current system time. This is so
that I can get a new random number on each call to ip_input. Is there a way around not being able to include standard lib
files in the kernel ( ie time.h, stdlib.h etc)?
Also it there any documentation on adding such modules to the kernel. It would greatly simplify my task.
Thanks
Patrick

/dev/random
/dev/random
Re: Adding code to the Linux Kernel
Date: Sun, 29 Mar 1998 11:20:52 GMT
From: Simon Green <sgreen@emunet.com.au>
Recent kernels (at least 2.0.31 and later) include a device file called /dev/random (no. 1,8). You can
read from this device file to get as many random bytes as you need.
Just read in bytes a pair/quad at a time and typecast to int.
This way, you don't need to keep reseeding the generator.
Simon

MSG_WAITALL flag
MSG_WAITALL flag
Date: Sun, 22 Mar 1998 16:38:16 GMT
From: Leonard Mosescu <lemo@lmct.sfos.ro>
Why Linux doesn't support MSG_WAITALL flag in the recv() call ?

possible bug in ipc/msg.c
possible bug in ipc/msg.c

Keywords: ipc bug msg
Date: Sat, 21 Mar 1998 04:57:44 GMT
From: Michael Adda <m_photon@usa.net>
hi
first, i hope that this is the right place, :-> ,
since i an not sure about the 'finding' ...
i need an advice. i am currently reading the kernel's code
systematiclay, and i believe i stumbled into a bug in ipc/msg.c
lines 326,329. i am talking about kernel 2.0.30-2.0.33 ( which i am
working with ) and not about the development kernels... please read
the relevent code ...
since we are no longer ( between this lines ) in atomic operations,
someone can suspend are in say line 326, recieve the current message
( the one we have nmsg as pointer to ) and leave us with pointer to
garbage...
i belive that we should put lines 326-329 in cli/restoreflags() pair
after checking that the message is valid via the pointer flag ( not
IPC_UNUSED/IPC_NOOID ).
i hope that i am not bothering you for nothing...
i have a possible patch.
thank you for your time

Michael ( m_photon@usa.net )

scheduler Question
scheduler Question
Keywords: scheduler sleep priority
Date: Sat, 07 Mar 1998 12:12:06 GMT
From: Arne Spetzler <arne@smooth.netzservice.de>
Hi!
For my diploma i'am modifying the Linux Kernel to support ACL's. At some places the Kernel has to
sleep on interal structures and after wake up the process has to run as soon as possible to minimize
delay (e.g. the acl-header-cache). As i know the traditional approach (Maurice J. Bach: The Design of
the UNIX Operating System) is to set the process to a fixed (and high) "sleep priority" on which the
process will run after wake up.
But i couldn't find the related code in the Linux kernel (2.0.33).
Does anybody know how Linux doing this??
thanks for every help
Arne
Messages
1. Untitled by Ovsov

Untitled
Untitled
Re: scheduler Question (Arne Spetzler)
Date: Sun, 08 Mar 1998 00:46:24 GMT
From: Ovsov <ovsov@bveb.belpak.minsk.by>
Actually Linux's wake_up_process does nothing but changing a process' state from
TASK_(UN)INTERRUPTIBLE to TASK_RUNNING and then it may be chosen for running the next
time schedule() is invoked depending on its priority (counter-parameter in task_struct for the processes
scheduled under OTHER-policy and rt_proirity-parameter for real-time processes scheduled under
FIFO or RobinRound (RR)).
Best regards. Kostya.

thanks
thanks
Re: scheduler Question (Arne Spetzler)
Re: Untitled (Ovsov)
Date: Sat, 16 May 1998 06:48:05 GMT
From: arne spetzler <unknown>
Hmmm. Thats exactly what i found.
So i think there is no chance to give my proccesse a higher priority in order to run quicker :-(
arne

File Descriptor Passing?
File Descriptor Passing?

Date: Fri, 27 Feb 1998 06:58:32 GMT
From: The Llamatron <mc2@geocities.com>
Quick questions: Has file descriptor passing been implemented yet, which kernel version to use, and
which method (SYSV or Berkeley)?
Thanks bunches and bunches!

Linux SMP Scheduling
Linux SMP Scheduling

Date: Thu, 26 Feb 1998 06:50:08 GMT
From: Angela <okchan@cs.hku.hk>
Dear sir, i am a student working on a project about Linux SMP scheduling.

i would be very grateful if you could kindly help me answering the following questions:
From the source code:"sched.c" - what is the constant "NO_PROC_ID" used for? and how to
determine its value?
From the source code: "fork.c" - what is the use of the integer "lock_depth" and what are the meaning
of the values it can have(eg. 1)?
Regards, Angela

Finding definitions in the source
Finding definitions in the source

Re: Linux SMP Scheduling (Angela)
Date: Mon, 16 Mar 1998 16:34:54 GMT
From: Felix Rauch <rauch@inf.ethz.ch>
The definition of NO_PROC_ID can be fount by, e.g., the following command:
find . -name "*.[ch]" -print | xargs grep NO_PROC_ID | grep #define
(assuming you are in the Linux-source-directory).
This will show that NO_PROC_ID is defined as follows:
#define NO_PROC_ID 0xFF

Re: Linux SMP Scheduling
Re: Linux SMP Scheduling

Re: Linux SMP Scheduling (Angela)
Keywords: SMP Linux
Date: Fri, 27 Feb 1998 13:35:31 GMT
From: Franky <genie@ucsd.com>
Hello:
It seems to me that NO_PROC_ID is just an indicator that no processes is scheduled

for execution on this processor(i.e. it is free to be used for running next task).
As of lock_depth - I did not find anything usefull about it - I am not so good at

Linux SMP kernel features, sorry :)))

Difference between ELF and kernel file
Difference between ELF and kernel file

Keywords: ELF binary kernel gcc output
Date: Tue, 17 Feb 1998 22:43:54 GMT
From: Thomas Prokosch <Thomas.Prokosch@jk.uni-linz.ac.at>
I would like to know the difference between a normal ELF executable and a kernel image. Of course I
know that the ELF file format contains pointers and different sections and a kernel file should be
rather a binary image only, but how are both related to each other? Both, ELF and kernel image, are
output by gcc, how do I convert an ELF binary to a kernel image?

How kernel communicates with outside when it's started?
How kernel communicates with outside when it's

started?
Keywords: kernel,file system,
Date: Mon, 16 Feb 1998 23:03:17 GMT
From: Xuan Cao <cao@cobalt.chem.uidaho.edu>
I am writing some kernel modules that is supposed to executed after the init process has been started.
The purpose of these modules is to write some kernel data out to a normal file. But I don't know what
kinds of functions are available for those modules. Some functions such as open(), read(), write()
which worked before init process don't work in those kernel modules after init process is started. I
guess the main problem might be the kernel don't have control over the system after init process is
started. But how can I get some of the data out of kernel to a file by using the kernel modules?
The problem is can or how I use some functions (file operations, IPC, etc) in a kernel module which is
only executed after the system has started and is running normally.
Thanks in advance!

Printing to the kernel log
Printing to the kernel log

Re: How kernel communicates with outside when it's started? (Xuan Cao)
Keywords: kernel,file system,
Date: Tue, 24 Feb 1998 20:59:16 GMT
From: Thomas Prokosch <Thomas.Prokosch@jk.uni-linz.ac.at>
During my recent studies of the Linux kernel I came across a function "fprintk" which prints kernel
error messages in the approprate logfile. Perhaps you could use this?

The way from a kernel hackers' idea to an "official" kernel?
The way from a kernel hackers' idea to an

"official" kernel?
Keywords: generic driver major number
Date: Fri, 13 Feb 1998 12:51:42 GMT
From: Roger Schreiter <schreite@helena.physik.uni-stuttgart.de>
We (see below) want to write a generic MLC driver (MLC stands for multiple logical channel). MLC
is perhaps not as important as SCSI, but has a similar structure - HP thinks about making it an IEEE
standard.
Is it the right way for us to discuss our ideas about how to build the new driver in the appropriate KHG
list? Where do we have to send the code when ready, for it will be part of future stable kernels? Who
decides, which major numbers and so on will be assigned to our new type of hardware driver?
Perhaps some KHG readers have already discovered on the Linux Parallel Port Home Page
(http://www.torque.net/linux-pp.html) the new link to the "Linux driver for the HP officejet" project
(http://www.ifs.physik.uni-stuttgart.de/Personal/RSchreiter/hpoj/). The HP officejet communicates via
MLC. We think further MLC capable devices will follow, so it will be useful to write a generic MLC
driver.
Messages
1. linux-kernel@vger.rutgers.edu by Michael K. Johnson

linux-kernel@vger.rutgers.edu
linux-kernel@vger.rutgers.edu
Re: The way from a kernel hackers' idea to an "official" kernel? (Roger Schreiter)
Keywords: generic driver major number
Date: Fri, 13 Feb 1998 16:41:17 GMT
You need to subscribe to linux-kernel@vger.rutgers.edu (send email with BODY, not subject, of
subscribe linux-kernel to majordomo@vger.rutgers.edu.
Discuss your ideas there.
Lastly, see Writing a parport Device Driver.

Writing a parport Device Driver

What parport does
One purpose of parport is to allow multiple device drivers to use the same parallel port. It does this by sitting in-between the
port hardware and the parallel port device drivers. When a driver wants to talk to its parallel port device, it calls a function to
"claim" the port, and "releases" the port when it is done.
Another thing that parport does is provide a layer of abstraction from the hardware, so that device drivers can be
architecture-independent in that they don't need to know which style of parallel port they are using (those currently supported
are PC-style, Archimedes, and Sun Ultra/AX architecture).
Interface to parport
Finding a port
To obtain a pointer to a linked list of parport structures, use the parport_enumerate function. This returns a pointer to a
struct parport, in which the member next points to the next one in the list, or is NULL at the end of the list.
This structure looks like (from linux/include/linux/parport.h):
/* A parallel port */
struct parport {
unsigned long base; /* base address */
unsigned int size; /* IO extent */
char *name;
int irq; /* interrupt (or -1 for none) */
int dma;
unsigned int modes;
struct pardevice *devices;

struct pardevice *cad; /* port owner */
struct pardevice *lurker;
struct parport *next;

unsigned int flags;
struct parport_dir pdir;

struct parport_device_info probe_info;
struct parport_operations *ops;

void *private_data; /* for lowlevel driver */
};
Device registration
The next thing to do is to register a device on each port that you want to use. This is done with the
parport_register_device function, which returns a pointer to a struct pardevice, which you will need in
order to use the port.
This structure looks like (again, from linux/include/linux/parport.h):
http://ldp.iol.it/LDP/khg/HyperNews/get/devices/parport.html (1 di 5) [08/03/2001 10.12.04]

/* A parallel port device */

struct pardevice {
char *name;
struct parport *port;
int (*preempt)(void *);
void (*wakeup)(void *);
void *private;
void (*irq_func)(int, void *, struct pt_regs *);
int flags;
struct pardevice *next;
struct pardevice *prev;
struct parport_state *state; /* saved status over preemption */
};
There are two types of driver that can be registered: "transient" and "lurking". A lurking driver is one that wants to have the
port whenever no-one else has it. PLIP is an example of this. A transient driver is one that only needs to use the parallel port
occasionally, and for short periods of time (the printer driver and Zip driver are good examples).
Claiming the port
To claim the port, use parport_claim, passing it a pointer to the struct pardevice obtained at device registration.
If parport_claim returns zero, the port is yours, otherwise you will have to try again later.
A good way of doing this is to register a "wakeup" function: when a device driver releases the port, other device drivers that
are registered on that port have their "wakeup" functions called, and the first one to claim the port gets it. If the parport claim
fails, you can go to sleep; when the parport is free again, your wakeup function can wake you up again. For example, declare
a global wait queue for each possible port that a device could be on:
static struct wait_queue * wait_q[MAX_MY_DEVICES];
The wakeup function looks like:
void my_wakeup (void * my_stuff)
{
/* this is our chance to grab the parport */
struct wait_queue ** wait_q_pointer = (struct wait_queue **) my_stuff;
if (!waitqueue_active (wait_q_pointer))
return; /* parport has messed up if we get here */
/* claim the parport */

if (parport_claim (wait_q_pointer)))
return; /* Shouldn't happen */
wake_up(wait_q_pointer);
}
Then, in the initialisation code, do something like:
struct pardevice * pd[MAX_MY_DEVICES];
int my_driver_init (void)

{
struct parport * pp = parport_enumerate ();
int count = 0;

while (pp) { /* for each port */
/* set up the wait queue */

init_waitqueue (&wait_q[count]);
/* register a device */
pd[count] = parport_register_device (pp, "Me",
/* preemption function */ my_preempt,
/* wakeup function */ my_wakeup,
/* interrupt function */ my_interrupt,
/* this driver is transient */ PARPORT_DEV_TRAN,
/* private data */ &wait_q[count]);
/* try initialising the device */

if (init_my_device (count) == ERROR)
/* failed, so unregister */
parport_unregister_device (pd[count]);
else if (++count == MAX_MY_DEVICES)
/* can't handle any more devices */
break;
}
Then a typical thing to do to obtain access to the port would be:
if (parport_claim (pd[n]))
/* someone else had it */
sleep_on (&wait_q[n]); /* will wake up when wakeup */
/* function called */
/* (do stuff with the port here) */
/* finished with the port now */

parport_release (pd[n]);
Using the port
Operations on the parallel port can be carried out using functions provided by the parport interface:
struct parport_operations {
void (*write_data)(struct parport *, unsigned int);
unsigned int (*read_data)(struct parport *);
void (*write_control)(struct parport *, unsigned int);
unsigned int (*read_control)(struct parport *);
unsigned int (*frob_control)(struct parport *, unsigned int mask, unsigned
int val);
void (*write_econtrol)(struct parport *, unsigned int);
unsigned int (*read_econtrol)(struct parport *);
unsigned int (*frob_econtrol)(struct parport *, unsigned int mask, unsigned
int val);
void (*write_status)(struct parport *, unsigned int);
unsigned int (*read_status)(struct parport *);
void (*write_fifo)(struct parport *, unsigned int);
unsigned int (*read_fifo)(struct parport *);

void (*change_mode)(struct parport *, int);
void (*release_resources)(struct parport *);

int (*claim_resources)(struct parport *);
unsigned int (*epp_write_block)(struct parport *, void *, unsigned int);

unsigned int (*epp_read_block)(struct parport *, void *, unsigned int);
unsigned int (*ecp_write_block)(struct parport *, void *, unsigned int, void

(*fn)(struct parport *, void *, unsigned int), void *);
unsigned int (*ecp_read_block)(struct parport *, void *, unsigned int, void
(*fn)(struct parport *, void *, unsigned int), void *);
void (*save_state)(struct parport *, struct parport_state *);

void (*restore_state)(struct parport *, struct parport_state *);
void (*enable_irq)(struct parport *);

void (*disable_irq)(struct parport *);
int (*examine_irq)(struct parport *);
void (*inc_use_count)(void);
void (*dec_use_count)(void);
};
However, for generic operations, the following macros should be used (architecture-specific parport implementations may
redefine them to avoid function call overheads):
/* Generic operations vector through the dispatch table. */

#define parport_write_data(p,x) (p)->ops->write_data(p,x)
#define parport_read_data(p) (p)->ops->read_data(p)
#define parport_write_control(p,x) (p)->ops->write_control(p,x)
#define parport_read_control(p) (p)->ops->read_control(p)
#define parport_frob_control(p,m,v) (p)->ops->frob_control(p,m,v)
#define parport_write_econtrol(p,x) (p)->ops->write_econtrol(p,x)
#define parport_read_econtrol(p) (p)->ops->read_econtrol(p)
#define parport_frob_econtrol(p,m,v) (p)->ops->frob_econtrol(p,m,v)
#define parport_write_status(p,v) (p)->ops->write_status(p,v)
#define parport_read_status(p) (p)->ops->read_status(p)
#define parport_write_fifo(p,v) (p)->ops->write_fifo(p,v)
#define parport_read_fifo(p) (p)->ops->read_fifo(p)
#define parport_change_mode(p,m) (p)->ops->change_mode(p,m)
#define parport_release_resources(p) (p)->ops->release_resources(p)
#define parport_claim_resources(p) (p)->ops->claim_resources(p)
Releasing the port
When you have finished the sequence of operations on the port that you wanted to do, use release_parport to let any
other devices that there may be have a go.
Unregistering the device
If you decide that you don't want to use the port after all (perhaps the device that you wanted to talk to isn't there), use
parport_unregister_device.

Something to bear in mind: interrupts
Parallel port devices cannot share interrupts. The parport code shares a parallel port among different devices by means of
scheduling - only one device has access to the port at any one time. If a device (a printer, say) is going to generate an
interrupt, it could do it when some other driver (like the Zip driver) has the port rather than the printer driver. That would
lead to the interrupt being missed altogether. For this reason, drivers should poll their devices unless there are no other
drivers using that port. To see how to do this, you might like to take a look at the printer driver.

Curious about sleep_on_interruptible() in ancient kernels.
Curious about sleep_on_interruptible() in ancient

kernels.
Keywords: sleep_on wake_up signal history
Date: Thu, 12 Feb 1998 21:19:36 GMT
From: Colin Howell <howell@cs.stanford.edu>
I've recently become interested in the design and history of the Linux kernel, so I decided to start at the
beginning. That's right, 0.01. :-)
Of course, I realize such old code may be very buggy or incomplete, but that's part of what makes it
interesting.
Anyway, I noticed a peculiarity in the scheduler function sleep_on_interruptible(). Before kernel
version 1.0, both forms of sleep_on() just waited on a task_struct pointer variable, setting that variable
to point to the last caller of sleep_on(). wake_up() would thus only directly awaken the last caller of
sleep_on(); that caller was responsible for waking up the previous caller (a pointer to which it had
saved), and so on down the chain. There was no explicit wait queue, only what you might call an
"implicit wait stack".
For uninterruptible waits, this will do, but if the wait is interruptible, there seems to be a problem.
Suppose one of the tasks waiting on the pointer variable gets a signal. It then is awakened by the
signal-handling code in the scheduler. This tasks will then awaken *all* the tasks waiting on the
pointer variable, just like wake_up() would. (If the task is at the top of the "wait stack", the behavior is
just like a call to wake_up(). If the task is somewhere else in the wait stack, it wakes up the task at the
top, puts itself back to sleep, and waits to be awakened. Then things proceed as with wake_up().) Of
course, you have to do things this way with such an implementation, since there's no practical way of
unlinking a task from the middle of such a wait stack. But the behavior still seems odd.
Maybe I'm mistaken about the intended behavior of sleep_on_interruptible(), but I thought a signal
was only supposed to wake the receiving task, not all the tasks. Am I wrong? It certainly seems to
work this way in versions 1.0 and later, which used a conventional wait queue approach.
Mind you, the old kernel code still works, because tasks which are awakened always seem to recheck
the condition they are waiting for before proceeding; if it hasn't occurred, they go back to sleep again.
But it seems inefficient to wake all the tasks simply because one got a signal, when they will all just
call sleep_on_interruptible() again.
Comments? Is this a valid criticism and a reason the kernel was changed, or am I just confused about
sleep_on()? And, assuming the criticism is valid, why did they wait until version 1.0 to make the
change?

Curious about sleep_on_interruptible() in ancient kernels.
P.S. Is there any archive or record of early discussions about the kernel design? The oldest thing I can
find is an archive of the linux-kernel mailing list which only dates back to the summer of 1995, four
years after the project started.

Server crashes using 2.0.32 and SMP

Keywords: SMP Crash Help!
Date: Tue, 03 Feb 1998 18:31:59 GMT
Greetings,
I have been having some problems with server crashes. On two occasions I was able to have personnel
at the co-location facility, where my server lives, look at the console immediately after a crash.
The kernel version running was 2.0.32 w/ SMP support on a dual Pentium Pro box.
When the server would crash, a message would be continuously displayed on the console (but not in
the syslog):
Aiee: scheduling in interrupt: 0012BBD1
A search of the sources found that this condition was tested for in /usr/src/linux/sched.c on line 396
and the message printed on line 497.
It would appear that an interrupt was encountered during the schedule() operation. This would be a bad
thing. (It's not nice to re-enter the scheduler via an interrupt)
Since the address being printed is, presumably, the return address after the schedule call, and is
consistent, I am assuming that the scheduler is being re-entered while servicing some sort of interrupt
from within the same ISR.
First, are my assumptions even close to reality?
Secondly, is this a "known" issue with the 2.0.32 kernel. I understand there have been some changes in
the kernel SMP code between 2.0.32 and 2.0.33 so I am wondering if upgrading the kernel will fix
this.
Thirdly, does this indicate some sort of hardware failure and if so, how can I trace this back to the
device in question.
Finally, I am open to suggestions for other ideas and/or options here.
As always, any help is appreciated. Most suggestions taken seriously :)
Thanks, in advace,
Steve

Messages
1. Debugging server crash by Balaji Srinivasan

Debugging server crash
Debugging server crash

Re: Server crashes using 2.0.32 and SMP (Steve Resnick)
Date: Tue, 03 Feb 1998 19:21:37 GMT
From: Balaji Srinivasan <balaji@hegel.ittc.ukans.edu>
To get more information on where exactly schedule was called within the interrupt handler, compile
the kernel with the -g flag (also remove the -fomit-frame-pointer flag). These options can be set via the
CFLAGS definition in the main Makefile of the kernel.
The address that is printed seems to be bogus. Recompile the kernel with debugging enabled (-g) and
see what address it prints out. You can then check out what function the address actually is in by using
gdb on the vmlinux file.
Also you might just try upgrading to 2.0.33
Hope this helps balaji

More Information
More Information
Re: Debugging server crash (Balaji Srinivasan)
Date: Thu, 05 Feb 1998 19:22:49 GMT
Ok ... A little more detail here:

I moved to 2.0.33, and build a kernel with SMP support, removing the -fomit_frame_pointer flag and
added -g.
My server crashed this morning with the same error message, the address being somewhat different,
but close enough to the original that I suspect the address to be the same function, in relation to 2.0.32
The scheduler was re-entered (apparently) from __wait_on_buffer in /usr/src/linux/fs/buffer.c, line
121.
I also read elswhere that enhanced RTC support is needed for the kernel, which had not been added
before.
Could this be a cause of the crashes?
Cheers,
Steve

it should not have happenned...
it should not have happenned...

Re: Debugging server crash (Balaji Srinivasan)
Re: More Information (Steve Resnick)
Date: Sun, 08 Feb 1998 23:33:44 GMT
As far as i know, wait_on_buffer should not be called from within the interrupt context. (since it
explicitly calls schedule). Along with the crash, there should be a call trace that gets printed out. With
that you might be able to find out what exactly called __wait_on_buffer
Hope this helps.
balaji
PS: I doubt if RTC has anything to do with the crash...
(maybe im wrong)

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/208/1/1/www.ducksfeet.com/steve
<HTML>
<HEAD>
<TITLE>HyperNews Redirect</TITLE>
<LINK rel="owner" href="mailto:">
</HEAD>
<BODY>
Broken URL:
http://www.redhat.com:8080/HyperNews/get/khg/208/1/1/www.ducksfeet.com/steveTry:
<A HREF="../../1.html">http://www.redhat.com:8080/HyperNews/get/khg/208/1/1.html</A> 
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/208/1/1/www.ducksfeet.com/steve [08/03/2001 10.12.08]

Signals ID definitions
Signals ID definitions
Date: Mon, 02 Feb 1998 13:54:56 GMT
From: Franky <genie@risq.belcaf.minsk.by>
I found recently that signal IDs, defined in src/include/asm-i386/signal.h are confused:

#define SIGABRT 6
#define SIGIOT 6
is that the right approach?

the segment D000 is not visible
the segment D000 is not visible

Date: Mon, 02 Feb 1998 13:00:01 GMT
From: <martinv2@ctima.uma.es>
Hi all
I am working on the design of ISA board for a PC. It has
96 Kb of ROM with addresses between C800:0 and D000:FFFF. The board works ok in a
pentium 75 but it does not work in more recent systems.
The reason is that the segment D000 is not visible.
Does anibody know what is going on and how to solve it ?
Thanks in advance.
Manuel J. Martin Universidad de Málaga.

ICMP - Supressing Messages in 2.1.82 Kernel
ICMP - Supressing Messages in 2.1.82 Kernel

Keywords: ICMP request IPV4 broadcast
Date: Fri, 30 Jan 1998 20:26:25 GMT
From: Brent Johnson <brent@foxtrot.dyn.ml.org>
On our local network we are using a terminal server which uses something called "Win-S" - it
intentionally sends an ICMP (ping) request to the broadcast address - because thats what its supposed
to do (i dont know the details of why).
Because of this I keep getting a message like: ipv4: (1 messages suppresed. Flood?) xxx.xxx.xxx.xxx
send an invalid ICMP error to a broadcast.
How can I stop this error message from being displayed?
Messages
1. Change /etc/syslog.conf by Balaji Srinivasan

Change /etc/syslog.conf
Change /etc/syslog.conf
Re: ICMP - Supressing Messages in 2.1.82 Kernel (Brent Johnson)
Keywords: ICMP request IPV4 broadcast
Date: Sat, 31 Jan 1998 00:56:29 GMT
The kernel message is printed with KERN_CRIT level. so modify your /etc/syslog.conf to log
KERN_CRIT messages to /var/adm/kernel or some other file. Note that in that way you will
potentially loose a lot of important messages...
If you want to fix this particular message then just modify your kernel (file net/ipv4/icmp.c line 683)
to printk(KERN_INFO "%s sent an invalid ICMP error to...); Dont try this if you dont want to mess
around with the kernel. balaji

Modem bits
Modem bits
Date: Fri, 30 Jan 1998 09:32:12 GMT
From: Franky <Franky>
Hello, all:
In my current project I should obtain the current status of a modem - CD, CTS, DTR, etc. Could
anyone give me a hint of how to do that?

Untitled
Untitled
Re: Modem bits (Franky)
Date: Sat, 31 Jan 1998 18:57:59 GMT
From: Kostya <ovsov@bveb.belpak.minsk.by>
It's quite simple. I think the code will explain it the best.
/***************************/
#include <termios.h>
int status ;
ioctl (fd,TIOCMGET,&status) ; // fd - already opened device.
if (status & TIOCM_DTR)
printf ("DTR is asserted\n"); // for example
/*****************************/
Info about '#defines' for modem signals in /usr/src/linux/include/asm/termios.h
Best regards.

I need some way to measure the time a process spend in READY QUEUE
I need some way to measure the time a process

spend in READY QUEUE
Keywords: kernel, scheduler, ready queue
Date: Tue, 27 Jan 1998 17:00:46 GMT
From: Leandro Gelasi <gelaslean@sunto.ing.unisi.it>
Hi everybody!
I need some way to get an accurate measurement of the time a

process spend in the scheduler's "ready queue" during his life
Any help will be appreciated.
Leandro Gelasi

How to make sockets work whilst my process is in kernel mode?
How to make sockets work whilst my process is

in kernel mode?
Keywords: socket, kernel
Date: Fri, 23 Jan 1998 14:10:05 GMT
From: Mikhail Kourinny <misio@kurin.kharkov.ua>
Hi all, I've written a kernel module that communicates with a remote device via TCP. It works fine
while data is passed at less than ~4K per write. Otherwise I receive a "Broken pipe" mistake :( Looks
like kernel internal buffers are not being empied while I'm in kernel mode. I tried
setsockopt(TCP_NODELAY), sleeping { current->state = INTERRUPTIBLE; current->timeout =
jiffies + 10; schedule();} with no success.
GODS! Where is my mistake? How should I make sockets to work? I'd be happy being pointed to
some TFM :)
There should be probably less tricky ways to solve my problem, but I just wish to complete this
approach.
Thanks in advance,
Mikhail

Realtime Problem
Realtime Problem
Date: Wed, 14 Jan 1998 20:51:59 GMT
From: Uwe Gaethke <Uwe.Gaethke@t-online.de>
Hi!
I do have a problem with the realtime capabilities of Linux. I wrote a loadable driver which uses the
RealTimeClock to generate periodic interrupt at a frequency of 1024Hz. With each interrupt the driver
increments an internal counter. The 'read' function of the driver returns simply that counter.
Next I wrote a realtime task (with the highest priority) which read that RTC continuously. This tasks
checks if the read counter is incremented by one between two read calls. If not, it prints an error
message which includes also the time between the last two read calls.
What I expect is: 1. The task calls 'read', 2. The RTC generates an Interrupt, 3. The task returns from
'read'.
If this is done fast enough (less than 1ms) every interrupt will get through to the realtime task.
And here is my surprise: Everything worked as expected until I called 'free' or 'ps' from a different
shell. At this time the task seem to loose interrupts for almost exact 10ms (i.e. 10 to 11 interrupts). It
seems that the scheduling is blocked for one tick.
Does anybody know why this happens?
Thanks for your help,
Uwe
Messages
1. SCHED_FIFO scheduling by Balaji Srinivasan

SCHED_FIFO scheduling
SCHED_FIFO scheduling
Re: Realtime Problem (Uwe Gaethke)
Date: Thu, 15 Jan 1998 00:41:30 GMT
I have a question regarding your initial query.

In your mail you said that the expected sequence of events is: 1: The task calls 'read' 2: RTC generates
an interrupt 3: The task returns from read.
I dont understand why the task should wait till the rtc generates an interrupt. Am i missing something
here?
As far as your query goes: What might be happening is that ps/free might be waking up some other
SCHED_FIFO (which i guess is what you are using) scheduled process (some kernel threads are
scheduled using SCHED_FIFO). This might schedule that process in instead of yours.
If you need predictable performance then you might try using KURT: KU Real-Time Linux
(http://hegel.ittc.ukans.edu/projects/kurt)
I hope this helps balaji

inodes
inodes
Keywords: inodes' locks
Date: Tue, 13 Jan 1998 22:49:14 GMT
From: Ovsov <ovsov@bveb.belpak.minsk.by>
What if after wait_on_inode () but before inode->lock = 1 in the inode.c module some hardware
interrupt comes and any other process will be scheduled that wants to use the same inode ???

Difference between SOCK_RAW SOCK_PACKET
Difference between SOCK_RAW SOCK_PACKET

Date: Wed, 07 Jan 1998 14:23:15 GMT
From: Chris Leung <eg_lch@stu.ust.hk>
We are doing a project that required us to change some

chatacteristics of the TCP protocols and/or add a buffer
into the TCP layer also. It seems to us that the SOCK_RAW or
SOCK_PACKET are good choice to override the TCP or even
IP level
Unfortunately, we cannot find any documents or sample programs
about SOCK_RAW or SOCK_PACKET... :~(
Please help us out ~~~~~~

Using SOCK_PACKET to gain full access to Ethernet
SOCK_PACKET
Re: Difference between SOCK_RAW SOCK_PACKET (Chris Leung)
Keywords: SOCK_PACKET
Date: Wed, 10 Jun 1998 18:01:01 GMT
From: Eddie Leung <edleung@uclink4.berkeley.edu>
Body-URL: http://www.senie.com/dan/technology/sock_packet.html
f78
Using the SOCK_PACKET mechanism in Linux

To Gain Complete Control of an Ethernet Interface
Daniel Senie
Amaranth Networks, Inc.
I have put together this web page in response to many queries from multiple people. Rather than
continue to write individual responses, I have put together this page to explain what I was trying to do,
and how I got it to work.
First, some background. To simulate software that was intended to run on a different (and not yet
built) platform, I needed a convenient way to exercise the code against live networks. I first tried using
a Solaris system, using the DLPI driver. This allowed me to do most things, but failed when I needed
to be able to set the source Ethernet MAC address. The Solaris DLPI driver provides no way to
override the hardware on a per-packet basis.
Next, I started looking at mechanisms in Linux. The mechanism that seemed to fit the best was
SOCK_PACKET, which is used by tcpdump among other things. To Make this work for me, though,
it was necessary to keep the Linux machine from doing anything on the interface, other than letting my
programs at it.
How To Do It
This information and these instructions work for RedHat Linux 4.2 with a 2.0.30 kernel. I expect
they'll work fine on a 2.0.32 kernel as well, and with other Linux distributions. I have heard that a
better mechanism for providing this facility is coming in a newer kernel. If or when I get more
information on that, I'll see about adding another page on that.
First, the interface needs to be told NOT to run ARP. Promiscuous mode should be enabled if you
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/186/1.html (1 di 4) [08/03/2001 10.12.15]

need to hear everything on the wire.:
ifconfig eth1 -ARP PROMISC UP 10.1.1.1

Then tell the Linux stack it's not supposed to see any of the traffic to or from this port:
ipfwadm -O -a deny -P all -S 0/0 -D 0/0 -W eth1

ipfwadm -I -a deny -P all -S 0/0 -D 0/0 -W eth1
In the program, you need to do several things. First, the socket call:
s = socket(AF_INET, SOCK_PACKET, htons(0x0003));

to get the socket set up.
Next I bind the specific Ethernet NIC I want:
struct sockaddr myaddr;
memset(&myaddr, '\0', sizeof(myaddr));

myaddr.sa_family = AF_INET;
strcpy(myaddr.sa_data, "eth1"); /* or whatever device */
r = bind(s, &myaddr, sizeof(struct sockaddr));

and check the return code for any errors.
Now, when you want to send or receive, this socket is bound to the proper device. One word of
caution, though, ALWAYS check the received packets to be sure you got them on the right device.
There's a race condition between making the socket call and the bind call where you'll get all packets
from ALL interfaces... not what you want!
So, to send a packet:
struct sockaddr from;

int fromlen;
memset(&from, '\0', sizeof(from));

from.sa_family = AF_INET;
strcpy(from.sa_data, "eth1"); /* or whatever device */
fromlen = sizeof(from);
r = sendto(s, msg, msglen, 0, &from, fromlen);

and check the return code. Note that msg is the pointer to the packet, starting with the MAC header. Be
sure you put the proper source MAC address into your packets! Also, msglen is the length of the
packet including the MAC header, but not including the CRC (which I do not worry about, but the

hardware does supply).

Receive is pretty similar:
struct sockaddr from;

int fromlen;
fromlen = sizeof(from);
r = recvfrom(s, msg, 2048, 0, &from, &fromlen);

if (r == -1)
{
/* deal with error */
}
if
5fc
(strcmp(from.sa_data, "eth1") != 0)
{
/* not from the interface we wanted, discard */
}
if r == -1, you have an error. If r > 0, then r is the length of the received packet. The strcmp ensures the
packet came from the right interface.
If you want to receive for MAC addresses other than the one the board has in it, use promiscuous
mode. To get the mac address from your program, there's an ioctl call SIOCGIFHWADDR. In the
return from that call is also the hardware type, so you can ensure it's Ethernet. Another call,
SIOCGIFMTU will tell you the MTU of the interface.
Caveats
● Do not use this methodology on your primary Ethernet interface. Instead, install a second (and
if needed, third) NIC card for use in this way. I've successfully used 5 NIC cards in one
machine, 1 under the control of Linux, the rest bypassed to my programs.
● Be VERY sure you set up the ipfwadm commands. Failure to do so will make a huge mess,
likely causing networking problems for other hosts on your lan.
If you found this information helpful and useful, please let me know. If you require further information
or assistance in this area, this can be arranged. For consultation beyond simple questions, Amaranth
Networks, Inc. can provide advice, services and information for a fee.
Copyright © 1998, Amaranth Networks, Inc.
All Rights Reserved
0


Need additional termcap entries for TERM=linux
Need additional termcap entries for TERM=linux

Date: Mon, 05 Jan 1998 07:47:29 GMT
From: Karl Bullock <karl@dixie-net.com>
I have an application that needs the graphics characters for the default G1 character set, but I can't find
the escape sequences for the G1 set anywhere. Anyone know where I can find the full G1
specification?

Question on Umount or sys_umount
Question on Umount or sys_umount

Keywords: umount sys_umount do_umount
Date: Wed, 31 Dec 1997 14:40:07 GMT
From: teddy <teddy@gti.net>
the program umount.c unmounts devices.

This program calls a function called umount(...)
this function calls the kernel function do_umount
located in super.c in kernel.
=================================
I can not find the function umount!
Where is it?
FYI: There is a kernel function called sys_umount,

the comments in this section say it is "umount"!
If this is true how does sys_umount get associated with umount?

===============================
email = teddy@gti.net

Passing file descriptors to the kernel
Passing file descriptors to the kernel

Keywords: file descriptor userfs pipe module
Date: Tue, 30 Dec 1997 18:54:12 GMT
From: Pradeep Gore <pradeepg@corelcomputer.com>
I am porting a filesystem(the userfs0.9.4.2) source from the linux x86 version to another for the arm
processor. The userfs filesystem has a kernel module that accepts a pair of file descriptors passed to it
in a user defined structure in the mount function. The kernel module receives this structure in the
read_super function of the file_system_type struct.(defined in fs.h).
now the problem.. The file descriptors passed are invalid, the values are correct but when the kernel
module code tries to convert the file descriptor to a file pointer,it fails.
struct file* fdtofp(int fd)
{
struct file* fp;
fp = current->files->fd[fd];
if (fp->f_inode == NULL)
{
return NULL; // fp->f_inode turns out to be null and the
// function returns here
}
}
This code works perfectly on the x86 linux but not on the arm version.The descriptors are created
successfully using the pipe function.
can anyone explain what could be going wrong?How can i debug? Is there a way to convert a file
descriptor to a struct file* in a user process?
thanks for any help,
Pradeep Gore

A way to "transform" a file descriptor into a struct file* in a user process
A way to "transform" a file descriptor into a struct

file* in a user process
Re: Passing file descriptors to the kernel (Pradeep Gore)
Keywords: file descriptor userfs pipe module
Date: Wed, 14 Jan 1998 23:07:42 GMT
From: Lorenzo Cavallaro <lc529863@silab.dsi.unimi.it>
Well, I don't know if I can help you and

maybe this'll be a stupid answer, but
about finding a way to transform a file descriptor into
a struct file*, try:
struct file *fdopen(int filedes, char *mode)
Where filedes is your file descriptor (Geeeee :) )

and mode can be either "w" or "r" (write/read mode)
This functin returns upon successfull a struct file*, else

a NULL pointer;
Let you also look on the man page;

This should work in a user process.

Dead Man Timer
Dead Man Timer

Date: Tue, 23 Dec 1997 16:39:31 GMT
From: Jody Winston <jody@sccsi.com>
I have a device driver that needs a dead man timer to go off and invoke a function if and only if the
time since the last interrupt is greater than a given value. Should I use interruptible_sleep_on to
implement the dead man timer?

raw sockets
raw sockets
Date: Thu, 11 Dec 1997 05:02:52 GMT
From: lightman <lightman2@hotmail.com>
I'm using Raw sockets and have sent away a TCP packet with the SYN flag on. I get the
SYN|ACK response and responds to it. But the kernel beats me to it, and sends away a
RST packet before my ACK response to the SYN|ACK. How can you stop the kernel from
responding with a RST to the SYN|ACK? I use two seperate sockets, one for
transmitting
and one for receiving.

a kernel-hacking newbie
a kernel-hacking newbie
Keywords: Thinkpad IBM Mwave newbie
Date: Thu, 11 Dec 1997 04:52:17 GMT
From: Bradley Lawrence <cecil@niagara.com>
I've never hacked the Linux kernel before, and I'm not very experienced in C++ at all, but I am a man
on a mission.
I've got a Thinkpad 760E, and the modem simply will not work in Linux. It's a nasty little peice of
hardware, IBM's 'Mwave' soundcard/modem combo... and I've finally gotten tired of waiting for
someone to come out with support for it, so I'm going to try and do it myself... I probably won't get
anywhere, but I'm going to try.
The only source of information I have right now is the the Win95 driver, and since I don't know
assembly that's not very informative at all. I'm trying to get IBM to give me some kind of specs for it,
but so far I've gotten nowhere.
The more I think about it the more I realize I'm never going to get anywhere. But I'm desperate. Linux
without a modem is rather pointless for me, and after spending $3,999 on this computer, I don't have
the change lying around to buy an external modem...
So anyway, my question is ... where do I start? I know bits and peices of C++, and have never played
with the kernel code before. Thanks a lot. Sorry, but I really honestly have no clue how to learn about
this subject.

A place to start.
A place to start.
Re: a kernel-hacking newbie (Bradley Lawrence)
Keywords: starting newbie
Date: Fri, 20 Mar 1998 22:40:03 GMT
From: <unknown>
A good place to start would be getting a subscription to the Linux Journal.
http://www.linuxjournal.com. They have informative articles on driver programming and kernel
hacking. Also you should by a book on C, not C++. All the code I've seen for the kernel is C and
assembly. Also you really shouldn't have to know too much assembly to write a driver for your
modem. I seriously doubt that IBM is going to give you too much info about the modem, but best of
luck.
Les Thompson

Modems in general
Modems in general
Re: a kernel-hacking newbie (Bradley Lawrence)
Keywords: Thinkpad IBM Mwave newbie
Date: Thu, 15 Jan 1998 09:03:11 GMT
From: Ian Carr-de Avelon <ian@emit.com.pl>
I don't know this modem but as an ISP I deal with lots of others.
99.9% of modems deal with the Hayes commands internally. A typical
attack on this would look like:
Fire up computer in an OS for which you have all the support
software. In this case 95.
Link to the modem with a terminal program eg Hyperterm
Type:
ATZ
See how it says:
OK
Note all the settings.
Now we need to get that is Linux.

Move to linux with Linloader. Connect to the modem with a Linux
terminal software eg kermit. Take as an example a modem which was
com2 under Win95
set line /dev/ttyS1

set speed 115200 (assuming new hardware here)
c
Now try the ATZ again. If it does not work I suggest you either give
up and swap the modem, or try to get a contract to write the driver.
It will be a big job.
If it does work you can start reading the HOWTOs about PPP get
mgetty for remote login etc.
You can try going directly to Linux with LILO. If that does not
work, but after Win95 start did work, the driver does some kind of
PnP initialisation and you had better keep Win95
Ian

Modems in general

How to write CD-ROM Driver ? Any Source Code ?
How to write CD-ROM Driver ? Any Source Code

?
Keywords: CD-ROM Device Driver Source code
Date: Mon, 08 Dec 1997 06:53:49 GMT
From: Madhura Upadhya <mupadhya@in.oracle.com>
Please any one can specify how to write CD-ROM Driver on linux. Is there any ready made reference
source code available ?. How exactly it works ?

Measuring the scheduler overhead
Measuring the scheduler overhead

Keywords: scheduler overhead
Date: Thu, 04 Dec 1997 22:34:31 GMT
From: Jasleen Kaur <jks@cs.utexas.edu>
Hi,
I am implementing a different scheduling policy in the

scheduler of Linux 2.0.0.
I need to measure the time taken for choosing the next

task to be scheduled, i.e., the actual time spent in the
schedule() function, without the actual context switch
having taken place. Since I clear interrupts in schedule(),
how do I measure the time? Does 'jiffies' still contain the
actual time? Or is the time asynchronously updated in some
other variable while the timer interrupt is disabled?
Thanks,
Jasleen Kaur

Where can I find the tcpdump or snoop in linux?
Where can I find the tcpdump or snoop in linux?

Keywords: tcpdump snoop
Date: Thu, 04 Dec 1997 02:19:11 GMT
From: <wangc@taurus>
Hi ,in some other Unix systems ,I can find very useful tools such as tcpdump and snoop , but in Linux
, how can I get them?

man which
man which
Re: Where can I find the tcpdump or snoop in linux?
Keywords: tcpdump snoop
Date: Mon, 25 May 1998 00:36:31 GMT
From: <trajek@j00nix.org>
[root@localhost /root]# which tcpdump /usr/sbin/tcpdump [root@localhost /root]#

Timers don't work??
Timers don't work??

Keywords: timer
Date: Tue, 02 Dec 1997 21:18:09 GMT
From: Joshua Liew <jliew@pacific.net.sg>
I've been playing around with the add_timer and del_timer functions and can't seem to get it to work.
Say if I want to execute a IRQ routine using a 10sec timer for 10 times, only the first time there is a 10
sec delay, but subsiquent interrupts are simultaneous.
static struct timer_list timer = { NULL, NULL, INTERVAL, 0L, &irq };
main() { add_timer(&timer); }
irq() { del_timer(&timer); timer.expires = INTERVAL; add_timer(&timer); printk(KERN_DEBUG
"Hello."); }
Is there any documentation on timers? Appreciate all the help I can get. I have a project that needs this
and I'm stuck.
Thanks.
Messages
1. Timers Work... by Balaji Srinivasan

Timers Work...
Timers Work...
Re: Timers don't work?? (Joshua Liew)
Keywords: timer
Date: Wed, 03 Dec 1997 01:25:06 GMT
The mistake with your code is that you need to

update INTERVAL every time you add_timer.
The expires field in timer_list is an absolute
time not a relative time.
for example:
irq() {
timer.expires = jiffies + INTERVAL;
add_timer(&timer);
}

problem of Linux's bridge code
problem of Linux's bridge code

Keywords: bridge code IEEE 802.1d
Date: Tue, 02 Dec 1997 09:17:55 GMT
From: <wangc@taurus>
In the Linux's bridge source code ,it refers to the IEEE 802.1d specification section
4.9.1. where can I get this documentation?
thanks!

Documention on writing kernel modules

Keywords: modules insmod
Date: Mon, 01 Dec 1997 20:22:08 GMT
From: Erik Nygren <nygren@mit.edu>
As far as I can tell, this site doesn't yet have information on how to write a loadable kernel module
(although there are quite a few queries from people asking about how to do it). After looking around, I
found that the insmod/modules.doc and insmod/HOWTO-modularize files in the modules-2.0.0.tar.gz
package contained a fairly good description of some of the things you need to do when writing a
kernel module. The insmod/drv_hello.c and insmod/Makefile files in that package provide an example
character device driver that can be built as a module.
It would be nice if these files (or the relevant contents) could get incorporated into the KHG at some
point.
To summarize, it looks like modules should be built with the following compiler options (at least, this
is the way the Makefile for drv_hello.o goes):
-O6 -pipe -fomit-frame-pointer -Wall -DMODULE -D__KERNEL__ -DLINUX
Of the above, it seems likely that the key arguments are the -DMODULE -D__KERNEL__ and
possibly the -O6 are actually needed. If you want to build your module with versioning support, add
the following options:
-DMODVERSIONS -include /usr/include/linux/modversions.h
From the examples and docs, it looks like modules should be in the form:
#include /* must be first! */

#include /* optional? */
#include /* ? */
/* More includes and the body of the driver go here. */
/* Note that any time a function returns a resource to the kernel

* (for example, after a open),
* call the MOD_INC_USE_COUNT; macro. Whenever the
* kernel releases the resource, call MOD_DEC_USE_COUNT;.

* This prevents the module from getting removed

* while other parts of the kernel still have
* references to its resources.
*/
#ifdef MODULE /* section containing module-specific code */
int
init_module(void)
{
/* Module initialization code.
* Registers drivers, symbols, handlers, etc. */
return 0; /* on success */
}
void
cleanup_module(void)
{
/* Do cleanup and unregister anything that was
* registered in init_module. */
}
#endif
Again, see the documentation scattered through the modules-2.0.0 package (and also presumably
through the newer modutils-2.1.x packages) for more detailed information.

How to display a clock on my console?
How to display a clock on my console?

Keywords: memory clock
Date: Fri, 28 Nov 1997 09:43:27 GMT
From: keco <hu@billow.ml.org>
Hi:
i want to display a clock on my console under text mode,

Can i access the video memory directly?
i try many method,but failed,how should i do?

thank you every much!
##I cant often read "kernel hack guide" :( ,please mail

to me. eamil: hu@billow.ml.org

Difference between SCO and Linux drivers.
Difference between SCO and Linux drivers.

Date: Thu, 27 Nov 1997 09:42:04 GMT
From: M COTE <ndg@mcii.fr>
I've a driver that was developped under SCO Unix and i'me wonderring if it is easy to
port under Linux.
Can somebody help me ???.
Thanks ...
M COTE (ndg@mcii.fr)

Changing the scheduler from round robin to shortest job first for kernel 2.0 and up
Changing the scheduler from round robin to

shortest job first for kernel 2.0 and up
Date: Wed, 26 Nov 1997 10:39:26 GMT
From: <royal_and_mary_harrell@msn.com>
How do you recode the kernel from round robin to shortest job first in later versions of Linux. I'm
using Red Hat 4.2.? I need this for a operating systems class and for better understanding of Linux.

Improving the Scheduer
Improving the Scheduer

Re: Changing the scheduler from round robin to shortest job first for kernel 2.0 and up
Keywords: scheduler linux
Date: Fri, 06 Mar 1998 10:23:30 GMT
From: Lee Ingram <lee@ingram.clara.net>
How could I improve the scheduler. Please HELP
Messages
1. Improving the Scheduler : use QNX-like by Leandro Gelasi

Improving the Scheduler : use QNX-like
Improving the Scheduler : use QNX-like

Re: Improving the Scheduer (Lee Ingram)
Keywords: scheduler linux QNX
Date: Fri, 06 Mar 1998 15:13:41 GMT
From: Leandro Gelasi <gelaslean@sunto.ing.unisi.it>
HI!
There are some Linux modified scheduler on Internet.
The best one for general purpouse application is QNX-like scheduler patch from Adam McKnee
(amckee@poboxes.com). It is a patch for kernel 2.0.32.
I am testing the performance of this scheduler for a University exam and the preliminary results says it
is better than standard one .
It would grant good interactive performance under heavy CPU load.
You can find the patch at www.linuxhq.com or by contacting the autor.
Hope this helps
Leandro

Re: Changing the sched. from round robin to shortest job first for kernel 2.0 and up.
Re: Changing the sched. from round robin to

shortest job first for kernel 2.0 and up.
Date: Mon, 05 Jan 1998 03:53:12 GMT
From: Pirasenna V.T. <piras@pspl.co.in>
Hello,
You must have got a lot of replies, but I am just sending
my views about it. It get ur idea that you wanna better
understanding of the OS. But I really wonder whether the
system will behave sane if you do the changes. You have
not told abt ur M/c. If it is a IBM compatible, then
you should go to the source code and try to grep on the
kernel data structures that you know, like, GDT, LDT, etc,.
Pl. tell me if you get some results, thanx,

Piras
piras@pspl.co.in

meanings of file->private_data
meanings of file->private_data
Date: Mon, 24 Nov 1997 11:09:40 GMT
From: <ncuandre@ms14.hinet.net>
Could anyone tell me what's the meaning of private_data in the structure file (file->private_data), and
under what condition should I use it.

/dev/signalprocess
/dev/signalprocess
Keywords: signal process device
Date: Mon, 24 Nov 1997 00:19:50 GMT
From: <flatmax>
Has anyone started on /dev/signalprocess ?

how to track VM page access sequence?
how to track VM page access sequence?

Keywords: Virtual Memory, page access
Date: Sat, 22 Nov 1997 02:26:54 GMT
From: shawn <shawnc@cs.berkeley.edu>
Does anyone know how one can keep track of VM page access sequence numbers of some application
programs? One way I thought was to mark each virtual page as protected at allocation time, so an
access to such a page will result a page fault, which is much easier to record. However, I couldn't find
any kernel functions that will lock only one virtual page. The functions I found were just marking an
entire virtual memory area as not readable, not writable, etc. I am trying to create some user
application program page access profiles. Any hints are greatly appreciated.
Shawn

Whats the difference between dev_tint(dev) and mark_bh(NET_BH)?
Whats the difference between dev_tint(dev) and

mark_bh(NET_BH)?
Date: Fri, 21 Nov 1997 14:39:17 GMT
From: Jaspreet Singh <jaspreet@sangoma.com>
Hi,
I don't know what the difference between the two is. It looks as if "net_bh()" calls devtint ( and i guess doing
mark_bh(NET_BH) calls net_bh() ) .
My driver calls dev_tint again and again due to heavy traffic and as a result the kernel crashes with the error:
"release:kernel stack corruption".
However when I replace that call with a mark_bh(NET_BH) then my kernel doesn't
crash.
If someone could shed some light as to what the difference between the two is that would be great.
Thanks
Jaspreet Singh

PCI
PCI
Keywords: PCI fault
Date: Wed, 19 Nov 1997 21:43:04 GMT
From: <mullerc@iname.com>
i try to write a device driver (module) for a card on the pci bus but i get segmentation fault in my x86
kernel (2.0.30) when i try to access the pci bus (memory) at 0xE4102000 for example. why?
Messages
1. RE: PCI by Armin A. Arbinger

RE: PCI
RE: PCI
Re: PCI
Keywords: PCI fault
Date: Thu, 20 Nov 1997 03:43:44 GMT
From: Armin A. Arbinger <armin.arbinger@bonn.netsurf.de>
did you make an ioperm() on the memory you want to access?

Can I make syscall from inside a kernel module?
Can I make syscall from inside a kernel module?

Keywords: syscall, module, lock physical page
Date: Tue, 18 Nov 1997 21:48:35 GMT
From: Shawn Chang <shawnc@cs.berkeley.edu>
I have a silly question about syscalls. I wrote a kernel module for user controlled page allocation,
which needs to call some kernel function to lock some physical pages allocated, so they won't get
swapped out. After searching in the kernel source, I only found sys_mlock() in mm/mlock.c seem to
be a good function for my purpose. But sys_mlock() is not a directly exported kernel symbol, so my
module can't call it directly. Then I found that one of the pre-defined syscalls is actually mlock, so I
was thinking if I could make a syscall from inside my kernel module. Is it possible to do that?
Otherwise, how do I export sys_mlock() so my kernel module will be able to call it?
Another related question. My kernel module is accessed by user application through syscall(164,...).
Suppose I want to access some functions in the kernel module from some original kernel functions,
such as do_page_faults() in fault.c. What should I do?
Thanks in advance for any answers.
Shawn

Re: Can I make syscall from inside a kernel module?
Re: Can I make syscall from inside a kernel

module?
Re: Can I make syscall from inside a kernel module? (Shawn Chang)
Date: Sat, 17 Jan 1998 07:39:25 GMT
From: Massoud Asgharifard <asghari@ce.sharif.ac.ir>
hi, Well, no. when you are doing a syscall from user space, the syscall parameters are in segment
register fs (assuming x86). The functions memcpy_fromfs and memcpy_tofs are called in kernel to
retrieve the parameters for kernel function. but module code is kernel code anyway, and can't do that.
(Segmentation fault....) (try rewriting the syscall, but reference kernel memory for parameters.) If you
want to call kernel-internal-functions from your module code, (which is not exported normally) you
should register it into file linux/kernel/ksyms.c this will export that function's name and insmod will
install your module. Sorry for typos and mistakes, if any.

Make a syscall despite of wrong fs!!
Make a syscall despite of wrong fs!!

Re: Re: Can I make syscall from inside a kernel module? (Massoud Asgharifard)
Keywords: syscall, module, fs
Date: Fri, 23 Jan 1998 12:00:16 GMT
From: Mikhail Kourinny <misio@kurin.kharkov.ua>
Yes, despite of fs pointing to wrong place. There are functions somewhere in the kernel named put_fs
and get_fs. Using these functions you should place KERNEL_DS in fs. And after syscall restore it,
certainly. You can avoid it if you don't pass user space pointers a paramters, IMHO.
Regards, Mikhail

code snip to make a sys_* call from a module
code snip to make a sys_* call from a module

Date: Thu, 08 Jan 1998 22:35:21 GMT
From: Pradeep Gore <pradeepg@corelcomputer.com>
As an example, here is a code snip that make a sys_read call from a

kernel module.
// module.c
....
#include <sys/syscall.h>
extern long sys_call_table[]; // arch/i386/kernel/entry.S
int (*sys_read)(unsigned int fd, char *buf, int count);

// pointer to sys_read. linux/fs/read_write.c
....
int init_module(void)
{
...
sys_read = sys_call_table[SYS_read];
//now you can use sys_read
sys_read(...);
...
}

Dont use system calls within kernel...(esp sys_mlock)
Dont use system calls within kernel...(esp

sys_mlock)
Date: Wed, 26 Nov 1997 13:16:17 GMT
If you want to use a system call within a kernel module then export the system call using
EXPORT_SYMBOL macro.
A better solution would be to use mlock in the user space before entering the kernel (ie. write a
wrapper function for your entry point that would lock pages in for you before it enters the kernel) This
in my opinion is a cleaner solution than exporting sys_mlock.
In addition since sys_mlock acts on the current process it might not have desirable effects in certain
cases. Hope this helps balaji

Untitled
Untitled
Keywords: fake source IP address
Date: Tue, 18 Nov 1997 19:51:23 GMT
From: Steve Durst <sdurst@rl.af.mil>
This is a follow-up to a question in June, about how to "cheat" and change the outgoing IP source
address.
I'm trying to do that too, but I only want to change packets belonging to particular user-level processes
(e.g. telnet). So I'm going to set up a table that both the kernel and a user-side daemon can write to,
then invoke the daemon to run whatever process I want. The daemon will get the PID and the desired
fake source IP address and write it to the table.
The appropriate function (I think it's ip_build_xmit() ) will read the table and change only those
packets sent by the processes listed in the table. Right now I'm using printk() lines to debug this thing.
Question: HOW do you find the PID associated with a given packet? I tried current->pid but
apparently it's not reliable... While some outgoing packets occur when current->pid does reflect the
correct process, other times outgoing packets known to be associated with, say, telnet, occur with the
current->pid indicating, say, syslogd.
Shouldn't the PID be accessible through an sk_buff? The packet had to come from somewhere, and
incoming packets have to be delivered to the right processes eventually. Right?
-Steve

RAW Sockets
RAW Sockets
Keywords: raw sockets
Date: Tue, 18 Nov 1997 19:35:46 GMT
From: Art <art@falt.deep.ru>
Hi All :)
Where can I find the BIG documentstion of RAW SOCKETS?
Pls HELP.
Art

use phy mem
use phy mem

Keywords: vremap()?
Date: Wed, 12 Nov 1997 07:57:22 GMT
From: WYB <tit@public.bta.net.cn>
My pc has 64M memory, now I want to map the

63-64M mem into kernel, so I call:
vremap( 63*1024*1024, 1024*1024 ) ;
1.If this block of memory has been occupied
by others, does vremap() return fail. Otherwise
I will crash the system.
2.After vremap(), does this block of memory marked
alloced, and other mem request does not get any page
of this block?
3.Are there any way to know which phy mem is free?

HyperNews for RH Linux ?
HyperNews for RH Linux ?

Date: Mon, 10 Nov 1997 22:00:25 GMT
From: Eigil Krogh Sorensen <eks@aar-vki.dk>
Are there RPMs with HyperNews for RH Linux ?

Not really needed
Not really needed

Re: HyperNews for RH Linux ? (Eigil Krogh Sorensen)
Date: Fri, 16 Jan 1998 22:10:52 GMT
From: Cameron <cls@greems.org>
Hypernews is written entirely in Perl, and contains nothing Linux-specific. Most of the installation is
configuration via forms, not appropriate to pre-packaging in an RPM or .deb file.
The one thing that threw me the first time I installed it was that all of Hypernews' data are owned by
the Web server user, not by your user account, not by root. If you try to own any of it yourself it just
makes a security and permissions mess. That is why you must use the setup and edit-article forms.
Don't even try the command line version of setup. (Well, it might work if you su - www first...)
Cameron

about raw ethernet frame: how to do it ?
about raw ethernet frame: how to do it ?

Date: Fri, 07 Nov 1997 14:27:19 GMT
From: <crbild@smc.it>
Hi, how can i send raw frames on ethernet devices ? and/or in other network devices ? thanks
Roberto Favaro

process table
process table
Keywords: task process
Date: Thu, 06 Nov 1997 21:15:27 GMT
From: Blaz Novak <blaz.novak@guest.arnes.si>
Hi! Could someone please tell me if it is possible for a user level program to get address of kernels
task struct(process table)?
Blaz
blaz.novak@guest.arnes.si bjt@gw2.s-gimb.lj.edus.si blaz.novak@usa.net

Stream drivers
Stream drivers
Keywords: stream driver
Date: Thu, 06 Nov 1997 17:56:25 GMT
From: Nick Egorov <nic@item.ru>
Who knows somthing about stream drivers in LiNUX ? Maybe this is something about Net Drivers ?
Thank you ! Nick.

Streams drivers
Streams drivers
Re: Stream drivers (Nick Egorov)
Date: Fri, 13 Feb 1998 06:47:19 GMT
From: <unknown>
The STREAMS package for Linux is available at the site www.gcom.com
Regards,
anand.

Stream in Solaris
Stream in Solaris
Re: Stream drivers (Nick Egorov)
Date: Mon, 12 Jan 1998 00:33:19 GMT
From: <cai.yu@rdc.etc.ericsson.se>
Hi :
I have info. about stream in Solaris .
Streams Programming Guide http://locutus.sas.upenn.edu:8888/ Streams protocols and drivers
http://home.eznet.net/~herbert/streams_ppt_notes.htm
Now I do a project about Streams , so I want a example about it ,I have no
more experience of linux . If you have some example , please forward to me .
Best regards

XircL631xternal Ethernet driver anywhere?
XircL631xternal Ethernet driver anywhere?

Keywords: network xircL63adapter driver device
Date: Mon, 03 Nov 1997 22:52:27 GMT
From: mike head <bc80267@binghamton.edu>
Are there any device drivers availible for the XircL631xternal Etheret adapter ee-10bu. I am looking
into writing a windows driver for this3adapter and need to get some technical info on it. (I'd also like to
use it to hook a IPIP network to my main server, rather then buy a new internal card).
bc80267@binghamton.edu

interruptible_sleep_on() too slow!
interruptible_sleep_on() too slow!

Keywords: device driver interrupts interruptible_sleep_on()
Date: Fri, 31 Oct 1997 06:37:26 GMT
From: Bill Blackwell <wjb@mit.edu>
Hi,
I'm in the process of writing a driver for a data
acquisition board. I'm having some difficulties setting up
the interrupt handler. Here's the problem:
I first write a byte to a register on the board which initiates a data conversion (when data is ready to be
read, an interrupt is generated). The next line of code is an interruptible_sleep_on() call. On some
occasions, the A/D board generates an interrupt BEFORE the i_s_o() call is complete, so the task is
never put to sleep and added to the queue (I hope I have that right, I'm very new to this stuff...). When
the companion wake_up_interruptible() call is made at the end of the interrupt handler routine, the
program stalls, since there is nothing to be awakened.
Is there a fix for this?
Thanks for reading this - sorry if this is excruciatingly trivial.

wrong functions
wrong functions
Re: interruptible_sleep_on() too slow! (Bill Blackwell)
Keywords: device driver interrupts interruptible_sleep_on()
Date: Fri, 31 Oct 1997 13:39:12 GMT
Getting around this common race condition without disabling interrupts is one of Linux's features.
You need to add the function to the wait queue before you write to the board and cause the interrupt to
occur. Look at kernel/sched.c at the definition of interruptible_sleep_on(): You want to do
something like this:
current->state = TASK_INTERRUPTIBLE;
add_wait_queue(this_entry, &wait);
trigger_interrupt_from_board();
schedule();
remove_wait_queue(this_entry, &wait);
Linux's serial.c uses this trick.

creating a kernel relocatable module
creating a kernel relocatable module

Keywords: kernel relocatable module
Date: Wed, 29 Oct 1997 18:41:01 GMT
From: Simon Kittle <simon@nfm.co.uk>
is there any tutorial/documentation for writing relocatable modules for the kernel or just stuff on the
structure of it. I have been able to get a sort of bare bones module that does nothing loaded and
unloaded (the code was just stripped down from some other device driver module) but I cant get new
functions, I supose they would be syscalls to work. If I just want to add a few syscalls and not deal
with any hardware, how do I "register" them so the kernel knows to use them. When I wrote one such
function just to try and return one char, I wrote program to test it but could not get it linked
any hep much apprieciated.
tanks - Simon Kittle

Up to date serial console patches
Up to date serial console patches

Keywords: serial console
Date: Mon, 27 Oct 1997 01:06:54 GMT
I've been looking for up-to-date patches for a serial console for Linux. I know most of you will be
wondering why I'm bothering... after the difficulties I've had finding any info, *I'm* starting to wonder
too.
Anyway, if anyone can tell me where some info on serial consoles can be found, please let me know.
The latest links I can find are about July '95, they point to non-existant pages, and they're for kernel
v1.3 anyway (I currently run 2.1.56).

Kernel-Level Support for Checkpointing on Linux?
Kernel-Level Support for Checkpointing on

Linux?
Keywords: checkpointing, kernel
Date: Sat, 11 Oct 1997 21:57:32 GMT
From: Argenis R. Fernandez <argenis@usa.net>
Has anybody heard of somebody doing this?

Working on it.
Working on it.
Re: Kernel-Level Support for Checkpointing on Linux? (Argenis R. Fernandez)
Keywords: checkpointing, kernel
Date: Sun, 16 Nov 1997 14:40:14 GMT
From: Jan Rychter <jwr@icm.edu.pl>
I'm working on it. I should have something ready in about three weeks. Expect basic process
checkpointing. Open files will be restored, network connections for obvious reasons will not. This
greatly limits the use of checkpointing.
Also, I probably won't even try to do process trees nor any form of IPC stuff in the first approach.
Than can be worked on later.
And BTW, I have two questions right away:
1. Do the VFS inode numbers change after boot ? (e.g. can I just store the inode info for open files?)
2. Is there any way to map inode numbers back to full path names ? (needed for migration, or if (1) is
not true)
--J.

Problem creating a new system call

Date: Sat, 11 Oct 1997 15:10:13 GMT
From: <sdesai@titan.fullerton.edu>
Thanks for your time, friend. I have a small request for you. My question is a little descriptive, so if you can,
please read it completely. I would appreciate it. Thanks.
Just to learn how to generate a system call, I created a simple system call that basically sets and resets the value
of a global parameter "PREFETCH".
**************** SYSTEM CALL code ********************************** int PREFETCH = 0; /*
this is a global variable */
int initialize (int nr_hints)
{
printk ("prefetch routine to be initialized\n");
if (PREFETCH == 1)
return 0;
PREFETCH = 1;
return 1;
}
int terminate ()
{
PREFETCH = 0;
return 1;
}
asmlinkage int sys_prefetch (int mode, int nr_hints)
{
printk ("prefetch system call called\n");
if (mode >= 0)
return initialize (nr_hints);
else
return terminate ();
}
**************** SYSTEM CALL code END**********************************
I included this code in /usr/src/linux/fs/buffer.c I then added the following line to arch/i386/kernel/entry.S
.long SYMBOL_NAME (sys_prefetch) /* 166 */
and changed
.space (NR_syscalls - 166)*4
to
.space (NR_syscalls - 167)*4
I then added the following line to include/asm/unistd.h

#define __NR_prefetch 166

To execute the sys_prefetch system call, I wrote a prefetch.c file with the following code.
************************code to call sys_prefetch***************** #include <linux/unistd.h> _syscall2
(int, prefetch, int, mode, int, nr_hints)
void main()
{
(few declarations and statements)
return_value = prefetch(1, 100); /* initialize */
printf ("%d", return_value);
}
******************************************************************
This code compiles and runs but always returns a -1 value and does not
even print the messages on the screen that I inserted using printk() in the
system call code in buffer.c
Since the messages are not getting printed, I have no way to know if the system call is getting called AT ALL
!!!
Thanks for reading it. If you have any insights into the problem, please let me know.
Thanks again. saurabh desai <sdesai@ecs.fullerton.edu>

How did the file /arch/i386/kernel/entry.S do its job
How did the file /arch/i386/kernel/entry.S do its job

Re: Problem creating a new system call
Date: Sun, 01 Mar 1998 04:18:04 GMT
From: Wang Ju <wangju@envst-1.ict.ac.cn>
Hi,All
I am a newer to KHG, I am sorry to ask
this trival problem.
while adding a system call, one need to edit the system_call_table to
add an entry,
what is it do with file entry.S and How?
Thanks

system call returns "Bad Address". Why?
system call returns "Bad Address". Why?

Date: Tue, 14 Oct 1997 20:44:22 GMT
thanks for your attention.

When I try to call my new system call, it always returns -1. I then check the error message using
perror( ). The error message is always "Bad Address".
Could anybody please tell me what can cause a system call to return "Bad Address" error. All my
variables are initialized and defined.
Thanks in advance for any insights from you.
saurabh desai <sdesai@ecs.fullerton.edu>

Re:return values
Re:return values
Re: system call returns "Bad Address". Why?
Date: Wed, 15 Oct 1997 18:04:38 GMT
From: C.H.Gopinath <gopich@cse.iitb.ernet.in>
I created the following call it is working fine, but i don't know about that bad
address.
sys call function looks like this.
int sys_print_data(char *s1,char *s2,int flag,int size1,int size2)

{
/* size1 and size2 are strlen of s1 and s2 respectively */
if (flag) {
sys_write(1,s1,size1);
return 1;
else {
sys_write(1,s2,size2);
return 2;
}
}
This is working fine. But i have another problem. I wrote a function int
string_len(char *s), which will return the
length of the string as follows.
int string_length(char *s)

{
int len=0;
while(*(s+len))
len++;
return len;
}
i am calling this in the sys_print_data instead of passing size1 and size2. Exactly
at this call it is saying
segmentation fault and dumping all the registers.
i also used the standard strlen call(linux/string.h).

It is also doing the same.
Can any body please clarify this.
Thanx in advance,
Gopinath

Re:return values

Re:return values
Re:return values
Re: Re:return values (C.H.Gopinath)
Date: Mon, 22 Dec 1997 08:41:22 GMT
From: Sameer Shah <ssameer@novell.com>
You cannot do the string_length because you are trying

to access a location that resides in the user space.
When switching to kernel mode, the data segment register
is changed to a location inside the kernel. But to allow
for such operations the kernel maintains address of
user's data segment in some other register (FS).
To access any string or some indirection data, you
have to actually copy that inside the kernel and then
you can go on with the normal strcpy functions. There
are few functions, I don't exactly recall their names
but with names like copy_fs_to_kernel, copy_kernel_to_fs
which allow you to copy between user and kernel spaces.
Just look at the implementation of some system call where

entire structures are passed (through a pointer to the
structure) e.g. ioctl() and you may need to do something
similar.
Hope this helps,

Sameer

possible reason for segmentation fault
possible reason for segmentation fault

Re: Re:return values (C.H.Gopinath)
Date: Thu, 16 Oct 1997 18:52:04 GMT
From: <unknown>
I tried to implement the system call, the way you did. That is
passing 2 strings s1, s2 and finding out their lengths. It did
give me segmentation fault.
I even tried to just print the strings within sys_print_data ()
using printk() as well as sys_write(), it did the same thing.
The message it gave was that the kernel was unable to
do paging at virtual address xxxxx..
I suppose there must be another way to pass strings to
the system call, but I don't know at this point. If I do in the
future, I will let you know.
saurabh desai.

Creating a new sytem call: solution

Date: Sun, 12 Oct 1997 06:32:03 GMT
The problem is using printk, it won't print on the stdout,

it will be stored in the buffers. If you want the data
to be displayed on the stdout use
sys_write(1,ptr,len);
where ptr is the string to be displayed and len is its length.
Using this you can check your sys call is created or not.
But regarding assigning a value to a global variable or local

variable i don't know. I am also struggling for the past
one week. I tried to assign and print a string in the system
call. It is compiling without any problem butwhen i try to
execute that, at the assignment it is giving
General Protection:000
and then dumping all the register values with Segmentation
Fault.
Can cny body explain why it is happeing like this.
By the by is there any kernel debugging tool for 4.2 kernel,

if so can anybody please give pointers for that.
Thanx in advance,
--
C.H.Gopinath
gopich@cse.iitb.ernet.in
Messages
1. Kernel Debuggers for Linux by sauru


Kernel Debuggers for Linux
Kernel Debuggers for Linux

Re: Creating a new sytem call: solution (C.H.Gopinath)
Date: Sun, 12 Oct 1997 13:22:06 GMT
First of all, I think that the segmentation fault you are getting
must be because of your system call code. You may want
to check it for any offending pointers.
There are debuggers for the Linux kernel. They are as
follows.
(1) xkgdb :- this is the debugger that allows you to debug the
---------
kernel by putting the break points. It was developed by
John Heidemann <johnh@isi.edu>. It was later revised by
Keith Owens <kaos@ocs.com.au>. The latest version of
xkgdb is available for kernel 2.1.55 which is an experimental
kernel (risk of crashing). If you want you can obtain it from
<http://sunsite.unc.edu/pub/Linux/kernel/v2.1/>.
(2) kitrace :- This debugger traces the system calls. It is
----------
very useful. It was developed by Geoff <geoff@fmg.cs.ucla.edu>
You can obtain it from :
<http://ficus-www.cs.ucla.edu/ficus-members/geoff/kitrace.html>
This debugger will run smoothly on your Red Hat 4.2; kernel
2.0.30 kernel.
Hope this info becomes useful. good luck saurabh desai <sdesai@titan.fullerton.edu>

problem with system call slot 167

Re: Creating a new sytem call: solution (C.H.Gopinath)
Keywords: printk SYMBOL_NAME sys_call_table entry.S unistd.h
Date: Thu, 22 Jan 1998 17:18:44 GMT
From: Todd Medlock <medlota@nu.com>
I am using linux v2.0.33. In

/usr/src/linux/arch/i386/kernel/entry.S
SYMBOL_NAME slots 164,165,166 are as follows:
.long 0,0
.long SYMBOL_NAME(sys_vm86)
I found that I could not add a new system call at 167. When
I did, it was called by something else for who knows what
reason. I know this because the only thing in my new system
call was a printk statement (which displays whenever the new
system call is called). With the system call at 167 I would
receive unwanted printk messages at boot time, at shutdown
time, and when I executed ifconfig! Hence, I put the
following at 167 and put my new system call at 168.
.long 0
That seems to have made everything work! Another strange

thing is that with my system call at 167 the insmod
function reports "unresolved symbol" messages and will not
install modules?? My guess is that one of the modutil
modules is using the 167 slot! Anyone have any ideas?
Regarding printk:
It is my understanding that printk messages will appear on

the console (I am assuming you are at a console if you are
modifying system calls and generating the kernel ) as long
as your message level is less then your log level. I use
<0> for testing to be sure of having the lowest message level.
For example:

printk("<0>syscallname: entering my test syscall\n");
This works for me.

Resetting interface counters
Resetting interface counters

Keywords: network interface counters
Date: Fri, 03 Oct 1997 22:10:44 GMT
From: Keith Dart <kdart@cisco.com>
Is there an ioctl or some other way to reset the network device counters (as shown in /proc/net/dev) to
zero?
TIA, Keith Dart

writing/accessing modules
writing/accessing modules
Keywords: module
Date: Wed, 01 Oct 1997 12:43:36 GMT
From: Jones MB <jonesmb@ziplink.net>
I am writing a module whose main purpose is to allow a user app to change the values of some
variables in the kernel's memory area. Using the modules in /usr/src/linux/drivers/net/ as a starting
point, I have been able to create the module. I can insmod and rmmod it successfully (configreed via
printk's to syslog). I am now looking for a way for the user level application to be able to access the
module. I searched high and low for info on how to do this with no success. Any pointers in the right
direction are most welcome.
Thanks
Messages
1. Use a device driver and read()/write()/ioctl() by Michael K. Johnson

Use a device driver and read()/write()/ioctl()
Use a device driver and read()/write()/ioctl()

Re: writing/accessing modules (Jones MB)
Keywords: module
Date: Wed, 01 Oct 1997 14:12:17 GMT
There's a whole huge section of the KHG on writing device drivers. Register a character device driver
and use either read()/write() or ioctl() to communicate between the user-level app and your module.

getting to the kernel's memory
getting to the kernel's memory

Re: writing/accessing modules (Jones MB)
Re: Use a device driver and read()/write()/ioctl() (Michael K. Johnson)
Keywords: module memory
Date: Fri, 03 Oct 1997 16:11:37 GMT
From: Jones MB <jonesmb@ziplink.net>
I have now been able to open and read/write to the module via a user level app. The point of the user
app/module combination was to allow some variables to be changed in the kernel. These are variables
which are created and malloc'ed when the module is loaded, so till the module comes up they do not
exist. Now the part of the kernel that will use these variables will not compile as at compile time it
does not know of the variables, so compiling fails. Is there a way to get around this?
JonesMB

use buffers!
use buffers!
Re: writing/accessirg modules (Jones MB)
Re: Use a device driver and read()/write()/ioctl() (Michael K. Johnson)
Re: getting to the kernel's memory (Jones MB)
Keywords: module memory
Date: Fri, 27 Feb 1998 13:27:39 GMT
From: Rubens <mytsplick@hotmail.com>
If you use a module with read() and write() functions, use the buffers that each functions has.
Example: when you write to the module, the read() function is called. read() stores the received data in
your buffer. Don't create and allocate variables, tranfer your data through buffers. If you have
questions, please send e-mail.
RUBENS

Help with CPU scheduler!
Help with CPU scheduler!

Keywords: scheduler
Date: Sun, 28 Sep 1997 05:45:53 GMT
From: Lerris <lerris@bigfoot.com>
Please help me if you can. I am doing a report on CPU scheduling in Linux. I have a copy of Linux
Kernel Internals, but that does not help me very much because I am just now learning C. If you have a
web page with a nice overview, or would be willing to provide one yourself, I would be eternally
grateful!

Response to "Help with CPU scheduler!"
Response to "Help with CPU scheduler!"

Re: Help with CPU scheduler! (Lerris)
Keywords: scheduler
Date: Mon, 29 Sep 1997 19:34:43 GMT
From: Jeremy Impson <jdimpson@syr.edu>
I wrote something for Linux kernel 2.0.30. I'm not sure if it has changed in 2.1.x, or will in 2.0.31.
But, it should still be helpful. Keep in mind that it is unedited, and no one (but myself) has checked it
for accuracy. I actually wrote it for inclusion here in the KHG, but haven't gotten around to submitting
it. It can be found at http://camelot.syr.edu/linux/scheduler.html Mr. michaelkjohnson, if you are
interested, please feel free to copy it from that location to the KHG, or, if necessary, edit it to your
heart's content. Just let me know when and if you do it. Thanks! --Jeremy Impson

Response to "Help with CPU scheduler!" (Redux)
Response to "Help with CPU scheduler!" (Redux)

Re: Help with CPU scheduler! (Lerris)
Re: Response to "Help with CPU scheduler!" (Jeremy Impson)
Keywords: scheduler
Date: Wed, 22 Apr 1998 00:52:45 GMT
The URL in my previous message has changed. For now, it is available at

http://source.syr.edu/~jdimpson/camelot/linux/scheduler.html
It may change again, after I graduate, start my new job, and get a new Web acccount.
--Jeremy

calling interupts from linux
calling interupts from linux

Keywords: interrupts callable from C.
Date: Thu, 18 Sep 1997 14:55:46 GMT
From: John J. Binder <binder@cs.berkeley.edu>
I'm trying to figure out how to run irq 0x10 from gcc so as to interact with the video card directly. I
believe it will have to be done with inline assembler.
The general question is "How do you make interupts work from gcc.
To say get the video mode (coded for with 0x0f in register ah when irq 0x10 is called) I tried a
fragment like:
___________
int ans;
__asm__ __volatile__ (
"movb $0x0F,%%ah\n\t" \
"int $0x10\n\t" \
"movl %%eax,ans\n\t" \
:"=memory" (ans) \
:
:"ax"
);
printf( "ans='%d'\n",(int) ans);
_____________
But it gives a segmentation fault.
Permissions? ioperm?? whats the answer?
Thanks John
Messages
1. You can't by Michael K. Johnson

You can't
You can't
Re: calling interupts from linux (John J. Binder)
Date: Thu, 18 Sep 1997 15:09:43 GMT
If you had read the KHG, you would have discovered that you can only access interrupts from kernel
code, not from user-level code. Please read what has already been written before you ask questions.

Calling BIOS interrupts from Linux kernel
Calling BIOS interrupts from Linux kernel

Re: You can't (Michael K. Johnson)
Date: Thu, 25 Sep 1997 06:45:59 GMT
From: Ian Collier <imc@comlab.ox.ac.uk>
John J. Binder asks:

I'm trying to figure out how to run irq 0x10 from gcc so as to interact with the video card directly.
Michael K. Johnson replies:
If you had read the KHG, you would have discovered that you can only access interrupts from kernel
code, not from user-level code.
May I take it then that it is possible from kernel code?
I have toured the KHG and found no mention of calling software interrupts. Receiving hardware
interrupts is of course covered, but this is different.
It would be nice to be able to write a device driver for VBE2 display devices. This would give the
application programmer access to the display, similarly to svgalib but with higher resolutions and
more colours. It would also allow an X server to be written for those devices which are currently
unsupported, including NeoMagic graphics adapters found in many laptops (including mine - hence
my interest in this subject).
The VBE2 interface requires one to call ìnt 0x10' with, among other things, the real-mode address of
a block of memory in ES:DI.
An alternative to the device driver would be to implement a system call like int86x() to emulate
software interrupts in real mode and export the address mapping functions to user level code.
imc.
Messages
1. Possible, but takes work by Michael K. Johnson

Possible, but takes work

Re: Calling BIOS interrupts from Linux kernel (Ian Collier)
Date: Thu, 25 Sep 1997 20:27:48 GMT
I'm sorry, I didn't notice that he was talking about software interrupts in the BIOS. Not paying close
enough attention...
In order to call anything in the BIOS, you need to put the processor in VM86 mode. From user mode,
it is possible to do that in vm86 mode if you map the BIOS into the process's memory. dosemu does
this in order to use the video bios. However, you won't be able to just call int 0x10. Linux reprograms
the interrupt controller on boot. You can put code in to see what slot 0x10 points at and save that
pointer and call it directly.
From kernel mode, you can look at how the APM bios calls are made in the file
/usr/src/linux/drivers/char/apm_bios.c and copy how it is done there. Even in kernel mode, you need to
get a pointer rather than blindly call int 0x10. And for 16-bit bioses, you need segment/offset, then use
them to do an lcall. See arch/i386/boot/setup.S and arch/i386/kernel/setup.c for how these kinds of
parameters get passed into the kernel on startup. The are recorded before the kernel goes into protected
mode at all.
Note that this is completely dependent on not booting with a protected-mode boot utility that starts up
the kernel already in 32-bit mode. Several such utilities exist, but they aren't much used at this point by
Linux folks.
However, calling the BIOS is slow. It's fine for the apm stuff that doesn't need to be
high-performance, but I wouldn't touch it for a video driver. Every call involves saving the processor
state, changing processor mode, calling the bios, changing processor mode, and restoring the processor
state. Assuming that you are calling into the kernel to do this, that's really an extra set of context
switches. If you are doing it in a user-level library, you have device contention to deal with, as well as
security issues, since you certainly need to be root to do this.
If I were in your shoes, I would try to use this interface only to set up a framebuffer, and then have the
X server write to memory which is mmap()ed to that framebuffer. That will probably be faster than
thunking through a 16-bit bios call level for every screen operation... There's a generic Linux
framebuffer interface that is used on Linux/m68k, Linux/PPC, Linux/SPARC, and I think other
platforms as well. You can start looking at that by reading /usr/src/linux/drivers/char/fbmem.c; I don't
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/117/1/1/1.html (1 di 2) [08/03/2001 10.13.05]

know the interface in any detail and can't help you beyond that.
If the bios is a 32-bit bios, you can skip saving state; that won't be such a problem. But since it wants a
real mode address for a block in memory, I doubt that's the case.
Good luck, but I won't be able to be much more help than this.
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/117/1/1/1.html (2 di 2) [08/03/2001 10.13.05]

VBE video driver
VBE video driver

Re: Possible, but takes work (Michael K. Johnson)
Date: Fri, 26 Sep 1997 14:56:15 GMT
From: Ian Collier <imc@comlab.ox.ac.uk>
Thanks for your informative answer. I wonder if you can point me at any docs on VM86 mode.
Anyway, the idea was to create a device which represents the video memory that an application can
just mmap and write to. This won't need to go through the BIOS so no performance problems there.
VBE is supposed to provide a linear frame buffer without the need to do banking, and returns the
physical address of the frame buffer in one of the query functions that you call when you want to set
the video mode.
In order to change video modes, the application would (probably) do an ioctl on the device file. This
would need to go through the BIOS, but it will only be called a few times per application anyway.
There may be a separate device file where one can read and write the palette registers. VBE2 can give
you a protected-mode address to call to do this instead of going via the interrupt, so performance
should be acceptable. You also get a protected-mode interface for changing the viewport.
The application will of course require r/w permissions on the devices involved. The best way of doing
this might be to arrange that the devices get chowned when one logs in on the console (although this
will hinder silly tricks like switching from an X session to another virtual console and letting someone
else log in and also run an X session).
imc
Messages
1. VM86 mode at which abstraction level? by Michael K. Johnson
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/117/1/1/1/1.html [08/03/2001 10.13.05]

VM86 mode at which abstraction level?
VM86 mode at which abstraction level?

Re: Possible, but takes work (Michael K. Johnson)
Re: VBE video driver (Ian Collier)
Date: Fri, 26 Sep 1997 15:53:09 GMT
If you mean VM86 mode as implemented by the I386 and higher processors, you want James Turley's
Advanced 80386 Programming Techniques. Unfortunately, it is out of print. Fortunately, he has given
me permission to scan it in and put it on the web. Unfortunately, that's a slow process, nowhere near
completed.
If you mean to ask how a user-space program can use Linux's vm86() syscall, use "man vm86". You
may find a use to modify a process's ldt, in which case you will want to read "man modify_ldt". Those
man pages may be slightly obsolete -- check them against recent dosemu and/or Wine source code.
It seems clear to me from your description that your job should be relatively easy to do as a kernel
device driver, for two reasons:
1. You can make calls to some address from protected mode.
2. It sets up a framebuffer, and Linux already has a standard way to set up framebuffers through
ioctl's, as I already mentioned.
Given those two considerations, you shouldn't have to know anything about vm86 mode at all.
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/117/1/1/1/1/1.html [08/03/2001 10.13.06]

DVD-ROM and Linux? (sorry if it's off topic...)
DVD-ROM and Linux? (sorry if it's off topic...)

Keywords: DVD DVD-ROM MPEG ISO9660
Date: Wed, 17 Sep 1997 19:22:47 GMT
From: Joel Hardy <deeng@inficad.com>
Please forgive me (and don't flame me too hard) if this has

already been discussed or if there's someplace better to discuss
this. I know Linux should support any SCSI or IDE DVD-ROM drive, but
so far (at least to my knowledge), it'll only act like a CD-ROM
drive. I really don't know much about the DVD standard (if anybody
knows, please point me to some documentation!), but my guess is that
DVD-ROM discs (I think I saw Walnut Creek selling 4.7 gigs of stuff
on one disc) would probably just be the same ISO9660 standard, so
there wouldn't be any need to support anything extra with that. My
question is this: is there any support planned to read the MPEG
streams and whatever else a DVD player can get off of a DVD movie
disc? Does anybody know anything about the format that a DVD movie
is stored in? Does anybody even know if this'd be best implemented
as a new filesystem or something completely different? I'd really
love to get this working, but I'm a newbie to kernel programming, so
if there's anybody else out there with similar goals (and especially
some information about this!), please contact me!
-Joel Hardy (deeng@inficad.com)

DVD-ROM and linux
DVD-ROM and linux

Re: DVD-ROM and Linux? (sorry if it's off topic...) (Joel Hardy)
Date: Tue, 28 Jul 1998 14:44:16 GMT
From: <Yuqing_Deng@brown.edu>
We need to impelement UDF file system to read DVD-ROM on linux. There is a UDF project, check
out the URL
http://www.netcom.ca/~aem/udf/
What I understand is that, the encryt algorithms should only be impelemented on hardware according
to the DVD standard. That might acctually make life easier. We only have to write a drive for the
decoder card.

Response to DVD and Mpeg in Linux
Response to DVD and Mpeg in Linux

Date: Sun, 12 Apr 1998 09:30:24 GMT
From: Mike Corrieri <mc@imagine-software.com>
Well...
To the best of my understanding we could IF we could get the encryption software. But it could not be
under the GPL. It would have to be a 4sale application for Linux.
DVD movie roms have a special copy encryption, that is specific to an area the DVD is sold in. Pretty
scary, heh?
So, if you bought your unit in Europe, it would not work in the usa.
Anyways, if we could get the encryption software, under development OEM license, it would conflict
with the GPL. We would not be able to make the code public.
I would like to see an answer to this myself, having to continue running Win 95 ONLY FOR DVD
MOVIES!

DVD Encryption
DVD Encryption
Re: Response to DVD and Mpeg in Linux (Mike Corrieri)
Date: Tue, 26 May 1998 15:09:08 GMT
From: Mark Treiber <mrtreibe@engmail.uwaterloo.ca>
I checked the creative site and for their player you have to add the country code when its installed and
then its permanent. I'm assuming that they are talking about the drive so if the drive is already setup,
reading from it should be okay without worrying about the encryption.
Untitled
Untitled
Re: DVD Encryption (Mark Treiber)
Date: Mon, 01 Jun 1998 11:09:26 GMT
From: Tim <ice@space.net.au>
Yes, I believe the creative dvd bundled mpeg1&2 decoder card performs the decryption. It is possible
(but probably not legal) to change the region supported by the card many many times by writing some
info to a flashrom on the card. DVD under linux is possible, however am not sure if it is 100% legal. I
am not sure if DVD-ROMS support two different access methods - normal and dvd .

DVD?
DVD?
Re: DVD Encryption (Mark Treiber)
Re: Untitled (Tim)
Date: Thu, 16 Jul 1998 23:51:42 GMT
From: <unknown>
It's time we hacked it all up otherwise we're pretty much stuck with win98 ;( If
anyone needs a box to test on I got one ;) -FireBall

Kernel Makefile Configuration: how?
Kernel Makefile Configuration: how?

Keywords: Makefile, configuration
Date: Sun, 14 Sep 1997 03:41:59 GMT
I'm aware of the make config/menuconfig/xconfig etc., but a program

I am writing for Uni will require that I am able to directly
configure the Makefile(s)...
Can anyone give me a general overview of where all the

CONFIG_blah_blah_blahs go? I had a look at the Makefile, but it
wasn't very instructive. For example, in drivers/net/Makefile:
.
.
.
ifeq($(CONFIG_NE2000),y)
L_OBJS += ne.o
.
.
.
Where does CONFIG_NE2000 get defined? I'd appreciate it if someone

could tell me the general rule.
Thanks

How to add a driver to the kernel ?
How to add a driver to the kernel ?

Re: Kernel Makefile Configuration: how? (Simon Green)
Keywords: Makefile, configuration, drivers, epic100
Date: Mon, 22 Dec 1997 07:08:52 GMT
From: jacek Radajewski <jacek@usq.edu.au>
Hi All,
I am in the process of rebuilding our beowulf cluster system and have to include
support for channel bonding and the epic100 SMC card. So far the
epic100.o module works fine, but I need to compile epic100.c into the
kernel. (I need a monolithic kernel to boot clients of a floppy disk and
mount / via NFS). Anyway, I put an antry in .config "CONFIG_EPIC100=y",
in drivers/net I put :
ifeq ($(CONFIG_EPIC100),y)
L_OBJS += epic100.o
endif
in the Makefile and compiled the kernel. (epic100.o did compile)
I added an entry in /etc/lilo.conf :
append="root=/dev/hda4 mem=256MB ether=0,0,eth0 ether=0,0,eth1

ether=0,0,eth2"
and ran lilo
"I have 2 SMC cards and 3C900 for the outside world"
When I rebooted the machine only 3C900 was detected, but I have no
problems loading the module.
Questions.
1. What did I do wrong in the kernel configuration ?

2. Is there any documentation available on how to add drivers to the
kernel ?

See include/linux/autoconf.h
See include/linux/autoconf.h
Re: Kernel Makefile Configuration: how? (Simon Green)
Keywords: Makefile, configuration
Date: Mon, 13 Oct 1997 12:05:56 GMT
From: Balaji Srinivasan <BalajiSrinivasan>
When you run make config (or its siblings) it creates a file in include/linux directory. This file
(autoconf.h) is included in include/linux/config.h in all the required C files...
For the makefile the place that these config options are specified is in the .config file in the
TOPLEVEL directory.
Hope this helps balaji

Multiprocessor Linux
Multiprocessor Linux
Keywords: SMP multiprocessor
Date: Tue, 09 Sep 1997 19:18:33 GMT
From: Davis Terrell <caddy@csh.rit.edu>
If anyone could tell me how or point to information on setting up Linux 2.x (RedHat 4.2) for SMP
support I would be very grateful... thanks...

Building an SMP kernel
Building an SMP kernel

Re: Multiprocessor Linux (Davis Terrell)
Date: Wed, 10 Sep 1997 11:03:27 GMT
1. Get the latest 2.0.x kernel sources from ftp.kernel.org. I recommend waiting for 2.0.31, which
should be out soon (as of this writing) and fixes some deadlocks in SMP.
2. Unpack it in /usr/src
3. Edit the Makefile and uncomment the SMP = 1 line near the top.
4. make config and choose your configuration.
5. make clean; make dep
6. make zImage; make modules
7. Move the kernel into place, make modules_install
8. Run lilo and reboot. Keep a known-working kernel around to revert to
Note: You must make new modules for your SMP kernel. Loading modules that are built for a
non-SMP kernel into an SMP kernel (and vice versa) breaks systems horribly.

SMP and module versions
SMP and module versions

Re: Multiprocessor Linux (Davis Terrell)
Re: Building an SMP kernel (Michael K. Johnson)
Date: Tue, 28 Oct 1997 21:20:40 GMT
From: <linux@catlimited.com>
I have not had any trouble configuring and compiling the kernel and modules for SMP, but I cannot
get it to boot properly. It complains about the module versions and won't load them. There must be
some aspect of the config I am missing. Can anyone tell me what I am missing here?
Thanks Bruce

Improving event timers?
Improving event timers?

Keywords: event timers improvement
Date: Fri, 05 Sep 1997 16:15:40 GMT
From: <bodomo@hotmail.com>
I want to evaluate my new implementation of event timers, and

compare its performance with the one I have (2.0.29) for different
usage patterns.
Three questions about add_timer, init_timer, del_timer:
- I want to know how existing apps (x,tcp/ip,ppp,latex, etc) use

event timers. Can I just trace some syscall using strace (which
ones?)? The alternative would be instrumenting the kernel to keep
track of the calls, that would take me more time coding.
- are these primitives C library wrappers around system calls that

do the same, or does the C part implement more functionality on top
of the kernel part?
- which library contains the primitives for C? gcc can't compile

because the linker can't find add_timer, ... I tried to find where
they are with "nm", but they don't seem to be in any file in
my/usr/lib. Should I download another library from some internet
site?
Thanks a lot,
Guillermo
bodomo@hotmail.com

measuring time to load a virtual mem page from disk
measuring time to load a virtual mem page from

disk
Keywords: virtual memory linux timing time
Date: Tue, 02 Sep 1997 13:23:33 GMT
From: <kandr@giasmd01.vsnl.net.in>
Is it possible to track the number of page-faults that occured during the course of a Linux session? I
want all kinds of intricate details like how long it took from the time of the page fault occuring to the
time the system recovers by loading the page from disk. (I know the the 'time' command can give the
number of page-faults but I also need to know the total time taken to service the page faults.)
If a readymade utility is not available, would anyone please suggest ways in which to modify the
kernel so that I can collect these statistics.
Thanks in advance. And cheers to the KHG!
-- Ranganathan <kandr@giasmd01.vsnl.net.in>

using cli/sti() and save_flags/restore_flags()
using cli/sti() and save_flags/restore_flags()

Date: Wed, 30 Jul 1997 21:43:57 GMT
From: george <georgel@cs.ucla.edu>
snippets of kernel code look like the following

save_flags(flags);
cli();
...
does restore_flags() somehow magically call sti()?
the answer: yes.
why/how?
remember that interrupt enable is _also_ a flag. thus restoring it would take care of everything. be
careful _not_ to simply call sti(). sti() would enable interrupts, even if they were disabled to start with.
in this case, you have just messed things up, as the example below illustrates
foo(){
cli();
...
sti();
}
bar(){
cli();
foo(); /* must not mess with cli() setting */
this_must_have_ints_off();
sti();
foo();
}
this was in response to a query i had about their use and there wasn't anything in the khg about this.
hence, i am adding this knowledge so others may know.

Protected Mode
Protected Mode
Date: Thu, 24 Jul 1997 02:10:25 GMT
From: ac <unknown>
Please need information about protected mode. I've been looking for books but they only give you
routines to enter and leave from pm. Is there any good reference? such as information about
TSS,LDT,etc. I don't want an introductory text,but I'm wondering if there exists some book about this
(if posible with example sources)
thanks
ac
Messages
1. Advanced 80386 Programming Techniques by Michael K. Johnson


Re: Protected Mode (ac)
Date: Thu, 24 Jul 1997 03:22:21 GMT
The book you really want is Advanced 80386 Programming Techniques. Unfortunately, it is out of
print.
Fortunately, the author, Jim Turley, has expressed interest in getting the book on the web. That will be
a somewhat long process, but at some point it should actually be available. When it is available, there
will be a link to it in the annotated bibliography. In the meantime, you'll just have to browse your local
bookstore for useful books on the subject.

'Developers manual' from Intel(download)...
'Developers manual' from Intel(download)...

Re: Protected Mode (ac)
Date: Wed, 15 Oct 1997 21:22:56 GMT
From: Mats Odman <mats-odm@dsv.su.se>
If you have the time, there are "Pentium Processor Family Developers Manual, vol3 Architectur and
programming manual" available for download from Intel, only problem is: it's about 1000 pages to
print :-)

DMA buffer sizes
DMA buffer sizes

Date: Tue, 22 Jul 1997 15:19:29 GMT
From: <nomrom@hotmail.com>
I want to allocate a DMA buffer larger than 128*1024. Is it possible to configure the kernel to allow
bigger DMA buffer sizes? I've attempted to increase the PAGE_SIZE to 8192 but that crashes the
system.

DMA limits
DMA limits
Re: DMA buffer sizes
Keywords: DMA limit
Date: Sat, 26 Jul 1997 02:16:07 GMT
From: Albert Cahalan <acahalan at cs.uml.edu> <unknown>
> I want to allocate a DMA buffer larger than 128*1024.
> Is it possible to configure the kernel to allow bigger
> DMA buffer sizes? I've attempted to increase the PAGE_SIZE
> to 8192 but that crashes the system.
I hope that is not PC hardware! The ISA bus has a hard limit of 128 kB (16-bit DMA) or 64 kB (8-bit).
Even 128 kB is hard though, because memory fragmentation makes it unlikely that you can allocate a
contiguous 128 kB chunk under the 16 MB ISA DMA limit (or elsewhere).

Not page size, page order
Not page size, page order

Re: DMA buffer sizes
Date: Thu, 24 Jul 1997 03:25:43 GMT
You want to change the largest order of pages available from 5 to 6 -- that will give you twice as large
regions.

Problem Getting the Kernel small enough
Problem Getting the Kernel small enough

Keywords: Kernel too large
Date: Mon, 21 Jul 1997 00:54:30 GMT
From: <afish@tdcdesigncorps.com>
I've been trying to get my Kernel small enough to boot from a floppy (zImage and
bzImage). Removed all unecessary drivers, etc.. however I still can't get it on a
floppy. I know its possible anyone have any global ideas about what exactly I am
ignorant about. By the way I've read all the Howto's and tried it on both 2.0.xx and
2.1.xx kernels and still get the same problem.

Check it's the right file, zImage not vmlinux
Check it's the right file, zImage not vmlinux

Re: Problem Getting the Kernel small enough
Date: Fri, 16 Jan 1998 22:39:25 GMT
From: Cameron <cls@greens.org>
I get questions like that every week, from people making LILO floppies. The most common problem is
that they are trying to install
/usr/src/linux/vmlinux
on their floppy.
That's the wrong file! It's too big!
The file you probably wanted was
/usr/src/linux/arch/i386/boot/zImage

Usually easy, but....
Usually easy, but....

Re: Problem Getting the Kernel small enough
Date: Thu, 15 Jan 1998 08:32:12 GMT
From: Ian Carr-de Avelon <ian@emit.com.pl>
Normally it is easy. A typical zImage is 300-400kb so fits fine on a 1.4MB floppy. Obviously it
depends on what you include, but normally you can get a whole compressed file system on there as
well and run Boot/Root. Like in the HOWTO of the same name. How big is your kernel and floppy
Ian

How to create /proc/sys variables?
How to create /proc/sys variables?

Date: Fri, 11 Jul 1997 11:08:50 GMT
From: Orlando Cantieni <ocantien@itr.ch>
Hi there
I'im going to do a kernel hack. I plan to insert a static variable (that is: static int) into the modified
kernel. The value of it should be tunable by an external program.
Now, How to do so ? I think best way is to create a variable into the /proc/sys mirror and tune it by
echo. But how can I create such a file ? Does sysctl() do something ?
Thanks for any help
Orla
-- Orlando Cantieni .. ITR Rapperswil ocantien@itr.ch .. www.itr.ch/~ocantien
"ERROR: Volume Earth is full up to 99%. Please delete some users."

Linux for NeXT black?
Linux for NeXT black?

Keywords: port NeXT
Date: Thu, 03 Jul 1997 02:49:45 GMT
From: Dale Amon <amon@gpl.com>
Could some one email me some good starting points for the Linux system porting bootstrapping
process? I'm looking at what it would take to do a port to NeXT black.

giveing compatiblity to win95 for ext2 partitions (for programmers forced to deal with both)
giveing compatiblity to win95 for ext2 partitions (for

programmers forced to deal with both)
Date: Sat, 28 Jun 1997 01:05:46 GMT
From: pharos <noonie@toledolink.com>
well working on a deviced driver to use linux partitions is not easy, I wanted to
make ext2 partitions look like a cdrom so I can use mscdex to give it a drive letter.
I would appreciate any low level info on ext2 partitions or if something like this
extists, for me to be told so I don't waste my time coding it ;P
hows this sound?
devicehigh=c:\pharos\ext2.sys -d:mscd0002 -p=/dev/hda8

Well, What's the status of the Windows / Dos driver for Ext2?
Well, What's the status of the Windows / Dos driver for

Ext2?
Re: giveing compatiblity to win95 for ext2 partitions (for programmers forced to deal with both) (pharos)
Keywords: win95 ext2 driver
Date: Thu, 14 May 1998 16:57:15 GMT
From: Brock Lynn <brock@cyberdude.com>
Just wondering what the status is. I can't seem to get to the address:
http://www.globalxs.nl/home/p/pvs/. Looking to see if there has been an update, but can't get there.
Though a few months ago I had someone on irc.linpeople.org go there and send me the proggie via
email. It didn't work so great, and every time I mounted an ext2, and then opened a few directories as
win95 folders Win95 would go down in flames.
I tried the ext2tools and they work quite well, but very awkward.
Anyluck with the IFS, or mscdex ???
Microsoft is such a turd... Why not open up their standards so more people can develop for it. Is MS
afraid that someone smarter than they are will come along and actually improve upon their
standards??? (probably so :)
Well, what's the word?
Brock Lynn brock@cyberdude.com

Working on it!
Working on it!
Keywords: ext2 for Win95, WinNT, and maybe DOS
Date: Tue, 05 Aug 1997 15:26:48 GMT
From: ibaird <topdawg@grfn.org>
I'm currently working on a driver for Windows 95 and Windows NT that

will run under the IFS (Installable File System) for Win32. I'm at
the stage right now where I can successfully read in the superblock
and the group descriptors of a test ext2 partition in my concept
testing program and am about ready to attempt to read the root
directory. In a few days (hopefully) I'm going to be able to read
some files from the drive. Write access will looks like it will take
longer (because of the allocation algorithms and other bull).

revision
revision
Re: Working on it! (ibaird)
Date: Fri, 08 Aug 1997 13:51:47 GMT
From: ibarid <topdawg@grfn.org>
Since I first wrote my message I've discovered Microsoft wants about

$1,000 for any info regarding their installable file system. Does
anyone know anything about it?

Untitled
Untitled
Re: Working on it! (ibaird)
Re: revision (ibarid)
Date: Fri, 14 Nov 1997 05:33:58 GMT
From: Olaf <in5y003@public.rrz.uni-hamburg.de>
I have tried to find information about the IFS before, actually for doing the same thing (an ext2fs
driver for Win95, WinNT and DOS). These are the best references I found:
- Inside the Windows 95 File System by Stan Mitchell,
O'Reilly in May '97
Gives many details (esp. on VFAT and IFSMgr) and includes
Multimon, a tool to sniff on INTs, VXDs and so on.
- Systems Programming for Windows 95 by Walter Oney,
Microsoft Press in '96 (he claims to have no affiliation
with MS other that they published his book).
Is more general than the above but has a very interesting
chapter about the IFS. Describes how to write a VXD.
There are other interesting books about DOS/Windows internals, especially those written by Andrew
Schulman and Geoff Chapell. (Write me if you need references).
I stopped working on this due to lack of time. But my approach started out as this: I was looking at the
ext2tools, a package for DOS providing rudimetary ext2 access through special commands (like e2dir,
e2cat and so on), without providing a drive letter. They were build from a snapshot of the ext2fs kernel
sources, glued together with a library doing regular expressions (for filename matching) and getting a
pointer to the partition through an environment variable. The disk accesses were done via the plain
BIOS IRQ 13.
I wanted to make all of this into a drive letter based approach and wanted to put together the current
ext2fs from the linux kernel, get a VXD running and answer IFS requests.
Someone else seems to have a read-only version of this running now. You should perhaps contact
Peter van Sebille or read his page at http://www.globalxs.nl/home/p/pvs/. You can find the driver there
as well.

setsockopt() error when triying to use ipfwadm for masquerading
setsockopt() error when triying to use ipfwadm for

masquerading
Date: Sat, 21 Jun 1997 20:05:37 GMT
From: <omaq@encomix.es>
Hi, I´m having such as a nasty problem...setsockopt(), I´d like my LAN in internet
masquerading all the 192.168.1.X directions as the real one on my providers side. I
have compiled 2.0.27 with IP forwarding and IP firewalling....that´s correct ?????
Thanks in advenced for any clue.

Re: masquerading
Re: masquerading
Re: setsockopt() error when triying to use ipfwadm for masquerading
Keywords: masquerade
Date: Mon, 23 Jun 1997 13:57:46 GMT
From: Charles Barrasso <charles@blitz.com>
I am assuming when you compiled the kernel you tured on masquerading too. Right?
you can check by doing
ls /proc/net/ and you should see
ip_masquerade if you don't then that could be why. I have never gotten that error so if it is not that I
don't konw what to tell you.
Hope this helps,
Charles

reset the irq 0 timer after APM suspend

Keywords: timer, APM
Date: Fri, 20 Jun 1997 21:34:04 GMT
From: Dong Chen <chen@ctp.mit.edu>
Hello,
On my AST J50 (P133) notebook, the timer on irq 0 resets itself from
interrupting at 100 Hz to 18.3 Hz (DOS default) after a suspend/resume.
What is the right way to re-initialize it back to 100 Hz without a reboot?

Thanks,
Dong Chen
chen@ctp.mit.edu
---------------------------------------------------------------------
BTW, I tried to modify
linux/drivers/char/apm_bios.c
in function suspend() at line 639 (kernel 2.0.30) from
err = apm_set_power_state(APM_STATE_SUSPEND);
if (err)
apm_error("suspend", err);
set_time();
to
if (err)
save_flags(flags);
cli();
/* set the clock to 100 Hz */
outb_p(0x34,0x43); /* binary, mode 2, LSB/MSB, ch 0 */
outb_p(LATCH & 0xff , 0x40); /* LSB */
outb(LATCH >> 8 , 0x40); /* MSB */
set_time();
But this does not work. All programs crash after suspend() is called.


Re: fixed, patch for kernel 2.0.30

Re: reset the irq 0 timer after APM suspend (Dong Chen)
Keywords: timer, APM
Date: Fri, 27 Jun 1997 15:44:46 GMT
From: Dong Chen <chen@ctp.mit.edu>
(This is the message I sent to the linux-kernel mailing list)
Hi,
This is a patch for "drivers/char/apm_bios.c", it fixes the following

problems:
(1) On some notebooks (AST J series, for example), the timer on interrupt 0
is reset to DOS default: 18 Hz. This patch re-initialize it to 100 Hz.
Thanks to Pavel (pavel@Elf.mj.gts.cz) for pointing out to me that I should
add some delays after the outb_p() and outb() calls.
(2) The clock is not correctly restored after a standby().
There are still some problems with not getting the correct time after APM
suspend or standby, namely before the first suspend() or standby()
call, if the clock is already slowed by CPU_IDLE call, then the estimate
time zone "clock_cmos_diff" would be wrong. Ideally, "clock_cmos_diff"
should be setup at boot time after the time zone is set. But that
will require changing code other than "apm_bios.c". Also, APM will not
correct for the change between daylight savings time and normal time.
Dong Chen
chen@ctp.mit.edu
---------------------------CUT HERE-------------------------------------
--- drivers/char/apm_bios.c.orig Mon May 26 11:05:15 1997
+++ drivers/char/apm_bios.c Tue Jun 24 12:09:06 1997
@@ -73,6 +73,18 @@
#include <linux/miscdevice.h>
#include <linux/apm_bios.h>
+/*
+ * INIT_TIMER_AFTER_SUSPEND: define to re-initialize the interrupt 0 timer
+ * to 100 Hz after a suspend.
+ */
+#define INIT_TIMER_AFTER_SUSPEND

+
+#ifdef INIT_TIMER_AFTER_SUSPEND
+#include <linux/timex.h>
+#include <asm/io.h>
+#include <linux/delay.h>
+#endif
+ static struct symbol_table apm_syms = {
#include <linux/symtab_begin.h>
X(apm_register_callback),
@@ -627,28 +639,53 @@
int err;
- /* Estimate time zone so that set_time can

- update the clock */
- save_flags(flags);
- clock_cmos_diff = -get_cmos_time();
- cli();
- clock_cmos_diff += CURRENT_TIME;
- got_clock_diff = 1;
- restore_flags(flags);
+ if (!got_clock_diff) {
+ /* Estimate time zone */
+ save_flags(flags);
+ clock_cmos_diff = -get_cmos_time();
+ cli();
+ clock_cmos_diff += CURRENT_TIME;
+ got_clock_diff = 1;
+ restore_flags(flags);
+ }
if (err)
+
+#ifdef INIT_TIMER_AFTER_SUSPEND
+ cli();
+ /* set the clock to 100 Hz */
+ outb_p(0x34,0x43); /* binary, mode 2, LSB/MSB, ch 0 */
+ udelay(10);
+ outb_p(LATCH & 0xff , 0x40); /* LSB */
+ udelay(10);
+ outb(LATCH >> 8 , 0x40); /* MSB */
+ udelay(10);
+#endif
+
set_time();
}
static void standby(void)

{
+ unsigned long flags;
int err;
+ if (!got_clock_diff) {
+ /* Estimate time zone */
+ save_flags(flags);
+ clock_cmos_diff = -get_cmos_time();
+ cli();
+ clock_cmos_diff += CURRENT_TIME;
+ got_clock_diff = 1;
+ restore_flags(flags);
+ }
+
err = apm_set_power_state(APM_STATE_STANDBY);
if (err)
apm_error("standby", err);
+ set_time();
}
static apm_event_t get_event(void)

Source Code in C for make Linux partitions.
Source Code in C for make Linux partitions.

Keywords: source code, partition
Date: Mon, 09 Jun 1997 07:24:11 GMT
From: Limbert Sanabria <limbert@bo.net>
Where I can get the Source Code in C for make Linux partitions?, Any idea about where, I can get
information about this?
Please reply cc to limb@hotmail.com or limbert@bo.net
Thanks.
Limbert
Messages
1. Untitled by lolley

Untitled
Untitled
Re: Source Code in C for make Linux partitions. (Limbert Sanabria)
Keywords: source code, partition
Date: Tue, 10 Jun 1997 03:09:08 GMT
From: Wu Min <wumin@sunny.bjnet.edu.cn>
I think you can see fdisk's source code. Maybe helpful.

How can I "cheat" and change the IP address (src,dest) in the sent socket?
How can I "cheat" and change the IP address

(src,dest) in the sent socket?
Date: Sat, 07 Jun 1997 17:18:36 GMT
From: Rami <s2623500@cs.technion.ac.il>
Hi, Me and my partner are trying to implement a LocalDirector (as a project in network course), so we
want to change the IP address in the socket to be sent to a local server from the LocalDirector.
Thanks
Messages
1. Do it in the kernel by Michael K. Johnson

Do it in the kernel
Do it in the kernel
Re: How can I "cheat" and change the IP address (src,dest) in the sent socket? (Rami)
Date: Sat, 14 Jun 1997 01:31:25 GMT
The only way to do this in user space is to do something like diald, where a program talks SLIP (or
you might choose PPP) to the kernel over a pty or two, and routes traffic back and forth through itself,
making modifications.
The more reasonable way to do this is to put it in the generic network filtering. You can either do
simple rewrites with the existing firewall tools or write your own firewall modules and drop them into
the stack. That way you can give yourself the option of making arbitrary modifications to packets on
their way in and/or out of the system.
Read Network Buffers And Memory Management first to learn about how the networking stack
works, then read the ipfwadm code and the relevant kernel code. Good luck.

Transparent Proxy
Transparent Proxy
Keywords: ip network address transparent proxy masquerade
Date: Mon, 22 Jun 1998 22:16:01 GMT
From: Zygo Blaxell <zblaxell@furryterror.org>
Linux Transparent proxy support (part of the firewalling stuff) is designed to do exactly this.
There are basically two "halves" to transparent proxy:
1. You can bind to any address you like, instead of choosing from the addresses of interfaces on
the machine.
2. You can collect SYN packets (generated by clients doing connect) on a port of your choice.
You can do a getsockname to find out what address+port number the client thinks it
connected to, and there are more fields in the "from" parameter of recvfrom that you can use
to find out where a datagram was destined.
So if you want to connect to a server while pretending to have some other IP address, you simply do a
bind system call on the socket before connecting. The address you bind to is the address you want
to appear to be. This is just like doing a bind with a specific IP address or port number when you
want a specific network interface or when you want a port number below 1024 for rcmd-based
services, except that now you specify an IP address other than your own.
If you're doing UDP, then you might want to do this with the sendto and recvfrom system calls,
in which case the source address is specified in the second 8 bytes of the socket address for the
destination address in sendto and vice-versa for the source address in recvfrom.
Put another way, when you do a sendto, you put the destination address in the "to" parameter as
usual, but you also put the desired source address (which is not the "usual" one) in the "to" parameter
+ 8 bytes. Note that you must OR in MSG_PROXY to the flags parameter for sendto/recvfrom.
Note that in order to use any of the transparent proxy features you must be root. Generally this is
most useful when the host doing transparent proxy is a gateway or router of some kind, because
impersonating host A when connecting to host B will only work if host B will normally try to send
packets to host A through your host.

Untitled
Untitled
Date: Tue, 17 Mar 1998 08:56:35 GMT
From: <qwzhang@public2.bta.net.cn>

Untitled
Untitled
Date: Sun, 14 Dec 1997 11:35:47 GMT
From: <navin97@hotmail.com>
how to change IP number? when I chat in IRC, I don't want somone know where am I from? then,
wanna change IP number... can I? could you please let's me know?... thanks!

Changing your IP address is easy, but...
Changing your IP address is easy, but...

Re: Untitled
Keywords: ip firewall router socket network address
Date: Mon, 22 Jun 1998 21:58:15 GMT
From: Zygo Blaxell <zblaxell@furryterror.org>
If you change your IP address on your client's socket to an "anonymous" IP address (one on a different
physical subnet than your own assigned address), you will not be able to receive replies sent to that IP
address unless you also manipulate the routing tables of all of the routers between the IRC server and
your "anonymous" client. You probably can't do that, so it's not actually useful to know how.
Note that if your machine is physically on an ethernet segment with a subnet, you could just change
your machine's IP address to a different address within the same subnet, which would obscure your
identity with that of another user on the same subnet (i.e. "they" will know what company or ISP
you're from but not which particular user, unless they have some other information to identify you).
Cable modems are good for this, as they often have little security or accounting and lots of spare
addresses to choose from.
Kids, don't try this at home. People who can afford lawyers get seriously offended if you steal their
vacant IP addresses.

You have to know a bit of C (if u wanna learn) ;)
You have to know a bit of C (if u wanna learn) ;)

Re: Untitled
Date: Thu, 15 Jan 1998 00:12:45 GMT
From: Lorenzo Cavallaro <lc529863@silab.dsi.unimi.it>
Yo,
You have either to write your own client IRC, or to modify
the source Code
one of the most famous client IRC: ircII (I dunno the actual
release ... maybe 2.9.x).
If u dont have ircII, well it's not hard to find it on the

Net ...
For further info, check out www.2600.com and look for the
FAQ (even old ones)
Bye

Untitled
Untitled
Date: Thu, 10 Jul 1997 01:53:57 GMT
From: <unknown>

Where is the source file for accept()
Where is the source file for accept()

Keywords: Networking Systemcall
Date: Wed, 04 Jun 1997 19:12:37 GMT
From: <kaixu@hocpa.ho.lucent.com>
Can anybody point me to the Linux source file that contains the implementation of accept() system call
? Thanx
Messages
1. Here, in /usr/src/linux/net/socket.c by wumin@netchina.co.cn

Here, in /usr/src/linux/net/socket.c
Here, in /usr/src/linux/net/socket.c
Re: Where is the source file for accept()
Keywords: Networking Systemcall
Date: Thu, 05 Jun 1997 02:04:39 GMT
From: Wu Min <wumin@netchina.co.cn>
asmlinkage int sys_accept(.....

How can I use RAW SOCKETS in UNIX?
How can I use RAW SOCKETS in UNIX?

Date: Mon, 02 Jun 1997 15:34:59 GMT
From: Rami <s2623500@cs.technion.ac.il>
Messages
1. Re: Raw sockets by genie@risq.belcaf.minsk.by

Re: Raw sockets
Re: Raw sockets

Re: How can I use RAW SOCKETS in UNIX? (Rami)
Date: Wed, 04 Jun 1997 13:15:33 GMT
From: <genie@risq.belcaf.minsk.by>
To use RAW sockets in Unix it it mandatory that one be a root . To create RAW socket just write:
s=socket(AF_INET,SOCK_RAW,<protocol>). Then you can do anything you want with it (sending,
receiving). However you have to perform all necessary operations, according to the protocol you use
(create headers (IP+TCP(UDP,ICMP,...)) and make all neccessary negotiations (TCP:
SYN->ACK->....RST....ACK....).

the KHG in spanish?
the KHG in spanish?

Date: Tue, 27 May 1997 03:28:04 GMT
From: Jorge Alvarado Revatta <j_alvarado@geocities.com>
where it's ?
Jorge Alvarado Revata
Universidad Nacional de San Marcos Lima Peru
Messages
1. No esta aqui! Pero... by Michael K. Johnson

No esta aqui! Pero...
No esta aqui! Pero...

Re: the KHG in spanish? (Jorge Alvarado Revatta)
Date: Sat, 14 Jun 1997 01:42:51 GMT
There is no one translating the KHG into Spanish. However, if you want to start, I'll be glad to put a
pointer into the "Other Sources of Information" section, or even at the top level. Just let me know...

Si tenga preguntas, quisa yo pueda ayudarte.
Si tenga preguntas, quisa yo pueda ayudarte.

Keywords: spanish translation
Date: Tue, 09 Sep 1997 03:18:41 GMT
From: <KernelJock>
Yo tengo poco experiencia en traduscas technicas, pero

necesito practicarla. Hace mucho tiempo que lo he
practicado.
Si tenga preguntas, quisa yo pueda ayudarle.
ciao

Tengo una pregunta
Tengo una pregunta

Re: Si tenga preguntas, quisa yo pueda ayudarte.
Date: Thu, 04 Jun 1998 16:05:56 GMT
From: <riderghost@rocketmail.com>
puedo echar a un cerdo de Internet, desde mi computadora?

Español
Español
Re: Si tenga preguntas, quisa yo pueda ayudarte.
Date: Thu, 09 Oct 1997 21:38:05 GMT
From: LL2 <jluis@itzcoatl.fi-c.unam.mx>
Si bajo la información.. ¿Cómo puedo utilizarla?

LLL

How to get a Memory snapshot ?
How to get a Memory snapshot ?

Date: Sat, 24 May 1997 14:42:34 GMT
From: Manuel Porras Brand <guporras@calvin.univalle.edu.co>
Hi
In my course of Operating Systems my final consists of the following:
If i'm using Linux as OS and suddenly the power shuts down, how can i now what was in the system
runnig at that moment (process, jobs, users, etc ). Maybe getting a snapshot of the memory blocks
where Linux puts this info. If this is the solution , how can i do it? If not, which one could it be ? Or
where in the web can i find docs or any information about this ?
Thanks.
PS : Sorry for my English.
Messages
1. Why not to get a memory snapshot? by Jukka Santala

Why not to get a memory snapshot?
Why not to get a memory snapshot?

Re: How to get a Memory snapshot ? (Manuel Porras Brand)
Keywords: UPS memory-image
Date: Fri, 30 May 1997 21:37:28 GMT
From: Jukka Santala <e75644@uwasa.fi>
Altough I normally disagree with study assignments "done" thru Usenet etc. (check any resource on
netiquette etc. :P) that assignment sounds weird enough to warrant a quick note.
That being that "technically", there's little you can do once the power goes down... the first warnign
you get is, well, err, when the power goes down and the processor stops executing instructions ;)
Incidentally (and luckily, for many) there's a thing called Uninterruptable Power Supply, or UPS for
short, which can warn the operating-system once the power goes bad, and feed emergency power to
the system until it has managed a controlled shut-down or the power goes up again.
Ofcourse, if you "simply" need to know _what_ the OS was doing when the power went down, that
may be a bit of an overkill. This is how we finalyl get into slightly kernel-related stuff. One approach
might be to let kernel log all program executions/forks into syslog thru syslogd - however, since all
disk-access is cached this will be slightly out of date if the computer just suddenly gets switched off. A
bit of clever coding might be used to avoid the cache in writing (Or does syslog already do that?
Would make sense) or another solution might be to use battery-backed bubble-ram or something
similiar with very short access times if one is worried about performance.
But shortly put, memory-images once the power actually goes down are out of question ;) However, if
the issue is simply about kernel bug-tracking... well, that's another issue (and more appropriate for this
place, I might add :P) indeed.

Why you would want to get a memory snapshot
Why you would want to get a memory snapshot

Re: How to get a Memory snapshot ? (Manuel Porras Brand)
Re: Why not to get a memory snapshot? (Jukka Santala)
Keywords: UPS memory-image, Power-Fail Interrupt, power-fail recovery
Date: Tue, 11 Aug 1998 17:07:18 GMT
From: Dave M. <DGMDGM@INAME.COM>
If I understand the question correctly, one reason you would want to get a memory snapshot before a
power failure would be in the case where you are designing a system that does not have a UPS but
does have a Power Fail Interrupt.
This is a concept that was used back in the mini-computer days. When the power supply sensed a drop
in input voltage it would generate an interrupt that would notify the system that power was about to be
lost. The power supply was designed to continue to provide power for a number of milliseconds
(typically 60-120) after the loss of line power. This would give the OS just enough time to stash
memory and the state of all registers in an image that could be retrieved upon power-up. When power
was restored, this image could be used to restart the system where it left off. It could also be analyzed
during the boot process and used to direct recovery operations upon restart.
The need for this type of power-fail restart may not be immediately obvious to a PC user but if you are
using your computer to perform some type of machine control or instrumentation monitoring, where
power-fail recovery is critical, then knowing what the OS was doing at the time of power loss is very
important.
Dave M.

resources hard limits
resources hard limits

Date: Thu, 22 May 1997 19:29:43 GMT
From: <castejon@kbw350.chem.yale.edu>
How the resources limits (filesize, cpu,etc) can be changed?
Messages
1. Setting resource limits by Jukka Santala

Setting resource limits
Setting resource limits

Re: resources hard limits
Keywords: getrlimit setrlimit limit ulimit hard resource limits
Date: Fri, 30 May 1997 21:59:26 GMT
Do try man limit and man ulimit; well, my current installation has neither manpage, but at least
ulimit's in /usr/bin by default. limit is a built-in command in some shells for the same purpose. As for
soft/hard limits, ulimit -S vs. ulimit -H ought to do that. Incidentally, if you specify neither, both
should get changed at the same time. csh is an excemption; here you need to use limit -h for hard
limits. As for C programs, use getrlimit() and setrlimit() calls - but then you really ought to get the
manpages, and I have no clue what this has to do with kernel hacking... *shakes head* ;)

How to invalidate a chache page
How to invalidate a chache page

Keywords: Chache Problem
Date: Wed, 21 May 1997 17:47:36 GMT
From: Gerhard Uttenthaler <u7x62bt@sun1.lrz-muenchen.de>
Hello everybody,
I have a problem with a char device driver, which I think is related to a chaching problem.
In my interrupt routine where I receive data from my device I have to handle a specific protocol
depending on my device.
So I have to poll a specific address until a change at this address occurs. Something like this:
do{
status = readb(0xc0000);
timeout++;
}while(!(status & 0x80) && (timeout < 2000));
Sometimes I get timeouts, which can only occur if I read chache memory which is not up to date. The
device is fast and works pretty well under DOS and Win311.
The situation is quite similiar on write access, where I must be sure to write directly to the device, not
into the chache waiting for a flush.
The question is: Does anyone know how to invalidate the chache on my address or how to tell the
chache not to chache my device at all???
Thank you very much!
G. Uttenthaler u7x62bt@sun1.lrz-muenchen.de
Messages
1. Read the rest of the KHG! by Michael K. Johnson

Read the rest of the KHG!
Read the rest of the KHG!

Re: How to invalidate a chache page (Gerhard Uttenthaler)
Keywords: Cache Problem
Date: Sat, 14 Jun 1997 01:49:44 GMT
You clearly want to read The Linux Cache Flush Architecture

Where are the tunable parameters of the kernel?
Where are the tunable parameters of the kernel?

Date: Thu, 15 May 1997 08:10:29 GMT
From: <demiguel@robot3.cps.unizar.es>
I need to find the tunable parameters of the linux kernel, bue i cant find them (filesystems, processes,
swapping,...)
I would like to know if there is any files such as mtune, or dtune is System V where you can change
the main tunable parameters.
Thank you everybody!!
Messages
1. Kernel tunable parameters by Jukka Santala

Kernel tunable parameters
Kernel tunable parameters

Re: Where are the tunable parameters of the kernel?
Date: Thu, 15 May 1997 14:47:31 GMT
Try out rdev (man rdev if you have it installed; must be on sunsite etc. if not). In short rdev sets root
device, swapdev sets swapping, ramsize sets ramdisk-size, rootflags for root-fs mounting parameters
and vidmode for ... well, you guessed it ;)

How can my device driver access data structures in user space?
How can my device driver access data structures

in user space?
Date: Tue, 06 May 1997 15:58:47 GMT
From: Stephan Theil <theil@zarm.uni-bremen.de>
My device driver needs to access data structs in user space. The functions get/put_user are the only
way - I think - to transfer data between kernel and user space. But both function only allow
manipulating of scalar types. How can I get/put data structs from/to user space???
Thanx!
Stephan

Forced Cast data type
Forced Cast data type

Re: How can my device driver access data structures in user space? (Stephan Theil)
Date: Fri, 05 Jun 1998 02:45:35 GMT
From: Wang Ju <wangju@ics.ict.ac.cn>
Since the device driver know the type of data structure, you can cast the point to user space to the
structure you needed.

Problem in doing RAW SOCKET Programming

Keywords: Client server program using Raw sockets.
Date: Wed, 30 Apr 1997 15:39:27 GMT
From: anjali sharma <asharm@acadcomp.cmp.ilstu.edu>
I have to write a client server program using raw socket. I have written the code for client as well as server but when ever I run it my server
hangs up. So I have to reboot the server. I think there is problem with my send and receive. I am sending the code for server. Hope you
would be able to help me.
@@@@@@@@@@@@@@@@@@@@@@@ code @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netdb.h>
#include <netinet/ip.h>
#include <netinet/ip_icmp.h>
u_short portbase = 0; long time();
#define qlen 6
#define protocol "raw"
#ifdef REALLY_RAW
#define FIX(x) htons(x)
#else
#define FIX(x) (x)
#endif
main(int argc, char **argv)
{
int msock, ssock;
int alen;
char buf[] = "asdfgh";
char recv_buffer[20];
struct servent *pse;
struct protoent *ppe;
struct sockaddr_in dst;
struct hostent *hp;
struct ip *ip = (struct ip *)buf;
struct icmp *icmp = (struct icmp *)(ip +1);
int s, type, dstL;
int q, bind1, lis;
int sockopt;
int on = 1, address;
int offset;
int sendbuff;
int n;
bzero((char *)&dst, sizeof(dst));
dst.sin_family = AF_INET;
dst.sin_port = 6000;
ppe = getprotobyname("raw");
setbuf(stdout,NULL);
s = socket(AF_INET, SOCK_RAW, 0);
printf("\n%d value of s in servsock",s);
if (s < 0)

printf("\nCann't creat socket");
setbuf(stdout,NULL);
sockopt = setsockopt(s, 0, IP_HDRINCL, &on, sizeof(on));
printf("\n%d value of sockopt", sockopt);
if (sockopt < 0)
exit(0);
if(( hp = gethostbyname(argv[1])) == NULL){
if(ip->ip_dst.s_addr = inet_addr(argv[1]) == -1)
printf("\nERROR: UNKNOWN HOST");
}
else
bcopy(hp->h_addr_list[0], &ip->ip_dst.s_addr,
hp->h_length);
printf("\nSending to %s\n", inet_ntoa(ip->ip_dst));
ip->ip_v = 4;
fflush(stdin);
ip->ip_hl = sizeof *ip >> 2;
ip>ip_tos = 0;
ip->ip_len = sizeof buf;
ip->ip_id = htons(4321);
ip->ip_off = 0;
ip->ip_ttl = 255;
ip->ip_p = 1;
ip->ip_sum = 0;
ip->ip_src.s_addr = 0;
dst.sin_addr = ip->ip_dst;
dst.sin_family = AF_INET;
icmp->icmp_type = ICMP_ECHO;
icmp->icmp_code = 0;
sendbuff = sendto(s, buf, sizeof buf, 0, (struct sockaddr
*) &dst, sizeof dst);
if(sendbuff < 0)
printf(" ERROR sending ");
if ( sendbuff != sizeof buf)
printf("ERROR packet size");
printf("\n buf is %s value of send is %d ", buf, sendbuff);
dstL = sizeof dst;
n = recvfrom(s, recv_buffer, sizeof(recv_buffer), 0,
(struct sockaddr *) &dst,&dstL);
printf("recv buffer is%s value of n is %d\n", recv_buffer,n); close(s); exit(0); }
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Problem with ICMP echo-request/reply

Re: Problem in doing RAW SOCKET Programming (anjali sharma)
Keywords: Client server program using Raw sockets.
Date: Thu, 09 Jul 1998 12:17:09 GMT
From: Raghavendra Bhat <raghu@swec.xko.dec.com>
I have the following program to simulate "ping". I am sending an ICMP echo

request to a specified host and expect the host to respond with ICMP echo
response. I understand from TCP/IP books that the kernel of the target host
will send the "echo response" as soon as it receives the "echo request". But
"recvfrom()" in my program waits forever since it doesn't receive any
"echo response". While "myping" is waiting in recvfrom(), if I do a
"ping" to the target host from my host, "myping" comes out successfully
with correct packet size also. This means that IF my current host receives
the "echo response", it is properly delivering the packet to my program. That
is,
there should not be any problem with my recvfrom(). But I am not able to make
out why my targethost's kernel is not responding to my "echo request". Could
it be because my "echo request" is not reaching the target machine or the
target machine is not able to identify the received packet as "echo request"?
Can anybody tell me the reason for this? What could be the possible reasons
and
what is the solution?
Thanks in advance,
Raghu
___________________________________________________________________________
/* myping.c */
/*
* This program simulates the "ping" program. But it doesn't bother about
* checksum, unique sequence id, etc.
*/
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <netdb.h>
#include <netinet/in_systm.h>
#include <netinet/in.h>
#include <netinet/ip.h>
#include <netinet/ip_icmp.h>

/* My own ICMP structure */

struct myicmp
{
unsigned char icmp_type;
unsigned char icmp_code;
unsigned short icmp_cksum;
unsigned short icmp_id1;
unsigned short icmp_seq1;
};
/* argv[1] can be either hostname or IP address */

main(int argc, char *argv[])
{
struct sockaddr_in address;
u_char sendpack[32];
struct myicmp *icp;
int Nbytes, sock , status, buf_len;
struct sockaddr_in Current_Sockaddr;
char buffer[100];
struct hostent *host;
if (argc != 2)
{
printf("usage : %s <hostname|IP address>\n", argv[0]);
exit(1);
}
if ((host=gethostbyname((const char*)argv[1])) == NULL)

{
if ((address.sin_addr.s_addr = inet_addr(argv[1])) == -1)
{
printf("%s: unknown host\n", argv[1]);
exit(1);
}
}
else
{
bcopy(host->h_addr_list[0], &(address.sin_addr.s_addr),
host->h_length);
}
memset(sendpack, 0x0, sizeof(sendpack));

memset(buffer, 0x0, sizeof(buffer));
icp=(struct myicmp*)sendpack;
icp->icmp_type=ICMP_ECHO;
icp->icmp_code=0;
icp->icmp_seq1=1; /* any abritrary sequence number */
icp->icmp_id1=123; /* any arbitrary id */

address.sin_family = AF_INET;
/* 1 is for ICMP protocol : from /etc/protocols */

sock = socket(AF_INET,SOCK_RAW, 1);
buf_len = sizeof(buffer);
Nbytes= sendto(sock, (const void *)sendpack, sizeof(sendpack), 0,
(struct sockaddr *)&address,sizeof(address));
printf ("Data is sent to %s\n", inet_ntoa(address.sin_addr));

printf ("No of bytes sent = %d\n", Nbytes);
buf_len = sizeof(Current_Sockaddr );
Nbytes= recvfrom(sock, buffer, buf_len, 0,

(struct sockaddr *)&Current_Sockaddr, &buf_len );
printf ("Data received from %s\n", inet_ntoa(Current_Sockaddr.sin_addr));
/* get the sequence-id : First 20 bytes are for IP header :

21t byte is the sequence-id */
icp=(struct myicmp*)&buffer[20];
printf("Received sequence id is %d\n", icp->icmp_seq1);
printf ("No of bytes recd = %d\n", Nbytes);

}
________________________________________________________________________

Tunable Kernel Parameters?
Tunable Kernel Parameters?

Keywords: kernel tuning, file handles, Bus error
Date: Sat, 19 Apr 1997 13:07:40 GMT
From: <dennyf@bmn.net>
In a recent comp.os.linux. article I saw a reference to tunable kernel parameters, which I think may
help me with a problem I'm having with expireover on a Debian 1.2, Inn 1.4 unoff4 news server.
The gist of the article was that the other fellow's problem may have been caused by running out of file
handles, and that file handles were a compile time option.
I'm runnming 2.0.27, 64mb ram, raid 0 on three 2gb drives for /var/spool/news.
News expires fine, but expireover causes errors like:
no memory for named no memory for expireover
then finally:
Bus error
At this point serveral drivers usually die or go zombie and the system has to be restarted.
Monitoring the system with top while expireover is running shows plenty of available memory, with
hardly any swap being used out of the 64mb swap space. By available memory, I mean there is about
30 odd mb in the buffer cache.
I've been looking for some docs on tuning kernel parameters like file handles, and can't seem to find it.
Can someone please point me in the right direction?
Any other suggestions would be greatly appreciated.
Thanks,
Denny Fox, dennyf@bmn.net
Messages
1. sysctl in Linux by Jukka Santala

sysctl in Linux
sysctl in Linux
Re: Tunable Kernel Parameters?
Date: Thu, 15 May 1997 15:14:33 GMT
The util you're looking for is sysctl (Or a system-call by the same name). However, as far as I know
this isn't quite fully implemented in Linux as of yet (I just saw it on a "wishlist" for 2.2 kernels).
Certainly I haven't been able to find anything meaningful on sysctl in Linux, so perhaps that post you
was referring to was abotu the *BSD's which I seem to remember use sysctl rather heavily. Ofcourse
there's the option to choose whether or not to compile sysctl in to the kernel at least on 2.1.37; if
anybody knows for sure if working sysctl utils can be had anywhere, drop a line.
However, on the grand scale, I don't think sysctl would do it- Linux (Ok, again, at least in 2.1.37)
comes with all defaults compiled to 1024fd's, and changing that would require at least increasing
_FD_SETSIZE, NR_FILE and NR_OPEN. You'd have to change these in the kernel headers and
recompile everything to get any changes of it working anyway.
On the grand scale, though, I somehow doubt this is your problem - running out of file descriptors
rarely results in kernel crashes.

Increasing number of files in system
Increasing number of files in system

Date: Sat, 25 Oct 1997 11:19:38 GMT
From: Simon Cooper <simon.cooper@snc-net.demon.co.uk>
Increasing number of open file handles in kernel.

I have successfully increased the number of files on my system (Linux 2.0.30) by editing the following
file:
/usr/src/linux/include/linux/fs.h
I increased NR_OPEN to 1024, NB. Don't make it any higher than this as it will break other code in
the kernel and your new kernel wont boot !
I also increased NR_INODE to 5000, and NR_FILE to 4096.
I then rebuilt the Kernel (and modules for good measure) and installed the new image.
Note: Its always a good idea to have your original Kernel image available. I installed a boot block in
my MBR using lilo, and added an entry for my original kernel I named vmlinux.safe.
For my root system on /dev/hda2 I added the following lines to my existing /etc/lilo.conf
image=/vmlinux.safe image=linuxS root=/dev/hda2
Messages
1. Increasing number of open files parameter by Simon Cooper

Increasing number of open files parameter
Increasing number of open files parameter

Re: Increasing number of files in system (Simon Cooper)
Date: Sat, 25 Oct 1997 11:25:44 GMT
From: Simon Cooper <simon.cooper@snc-net.demon.co.uk>
Sorry for the typo, my (Linux loader) lilo.conf entries are:

image=/vmlinux.safe [Newline] label=LinuxS [Newline] root=/dev/hda2 [Newline]

Setting and getting kernel vars
Setting and getting kernel vars

Re: sysctl in Linux (Jukka Santala)
Keywords: kernel vars
Date: Thu, 16 Jul 1998 22:57:00 GMT
From: <kbrown@csuhayward.edu>
If sysctl is not implemented properly yet, then what is a good method for setting and getting kernel
variables?
I used sysctl to access some varibale I added under NetBSD and I was hoping to do the same now that
I'm using Linux instead.
Has anyone done this successfully?

ELF matters
ELF matters
Date: Tue, 15 Apr 1997 09:09:51 GMT
From: Carlos Munoz <cmunoz@dsic.upv.es>
Hi again!!
Something bizarre is happening in my Linux 1.2.13:
I have created a silly program that executes just after a fork() for the init is done in the function
start_kernel():
.........
if (!fork())
init();
..........
void init() { ..... THIS IS MY CHANGE:
if (!fork())
execve("silly", NULL, NULL);
.....
}
If the format of the executable silly is ELF then everything works okay, but if I use the old a.out
(compiled with gcc silly.c -o silly -b i486-linuxaout) the program doesn't execute at all and I cannot
find it anywhere (top and ps -x fail to do it).
Am I going nuts or are Linux developers trying to move us to ELF?
Secondly, I would like to disable demand loading in ELF for some real-time tasks which I don't want
them to produce page faults while they're actually running. Is it possible?
Thirdly, can you tell me where can I find information about ELF internals? I get sick when I try to
understand the function load_elf_binary!
Thanks a lot in advance!!
CARLOS AKA SLACKER
Messages
1. Information about ELF Internals by Pat Ekman

ELF matters

Information about ELF Internals
Information about ELF Internals

Re: ELF matters (Carlos Munoz)
Keywords: ELF format
Date: Thu, 08 May 1997 18:52:42 GMT
From: Pat Ekman <ekman@cs.byu.edu>
Documentation for ELF internals can be found at sunsite.unc.edu (or a mirror), in

/pub/Linux/GCC/ELF.doc.tar.gz.

Droping Packets
Droping Packets
Keywords: packets
Date: Mon, 31 Mar 1997 01:27:55 GMT
Is there any way to tell the kernel to drop packets from certian hosts. Like drop packets from all the
hosts in a file? If not I would be verry greatfull to the person who writes that program if it could be
written.
charles
Messages
1. [Selectively] Droping Packets by Jose R. cordones

[Selectively] Droping Packets
[Selectively] Droping Packets

Re: Droping Packets (Charles Barrasso)
Keywords: ICMP packet Firewall
Date: Fri, 11 Apr 1997 19:38:11 GMT
From: Jose R. cordones <cord2403@cslab.engr.ccny.cuny.edu>
Sure you can.

Look into compiling support for "Firewall," etc. code in the kernel. Then you get the "ipfwadm" (IP
Firewall Admin.) package (available wherever fine free software is sold^H^H^H^Hgiven away.)
You then add rules on which traffic to allow your host to accept, which to reject (implies that the host
attempting a connection receives feedback in the form of a ICMP error message) and which to ignore
(no ICMP error sent.)
If such a machine is additionally forwarding traffic between several networks, then the marketing
people call this a "Firewall." But you can also be using it just to protect the host itself.
There are other solutions that are not so low-level such as the tcpd daemon and configuring well any
daemon/service you run on your machine, something to which no firewall can be a substitute.
Cheers,
José R. Cordones <cord2403@cslab.engr.ccny.cuny.edu> http://www.engr.ccny.cuny.edu/~cordones

The /proc/profile
The /proc/profile
Keywords: /proc/profile, kernel hacking
Date: Mon, 31 Mar 1997 00:15:34 GMT
I recently upgraded the kernel to 2.0.29 and included kernel hacking support. Now I have a
/proc/profile file that I want to read. Supposedly it contains info on the kernel. I know I need to have
softwhere to read what is in the file. where would I get that? Also what else can I do now that I have
kernel Hacking support?
thanks charles
Messages
1. readprofile systool by Jukka Santala

readprofile systool
readprofile systool
Re: The /proc/profile (Charles Barrasso)
Keywords: /proc/profile, kernel hacking
Date: Thu, 15 May 1997 15:24:43 GMT
RTFM ;) If you enter '?' at the prompt (or menu-choice) about kernel-hacking support, it explains all it
does for now is allow you to get profiles on where exactly the kernel is spending it's time into the /proc
- filesystem. Further it's noted that to read that information you need a tool called readprofile, which is
available from ftp://sunsite.unc.edu/pub/Linux/kernel/readprofile-2.0.tar.gz (What a surprise;). Using
the mirror-site closest to you is preferable.
It's actually pretty useful for profiling the kernel for optimization purposes, however worth
remembering is that since the kernel-source is heavily optimized, there's no direct connection between
the results and the actual code in the way one could expect (Ie. a lot of functions get inlined etc; if the
entry you're looking for doesn't show even on readprofile -a, it's probably made part of the calling
function(s)).

Can you block or ignore ICMP packets?
Can you block or ignore ICMP packets?

Keywords: ICMP ping Internet echo flood
Date: Sat, 22 Mar 1997 02:46:28 GMT
From: <HackOps@nutnet.com>
I would like to know if you could block or ignore ICMP packets. If there is no way to block reception,
then is there a way to prevent the kernel from replying? The reason is that I want to prevent other
people from abusing the "ping" command and flooding me with ICMP packets. In particular, I want to
block only when there is an excessive amount of packets being recieved. (ie. 25+ in a 10 sec period)
Messages
4. ICMP send rate limit / ignoring by Jukka Santala
3. Using ipfwadm by Charles Barrasso
1. Icmp.c and kernal ping replies by Don Thomas

ICMP send rate limit / ignoring
ICMP send rate limit / ignoring

Re: Can you block or ignore ICMP packets?
Date: Thu, 15 May 1997 14:30:48 GMT
While adding that #define CONFIG_IP_IGNORE_ECHO_REQUESTS into linux/net/ipv4/icmp.c will

work fine for now, I'd suggest putting it into the configuration-headers so it doesn't tangle up with
further patches, or, should that define later move into different file(s), lose it's efficiency. This is also
the easiest way to make sure all future versions of the kernel you compile get that setting defined.
Unfortunately, I'm not quite sure where you can stick it without messing up the kernel autoconfig ;) If
anybody has any input on this, it would be most welcome.
Meanwhile, if you're worried that ignoring _all_ echo-requests may be a bit too rough move, there's a
way to make the kernel ignore them selectively. This is available at least in the 2.1.X series,
unfortunately I don't know if it's elsewhere.
While browsing the net earlier I came upon a site with cross- referenced kernel sources for all major
Linux distributions, so I thought I'd check it out from there, but naturally I didn't save the URL
anywhere, typical, so if somebody knows that site I'd appreciate to know too ;)
But back on track... so how do you make that selective ignore? Simple, first make sure
CONFIG_NO_ICMP_LIMIT _isn't_ defined - don't worry how, it won't be ;) Next, in
linux/net/ipv4/icmp.c go to the end of the file where there is a table of ICMP definitions - the first
entry is after /* ECHO REPLY (0) */ This is, incidentally, what you need to change. Change the
NULL on that line to &xrl_generic. So what does that do? I suggest you look at the source and try to
figure that out yourself - it's not that hard, and allows you better diddle with it. (However, the
limit-code seems pretty inefficient to me, and is no use against spoofed ICMP-floods, so I suggest
relying on it with caution)
Messages

Omission in earlier rate-limit...
Omission in earlier rate-limit...

Re: ICMP send rate limit / ignoring (Jukka Santala)
Date: Thu, 15 May 1997 22:44:33 GMT
Oops, what a mistake. I missed the fact that icmp_send()

isn't actually used for replying to ICMP_ECHO_REQUEST's etc.
so no matter how you change the table in question, none
of the replies are going to be limited... so what you need
to do is add a call to the check in question to icmp_reply()
as well, which is something that can already be called real
kernel hacking. Here's how I'm doing it; however...
1) I haven't yet rebooted with this code... wish me luck ;)
2) Am I missing something? ping -f and ping -l get mostly ignored
Here's the bit of code, in icmp_reply() right at the beginning (after local varable
definitions) :
#ifndef CONFIG_NO_ICMP_LIMIT
if(!xrlim_allow(icmp_param->icmph.type, skb->nh.iph->saddr))
return;
#endif
I'll let you know how my tests with the thing proceed ;)
(Sorry for bad formatting, I managed to break my PPP thingy playing around with
filedescriptors, it seems, and this remote lynx doesn't quite handle text-fields
properly, it seems... :P)
Messages
1. Patch worked... by Jukka Santala

Patch worked...
Patch worked...
Re: ICMP send rate limit / ignoring (Jukka Santala)
Re: Omission in earlier rate-limit... (Jukka Santala)
Date: Thu, 15 May 1997 23:04:28 GMT
Just a quick note to report success on that patch ;) Now, doing ping -l 1000 -c 1000 host (Not
suggested to test willy-nilly; very effective flood where supported;) only replies to 30 first
ping-packets, ignoring the rest (Before the patch I got about 180 replies - does similiar code to tune
already exist elsewhere?). Another ping-flood right before earns only two replies, though (Is this
correct?). A normal ping with one-second delay goes thu with 0% packet loss. I'd be interested to hear
results if anybody dares to try this patching out on a "real" configuration. (I have a very limited PPP
account, basically conducting tests over local loopback - oh, and by the way, that PPP breakage wasn't
because of my filehandle playing, it was because I had removed resolv.conf for who knows what
reasons... increasing fd's up to 4k seem to have worked without problem at least for now;)

Using ipfwadm
Using ipfwadm
Date: Sun, 11 May 1997 21:50:25 GMT
If you compile the kernel with FireWall support then you could do:
ipfwadm -I -P icmp -a reject(or deny)
that would make it so your computer wouldn't reply to the pings from any host.
But lets say that you wanted to be able to be pinged by brigia.blitz.com but not by anyone else well
then you would
ipfwadm -I -P icmp -a accept -S brigia.blitz.com ipfwadm -I -P icmp -a reject(or deny)
make sure you put the accepts first then the deny's or rejects.
Hope this wasn't too confusing,
Charles

Icmp.c and kernal ping replies
Icmp.c and kernal ping replies

Date: Sat, 22 Mar 1997 15:55:26 GMT
From: Don Thomas <dgt@multiline.com.au>
Hi,
I'm not sure if you can ignore ICMP requests, but I have been able to modify icmp.c to stop the kernal
replying to ping requests. This halves the amount of traffic if you are flood pinged, plus the person
pinging you, may well believe that you are down because of the absence of replies. I added this to
icmp.c in the /usr/src/linux/net/ipv4 directory, and then re-compiled. Seems to work okay on 2.0.28.
#define CONFIG_IP_IGNORE_ECHO_REQUESTS
Regards,
Don

ipfwadm configuration utility
ipfwadm configuration utility

Re: Using ipfwadm (Charles Barrasso)
Keywords: IPFWADM
Date: Mon, 20 Jul 1998 22:44:31 GMT
From: Sonny Parlin <sonny@ejj.net>
*FREE SOFTWARE*
I have written an ipfwadm GUI configuration utility. It's GUI via Netscape... it creates a shell script to
be used as a firewall based on the criteria you choose during the configuration. It can also install the
firewall rules, uninstall them, check firewall status, and watch network traffic from the masqueraded
connections. If anyone is interested in this, Check out:
http://www.ejj.net/~sonny/fwconfig/fwconfig.html
Sonny

encaps documentation
encaps documentation
Keywords: encaps doc docs
Date: Thu, 13 Mar 1997 11:28:50 GMT
From: Kuang-chun Cheng <kccheng@hycppc01.fnal.gov>
Hi,
Could someone tell me where can find the documentation
or man pages of command 'encaps' which used in elf kernel
building. I knew it came from binutils but just can't find
any word about in binutils' info page. Thanks.
Kuang-chun Cheng
kccheng@hycppc01.fnal.gov

Mounting Caldrea OpenDOS formatted fs's
Mounting Caldrea OpenDOS formatted fs's

Keywords: filesystems Caldera OpenDOS mount
Date: Mon, 17 Feb 1997 18:17:19 GMT
From: Trey Childs <tchilds@paraengr.com>
Anyone have a patch to allow mounting an OpenDOS formatted partition as 'msdos'? I think it's
complaining about a signature that it expects. If so, that sounds easy. I took a look in the filesystems
code and didn't see anything like it.
Thanks, Trey Childs

finding the address that caused a SIGSEGV.
finding the address that caused a SIGSEGV.

Keywords: SIGSEGV, signal, handler
Date: Sun, 09 Feb 1997 08:46:09 GMT
From: Ben Shelef <meekg@pobox.com>
one way, of course, is to backtrack up the stack, beyond the sigreturn frame. for that, i'd need the stack
layout. but this address should also be available from the kernel - does anyone know where to find it?
thanks, ben.

sti() called too late.

Keywords: cli() IRQ rtc realtime timer lost
Date: Sat, 18 Jan 1997 11:00:12 GMT
From: Erik Thiele <erik@unterland.de>
Hello
i read the /dev/rtc (Realtime clock) driver, mutated it and wrote my own little module, that get's called
8192 times per second via SA_INTERRUPT (highest possible priority) IRQ 8. ok. if harddisk
interrupts occur, some timer interrupts get lost. this is also mentioned in the original rtc driver. this
happens, because cli is "active" too long. WHY doesn't the harddisk driver do a sti() as early as
possible ???
i don't think this is good code.
:)
byebye Erik
Messages
1. sti() called too late. by Gadi Oxman


Re: sti() called too late. (Erik Thiele)
Keywords: cli() IRQ rtc realtime timer lost
Date: Fri, 24 Jan 1997 19:07:40 GMT
From: Gadi Oxman <gadio@netvision.net.il>
Unfortunately, some combinations of chipsets, IDE

interfaces and drives will not work reliably when
interrupts are allowed to occur while we are servicing
a disk interrupt.
The IDE driver can optionally unmask interrupts during

the data transfer, by using "hdparm -u1 /dev/hdx".
This is a typical problem; on the one hand, we want to

achieve the best possible performance, but once we do
this, we might find out that we are no longer compatible
with marginal hardware which could have worked if only
we didn't try to push it so hard.
Gadi Oxman
gadio@netvision.net.il

Module Development Info?
Module Development Info?

Date: Wed, 18 Dec 1996 04:12:10 GMT
From: Mark S. Mathews <mark@absoval.com>
Is there a 'single source' for information about developing modules?

Thanks in Advance! MSM
Messages
1. Needed here too by ajay
2. Help needed here too! by ajay

Needed here too
Needed here too

Re: Module Development Info? (Mark S. Mathews)
Date: Tue, 11 Feb 1997 09:48:15 GMT
From: ajay <aftek@giaspn01.vsnl.net.in>
I also need information about module development and the only help I can get are from the module
programs given with the source. Also need kernel functions and I don't have any info other than the
ksyms file. Thanx !!

Help needed here too!
Help needed here too!

Re: Module Development Info? (Mark S. Mathews)
Date: Tue, 11 Feb 1997 09:52:25 GMT
From: ajay <aftek@giaspn01.vsnl.net.in>
I also need information about module development and the only help I can get are from the module
programs given with the source. Also need kernel functions and I don't have any info other than the
ksyms file. Thanx !!

Need quicker timer than 100 ms in kernel-module
Need quicker timer than 100 ms in kernel-module

Date: Fri, 06 Dec 1996 19:14:31 GMT
From: Erik Thiele <erik@unterland.de>
hello
i am just writing a kernel module (2.0.27kernel), that shall turn engines.. :) you know, those electronic
engines that have 4 statuses, and by selecting 1 2 3 4 1 2 3 4 etc. they turn. i just don't know the
english word... :)
i use a little external hardware that needs to be triggered in order to let the engine perform a step. the
module uses an add_timer(...) to be called 100 times per second. this means 50 pulses per second. this
means 50 steps of the engine per second. this is not enough. :-( so. as i read above there are many
real-time approaches to linux, but they aren't yet really implemented ??? :) well somebody told me
about /dev/rtc i could use this one. but i still want to use a kernel module for my purpose, so:
Is it possible to use open() write() read() close() ioctl() in a kernel-module or aren't those libc functions
overloaded? (i want to use /dev/rtc IN my kernel module)
byebye&thanks :) Erik
Messages
1. 10 ms timer patch by Reinhold J. Gerharz
-> Please send me the patch by Jin Hwang

10 ms timer patch
10 ms timer patch
Re: Need quicker timer than 100 ms in kernel-module (Erik Thiele)
Keywords: TIMER
Date: Thu, 09 Jan 1997 03:55:29 GMT
The "little engines" are call stepper-motors.

I have a kernel patch that allows a module to insert a high-priority hook function on the timer
interrupt. It's very simple, actually. I can email it, as I don't have a web or ftp site of my own.
Messages
1. Please send me the patch by Jin Hwang

Please send me the patch
Please send me the patch

Re: 10 ms timer patch (Reinhold J. Gerharz)
Keywords: TIMER
Date: Fri, 10 Jan 1997 22:43:18 GMT
From: Jin Hwang <hwang@ilt.com>
I have been looking for a way to use timer interrupt (30 msec) in module. I would appreciate if you
send me that patch or tell me how to do it. I am using kernel 1.2.3. my e-mail address is
hwang@ilt.com

please send me 10 ms timer patch
please send me 10 ms timer patch

Keywords: TIMER
Date: Sat, 29 Nov 1997 15:08:26 GMT
From: Tolga Ayav <enko@raksnet.com.tr>
I am working on PC based control systems and I really need to use timer interrupt as fast as possible. I
would be pleased if you could send me your patch. Thank you.

UTIME: Microsecond Resolution Timers
UTIME: Microsecond Resolution Timers

Re: Please send me the patch (Jin Hwang)
Keywords: TIMER Microsecond Resolution
Date: Mon, 13 Oct 1997 12:01:30 GMT
From: <BalajiSrinivasan>
We at the University of Kansas have been working on providing facility to add microsecond resolution
timers to linux. Please see http://hegel.ittc.ukans.edu/projects/utime for more details. Balaji

Need help with finding the linked list of unacked sk_buffs in TCP
Need help with finding the linked list of unacked

sk_buffs in TCP
Keywords: sk_buff linked list
Date: Thu, 05 Dec 1996 02:40:41 GMT
Hi,
I have been trying to build a protocol
similar to TCP within the Linux kernel for a
mobile computing environment.
For that, I need to look at the list of
unacknowledged sk_buffs in order to change their
headers so that I can send them through a new
interface.
I have tried sk->send_head, sk->send_tail
but without any success. In spite of there being
around 36-37 unacknowledged segments, there is
still nothing which I get by following the next
or prev pointers of send_head and send_tail.
Can anyone help me tell which other place
I can look into ?
Any help will be greatly appreciated. If
you can also cc the reply to vsgupta@cs.uiuc.edu,
then that would be great.
Thanks a lot, Vijay

Partition Type
Partition Type
Date: Tue, 12 Nov 1996 21:46:59 GMT
From: Suman Ball <sb241@columbia.edu>
Does anyone know which file in the kernel source contains partition type information? It appears that
windows95 has a different partion type number for extended and logical partitions, so I need to add it
to see whether I can read them. Thanks, Suman.

New document on exception handling
New document on exception handling

Date: Mon, 11 Nov 1996 23:16:40 GMT
Joerg Pommnitz wrote a nice document on Kernel-Level Exception Handling which he posted to the
linux-kernel@vger.rutgers.edu list, and which he has graciously allowed me to format in HTML and
include here in the KHG.
Thanks, Joerg!

How to make paralelism in to the kernel?
How to make paralelism in to the kernel?

Date: Wed, 30 Oct 1996 06:28:27 GMT
From: Delian Dlechev <delian@wfpa.acad.bg>
I need to know how to make paralel programs in to the kernel. May be I need to make modules? Or
any process like kerneld and nfsiod?

readv/writev & other sock funcs
readv/writev & other sock funcs

Keywords: socket programming readv
Date: Mon, 21 Oct 1996 19:58:50 GMT
From: Dave Wreski <dwreski@ultrix.ramapo.edu>
I'm having a problem writing code to make use of readv() and the iovec struct. I'm pretty sure I'm
doing it correctly, as I have spent countless hours troubleshooting (seasoned newbie here)
Should I post to linux-c-programming, or is it ok to ask kernel/C/socket questions here?
I'm using 2.0.20 or so, and I have one writev() that writes two iovec structs to a socket. The readv() on
the other end requires two reads (well, readv's), to gather the data, and it doesn't seem to even be
placed back together in my header properly.
Any ideas on pointers, or directions? I have most of Stevens' books, and have scoured them. They
seem to be more interested in fd passing, which I don't need. (should I be using sendmsg()/recvmsg()?)
Thanks so much, Dave Wreski dwreski@ultrix.ramapo.edu

I'd like to see the scheduler chapter
I'd like to see the scheduler chapter

Date: Fri, 13 Sep 1996 02:29:56 GMT
The alpha 0.6 KHG included a chapter on the scheduler.

I suppose it's mostly out of date, but it would be nice to have it posted so we could start updating it.
I apologize in advance if it's here somewhere and I just didn't see it.
Messages
1. Untitled by Vijay Gupta
3. Go ahead! by Michael K. Johnson

Untitled
Untitled
Re: I'd like to see the scheduler chapter (Tim Bird)
Date: Fri, 13 Sep 1996 09:19:00 GMT
You can refer to pages 46-49 of

Linux Kernel Internals (English version).
Authors M. Beck, H. Bohme, M. Dziadzka et. al.
Publisher Addison Wesley
Year 1996
Hope this helps,

Vijay

Go ahead!
Go ahead!
Re: I'd like to see the scheduler chapter (Tim Bird)
Date: Sun, 29 Sep 1996 21:11:38 GMT
If someone will research the topic sufficiently and write up a basic docoument, I'll put it up as a base
document to which to attach comments and questions.

Unable to access KHG, port 8080 giving problem.
Unable to access KHG, port 8080 giving problem.

Keywords: KHG port 80
Date: Sun, 18 Aug 1996 22:11:48 GMT
From: Srihari Nelakuditi <srihari@cs.umn.edu>
Hi,
I am not able to access KHG from my work place. I guess packets with port number greater than 1024
are filtered. So could you please suggest a mirror KHG site which serves on the regular port 80.
Thanks a Lot,
Srihari.
Messages
1. Get a proxy by Michael K. Johnson

Get a proxy
Get a proxy
Re: Unable to access KHG, port 8080 giving problem. (Srihari Nelakuditi)
Keywords: KHG port 80
Date: Thu, 29 Aug 1996 19:47:55 GMT
Either have your workplace add a WWW proxy, or access the KHG from home.

proc fs docs?
proc fs docs?
Keywords: proc
Date: Thu, 15 Aug 1996 07:53:40 GMT
From: David Woodruff <david.woodruff@tellurian.com.au>
Anyone know where to get information on how to write programs which use the proc fs? Or should I
find and use sample code?
ta muchly, dave

Examples code as documentation
Examples code as documentation

Re: proc fs docs? (David Woodruff)
Keywords: proc
Date: Sun, 07 Dec 1997 02:43:13 GMT
I know this question was asked over two years ago, but I have a partial answer...
At http://web.syr.edu/~jdimpson/proj/fib-0.1.tgz you can get the source code to a loadable module that
creates a file called /proc/fib. Writing a number to it will seed it, reading from it will read the the
fibonacci value of that seed.
I wrote it as an exercise in understanding how to read and write to /proc files. Reade
http://web.syr.edu/~jdimpson/proj/fib-0.1/README for more details.
--Jeremy

What is SOCK_RAW and how do I use it?
What is SOCK_RAW and how do I use it?

Keywords: SOCK_RAW socket
Date: Tue, 06 Aug 1996 20:02:18 GMT
From: arkane <cat@iol.unh.edu>
What is SOCK_RAW type of socket. I know it requires root access but I don't know how to use it or
what it does.
TIA
Messages
1. What raw sockets are for. by Cameron MacKinnon

What raw sockets are for.
What raw sockets are for.

Re: What is SOCK_RAW and how do I use it? (arkane)
Keywords: SOCK_RAW socket
Date: Thu, 08 Aug 1996 14:59:11 GMT
From: Cameron MacKinnon <mackin@interlog.com>
Well, there are several types of sockets: TCP and UDP go over the wire formatted as TCP or UDP
packets, unix-domain sockets don't generally go over the wire (they're used for interprocess
communication). These are some of the built-in socket types that the kernel understands (i.e. it will
handle the connection management stuff at the front of each of these packet types). Raw sockets are
used to generate/receive packets of a type that the kernel doesn't explicitly support.
An easy example that you're probably familiar with is PING. Ping works by sending out an ICMP
(internet control message protocol - another IP protocol distinct from TCP or UDP) echo packet. The
kernel has built-in code to respond to echo/ping packets; it has to in order to comply with the TCP/IP
spec. It doesn't have code to generate these packets, because it isn't required. So, rather than create
another system call with associated code in the kernel to accomplish this, the "ping packet generator"
is a program in user space. It formats an ICMP echo packet and sends it out over a SOCK_RAW,
waiting for a response. That's why ping runs as set-uid root.

Linux kernel debugging
Linux kernel debugging

Keywords: Linux kernel debugging
Date: Sat, 29 Jun 1996 09:52:11 GMT
From: <yylai@hk.net>
Does anyone get experiences, standard methods/setup, tips etc. in debugging the Linux kernel ?¡@I
think it is a great topic that can be added to the Linux Hackers' guide.
Jeremy Y.Y.Lai
Messages
1. Device debugging by alombard©iiic.ethz.ch
2. GDB for Linux by David Grothe

Device debugging
Device debugging
Re: Linux kernel debugging
Date: Mon, 19 Aug 1996 10:12:27 GMT
From: <alombard©iiic.ethz.ch>
I have the same problem. I need to debug a network driver, but I can't figure out how to do it. It would
be nice if I could make it write a kind of log file. Is that possible?

GDB for Linux
GDB for Linux

Date: Fri, 20 Sep 1996 16:26:04 GMT
From: David Grothe <dave@gcom.com>
I have GDB running between two Linux boxes with a serial interface cable between them. I can set
breakpoints, single step and do source level debugging in the kernel from one machine to the other.
Unfortunately I have not been able to make this generally available for two reasons. 1) I have been
unable to do any Linux work at all for several months due to other pressing needs. 2) I am a beginner
at the Linux kernel and toolsets and do not yet know how to use the "patch" facility.
I hope to be able to get this info out there before too long.
Messages

gdb debugging of kernel now available
gdb debugging of kernel now available

Re: GDB for Linux (David Grothe)
Date: Mon, 09 Dec 1996 23:40:07 GMT
From: David Grothe <dave@gcom.com>
A package that implements kernel debugging between two machines using gdb is now available.
Connect to ftp.gcom.com and download /pub/linux/src/gdbstub/gdbstub-2.0.24.tar.gz. This package is
intended for the 2.0.24 kernel but will fit pretty easily into other (older) kernels as well.

Another kernel debugging tool
Another kernel debugging tool

Date: Fri, 20 Dec 1996 21:04:49 GMT
From: David Hinds <dhinds@hyper.stanford.edu>
I wrote a tool that lets you run gdb on the same system as the kernel you're debugging. It supports
viewing and modifying kernel data structures, viewing stack traces for processes in the kernel,
interpreting trap reports, and calling kernel functions. It isn't as flexible as a remote debugger; in
particular, there are no breakpoints. But I've still found it to be very useful, and if you don't have a
spare system to use for remote debugging, it is the next best thing.
You can find the "kdebug" package at <ftp://hyper.stanford.edu/pub/pcmcia/extras/kdebug-1.6.tar.gz>.

Kernel debugging with breakpoints
Kernel debugging with breakpoints

Re: Another kernel debugging tool (David Hinds)
Date: Wed, 17 Sep 1997 11:59:33 GMT
From: Keith Owens <kaos@ocs.com.au>
John Heidemann <johnh@isi.edu> modified kdebug to support single-stepping and breakpointing,

calling it xkdebug. In turn I have upgraded John's code from 1.3.30 to 2.1.55 and added code to detect
some deadlock conditions and automatically avoid them.
ftp://ftp.ocs.com.au/pub/xkdebug-2.1.55.tgz

Need help for debugging

Re: Another kernel debugging tool (David Hinds)
Re: Kernel debugging with breakpoints (Keith Owens)
Date: Thu, 23 Oct 1997 11:05:51 GMT
hi,
I have downloaded the xkdebug_for_2.1.55. I tried to install it. It has
generated the Makefile.rej.
I am new to linux. So can u please suggest me how to deal with that file.
By the by I am using redhat-4.2 kernel 2.0.30. Can I use this
debugger or not?
Here i am giveing the rej file.
***************
*** 88,97 ****
# standard CFLAGS
#
- CFLAGS = -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer
ifdef CONFIG_CPP
CFLAGS := $(CFLAGS) -x c++
endif
ifdef SMP
--- 88,103 ----
# standard CFLAGS
#
+ CFLAGS = -Wall -Wstrict-prototypes -O2
ifdef CONFIG_CPP
CFLAGS := $(CFLAGS) -x c++
+ endif
+
+ ifdef CONFIG_XKDEBUG
+ CFLAGS := $(CFLAGS) -g
+ else
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/15/2/2/2/1.html (1 di 2) [08/03/2001 10.14.28]

+ CFLAGS := $(CFLAGS) -fomit-frame-pointer

endif
ifdef SMP
_XKDEBUG
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/15/2/2/2/1.html (2 di 2) [08/03/2001 10.14.28]

Realtime mods anyone?
Realtime mods anyone?

Keywords: realtime round-robin task scheduling
Date: Sat, 01 Jun 1996 19:20:04 GMT
From: bill duncan <bduncan@beachnet.org>
I have a requirement for a realtime system doing process control, and I'd like to see if Linux can do it.
I believe that the timing constraints are relaxed enough that Linux can do it straight out of the box, but
wonder if anyone else has done enhancements for realtime.
The timing constraints are less than 100ms response times for a few external events. Since it will be a
single purpose machine, and I will configure it without swap, I doubt that there will be a problem
anyway. Nevertheless, if there are mods out there for the scheduling algorithms (like round-robin
instead of the Unix-style socialist policy scheduling) I'd appreciate finding out.
Thanks.
--
bill duncan, bduncan@beachnet.org
Messages
1. 100 ms real time should be easy by jeff millar
2. Realtime is already done(!) by Kai Harrekilde-Petersen
4. POSIX.4 scheduler by Peter Monta
5. found some hacks ?!? by Mayk Langer
6. Hard real-time now available by Michael K. Johnson
7. Summary of Linux Real-Time Status by Markus Kuhn

100 ms real time should be easy
100 ms real time should be easy

Re: Realtime mods anyone? (bill duncan)
Date: Mon, 03 Jun 1996 03:07:35 GMT
From: jeff millar <jeff@wa1hco.mv.com>
Linux 1.3.(some high numbers) kernels have fairly good real time performance. Applications can use
POSIX real time scheduling with absolute priorities higher than any process. I ran the realtime test
programs associated with some program (don't remember the name) for POSIX realtime process
testing and noted that the longest time that the kernal locked out the realtime application never
exceeded 135 microseconds on my Pentium 100. I assume this means that the longest kernel call tested
didn't exceed that number...some other cases might go longer.
I would like to run a test where a realtime process ran on a precision timed interrupt at the same time
the overall Linux kernel performed it full range of functions. This realtime process sole job would be
to measure interrupt latency and histogram them, probably through the /proc filesyste. My learning
curve for this task would be quite steep but if someone would like to take on this task for a little
education, I'd be interested in the results.
jeff

Realtime is already done(!)
Realtime is already done(!)

Date: Tue, 04 Jun 1996 12:46:00 GMT
From: Kai Harrekilde-Petersen <khp@dolphinics.no>
I remember that someone implemented a POSIX.4 (aka Real-Time) scheduler for Linux, perhaps a
year ago. However, I don't remember who. You probably need to grep through the collected kernel
mailing list archives to find it.
Kai

POSIX.4 scheduler
POSIX.4 scheduler
Date: Tue, 09 Jul 1996 00:47:18 GMT
From: Peter Monta <pmonta@gi.com>
The author of the POSIX.4 scheduler mods is Markus Kuhn; the archives or dejanews will have the
announcements and performance utilities. I assume everything made it into the 2.0 kernel.
I did have occasion to compare the dispatch latency with real (microsecond-resolution) hardware
timers. Once it's running under SCHED_FIFO and everything is locked down, latency is quite stable,
though there were a few spikes up to a few milliseconds. I think this might have been some network
code.
In general I don't think there's heavy emphasis on the part of kernel-driver authors to be careful about
disabling interrupts for a long time. Your mileage will depend on what mix of kernel code gets run.
Some sort of monitoring is a very good idea; I'm told the Pentium has a cycle counter on-chip, which
is ideal.
Cheers, Peter Monta
Messages

cli()/sti() latency, hard numbers
cli()/sti() latency, hard numbers

Re: POSIX.4 scheduler (Peter Monta)
Date: Sat, 17 Aug 1996 10:45:38 GMT
From: Ingo Molnar <mingo@pc5829.hil.siemens.at>
i've done some timings on cli()/sti() latency, on IP basis. Most parts of the kernel are OK, they have
less than 100 usecs of max latency. There is one thing why device driver writers take care of cli()/sti()
latencies, it's the serial interrupt. If the latency is too high, then we loose serial data quite easily. Some
hard data: on a 100 MHz Neptun dual CPU system, hardware interrupt latency is 10+-1 usecs, typical
cli()/sti() latencies are on the order of 10 usecs. Some code like the IDE driver has latencies up to 100
usecs, occasionally higher. The IDE driver latency can be minimized by using the hdparm utility:
multiple mode and irq masking should be turned off.

found some hacks ?!?
found some hacks ?!?

Date: Wed, 24 Jul 1996 18:04:53 GMT
From: Mayk Langer <langer@vsys.de>
i have found some hack files on
http://www.dur.ac.uk/~dph0www2/martini/WFS/linux-rtx.html
i dont know how to use the real-timer from simple programs
please send email with ideas

Hard real-time now available
Hard real-time now available

Date: Sun, 29 Sep 1996 20:51:50 GMT
Victor Yodaiken has announced RT-Linux, a hard-real-time-capable Linux-based OS.

Summary of Linux Real-Time Status
Summary of Linux Real-Time Status

Keywords: real-time round-robin task scheduling POSIX.1b SCHED_FIFO SCHED_RR
Date: Tue, 01 Oct 1996 17:32:30 GMT
From: Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
A detailed summary of the POSIX.1b real-time support system call interface, the current implementation status
of POSIX.1b under Linux, a discussion of various real-time related problems, and many links to related
ressources can be found in
ftp://ftp.informatik.uni-erlangen.de/local/cip/mskuhn/misc/linux-posix.1b
If you look for information about real-time applications under Linux, start there! Feel free to copy information
from there into the Linux KHG.

Shortcomings of RT-Linux
Shortcomings of RT-Linux
Re: Hard real-time now available (Michael K. Johnson)
Keywords: realtime round-robin task scheduling RT-Linux KURT KU Real-Time Linux
Date: Sun, 08 Feb 1998 23:25:41 GMT
Though Victor Yodaikens RT-Linux is great for developing hard real-time applications using Linux, it
does not allow real-time tasks to use any of Linux's features (like networking, etc...) To write a
real-time application that uses Linux's features, you need to split it into two parts. A part that does not
need such features (the real-time part) and a part that needs to use these features (the non-real-time
part). These two parts can communicate by using FIFOs (i think). Note that the non real-time part is
not given any real-time guarantees. If the real-time application cannot be split into two parts, then you
cannot use RT-Linux. See http://hegel.ittc.ukans.edu/projects/kurt for further details. balaji

Firm Realtime available
Firm Realtime available

Re: Hard real-time now available (Michael K. Johnson)
Date: Mon, 13 Oct 1997 11:57:46 GMT
We at the university of kansas have been working on a firm realtime linux. This version of realtime
linux allows you to use standard linux features in realtime tasks and trades off some deadline
guarantees. For more information see http://hegel.ittc.ukans.edu/projects/kurt balaji

I want to know how to hack Red Hat Linux Release 5.0
I want to know how to hack Red Hat Linux

Release 5.0
Re: found some hacks ?!? (Mayk Langer)
Keywords: Red Hat Linux hacking hack hacker
Date: Wed, 01 Jul 1998 06:01:59 GMT
From: Kevin <hack_the_earth@hotmail.com>
Please help
anyone that know how to hack Red Hat Linux release 5.0 please E-mail me (Kevin) and tell me how,
i would really like your help. hack_the_earth@hotmail.com thanks

Real-Time Applications with Linux POSIX.4 Scheduling
Real-Time Applications with Linux POSIX.4

Scheduling
Re: 100 ms real time should be easy (jeff millar)
Keywords: realtime FIFO scheduling
Date: Wed, 22 Oct 1997 18:08:55 GMT
From: P. Woolley <woolley@cs.york.ac.uk>
I have recently been working on system to implement Real-Time computer control using Linux 1.2.13
with the POSIX.4 (should that be .1b ?) FIFO scheduler.
When I run my RT progs. I also have the usual system running doing various things, and have not
experieneced any unpredictability down to sampling intervals of 10ms. This includes doing things like
AD/DA device driver access.
I have also, though less deterministically, had intervals as low as 500us to 600us going. (I have been
using multiple processes, but in a single process (threaded maybe) fast, deterministic intervals should
go fine?)

Why can't we incorporate new changes in linux kernel in KHG ?
Why can't we incorporate new changes in linux

kernel in KHG ?
Keywords: KHG should reflect current changes in the linux kernel
Date: Fri, 24 May 1996 11:05:22 GMT
From: Praveen Kumar Dwivedi <pkd@wipro.wipsys.soft.net>
I have read version 0.6 completely but so much has changed in the 1.3.xx kernels that for experienced
linux kernel hackers KHG no longer is all that useful.
I thinks this is high time we should do a complete overhaul
of KHG so that it reflects current changes in the linux kernel
especially in the areas such as memory management.
One last thing I am grateful to all those who contributed
to KHG.
Messages
1. You can! by Michael K. Johnson

You can!
You can!
Re: Why can't we incorporate new changes in linux kernel in KHG ? (Praveen Kumar Dwivedi)
Keywords: KHG should reflect current changes in the linux kernel
Date: Sun, 26 May 1996 15:48:35 GMT
I have read version 0.6 completely but so much has changed in the 1.3.xx kernels that for
experienced linux kernel hackers KHG no longer is all that useful.
Was the KHG ever really useful for experienced Linux kernel hackers? I would guess not.
I thinks this is high time we should do a complete overhaul of KHG so that it reflects
current changes in the linux kernel especially in the areas such as memory management.
I'm glad you phrased it that way. We should indeed do a complete overhaul of the KHG. Here's how to
do this: as you find disparities, please post them as responses. Those responses will be used to update
the KHG. Without those responses, the updates will not happen except as I happen to notice the
disparities, which doesn't happen much. Without your help, the KHG will remain hopelessly mired in
the past.
If you look at the pages, you will see that kind readers have already started this process. To everyone
who has contributed a change, fix, elucidation, update, or whatever, thank you very much! Keep up the
good work, everyone, and we'll have a document worth reading.

Kernel source code
Kernel source code

Date: Sat, 18 May 1996 15:45:45 GMT
From: Gabor J.Toth <jtoth@princeton.edu>
I like the new setup of KHG a lot! One thing I miss is that the source of the kernel is not available
along with KHG. I often browse the net from work where I don't have linux available and I guess I'm
not alone. Putting a version of the kernel (the latest my be a good idea but pretty much any would do)
onto an ftp server and making a link to it from least the main page (but eventually from all relevant
pages) might be of great help for many.
Comments, anyone?
Messages
1. The sounds of silence... by Gabor J.Toth
2. Kernel source is already browsable online by Axel Boldt

The sounds of silence...
The sounds of silence...

Re: Kernel source code (Gabor J.Toth)
Date: Thu, 23 May 1996 01:23:35 GMT
From: Gabor J.Toth <jtoth@princeton.edu>
I didn't expect to get very much response but, boy, I thought I would get some. Does this silence mean
that
* everybody likes the idea and has nothing to add;
* everybody found the idea dumb and boring and didn't care to
comment;
* no one could figure out what I meant?
Comments, this time?
Messages

Breaking the silence :)
Breaking the silence :)

Re: The sounds of silence... (Gabor J.Toth)
Keywords: source links hardcopy
Date: Thu, 23 May 1996 17:24:37 GMT
From: Kyle Ferrio <kylef@engin.umich.edu>
Hypertext links to germane sections of the kernel source would be great, especially for those like me
who are just starting to hack a path through the woods.
Links to the code would also go a long way toward convincing me that a fully on-line, interactive khg
is a Good Thing. At present, I still prefer having pulp and carbon. It's just too hard to scribble in the
margins of a Web page. :)
Respectfully Submitted,
Kyle Ferrio
Messages

Scribbling in the margins
Scribbling in the margins

Re: Breaking the silence :) (Kyle Ferrio)
Keywords: source links hardcopy
Date: Sun, 26 May 1996 15:32:04 GMT
You wrote:
It's just too hard to scribble in the margins of a Web page. :)
You just scribbled in the margin. The difference is that here other people can benefit from your
scriblings.

It requires thought...
It requires thought...
Date: Sun, 26 May 1996 15:37:49 GMT
I didn't respond because I wanted to think about how it could be done, and done in such a way that it
would actually be helpful...
I just saw a pointer to cxref posted to comp.os.linux.announce. I don't think that we would have to use
its extra documentation features to use its cross-referencing features. I would appreciate it if someone
would retrieve and evaluate cxref for this purpose.

http://ldp.iol.it/LDP/khg/HyperNews/get/khg/3/1/1/www-personal.engin.umich.edu/~kylef
<HTML>
<HEAD>
<TITLE>HyperNews Redirect</TITLE>
<LINK rel="owner" href="mailto:">
</HEAD>
<BODY>
Broken URL:
http://www.redhat.com:8080/HyperNews/get/khg/3/1/1/www-personal.engin.umich.edu/~kylefTry:
<A HREF="../../1.html">http://www.redhat.com:8080/HyperNews/get/khg/3/1/1.html</A> 
http://ldp.iol.it/LDP/khg/HyperNews/get/khg/3/1/1/www-personal.engin.umich.edu/~kylef [08/03/2001 10.14.38]

Kernel source is already browsable online
Kernel source is already browsable online

Keywords: Linux kernel source code browsing WWW online
Date: Wed, 16 Oct 1996 22:03:05 GMT
From: Axel Boldt <boldt@math.ucsb.edu>
There's already an excellent online Linux kernel source code browser at

http://sunsite.unc.edu/linux-source/
I think we should have a pointer to it somewhere near the top of the KHG.
Thanks,
Axel

Need easy way to download whole KHG
Need easy way to download whole KHG

Keywords: administrivia
Date: Sat, 18 May 1996 07:04:31 GMT
From: <unknown>
Those of us who have slow dialup links, and who consequently
use suck (or equivalent) to download news quickly and read
it offline, would similarly like to be able to download the
KHG and read it offline. It would be nice if there were a
packaged version to make this simpler. (Doing a tree
traversal with Lynx is no fun at all if one needs to be
sure one has obtained the whole thing.)
Messages
1. Mirror packages are available, but that's not really enough by Michael K. Johnson
1. Anyone willing to put the whole lot up for FTP by Richard Braakman
-> Probably... by Michael K. Johnson
1. Pointer to an HTTP mirror package by Amos Shapira
2. postscript version of these documents? by Michael Stiller
3. an iterator query might not be hard... by Mark Eichin
1. Might be harder here... by Michael K. Johnson

Mirror packages are available, but that's not really enough
Mirror packages are available, but that's not really

enough
Re: Need easy way to download whole KHG
Date: Sat, 18 May 1996 15:41:36 GMT
I also have a slow dialup link; I use a 14.4 modem and I'm currently about 20 network hops away from
the home site. So I understand how annoying it can be.
Nevertheless, I'm not going to offer a tar file of the KHG at this time. The primary reason for going to
a web-based presentation was to make it interactive, and pushing it off-line would make it less
interactive.
For now, you can use web mirroring packages to create your own local mirror and read from there.
Several such packages are available, and since I don't use any of them, I don't have details like names:
can someone else who knows please post details?
For the future, it would be nice if someone is interested enough in mirroring to write scripts for the
KHG that allow you to read it off-line, but also allow you to respond. However, I will only give my
imprimatur to a system which:
● Mirrors everything, including responses. The responses are just as much a part of the new KHG
as the bodies of the articles.
● Provides a reasonable method of posting responses from the off-line state; preferably it will
queue them up, and the process of "mirroring" the KHG will involve two steps:
1. Post any queued responses back to the original site
2. Mirror the whole thing again.
They need to happen in that order, obviously.
If anyone is interested in working on such a system, they are welcome to. I won't be working on it; I
have too many other things on my plate and I wouldn't use it, so I'd be a lousy choice for building it.
The person who writes the system really ought to be someone who will use it...
I would include a pointer to those scripts from within the KHG if anyone wrote them to my
satisfaction.
Messages

Mirror packages are available, but that's not really enough
1. Pointer to an HTTP mirror package by Amos Shapira

Mirror whole KHG package, off line reading and Post to this site
Mirror whole KHG package, off line reading and

Post to this site
Re: Mirror packages are available, but that's not really enough (Michael K. Johnson)
Date: Sun, 02 Mar 1997 18:00:02 GMT
From: Kim In-Sung <kisskiss@soback.kornet.nm.kr>
(My English is not perfect. Sorry)

Try this package
ftp://sunsite.unc.edu/pub/Linux/Incoming/getwww-1.4.tar.gz
I write this program. :-P
I mirror KHG, open this KHG locally and post this article locally. But this article is posted KHG
original site.
Getwww translate absolute URL to relative links between local files. and don't touch <FORM
ACTION...>.
So you can get KHG in batch mode and read local file reading mode in Netscape and post your article
to this site.
KHG HTML links are some mingle(is it correct? I mean "not simple" Hmmmm... maybe "complex
links") so you can use Getwww options like this
getwww http://www.redhat.com:8080/HyperNews/khg.html \
-S embed frame outline show=all admin Response \
-D HyperNews/SECURED HyperNews/thread.pl \
HyperNews/edit-subscribe.pl
-l login:passwd
-S means don't get files have [string] in file name
the same html file are included normal mode, frame mode,
embed mode, admin mode .....
So such URL is useless.
-D means don't get files in [dir] directory
files in those directory are not meaning in local mode.
-l Getwww uuencode your password and user name, but Redhat
Web server not accept such simple encoding, so this option
is useless, but I use this option for your understanding.

Mirror whole KHG package, off line reading and Post to this site
Next time you want get new KHG version, use this option
getwww http://www.redhat.com:8080/HyperNews/khg.html \
-S embed frame outline show=all admin Response \
-D HyperNews/SECURED HyperNews/thread.pl \
HyperNews/edit-subscribe.pl
-l login:passwd \
-i
With -i option, Getwww get updated files from KHG site.
I write README file for Getwww in Korean, I found someone translate it in English. Maybe you can
help me.
Thanks
Messages

Untitled
Untitled
Re: Mirror whole KHG package, off line reading and Post to this site (Kim In-Sung)
Keywords: getwww mirror
Date: Sun, 08 Jun 1997 01:08:53 GMT
From: Jim Van Zandt <jrv@vanzandt.mv.com>
The sunsite archives have apparently been reorganized. The getwww application has been moved to:
ftp://sunsite.unc.edu/pub/Linux/apps/www/mirroring/getwww-1.4.tar.gz

That works. (using it now). Two tips:
That works. (using it now). Two tips:

Date: Wed, 19 Mar 1997 00:49:48 GMT
From: Richard Braakman <dark@xs4all.nl>
First, the package is no longer in incoming. I found it in

ftp://sunsite.unc.edu/pub/Linux/apps/www/getwww-1.4.tar.gz (Not hard to find, but I thought I'd save
people the trouble) Second, it's a good idea to put ~johnsonm/index.html among the URLs to avoid. It
contains further links to scanned books, and getwww will happily suck down 2 MB of prose :-) I
haven't tried the "upgrade" flag yet.
Messages

Anyone willing to put the whole lot up for FTP
Anyone willing to put the whole lot up for FTP

Re: That works. (using it now). Two tips: (Richard Braakman)
Date: Wed, 19 Mar 1997 00:49:48 GMT
From: Richard Braakman <dark@xs4all.nl>
Is anyone interested in putting all of their mirror up for FTP in one file, to save load on the Hypernews
servers and speed up the downloading a bit?
Messages
1. Probably... by Michael K. Johnson

Probably...
Probably...
Re: Anyone willing to put the whole lot up for FTP (Richard Braakman)
Date: Tue, 08 Apr 1997 00:39:17 GMT
I'll probably do that at some point. I only ask that if anyone else wants to do this, they make AT
LEAST a nightly mirror so that downloaders are up-to-date.

Pointer to an HTTP mirror package
Pointer to an HTTP mirror package

Date: Sun, 19 May 1996 08:16:33 GMT
From: Amos Shapira <amoss@cs.huji.ac.il>
A little dig into the archives of comp.lang.perl provided the following link:
http://www.cs.trinity.edu/~nyarrow/MirrorTools/
This is a beta release, and I haven't run it yet, but it looks promising.

postscript version of these documents?
postscript version of these documents?

Date: Wed, 22 May 1996 21:33:33 GMT
From: Michael Stiller <michael@toyland.ping.de>
like the other 'old' khg, a postscript version would be good, it's nice to have a carbon copy at hand if
you try to figure things out.
Messages

Sure!
Sure!
Re: postscript version of these documents? (Michael Stiller)
Date: Sun, 26 May 1996 16:58:49 GMT
Most browsers allow you to print the page you are looking at. Just print the page that you are
interested in, and any responses to that page the bear on what you are doing. That way, you will get the
latest possible information; it will be better than the old KHG where you could hardly know if the
information you were reading was two, three, or four years old.
Messages
1. Not so Sure! by jeff millar

Not so Sure!
Not so Sure!
Re: Sure! (Michael K. Johnson)
Date: Mon, 03 Jun 1996 02:52:39 GMT
From: jeff millar <jeff@wa1hco.mv.com>
You suggested printing each page...sound painful to me given

several hundred pages. Having an assembled downloadable
format as an option seems desireable.
jeff
Messages
1. Enough already! by Michael K. Johnson

Enough already!
Enough already!
Re: Sure! (Michael K. Johnson)
Re: Not so Sure! (jeff millar)
Date: Mon, 03 Jun 1996 18:13:16 GMT
You suggested printing each page...sound painful to me given several hundred pages.
Web page, not paper page.
● Only print the pages you want to read right away.
● If you want to print all responses with the page, click on the All item offered under ``[Embed
Depth: 1 2 3 All]'' above the response list on each HyperNews page and see what you get.
● There aren't hundreds of web pages here, and if there ever are, that would equal many thousands
of paper pages, and would be yet another argument against making it easy to print the whole
thing out.
End of discussion. (Yes, that's a hint...)
Messages

an iterator query might not be hard...
an iterator query might not be hard...

Keywords: server administrivia
Date: Fri, 24 May 1996 23:14:34 GMT
From: Mark Eichin <eichin@kitten.gen.ma.us>
Something I implemented in kittenweb (still just something I'm playing

with, not widely published yet) was a "report" query that let you grab
the tree at some point and have the server hand you everything below
it. It also did some useful massaging of links (expecting that it
would be in print form.) Kittenweb is inspired by wikiwikiweb,
http://c2.com/ and is also written in perl.
If the sources to the server used here are available, I'd take a look
and see how hard it would be to add...
_Mark_ <eichin@kitten.gen.ma.us>
Messages
1. Might be harder here... by Michael K. Johnson

Might be harder here...
Might be harder here...

Re: an iterator query might not be hard... (Mark Eichin)
Keywords: server administrivia
Date: Sun, 26 May 1996 16:30:17 GMT
...a "report" query that let you grab the tree at some point and have the server hand you
everything below it.
That would be harder with the KHG because each page is assembled on-request by CGI scripts. We
might want to add the indexed kernel source to the KHG at some point, so it might be nice to be able
to choose whether or not to include that in the response to the request. :-)
So I don't think it is a trivial operation.
If the sources to the server used here are available, I'd take a look and see how hard it
would be to add...
HyperNews feel very free to take a look at it and implement similar functionality for it. I'm sure that it
would be useful for far more than just the KHG; HyperNews is used to run a lot of other sites as well.

KHG being mirrored nightly for download!
KHG being mirrored nightly for download!

Date: Sat, 14 Jun 1997 15:10:06 GMT
Please download ftp://ftp.redhat.com/johnsonm/khg.tar.gz and try it out. There are a few broken links
in it, but it seems to work. You'll need a connection to post, but not to read. Thanks to Kim In-Sung for
getwww which has enabled this!

Appears to be a bug in getwww, though...
Appears to be a bug in getwww, though...

Date: Sat, 14 Jun 1997 14:51:50 GMT
getwww doesn't seem to understand different port numbers. It would be fine if there were a
configuration option that said "do follow links that are on the same site but have different port
numbers" or "don't follow links that are on the same site but have different port numbers", but getwww
doesn't understand either...
When getting the KHG from port 8080, getwww sees absolute links without a port number specified,
and assumes that they should come from port 8080 instead of port 80.
There's another bug, but I don't know whether its a but in getwww or in my. Richard says that "it's a
good idea to put ~johnsonm/index.html among the URLs to avoid" but I can't make it actually avoid
that, and I've tried a lot of command-line arguments by now. Has anyone made that work? I managed
to make it not suck down my home page by explicitly telling the server on port 8080 not to serve
public_html pages, but that means that the link to the device driver paper it still tries to download and
leaves as a broken link (because of the port number bug). It would make more sense for it to leave it as
a remote link, I'd think.

Sucking up to the wrong site... ;)
Sucking up to the wrong site... ;)

Re: Appears to be a bug in getwww, though... (Michael K. Johnson)
Date: Tue, 17 Jun 1997 14:44:11 GMT
I haven't been able to avoid the ~johnsonm homepage either - I'm not sure what's the reason, either
there could be some mesh with absolute WWW-pagenames or then I just don't know how to
quote/escape the character properly... hmm, I'm not sure I tried that, reating it as a mask with \~...
But anyway, I avoided that page using another, altough a bit rough method - I ignored index.html
pages altogether. This is because to the best I'm able to tell, none of the actual stuff on these pages uses
that filename. This should also keep the thing from running rampant on any other possible future
index.html references.

Help make the new KHG a success
Help make the new KHG a success

Keywords: contributing
Date: Wed, 15 May 1996 16:08:04 GMT
I can't write and maintain the KHG by myself. In order to make this a success, I need help from
readers:
● If you see something wrong, respond with a correction.
● If you see something missing, either respond to say that it is missing, or respond with the
missing piece.
● If something is unclear, try to find out what it is saying and post a clarification. Otherwise, post
a request for clarification.
● If you can answer a request for clarification, do so.
● In order to encourage me to continue providing this service, subscribe to the pages in which
you are most interested.
Responses do not have to be perfect, or in perfect english. Think of this like a mailing list, not a book.
When I incorporate responses into the the main text (body) of an article, I'll edit it. I'm good at editing,
and I enjoy doing it, especially when I don't have a deadline.
If the KHG remains a simply personal effort, it will become less and less relevant. With your help, it
can become more and more worth reading.
Thanks!
Messages

Linux Kernel Hackers&#39; Guide

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Linux Kernel Hackers&#39; Guide

Încărcat de

Drepturi de autor:

Formate disponibile

The Linux Kernel Hackers' Guide

The HyperNews Linux KHG Discussion Pages