Sunteți pe pagina 1din 108

INFORMATION TECHNOLOGY

& BIOINFORMATICS

BSCBT-604

This SIM has been prepared exclusively under the guidance of Punjab Technical University (PTU) and
reviewed by experts and approved by the concerned statutory Board of Studies (BOS). It conforms to the
syllabi and contents as approved by the BOS of PTU.

Copyright Manish Gupta, 2012


Reviewer: Dr. Ankur Diwan
No part of this publication which is material protected by this copyright notice may be reproduced or transmitted or
utilized or stored in any form or by any means now known or hereinafter invented, electronic, digital or mechanical,
including photocopying, scanning, recording or by any information storage or retrieval system, without prior written
permission from the publisher.
Information contained in this book has been published by Excel Books Pvt. Ltd. and has been obtained by its authors from
sources believed to be reliable and are correct to the best of their knowledge. However, the publisher and its author shall in
no event be liable for any errors, omissions or damages arising out of use of this information and specifically disclaim any
implied warranties or merchantability or fitness for any particular use.
Published by Anurag Jain for Excel Books Pvt. Ltd., A-45, Naraina, Phase-I, New Delhi-110 028
Tel: 47144444 email: eb@excelbooks.com

PTU DEP SYLLABI-BOOK MAPPING TABLE


BSCBT-604 INFORMATION TECHNOLOGY & BIOINFORMATICS
Syllabi
Overview of computers. Operating system and networking concept
(Windows, UNIX, Linux) Basics of internet and websites.

MS-Office.

Data structure and Algorithms.

Bioinformatics Internet Applications

Mapping in Book
Unit 1: Overview of
Computers
(Page 1-20)
Unit 2: Introduction of
MS Word
(Page 21-52)
Unit 3: Operating Systems
(Page 53-73)
Unit 4: Bioinformatics
Internet Applications
(Page 75-97)

Contents
UNIT 1

OVERVIEW OF COMPUTERS

Introduction
Computer System
Basic Ideas and Terms
Components of a Computer System
Basic Architecture
Computer Organisation
Data Representation
Performance Factors
Summary
Keywords
Review Questions
Further Readings
UNIT 2

INTRODUCTION OF MS WORD

21

Introduction
New in Microsoft Word 2000
Starting of MS Word
Components of the MS Word Window
Start Working with Word Document
File-related Operations
Opening a New File
Save a Document
Reversing and Reapplying Commands
Summary
Keywords
Review Questions
Further Readings
UNIT 3

OPERATING SYSTEMS
Introduction
Computer Hardware and OS Interaction
Single User Single Processing System
Multiprogramming Operating System
Architecture and Design of OS
Interface Design and Implementation
Window Systems based on PCs

53

Summary
Keywords
Review Questions
Further Readings
UNIT 4

BIOINFORMATICS INTERNET APPLICATIONS


Introduction
A Protein Sequence with Ends Indicated
Visualisation of Structure Information
Statistics
Sequencing and Assembling Genome using Computer
Initial Fragment Assembly
Summary
Keywords
Review Questions
Further Readings

75

Unit 1 Overview of
Computers

Overview of Computers

Notes

Unit Structure
z

Introduction

Computer System

Basic Ideas and Terms

Components of a Computer System

Basic Architecture

Computer Organisation

Data Representation

Performance Factors

Summary

Keywords

Review Questions

Further Readings

Learning Objectives
At the conclusion of this unit, you will be able to:
z

Define the Computer System

Understand the Components of a Computer System

Know the performance factors and data representation

Introduction
When we speak about computers, what exactly are we referring to? Many of us tend
to define a computer as a computing machine, as in calculator. However, this
definition strips the computer of 95% of its total capabilities. In laymans terms, a
computer can be defined as a machine that is used to generate some kind of
information from the data that is fed into the computer. The application areas of
computers are unlimited. We find a computer in every aspect of our life. From a
simple operation as playing a video game to more complicated applications as
weather forecasting computers are found everywhere. Let us take a simple example of
a person who needs to purchase a can of juice from a super market.
He walks inside the super market, picks up a can of juice and proceeds to the cash
counter. The counter person scans the code that is present in the label to generate a
bill. This scanning of the code is computerised. The man pays his bills with his credit
card and walks off the super market. He just used a computer, which will transfer the
cost of the can of juice from his bank account to the super market. The man then
moves across the street and enters the office of his travel agent. He tells the agent that
he plans to take a vacation and inquires about the places that he can possibly go. The
agent turns to his computer, presses a couple of keys and gets the list of the
prospective places immediately. The agent just used a database application of the
computer. The man selects a place and confirms his travel. The agent again turns to
his computer and moments later hands him the air tickets to that place. The agent
actually connected to a computer that did the reservation. The man then happily
Punjab Technical University 1

Information Technology &


Bioinformatics
Notes

comes to his office and decides to inform his wife about the vacation. He, therefore,
sends e-mail to his wife. The man used a network application of the computer.
Numerous examples of such kind can be sited. With the advent of technology, newer
and newer application domains of the computer are created everyday. It is just a
matter of time when life without computers cannot be imagined.

Computer System
As a computer can simply be defined as a machine that is used to generate some kind of
information from the data that is fed into the computer. The question that arises at this
point is that how does the computer actually generate the output? There are certain
components of a computer which does the job. At this stage, we can define a computer
as a device, which manipulates the data fed in to generate a desirable output according
to a set of instructions given by the user. This definition clearly demarcates the
difference between a computer and a calculator. In simple terms, a calculator can be
said to be a subset of a computer. Figure 1.1 shows a computer along with some
peripheral devices.

Figure 1.1: A Computer


The computer consist of a set of hardware which when coupled with a program, can
be turned into a tool for some specific purpose. The hardware refers to various
components such as the input and output devices, memory, storage, processor, etc.
They perform various functions like giving inputs to the computer to actually executing
the instructions from the user. The user gives the instructions in the form of programs.

Types of Computer
Computer can be categorized based on their size and design. Modern computers can
vary in size ranging from the one that fills the entire room to a size that is small enough
to fit the nail of your thumb with room to spare. It is a general tendency that the larger
the system, the greater is the processing speed, storage, cost and the ability to handle
number of peripheral devices. The difference in size also varies the number of users that
can work on the system simultaneously. At the lowest end of the size scale is the
microcomputer. They are small devices that can be used to perform dedicated tasks like
scanning the code of a can of juice. The more familiar personal computer is a kind of a
microcomputer. A typical microcomputer is shown in Figure 1.2

Figure 1.2: A laptop


2 Self-Instructional Material

Next in the line is the minicomputer as shown in Figure 1.3. These are also small
general-purpose computers having the capability to serve a number of users
simultaneously. They are generally more powerful and expensive than the
microcomputers. In size, they range from desktops to a size of a small file cabinet.

Overview of Computers

Notes

Figure 1.3: A Minicomputer


The mainframe computers are much bigger in size and offer very high processing
speed and storage capacity. Finally, come the supercomputers that are the fastest and
the most expensive systems in the world. They are typically used for complex
scientific operations like weather forecasting, statistical analysis, etc. Figures 1.4 and
1.5 shows a mainframe and a supercomputer respectively.

Figure 1.4: Mainframe Computer

Figure 1.5: Some Supercomputers


Computers can be classified according to their design. Most of todays computers
follow the design that was formulated by John Von Neumann and others in mid
1940s. This design was called the Von Neumann architecture. This design had a
single control, primary storage and arithmetic and logic unit in the processor unit. It
interpreted and processed the data as a single sequential stream. A single channel was
available that was used to transfer data from the storage. This limited the speed at
which the computer could operate. This design, therefore, was replaced by newer
designs in which there are additional storage, control and arithmetic-logic sections.

Punjab Technical University 3

Information Technology &


Bioinformatics
Notes

This enabled the simultaneous processing of a number of instructions. In essence, they


were multiprocessor systems. The Von Neumann design is based on three essential
concepts:
1.

A single read-write memory stores the data and the instructions.

2.

The type of the data contained in the memory is of no regard and it can be
addressed by location.

3.

Execution occurs from one instruction to another i.e. sequentially.

Generation of Computers
Computer production started in the 1940s when the first electronic computer was
created. Since then, the improvements and enhancements in the field of electronics
had considerable influence on the design of computers thus leading to what is known
today as Generation of Computers.

First Generation
Dr. John Vincent Atanasoff and Clifford Berry created the first electronic computer.
They called it the Atanasoff-Berry Computer or ABC. The ABC used vacuum tubes
for storage and arithmetic and logical functions. This work was noticed by John W.
Mauchly, who in 1940-41 teamed up with J. Presper Eckert Jr., and organised the
construction of the ENIAC. The ENIAC, shown in Figure 1.6, was the first general
purpose computer to be put fully in operation.

Figure 1.6: ENIAC


Using vacuum tubes (Figure 1.7), the ENIAC could perform 300 multiplications per
second. However, the fact that it weighed 30 tons and occupied the space of a three
bedroom house was the major disadvantage of this computer. In mid 1940s, John Von
Neumann published a paper where he gave the concept of stored program and using
binary number system for computation purposes. This idea was incorporated into a
new computer called the EDVAC and then the EDSAC. Later in time, the UNIVAC,
shown in Figure 1.8, came into existence.

4 Self-Instructional Material

Overview of Computers

Notes

Figure 1.7: Vacuum Tubes

Figure 1.8: UNIVAC

Second Generation
The main disadvantage of the first generation computers was the fact that the vacuum
tubes, owing to their short life, had to be replaced frequently and they generated a lot
of heat. These computers took up a lot of space and programming them was a tedious
task because programs had to be written in machine language. In 1950s, these
disadvantages led to the creation of computers, which were much smaller and faster.
In addition to this, programming in these computers was easy because they
understood high-level programming languages. These languages were more English
like and easy to understand. The computers in this generation used solid state
components such as the transistors developed by the Bell Laboratories. Some
computers of this generation are LEO mark III, ATLAS and the IBM 7000 series,
shown in Figure 1.9.

Figure 1.9: IBM 7000 Series Computer

Third Generation
The second-generation computers were well suited to do either scientific or
non-scientific applications but not both. Thus, in 1964, IBM announced the System 360
family of mainframes, where each processor had a set of large built-in instructions.
Punjab Technical University 5

Information Technology &


Bioinformatics
Notes

Some of these instructions could be used effectively for scientific calculation while the
others were more suited for record-keeping applications. The computers in this
generation used the technology of Integrated Circuits (IC). Since the ICs were small in
size, there was a further reduction in size of these computers. A typical IC is shown in
Figure 1.10.

Figure 1.10: Integrated Circuit (IC)

Fourth Generation
As the technology advanced, the size of the ICs reduced and more and more
components could be packed into smaller chips. They were called Large Scale
Integration (LSI) and Very Large Scale Integration (VLSI) chips, as in Figure 1.11.
Since the computers in this generation used these chips, their size was greatly
reduced. The speed at which they operated increased and they cost decreased. The
computers of today are said to be of the fourth generation.

Figure 1.11: VLSI Chip

Fifth Generation
It is predicted that by early 21st century, computers will be able to behave like human
making interaction more human like. They would be able to think and act on their
own. This situation is very well depicted in the motion picture Terminator II where
the computers acts on their own based on their own judgment.

Basic Ideas and Terms


Data
Data is a name given to the facts that are supplied to the computer. It is then
processed to obtain the desired output. In simple terms, data can be defined as the
raw form of information. Typical data may not make sense to the user. It is only after
processing that the data is transformed into something that is useful to the user. Thus,
it can be said that data is different from information. As an example, in the operation
to add two numbers A and B, the values A, B and the add operator (+) is data.

Program
We know that the computer is a digital device. It is thus capable to understand the
digital signals. These signals are generated based on certain instructions that the user
feeds into the computer. A program can be termed as the collection of such
instructions.

6 Self-Instructional Material

Overview of Computers

A simple example of a program to add two numbers is as given below:


Input (A)
Input (B)

Notes

Add A to B
Assign value to C
In the steps given above, Input asks the user to enter the value of A and B. The
computer then adds the value of A to the value of B and assigns the result to C. These
four instructions are collectively called a program.

Information
Information can be termed as a more useful and intelligible form of data. A program
operates on the data in a specified format and transforms it into information. For
example the bill that is produced after the user feeds in the data is the information.

Hardware
Hardware is the term used to define all the electronic and mechanical components
found inside a computer system. These components are activated as required to
execute the program. For example the counter person scans the code of the
commodity and gets a printout of the bill. The scanner and the printer are two of the
many hardware components that are used in the process.

Software
Suppose, the counter person, to generate the bill, is using a set of programs. These
programs are in turn executed by a set of underlying programs. These sets of
programs are called software. Software can be broadly divided into two categories:
application software and system software. Application software is the one that is
created to cater to a specific task, for instance, generation of bills. System software is
that which runs the application software. It also provides an interface between the
application software and the hardware thus enabling the application software to
access the hardware units. It can therefore be said that system software runs the
application software along with the hardware. The operating system is a very good
example of system software.
Example: Consider a situation where a man needs to purchase three cans of mango
juice. The man goes into a super market, picks up three cans of juice and brings them to
the counter where the counter person scans the code given in the cans. After scanning,
let us assume that the computer produces a result, which says D20567, J002A, 3,
$5.00,$15.00, $15.00, 11/11/05. The counter person may find it difficult to understand
(unless he is used to it). This is called data. Now if the counter person uses a program
that will take this data as input and generate a bill in the following format, it is said that
the data is transformed into information. The entire process is explained in Figure 1.12.
Bill Number: D20567
Date of purchase: 11th Nov 2005
S. No.
Item Code
Item Name
1.
J002A
Daily Mango
Juice Can
Total

Quantity
3.00

Rate
$5.00

Amount
$15.00

$ 15.00

This bill is said to be information.

Punjab Technical University 7

Information Technology &


Bioinformatics
Notes

Figure 1.12: Data, Program, Information, Hardware and Software

Components of a Computer System


The basic architecture of a computer system can be divided into four major
components each doing a separate but inter-linked function. Various hardware
devices realize these components. The components are listed below:
1.

Input

2.

Storage

3.

Processing

4.

Output.

An input component in a computer system is concerned with the data that is fed in.
With the help of these components, data is actually entered into the computer. Several
input devices can be used to enter data. These devices use a variety of technologies
ranging from a key pressed to voice. The input components have evolved from
switches that were used in the first generation computers to the more modern and
sophisticated voice recognition system. The keyboard and the mouse are good
examples of input components. Figure 1.13 shows a few input devices.

Figure 1.13: Input Devices

8 Self-Instructional Material

During the process of processing data, the data needs to be stored in the computer. It
is also necessary to store the programs that will manipulate the data in order to
generate the output, which will also be stored. The storage components deal with the
storage of programs and data. Storage is of two kinds: primary and secondary.
Primary storage components are volatile. That is, they can store data as long as they
are supplied with electricity. They are used for temporary storage of data. Secondary
storage components, on the other hand, can store data permanently. Two typical
storage components are shown in Figure 1.14.

Overview of Computers

Notes

Figure 1.14: Storage Components


The computer system, as mentioned earlier, does the job as specified by some
programs. How does the computer do it? To execute the instructions that are given in
a program, the computer has a hardware unit called the processor. The processor
therefore can be thought of as the heart of the computer. Figure 1.15 shows the
Pentium III processor. The overall performance of the computer system is mostly
based on the performance of the processor. Therefore, there is a constant attempt to
optimise the usage of the processor.

Figure 1.15: The Pentium III Processor


The computer, after processing the data, generates an output. This output can be in
the form of plain text, printouts, graphs, etc. For this purpose, the output devices are
used.
Various types of output devices are available in the market today. They use a variety
of technologies to generate specific types of outputs. For instance, plotters are used for
outputting engineering and other precision drawings, printers are used to generate
letters, bills, etc. and the Video Display Unit (VDU) is used to view the output. Figure
1.16 shows a few output components.

Figure 1.16: Output Components

Punjab Technical University 9

Information Technology &


Bioinformatics

Basic Architecture
The basic architecture of a computer system is explained with the help of Figure 1.17.

Notes

Figure 1.17: Basic Architecture of a Computer System


In Figure 1.17, we can clearly bring out the four major components. The input
components are denoted by Input. They are used to give the inputs in the form of
program and data to the computer, which are stored temporarily in the primary
storage and if necessary they are also stored permanently in the secondary storage.
The primary and the secondary storage are the storage components. The next step is
to carry the inputs to the ALU (Arithmetic and Logic Unit) where the data is
processed according to the program fed. The ALU sends the processed output back to
the primary storage from where, if necessary, it is sent to the secondary storage. The
output is also sent to the output devices, marked as Output. These devices signify the
output components. All these units work in unison and are interdependent on each
other. For instance, the ALU will have no data to manipulate if the data is not fed in
with the help of the input devices or are already stored in the storage components.
Thus, it becomes essential to have a central unit that governs the function of these
devices. For such a purpose, a Control Unit is present. The major function of the
control unit, thus, is to provide an environment where all the other units can function
interacting and exchanging information with each other.

Computer Organisation
A computer must have a system to get information from the outside world and must
be able to the communicate results to the external world. Programs and data should
be entered into computer memory for processing and results obtained from
computations must be recorded or displayed for the user. The most familiar method
of entering information into a computer is using a typewriter like keyboard that
allows a person to enter alphanumeric information directly. Every time a key is
depressed, the terminal sends a binary coded character to the computer. When input
information is transferred to the processor via a keyboard, the processor will be idle
most of the time while waiting for the information to arrive. To use a computer
efficiently, a large amount of programs and data must be prepared in advance and
transmitted into a storage medium. The information in the disk is then transferred
into computer memory at a rapid rate. Results of programs are also transferred into a
high-speed storage, which can be transferred later to output device for results.

10 Self-Instructional Material

Peripheral Devices
Devices are said to be connected online that are under the direct control of the
computer. These devices are designed to read information into or out of the memory
unit when the CPU gives a command. Input or output devices connected to the
computer are also called peripherals. Among the most common peripherals are
keyboards, display units and printers. Peripherals that provide auxiliary storage for
the system are magnetic disks.

Overview of Computers

Notes

Other input and output devices are digital incremental plotters, optical and magnetic
character readers, analog-to-digital converters, etc. Not all input comes from people,
and not all output is intended for people. Computers are used to control various
processes in real time, such as machine tooling, assembly line procedures, and
chemical and industrial processes. For such applications, a method must be provided
for sensing status conditions in the process and sending control signals to the process
being controlled.

Input Devices
Input devices provide an interface between the users and the machine, for inputting
data and instruction. One of the most common examples is the keyboard. Data can be
input in many more forms audio, visual, graphical, etc.
Some common input devices are listed below:
1.

Keyboard

2.

Mouse

3.

Voice data entry

4.

Joy stick

5.

Light pen

6.

Scanner

7.

Secondary storage devices such as floppy disks, magnetic tapes, etc.

The data in any form is first digitized, i.e. converted into binary form, by the input
device before being fed to the Central Processing Unit (CPU).

Output Devices
Like the Input devices, the Output devices also provide an interface between the user
and the machine. A common example is the visual display unit (monitor) of a
personal computer. The output unit receives the data from the CPU in the form of
binary bits. This is then converted into a desired form (graphical, audio, visual, etc.)
understandable by the user.
Some common output devices are:
(i) Visual Display Unit (Monitor)
(ii) Printers
(iii) Speakers
(iv) Secondary Storage Devices
The input and output unit collectively are referred to as peripherals.

Memory
Memory system is at the heart of a computer system. It is the memory system that
makes what a computer is. The input data, the instructions necessary to manipulate
the input data as also the output data are all stored in the memory.

Punjab Technical University 11

Information Technology &


Bioinformatics
Notes

Memory unit is an essential part of any digital computer because computer processes
data only if it is stored somewhere in its memory. For example, if computer has to
compute f(x)=sinx for a given value of x, then first of all x is stored in memory
somewhere, then a routine is called that contains program that calculates sine value of
a given x. It is an indispensable component of a computer.

Since these memory devices are generally silicon chips containing several thousands
of memory cells, it is not adequate to use a single memory. So there are types of
memories used within the same computer system.

Storage Technologies
Various storage technologies have been developed making use of bi-stable properties
of different objects. Most popularly employed technologies are briefed below:

Electrical Storage
These storage devices use the electronic charges (-ve and +ve) for the data and/or
instruction storage. RAM, ROM, PROM and a host of other fast primary memories
rely on this technology. Though they are fast, they have small capacity and high cost.

Magnetic Storage
Magnetic polarizations (North and South) of a magnetic substance are exploited for
data and/or instruction storage in these types of memories. Most of the large capacity
and relatively cheaper storage devices fall into this category such as floppy disk, hard
disk, etc. to name a few.

Optical Storage
Optical storage devices use the fact that a light source (usually a laser beam) can burn
holes on a disk that can be read back by reversing the source direction. CD-ROM
disks employ this technology for data storage. Capacity of such devices is very high
and the cost relatively very low.

Main Memory
Main memory is also known as primary memory. It is a faster memory. CPU directly
communicates with main memory. Main memory contains all the data that's currently
being processed by CPU. Its cost is higher than secondary memories because
production of high speed memory employs sophisticated designing techniques.

Secondary Memory
Apart from main memory there is secondary memory too, which works slower than the
main memory and is used to provide a backup. It is also called auxiliary memory. The
main memory gathers the data required currently for processing and CPU uses this data.

12 Self-Instructional Material

Cache Memory
This is the smallest and fastest memory component in memory hierarchy of a digital
computer. It increases the speed of processing. It is placed between the main memory
and CPU.

Overview of Computers

Notes

It stores the data in advance, for processing in CPU. This way it increases the inflow
of data to CPU, which is fast inherently.

Other Types of Memory


There are two types of memories:
1.

Random Access Memory

2.

Sequential Access Memory

The basic difference between the two memories is that first is random in nature, that
is, the access of particular memory location doesn't depend upon the sequence, i.e.,
access time is small. But in sequential memory the access of a particular data depends
upon the location where it is stored.
For example, if a data is stored at XX40F then in sequential access the locations XX00
to XX40F all will be accessed, but in random access it takes same amount of time for
each access.

Types of RAM
RAMs are of two types:
z

Static RAM: Here refresh signal is not required. Data stored is lost as soon as
power is switched off.

Dynamic RAM: Here data stored may be lost even when power is on, so to
maintain data one has to give refreshing signals.

Data Representation
The computer that we know of is a digital device, which means that digital signals are
used for the functioning of the computer. One property of digital signals is that it has
discrete level of voltages as shown in Figure 1.18. This property of digital signals
proves to be very useful to represent data in the computer. Why is it so? The answer
to this question lies to the fact that there are many electrical and electronic devices
which can be in any one of their two possible states. For example, a simple switch can
be either on or off, a bulb or a Light Emitting Diode can be glowing or not glowing.
Therefore, representation of data becomes very easy with these devices. At this point,
it becomes necessary to understand how the computer represents data. As we will see
shortly, the computer uses two values to represent data. These two values are called
bits, which stand for Binary Digits. They take the values 0 and 1. John von Neumann
suggested this convention. Since there are only two values with which data can be
represented in a computer, at the lowest level, two discrete voltage values are used to
represent the bits. Thus, digital signals, with their property mentioned above, proved
to be ideal for computer systems.

Punjab Technical University 13

Information Technology &


Bioinformatics
Notes

Figure 1.18: Digital Signal


Nearly all of our arithmetic operations are carried out with decimal numbers. The
computer on the other hand does not use this numbering system. As mentioned
earlier, the computer does its computational jobs with binary numbers, which is based
on the number 2, for a number of reasons. The most important of them are the facts
the operations with binary numbers are more precise and also the operations are
faster because of the fact that for certain binary operations, there are hardware
circuitries available.
Four common numbering systems are generally used. They are the decimal
numbering system, binary numbering system, octal numbering system and finally,
the hexadecimal numbering system. Decimal numbering system use base 10 (thus the
prefix deci). Binary number system use base 2, octal numbering system use base 8 and
hexadecimal numbering system use base 16. Thus, they have the prefix bi, octal and
hexa respectively. We said that the numbering system use a base value. What do we
understand by a base value? We can say that the base of a number refers to how many
digits are required to represent a numerical value in that numbering system.
Thus, decimal been base 10, requires ten digits, 09, to represent any numerical value.
Similarly, binary-numbering system requires two digits, 0 and 1, to represent any
numerical value. The following sections will describe three common numbering
systems.

Binary Numbering System


Computers use this numbering system for doing its computations. As explained
above, it has its own advantages. The computer follows three steps to complete an
arithmetic operation:
1.

It converts the numeric data input to its corresponding binary equivalent.

2.

Performs the desired arithmetic operation in binary.

3.

Converts the result back to its corresponding decimal equivalent and outputs the
result.

When compared with decimal numbering system, the binary numbering system
differs in the number of digits used for a numeric value representation. The decimal
system uses ten digits, namely 0 to 9, whereas the binary system uses only two digits,
0 and 1. Table 1.1 below gives the decimal digits and its equivalent binary value.

14 Self-Instructional Material

Table 1.1: Binary Equivalent of Decimal Numbers

Overview of Computers

Notes

In table 1.1, we have seen the binary equivalent of decimal numbers. Now the
question that arises is how to convert a decimal number into its binary equivalent and
vice versa? To make it more general, how to convert a number in a particular base
to its equivalent in another base. There are conversion techniques available to
accomplish this task. When dealing with binary numbers, two more terms need to be
understood. These are MSB (Most Significant Bit) and LSB (Least Significant Bit).
These two bits play a very important role in many other aspects of computing, such as
address calculation and bus optimization. So how do we define MSB and LSB? MSB
can be defined as the digit that occurs at the leftmost position in a binary number.
Similarly, LSB can be defined as the digit that occurs at the rightmost position in a
binary number. Figure 1.19 shows the MSB and LSB in a binary number.

Figure 1.19

Octal Number System


The octal number system uses 8 digits, 0 through 7. They are used by some computers
and can be employed by using the conversion techniques. There exists a relationship
between octal number system and binary number system because the base number 8
is a multiple of base number 2. Since, 8 is 2, the relationship is 3:1, that is, all the digits
of octal number system can be represented by a three digit binary number. Table 1.2
gives the binary code for every octal digit.
Using this table, one method for conversion can also be formulated. Consider
example.
Table 1.2: Decimal, Binary and Octal numbers

Example: To convert 111002 to its equivalent octal code. The binary number 111002 can be
grouped in sets of three digits as 0112 and 1002. From table 1.2, binary code 001
corresponds to octal digit 3 and binary code 100 corresponds to octal digit 4.
Therefore, 111002 in octal becomes 348.

Punjab Technical University 15

Information Technology &


Bioinformatics
Notes

Example: To convert 11011102 to its equivalent octal code. The binary number 11011102
can be grouped in sets of three digits as 0012, 1012 and 1102. From table 1.2, binary
code 001 corresponds to octal digit 1; binary code 101 corresponds to octal digit 5 and
binary code 110 corresponds to octal digit 6.
Therefore, 11011102 in octal becomes 1568.
Similarly, the reverse of this method can be applied to obtain the binary code from an
octal number.
Example: To convert 5238 to its equivalent binary code. From table 1.2 we see that the
octal code 5 corresponds to 101 in binary, octal code 2 corresponds to 010 in binary
and octal code 3 corresponds to 011 in binary. Thus, 5238 in binary is 1010100112.

Hexadecimal Number System


The hexadecimal numbering system has a base value of 16. This means that in
hexadecimal system, there are 16 digits, which can be used to represent any numeric
value. Table 1.3 gives these 16 digits and their decimal equivalent.
Table 1.3: Hexadecimal Digits and their Decimal Equivalent

The difference between the hexadecimal system and the octal or binary system is that
alphabets are used to represent numeric values because our standard numbering
system has only 10 digits, that is, from 0 to 9.
For conversion, the techniques described in section 1.4.5 can be effectively used. As in
the case of octal numbering system, a relationship exists between hexadecimal and
binary numbering system. The number 16 is a multiple of 2. Therefore, the
relationship is 4:1. This means that a group of 4 binary digits can be used to represent
a hexadecimal digit. Table 1.4 shows this relationship similar to that of an octal
system.
Table 1.4: Decimal, Binary and Hexadecimal Numbers

Using table 1.4, a conversion technique can be devised as explained in example.

16 Self-Instructional Material

Example: To convert 11011012 to its hexadecimal equivalent. 11011012 can be broken


down into two groups as 01102 and 11012. From table 1.4, their hexadecimal
equivalent is 616 and D16 respectively. Therefore, the hexadecimal equivalent of
11011012 is 6D16.
The reverse of this technique can be applied to determine the binary equivalent of a
hexadecimal number.

Overview of Computers

Notes

Conversion Techniques
There are two methods used most frequently to convert a number in a particular base
to any other base. They are called The Remainder Method and The Expansion Method
as explained below:
1.

Remainder Method: This method can be used to convert a decimal number to its
equivalent value in any other base. The following steps are to be followed to carry
out the conversion with the remainder method. Let us assume that the number 14
is to be converted to its binary equivalent. The required base therefore is 2.
(i) Divide the number by the base and note the remainder.
(ii) Divide the quotient by the base and note the remainder.
(iii) Repeat step 2 until the quotient cannot be divided further. That is, the
quotient becomes smaller than the divisor.
(iv) The sequence of remainders starting from the last generated one prefixed by
the undivided quotient is the converted number.
These steps are explained with examples.
Example: To convert decimal number 14 to its binary equivalent.
Step 1: 14 divided by 2; Quotient = 7; Remainder = 0
Step 2: 7 divided by 2; Quotient = 3; Remainder = 1
Step 3: 3 divided by 2; Quotient = 1; Remainder = 1
The binary number therefore becomes 1110
Example: To convert decimal number 14 to its octal equivalent.
Step 1: 14 divided by 8; Quotient = 1; Remainder = 6;
The octal number therefore becomes 16

2.

Expansion Method: This method can be applied to convert any number in any
base to its equivalent in base 10. To understand how this method is carried out, let
us take an example.
Consider example. In this example, we will convert a binary number 1001 to its
equivalent decimal value.
Example:
10012 = 1 x 23 + 0 x 22 + 0 x 21 + 1 x 20
=8+0+0+1=9
The following observations are to be made from the above example:
(a) Each digit in the original number individually precedes the component of
expansion. That is, during expansion 1 precedes the expansion component in
the left most position followed by the other digits -0, 0 and 1. This occurrence
is exactly according to the digit's placement in the original binary number.
(b) During expansion, the base of the number is sequentially raised to a count
that starts with 0 and is incremented by one for every digit that occurs in the

Punjab Technical University 17

Information Technology &


Bioinformatics

binary number. The expansion component base raised to the power of 0


occurs for the rightmost digit in the binary number.
(c) The result that is obtained applying this method, is a decimal number.

Notes

Using the above guidelines, any binary number can be converted to its
equivalent decimal number.
Example:
11102 = 1 x 23 + 1 x 22 + 1 x 21 + 0 x 20
=8+4+2+0
= 14
Example:
110102 = 1 x 24 + 1 x 23 + 0 x 22 + 1 x 21 + 0 x 20
= 16 + 8 + 0 + 2 + 0
= 26
To convert a number of any base to base 10, a minor modification has to be
done to the steps above. For instance, if an octal number has to be converted
to its decimal equivalent, 2 in the above steps have to be changed to 8. This is
explained with example.
Example:
11018 = 1 83 + 1 82 + 0 81 + 1 80
= 512 + 64 + 0 + 1 = 577

Performance Factors
In todays world of easily available technology, it is not difficult for anyone to get a
computer system. Be it an organisation or an individual, everyone is becoming
increasingly dependent on computers. The capability of these machines to solve
almost any type of problem and to make any process more efficient has made them
very popular. Over the period of time the one thing that has really grown
exponentially in these systems is their performance.
Performance is one thing that really makes these computers sell. So in this part we
will briefly cover certain factors with the help of which we can judge the performance
of a computer.

MIPS
MIPS stand for million instructions per second. It measures the number of machine
instructions a machine can execute in a second. So you can say that it is the measure
of the computers speed and power. However, it is also true that there is no standard
of measuring MIPS since different instructions take different time. E.g. Pentium based
systems run at 100MIPS.

Clock Speed
It is the speed at which the microprocessor executes the instructions. There is an
internal clock in every computer regulating the rate at which the instructions are
executed as well as it synchronizes the other computer components. It can be
measured in Mega Hertz (MHz).

18 Self-Instructional Material

Bus Architecture

Overview of Computers

Performance also depends upon the bus architecture of a given system. The size of the
bus determines how much data can be transferred through the bus. For example, a
system can have a 16-bit or a 32-bit bus. If it has 32-bit then, it will definitely be faster
than the 16-bit one.

Notes

FLOPS
FLOPS stands for floating point operations per second. It is the benchmark
measurement for measuring the speed of the microprocessor. Any operation that
includes the involvement of fractional numbers is called floating-point operation.

Student Activity
1.

Convert a binary number 1001 to its equivalent decimal value.

2.

Write down at least 15 short cut keys available in Excel, word and
PowerPoint each.

Summary
The computer can be defined as a data manipulating digital electronic device. The
evolution of computers is explained in terms of the generation of computers.
Computers in each generation have better capabilities and features than those in the
previous generations. There are four components of the computer: Input, Storage,
Processing, Output. The basic architecture of the computer has modules, which
realizes the above components. We follow three basic number systems: binary
number system, octal number system and hexadecimal number system. The computer
follows the binary number system.

Keywords
CPU: Central Processing Unit.
RAM: Random Access Memory.
MIPS: Million Instructions Per Second.
FLOPS: Floating Point Operations Per Second.

Review Questions
1.

Differentiate between minicomputer and microcomputer.

2.

What do you understand by the fourth generation of computer systems?

3.

The basic architecture of a computer system can be divided into four major
components, what are those?

4.

What is the difference between primary and secondary storage devices?

5.

List the three steps a computer follows to complete an arithmetic operation.

Further Readings
Hamaacher, V., et al., Computer Organization, 4th ed., McGraw Hill, 1996
Hennessay, J.L., Patterson, D.A, Computer Organization
Hardware/Software Interface, Morgan Kufmann, 1994

and

Design:

The

Punjab Technical University 19

Information Technology &


Bioinformatics

Hennessay, JA, Patterson, DA, Computer Architecture A Quantitative Approach,


Morgan Kufmann, 1996
Mano, MM, Computer System and Architecture, Prentice Hall of India, New Delhi, 1994

Notes

20 Self-Instructional Material

Stallings, W, Computer Organization and Architecture, 2nd ed., Prentice Hall of India,
New Delhi

Unit 2 Introduction of
MS Word

Introduction of MS Word

Notes

Unit Structure
z

Introduction

New in Microsoft Word 2000

Starting of MS Word

Components of the MS Word Window

Start Working with Word Document

File-related Operations

Opening a New File

Save a Document

Reversing and Reapplying Commands

Summary

Keywords

Review Questions

Further Readings

Learning Objectives
At the conclusion of this unit, you will be able to:
z

Know Functionalities of Word 2000

Explain New features in Word 2000

Discuss Word 2000 with Web

Introduction
Microsoft Word 2000 is highly sophisticated word-processing application software
included in Microsofts Office 2000 suite. It is the newest version of the MS-Word
available at this time. Among others, following are the functionalities of MS-Word.
z

Efficient mode of text editing

Facility of Cut, Copy, Paste

Redo, Undo

Search and Replace Text

Justification, Indentation etc.

Pagination

Spell Checking

Importing/Exporting Text

Mail Merging

Tables

Print previewing of text

Offer a Varity of font style and font sizes.


Punjab Technical University 21

Information Technology &


Bioinformatics
Notes

Graphical Drawing

Document Template

Document wizard

MS-Word can be used to:

Write letters

Thesis

Newsletters

Resumes

Applications

Books

World Wide Web Pages

We will discuss about its features and working as well as about the new features of
Microsoft word 2000 for those readers who are already familiar with earlier versions
of MS-Word (i.e. MS-Word 95 and 97).

New in Microsoft Word 2000


Compared to Word 97, Word 2000 offers profound changes in a few important areas,
and minor refinements in many other areas. The major changes are as follows:
Better Web editing support. Previous editions of MS-Word were limited in their
capabilities to create and save Web pages. For instance, in Word 97, users were
required to work in a separate Web Page Authoring Environment that did not
support many accustomed features. In contrast, Word 2000 users can use nearly every
feature of the program, confident that they can convert back and forth between HTML
and binary DOC formats without trouble and that their Web documents will display
properly as long as they use a recent browsers such as Internet Explorer or Netscape
Navigator.
You can actually choose to use HTML as your standard document format, without
sacrificing the sophisticated document formatting features youve come to expect.
New web based collaboration tools. Word and Office now include tools you can use
with intranets to schedule meetings and hold Web discussions about your documents
thereby arriving at consensus more quickly.
Better tools for managing word. In Word 2000 and Office 2000, Microsoft has focused
on making software easier to administer. The installation process is highly
customizable, and you can choose to install many features on first use. This means
the features arent actually installed until the first time the user wants to run them.
MS-Word then searches for the appropriate files in the original installation location
and installs the features automatically. As a result for many users, Word (and Office)
will typically require significantly less hard drive space.
Both individual users and administrators will benefit from MS-Words new Detect
and Repair feature, which enables MS-Word to fix damaged files by itself. Microsofts
new Custom Installation Wizard makes it much easier to customize installations
across a network, and the Office Profile Wizard makes it much easier to standardize
settings and distribute those settings company wide.
Better User Personalization. Because many users find MS-Words interface quite
complex, MS-Word 2000 introduces personalized toolbars and menus. When you first
display MS-Word, you see abbreviated menus that contain only those commands
Microsoft expects you to use most. If you select a menu and pause for a moment, the
remaining commands appear, as you use additional commands, they become part of
22 Self-Instructional Material

the set that always appears. Word also displays abbreviated versions of the Standard
and Formatting toolbars on the same line, including only those buttons Microsoft
expects you to use most. You can add buttons directly from the toolbar, rather than
use MS-Words traditional customisation tools.
Better support for international users and document. For the first time, word is built
on a single code base. A separate language resources component customizes MSWords user interface for different languages. With the right language resource files,
you can change you copy of Word to display a foreign language user interface.
Perhaps more valuable, MS-Word now permits you to edit and proof documents in
multiple languages.

Introduction of MS Word

Notes

An improved help system based on Microsofts new HTML Help technology and
new Office Assistant characters, such as Rocky the dog.
Somewhat smarter IntelliSense automated features, including and improved Auto
correct feature that fixes many more spelling mistakes automatically.
A new Collect and Paste feature (and Clipboard toolbar) that makes it easier to copy
multiple elements into the Clipboard and paste them together into one location.
Improved spelling dictionary, thesaurus, and grammar checking tools.
z

New Open and Save dialog boxes that make it easier to access and store
documents quickly.

More sophisticated table formatting capabilities, including diagonal lines, and


nested tables for Web pages.

Click and Type, which enables you to double-click anywhere on a page and type
there, even if theres no existing text anywhere in sight

A new Themes feature that enables users to change the entire look of a Web (or
other) document quickly

More flexible printing features, including easy zooming to different paper sizes
and printing multiple pages on a single sheet.

VBScript scripting for building Web pages with automated features.

Somewhat more effective protection against macro viruses, including support for
authenticated trusted sources but still no built-in virus detection features.

Finally, according to Microsoft Word and the rest of Office now support full Y2K
compliance.

Word 2000 and the Web


MS-Word 2000 represents the next evolutionary step of incorporating Web technology
into an Office product and Microsoft has done a lot of work under the hood to tune
up the Web capabilities of this product. Some of the new Web-related capabilities in
word 2000 include.
z

A more robust translation to Web pages.

An improved wizard for building Web pages or even small Web sites

An option to build Web pages within word, without switching to a separate


working environment.

A Web Page preview, similar to Print preview

By far the most important change is the improvement to the HTML format. Microsoft
has cleverly incorporated a wide variety of new Web technologies and languages into
its Web page format. This new alphabet soup of technologies includes:
z

XML (Extensible Markup Language)

Punjab Technical University 23

Information Technology &


Bioinformatics
Notes

CSS (Cascading Style sheets)

VML (Vector Markup Language)

Javascript and VBScript (Script support)

You dont need to know much as to how to use each or any of these technologies to
build or edit Web pages in Word 2000. The main effect of adding these technologies is
to enhance the capabilities of browsers to display the data, improve formatting and
increase the scope of graphical object types that can be included in Web pages.
What this means is that, in many situations, it doesnt matter anymore whether you
save a document as a .doc binary file or a Web page. Either file format looks the same
in both MS-Word 2000 and a properly equipped browser. All the information
normally contained in the .doc binary file format is also included in the Web page
format and vice versa. This inter-changeability of file formats is called round tripping
by Microsoft. You can use any Web page created in Word 2000 to completely
regenerate the binary .doc format. This was not possible with Word 97.
In MS-Word 2000, creating a Web page is no different than creating a Word
document. You do not need to open a special environment.

Starting of MS Word
Microsoft word is a Windows based word processing application. It can be started on
a computer, where is already installed as follows:
z

Click on the start button on the Taskbar

Then Click on the Program option in start menu.

After that click on the Microsoft Word option identified by

icon.

This command will launch the Microsoft word 2000 on your computer, which will
have typical look as shown below:

The Microsoft Word window opens with a new document opened.

Components of the MS Word Window


Components of the Microsoft Word window are:

24 Self-Instructional Material

Title or Caption Bar

Menu Bar

Tool Bars

Ruler

Cursor

Status Bar

Scroll Bars

Document Navigator

Introduction of MS Word

Notes

Function of the Various Components of Microsoft Word


Window
1.

Title Bar: Title bar shows the name of the document and situated in the top of the
window application.

2.

Menu Bar: Menu bar contains the various commands under the various topics to
perform some special tasks. Menu bar is located under the title bar.

3.

Toolbar(s): A Toolbar is a group of graphical shortcut buttons for executing Menu


options/commands in easier and faster way. They appear generally below the
menu bar but can be placed anywhere, as we shall see later.

4.

Ruler: The window is supplied with one horizontal and a vertical Ruler displayed
along the left and top of the document. Rulers can be used to set margins and
indents in easier way and they also provide measurement for the page formatting.

5.

Cursor: Cursor is MS-Word pointer, which tells where on the document the action
(that you choose) will appear or affect. The cursor can be moved and placed
anywhere on the document using pointing device like mouse.

6.

Status Bar: This bar displays the position of the cursor, status of some important
keys of keyboard, the messages for the toolbar button when a mouse points to it,
messages for menu option when a menu option is selected or pointed out by a
user and/or many other relevant information. It is located at the bottom of the
window.

Punjab Technical University 25

Information Technology &


Bioinformatics

7.

Scroll Bars: Scroll bars are sliders that can be moved using mouse. As the scroll
bar is moved, the window pans through the document exposing different regions
of the document. There are two types of scroll bars:
(i) Horizontal Scrollbar

Notes

(ii) Vertical Scrollbar


Horizontal Scrollbar is placed at the bottom while vertical scroll bar at the right of
the document.
8.

View Buttons: View buttons are shortcuts of various views in the View Menu,
placed adjacent to the horizontal scroll bar. These buttons select different ways
the document can be viewed, as we shall see later.

9.

Document Navigator: Document Navigator allows navigating the document in


different types of objects and is activated when clicked on the ball type button on
the vertical scroll bar.

10. Office Assistant: Office assistant provides you the online help, real-time tips
while working with MS-Word.

Start Working with Word Document


When you open the word a blank document opens. You can start typing on it directly.
But for performing various other activities, you need to learn about the different
menus in the menu bar and the options under these menu bars. You can select
different menus with the mouse and you can select it through the hotkeys also.
Hotkeys are displayed underlined. Just press the underlined combination of keys
(usually, Alt, unless specified otherwise) to get the action performed. For example if
you wish to select the File menu you have to press the Alt+f keys to open the menu or
Alt+v to select the view menu. Before going on menus lets see how to move within
the document using various keys.
Movement of the cursor using keyboard
Movement of the Cursor

Keys

Left of Right one character


Left or Right one word
Up or Down one line

Or
Ctrl+

or Ctrl+

Or

Up or Down one paragraph

Ctrl+ or Ctrl+

To the start or end of a line

Home or End

Up or Down one screen

Page up or Page Down

To the top or bottom of the current screen

Ctrl+Page up or Ctrl+Page Down

To the start or end of the document

Ctrl+Home or Ctrl+End

File-related Operations
File (or document) related operation can be done through FILE menu. Different
file-operations are:

26 Self-Instructional Material

Opening a new file

Opening an existing file

Saving an opened file

Closing an opened file

Examining the print perview of an opened file

Printing an opened file

Setting the page setup of the file

Saving different copy of a file

Exiting from word

Introduction of MS Word

Notes

File menu is shown in the following MS-WORD window.

Opening a New File


Click on the File menu
Click on the New option or press Ctrl+N through the keyboard or click at the
at the toolbar button.

icon

Select the Blank document from the General Tab from the dialog box and then press
OK button.

Note that there are many types of pre-designed documents are available in the dialog
window above.

Blank Document
Start with a blank document when you want to create a traditional printed document.

Punjab Technical University 27

Information Technology &


Bioinformatics
Notes

Web Page
Use a Web Document when you want to display the documents contents on an
intranet or the Internet in a Web browser. A Web page opens in Web layout view.
Web pages are saved in HTML format i.e. a file with .html extension.

E-mail Messages
If you use Outlook 2000 or Outlook Express, use an e-mail message when you want to
compose and send a message or a document to others directly from Word. An e-mail
message includes an e-mail envelope toolbar so that you can fill in the recipient
names and subject of the message, set message properties, and then send the message.

Templates
Use a template when you want to reuse boilerplate text, custom toolbars, macros,
shortcut keys, styles, and auto text entries.

Save a Document
For saving a document:
Click on the File->save option.
OR
Press Ctrl+S
OR

Press the Save tool from standard toolbar


If you are saving the file for the first time the Save as window will appear
Choose the appropriate folder from Save in combo box.
Write the proper file name in the File name text box
And then press the Save button
Note that if you close an opened document without saving its latest content, MSWord duly prompts you with option for saving and not saving the document.
You can select the appropriate action from these options as well.

28 Self-Instructional Material

Save and Save as options do the same work of saving a document. However, the
difference between both option is that the Save as command allows the user to save a
file by a different name and format. The Save option will save the document by the
same name and format as it was saved for very first time.

Introduction of MS Word

Notes

Opening of an Existing Document


To open an existing document:
Select the Open option from the File menu
OR
Press Ctrl+O
OR
Click on the Open tool from the standard toolbar

Then the Open dialog box will appear as shown in figure


1.

Select the appropriate folder from look-in combo box.

2.

Select the required file from the file window


Or

3.

Write the required files name in the file name window

4.

Click on open button on the right hand side


Or
Press Enter

Closing of Document
To close an already opened document just choose the close option from file menu but
keep it in your mind that the only current window or document will close since the
Microsoft Word works in MDI (Multi Document Interface) environment unlike
notepad which works in SDI (Single Document Interface).

Punjab Technical University 29

Information Technology &


Bioinformatics
Notes

Save as Web Page


Save as Web page option will save the word document in the Web page format i.e. in
the html format with the extension .htm. You can view it then in the web browser.

Version
A document can be saved in different versions with the Version option in the File
menu. Following are the steps of saving different versions of Word document.
1.

Click on the Version option of File menu then the following Versions window
will appear.

2.

Click on the version window then the following window will appear.

3.

Write the comments of the version for example Ist version or IInd Version.

4.

Then press Ok
Do some changes in the document and repeat steps from 1 to 4. These steps will
save your document in different versions. To see the difference between the
versions do the following:

30 Self-Instructional Material

Again choose the version option from File menu.

Introduction of MS Word

Now this time the window will look like:


Notes

Here you can see the two different versions of the same document. The last version
i.e. the IInd version is you current version. To compare:
1.

Click on Ist version

2.

Click on Open

Now you can see the two windows of two versions and you can compare the text in
the two documents.

Web Page Preview


Web page preview will invoke your default web browser for example Internet
Explorer. In the web browser the document will be shown as Web page format.

Page Setup
From the Page setup option the one can setup the page layout (margins etc.). For
using the Page setup option you have to perform the following steps:
1.

Click on the Page setup option from the file menu then the following page setup
window will appear.

Punjab Technical University 31

Information Technology &


Bioinformatics
Notes

2.

Adjust the different margins or apply different options from the margin tab
where
(i) In top margin enter the distance you want between the top of the page and
the top of the first line on the page.
(ii) In Bottom margin enter the distance you want between the bottom of the
page and the bottom of the last line on the page.
(iii) In Left option, enter the distance you want between the left edge of the page
and the left edge of unindented lines.
(iv) In Right option enter the distance you want between the right edge of the
page and the right end of a line with no right indent.
(v) In Gutter option enter the amount of extra space you want to add to the
margin for binding. Word adds the extra space to the left margin of all pages
if you clear the Mirror margins check box, or to the inside margin of all pages
if you select the Mirror margins check box.
(vi) In Header option under From edge frame enter the distance you want from
the top edge of the paper to the top edge of the header. If the Header setting is
larger than the Top setting, Word prints the body text below the header.
(vii) In Footer option under From edge enter the distance you want from the
bottom edge of the paper to the bottom edge of the footer. If the Footer setting
is larger than the Bottom setting, Word stops printing the body text above the
footer.
(viii) Check Mirror margin check box to adjusts left and right margins so that when
you print on both sides of the page the inside margins of facing pages are the
same width and the outside margins are the same width.
(ix) Check the 2 pages per sheet checkbox to print the second page of a document
on the first page. This check box is used when the printed page is folded in
half with the two pages on the inside. The outer margins (gutter) of the page
will be the same width, and the inner margins will be the same width.
(x) In the Apply to list box click the portion of the document you want to apply
the current settings to in the Page Setup dialog box. And the options of this
list box are whole document. This point forward, etc. which can be changed
according to the situation.
From the Paper size tab you can set the length or width of the page. When you
click on the Paper size tab the following window will appear.

32 Self-Instructional Material

In the above window you can adjust the following:


1.

From Paper size list box you can select the predefined Paper sizes.

2.

From Width and Height text boxes the custom Paper size can be defined by
adjusting the Height and width of the paper.

3.

Select the orientation of the paper from Landscape or Portrait orientation frame.
Portrait orientation is length-wise while Landscape orientation is width-wise on
the page.

Introduction of MS Word

Notes

From the Paper source tab you can select the source of paper that from where you
are going to insert the paper in the printer. Clicking on the Paper source tab the
following window will appear:

In the above window you can adjust the source of the first paper and other pages.

Print Option
For taking the printout you have to select the print option of the file menu. After
selecting the print option from file menu the window given below will appear.

Punjab Technical University 33

Information Technology &


Bioinformatics
Notes

You can set various options before taking printout.


1.

From the Name combo box you can select the printer if there is more than one
printer is installed.

2.

You can select the range of pages i.e. all pages or current page or number of pages
you require from Page range frame.

3.

From the Print what option you can choose that which part of a document you
want to print i.e. the whole document or comments or anything else.

4.

From print option the pages can be selected to print i.e. all pages or even pages or
old pages.

5.

You can choose number of copies from Number of copies option under Copies
frame.

6.

From Pages per sheet option under Zoom frame you can select the number of
pages in the document that you want to print on each sheet of paper.

7.

From Scale to paper size option you can select the paper size on which you want
to print the document. For example, you can specify that a B4-size document will
be printed on A4-size paper by decreasing the size of the font and the graphics.
This feature is similar to the reduce/enlarge feature on a photocopy machine.

8.

The collate check box can prints the copies of the document in proper binding
order.

9.

Send to option

From Send to option you can send the document to various other recipients
(application or otherwise) on various places through various technologies like email,
fax, etc.

Properties Option
From the Properties option in the file menu you can set the various properties of the
document. When you click on the properties option you will see the following
window:

34 Self-Instructional Material

From this window you can set or see the various properties of the document. You can
set the author name, company name, title of the document and various other
properties. And from the other tabs you can see the various properties of the
document like.
1.

From General tab you can see the type of document, the file size, the path of the
file where it is saved, date of creation of the document, date of last modified, etc.

2.

From Statistics tab you can see the various statistics of the document like no of
words, no of characters, no of lines etc.

Introduction of MS Word

Notes

After the properties option there are the list of last modified or created files. The
number of files is dependent upon the user, which he has set. By default it is 4.
All the necessary editing commands have been grouped into Edit menu or edit
toolbar. See the edit menu given below.

Block Operations
There may be a situation when you have to perform a task on a single character.
Performing that action is very simple just put the cursor on the character and perform
the task. But take the situation when you have to perform the same action with a
group of word. Lets take an example of deleting a whole paragraph. To delete this
paragraph you can press the <Del> key repeatedly until the text of entire paragraph is
deleted. However, as you would realise that this is very tedious way of deleting
multiple characters. A better solution is to make a block of the paragraph by selecting
it and then deleting it in one go. Any general method of text selection may be
employed to select the text block. It will appear in reverse background (white
foreground and black background). While selected, if a key is pressed the entire
selection is replaced by just the character being keyed in. If instead <del> key is
pressed the selected block is deleted without being replaced by any character.

Punjab Technical University 35

Information Technology &


Bioinformatics

Making Block of the Text


Making block of the text is also known as selection of the text. Selection of text can be
performed in following ways.

Notes

1.

Selecting a word

2.

Selecting a line

3.

Selecting the Complete paragraph

4.

Selecting the whole document

To select a word you have to perform the following steps

1.

Place the mouse pointer on the word

2.

Double click the word


Or

3.

Place the cursor before the word

4.

Press the left button of mouse and drag it till the end of the word
Or

5.

Place the cursor before the word and then press ctrl+shift+right arrow key

To Select a line
1.

Point the mouse pointer on the selection bar. There the mouse pointer will change
to an Arrow pointing opposite to usual direction.

2.

Click the left moue button only once.


OR

36 Self-Instructional Material

3.

Place the cursor in front of the first character of the line

4.

Press shift+end key.

Introduction of MS Word

Notes

Selection bar

To select a paragraph

Place the mouse pointer to selection bar and double click. The entire paragraph will
be selected.
OR
Place the cursor on the first character of the paragraph
Press ctrl + shift + down-arrow keys

Reversing and Reapplying Commands


Sometimes, after making certain changes, it is desired to cancel those changes made.
MS-Word provides a command for undoing whatever was done in the previous step.
1.

Undo
Click on the Undo option under Edit menu
Punjab Technical University 37

Information Technology &


Bioinformatics

Or
Click on the Undo button on Standard toolbar
Or

Notes

Press Ctrl+z.
The undo option displays all the recent actions, which MS-Word can undo. You
can select from this list the appropriate action to be undone.
Similarly, sometimes, after undoing certain changes, it is desired to reapply the
action undone. MS-Word provides a command for redoing whatever was undone
in the previous step.
2.

Redo
If an undo has to perform on last reversed action that is known as Redo. To
perform redo you can
Click on the Redo option under Edit Menu
Or
Click on the Undo button on Standard Toolbar
Or
Press Ctrl+y

Cut and Paste


Cut operation removes the selected object or objects from the present document and
puts into a memory area called clipboard. The objects may be text or picture or table
or virtually anything that is available in the document. Paste means copying the
content(s) of the clipboard at the specified location in the present document.

To cut text or an object


Select the text(s) or object(s). Text/Object can be selected in various ways.
Any amount of text

Drag over the text.

A word

Double-click the word.

A graphic

Click the graphic.

A line of text

Move the pointer to the left of the line until it changes to a right-pointing
arrow, and then click.

Multiple lines of text

Move the pointer to the left of the lines until it changes to a right-pointing
arrow, and then drag up or down.

A sentence

Hold down CTRL, and then click anywhere in the sentence.

A paragraph

Move the pointer to the left of the paragraph until it changes to a rightpointing arrow, and then double-click. Or triple-click anywhere in the
paragraph.

Multiple paragraphs

Move the pointer to the left of the paragraphs until it changes to a rightpointing arrow, and then double-click and drag up or down.

A large block of text

Click at the start of the selection, scroll to the end of the selection, and
then hold down SHIFT and click.

An entire document

Move the pointer to the left of any document text until it changes to a
right-pointing arrow, and then triple-click.
Contd.

38 Self-Instructional Material

Headers and footers

In normal view, click Header and Footer on the View menu; in print layout
view, double-click the dimmed header or footer text. Move the pointer to
the left of the header or footer until it changes to a right-pointing arrow,
and then triple-click.

Comments,
footnotes, and
endnotes

Click in the pane, move the pointer to the left of the text until it changes to
a right-pointing arrow, and then triple-click.

A vertical block of
text (except within a
table cell)

Hold down ALT, and then drag over the text.

Introduction of MS Word

Notes

Select cut from Edit menu


Or
Click on the cut button of standard toolbar
Or
Press the ctrl+x button
Or
Move the mouse pointer on the selected text and right click and select cut as shown in
figure.

To paste the selection


Move the cursor at the location where the text is to be pasted
Select Paste from Edit menu
Or
Click paste button on the standard toolbar
Or
Press Ctrl + v
Or

Punjab Technical University 39

Information Technology &


Bioinformatics

Click the right mouse button and select paste from the context menu. As shown in
figure

Notes

Copy and Paste


Copying means duplicating the contents of the document at some other desired place.
The procedure for copying text is almost the same as that of moving text with a little
difference that cutting removes the object(s) cut from its original place whereas
copying leaves it as it is.
To copy a particular text
Select the text or make the block of the text
Select copy from Edit menu
Or
Click on the copy button of standard toolbar
Or
Press the ctrl + c button
Or
Move the mouse pointer on the selected text and right click and select copy as shown
in figure.

40 Self-Instructional Material

Introduction of MS Word

Notes

Paste operation is same as described in the cut and paste section

Find, Replace and Go To Options


Find
Sometimes, while working with a document you need to find a particular text or some
format or a special character or a page number, section number, comments, etc. Find
and its associated commands allow you to do just the same.
1.

To find a particular text


(i) On the Edit menu, click Find. Find and replace window will pop-up.

(ii) In the Find what box, enter the text that you want to search for.
(iii) Select the direction of searching from Search list box.

Punjab Technical University 41

Information Technology &


Bioinformatics

(iv) Select any other options that you want from the following:
Match case: To find the characters matching the cases as well. With this
option on, f will not match with F.

Notes

Find whole words only: To find the characters forming a word by themselves
and are not a part of another word.
Use wildcards: To specify the wildcard characters (? Or *) in the Find what
text box. These two characters represent any character in a comparison, hence
are called wild cards. Whereas ? matches with any one character, * matches
with a string of characters.
Sounds like: To find words that sounds similar but spelled differently. For
example hair, heir, hear and hare are sounds similar to here.
Find all word forms: To find all grammatical forms of the word. For example,
on entering the word eat it also searches ate, eaten and eating words.
(v) For Help on an option, click the question mark and then click the option.
(vi) Click Find Next to proceed.
(vii) MS-Word starts searching in the specified direction from the current cursor
position and stops at the first match found. If you want to continue finding in
the rest of the document click on the Find next button.
2.

To find a particular text with a particular format


i.

On the Edit menu, click Find. Find and replace window will pop-up on the
screen.

ii. Do one of the following: To search for text with specific formatting, enter the
text in the Find what box.
To search for specific formatting only, delete any text in the Find what box.
iii. If you don't see the Format button, click More .
iv. If you want to clear the specified formatting, click No Formatting.
v. Click Format, and then select the formats you want.
vii. Click Find Next.

Replace
If you have to replace a word in the document with another word you can use find
and replace command to do that. Find we have discussed above now its the turn of
Replace.

42 Self-Instructional Material

1.

On the Edit menu, click Replace. Then you will find the following window.

2.

The Find what box, enter the text that you want to search for.

3.

In the Replace with box, enter the replacement text.

4.

Select any other options that you want.

5.

For Help on an option, click the question mark and then click the option.

6.

Click Find Next, Replace, or Replace All.

7.

To cancel a search in progress, press ESC.

Introduction of MS Word

Notes

Go to
To go to on a particular location or particular item use Go to option under the Edit
menu. Steps are as follows.
1.

Click on the Go to option under the Edit menu. Then the following window will
appear on the screen.

2.

Select according to what you want to navigate in the document from Go to what
combo box.

3.

Enter the parameter as page number or name of the bookmark.

4.

Click on the previous or next depending upon the direction you want to go.

You can also use the Document navigator to move around the document. The browse
methods on the Document Navigator includes
1.

Go to method

2.

Find and Replace method

Document Navigator can be invoked by clicking the 3-D ball on the vertical scrollbar.

The browse objects on the Document Navigator includes


z

Edits

pointer moves to the next and previous three edits

Heading

pointer moves to the next and previous headings.

Graphics

pointer moves to the next and previous graphics.

Punjab Technical University 43

Information Technology &


Bioinformatics
Notes

Tables

pointer moves to the next and previous tables.

Fields

Pointer moves to the next and previous fields

Endnotes

pointer moves to the next and previous endnotes.

Footnotes

pointer moves to the next and previous footnotes.

Comments :

Pointer moves to the next and previous comments.

Sections

Pointer moves to the next and previous Sections.

Pages

Pointer moves to the next and previous pages.

S-Word provides a wide variety of views of seeing your document in different ways.
Various commands under View menu facilitate this action. To access the view menu
click on the view menu of press alt + v from keyboard.

MS-Word 2000 provides the following options to view a document in different styles.
1.

Normal

2.

Web Layout

3.

Print Layout

4.

Outline

5.

Full Screen

Normal View
In the Normal view you can only view the Horizontal ruler instead of both horizontal
and vertical ruler. It does not display the margin areas of the page thats why you
cant see the headers and footers.

44 Self-Instructional Material

Introduction of MS Word

Notes

The advantage of the Normal view is that you differentiate between the Soft page
break (One that MS-Word gives you when the text flows out of one page), which will
appear as horizontal line running across the page. Or A hard page break (One which
you insert to end the page before it goes full) which will appears as dotted horizontal
line with the MS-Words Page Break as shown below.

Soft Page Break

Hard Page Break

Web Layout view


With the help of Web Layout view one can view the document in the way it will
appear when opened with the web browser. Clicking on Web Layout under View
menu sets the Web layout view as shown below.

Punjab Technical University 45

Information Technology &


Bioinformatics
Notes

In this layout also the horizontal ruler is shown instead of both the rulers as in case of
Normal view. But there will no page break displayed on the page in the Web layout
view. The whole document looks as if it were a single page.

Print Layout
This layout is the default view layout. Print layout view gives you the view that will
appear on the hard copy when printed. It includes both horizontal as well vertical
rulers to tell you exact position of your text or picture in the document. It also shows
you all the four i.e. top, bottom, left, and right margins as well the text you have typed
in the header or the footer section of the document with light gray color. To change
the view to Print layout, click on the Print Layout option under View menu.

As you can see in the Print Layout view each and every page looks like a separate
page of the notebook.

46 Self-Instructional Material

Outline view

Introduction of MS Word

Outline view displays the contents of your document in a traditional outline format,
with text indented beneath headings in a hierarchical structure. In this view, you can
display headings, or any level of detail beneath the headings that you wish. The
figure given below shows a document in Outline view. Notice that when you are in
this view there is an additional toolbar that enables you to open and close headings to
reveal more or less detail and to promote or demote headings to change their position
in the outline hierarchy.
1.

To change to outline view Select Outline from the view menu.

2.

Move your mouse pointer to the plus symbol to the left of the main heading. The
pointer changes to a four-way arrow.

3.

Click the arrow symbol to select this heading and all of its subheadings.

4.

Choose the Collapse button from the Outline toolbar. The entire subheading
disappears. They have temporarily been collapsed of hidden from view. The
wavy line under the heading indicates there are collapsed heading underneath it.

5.

Choose the Expand button. The subheading will appear again.

6.

To restyle a heading to demote it one level or promote it one level, select it and
then click the Promote or Demote buttons in the Outlining toolbar. Then click the
appropriate heading style. For example, if you demote a heading formatted with
the Heading 2 style, Word reformats it with the Heading 3 style.

7.

To move a heading (along with all the subheadings and body text it contains) to a
new location in the document, drag its plus sign. As you drag, a horizontal line
indicates where the heading will appear. When the line is in the right place,
release the mouse button.
Demote

Expand

Notes

Collapse

Promote

Show Heading Buttons

Punjab Technical University 47

Information Technology &


Bioinformatics
Notes

Full Screen View


If you like working in a completely spick and span environment, youll like the Full
Screen view. To switch to this view, choose Full screen option from view menu. Your
document enlarges to cover the entire desktop. The title bar, menu bar, and toolbars
in the MS-Word window are temporarily hidden to give you as much room as
possible to see your text. If you want to issue a menu command, point to the thin gray
line running across the top of your screen to slide the menu bar into view. To view
your document in full-screen click the Full Screen in View menu. Full screen view of
a document is given below.

Zooming the Views


The Zoom feature of MS-Word lets the user increase or decrease the size of the
display to make the text easily visible. The zoom percentage can be set between 10%
and 500% of full size.
To set the zoom percentage either the Zoom box on the standard toolbar or the Zoom
dialog box can be used.
To zoom the view using the Zoom box on the Standard toolbar:

48 Self-Instructional Material

1.

Display the Standard toolbar if it isnt visible, using the Toolbars command from
the view menu.

2.

Click the drop-down control of the Zoom box to display a list of zoom percentage.

3.

Select a percentage from the list or type in a different percentage in the Zoom box.
As shown in figure below.

Introduction of MS Word

Notes

The four options at the bottom of the Zoom list also come in handy. They
automatically adjust your documents magnification just the right amount to display
the full width of the page (Page width), the width of the text only (Text Width), the
entire page (Whole Page), and two entire pages (Two pages). In Normal view, the
Text Width, Whole Page, and Two pages options are not available.

Working with Toolbar


Toolbars are a faster way to issue a command. MS-Word comes with 16 toolbars. By
default it displays only two toolbars i.e. standard toolbar and formatting toolbar. One
can display any toolbar from the other fourteen toolbars or can hide any toolbar.
To display or to hide a toolbar:
1.

Click on the view menu.

2.

Select the Toolbar option

3.

This will display all the toolbar names. Toolbars that are currently displayed have
check marks in front of them and those, which are hidden; they dont have any
check mark.

4.

Check mark the toolbar name which you want to use.

5.

Remove the check mark name, which you want to hide.

Punjab Technical University 49

Information Technology &


Bioinformatics
Notes

Header and Footer


Headers and footers are text or graphics that appear on each page of the document.
You can create headers and footers that include text or graphics for example, page
number, date, a company logo, the document's title or file name, or the author's
name that are usually printed at the top or bottom of each page in a document. A
header is printed in the top margin; footer is printed in the bottom margin.
You can use the same header and footer throughout a document or change the header
and footer for part of the document. For example, use a unique header or footer on the
first page, or leave the header or footer off the first page. You can also use different
headers and footers on odd and even pages or for part of a document.
To create a header or footer:

50 Self-Instructional Material

1.

Select Header and Footer option from the View menu. Header and Footer toolbar
is displayed. Header area is activated and documents text color changes to light
gray.

2.

Type the text in Header area.

3.

If you want to create footer, click the switch between Header and Footer toolbar
button to make footer area active.

4.

Type the text in footer area.

5.

Apply any formatting that you like on header or footer of the document.

Introduction of MS Word

Notes

Click on close button to return back to the document.

Punjab Technical University 51

Student Activity

Information Technology &


Bioinformatics

1.

Fill in the blanks:


(a) Headers and footers are typically used in . documents.

Notes

(b) The . feature lets us increase or decrease the size of the


display to make the text easily visible
(c) If you want to create footer, click the . to enable footer area
active.
(d) If you like working in a completely spick and span environment,
youll like the . view.
(e) . layout is the default view layout.
2.

Explain office 2000 and its suites.

3.

Write down at least 15 short cut keys available in Excel, word and
PowerPoint each.

Summary
Microsoft Office 2000, a successor to Microsoft Office 97, was designed as a fully
32-bit and Y2K compliant version to match Windows 2000 features. All the Office
2000 applications have OLE 2 capacity, which allows moving data automatically
between various programs. All the available packages in office 2000 have some short
cut keys.

Keywords
Office 2000: Y2K compliant application software
Word 2000: Word Processing software comes as a part of office 2000
Excel 2000: Spreadsheet software
PowerPoint 2000: Presentation Software

Review Questions
1.

Describe all the views of the document in short.

2.

Define the role of header and footer in a word document.

3.

How to add or hide a toolbar?

4.

What are the potential benefits of Microsoft office over its contemporaries?

Further Readings
Dharminder Kumar, Management Information Systems, Excel Books, New Delhi.
Dhiraj Sharma, Foundations of IT, Excel Books, New Delhi.
Bhuwanesh Jha, Elements of basic computing, Khanna Publication.
Chee wong lee, Fundamentals of Office 2000, China Publishing.
Rajiv Gupta, Foundations of Office 2000, Rajasthan Publishers.
Narender, Singh and Naruka, Office 2000 in 7 Days, Jalandhar Publishing House.

52 Self-Instructional Material

Unit 3 Operating
Systems

Operating Systems

Notes

Unit Structure
z

Introduction

Computer Hardware and OS Interaction

Single User Single Processing System

Multiprogramming Operating System

Architecture and Design of OS

Interface Design and Implementation

Window Systems based on PCs

Summary

Keywords

Review Questions

Further Readings

Learning Objectives
At the conclusion of this unit, you will be able to:
z

Recognize the need for an operating system

Explain the relation between computer hardware and operating systems

List and describe the operating system functions

Identify different components that make an operating system

Describe different classes of operating systems

Explain various architectures of operating systems

Address the issues related to operating systems interface design

Address the performance related issues of operating systems

Introduction
Operating systems are so ubiquitous in computer operations that one hardly realises
its presence. Most likely you must have already interacted with one or more different
operating systems. The names like DOS, UNIX, etc. should not be unknown to you.
These are the names of very popular operating systems.
From a very simple standpoint, it can be stated that a computer cannot become
operational without an operating system, hence the name. Operating system is simply
a very complex computer program. You will learn about various issues related to an
operating system in this unit.

Computer Hardware and OS Interaction


Try to recall what all happens when you switch on a computer and before you start
operating on it. In a typical personal computer scenario, this is what happens.
Some information appears on the screen. This is followed by memory counting
activity. Keyboard, disk drives, printers and other similar devices are verified for
proper operation. These activities always occur whenever the computer is switched
Punjab Technical University 53

Information Technology &


Bioinformatics
Notes

on or reset. There may be some additional activities on some machine also. These
activities are called power-on routines. Why do these activities always happen? You
will learn about it elsewhere in this unit.
You know a computer does not do anything without properly instructed. Thus, for
each one of the above power-on activities also, the computer must have instructions.
These instructions are stored in a non-volatile memory, usually in a ROM. The CPU of
the computer takes one instruction from this ROM and executes it before taking next
instruction. Since ROMs are of finite size they can store only a few kilobytes of
instructions. One by one the CPU executes these instructions. Once, these instructions
are over, the CPU must obtain further instructions from somewhere else.
Usually further instructions are stored on a secondary storage device like hard disk,
floppy disk or CD-ROM disk. These instructions are collectively known as operating
system and their primary function is to provide an environment in which users may
execute their own instructions.
Once the operating system is loaded into the main memory, the CPU starts executing
its instructions. Operating systems run in an infinite loop, each time taking
instructions in the form of commands or programs from the users and executing them
in that order. This loop continues until either the user terminates the loop deliberately
by shutting it down or something goes wrong during the operation.
Please note that a user almost never interacts with the hardware directly and that a lot
depends on the operating system loaded and running on a computer. For all practical
purposes a computer is nothing more than the operating system controlling it as far as
the users are concerned. In order to exploit the most from a computer, therefore, a
deep understanding of operating system is a must.
An operating system is the most important program in a computer system. This is one
program that runs all the time, as long as the computer is operational and exits only
when the computer is shut down.
In general, however, there is no completely adequate definition of an operating
system. Operating systems exist because they are a reasonable way to solve the
problem of creating a usable computing system.
The fundamental goal of computer systems is to execute user programs and to make
solving user problems easier. Hardware of a computer is equipped with extremely
capable resources memory, CPU, I/O devices, etc. All these hardware units interact
with each other in a well-defined manner. Bare hardware is not enough to solve a
problem. Application programs are developed for the purpose, which require certain
common operations, such as those controlling the I/O devices. The common functions
of controlling and allocating resources are then brought together into one piece of
software: the operating system.
It is easier to define operating systems by their functions, i.e., by what they do than by
what they are. The computer becomes easier for the users to operate, is the primary
goal of an operating system. Operating systems exist because they are supposed to
make it easier to compute with them than without them. This view is particularly
clear when you look at operating systems for small personal computers.
Efficient operation of the computer system is a secondary goal of an operating system.
This goal is particularly important for large, shared multi-user systems. These systems
are typically expensive, so it is desirable to make them as efficient as possible.
Operating systems and computer architecture have had a great deal of influence on
each other. To facilitate the use of the hardware, operating systems were developed.
As operating systems were designed and used, it became obvious that changes in the
design of the hardware could simplify them.

54 Self-Instructional Material

Operating systems are the programs that make computers operational, hence the
name. Without an operating system, the hardware of a computer is just an inactive
electronic machine, possessing great computational power, but doing nothing for the
user. All it can do is to execute fixed number of instructions stored into its internal
memory (ROM: Read Only Memory), each time you switch the power on, and
nothing else.

Operating Systems

Notes

Operating systems are programs (fairly complex ones) that act as interface between
the user and the computer hardware. They sit between the user and the hardware of
the computer providing an operational environment to the users and application
programs. For a user, therefore, a computer is nothing but the operating system
running on it. It is extended machine.
Users do not interact with the hardware of a computer directly but through the
services offered by operating system. This is because the language that users employ
is different from that of the hardware. Whereas users prefer to use natural language
or near natural language for interaction, the hardware uses machine language. It is the
operating system that does the necessary translation back and forth and lets the user
interact with the hardware. The operating system speaks users language one hand
and machine language on the other. It takes instructions in form of commands from
the user and translates into machine understandable instructions, gets these
instructions executed by the CPU and translates the result back into userunderstandable form.
A user can interact with a computer if only he/she understands the language of the
resident operating system. You cannot interact with a computer running UNIX
operating system, for instance, if you do not know UNIX language or UNIX
commands. A UNIX user can always interact with a computer running UNIX
operating system, no matter what type of computer it is. Thus, for a user operating
system itself is the machine an extended machine as shown in figure 3.1.
user
1

user
2

user
3

compiler

Text editor

database

- - - - -

user
n

- - - - -

application programs

System calls

shell
Operating system

Computer hardware

CPU

memory

I/O

Figure 3.1: Extended-machine View of Operating System

Punjab Technical University 55

Information Technology &


Bioinformatics
Notes

Operating Systems are Computers Resource Manager


The computer hardware is made up of physical electronic devices, viz. memory,
microprocessor, magnetic disks and the like. These functional components are
referred to as resources available to computers for carrying out their computations.
All the hardware units interact with each other in terms of electric signals (i.e. voltage
and current) usually coded into binary format (i.e. 0 and 1) in digital computers, in a
very complex way.
In order to interact with the computer hardware and get a computational job executed
by it, the job needs to be translated in this binary form called machine language. Thus,
the instructions and data of the job must be converted into some binary form, which
then must be stored into the computer's main memory. The CPU must be directed at
this point, to execute the instructions loaded in the memory. A computer, being a
machine after all, does not do anything by itself. Which resource is to be allocated to
which program, when and how, is decided by the operating system in such a way that
the resources are utilized optimally and efficiently.

Functions
As has been stated earlier, the prime function of an operating system is to provide an
environment for the execution of users programs. Besides, the operating system also
provides certain services to programs and to the users of those programs to enhance
the primary function of program execution in various ways.
The specific services provided differ from one operating system to another, but there
are some common classes that we can identify. These operating system services are
provided for the convenience of the programmer, to make the programming task
easier. Some of these services are listed below:
1.

Program execution services

2.

I/O operation services

3.

File system services

4.

Communication services

5.

Error detection services

6.

Accounting services

7.

Protection services

Components
An operating system performs large number of functions. Each function is carried out
by a component of the operating system called its subsystems. The typical
components of an operating system are:

56 Self-Instructional Material

1.

Process management sub-system

2.

Memory management sub-system

3.

File management sub-system

4.

I/O system management sub-system

5.

Secondary storage management sub-system

6.

Network management sub-system

7.

Protection sub-system

8.

User-interface sub-system

Classification

Operating Systems

The variations and differences in the nature of different operating systems may give
the impression that all operating systems are absolutely different from each-other. But
this is not true. All operating systems contain the same components whose
functionalities are almost the same. For instance, all the operating systems perform
the functions of storage management, process management, protection of users from
one-another, etc. The procedures and methods that are used to perform these
functions might be different but the fundamental concepts behind these techniques
are just the same. Operating systems in general, perform similar functions but may
have distinguishing features. Therefore, they can be classified into different categories
on different bases. Let us quickly look at the different types of operating systems.

Notes

Single User Single Processing System


The simplest of all the computer systems is a single use-single processor system. It has
a single processor, runs a single program and interacts with a single user at a time.
The operating system for this system is very simple to design and implement.
However, the CPU is not utilized to its full potential, because it sits idle for most of
the time. (Figure 3.2)

Application
program

user

operating system
hardware

Figure 3.2: Single User Single Processor System


In this configuration, all the computing resources are available to the user all the time.
Therefore, operating system has very simple responsibility. A representative example
of this category of operating system is MS-DOS.

Batch Processing Systems


The main function of a batch processing system is to automatically keep executing one
job to the next job in the batch (Figure 3.3). The main idea behind a batch processing
system is to reduce the interference of the operator during the processing or execution
of jobs by the computer. All functions of a batch processing system are carried out by
the batch monitor. The batch monitor permanently resides in the low end of the main
store. The current jobs out of the whole batch are executed in the remaining storage
area. In other words, a batch monitor is responsible for controlling all the
environment of the system operation. The batch monitor accepts batch initiation
commands from the operator, processes a job, as well as performs the job of job
termination and batch termination.
In a batch processing system, we generally make use of the term turn around time.
It is defined as the time from which a user job is given to the time when its output is
given back to the user. This time includes the batch formation time, time taken to
execute a batch, time taken to print results and the time required to physically sort the

Punjab Technical University 57

Information Technology &


Bioinformatics
Notes

printed outputs that belong to different jobs. As the printing and sorting of the results
is done for all the jobs of batch together, the turn around time for a job becomes the
function of the execution time requirement of all jobs in the batch. You can reduce the
turn around time for different jobs by recording the jobs or faster input output media
like magnetic tape or disk surfaces. It takes very less time to read a record from these
media. For instance, it takes round about five milliseconds for a magnetic tape and
about one millisecond for a fast fixed-head disk in comparison to a card reader or
printer that takes around 50-100 milliseconds. Thus, if you use a disk or tape, it
reduces the amount of time the central processor has to wait for an input output
operation to finish before resuming processing. This would reduce the time taken to
process a job which indirectly would bring down the turn-around times for all the
jobs in the batch.

Jobs/tasks

Jobs/tasks

Jobs/tasks
Jobs/tasks

operating system
hardware

Figure 3.3
Another term that is commonly used in a batch processing system is Job Scheduling.
Job scheduling is the process of sequencing jobs so that they can be executed on the
processor. It recognizes different jobs on the basis of first-come-first-served (FCFS)
basis. It is because of the sequential nature of the batch. The batch monitor always
starts the next job in the batch. However, in exceptional cases, you could also arrange
the different jobs in the batch depending upon the priority of each batch. Sequencing
of jobs according to some criteria require scheduling the jobs at the time of creating or
executing a batch. On the basis of relative importance of jobs, certain priorities could
be set for each batch of jobs. Several batches could be formed on the same criteria of
priorities. So, the batch having the highest priority could be made to run earlier than
other batches. This would give a better turn around service to the selected jobs.
Now, we discuss the concept of storage management. At any point of time, the main
store of the computer is shared by the batch monitor program and the current user job
of a batch. The big question that comes in our mind is-how much storage has to be
kept for the monitor program and how much has to be provided for the user jobs of a
batch. However, if too much main storage is provided to the monitor, then the user
programs will not get enough storage. Therefore, an overlay structure has to be
devised so that the unwanted sections of monitor code dont occupy storage
simultaneously.
Next we will discuss the concept of sharing and protection. The efficiency of
utilization of a computer system is recognized by its ability of sharing the system's
hardware and software resources amongst its users. Whenever, the idea of sharing the

58 Self-Instructional Material

system resources comes in your mind certain doubts also arise about the fairness and
security of the system. Every user wants that all his reasonable requests should be
taken care of and no intentional and unintentional acts of other users should fiddle
with his data. A batch processing system guarantees the fulfillment of these user
requirements. All the user jobs are performed one after the other. There is no
simultaneous execution of more than one job at a time. So, all the system resources
like storage IO devices, central processing unit, etc. are shared sequentially or serially.
This is how sharing of resources is enforced on a batch processing system. Now, arises
the question of protection. Though all the jobs are processed simultaneously, this too
can lead to loss of security or protection. Let us suppose that there are two users A
and B. User A creates a file of his own. User B deletes the file created by User A. There
are so many other similar instances that can occur in our day to day life. So, the files
and other data of all the users should be protected against unauthorized usage. In
order to avoid such loss of protection, each user is bound around certain rules and
regulations. This takes the form of a set of control statements, which every user is
required to follow.

Operating Systems

Notes

Multiprogramming Operating System


The objective of a multiprogramming operating system is to increase the system
utilization efficiency. The batch processing system tries to reduce the CPU idle time
through operator interaction. However, it cannot reduce the idle time due to IO
operations. So, when some IO is being performed by the currently executing job of a
batch, the CPU sits idle without any work to do. Thus, the multiprogramming
operating system tries to eliminate such idle times by providing multiple
computational tasks for the CPU to perform. This is achieved by keeping multiple
jobs in the main store. So, when the job that is being currently executed on the CPU
needs some IO, the CPU passes its requirement over to the IO processor. Till the time
the IO operation is being carried out, the CPU is free to carry out some other job. The
presence of independent jobs guarantees that the CPU and IO activities are totally
independent of each other. However, if it was not so, then it could lead to some
erroneous situations leading to some time-dependent errors.
Some of the most popular multiprogramming operating systems are UNIX, VMS,
Windows NT, etc.

Time Sharing or Multitasking System


Time sharing, or multitasking, is a logical extension of multiprogramming. Multiple
jobs are executed by the CPU switching between them. Switching being so frequently
that the each user thinks that the CPU is executing only his program.
An interactive, or hands-on, computer system provides on-line communication
between the user and the system. The user gives instructions to the operating system
or to a program directly, and receives an immediate response.
Usually, a keyboard is used to provide input, and a display screen (such as a cathoderay tube (CRT), or monitor) is used to provide output. When the operating system
finishes the execution of one command, it seeks the next control statement not from a
card reader, but rather from the user's keyboard. The user gives a command, waits for
the response, and decides on the next command, based on the result of the previous
one. The user can easily experiment, and can see results immediately. Most systems
have an interactive text editor for entering programs, and an interactive debugger for
assisting in debugging programs.
If users are to be able to access both data and code conveniently, an on-line file system
must be available. A file is a collection of related information defined by its creator.
Commonly, files represent programs (both source and object forms) and data. Data
files may be numeric, alphabetic, or alphanumeric. Files may be freeform, such as text
Punjab Technical University 59

Information Technology &


Bioinformatics
Notes

files, or may be rigidly formatted. In general, a file is a sequence of bits, bytes, lines, or
records whose meaning is defined by its creator and user. The operating system
implements the abstract concept of a file by managing mass-storage devices, such as
tapes and disks. Files are normally organized into logical clusters, or directories,
which make them easier to locate and access. Since multiple users have access to files,
it is desirable to control by whom and in what ways files may be accessed. Batch
systems are appropriate for executing large jobs that need little interaction. The user
can submit jobs and return later for the results; it is not necessary for the user to wait
while the job is processed.
Interactive jobs tend to be composed of many short actions, where the results of the
next command may be unpredictable. The user submits the command and then waits
for the results. Accordingly, the response time should be short on the order of seconds
at most.
An interactive system is used when a short response time is required. Early
computers with a single user were interactive systems. That is, the entire system was
at the immediate disposal of the programmer/operator. This situation allowed the
programmer great flexibility and freedom in program testing and development. But,
as we saw, this arrangement resulted in substantial idle time while the CPU waited
for some action to be taken by the programmer/operator. Because of the high cost of
these early computers, idle CPU time was undesirable. Batch operating systems were
developed to avoid this problem. Batch systems improved system utilization for the
owners of the computer systems.
Time-sharing systems were developed to provide interactive use of a computer
system at a reasonable cost. A time-shared operating system uses CPU scheduling
and multiprogramming to provide each user with a small portion of a time-shared
computer.
Each user has at least one separate program in memory. A program that is loaded into
memory and is executing is commonly referred to as a process. When a process
executes, it typically executes for only a short time before it either finishes or needs to
perform I/O. I/O may be interactive; that is, output is to a display for the user and
input is from a user keyboard.
Since interactive I/O typically runs at people speeds, it may take a long time to
complete. Input, for example, may be bounded by the user's typing speed; five
characters per second is fairly fast for people, but is incredibly slow for computers.
Rather than let the CPU sit idle when this interactive input takes place, the operating
system will rapidly switch the CPU to the program of some other user.
A time-shared operating system allows the many users to share the computer
simultaneously. Since each action or command in a time-shared system tends to be
short, only a little CPU time is needed for each user. As the system switches rapidly
from one user to the next, each user is given the impression that she has her own
computer, whereas actually one computer is being shared among many users.
The idea of time-sharing was demonstrated as early as 1960, but since time-shared
systems are difficult and expensive to build, they did not become common until the
early 1970s. As the popularity of time-sharing has grown, researchers have attempted
to merge batch and time-shared systems. Many computer systems that were
designed as primarily batch systems have been modified to create a time-sharing
subsystem. For example, IBM's OS/360, a batch system, was modified to support the
time-sharing option (TSO). At the same time, time-sharing systems have often added
a batch subsystem. Today, most systems provide both batch processing and time
sharing, although their basic design and use tends to be one or the other type.
Time-sharing operating systems are even more complex than are multi-programmed
operating systems. As in multiprogramming, several jobs must be kept

60 Self-Instructional Material

simultaneously in memory, which requires some form of memory management and


protection. So that a reasonable response time can be obtained, jobs may have to be
swapped in and out of main memory.
Many universities and businesses have large numbers of workstations tied together
with local-area networks. As PCs gain more sophisticated hardware and software, the
line dividing the two categories is blurring.

Operating Systems

Notes

Parallel Systems
Most systems to date are single-processor systems; that is, they have only one main
CPU. However, there is a trend toward multiprocessor systems. Such systems have
more than one processor in close communication, sharing the computer bus, the clock,
and sometimes memory and peripheral devices. These systems are referred to as
tightly coupled systems.
There are several reasons for building such systems. One advantage is increased
throughput. By increasing the number of processors, we hope to get more work done
in a shorter period of time. The speed-up ratio with n processors is not n, however,
but rather is less than n. When multiple processors cooperate on a task, a certain
amount of overhead is incurred in keeping all the parts working correctly. This
overhead, plus contention for shared resources lowers the expected gain from
additional processors. Similarly, a group of n programmers working closely together
does not result in n times the amount of work being accomplished.
Multiprocessors can also save money compared to multiple single systems because
the processors can share peripherals, cabinets, and power supplies. If several
programs are to operate on the same set of data, it is cheaper to store those data on
one disk and to have all the processors share them, rather than to have many
computers with local disks and many copies of the data.
Another reason for multiprocessor systems is that they increase reliability. If functions
can be distributed properly among several processors, then the failure of one
processor will not halt the system, but rather will only slow it down. If we have
10 processors and one fails, then each of the remaining nine processors must pick up a
share of the work of the failed processor. Thus, the entire system runs only 10 percent
slower, rather than failing altogether.
This ability to continue providing service proportional to the level of surviving
hardware is called graceful degradation. Systems that are designed for graceful
degradation are also called fault-tolerant.
Continued operation in the presence of failures requires a mechanism to allow the
failure to be detected, diagnosed, and corrected (if possible). The Tandem system uses
both hardware and software duplication to ensure continued operation despite faults.
The system consists of two identical processors, each with its own local memory. The
processors are connected by a bus. One processor is the primary, and the other is the
backup. Two copies are kept of each process; one on the primary machine and the
other on the backup. At fixed checkpoints in the execution of the system, the state
information of each job (including a copy of the memory image) is copied from the
primary machine to the backup. If a failure is detected, the backup copy is activated,
and is restarted from the most recent checkpoint.
This solution is obviously an expensive one, since there is considerable hardware
duplication. The most common multiple-processor systems now use the symmetric
multi-processing model, in which each processor runs an identical copy of the
operating system, and the copies communicate with one another as needed. Some
systems use symmetric multiprocessing, in which each processor is assigned a specific
task. A master processor controls the system; the other processors either look to the
master for instruction or have predefined tasks. This scheme defines a master-slave

Punjab Technical University 61

Information Technology &


Bioinformatics
Notes

relationship. The master processor schedules and allocates work to the slave
processors.
An example of the symmetric multiprocessing system is Encore's version of UNIX for
the Multimax computer. This computer can be configured to employ dozens of
processors, all running a copy of UNIX. The benefit of this model is that many
processes can run at once (N processes if there are N CPUs) without causing a
deterioration of performance. However, we must carefully control I/O to ensure that
data reach the appropriate processor. Also, since the CPUs are separate, one may be
sitting idle while another is overloaded, resulting in inefficiencies. To avoid these
inefficiencies, the processors can share certain data structures. A multiprocessor
system of this form will allow jobs and resources to be shared dynamically among the
various processors, and can lower the variance among the systems. However, such a
system must be written carefully.
Asymmetric multiprocessing is more common in extremely large systems, where one
of the most time-consuming activities is simply processing I/O. In older batch
systems, small processors, located at some distance from the main CPU, were used to
run card readers and line printers and to transfer these jobs to and from the main
computer. These locations are called remote-job-entry (RJE) sites. In a time-sharing
system, a main I/O activity is processing the I/O of characters between the terminals
and the computer. If the main CPU must be interrupted for every character for every
terminal, it may spend all its time simply processing characters. So that this situation
is avoided, most systems have a separate front-end processor that handles the entire
terminal I/O.
For example, a large IBM system might use an IBM Series/I minicomputer as a frontend. The front-end acts as a buffer between the terminals and the main CPU, allowing
the main CPU to handle lines and blocks of characters, instead of individual
characters. Such systems suffer from decreased reliability through increased
specialization. It is important to recognize that the difference between symmetric and
asymmetric multiprocessing may be the result of either hardware or software.
Special hardware may exist to differentiate the multiple processors, or the software
may be written to allow only one master and multiple slaves. For instance, Sun's
operating system SunOS Version 4 provides asymmetric multiprocessing, whereas
Version 5 (Solaris 2) is symmetric. As microprocessors become less expensive and
more powerful, additional operating system functions are off-loaded to slaveprocessors, or back-ends.
For example, it is fairly easy to add a microprocessor with its own memory to manage
a disk system. The microprocessor could receive a sequence of requests from the main
CPU and implement its own disk queue and scheduling algorithm. This arrangement
relieves the main CPU of the overhead of disk scheduling. PCs contain a
microprocessor in the keyboard to convert the key strokes into codes to be sent to the
CPU. In fact, this use of microprocessors has become so common that it is no longer
considered multiprocessing.

Distributed Systems
A recent trend in computer systems is to distribute computation among several
processors. In contrast to the tightly coupled systems, the processors do not share
memory or a clock. Instead, each processor has its own memory and clock. The
processors communicate with one another through various communication lines,
such as high-speed buses or telephone lines.
These systems are usually referred to as loosely coupled systems, or distributed
systems. The processors in a distributed system may vary in size and function. They
may include small microprocessors, workstations, minicomputers and large generalpurpose computer systems. These processors are referred to by a number of different
62 Self-Instructional Material

names, such as sites, nodes, computers, and so on, depending on the context in which
they are mentioned.

Operating Systems

There are a variety of reasons for building distributed systems, the major ones being:
z

Resource sharing: If a number of different sites (with different capabilities) are


connected to one another, then a user at one site may be able to use the resources
available at another. For example a user at site A may be using a laser printer
available only at site B. Meanwhile, a user at B may access a file that resides at A.
In general, resource sharing in a distributed system provides mechanisms for
sharing files at remote sites, processing information in a distributed database,
printing files at remote sites, using remote specialized hardware devices (such as
a high-speed array processor), and performing other operations.

Computation speedup: If a particular computation can be partitioned into a


number of subcomputations that can run concurrently, then a distributed system
may allow us to distribute the computation among the various sites to run that
computation concurrently. In addition, if a particular site is currently overloaded
with jobs, some of them may be moved to other, lightly loaded, sites. This
movement of jobs is called load sharing.

Reliability: If one site fails in a distributed system, the remaining sites can
potentially continue operating. If the system is composed of a number of large
autonomous installations (that is, general-purpose computers), the failure of one
of them should not affect the rest. If, on the other hand, the system is composed of
a number of small machines, each of which is responsible for some crucial system
function (such as terminal character I/O or the file system), then a single failure
may effectively halt the operation of the whole system. In general, if sufficient
redundancy exists in the system (in both hardware and data), the system can
continue with its operation, even if some of its sites have failed.

Communication: There are many instances in which programs need to exchange


data with one another on one system. Window systems are one example, since
they frequently share data or transfer data between displays. When many sites are
connected to one another by a communication network, the processes at different
sites have the opportunity to exchange information. Users may initiate file
transfers or communicate with one another via electronic mail. A user can send
mail to another user at the same site or at a different site.

Notes

Real Time Systems


Another form of a special-purpose operating system is the real-time system. A real-time
system is used when there are rigid time requirements on the operation of a processor
or the flow of data, and thus is often used as a control device in a dedicated application.
Sensors bring data to the computer. The computer must analyze the data and possibly
adjust controls to modify the sensor inputs. Systems that control scientific
experiments, medical imaging systems, industrial control systems, and some display
systems are real-time systems. Also included are some automobile-engine fuelinjection systems, home-appliance controllers and weapon systems.
A real-time operating system has well-defined, fixed time constraints. Processing
must be done within the defined constraints, or the system will fail. For instance, it
would not do for a robot arm to be instructed to halt after it had smashed into the car
it was building. A real-time system is considered to function correctly only if it
returns the correct result within any time constraints. Contrast this requirement to a
time-sharing system, where it is desirable (but not mandatory) to respond quickly, or
to a batch system, where there may be no time constraints at all.
There are two flavors of real-time systems. A hard real-time system guarantees that
critical tasks complete on time. This goal requires that all delays in the system be
Punjab Technical University 63

Information Technology &


Bioinformatics
Notes

bounded, from the retrieval of stored data to the time that it takes the operating
system to finish any request made of it. Such time constraints dictate the facilities that
are available in hard real-time systems. Secondary storage of any sort is usually
limited or missing, with data instead being stored in short-term memory, or in Readonly Memory (ROM). ROM is located on nonvolatile storage devices that retain their
contents even in the case of electric outage; most other types of memory are volatile.
Most advanced operating-system features are absent too, since they tend to separate
the user further from the hardware, and that separation results in uncertainty about
the amount of time an operation will take. For instance, virtual memory is almost
never found on real-time systems. Therefore, hard real-time systems conflict with the
operation of time-sharing systems, and the two cannot be mixed. Since none of the
existing general-purpose operating systems support hard real-time functionality, we
do not concern ourselves with this type of system in this text.
A less restrictive type of real-time system is a soft real-time system, where a critical
real-time task gets priority over other tasks, and retains that priority until it completes.

Architecture and Design of OS


An operating system is large and complex system. As such it must be engineered
carefully if it is to function properly and to be modified easily. A common approach is
to partition the intended tasks into smaller components, rather than have one
monolithic system. Each of these modules should be a well-defined portion of the
system, with carefully defined inputs, outputs, and function. We have already
discussed briefly the common components of operating systems. In this section, we
discuss the way that these components are interconnected and melded into a kernel.

Monolithic Architecture
There are numerous commercial systems that do not have a well-defined structure.
This architecture is referred to as monolithic architecture because of the lack of any
identifiable structure. Frequently, such operating systems started as small, simple,
and limited systems, and then grew beyond their original scope. MS-DOS is an
example of such a system. It was originally designed and implemented by a few
people who had no idea that it would become so popular. It was written to provide
the most functionality in the least space, because of the limited hardware on which it
ran, so it was not divided into modules carefully. Figure 3.4 shows its structure.

Figure 3.4: MS DOS Layer Structure


In MS-DOS, the interfaces and levels of functionality are not well separated. For
instance, application programs are able to access the basic I/O routines to write
directly to the display and disk drives. Such freedom leaves MS-DOS vulnerable to
errant (or malicious) programs, causing entire system crashes when user programs
fail. Of course, MS-DOS was also limited by the hardware of its era. Because the Intel

64 Self-Instructional Material

8088 for which it was written provides no dual mode and no hardware protection, the
designers of MS-DOS had no choice but to leave the base hardware accessible.

Layered Architecture
New versions of operating systems are designed to use more advanced hardware.
Given proper hardware-support, operating systems may be broken into smaller, more
appropriate pieces than those allowed by the original MS-DOS or UNIX. The
operating system can then retain much greater control over the computer and the
applications that make use of that computer. Implementers have more freedom to
make changes to the inner workings of the system. Familiar techniques are used to aid
in the creation of modular operating systems. Under the top-down approach, the
overall functionality and features can be determined and separated into components.
Information hiding is also important, leaving programmers free to implement the
low-level routines as they see fit, provided that the external interface of the routine
stays unchanged and the routine itself performs the advertised task.

Operating Systems

Notes

The modularization of a system can be done in many ways; the most appealing is the
layered approach, which consists of breaking the operating system into a number of
layers (levels), each built on top of lower layers. The bottom layer (layer 0) is the
hardware; the highest (layer N) is the user interface.
An operating-system layer is an implementation of an abstract object that is the
encapsulation of data, and operations that can manipulate those data. A typical
operating-system layer say layer M is depicted in Figure 3.5.

Figure 3.5: An Operating System Layer


It consists of some data structures and a set of routines that can be invoked by higherlevel layers. Layer M, in return, can invoke operations on lower-level layers.
The main advantage of the layered approach is modularity. The layers are selected
such that each uses functions (operations) and services of only lower level layers. This
approach simplifies debugging and system verification. The first layer can be
debugged without any concern for the rest of the system, because, by definition, it
uses only the basic hardware (which is assumed correct) to implement its functions.
Once the first layer is debugged, its correct functioning can be assumed while the
second layer is worked on, and so on. If an error is found during the debugging of a
particular layer, we know that the error must be on that layer, because the layers
below it are already debugged. Thus, the design and implementation of the system is
simplified when the system is broken down into layers.
Each layer is implemented using only those operations provided by lower level
layers. A layer does not need to know how these operations are implemented; it needs

Punjab Technical University 65

Information Technology &


Bioinformatics
Notes

to know only what these operations do. Hence, each layer hides the existence of
certain data structures, operations, and hardware from higher-level layers.
The layer approach to design was first used in the operating system at the Technische
Hogeschool Eindhoven. The system was defined in six layers, as shown in Figure 1.8.
The bottom layer was the hardware. The next layer implemented CPU scheduling.
The next layer implemented memory management; the memory-management scheme
was virtual memory. Layer 3 contained device driver for the operator's console.
Because it and I/O buffering (level 4) were placed above memory management, the
device buffers could be placed in virtual memory. The I/O buffering was also above
the operator's console, so that I/O error conditions could be output to the operator's
console.
layer 5:

user programs

layer 4:

buffering for input


and output devices

layer 3:

operator-console
device driver

layer 2:

memory management

layer 1:

CPU scheduling

Figure 3.6: The Layer Structure


This approach can be used in many ways. For example, the Venus system was also
designed using a layered approach. The lower layers (0 to 4), dealing with CPU
scheduling and memory management, were then put into microcode. This decision
provided the advantages of additional speed of execution and a clearly defined
interface between the microcoded layers and the higher layers.
The major difficulty with the layered approach involves the appropriate definition of
the various layers. Because a layer can use only those layers that are at a lower level,
careful planning is necessary. For example the device driver for the backing store
(disk space used by virtual-memory algorithms) must be at a level lower than that of
the memory-management routines, because memory management requires the ability
to use the backing store.
Other requirements may not be so obvious. The backing-store driver would normally
be above the CPU scheduler, because the driver may need to wait for I/O and the
CPU can be rescheduled during this time. However, on a large system, the CPU
scheduler may have more information about all the active processes than can fit in
memory. Therefore, this information may need to be swapped in and out of memory,
requiring the backing-store driver routine to be below the CPU scheduler.
A final problem with layered implementations is that they tend to be less efficient
than other types. For instance, for a user program to execute an I/O operation, it
executes a system call which is trapped to the I/O layer, which calls the memorymanagement layer, through to the CPU scheduling layer, and finally to the hardware.
At each layer, the parameters may be modified, data may need to be passed, and so
on. Each layer adds overhead to the system call and the net result is a system call that
takes longer than one does on a nonlayered system.
These limitations have caused a small backlash against layering in recent years. Fewer
layers with more functionality are being designed, providing most of the advantages
of modularized code while avoiding the difficult problems of layer definition and
interaction. For instance, OS/2 is a descendant of MS-DOS that adds multitasking and
dual-mode operation, as well as other new features.
Because of this added complexity and the more powerful hardware for which OS/2
was designed, the system was implemented in a more layered fashion. Contrast the
66 Self-Instructional Material

MS-DOS structure to that of the OS/2. It should be clear that, from both the systemdesign and implementation standpoints, OS/2 has the advantage. For instance, direct
user access to low-level facilities is not allowed, providing the operating system with
more control over the hardware and more knowledge of which resources each user
program is using.

Operating Systems

Notes

As a further example, consider the history of Windows NT. The first release had a
very layer-oriented organization. However, this version suffered low performance
compared to that of Windows 95. Windows NT 4.0 redressed some of these
performance issues by moving layers from user space to kernel space and more
closely integrating them.

Virtual Machine Architecture


Conceptually, a computer system is made up of layers. The hardware is the lowest
level in all such systems. The kernel running at the next level uses the hardware
instructions to create a set of system calls for use by outer layers. The systems
programs above the kernel are therefore able to use either system calls or hardware
instructions, and in some ways these programs do not differentiate between these
two. Thus, although they are accessed differently, they both provide functionality that
the program can use to create even more advanced functions. System programs, in
turn, treat the hardware and the system calls as though they both are at the same
level.
Some systems carry this scheme even a step further by allowing the system programs
to be called easily by the application programs. As before, although the system
programs are at a level higher than that of the other routines, the application
programs may view everything under them in the hierarchy as though the latter were
part of the machine itself. This layered approach is taken to its logical conclusion in
the concept of a virtual machine. The VM operating system for IBM systems is the
best example of the virtual-machine concept, because IBM pioneered the work in this
area.
By using CPU scheduling and virtual-memory techniques, an operating system can
create the illusion of multiple processes, each executing on its own processor with its
own (virtual) memory. Of course, normally, the process has additional features, such
as system calls and a file system, which are not provided by the bare hardware. The
virtual-machine approach, on the other hand, does not provide any additional
function, but rather provides an interface that is identical to the underlying bare
hardware. Each process is provided with a (virtual) copy of the underlying computer.
The resources of the physical computer are shared to create the virtual machines. CPU
scheduling can be used to share the CPU and to create the appearance that users have
their own processor. Spooling and a file system can provide virtual card readers and
virtual line printers. A normal user timesharing terminal provides the function of the
virtual machine operator's console.
A major difficulty with the virtual-machine approach involves disk systems. Suppose
that the physical machine has three disk drives but wants to support seven virtual
machines. Clearly, it cannot allocate a disk drive to each virtual machine. Remember
that the virtual-machine software itself will need substantial disk space to provide
virtual memory and spooling. The solution is to provide virtual disks, which are
identical in all respects except size; these are termed minidisks in IBMs VM operating
system. The system implements each minidisk by allocating as many tracks as the
minidisk needs on the physical disks. Obviously, the sum of the sizes of all minidisks
must be less than the actual amount of physical disk space available.
Users thus are given their own virtual machine. They can then run any of the
operating systems or software packages that are available on the underlying machine.
For the IBM VM system, a user normally runs CMS, a single-user interactive
Punjab Technical University 67

Information Technology &


Bioinformatics
Notes

operating system. The virtual-machine software is concerned with multiprogramming


multiple virtual machines onto a physical machine, but does not need to consider any
user-support software. This arrangement may provide a useful partitioning of the
problem of designing a multiuser interactive system into two smaller pieces.

Exokernel Architecture
Operating systems define the interface between applications and physical resources.
Unfortunately, this interface can significantly limit the performance and
implementation freedom of applications. Traditionally, operating systems hide
information about machine resources behind high-level abstractions such as
processes, files, address spaces and interprocess communication. These abstractions
define a virtual machine on which applications execute; their implementation cannot
be replaced or modified by untrusted applications.
Hardcoding the implementations of these abstractions is inappropriate for three main
reasons:
z

it denies applications the advantages of domain-specific optimizations

it discourages changes to the implementations of existing abstractions, and

it restricts the flexibility of application builders, since new abstractions can only
be added by awkward emulation on top of existing ones (if they can be added at
all).

These problems can be solved through application level resource management in


which traditional operating system abstractions, such as virtual memory (VM) and
interprocess communication (IPC), are implemented entirely at application level by
untrusted software. In this architecture, a minimal kernel called an exokernel
securely multiplexes available hardware resources. Library operating systems,
working above the exokernel interface, implement higher-level abstractions.
Application writers select libraries or implement their own. New implementations of
library operating systems are incorporated by simply relinking application
executables. Applications can benefit greatly from having more control over how
machine resources are used to implement higher-level abstractions. The high cost of
general-purpose virtual memory primitives reduces the performance of persistent
stores, garbage collectors, and distributed shared memory systems. Application-level
control over file caching can reduce application-running time considerably.
Application-specific virtual memory policies can increase application performance.
The inappropriate file-system implementation decisions can have a dramatic impact
on the performance of databases. The exceptions can be made an order of magnitude
faster by deferring signal handling to applications.
To provide applications control over machine resources, an exokernel defines a lowlevel interface. The exokernel architecture is founded on and motivated by a single,
simple, and old observation that the lower the level of a primitive, the more efficiently
it can be implemented, and the more latitude it grants to implementors of higher-level
abstractions.
To provide an interface that is as low-level as possible (ideally, just the hardware
interface), an exokernel designer has a single overriding goal of separating protection
from management. For instance, an exokernel should protect frame buffers without
understanding windowing systems and disks without understanding file systems.
One approach is to give each application its own virtual machine. Virtual machines
can have severe performance penalties. Therefore, an exokernel uses a different
approach - it exports hardware resources rather than emulating them, which allows
an efficient and simple implementation.

68 Self-Instructional Material

An exokernel employs three techniques to export resources securely:


z

by using secure bindings, applications can securely bind to machine resources


and handle events

by using visible re-source revocation, applications participate in a resource


revocation protocol

by using an abort protocol, an exokernel can break secure bindings of


uncooperative applications by force

Operating Systems

Notes

The advantages of exokernel systems among others are:


z

exokernels can be made efficient due to the limited number of simple primitives
they must provide

low-level secure multiplexing of hardware resources can be provided with low


overhead

traditional abstractions, such as VM and IPC, can be implemented efficiently

at application level, where they can be easily extended, specialized, or replaced

applications can create special-purpose implementations of abstractions, tailored


to their functionality and performance needs

Finally, many of the hardware resources in microkernel systems, such as the network,
screen, and disk, are encapsulated in heavyweight servers that cannot be bypassed or
tailored to application-specific needs. These heavyweight servers can be viewed as
fixed kernel subsystems that run in user-space.

Client-Server Architecture
This architecture abstracts each component of the operating system as clients and
servers. Every system operation is carried out by one component requesting one or
more components for one or more system service. The servers respond to the requests
and the intended action takes place in this manner.
Among other advantages of this architecture is one which makes the whole system
highly modular and hence easy to maintain and modify.

Interface Design and Implementation


Programmers are the people who produce software for themselves or others. The
software can be broadly categorized as the system software or the application
software. The system software requires very less access to system calls. It comprises of
operating systems, compilers, etc. An application program comprises of software such
as spreadsheets, databases, management information system, etc. An application
programmer more usually works on a high level language like C, COBOL, etc. or a
high level tool such as a database language or a spreadsheet macro.
An operational user is concerned with the operation and management of computer
facilities for other users. This engulfs installation management, mainframe computer
operators and system engineers who are responsible for providing system efficiency
software installation, etc. The operational manager is also responsible for performing
functions like creating new directories, deleting old files, taking backups, checking on
free disk space, checking for viruses on the computer system, etc. They are not
concerned with the functioning of application programs or the interpretation of the
content of data files.
An end-user is a person who can perform only those tasks which an application can
achieve. The end-user applies the software to some problem area. These kind of users
dont know anything about the internal code of programs. They are only concerned
with the data entry programs. This class of users include the clerical staff. Simplicity

Punjab Technical University 69

Information Technology &


Bioinformatics
Notes

and user-friendliness are of the two very important features that an end-user program
should support.
Depending on the varied needs of the users, the interface has to comply with all the
user requirements. Therefore, there are various types of interfaces available for
different types of users. The common types of interface are:
1.

System calls

2.

Command line interface

3.

Graphical user interface

We have already covered system calls in previous units. We shall, now discuss about
rest of the two terminal I/O mechanisms.

Command Line Interface


This kind of an interface allows the user to interact with the operating system
through commands given at the command prompt. Such kinds of interfaces are a bit
difficult to use. The exact command and its syntax has to be remembered properly for
the proper execution.
MS-DOS also supports batch files which is similar to shell scripts of UNIX. The batch
files of MS-DOS dont offer very many facilities.
The main central area is used to display and interact with the user data which can
either be textual or graphical in nature. The window will show only a part of the data
that is being processed e.g. a page of a word processed document, etc. You can easily
move around the window with the help of mouse. The general cursor shape of a
mouse is an arrow but the cursor can change shape depending upon its location on
the screen and the particular task in hand. The vertical slider allows you to move up
or down in the window. The horizontal sliders help you in moving towards the right
or the left side of the window. The title bar generally shows the name of the program
that is being used. It also shows the name of the currently open file. The minimise
button is used to reduce the window to a small icon and places it on the taskbar. The
maximise button is used to expand the window so that it occupies the whole screen.
The close button is used for closing the window.
The window can also be re-positioned and re-sized. A window can be moved by
pointing to the title bar, hitting the mouse button, dragging the cursor to a new
position and then releasing the button. You can resize a window by taking the mouse
pointer at the boundary of the window. The pointer will take the shape of a doubleheaded arrow. As soon as a double headed arrow appears on the screen, hit the
mouse button and drag it in the required direction. This would resize your window.
GUIs use a different naming convention. A file is called a document and a directory
is called a folder. The GUI systems that are most commonly used these days are
Microsofts Windows and the X window system used in UNIX environments. Let us
discuss them briefly.

Graphical User Interface (X Windows)


X Windows was developed at Massachusetts Institute of Technology in 1984. The
researchers at Massachusetts were attempting to devise a number of different
graphical work stations which were to be used by a large number of very different
computers and operating systems. The X windows were designed in such a manner
so that the communication between the displays and the computer would depend on
the simple transmission of character based messages and not on any complex encoded
protocol. Hence, X Windows could work on any network that was capable of carrying
out simple character stream communication. There are a large variety of X-based GUI
systems which are very much different from each-other. This is because X Windows

70 Self-Instructional Material

do not follow any particular style or format. They only provide a means of producing
GUI systems.
The X system is based on a client server model. The application programs form the
client which require graphical display and input facilities. These facilities are
provided by the servers. The communication between the client and the server
happens through messages which are carried out in a standard protocol. The client
and the server may exist as separate processes on one system or they may exist on
separate computers and then linked over a network. X system is entirely machineindependent. The client application is not concerned with the internals of the
target display terminal it is using. The application concerns itself with a logical or
virtual terminal. The X system must match with the requests made by the user on to
the actual hardware. In X windows, a single server terminal is capable of handling
many applications at on time.

Operating Systems

Notes

In GUI systems, a window manager takes care of the size, location, movement, etc. of
windows. In X Windows, a window manager functions as an ordinary client
application program.
The facilities provided by the X windows library are at a low level. Therefore, a
considerable amount of coding is required to produce useful applications. The
applications that are coded at this level do not always provide uniformity in the user
interface. Thus, in order to meet these problems, the programmer uses a higher level
tool called an X Tool kit. An X Toolkit comprises of two parts. One is set of functions
which are called intrinsics. These reside above the X lib function. The second part
includes an additional set of tools called widgets. The widgets are provided as
separate products such as open look from Unix International. Widgets include menus,
slide bars, icons, buttons, etc.

Window Systems based on PCs


These days, there are two patterns of interface that are available in the market. One is
the Windows 3.1 version and the other is the Windows 95/ NT version. The two vary
from each-other in quite a big way. The Windows 3.1 contains program manager and
other program group facilities which are absent in Windows 95. The programs group
allows the user to gather different programs together and organise them into logical
sets. The same set-up is provided by the Start button in the case of Windows 95. On
clicking at the Start button a pop-up containing may options shoots up on the screen.

Performance Measurement and Monitoring


Operating systems are the main resource managers in a computer system. Therefore,
a poorly designed operating system or a poorly designed application program may
not be ale to harness the computing capabilities of a computer system. In other words,
even though you may have a very capable computer hardware, the realized
performance in terms of speed of execution.
Operating systems therefore constantly need to measure their performance and
monitor the operations with a view to improve the efficiency and optimal use of
resources. At the same time they must have mechanisms in place to detect, avoid and
handle exceptional computing conditions should they occur.

Student Activity
1.

What are the advantages of GUI over character-based user interface?

2.

How is application program interface different from user interface?

Punjab Technical University 71

Information Technology &


Bioinformatics
Notes

Summary
An operating system is the most important program in a computer system. This is one
program that runs all the time, as long as the computer is operational and exits only
when the computer is shut down. Operating systems are the programs that make
computers operational, hence the name. Operating systems are computers resource
manager. Job scheduling is the process of sequencing jobs so that they can be executed
on the processor. The main advantage of the layered approach is modularity. The
layers are selected such that each uses functions (operations) and services of only
lower level layers. System calls provide the interface between a process and the
operating system. These calls are generally available as assembly-language
instructions, and are usually listed in the manuals used by assembly-language
programmers.

Keywords
Operating System: A set of complex programs that makes a computer operational by
providing an environment in which users can use the power of the computer.
Batch Processing: A mode of data processing in which a number of tasks are lines up
and the entire task set is submitted to the computer in one go.
Multiprogramming: A style of programming in which multiple programs can share
the resources appearing to be executing simultaneously.
Time Sharing: A mode of programming in which the CPU is shared between multiple
programs each getting a share of CPU time in turn.
Parallel System: A system that is capable of executing a number of programs
parallelly.
Multiprocessor System: A computer system that has an array of a number of
processors.
Distributed System: A computer systems in which computation tasks are distributed
among several processors.
Real Time System: A special-purpose operating system in which there are rigid time
requirements on the operation of a processor or the flow of data.
Monolithic Architecture: An operating system architecture that lacks any identifiable
structure.
Command Line Interface: A kind of operating system interface that allows the user to
interact with the operating system through commands given at the command prompt.

Review Questions

72 Self-Instructional Material

1.

What is the main advantage of layered approach to operating system design?


Explain.

2.

What are the requirements for a virtual memory architecture?

3.

How does a distributed system enhance resource sharing?

4.

Multiprogramming is essentially a sequential execution of programs. Comment.

5.

What are the constraints of a real time system?

6.

Operating system acts as resource manager. What resources does it manage?

7.

Discuss the inconveniences faced by a user interacting with a computer system


without an operating system.

8.

What are the benefits of multiprogramming?

9.

What are the characteristics of real time operating systems?

10. Application programs interact with operating systems through system calls. Is
there any other method of interaction between the two?

Further Readings
I.A Dhotre, Operating System, Technical Publications Office 2000 and its Applications

Operating Systems

Notes

Narender, Singh and Naruka, Office 2000 in 7 Days, Jalandhar Publishing House.

Punjab Technical University 73

Information Technology &


Bioinformatics
Notes

74 Self-Instructional Material

Unit 4 Bioinformatics
Internet
Applications

Bioinformatics Internet
Applications
Notes

Unit Structure
z

Introduction

A Protein Sequence with Ends Indicated

Visualisation of Structure Information

Statistics

Sequencing and Assembling Genome using Computer

Initial Fragment Assembly

Summary

Keywords

Review Questions

Further Readings

Learning Objectives
At the conclusion of this unit, you will be able to:
z

Visualize structural information

Create sequence database

Search a database for similar new sequences

Understand the multiple alignment and database searching

Introduction
Most important use of internet in bioinformatics is for the biological databases which
consist of long strings of nucleotides (guanine, adenine, thymine, cytosine and uracil)
and/or amino acids (threonine, serine, glycine, etc.). Each sequence of nucleotides or
amino acids represents a particular gene or protein (or section thereof), respectively.
Sequences are represented in shorthand, using single letter designations. This
decreases the space necessary to store information and increases processing speed for
analysis.
5` ACGAGCAGCTACGCACTACGATCG 3`
3` TGCTCGTCGATGCGTGATGCTAGC 5`
A nucleotide sequence
N

SDFHKJSDHFKDHGLKDSKJG C

A Protein Sequence with Ends Indicated


While most biological databases contain nucleotide and protein sequence information,
there are also databases, which include taxonomic information such as the structural
and biochemical characteristics of organisms. The power and ease of using sequence
information has however, made it the method of choice in modern analysis.

Punjab Technical University 75

Information Technology &


Bioinformatics
Notes

In the last three decades, contributions from the fields of biology and chemistry have
facilitated an increase in the speed of sequencing genes and proteins. The advent of
cloning technology allowed foreign DNA sequences to be easily introduced into
bacteria. In this way, rapid mass production of particular DNA sequences, a necessary
prelude to sequence determination, became possible. Oligonucleotide synthesis
provided researchers with the ability to construct short fragments of DNA with
sequences of their own choosing. These oligonucleotides could then be used in
probing vast libraries of DNA to extract genes containing that sequence.
Alternatively, these DNA fragments could also be used in polymerase chain reactions
to amplify existing DNA sequences or to modify these sequences. With these
techniques in place, progress in biological research increased exponentially.

Figure 4.1: Growth of the Genbank Database


For researchers to benefit from all this information, however, two additional things
were required: (1) ready access to the collected pool of sequence information and (2) a
way to extract from the pool sequences of interest to a given researcher. Simply
collecting, by hand, all necessary sequence information of interest to a given project
from published journal articles quickly became a formidable task. After collection, the
organisation and analysis of this data still remained. It could take weeks to months for
a researcher to search sequences by hand in order to find related genes or proteins.
Computer technology has provided the obvious solution to this problem. Not only
can computers be used to store and organise sequence information into databases, but
they can also be used to analyse sequence data rapidly. The evolution of computing
power and storage capacity has, so far, been able to outpace the increase in sequence
information being created. Theoretical scientists have derived new and sophisticated
algorithms which allow sequences to be readily compared using probability theories.
These comparisons become the basis for determining gene function, developing
phylogenetic relationships and simulating protein models. The physical linking of a
vast array of computers in the 1970's provided a few biologists with ready access to
the expanding pool of sequence information. This web of connections, now known as
the Internet, has evolved and expanded so that nearly everyone has access to this
information and the tools necessary to analyze it.

Visualisation of Structure Information


Database Similarity Searching
This section takes a first look at the problem of identifying those sequences in a
sequence database that are similar to a given sequence. This task arises, e.g., when a
gene has been newly sequenced and one wants to determine whether a related
sequence already exists in a database. Generally, two settings can be distinguished.
76 Self-Instructional Material

The starting point for the search may either be a single sequence with the goal of
identifying its relatives, or a family of sequences with the goal of identifying further
members of that family. Searching data base needs to be fast and sensitive but the two
objectives counteract each other. Fast methods have been developed primarily for
searching with a single sequence and this shall be the topic of this section.

Bioinformatics Internet
Applications
Notes

When searching a database with a newly determined DNA or amino acid sequence
the so-called query sequence the user will typically lack knowledge of whether an
expected similarity might span the entire query or just part of it. Likewise, he will be
ignorant of whether the match will extend along the full length of some database
sequence or only part of it. Therefore, one needs to look for a local alignment between
the query and any sequence in the database. This immediately suggests the
application of the Smith-Waterman algorithm to each database sequence. One should
take care, though, to apply a fairly stringent gap penalty such that the algorithm
focuses on the regions that really match. After sorting the resulting scores, the top
scoring database sequences are the candidates one is interested in.
Several implementations of this procedure are available, most prominently the
SEARCH program from the FASTA package. There exist versions of this program that
are tuned for speed like the one due to Phil Green, one that runs especially fast on
SUN computers , and one by Geoff Barton. Depending on implementation, computer
and database size, a search with such program will take on the order of several
minutes.
The motivation behind the development of other database search programs has been
to emulate the Smith-Waterman algorithm's ability to discern related sequences as
closely as possible while at the same time performing the job in much less time. To
this end, one usually makes the assumption that any good alignment as one wishes to
identify, contains, in particular, some stretch of ungapped similarity. Furthermore this
stretch will tend to contain a certain number of identically matching residues and not
only conservative replacements. Based on these assumption, most heuristic programs
rely on identifying a well-matching core and then extending it or combining several of
these. With hindsight, the different developments in this area can further be classified
according to a traditional distinction in computer science according to which one
either preprocesses the query or the text (i.e., the database). Preprocessing means that
the string is represented in a different form that allows for faster answer to particular
questions like, e.g., whether it contains a certain subword.

Searching of Database for Similar New Sequence


Fasta
The FASTA program sets a size k for k-tuple subwords. The program then looks for
diagonals in the comparison matrix between query and search sequence along which
many k-tuples match. This can be done very quickly based on a preprocessed list of ktuples contained in the query sequence. The set of k-tuples can be identified with an
array whose length corresponds to the number of possible tuples of size k. This array
is linked to the indices where the particular k-tuples occur in the query sequence.
Note that a matching k-tuple at index i in the query and at index j in the database
sequence can be attributed to a diagonal by subtracting the one index from the other.
Therefore, when inspecting a new sequence for similarity, one walks along this
sequence inspecting each k-tuple. For each of them, one looks up the indices where it
occurs in the query, computes the index-difference to identify the diagonal and
increases a counter for this diagonal. After inspecting the search sequence in this way,
a diagonal with a high count is likely to contain a well-matching region. In terms of
the execution time, this procedure is only linear in the length of the database sequence
and can easily be iterated for a whole database. Of course this rough outline needs to

Punjab Technical University 77

Information Technology &


Bioinformatics

be adapted to focus on regions on diagonals where the match density is high and link
nearby good diagonals into alignments.
Example taken from GCG package.

Notes

Description
FASTA uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 24442448 (1988)) to search for similarities between one sequence (the query) and any
group of sequences of the same type (nucleic acid or protein) as the query sequence.
In the first step of this search, the comparison can be viewed as a set of dot plots, with
the query as the vertical sequence and the group of sequences to which the query is
being compared as the different horizontal sequences. This first step finds the
registers of comparison (diagonals) having the largest number of short perfect
matches (words) for each comparison. In the second step, these "best" regions are
restored using a scoring matrix that allows conservative replacements, ambiguity
symbols, and runs of identities shorter than the size of a word. In the third step, the
program checks to see if some of these initial highest-scoring diagonals can be joined
together. Finally, the search set sequences with the highest scores are aligned to the
query sequence for display.

What is a Word?
A word is any short sequence (n-mer or k-tuple) where you have set n to some small
integer less than or equal to six. The word GGATGG is one of the 4,096 possible
words of length six that can be created from an alphabet consisting of the four letters
G, A, T, and C. The word QL is one of the 400 possible words of length two that you
can make with the 20 letters of the amino acid alphabet.
Example: Here is a session using FastA to identify sequences in the PIR protein
sequence database that are similar to a human globin protein sequence:
% fasta
FASTA with what query sequence? ggamma.pep
Removing terminal * from query sequence...
Begin (* 1 *) ?
End (* 147 *) ?
Search for query in what sequence(s) (* PIR:* *) ?
What word size (* 2 *) ?
Don't show scores whose E() value exceeds: (* 10.0 *):
What should I call the output file (* ggamma.fasta *) ?
1 Sequences

105 aa searched

PIR1:CCHU

501 Sequences

93,217 aa searched

PIR1:IHQFT

CPU time used:


Database scan: 0:01:14.3
Post-scan processing: 0:00: 6.2
Total CPU time: 0:01:20.6
Output File: ggamma.fasta
%

78 Self-Instructional Material

Output
The output from FastA is a list file, and is suitable for input to any GCG program that
allows indirect file specifications.
Here is some of the output file:

Bioinformatics Internet
Applications
Notes

!!SEQUENCE_LIST 1.0
(Peptide) FASTA of: ggamma.pep from: 1 to: 147 September 25, 1998 11:18
TRANSLATE of: gamma.seq check: 6474 from: 2179 to: 2270 and of: gamma.seq check:
6474 from: 2393 to: 2615 and of: gamma.seq check: 6474 from: 3502 to: 3630 generated
symbols 1 to: 148.
Human fetal beta globins G and A gamma from Shen, Slightom and Smithies, Cell 26;
191-203. . . .
TO: PIR:* Sequences: 109,075 Symbols: 34,814,664 Word Size: 2

Databases Searched
NBRF, Release 57.0, Released on 30Jun1998, Formatted on 18Aug1998
Scoring matrix: GenRunData:Blosum50.Cmp
Variable pamfactor used
Gap creation penalty: 12 Gap extension penalty: 2

Histogram Key
Each histogram symbol represents 179 search set sequences
Each inset symbol represents 17 search set sequences
z-scores computed from opt scores
z-score obs exp
(=)

(*)

< 20

863

0:

22

0:

24

0:

26

2:*

28

14

25:*

30

81

149:*

32

306

577: == *

34

1045

1564 :====== *

36

2925

3213 :=================*

38

5368

5310 :=============================*

40

7971

7407 :============================= ========*===

42

9957

9054 :================================= =*=====

44

10706

9987 : ====================================*====

46

10069

10172:======================== = ===============*

48

9611

9739 :======================================= =*

50

8595

8887 := ====================================*

52

7636

7813 :====================================== *

=====

Punjab Technical University 79

Information Technology &


Bioinformatics
Notes

54

6559

6674 :=====================================*

56

5262

5574 :============================== *

58

4590

4576 :=========================*

60

3638

3707 :====================*

62

2916

2972 :================*

64

2320

2364 :=============*

66

1907

1868 :==========*

68

1368

1469 :========*

70

1122

1152 :======*

72

837

900

:=====*

74

631

702

:===*

76

483

546

:===*

78

349

424

:==*

80

299

330

:=*

82

213

252

:=*

84

132

200

:=*

86

112

155

:*

88

87

120

:*

90

74

93

:*

92

47

72

:* :=== *

94

27

55

:* :== *

96

29

43:* :==*

98

25

33:* :=*

100

24

26:* :=*

102

20:* :=*

104

11

15:* :*

106

12:* :*

108

9:*

:*

110

7:*

:*

112

6:*

:*

114

4:*

:*

116

3:*

:*

118

3:*

:*

>120

829

:*====

:*================= ===============

Joining threshold: 36, opt. threshold: 24, opt. width: 16, reg.-scaled
The best scores are:

init1 initn opt z-sc E(108291)..

PIR1:HGCZG
! hemoglobin gamma-G chain - chimpanzee 971 971 971 1145.0 6.2e57
PIR1:I37025

80 Self-Instructional Material

! hemoglobin gamma-G chain - gorilla

971 971 971 1145.0 6.2e57

PIR1:HGHUG
! hemoglobin gamma-G chain - human

971 971 971 1145.0 6.2e57

End of List

Bioinformatics Internet
Applications
Notes

ggamma.pep
PIR1:HGCZG
P1;HGCZG - hemoglobin gamma-G chain - chimpanzee
N;Alternate names: hemoglobin gamma-1 chain
C;Species: Pan troglodytes (chimpanzee)
C;Date: 31-May-1996 #sequence_revision 21-Jan-1997 #text_change 14-Nov1997
C;Accession: I36939; I61853
R;Slightom, J.L.; Chang, L.Y.; Koop, B.F.; Goodman, M. . . .
SCORES Init1: 971 Initn: 971 Opt: 971 z-score: 1145.0 E(): 6.2e57
Smith-Waterman score: 971; 100.0% identity in 147 aa overlap
10

20

30

40

50

60

ggamma.pep
MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAI
MGNPK
HGCZG
MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAI
MGNPK
10

20

30

40

50

60

70

80

90

100

110

120

ggamma.pep
VKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVL
AIHFG
HGCZG
VKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVL
AIHFG
70

80

130

140

90

100

110

120

ggamma.pep KEFTPEVQASWQKMVTGVASALSSRYH
HGCZG
130

KEFTPEVQASWQKMVTGVASALSSRYH
140

! Distributed over 1 thread.


! Start time: Fri Sep 25 11:15:55 1998
! Completion time: Fri Sep 25 11:30:38 1998
! CPU time used:
! Database scan: 0:01:14.3
! Post-scan processing: 0:00:06.2
! Total CPU time: 0:01:20.6
! Output File: ggamma.fasta
Punjab Technical University 81

Information Technology &


Bioinformatics
Notes

What is the Output?


The first part of the output file contains a histogram showing the distribution of the zscores between the query and search set sequences. The histogram is composed of
bins of size 2 that are labeled according to the higher score for that bin (the leftmost
column of the histogram). For example, the bin labeled 24 stores had been having the
number of sequence pairs that had scores of 23 or 24.
The next two columns of the histogram list the number of z-scores that fell within
each bin. The second column lists the number of z-scores observed in the search and
the third column lists the number of z-scores that were expected.
The body of the histogram displays a graphical representation of the score
distributions. Equal signs (=) indicate the number of scores of that magnitude that
were observed during the search, while asterisks (*) plot the number of scores of that
magnitude that were expected.
At the bottom of the histogram is a list of some of the parameters pertaining to the
search. Below the histogram, FastA displays a listing of the best scores. Strand:- after
the sequence name in this list indicates that the match was found between the search
set sequence and the reverse complement of the query sequence.
Following the list of best scores, FastA displays the alignments of the regions of best
overlap between the query and search sequences. /rev following the query sequence
name indicates that the search sequence is aligned with the reverse complement of the
query sequence.
This program displays only the region of overlap between the two aligned sequences
(plus some residues on either side of the region to provide context for the alignment)
unless you use -SHOWall. The display of identities and conservative replacements
between the aligned sequences depends on the value of -MARKx. By default ( MARKx=3), the pipe character (|) is used to denote identities and the colon (:) to
denote conservative replacements.

Blast
The other widely used program to search a database is called BLAST (Basic Local
Alignment Search Tool). Blast follows a similar scheme in that it relies on a core
similarity, although with less emphasis on the occurrence of exact matches. This
program also aims at identifying core similarities for later extension. The core
similarity is defined by a window with a certain match density on DNA or with an
amino acid similarity score above some threshold for proteins. Independent of the
exact definition of the core similarity, BLAST rests on the precomputation of all
strings which are in the given sense similar to any position in the query. The resulting
list may be on the order of thousand or more words long, each of which if detected in
a database give rise to a core similarity. In Blast nomenclature, this set of strings is
called the neighborhood of the query. The code to generate this neighborhood is in
fact exceedingly fast.
Given the neighborhood, a so-called finite automaton is used to detect occurrences in
the database of any string from the neighborhood. This automaton is a program,
constructed on the fly and specifically for the particular word neighborhood that has
been computed for a query. Upon reading through a database of sequences, the
automaton is given an additional letter at a time and decides whether the string that
ends in this letter is part of the neighborhood. If so, BLAST attempts to extend the
similarity around the neighborhood and if this is successful, it reports a match.
Like with the FASTA, BLAST has also been adapted to connect good diagonals and
report local alignments with gaps. BLAST additionally converts the database file into
its own format to allow for faster reading. This makes it somewhat unwieldy to use in
a local installation unless someone takes care of the installation. FASTA, on the other
82 Self-Instructional Material

hand, is slower but easier to use. There exist excellent web servers that offer these
programs, in particular at the NCBI, where BLAST http://www.ncbi.nlm.
nih.gov/BLAST/ can be used on up-to-date DNA and protein databases.

Bioinformatics Internet
Applications
Notes

Figure 4.2: Blast (Schematic)


Progress in computational speed using either specially designed or massively parallel
hardware has led to the availability of extremely fast versions of the Smith-Waterman
algorithm. The EBI, among other institutions, is offering a service where this
algorithm is executed on a massively parallel computer resulting in search times of a
few seconds. Companies like Compugen or Paracel have developed special hardware
to do this job. With the availability of EST sequences, it has become very important to
match DNA sequence with protein sequence in such a way that a possible translation
is maintained throughout the alignment. Both the FASTA and the BLAST package
contain programs and related tasks. When coding DNA is compared to proteins, gaps
are inserted in such a way as to maintain a reading frame. Likewise, a protein
sequence can be searched versus a DNA sequence database, and DNA can be
searched versus DNA, too.

Steps in the Blast Algorithm


1.

Sequence is filtered to remove low complexity regions.

2.

List of words of length 3 in the query protein sequence is made ( length 11-12 for
DNA sequences).

3.

Words are evaluated for matches with any other combination of 3 amino amino
acids using Blosum 62 scoring matrix as default. Matches of PQG to PEG would
score 15, to PRG 14, to PSG 13 and to PQA 12.

4.

For DNA words, a match score of +5 and a mismatch score of -4 is used


corresponding to the changes expected in sequences separated by a PAM distance
of 40.

5.

A cutoff score T called a neighborhood word score threshold is selected to reduce


the number of matches.

6.

The above procedure is repeated for each 3-letter word in the query sequence. For
a sequence of length 250 amino acids, the total number of words to search for is
approximately 50 x 250 = 12,500.

Punjab Technical University 83

Information Technology &


Bioinformatics

7.

Words organized into an efficient search tree for comparing them rapidly to the
database sequences.

8.

Each database sequence is scanned for an exact match to one of the 50 high
scoring amino acid words corresponding to the first query sequence position.

9.

In Blast2 or gapped Blast, short matched regions called HSPs or high scoring
segment pairs lying on the same diagonal and within a certain distance of each
other are extended in each direction as long as the score keeps rising.

Notes

10. HSPs of score greater than a cutoff score S are kept.


11. In earlier versions of Blast and some of the later ones, the statistical significance of
each HSP score is determined and if two or more HSP regions are found, thereby
providing additional evidence that the query and database sequences are related,
these scores will be combined to form a combined score.
12. In Blast 2, a local gapped alignment of the sequences is made and the significance
of the score is determined.

Blast 2 Statistics
The probability p of observing a score S equal to or greater than x is given by the
equation,
p ( S > x) = 1 - exp( - e-l(x-u))
where u = [log (Km'n')]/l
where K and l are parameters that are calculated by Blast for the amino acid
substitution scoring, n' is the effective length of the query sequence and m' is the
effective length of the database sequence.
The expect value E, the number database sequences not related to m but which by
chance would give a score x with the query sequence is given by where D is
calculated as the length of the database divided by m.
and E ~ Dp for small p, as in the Fasta calculation.
Specifically, p is the average probability for a Poisson distribution of scores where
0,1,2,3... scores can be found. E = 1 - e-Dp is probable number of sequences giving the
score. E is roughly significant at 0.02-0.05

Blast Programs
Program

Query sequence

Database

Type of alignment

Blastp

protein

protein

gapped

Blastn

nucleic acid

nucleic acid

gapped

Blastx

translated nucleic acid

protein

each frame gapped

Tblastn

protein

translated nucleic acid each frame gapped

Tblastx

translated nucleic acid2 translated nucleic acid ungapped

BLAST Output and interpreting the results:


The BLAST programs all provide information in roughly the same format. First comes
(A) an introduction to the program; (B) a series of one-line descriptions of matching
database sequences; (C) the actual sequence alignments; and finally the parameters
and other statistics gathered during the search.

84 Self-Instructional Material

Bioinformatics Internet
Applications

Sample tblastx output is presented below.


TBLASTX 2.1.2 [Oct-19-2000]
Query= t3s23o

Notes

(500 letters)
Database: em_pln
98,350 sequences; 292,516,774 total letters
Searching.................................................done
Score

Sequences producing significant alignments:


(bits) Value
ATT5C2 AL138664 Arabidopsis thaliana DNA chromosome 3, BAC clon... 148 3e-35
AC006420 AC006420 Arabidopsis thaliana chromosome II section 36... 143 7e-34
AB025639 AB025639 Arabidopsis thaliana genomic DNA, chromosome ... 143 1e-33
AC002392 AC002392 Arabidopsis thaliana chromosome II section 11... 142 1e33
AC005315 AC005315 Arabidopsis thaliana chromosome II section 16... 142 1e33
AC005170 AC005170 Arabidopsis thaliana chromosome II section 13... 142 2e33
AC005560 AC005560 Arabidopsis thaliana chromosome II section 3 ... 142 2e33
AB010077 AB010077 Arabidopsis thaliana genomic DNA, chromosome ... 141 3e33
AB013908 AB013908 Cannabis sativa gene for reverse transcriptas... 140 7e33
AB010694 AB010694 Arabidopsis thaliana genomic DNA, chromosome ... 139 9e33
ATT22K7 AL138641 Arabidopsis thaliana DNA chromosome 3, BAC clo... 139 9e33
AC012463 AC012463 Genomic sequence for Arabidopsis thaliana BAC... 139 1e32
AF058914 AF058914 Arabidopsis thaliana BAC F21E10, complete seq... 138 3e32
AC005724 AC005724 Arabidopsis thaliana chromosome II section 10... 136 8e32
AB028617 AB028617 Arabidopsis thaliana genomic DNA, chromosome ... 136 8e32
AC005489 AC005489 Genomic sequence for Arabidopsis thaliana BAC... 136 1e31
AB028607 AB028607 Arabidopsis thaliana genomic DNA, chromosome ... 135 2e31
AP000599 AP000599 Arabidopsis thaliana genomic DNA, chromosome ... 135 2e-31
ATT6I14 AL391710 Arabidopsis thaliana DNA chromosome 5, BAC clo... 134 3e-31
ATF7K15 AL353871 Arabidopsis thaliana DNA chromosome 3, BAC clo... 134 3e-31
ATT22N19 AL163572 Arabidopsis thaliana DNA chromosome 5, BAC cl... 134 3e-31
AC009606 AC009606 Arabidopsis thaliana chromosome III BAC F22F7... 134 4e-31
AC006067 AC006067 Arabidopsis thaliana chromosome II section 83... 133 6e-31
ATF26K9 AL162651 Arabidopsis thaliana DNA chromosome 3, BAC clo... 133 8e-31
ATAC107 AC000107 Genomic sequence for Arabidopsis thaliana BAC ... 133 1e-30
ATT10O8 AL161746 Arabidopsis thaliana DNA chromosome 5, BAC clo... 133 1e-30
AC005970 AC005970 Arabidopsis thaliana chromosome II section 32... 133 1e-30
AB011485 AB011485 Arabidopsis thaliana genomic DNA, chromosome ... 133 1e-30
AF069298 AF069298 Arabidopsis thaliana BAC T14P8. 131 3e-30

Punjab Technical University 85

Information Technology &


Bioinformatics

ATAC2330 AC002330 Arabidopsis thaliana BAC T10P11 from chromoso... 131 3e-30
ATCHRIV6 AL161494 Arabidopsis thaliana DNA chromosome 4, contig... 131 3e-30
AC068901 AC068901 Arabidopsis thaliana chromosome 1 BAC F1O3 ge... 131 3e-30

Notes

AB018120 AB018120 Arabidopsis thaliana genomic DNA, chromosome ... 131 4e-30
AC007505 AC007505 Arabidopsis thaliana chromosome I BAC F28L22 ... 131 4e-30
AC020646 AC020646 Genomic sequence for Arabidopsis thaliana BAC... 131 4e-30
ATT18B22 AL138652 Arabidopsis thaliana DNA chromosome 3, BAC cl... 130 5e-30
>ATT5C2 AL138664 Arabidopsis thaliana DNA chromosome 3, BAC clone T5C2
Length = 103098
Score = 40.9 bits (83), Expect = 0.006
Identities = 28/68 (41%), Positives = 32/68 (46%)
Frame = +1 / +1
Query: 235 TKATWFLLIIKGMMDLNLAVRILEIIFKNTIQPAMGLYSRILLA*SPLG
IKGTTVLLKPL 414
TKA W L + G +L

I E+ T+Q MG S I LA LGIK V L P

Sbjct: 8290 TKADWLLAMKLGSNNLRRFAIIFEMTL*ITLQQEMGR*SFISLASCFLG


IKAKIVELIPF 8469
Query: 415 GI*PVLKK 438
G P KK
Sbjct: 8470 GKNPFTKK 8493
Score = 54.7 bits (113), Expect = 4e-07
Identities = 36/79 (45%), Positives = 42/79 (52%)
Frame = +1 / -1
Query: 238 KATWFLLIIKGMMDLNLAVRILEIIFKNTIQPAMGLYSRILLA*SPLGIK
GTTVLLKPLG 417
KA WF + G + L+ RI +I

T+Q MGL S I +A GIK V L PLG

Sbjct: 45126 KADWFGEMNLGRILLSRFARIFDINLY*TLQQEMGL*SFITIASFFFGIK


ANMVELTPLG 44947
Query: 418 I*PVLKKSLTAAHTSFCII 474
PVLKK TA TS II
Sbjct: 44946 RKPVLKKD*TATTTSL*II 44890
Score = 31.8 bits (63), Expect = 3.2
Identities = 21/56 (37%), Positives = 26/56 (45%)
Frame = +1 / -3
Query: 295
AHTS 462

RILEIIFKNTIQPAMGLYSRILLA*SPLGIKGTTVLLKPLGI*PVLKKSLTA

RI EII T+Q +G S I + S GIK V+ L P KK T + S


Sbjct: 84802 RIFEIILYPTLQRLIGRNSVIRVGRSVFGIKHILVVFNRLSKTPSCKKEFT
KNNKS 84635
Score = 134 bits (288), Expect = 3e-31
Identities = 58/162 (35%), Positives = 94/162 (57%)

86 Self-Instructional Material

Frame = -1 / -2
Query: 488 KSP*HIMQNDVCAAVRDFFRTGQIPKGFNKTVVPLIPKGDHAKSIREYR
PIAGCIVFLKI 309
KS I+ + A++ FF G +PKG N T++ LIPK AK +++YRPI+ C V K+

Bioinformatics Internet
Applications
Notes

Sbjct: 8543 KSTCDIIGTEFTIAIQSFFVKGFLPKGINSTILALIPKKQEAKEMKDYRPIS


CCNVIYKV 8364
Query: 308 ISKILTARLRSIMPFIINKNQVAFVVGQDIHNHFHLAQELIKGYDRKSG
TLRCMFQIDLQ 129
ISKI+ RL+ ++P I NQ AFV + + + LA EL+K Y + + + RC ++D+
Sbjct: 8363 ISKIIANRLKLLLPNFIASNQSAFVKDRLLIENLLLATELVKDYHKDTIS
ARCAIKVDIS 8184
Query: 128 KAYDMVHWDALEGIMKEXGFPSLFVSRIMNLVNTVSYTFKIN 3
KA+D V

L+ + F +F+ I + T S++ ++N

Sbjct: 8183 KAFDSVQCSILQNTLSAMNFSPIFIHWITLCITTASFSVQVN 8058


Score = 148 bits (317), Expect = 3e-35
Identities = 65/157 (41%), Positives = 97/157 (61%)
Frame = -1 / +2
Query: 473 IMQNDVCAAVRDFFRTGQIPKGFNKTVVPLIPKGDHAKSIREYRPIAG
CIVFLKIISKIL 294
I+ DV AV+ FF+TG +PKG N T++ LIPK A +++YRPI+ C V K+ISKIL
Sbjct: 44891 IIHKDVVVAVQSFFKTGFLPKGVNSTILALIPKKKEAMVMKDYRPIS
CCNVQYKLISKIL 45070
Query: 293
TARLRSIMPFIINKNQVAFVVGQDIHNHFHLAQELIKGYDRKSGTLR
CMFQIDLQKAYDM 114
RL+SI+P I+ NQ AF+ + + + LA E+IK Y + S + RC +ID+ KA+D
Sbjct:
45071
ANRLKSILPKFISPNQSAFIKDRLLMENLLLATEVIKDYHKDSVSP
RCAMKIDISKAFDS 45250
Query: 113 VHWDALEGIMKEXGFPSLFVSRIMNLVNTVSYTFKIN 3
V W L ++ P +V+ I V T S++ ++N
Sbjct: 45251 VQWFFLLNTLRALDIPEQYVNWIQKCVTTASFSVQVN 45361

Examine the Alignment Scores and Statistics


The raw score "S" of the alignment is usually calculated by summing the scores for
each letter-to-letter and letter-to-null position in the alignment. Scores for each
position of an alignment are derived from a substitution matrix, the most popular of
these are the BLOSUM and PAM matrices. Unlike the raw score. The bit score
accounts for the type of scoring system used, and is therefore more informative. The
bit score is calculated from the raw score by normalizing with the statistical variables
that define a given scoring system. Therefore, bit scores from different alignments,
even those employing different scoring matrices can be compared. The higher the
score, the better the alignment, but the significance of an alignment can not be
deduced from the score alone.
A position at which a letter is paired with a null is called a gap. Gap scores are
negative. Since a single mutational event may cause the insertion or deletion of more
than one residue, the presence of a gap is frequently ascribed more significance than
the length of the gap. Hence the gap is penalized heavily, whereas a lesser penalty is

Punjab Technical University 87

Information Technology &


Bioinformatics
Notes

ascribed to each subsequent residue in the gap. There is no widely accepted theory for
selecting gap costs.

Statistics
Local alignments with no gaps are referred to as High Scoring Pairs (HSPs). The
number of random HSP scores equal or greater than S is described by the Poisson
distribution. This is the P value associated with the score S. Highly significant scores
have P values close to zero. For gapped alignments, the significance of a given
alignment with score S is represented by the E (expect) value, the expected number of
chance alignments with a score of S or better. This can be evaluated by looking at
alignment scores generated using mock databases of random sequence of comparable
length and composition. The E value decreases exponentially as the Score (S) that is
assigned to a match between two sequences increases. The E value reflects the size of
database and the scoring system in use. At very low E values, the E and P values may
converge.

Multiple Alignments and Database Searching


Information about which residues are conserved and thus important for a particular
family are crucial not only for the purpose of multiple aligning a set of sequences.
Also in the context of identifying related sequences in a database, this information is
very valuable. Thus, a multitude of methods has been developed that aim at
identifying sequences in a database which are related to a given family.
Historically, the first such method had introduced the profiles described above in the
context of multiple sequence alignment. Like in this application, profiles help in
emphasizing conserved regions in a database search. Thus, a sequence that matches
the query profile in a conserved region will receive a higher score than a database
sequence matching only in a divergent part of an alignment. This feature is of
enormous help in distinguishing truly related sequences.
In an algorithmic sense, profile searching simply uses the dynamic programming
alignment algorithm for aligning a sequence to a profile on each sequence in the
database. Of course, this is computationally quite demanding and much slower than
the heuristic database search algorithms like BLAST or FASTA. Typically, the
multiple alignment underlying the profile will describe a conserved domain which
one expects to find within a database sequence. Therefore, in this context it is
important that end gaps should not be penalized. Gap penalties for profile matching
frequently vary along the profile in order to reflect the existence of gaps within the
underlying multiple alignment. Through this mechanism, one attempts to
preferentially include new gaps in regions where gaps have been observed already.
However, different suggestions exist as to the choice and derivation method for these
gap penalties (Bucher, Taylor).
In 1993, Hidden Markov Models were introduced for the purpose of identifying
family members in a database. Hidden Markov Models are a class of mathematical
models well-suited for describing the relevant parameters in matching a given
multiple alignment against a database. For HMMs, there exists automatic learning
algorithms that adapt the parameters of the HMM for best identification of family
members. Thus, they offer, in particular, a solution to the question of gap penalty
settings along a profile. Sequences are matched to HMMs in much the same way as
they are aligned to a profile although the interpretation of the procedure is different.
The HMM is thought of as producing sequences by going through different states.
Aligning a sequence to an HMM amounts to delineating the series of states that is
most likely to have produced the sequence.
Based on this interpretation, one can make an interesting distinction between the
optimal alignment of a sequence with a Hidden Markov Model and the computation
88 Self-Instructional Material

of the probability that a model has produced a particular sequence. The optimal
alignment is computed with an algorithm exactly analogous to the dynamic
programming algorithm and maximising the probability that a series of states as
given rise to a particular sequence. In contrast hereto, in absence of knowledge of the
correct path of states, the probability that a model has given rise to a particular
sequence should rather be computed as the sum over the different sets of states that
could have produced the sequence. This interpretation leads to a summation over all
paths instead of the choice of the best one. Practically, there is little known about the
difference in performance between the two approaches.

Bioinformatics Internet
Applications
Notes

Sequencing and Assembling Genome using


Computer
Dideoxy chain-termination sequencing depends on synthetic DNA primer sequences
to initiate the reaction. These primers must match a portion of the template whose
sequence we are trying to determine. This gives us a 'chicken and egg' problem of
needing to know a bit of the template sequence before we can read more of it.
One way to start sequencing an unknown sequence is to make a recombinant clone,
putting the unknown insert into a vector of known sequence. Then primers from the
vector can be used to begin reading the sequence of the insert. Once a portion of the
new insert sequence is known, we can use that to design a new primer to let us read
further. This process can be repeated until the whole insert is sequenced. This 'primer
walking' process is inherently sequential, since each step must be completed before
the results can be used to design the primer for the next step. Shotgun sequencing is
an approach that lets us run large numbers of reactions in parallel, rather than in
series. Rather than using primer walking through one large insert, we randomly
fragment the insert to create a library of smaller fragments. A large number of these
clones are chosen at random, and sequenced in parallel using primers matching the
vector. The sequencing results are then 'assembled' on the computer into a contiguous
sequence of overlapping fragments. This approach essentially trades much of the
laborious laboratory work for a puzzle to be solved on the computer, and turns out to
be much faster than pure primer walking.
Technologies: This exercise uses the Staden sequence data management software for
assembly of reads into contigs. Using the sequence simulation module, students must
design custom sequencing primers for primer walking through regions not
adequately covered by random clones, and to resolve ambiguities.
In dideoxy chain-termination sequencing, a synthetic DNA primer is used to start the
process of copying a template sequence into a set of labelled products. Each of these
products ends with a particular base, depending on the sequence of the template, and
has a particular size, depending on the position of the base in the template. Ordering
these products by size (using gel electophoresis) lets us determine the order of bases
in the template. Unfortunately, we often need to determine DNA sequences that are
longer than we can read in a single sequencing reaction. This means we need to collect
and organize the results of many reactions into larger contiguous sequences in a
process called sequence assembly. Note that the Sanger method is a type of primer
extension reaction. Other important types of primer extension reactions include
cDNA synthesis, labelling of hybridization probes, and the Polymerase Chain
Reaction (PCR). Old-fashioned water pumps needed to be wet to make a good seal.
Wetting it was called 'priming the pump'; you always had to be sure you kept a bit of
water available to prime the pump, so you could pump more water later.
Oligonucleotide primers are conceptually similar in that you need to know some of
the target sequence so that you can make a primer to read more of the sequence.
Say you have a cloned template of 10,000 base pairs that you need to sequence. If you
can read about 1,000 bp in a single sequencing reaction, you will need to perform
Punjab Technical University 89

Information Technology &


Bioinformatics
Notes

about 10 reactions in series to read the whole thing. The first reaction can use a primer
site on the cloning vector. After you get the results from that reaction, you can use
them to design a new primer to read 1,000 bases further. Assume it takes one day to
run a sequencing reaction, study the results, and design a primer, and another two
days to have the new primer made, it will take you about a month to sequence the
whole 10,000 bases.
Now, instead of just one universal primer to read in from one end, consider how
much you could speed up the process if you read in from both ends at the same time.
With automated machines, it probably doesn't take much longer to do two reactions
than to do one, since we can do them in parallel at the same time. So reading in from
both ends, we could sequence the insert in about 15 days.
It might occur to you that the process would be even faster if we could break up the
template into smaller fragments, and sequence each of them from both ends. For
example, if we could break it into five pieces of 2,000 bases, and read 1,000 bases in
from each end, we could obtain all of the sequence in one set of reactions, using only
universal primers. This would take one day in our scenario, assuming we can run 10
reactions at once. The trouble is, we don't have a good way to break the template up
into five non-overlapping pieces of 2,000 base pairs. Maybe if we had a good
restriction map, we could invest some time to in cloning smaller pieces, but that
would take a considerable amount of work. The good news is that the speedup from
sequencing many small clones in parallel using universal primers is so great that it
can be well worthwhile to do so, even if we have to use clones representing
overlapping pieces, and we end up sequencing some parts multiple times. In fact, we
do have good methods of generating random fragments from a large piece of DNA.
It also turns out that sequencing the same part several times from different clones, at
different distances from the primer and on both strands, can let us determine the
sequence more accurately. As with most experimental results, the interpretation of
sequencing reactions often leaves us with some ambiguity. For example, sometimes
peaks blur together on an electropherogram, and though we can tell there are several
'A's in a row, it can be difficult tell whether there are five or six. Ideally, peaks would
be spaced evenly, but in fact, peak separations can vary a bit due to the way the
strands of DNA fold during electrophoresis. Since peaks tend to get shorter and
broader farther from the primer, the signal to noise ratio drops, until eventually we
can't read with confidence. Since peak migration is influenced by folding of DNA
strands, artifacts due to strand folding are likely to be different on the two strands.
This means that if sequence from both strands agrees, we can be fairly confident that
it is correct.

Initial Fragment Assembly


The first part of the shotgun sequencing experiment has been done for you. Twenty
five clones were selected from a random insert library and sequenced on both ends
with universal primers. Consider there are 25 sequences selected for studies and are
stored in a zip file. Below are the steps to work on it.
1.

Unzip the shotgun.zip file.


In the new "shotgun" directory, double click the file named "config.pg4". This will
launch the PreGap4 program.

2.

Click the "Files to Process" tab.

3.

Click the "Add files" button.


Fine the shotgun directory. Change "Files of type..." to "SCF". Select all 25 SCF
files (Control-A), and click "Open".

4.
90 Self-Instructional Material

Click the "Configure Modules" tab.

5.

Check to see that the following modules are selected (You can read the help
documentation to see what these modules do):

6.

Check that the following values are set:

7.

Click "Run". When it finishes, close Pregap4.

Bioinformatics Internet
Applications
Notes

Look in the "shotgun" directory again. You should find a new file named
"HIV.0.aux": double click this file to launch Gap4.
8.

Two windows will open, the main Gap4 window and the "Contig Selector".
In the main Gap4 window, select "View: Template Display". This opens a "Show
templates" dialog box. Be sure the "all contigs" radio button is selected and the
"Templates" and "Readings" checkboxes are checked. Click "OK" (see Figure 4.3:
"original assembly").
The graphical template display shows how the 25 sequencing "reads" have been
assembled into 7 contiguous blocks ("contigs"). The "reads" are arrows, and the
lines between them are templates. Note that there are two reads per template.
Each template was sequenced on each end using universal primers that read into
the insert from the plasmid cloning vector. Two ends read from the same
template are called a "read pair".
All of our templates are about 1200 bases long, plus or minus about 200 bases, and
the read pairs both read in from the ends of the template. We can rearrange the
contigs so that the display of read pairs is consistent with these facts.
Note that the two contigs at the right end of Figure 4.3 are pointing out from their
templates, rather than in. Right-click on the contig lines at the bottom of the
Templates Display window to bring up a context menu that lets you
"Complement" the contig. Figure 4.4 ("complemented contigs") shows what they
should look like when you're done.
Notice the templates drawn in yellow. Each has one read in one contig, and the
other read in a different contig. Later in the exercise we will sequence the middle
parts of these templates, which will let us join the contigs that these templates
span. But first, note that some of the contigs do not have templates that would let
us connect them. We must go back to the clone library and sequence more clones.
Save your Gap4 database with a new version number (version 1) by choosing
"File" Copy database" from the main window's top menu and entering "1" in the
box marked "New version character". Exit Gap4.

9.

Sequencing Additional Clones


Open the Cybertory sequence simulator in a separate web page.
The sequence fragments you have already assembled are from clones 1 through
25. We must have sequence from at least five more clones, so teams will be
assigned clones to sequence, starting with "HIVsubclone026". Each clone should
be sequenced using both "forward" and "reverse" primers. Copy and paste the
appropriate primer sequence into the "Primer sequence" box. Be sure to
appropriately select "forward" or "reverse" from the "primer strand" pull down
list. Since these are universal primers, select "vector (step 1)" from the "priming
site" pull-down list.
The default reaction conditions should work for the universal primers. Be sure to
enter the names of everyone on your team into the "user name" box. This will help
us to troubleshoot your reactions if they do not work as expected. After you click
"run sequencing reaction", it will take six or seven seconds for your virtual
sequence trace to be created. Note the name of the result file: it will be something
like "HIVsubclone026-p1t.scf". Be sure you have the correct subclone. Note the
letter following the dash: it will be "p" if you told the program this primer was on
Punjab Technical University 91

Information Technology &


Bioinformatics
Notes

the forward strand, and "q" for the reverse. The number "1" indicates that you said
you were using a universal primer. Be sure these items are correct, and save the
trace file.
Once all groups have sequenced their assigned clones, we will collect them and
give everyone a copy of all the traces to use in the next round of assembly.
10. Adding New Traces to the Assembly
Copy the new sequence traces into your project folder. Open Pregap4 again by
double clicking the "congif.pg4" file. Select the "Files to process" tab, and click the
"Add files" button. Select the new SCF files, starting with "HIVsubclone026p1t.scf". Be sure to set "Files of type" to "SCF"!
On the "Configure modules" tab, click on the "Gap4 shotgun assembly" module.
Enter "1" in the "Gap4 database version" field, and select the "Append to existing
database" radio button. Click the "Run" button in the lower left corner of the
window. When the program reports "processing finished", close Pregap4.
Now open Gap4 again, but this time by double-clicking the file "HIV.1.aux". This
is the new version of the database where Pregap4 put the latest trace data. Open
the Templates Display window ("View: Template Display", "OK"). It should
resemble Figure 4.5.
Note that several of the templates do not seem to be displayed correctly. All the
templates in our subclone library have inserts of roughly the same size (1200 +/~200 bp). Since each should have been sequenced from both ends using the
forward and reverse universal primers, there should be an arrow representing the
sequencing read from each end pointing in toward the middle of the insert.
Because some of our templates are drawn much too long, and not all of the
arrows point in from the ends, we need to rearrange the contigs so they are
consistent with what we know about our templates and sequence reads.
As we saw earlier, clicking on a contig with the right mouse button brings up a
context menu that lets you complement the contig. This will change the direction
of all the read arrows in that contig. Click on a contig using the middle mouse
button to drag it left or right to a new position (if your mouse has a wheel, it will
probably work as a middle mouse button, too). You may have to click a few time
in slightly different spots to grab the contig line successfully.
Use these operations to rearrange the contigs until all templates are drawn about
the right length, with one read coming in from each end, as in Figure 4.6.
Note that the templates drawn in grey or dark grey all cross boundaries between
contigs. The next part of the exercise will be to use custom primers to sequence
the middle parts of some of these templates, to see if we can obtain enough
sequence to join some of our contigs together. For example, the contigs named
"HIVsubclone010-q1t" and "HIVsubclone021-p1t" would presumably be joined if
we had better sequence from the middle part of clone 9, 19, or 21. (Point the
mouse at a contig, template, or read to see its name. Each contig is named after
the leftmost read that it contains.)
11. Sequencing Clones that Connect Contigs
At this point, the sequence is assembled into 7 contigs, which means that there are
6 boundaries between contigs. Each group of students will be assigned one of
these boundaries, and will do additional sequencing reactions to try to get enough
information to join the contigs.

92 Self-Instructional Material

1.

Divide clones among groups of students


Group

2.

Clones

9, 19 or 21

29

30 or 29

18, 24 or 30

26 or 28

15, 16, 27

Bioinformatics Internet
Applications
Notes

Primer walking to join contigs


(i) Design custom sequencing primers to sequence the regions of these templates
that span across contigs.
(ii) Use them in simulated sequencing reactions. Check your traces in the trace
viewer (Trev) to be sure they worked (if not, you may need to check your
primer design or adjust your reaction conditions).
(iii) Submit your reads to the web site (the instructor will demonstrate).
(iv) Once all teams have submitted their results, each group should download the
trace files and add them to their own assembly.

Finishing and Editing the Sequence


At this point, all the templates should be joined into a single contig. On the Template
Display window, choose "View: Quality Plot", then select the contig. A color-coded
display will be drawn below the contig line; see the online help for an explanation of
the colors. Blocks of green and blue represent areas that have been sequenced on only
one strand. If time permits, we may run additional sequencing reactions to sequence
the other strand in these regions.
Editing is the process of resolving inconsistencies among different reads, usually by
going back to the traces and deciding what sequence is most consistent with the
experimental results. Gap4 is extremely helpful in this process. Click on a problem
area in the Quality Plot, and the corresponding sequences will be opened in the
Contig Editor. Click on the consensus sequence in the contig editor and the original
traces will be displayed together so you can compare them and decide which
sequence to believe. Edit the sequences in the Contig Editor to record your choices.
Given time, you should be able to resolve all of the ambiguities to determine a highquality sequence for the entire target gene.

Figure 4.3: Original Assembly of Sequencing Reads into Contigs by Gap4

Punjab Technical University 93

Information Technology &


Bioinformatics
Notes

Figure 4.4: The two Rightmost Contigs have been Complemented so that the Read
Pairs Point toward each Other (into the Template)

Figure 4.5: New Traces Added to the Assembly

94 Self-Instructional Material

Bioinformatics Internet
Applications
Notes

Figure 4.6: Contigs have been Rearranged so Templates are within the Expected
Size and have Reads Coming in from both Ends
Although shotgun sequencing was the most highly developed technique for
sequencing genomes from about 19952005, other technologies have surfaced, called
next-generation sequencing. These technologies create shorter reads (anywhere from
25500bp) but many hundreds of thousands or millions of reads in a relatively short
time (on the order of a day). This results in high exposure, but the assembly process is
much more computationally costly. These technologies are vastly superior to shotgun
sequencing due to the high volume of data and the comparatively short time it takes
to sequence a whole genome. The major disadvantage is that the accuracies are
usually lower (although this is compensated for by the high coverage).

Some Other Methods of Genome Sequencing and


Assembly
Eulerian path: Eulerian path approaches are based on early attempts to sequence
genomes through a technique called sequencing by hybridization. In this technique,
instead of generating a set of reads, scientists identified all strings of length k (k-mers)
contained in the original genome. While this experimental method did not produce a
viable alternative to Sanger sequencing, it led to the development of an elegant
approach to sequence assembly. This approach, also based on a graph-theoretic
model, breaks up each read into a collection of overlapping k-mers. Each k-mer is
represented in a graph as an edge connecting two nodes corresponding to its k-1 bp
prefix and suffix respectively. It is easy to see that, in the graph containing the
information obtain from all the reads, a solution to the assembly problem corresponds
to a path in the graph that uses all the edges an Eulerian path. One advantage of the
Eulerian approach is that repeats are immediately recognizable while in an overlap
graph they are more difficult to identify.
Align-layout-consensus: As more and more genomes become available in public
databases, it is increasingly the case that a completed genome exists that is closely
related to the genome being assembled. The assembly problem thus becomes easier as
the relative placement of reads can be inferred from their alignment to the related
genome (or reference), in a process called comparative assembly. Thus, the overlap
stage of assembly (often one of the most computationally intensive assembly tasks) is
replaced by an alignment step. The layout stage is also greatly simplified due to the
additional constraints provided by the alignment to the reference.
BAC-by-BAC (hierarchical) sequencing: In order to avoid some of the complexity
involved in assembling large genomes, scientists developed a hierarchical approach.

Punjab Technical University 95

Information Technology &


Bioinformatics
Notes

First, the genome is broken up into a collection of large fragments (between 40 and
200 kbp) called Bacterial Artificial Chromosomes or BACs. The BACs location along
the genome is then mapped using specialized laboratory experiments. A minimal
tiling path of BACs is chosen such that each base in the genome is covered by at least
one BAC, and the overlap between BACs is minimized. Each BAC is then sequenced
through the standard shotgun method, the resulting assemblies being combined into
an assembly for each chromosome using the information provided by the tiling paths.

Figure 4.7: BAC-by-BAC Approach (The Long Lines Represent Individual BACs)

Student Activity
1.

Briefly describe how shotgun sequencing experiments and data analysis


are used to produce one continuous DNA sequence.

2.

Write down the different techniques for DNA sequencing.

Summary
The motivation behind the development of other database search programs has been
to emulate the Smith-Waterman algorithm's ability to discern related sequences as
closely as possible while at the same time performing the job in much less time. The
other widely used program to search a database is called BLAST (Basic Local
Alignment Search Tool). This program also aims at identifying core similarities for
later extension. The core similarity is defined by a window with a certain match
density on DNA or with an amino acid similarity score above some threshold for
proteins. Upon reading through a database of sequences, the automaton is given an
additional letter at a time and decides whether the string that ends in this letter is part
of the neighborhood. Dideoxy chain-termination sequencing depends on synthetic
DNA primer sequences to initiate the reaction. These primers must match a portion of
the template whose sequence we are trying to determine. Although shotgun
sequencing was the most highly developed technique for sequencing genomes from
about 19952005, other technologies have surfaced, called next-generation sequencing.

Keywords
Modularity: Ensures that, for the particular task at hand, the data will be collected
and stored in an appropriate manner which differs greatly from one level of activity
(simply gathering the raw data) to another (storing analyzed data) and from one type
of high- throughput system to another. ... The best system is one that employs
integration at those levels where it is an advantage but maintains enough modularity
to ensure that (1) there are no major compromises regarding how any one type of data
is handled and, (2) all the key elements in a researchers information system can be
adjusted or updated independently.
96 Self-Instructional Material

Non-redundant Databases: Researchers at the National Center for Biotechnology


Information (NCBI) coined the term nr database (nonredundant database) to refer
to a database in which the obviously redundant entries have been merged. These
entries are typically those that are 100%, character-by-character identical, and
algorithms exist that can remove such redundancy. Although such a
database has less redundancy than a primary database, a substantial amount of
redundancy remains, and it can be removed only by a curator using scientific
judgment.

Bioinformatics Internet
Applications
Notes

Codon: The set of three nucleotides along a strand of mRNA that determine (or code)
the amino acid placement during protein synthesis. The number of possible
arrangements of these three nucleotides (or triplet codes) available for protein.
Beauty (Blast Enhanced Alignment Utility): A tool developed at Baylor College of
Medicine (Worley et al. 1995) which uses BLAST to search several custom databases
and incorporates sequence family information, location of conserved domains, and
information about any annotated sites or domains directly into the BLAST query
results.
Blast: Basic Local Alignment Search Tool. A program for searching biosequence
databases which was developed and is maintained by a group at the National Center
for Biotechnology Information (NCBI). There are several versions of BLAST: BLASTP
which searches a protein database, BLASTN to search a nucleotide database,
TBLASTN which searches for a protein sequence in a nucleotide database by
translating nucleotide sequences in all 6 reading frames, BLASTX which can search
for a nucleotide sequence against a protein database by translating the query via all 6
reading frames, gapped-BLAST, and psi-BLAST. BLAST locates patches of regional
similarity instead of calculating the best overall alignment using gaps. The program
then uses a scoring matrix to rank these matches as positive, negative or zero. If the
initial match is scored highly, the search is expanded in both directions until the
ranking score falls off.

Review Questions
1.

Determine the amino acid composition of the encoded polypeptide. Are all 20
amino acids present? Do certain amino acids predominate. Compare the
composition to the frequency observed in proteins on average.

2.

What are the benefits of Bioinformatics?

3.

What are the techniques used in human genome project?

4.

What was the reason behind the development of Bioinformatics?

5.

What is BLAST?

Further Readings
Andreas D. Baxevanis, B.F. Francis Ouellette and Dharminder Kumar, Bioinformatics: a
practical guide to the analysis of genes and proteins. John Wiley & Sons, Inc., New York.
2001.
Jin Xiong, Essential Bioinformatics, Cambridge University Press, 2006.
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang,
Zheng Zhang, Webb Miller, and David J. Lipman, Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs, 1997.

Punjab Technical University 97

Information Technology &


Bioinformatics
Notes

98 Self-Instructional Material

PUNJAB TECHNICAL UNIVERSITY


LADOWALI ROAD, JALANDHAR
INTERNAL ASSIGNMENT
TOTAL MARKS: 25
NOTE:

Attempt any 5 questions


All questions carry 5 Marks.

Q 1.

Differentiate between minicomputer and microcomputer.

Q 2.

What is the difference between primary and secondary storage devices?

Q 3.

List the three steps a computer follows to complete an arithmetic operation.

Q 4.

Describe all the views of the document in short.

Q 5.

How to add or hide a toolbar?

Q 6.

What are the requirements for a virtual memory architecture?

Q 7.

How does a distributed system enhance resource sharing?

Q 8.

What are the benefits of multiprogramming?

Q 9.

What are the benefits of Bioinformatics?

Q 10. What is BLAST?

PUNJAB TECHNICAL UNIVERSITY


LADOWALI ROAD, JALANDHAR
ASSIGNMENT SHEET
(To be attached with each Assignment)
______________________________________________________________________________________
Full Name of Student:____________________________________________________________________
(First Name)
(Last Name)
Registration Number:
Course:__________ Sem.:________ Subject of Assignment:_____________________________________
Date of Submission of Assignment:

(Question Response Record-To be completed by student)


S.No.

Question Number
Responded

On Page Number of
Assignment

Marks

1
2
3
4
5
6
7
8
9
10
Total Marks:_____________/25
Remarks by Evaluator:___________________________________________________________________
______________________________________________________________________________________
Note: Please ensure that your Correct Registration Number is mentioned on the Assignment Sheet.
Signature of the Evaluator
Signature of the student

Name of the Evaluator

Date:_______________

Date:_______________

S-ar putea să vă placă și