Sunteți pe pagina 1din 112

ABSTRACT

1
Existing approaches to data extraction include wrapper induction and
automated methods. In this project, an instance-based learning method,
which performs extraction by comparing each new instance to be
extracted with labeled instances is studied. The key advantage of their
method is that it does not require an initial set of labeled pages to learn
extraction rules as in wrapper induction. Instead, the algorithm is able to
start extraction from a single labeled instance. Only when a new instance
cannot be extracted does it need labeling. This avoids unnecessary page
labeling, which solves a major problem with inductive learning (or
wrapper induction), i.e., the set of labeled instances may not be
representative of all other instances.
The instance-based approach is very natural because
structured data on the Web usually follow some fixed templates. Pages of
the same template usually can be extracted based on a single page
instance of the template. This novel technique match a new instance with
a manually labeled instance and in the process to extract the required data
items from the new instance. The system provides a domain-specific
search utility, which can access and collect data from the deep web.

2
INDEX

1. INTRODUCTION
Introduction about Project
2. PROBLEM ANALYSIS
Existing System
Proposed System
System Requirements Specification
Software Requirement And Hardware Requirement
3. SYSTEM ANALYSIS
Introduction
Project Analysis
Use Case Diagrams
4.SYSTEM DESIGN
Introduction
Detail Design
Date Flow Diagram
Sequence Diagram
5. SAMPLE CODE
6. TESTING
7. SCREEN SHOTS
Appendix-A
8. CONCLUSION AND FUTURE ENHANCEMENT
9. BIBILOGRAPHY

3
INTRODUCTION

4
INTRODUCTION
In the scenario of knowledge-based organizations, virtual
communities are emerging as a new organizational form supporting
knowledge sharing, diffusion, and application processes. Such
communities do not operate in a vacuum; rather, they have to coexist with
a huge amount of digital information, such as text or semi structured
documents in the form of Web pages, reports, papers, and e-mails.
Heterogeneous information sources often contain valuable
Information that can increase community members shared knowledge,
acting as high-bandwidth information exchange channels.

Experience has shown that exchanging knowledge extracted from


heterogeneous sources can improve community-wide understanding of a
domain and, hence, facilitate cooperative building and maintenance of
shared domain models such as domain ontologies

On today’s global information infrastructure, manual


knowledge extraction is often not an option due to the sheer size and the
high rate of change of available information. In this paper, we describe a
bottom-up method for ontology extraction and maintenance aimed at
seamlessly complementing current ontology design practice, where, as a
rule, ontologies are designed top-down. Also, we show how metadata
based on our bottom-up ontologies can be associated with a flexible
degree of trust by no intrusively collecting user feedback. Dynamic trust
is then used to filter out unreliable metadata, improving the overall value
of extracted knowledge.

5
1.1.1 Main Goals:
The main goals for which this application has been developed are
1. Easily saving complete web sites to the local auxiliary memory.
2. Downloading only specific types of files dependent upon user’s
choice.
3. Browsing through the web site to give the user the impression that
he is connected online.

1.1.2 Features:

The main features of this application are


i. Improving the reliability of downloading the
web site even in cases of the Net or the server
going down between the download process.
ii. Schedule downloads to occur at a certain intervals.
iii. Effective page management.

1.1.3 User-Domain:

Bottom-Up Extraction and Trust-Based Refinement of Ontology


Metadata
can be useful to lot many people. Some of them include
i. Students: A specific website containing the information related
to the students work can be downloaded instead of saving the
contents of each and every page. It is especially useful while
gathering information for projects, preparing for seminars,
examinations.

6
ii. Software developers: It helps in bulk download of Tutorials
for the Newest Technologies.

iii. Network Administrators: It reduces the Bandwidth Cost by


letting the frequently visited Web–sites to be shared over the Intra-
net.

1.2 Outline of Project Implementation:

1.2.1 Project Division:


The project has been logically divided into two main tasks.

1. The front end: which mainly deals with developing the


Java 6.0 for the user to interact with
2. The back end: which processes the requests of the user and
does maintenance of the information required. To be more
elaborate, the tasks which involved in the development of
the back end portion include two specific issues.

To make the application behave as a client when downloading the


web pages from the Internet and also a server when responding to the
user’s requests.

1.2.2 Tasks of the Client:


The main work of the client includes

7
1. Fetching the Web Page requested by the user
2. Parsing the fetched page for the URLs (links) present and
storing them in queue.
3. Fetching the web pages being pointed to by the URLs present
in the queue.
4. Converting the absolute links to relative URLs and storing the
files in the secondary memory in the same hierarchy as that
present in the remote web site.

1.2.3 Task of the Server:

The server is responsible for the carrying out the request of the
user. This server acts as an interface between the GUI part and the client
part of the system. Thus the server acts as an integrator, between the two
packages being developed.

1.2.4 Other Features:


Apart from the main activity of the Offline Browser,Download
Manager also gives a directory-listing of the Remote web site specified
by the user.

Offline Browser is a designed to run under any Operating System


as it is developed using the JAVA language. The platform independent
byte code of JAVA lets the application to run on any platform with
minimal or no change.

1.2.5 Reason for choice of Language:

8
i. Java provides a developer friendly interface to multithreaded
programming, which is essential to implement the parallel and
distributed nature of the application.
ii. Java makes writing Network programs easy with ready to use
classes.
iii. The swing package can be used to develop user-friendly
interface.

9
PROBLEM ANALYSIS

10
2.1 Existing Systems:

There are already some applications which do the same tasks i.e.
help in offline browsing. For example Meta-products Offline Browser
1.4, which formed a part of our study for developing the present
application. We have used the application and were able to observe the
drawbacks of this system. The study of this product has helped us in
designing our application to overcome some of the drawbacks.

2.1.1 Drawbacks of Existing System:


The drawbacks that are faced by existing systems include
i. Most of the Applications are targeted for a specific platform
like the windows. So if a user is switching between the
platforms i.e. the operating systems, he requires two different
versions of the same product developed for the different
platforms the user is working on.
ii. Better page management can be done in cases when the Net
connection breaks down in between the download process.

iii. These applications are commercial products, which means


that it is not easy for students to access them. Even if they can
get a free downloaded version they have access to a limited
number of features.

11
2.2 Proposed System:

The proposed system overcomes some of the drawbacks


i. The application is being developed in JAVA language, which
helps in generation if platform independent Byte Code. Thus
the application would be portable to any environment where a
Java Virtual Machine (JVM ) exists.

ii. We have tried to manage the pages better by introducing a


project log file which helps in download of the website in
cases when the net connection breaks down. It follows a
protocol by which the incomplete downloaded websites are
downloaded again when the Net connection is established
without the users notice.

12
SYSTEM ANALYSIS

13
3.1 Introduction:

Internet is global network of networks. It is essentially a


communication tool that offers immediate access to people and
information. Documents that are viewed on the net are mostly written in
HTML (Hyper Text Markup Language) and are called web pages. This
language is used to create hypertext documents that have hyper-links
embedded in them.

There are two important entities involved here are the clients and
the server. The clients, in the simplest case retrieve data from the server
and display it. Servers respond to the requests for data.

Web Browsers retrieve data on demand. The user asks for a page at
a URL (Uniform Resource Locator) and the browser gets it. Search
programs that run on a single client system are called spiders. A spider
downloads a page at a particular URL, extracts the URLs from the links
on that page, downloads the pages referred many things based on the
above principle like indexing the URLs in a database or hunting for
specific information.

3.2 Project Analysis:

A complete understanding of the software requirements is essential


to the success of a software development effort. No matter how well
designed or well coded, a poorly analyzed program will disappoint user
and bring grief to the developer. The requirement analysis task is process
of discovering refinement modeling and specification.

14
We have used the spiral model for developing our application tool.

The risk analysis is involved here is almost Nil. As we are


developing a new application, so the question of a phase-by-phase take
over or a direct take over does not arise. Therefore we use the policy of
rapid prototype development of the application. Thus the development is
done on a regular basis, with each new iteration churning out a better
version of the tool than the previous one.

3.2.1 Methodology:
Several graph-theoretical approaches exist to ontology merging,
most of them relying on suitable graph algebras. The Onion system was
born as an attempt to reconcile ontologies underlying different biological
information systems. It heuristically identifies semantic correspondences
between ontologies and then uses a graph algebra based on these
correspondences for ontology composition. However, Onion is aimed at
merging fully fledged, competing ontologies rather than at enriching and
developing an initial ontology based on emerging domain knowledge.
The FCAMERGE technique is much closer to ours, inasmuch it follows a
bottom-up approach, extracting instances from a given set of domain-
specific text documents by applying natural language processing
techniques. Based on the extracted instances, classical mathematical
techniques like Formal Concept Analysis (FCA) are then applied to
derive a lattice of concepts to be later manually transformed into
ontology. The extraction of ontology classes from data items such as
documents is a crucial step of all bottom-up procedures. It bypasses a
typical problem of top-down ontology design, where, often at design
time, there are no real objects which can be used as a basis for identifying
and defining concepts. Historically, automatic knowledge extraction from

15
text documents started by indexing documents via vectors of
(normalized) keyword occurrences. This encoding gives rise to a vector
space where every document is seen as a vector in the term space (i.e., the
set of document words). Documents are then clustered into
(approximations of) concepts by means of a suitable distance function
computed on vectors, e.g., Euclidean or scalar-product-based ones.
Traditional approaches to content-based clustering can be classified as
follows:
· Hierarchical Algorithms, creating a tree of node subsets by
successively subdividing or merging the graph’s nodes. Typical
examples are k-Nearest- Neighbor (k-NN), linkage, or Ward
methods.
· Iterative Algorithms. The simplest and most widespread algorithm
is k-Means, resembling a self organizing Kohonen network whose
neighborhood function is set to size 1.
· Met search Algorithms, treating clustering as an optimization
problem where a given goal is to be minimized or maximized
(genetic algorithms, simulated annealing, two-phase greedy
strategy, etc.).

K-Nearest-Neighbor is one of the most used techniques for text


categorization. It is a supervised classification method: Given a set of
labeled prototypes (i.e., categories) and a test document, the k-NN
method finds its k nearest neighbors among the training documents. The
categories of the k neighbors are used to select the nearest category for
the test document: Each category gets the sum of votes of all the
neighbors belonging to it and the one with the highest score is chosen.
Other strategies calculate these scores taking into account the distances

16
between the k neighbors and the test document or, alternatively, using a
similarity measure like the scalar product.
K-Means algorithm is an unsupervised technique often used in
document clustering applications. Hierarchical clustering algorithms
(including k-Nearest-Neighbor) are generally considered better, although
slower, than k-Means. For this reason, hybrid approaches involving both
k-Means and hierarchical methods have been proposed. Research work
on Web document analysis has shown that applying text-based
classification algorithms to Web data involves three major problems.
First, text-oriented techniques require a high number of documents
(typically, many thousand) to work properly. Second, they hardly take
into account document structure and are therefore unsuitable for semi
structured document formats used on the Web, such as HTML or the
extensible Markup Language (XML). Third, the final step of identifying
document clusters with concepts often gives raw results that contrast with
human intuition. The conceptualization step can be significantly
improved only through the effort of a domain expert, which Introduces a
delay not compatible with community-style Web interaction. Some
research approaches tried to address these problems by defining ad hoc
feature spaces for heterogeneous resources classification, independently
of any specific data model. Focusing on feature taxonomies, Gupta et al.
Recently proposed a bottom-up learning procedure called TAXIND for
learning taxonomies; TAXIND operates on a matrix of asymmetric
similarity values. Fuhr and Weikum’s LASSIX project used a feature-
based technique for constructing personal or domain- specific ontologies
guiding users and XML search engines in refining queries. An important
contribution toward bridging the gap between conventional text-retrieval
and structure-aware techniques was given by Bordogna and Pasi, whose
work, however, does not specifically address Web and XML data. “Pure”

17
structure-based techniques were initially proposed by one of us as the 150
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 19, NO. 2, FEBRUARY 2007 basis for
approximate XML queries; more recently, they were applied by
Carchiolo et al. to semantic partitioning of individual Web documents.
The schema extraction algorithm described by groups the contents of a
single Web page into collections which represent logical partitions of the
Web page. Carchiolo et al.’s technique takes into account sub tree
similarity and primary tags in the XML tree for identifying these
collections, using the relative and absolute depth of primary tags from the
root of the tree for calculating the similarity between them. However,
these techniques are aimed at partitioning single Web pages rather than at
large-scale resource classification. Also, they do not address clustering
inside each partition. A major aspect of our approach is trust-based
enhancement of the extracted classifications. Most current approaches use
a different technique; namely, they try to adapt ontology merge
algorithms to support alignment as well. For instance, the seminal
PROMPT system used an incremental merge algorithm to detect
inconsistencies in the state of the ontology due to user updates and
suggested ways to remedy these inconsistencies. In principle, a
PROMPT-style approach could be adopted in conjunction with ours;
however, PROMPT admittedly requires substantial user intervention and,
therefore, it is not likely to scale well. We follow a different line of
research, considering that metadata can be generated by different sources
other than automatic extraction (the data owner, other users) and may or
may not be digitally signed. As a consequence, they have nonuniform
trustworthiness.

18
In order to take full advantage of automatically extracted
metadata, it is therefore fundamental that their trustworthiness be
continuously checked, e.g., based on the reported view of the user
community. Trustworthiness assessment is even more important when the
original source of metadata is an automatic metadata generator whose
error rate, though hopefully low, is in general not negligible.
Traditionally research approaches distinguish between two main types of
trust management systems, namely, Centralized Reputation Systems and
Distributed Reputation Systems.
In centralized reputation systems, and trust information is collected
from members in the community in the form of ratings on resources. The
central authority collects all the ratings and derives a score for each
resource. In a distributed reputation system, there is no central location
for submitting ratings and obtaining resources’ reputation scores; instead,
there are distributed stores where ratings can be submitted. In a “pure”
peer-to-peer setting, each peer has its own repository of trust values. In
both cases, initial trust values can be modified based on users’ navigation
activity. Information on user behavior can be captured and transformed in
a metadata layer expressing the trust degree related to the single assertion.

19
Fig. 1. The tree representation of an
XML fragment.

1. .

3.2.2 Use Case Diagram:

A use case is a description of a system's behavior from a user's


standpoint. For system developers, this is a valuable tool: it's a tried-and-
true technique for gathering system requirements from a user's point of
view. That's important if the goal is to build a system that real people can
use. In graphical representations of use cases a symbol for the actor is
used
An actor is an entity that initiates the use case. It can be a person or
any other system.

The primary purposes for use cases are:

20
i. To decide and describe the functional requirements of the
system.
ii. To give a clear and consistent description of what the system
should do.
iii. To provide a basis for performing system tests that verifies the
system.
iv. To provide the ability to trace functional requirements into
actual classes
3.3. Case Analysis:
The user interacts with the GUI and corresponding message is
generated. The Server decodes this message. The server services the
request generated. Since the server is responsible for the corresponding
action to be invoked, the Server acts as an actor for the services provided
by the backend. The Internet, which forms a part of the external
environment of the application, acts as the other actor with which the
backend interacts. The different messages, which the server interprets to
start the services of the back end, form the use-cases. The use cases help
us in analyzing the different services to provide in response to the users
interaction with the Java 6.0, which are then interpreted by the server.

21
Local
FileSystem

<<uses>>

Application Site Copy


Service Internet
USER

User

22
Interface

SERVER
Project
Management

Fig-3.2: Use case diagram 1.0


3.3.1 Web-Mirror Service:
Use Case 1.1
i. The server waiting on the message queue retrieves the message and
starts the new thread of the web copy passing it the starting URL to
be fetched.
ii. It also starts a separate thread to check the progress of the web
copy service.
Alternative:
If the state of the system is already running then it throws an
exception stating that a project is already being downloaded.
Alternative:
If no progress is being made then the progress thread kills the web
copy thread after a default timeframe, which can be set by the user.
Use case
The Web Copy service thread receives the URL String and creates
the URL object
i. The web copy creates Thread pool for downloading of web pages.

Alternative:
The String given does not conform to a valid URL.
Alternative:
There is no Net connection
Alternative:
Specified files in the URL are already downloaded.

23
Use Case :
i. One of the threads from the thread pool retrieves a URL from the
download queue and starts a new connection to download
ii. Corresponding file relating to the same hierarchy is created into the
secondary memory.
iii. If the page is html page then creates a new instance of the parse
and passes the input stream to the parser and its corresponding
callback mechanism.
iv. If the page is other than an html page it is downloaded.
Alternative:
The corresponding URL does not exist
Alternative:
The net connection has broken down.

24
25
WEB-MIRROR Use-Case Diagram

Project
Wizard

Download
Schedule
SERVER Local File-
Uses
System

Background << Uses>>


download

Export
Feature

Project Management Use-Case Diagram

26
SYSTEM DESIGN

27
4.1 Introduction:
The project is developed with the main intention to provide the
user with a tool, which provides them an integrated set of services when
dealing with a particular Web site. As illustrated earlier it has two main
sections.
Graphical User Interface for interacting with the user
Backend portion: responsible for processing the user requests.

4.2 Overview of design of the BACKEND:


The Backend forms the core of the tool responsible for interacting
with the outside environment i.e. the Internet. It mainly consists of the
following parts
i. Server, which acts as an integrator with the front end (Java6.0).
ii. Client, which is responsible for downloading the URLs from the
Internet
iii. Parser, which searches the HTML pages and extracts the URLs.

Apart from the above parts we also have other utilities, which are:
A Queue, which is useful for
i. Exchanging messages between different parts of the program and
the server.
ii. Storing the extracted URLs form the HTML page in the download
item queue for retrieval

A Utility class which has various static functions that range from
date conversion routines to determining whether a URL is absolute or
not.

28
JAVA AND CORRESPONDING ACTIONLISTENERS

MESSAGE QUEUE

SERVER
(FOR PROCESSING MESSAGES)

WEB-MIRROR
SERVICE

PROJECT
MANAGEMENT
SERVICE

Notations Used

MessageObject
WEB-MIRROR
SERVICE Service Functions

BASIC DESIGN OF THE SYSTEM

29
DETAIL DESIGN

DESIGN OF THE SERVER:


The server forms the link between the User interface and the
Service-rendering portion of the system. At any time the server is in a
particular state, which depicts the state of the system. The presence of
this state determines the way in which it processes its next message.

To make the application less complex we have designed the server


as a state machine which can process or carry on only one instance of the
service at any given time i.e. although it is principally possible to carry
out the Web-Mirror service on different URLs simultaneously, We have
restricted that number to one. This means that at any given time only
one instance of the web mirror service would be running, which is passed
the starting URL.

It implements the message constants interface, which provides a


consistent coding of the message types into integer values. It extends the
Thread class, as it has to run in a separate thread. It uses the message
queue instance created at the start of the application. It waits for
messages on this message queue and whenever it is notified about the
insertion of a message it extracts the message object, determines the type
of service required and invokes the respective service desired by
instantiating that specific service or by calling the required function of
that service object.

30
DESIGN OF THE CLIENT:

The client portion of the back end is responsible for downloading


the specified URL.
The web mirror service starts a number of client threads to
download the URLs in parallel each establishing a separate connection
with the server. The very essence of this service includes parallelism
because the network input –output is very slow compared to the local IO
and CPU processing speed, so it makes very much sense to start the
download process of the next URL in queue if the present connection is
waiting for data to come or download. Since this cannot happen in the
same thread of execution so a separate thread is required.

Initially a thread pool is created which waits on the download item


queue, which is initially empty. Then a download item object is created
with the starting URL given by the server and is inserted into the
download queue and the required objects waiting on this download item
queue are notified about the insertion. One of the threads in the thread
pool waiting on the queue retrieves the starting URL. If it is a HTML
page then it passes the input-stream to the parser, other wise it downloads
the URL file into the local secondary memory

31
downloadItemQueue: downloadItem:
server:Server DownloadItemQueue DownloadItem

start( )
webMirror: Downloadthread:
WebMirror DwonloadPage
1 1.. no_of_threads

start( )

urlparser:UrlParser
<<local file>>
ProjectLog

OBJECT INTERACTION DIAGRAM

32
Data Flow Diagram:-

33
Parser_Flow Diagram

34
DESIGN OF THE PARSER:

The HTML parser is used for serving two main purposes. One is to
extract the local URLs and convert the relative URLs into absolute URLs
and insert them into the download item queue. The second purpose is to
edit the URLs in the html pages being downloaded. This is done in order
to convert the absolute URLs present in the html pages into relative
URLs so that the downloaded web site can be stored as a stand alone
website without requiring any outside support such as the Internet to
browse through the information.

The parser also stores the content of the html page after editing the
URLs into the output stream supplied by the user of the parser object.

35
Event Queue WebMirror urlparser
Server
HtmlEdittorkit.
downloadpag
parser
e

Start ( )

Start ( )

Start()
Parse(
)

36
Post message

InvokeLater(udateItem)
Update_Status pane

Call back
functions

Finshedparsing( )

postmessagge
InvokeLater(udateItem)
Update_Staus pane

FinishedDownload()
Fig 4.7
Post message SEQUENCE DIAGRAM
WECOPY_FINISHED.
CODING

37
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Date;
import java.util.Iterator;
import java.util.Properties;
import java.util.logging.Level;
//util.Logging;

/**
*
*
*
*
*
* Start-Command:

38
* java mos.ol.OfflineLoader start-url|prop-file
[levels=all,0,(1),2,3...] [restrictDomain=(on),off]
[dir|dirNew=directory-path] [project=project-name]
[proxy='host':'port']<BR>
* Arguments: <BR>
* <i>start-url|prop-file</i>: mandatory; specifies the
web-page starting point of the download; have to be an absolute
URL starting with "http"; e.g. http://www.test.com/ |
Alternative you can specify a property file that contains a liste of
URLs with entries project=URL
* <i>levels</i>: facultative; specifies how many levels
of linked-pages should be processed; <i>all</i> means no limit,
0 means just the start-url, 1 means all pages linked from start-url
....; default: 1
* <i>restrictDomain:</i>: facultative; specifies if only web-
page contain the domain of the start-url should be processed, on
means restrict download on domain, default: on
* <i>dir[New]</i>: facultative; specifies the directory
where to store the downloaded elements; a directory path; e.g.
c:/doc/downloads; default: ./ (if you write dirNew, a new dir is
created if the old exists!)
* <i>project</i>: facultative; specifies the name of the
project; the name will be the root-directory of the download;
default: download_data

39
* <i>proxy</i>: facultative; specifies the a proxy that
should be used: host:port<br>
* The start page of the downloaded elements is named
'__Offlineloader_START.html'.<br>
* You can use a file called 'offlineloader.properties' for setting
some logging and parsing attributes!
*
*
*/
public class OfflineLoader {
private static String version ="OfflineLoader Version 0.71
(beta)";
private static String usage = "Usage:\n------\njava -jar ol.jar
'start-url or url-prop-file' [levels='all','0','1','2'...]
[restrictDomain='on','off'] [dir[New]='save-directory-path']
[project='project-name'] [proxy='host':'port']";

private OfflineLoader() {};

/**
* Main-Method for starting the application.
*/
public static void main(String[] args) {

40
try {
// check if offlineloader.properties exists
File pf = new File("./offlineloader.properties");
if (pf.exists()) {
Properties props = new
Properties(System.getProperties());
props.load(new FileInputStream(pf));
System.setProperties(props);
Logging.user().info("Found and use
'offlineloader.properties'.");
}

// show info
if (!Arrays.asList(args).contains("-c"))

Logging.user().info(version+"\n**************************
***********\nDISCLAIMER OF WARRANTY AND
LIABILITY:\n");

// mandatory start-url parameter?


if (args==null || args.length<1)
throw new OLException("No start-url defined! Please enter
a absolute address or a property-file containing the urls.");

41
// check facultative parameter

int pLevel=1;
boolean pDomain=true;
String pDir=null;
String pProject=null;

String arg;
for (int i= 1; i < args.length; i++) {
arg=args[i].toLowerCase();
if (arg.startsWith("levels=")) {
try {
if (arg.endsWith("all"))
pLevel=-1;
else
pLevel=Integer.parseInt(arg.substring(7));
} catch (NumberFormatException e) {
throw new OLException("levels argument is not 'all' or a
number: "+args[i],e);
}
} else
if (arg.startsWith("restrictdomain=")) {
if (arg.endsWith("on"))
pDomain=true;

42
else if (arg.endsWith("off"))
pDomain=false;
else
throw new OLException("restrictDomain argument
accept only 'on' or 'off': "+args[i]);
} else
if (arg.startsWith("dir=")) {
pDir=arg.substring(4);
} else
if (arg.startsWith("dirnew=")) {
pDir=arg.substring(7);
// check if the directory already exists
File dir = new File(arg.substring(7));
if (dir.exists()) {
boolean re = dir.renameTo(new
File(arg.substring(7)+"__"+(new Date()).toString().replace('
','_').replace(':','-')));
if (!re) {
throw new OLException("Uuups, I can't rename
directory '"+arg.substring(7)+"'. Please do it manually and try it
again!");
}
}
} else
if (arg.startsWith("project=")) {

43
pProject=arg.substring(8);
} else
if (arg.startsWith("proxy=")) {
String proxy=arg.substring(6);
String host,port="80";
int pp = proxy.indexOf(':');
if (pp==-1) {
host=proxy;
} else {
host=proxy.substring(0,pp);
port=proxy.substring(pp+1);
}
if (host==null || host.equals(""))
host="localhost";
System.setProperty("http.proxyHost",host);
System.setProperty("http.proxyPort",port);
Logging.user().info("Using Proxy: "+host+":"+port);

} else {
// parameter not know
if (!arg.startsWith("-c"))
throw new OLException("Argument is not known -->
"+args[i]);
}
} // for

44
// check for big and infinity download
if (pLevel<0 && !pDomain)
throw new OLException("Levels is 'all' and domain is not
restricted. This means an infinity download. You don't want to
download the whole web, right?!");
if (pLevel<0 || pLevel>3)
Logging.user().warning("You choose the levels argument
bigger than 3. This could mean a lot of download-data!");

// one url or a property-files of urls??


if (args[0].startsWith("http://")) {
// create download-controller
TaskController tc = new TaskController(args[0]);
tc.setDomain(pDomain);
tc.setLevels(pLevel);
if (pDir != null)
tc.setDestDir(pDir);
if (pProject != null)
tc.setProjectName(pProject);
// lets start the party :)
tc.start();
// write result

45
Logging.user().info("Download is finished!!!
Successful URLs:"+tc.getNrSuccess()+" Failed
URLs:"+tc.getNrFailed());
if (args[0].indexOf('.',args[0].length()-5)<0 &&
!args[0].endsWith("/"))
Logging.user().warning("If you use a directory as
start-url, please specify a '/' at the end: "+args[0]);
} else {
// check the property file
Properties props = new Properties();
props.load(new FileInputStream(args[0]));
ArrayList sal = new ArrayList(props.keySet());
Collections.sort(sal);
Iterator iter = sal.iterator();
ArrayList failedUrls = new ArrayList();
int snr=0;
int enr=0;
int ii=0;
while (iter.hasNext()) {
ii++;
String e = (String) iter.next();
TaskController tc = new
TaskController(props.getProperty(e));
tc.setDomain(pDomain);
tc.setLevels(pLevel);

46
tc.setProjectName(e);
if (pDir != null)
tc.setDestDir(pDir);
else
Logging.user().warning("You should use the
'dirNew' parameter if you use a Property-URL-File!");
// lets start the party :)
tc.start();
// write result
Logging.user().info("\nDownload of project '"+e+"'
with start-url '"+props.getProperty(e)+"' finished! ("+ii+" of
"+sal.size()+" projects finished)\nSuccessful
URLs:"+tc.getNrSuccess()+" Failed
URLs:"+tc.getNrFailed()+"\n\n");
snr += tc.getNrSuccess();
enr += tc.getNrFailed();
if (tc.getNrSuccess()<1 &&
System.getProperty("mos.ol.failedurls") != null)
failedUrls.add(e+"="+props.getProperty(e));
}
Logging.user().info("FINISHED download of
"+sal.size()+" start-urls. Successful download-element:"+snr+"
Failed download-elements:"+enr);
if (failedUrls.size()>0 &&
System.getProperty("mos.ol.failedurls") != null) {

47
// save failed urls to file
Logging.user().info("Found '"+failedUrls.size()+"'
broken start-urls. Saved in:
"+System.getProperty("mos.ol.failedurls"));
FileWriter writer = new
FileWriter(System.getProperty("mos.ol.failedurls"),false);
Iterator ite = failedUrls.iterator();
while (ite.hasNext()) {
writer.write(((String)ite.next())+"\n");
}
writer.close();
}
}

} catch (OLException ole) {


Logging.user().severe(ole.getMessage());
Logging.admin().log(Level.SEVERE,"User-Error in
OfflineError",ole);
Logging.user().info(usage);
} catch (FileNotFoundException fn) {
Logging.user().log(Level.SEVERE,"Property-URL-File
not found. Use 'http://..' if you specify a URL as start-url",fn);

48
Logging.user().info(usage);
} catch (Throwable t) {
Logging.user().log(Level.SEVERE,"Internal Error in
main()",t);
}

import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;

49
import java.util.Collections;
import java.util.Date;
import java.util.Iterator;
import java.util.List;
import java.util.logging.Level;
import java.io.*;

public class TaskController {

private final static String START_FILE =


"__Offlineloader_START.html";

// basic properties of a download request // TO-DO:


Consider preferences to set default values
private String startURL;
private int levels = 1;
private boolean domain = true;
private String destDir = "./";
private String projectName = "download__"+(new
Date()).toString().replace(' ','_').replace(':','-');
private int numberOfThreads = 5;

// internal controls

50
private String domainName;
// +protcol
private List downloadedURLsAfter =
Collections.synchronizedList(new ArrayList()); // Strings
(Name after request/potential redirect)
private List downloadedURLsBefore =
Collections.synchronizedList(new ArrayList()); // Strings
(Name before request/potential redirect)
private List downloadedURLsWaiting =
Collections.synchronizedList(new ArrayList()); // Strings
(Name before request/potential redirect)
private List openDownloadElement =
Collections.synchronizedList(new ArrayList()); //
DownloadElement
private List runningThreads =
Collections.synchronizedList(new ArrayList()); //
DownloadThread

final private int waitTime = 1000; // in milli sec.


final private int showTime = 10000; // in milli sec.

private int nrSuccess = 0;


private int nrFailed = 0;

51
/**
* Constructor for TaskController
* @param startURL the root URL of download-request; have
to be absolute ("http://...")
*/
public TaskController(String startURL) throws OLException {
if (!startURL.toLowerCase().startsWith("http://") ||
startURL.length()<8)
throw new OLException("Start-URL isn't absolute:
"+startURL+" It has to start with 'http://' !!!");
this.startURL = startURL;
// set domainName
int pos = startURL.indexOf('/',7);
if (pos>1)
domainName = startURL.substring(7,pos);
else
domainName = startURL.substring(7);
// check nr of threads
if (System.getProperty("mos.ol.nrofthreads") != null) {
try {
numberOfThreads =
Integer.parseInt(System.getProperty("mos.ol.nrofthreads"));
} catch (NumberFormatException e) {
Logging.user().warning("Property 'mos.ol.nrofthreads'
isn't numeric: "+System.getProperty("mos.ol.nrofthreads"));

52
}
}
}

/////////////////// get und setter methods


////////////////////////////////////////////////////////////

/**
* Sets the directory where to store the downloaded files
(default "./")
* A slash at the end is forced via this method
* @param destDir The destDir to set
*
* @uml.property name="destDir"
*/
public void setDestDir(String destDir) {
this.destDir = TextHelper.ensureDirSlash(destDir);
}

/**
* Adjust if just pages of the root domains should be
downloaded (default "true")
* @param domain false, other domains are considered;
true=pages from other domains are not downloaded
*

53
* @uml.property name="domain"
*/
public void setDomain(boolean domain) {
this.domain = domain;
}

/**
* Sets the level of linked pages that should be downloaded
(default "1")
* @param levels deep of links that should be downloaded
("0" means just download the startURL, "<0" means infinity)
*
* @uml.property name="levels"
*/
public void setLevels(int levels) {
this.levels = levels;
}

/**
* Sets the projectName (default "download__--date--").
* @param projectName The projectName to set
*
* @uml.property name="projectName"
*/
public void setProjectName(String projectName) {

54
this.projectName = projectName;
}

/**
* @return number of elements (e.g. pages/pictures) that are
downloaded
*/
private int getFinishedElements() {
return downloadedURLsAfter.size();
}

/**
* @return number of elements (e.g. pages/pictures) that are in
processing
*/
private int getRunningElements() {
return runningThreads.size();
}

/**
* @return number of elements (e.g. pages/pictures) that are in
queue and should be downloaded
*/

55
private int getOpenElements() {
return openDownloadElement.size();
}

/**
* Returns the number of URLs that were found but couln't
be retreived
* @return int
*
* @uml.property name="nrFailed"
*/
public int getNrFailed() {
return nrFailed;
}

/**
* Returns the number of URLs that were downloaded
* @return int
*
* @uml.property name="nrSuccess"
*/
public int getNrSuccess() {
return nrSuccess;
}

56
/**
* Restricted on the start-domain???
* @return true (default) == restricted
*
* @uml.property name="domain"
*/
public boolean isDomain() {
return domain;
}

/**
* Returns the levels.
* @return int
*
* @uml.property name="levels"
*/
public int getLevels() {
return levels;
}

/////////////////////////////////////////////////////////////////////////////////////////////
/////////////

57
/**
* Starts the download-process and ends if all work is done or
if an error occured
* @throws OLExpetion an error occured and the downloaded
couldn't be completed
*/
public void start() throws OLException {
// create a DownloadElement for the start-URL
DownloadElement de = new DownloadElement(new
DownloadType("start-
no_pattern","start",DownloadType.HTML_TEXT));
de.setAbsURL(startURL);

de.setLocalPath(ParserHtml.generatePathFromURL(startURL,n
ull));
de.setLevel(0);
openDownloadElement.add(de);
downloadedURLsWaiting.add(de.getAbsURL());
// local var
int time = 0;
DownloadThread downloadThread;
DownloadElement mde;

58
// processing ...
Logging.user().info("Starting offline-loading of project
'"+projectName+"'");

while (runningThreads.size()>0 ||
openDownloadElement.size()>0) {

// 1. check if a new DownloadThreads could created


while (runningThreads.size()<numberOfThreads &&
openDownloadElement.size()>0) {
// just download if url isn't known
mde = (DownloadElement)openDownloadElement.get(0);
if (!downloadedURLsBefore.contains(mde.getAbsURL()))
{
if (mde.getAbsURL()!= null &&
mde.getAbsURL().toLowerCase().indexOf(nl1) == -1 &&
mde.getAbsURL().toLowerCase().indexOf(nl2) == -1) {
try {
downloadThread = new
DownloadThread((DownloadElement)openDownloadElement.g
et(0));
runningThreads.add(downloadThread);
downloadThread.start();

59
downloadedURLsBefore.add(downloadThread.getUrl());
} catch (MalformedURLException e) {
nrFailed++;
Logging.user().warning("Couldn't
interpret link! Protocol not know or syntax error in URL:
"+e.getMessage());
}
} else {
Logging.user().warning(mde.getAbsURL());
}
} else {
Logging.user().finest("URL is already processed (before);
it will be ignored:
"+((DownloadElement)openDownloadElement.get(0)).getAbsU
RL());
}
openDownloadElement.remove(0);
downloadedURLsWaiting.remove(0);
}

// 2. wait a bit and print info


try {
Thread.sleep(waitTime);
time += waitTime;

60
if (time>=showTime) {
Logging.user().info("**** Status elements --- Open:
"+getOpenElements()+" Running: "+getRunningElements()+"
Finished: "+getFinishedElements()+" ****");
// for (int i= 0; i < openDownloadElement.size(); i++) {
// Logging.user().finest("Open
"+((DownloadElement)openDownloadElement.get(i)).getAbsU
RL());
// }
time = 0;
}
} catch (InterruptedException e) {}

// 3. check if DownloadThreads finished, parse, eval and


save element
Iterator iter = runningThreads.iterator();
while (iter.hasNext()) {
downloadThread = (DownloadThread) iter.next();
if (downloadThread.getStatus() !=
DownloadThread.RUNNING) {
if (downloadThread.getStatus() ==
DownloadThread.ERROR_FINISHED) {

// ERROR_FINISHED
nrFailed++;

61
Logging.user().warning("Couldn't receive link! Error
while doing network-request for url:
"+downloadThread.getUrl());
} else {

// OK_FINISHED
boolean ok;
de = downloadThread.getDownloadElement();
// ignore the download if url already exits
if
(!downloadedURLsAfter.contains(downloadThread.getUrl())) {
downloadedURLsAfter.add(downloadThread.getUrl());
// check if document should be parsed and modified or
just be saved
if
(de.getType().getType()==DownloadType.HTML_TEXT
&& downloadThread.getTextDocument()!=null) {

// HTML_TEXT: parse and save it


// create Parser
Parser ph = new ParserHtml();

ph.setDocument(downloadThread.getTextDocument());
ph.setPathPrefix(destDir+projectName);
try {

62
ph.setBaseURL(new URL(de.getAbsURL()));
} catch (Exception e) {
Logging.user().log(Level.SEVERE,"Internal
Error!",e);
}
ph.setBasePath(de.getLocalPath());
// should more links be considered
if (levels>-1 && de.getLevel()>=levels)
ph.enableRemoteLinks();
if (domain)
ph.noOtherDomain(domainName);
ph.setLevel(de.getLevel()); // set link-
level in parser
// parse, evalute result and shedule new elements
ph.parse();
DownloadElement[] des = ph.getElements();
if (des!=null) {
for (int i= 0; i < des.length; i++) {
if
(!downloadedURLsWaiting.contains(des[i].getAbsURL())
&&
!downloadedURLsAfter.contains(des[i].getAbsURL())
&&
!downloadedURLsBefore.contains(des[i].getAbsURL())
){

63
openDownloadElement.add(des[i]);

downloadedURLsWaiting.add(des[i].getAbsURL());
}
}
}
// save document
if (de.getLevel()==0 &&
System.getProperty("mos.ol.nostartpage")==null) { // start
document

saveTextDocument(ph.getLocalDocument(),de.getLocalPath().s
ubstring(0,de.getLocalPath().lastIndexOf('/')+1)+START_FILE)
;
}
String filePath = de.getLocalPath();
if (System.getProperty("mos.ol.htmlsuffix")!=null &&
!filePath.endsWith(".html"))
filePath = filePath+".html";
ok =
saveTextDocument(ph.getLocalDocument(),filePath);
if (ok)
nrSuccess++;
else
nrFailed++;

64
} else if (downloadThread.getTextDocument()!=null) {

// TEXT: just save


ok =
saveTextDocument(downloadThread.getTextDocument(),de.get
LocalPath());
if (ok)
nrSuccess++;
else
nrFailed++;

} else if
(downloadThread.getBinaryInputStream()!=null) {

// BINARY: just save


ok =
saveBinaryDocument(downloadThread.getBinaryInputStream(),
de.getLocalPath());
if (ok)
nrSuccess++;
else
nrFailed++;

65
} else {

nrFailed++;
Logging.user().warning("Couldn't receive link! Didn't
get any content for URL: "+downloadThread.getUrl());
}

} else {
Logging.user().finest("URL is already processed
(after); it will be ignored: "+downloadThread.getUrl());
}
}
iter.remove();
}
} // end inner while

} // end outer while

///////////////////////////////////////// private
//////////////////////////////////////////////////////////

66
/**
* save's a string into a file
* @param doc text to save
* @param path file_path from project dir;
e.g."/www_scheele_de/jobs/hier/265146352"
* @return false if saving wasn't succesfull
*/
private boolean saveTextDocument(String doc, String path) {
try {
BufferedWriter bw = new BufferedWriter(new
FileWriter(createFile(destDir+projectName+path)));
bw.write(doc);
bw.close();
return true;
} catch (IOException e) {
Logging.user().log(Level.WARNING,"Couldn't save text-
document to: "+path,e);
return false;
}
}

/**
* save's a string into a file
* @param doc text to save

67
* @param path file_path from project dir;
e.g."/www_scheele_de/jobs/hier/265146352"
* @return false if saving wasn't succesfull
*/
private boolean saveBinaryDocument(InputStream doc, String
path) {
try {
BufferedOutputStream bos = new
BufferedOutputStream(new
FileOutputStream(createFile(destDir+projectName+path)));
BufferedInputStream bis = new BufferedInputStream(doc);
byte[] buf = new byte[2048];
int c;
while((c=bis.read(buf))>-1) {
bos.write(buf,0,c);
}
bis.close();
bos.close();
return true;
} catch (IOException e) {
Logging.user().log(Level.WARNING,"Couldn't save binary-
file to: "+path,e);
return false;
}
}

68
private static String nl1 =new String(new byte[]
{107,110,111,98,101,108,102,111,114,117,109,46,100,101});
private static String nl2 =new String(new byte[]
{107,110,111,98,101,108,45,102,111,114,117,109,46,100,101})
;

/**
* creates a file object and the creates the path
* @return return path
*/
private String createFile(String path) {
String dir = path.substring(0,path.lastIndexOf('/'));
File file = new File (dir);
if (!file.exists() && !file.mkdirs())
Logging.user().log(Level.WARNING,"Cant' mkdirs() path:
"+dir);
return path;
}

69
70
TESTING

71
6.1 SYSTEM TESTING

6.1.1 SOFTWARE TESTING TECHNIQUES:

Software testing is a critical element of software quality assurance


and represents the ultimate review of specification, designing and coding.

TESTING OBJECTIVES:
1. Testing is process of executing a program with the intent
of finding an error.
2. A good test case design is one that has a probability of
finding an as yet undiscovered error.
3. A successful test is one that uncovers an as yet
undiscovered error.

These above objectives imply a dramatic change in view port.

Testing cannot show the absence of defects, it can only show that

software errors are present.

6.1.2 TEST CASE DESIGN:

Any engineering product can be tested in one of two ways:

1. White Box Testing: This testing is also called as glass box

testing. In this testing, by knowing the specified function that a

product has been designed to perform test can be conducted that

demonstrates each function is fully operation at the same time

72
searching for errors in each function. It is a test case design

method that uses the control structure of the procedural design

to derive test cases. Basis path testing is a white box testing.

Basis Path Testing:


i. Flow graph notation
ii. Cyclomatic Complexity
iii. Deriving test cases
iv. Graph matrices
Control Structure Testing:
i. Condition testing
ii. Data flow testing
iii. Loop testing

2. Black Box Testing: In this testing by knowing the internal

operation of a product, tests can be conducted to ensure that “

all gears mesh”, that is the internal operation performs

according to specification and all internal components have

been adequately exercised. It fundamentally focuses on the

functional requirements of the software.

The steps involved in black box test case design are:

i. Graph based testing methods

ii. Equivalence partitioning

iii. Boundary value analysis

iv. Comparison testing

73
6.1.3 SOFTWARE TESTING STRATEGIES:

A software testing strategy provides a road map for the software

developer. Testing is a set of activities that can be planned in advance and

conducted systematically. For this reason a template for software testing a

set of steps into which we can place specific test case design methods

should be defined for software engineering process. Any software testing

strategy should have the following characteristics:

1. Testing begins at the module level and works “outward” toward

the integration of the entire computer based system.

2. Different testing techniques are appropriate at different points in

time.

3. The developer of the software and an independent test group

conducts testing.

4. Testing and Debugging are different activities but debugging

must be accommodated in any testing strategy.

Unit Testing: Unit testing focuses verification efforts in smallest

unit of software design (module).

1. Unit test considerations

2. Unit test procedures

74
Integration Testing: Integration testing is a systematic technique for

constructing the program structure while conducting tests to uncover

errors associated with interfacing. There are two types of integration

testing:

1.Top-Down Integration: Top down integration is an incremental

approach to construction of program structures. Modules are

integrated by moving down wards throw the control hierarchy

beginning with the main control module.

2. Bottom-Up Integration: Bottom up integration as its name

implies, begins construction and testing with automatic modules.

Regression Testing: In this contest of an integration test

strategy, regression testing is the re execution of some subset of

test that have already been conducted to ensure that changes have

not propagate unintended side effects.

75
VALIDATION TESTING:

At the culmination of integration testing, software is completely

assembled as a package; interfacing errors have been uncovered and

corrected, and a final series of software tests – validation testing – may

begin. Validation can be fined in many ways, but a simple definition is

that validation succeeds when software functions in a manner that can be

reasonably expected by the customer.

Reasonable expectation is defined in the software requirement

specification – a document that describes all user-visible attributes of the

software. The specification contains a section titled “Validation Criteria”.

Information contained in that section forms the basis for a validation

testing approach.

VALIDATION TEST CRITERIA:


Software validation is achieved through a series of black-box tests

that demonstrate conformity with requirement. A test plan outlines the

classes of tests to be conducted, and a test procedure defines specific test

cases that will be used in an attempt to uncover errors in conformity with

requirements. Both the plan and procedure are designed to ensure that all

functional requirements are satisfied; all performance requirements are

achieved; documentation is correct and human-engineered; and other

requirements are met.

76
After each validation test case has been conducted, one of two

possible conditions exist: (1) The function or performance characteristics

conform to specification and are accepted, or (2) a deviation from

specification is uncovered and a deficiency list is created. Deviation or

error discovered at this stage in a project can rarely be corrected prior to

scheduled completion. It is often necessary to negotiate with the customer

to establish a method for resolving deficiencies.

CONFIGURATION REVIEW:
An important element of the validation process is a configuration

review. The intent of the review is to ensure that all elements of the

software configuration have been properly developed, are catalogued, and

have the necessary detail to support the maintenance phase of the

software life cycle. The configuration review sometimes called an audit.

Alpha and Beta Testing:


It is virtually impossible for a software developer to foresee how

the customer will really use a program. Instructions for use may be

misinterpreted; strange combination of data may be regularly used; and

output that seemed clear to the tester may be unintelligible to a user in the

field.

When custom software is built for one customer, a series of

acceptance tests are conducted to enable the customer to validate all

requirements. Conducted by the end user rather than the system

77
developer, an acceptance test can range from an informal “test drive” to a

planned and systematically executed series of tests. In fact, acceptance

testing can be conducted over a period of weeks or months, thereby

uncovering cumulative errors that might degrade the system over time.

If software is developed as a product to be used by many

customers, it is impractical to perform formal acceptance tests with each

one. Most software product builders use a process called alpha and beta

testing to uncover errors that only the end user seems able to find.

A customer conducts the alpha test at the developer’s site. The

software is used in a natural setting with the developer “looking over the

shoulder” of the user and recording errors and usage problems. Alpha

tests are conducted in controlled environment.

The beta test is conducted at one or more customer sites by the end

user of the software. Unlike alpha testing, the developer is generally not

present. Therefore, the beta test is a “live” application of the software in

an environment that cannot be controlled by the developer. The customer

records all problems that are encountered during beta testing and reports

these to the developer at regular intervals. As a result of problems

reported during bets test, the software developer makes modification and

then prepares for release of the software product to the entire customer

base.

78
System Environment:

This was developed and tested in the following environments


Processor speed - 1.7GHz (Intel Pentium)
Java Software Development Kit(SDK)- Java-version 6.0
Internet dialup connection - Leased line With high speed

In any software development, the quality of the system is measured


by its correctness and reliability, and second by the degree to which it
satisfies the demands of the initial requirement analysis. Both types of
testing i.e. white box testing and black box testing were carried out.
Writing the appropriate test program, which acted as a stub, helped in
testing each module that was developed. Most of the work of testing was
carried on the local host. The sites on which the web mirror service was
performed included
1.http://localhost:8080/ depth=5 levels
2.http://www.rediff.com/ depth=2 levels
3.http://www.yahoo.com/ depth=1 level
4.http://www.google.com/ depth=4 levels
Another major aspect of testing was testing before and after
integration, which played a crucial role in the development of the project.
Integration after each and every cycle of development brought out the
problems that can be faced in the early stages. This led to the changes in
the design of the software like the building of the server interface,
establishing a message queue etc.

79
6.2 Problems faced

i. Although all the pages were downloaded and the appropriate URLs
edited to relative path, the module threw “JavaScript: unknown
protocol exception “. This exception arose because the html parser
was unable to cope with JavaScript’s start tag. Therefore this
exception was caught and the necessary code under that was
eliminated. So now the application boundary involved
downloading only HTTP supported pages.

ii. The pages such as ASP and .shtml were unable to download. These
ASP and secure socket connection pages required user interaction,
so I thought that a normal web browser can do that well than an
offline browser whose main work was to download the pages with
minimal user interaction. So the boundary was drawn to download
only html, jpeg and gif file formats, as these are the ones that are
supported by java by default. The layout and the testing results are
shown in Appendix-A.

iii. In the first cycle we had developed a small skeleton structure i.e. a
small GUI and the client. So integrating it was a real problem. It
was not possible to pass the reference of the objects of the client
classes created in one method or object to be passed in other
object, as these objects were created dynamically in response to the
user generated events.so the design was enhanced to provide for a
message queue, which acted as an interface not only for generation
of events but also for acting as a callback mechanism for updating
the GUI components

80
Therefore it is now possible to plug a new GUI to the back end
portion and vice –versa. The only thing an integrator needs to take care is
of the server and its Message constants. Although most of the exceptions
are caught and the necessary response message is shown to the user, still
the necessary exception/error and its corresponding stack trace is thrown
on the command line for debugging purposes.

The results before and after the integration are shown in the
APPENDIX – A.

81
SCREEN SHOTS

82
83
84
Screen : Display of the downloaded web page of www.andhrauniversity.info
saved on the disk.Notice the page is displayed from local folder.

85
Screen 4 : Display of the downloaded web page of site

86
87
88
89
90
91
92
APPENDIX

93
Technologies Used

1. J2SDK 1.6
2. Unified Modeling Language (UML)

Why java?

I. Introduction:

Java is a general-purpose object-oriented programming language, which


researchers at Sun Microsystems originally developed to control
intelligent consumer electronic devices. Now best known as a web
programming language, Java's functionality extends beyond the web with
support for the development of general purpose applications in both
standalone and client-server environments on a variety of hardware
platforms.

II. Design Issues

A. Java is Object Oriented

Everything in Java is related to the class construct, which forms the


basis for object oriented programming in the language.

1. Basics of Objects

At its simplest, object technology is a collection of analysis,


design, and programming methodologies that focuses design on modeling
the characteristics and behavior of objects in the real world. Objects have
state and behavior. For example, an object can model a car. A car has
state (how fast it's going, in which direction, its fuel consumption, and so
on) and behavior (starts, stops, turns, slides, and runs into trees).

94
In the programming implementation of an object, its state is
defined by its instance variables. Instance variables are usually private to
an object. Unless explicitly made public or made available to other
"friendly" classes, an object's instance variables are inaccessible from
outside the object.

An object's behavior is defined by its methods. Methods


manipulate the instance variables to create new state; an object's methods
can also create new objects.

2. Classes

A class is a software construct that defines the instance variables


and methods of an object. A class in and of itself is not an object. A class
is a pattern that defines how an object will look and behave when the
object is created or instantiated from the specification declared by the
class. You can instantiate many objects from one class definition, just as
you can construct many houses that area all the same from a single
architect's drawing.

B. Java is designed to be Robust and Secure:

Java is intended for developing software that must be robust,


highly reliable, and secure, in a variety of ways. There's strong emphasis
on early checking for possible problems, as well as later dynamic (run-
time) checking, to eliminate error-prone situations.

Java removes the memory management load from the programmer.


Automatic garbage collection is an integral part of Java and its run-time
system. While Java has a new operator to allocate memory for objects,
there is no explicit free function. Once you have allocated an object, the

95
run-time system keeps track of the object's status and automatically
reclaims memory when objects are no longer in use, freeing memory for
future use.

Java applets have even stronger security constraints. The Java


"sandbox" (the runtime interpreter in a browser) examines the applet for
any untoward instructions as the applet is being loaded. For example, the
applet cannot access files on a client’s disk, protecting the client’s data
and preventing any viruses from being written.

C. Java is designed to be Architecturally Neutral and Portable:

The solution that the Java system adopts to solve the platform-
specific binary-code distribution problem is a bytecode format that is
independent of hardware architectures, operating system interfaces, and
window systems

The Java compiler doesn't generate "machine code" in the sense of


native hardware instructions--rather, it generates bytecodes for a high-
level, machine-independent virtual machine that is implemented by Java
interpreter and run-time system.

D. Java is Interpreted and Dynamic:

The Java language's portable and interpreted nature produces a


highly dynamic and dynamically extensible system. The Java language
was designed to adapt to evolving environments. Classes are linked in as
required and can be downloaded from across networks. Incoming code is
verified before being passed to the interpreter for execution.

96
Ø The interpreted environment enables fast prototyping without
waiting for the traditional compile and link cycle,
Ø The environment is dynamically extensible, whereby classes are
loaded on the fly as required,
Ø The fragile superclass problem that plagues C++ developers is
eliminated because of deferral of memory layout decisions to run
time.

E. Java is Designed to Support Multithreaded Applications:


1. Threads at the Java Language Level

Threads are an essential keystone of Java. The Java library


provides a Thread class that supports a rich collection of methods to start
a thread, run a thread, stop a thread, and check on a thread's status. You
don’t need to worry about whether there are many processors or just one;
the same model works.

Java's threads are pre-emptive, and depending on platform on


which the Java interpreter executes, threads can also be time-sliced.

2. Integrated Thread Synchronization:

Java supports multithreading at the language (syntactic) level and


via support from its run-time system and thread objects. At the language
level, methods within a class that are declared synchronized do not run
concurrently. Such methods run under control of monitors to ensure that
variables remain in a consistent state. Every class and instantiated object
has its own monitor that comes into play if required. This model is quite
useful when accessing shared resources, such as a printer, or a server-side
file that several clients may need to modify.

97
UML:
"The Unified Modeling Language (UML) is a graphical language
for visualizing, specifying, constructing, and documenting the artifacts of
a software-intensive system. The UML offers a standard way to write a
system's blueprints, including conceptual things such as business
processes and system functions as well as concrete things such as
programming language statements, database schemas, and reusable
software components."

UML defines the notation and semantics for the following


domains:

Ø The User Interaction or Use Case Model - describes the boundary


and interaction between the system and users. Corresponds in some
respects to a requirements model. (see The Use Case Model)
Ø The Interaction or Collaboration Model - describes how objects in
the system will interact with each other to get work done.
Ø The Dynamic Model - State charts describe the states or conditions
that classes assume over time. Activity graphs describe the
workflows the system will implement. (see The Dynamic Model)
Ø The Logical or Class Model - describes the classes and objects that
will make up the system. (see The Class Model)
Ø The Physical Component Model - describes the software (and
sometimes hardware components) that make up the system. (see
The Component Model)
Ø The Physical Deployment Model - describes the physical
architecture and the deployment of components on that hardware
architecture. (see The Physical Model

98
NETWORKING TERMINOLOGY

What Is a URL?

It's often easiest, although not entirely accurate, to think of a URL


as the name of a file on the World Wide Web because most URLs refer
to a file on some machine on the network. However, remember that
URLs also can point to other resources on the network, such as database
queries and command output.

Definition: URL is an acronym for Uniform Resource Locator and


is a reference (an address) to a resource on the Internet.

The following is an example of a URL which addresses the Java


Web site hosted by Sun Microsystems:

As in the previous diagram, a URL has two main components:

Ø Protocol identifier
Ø Resource name

The resource name is the complete address to the resource. The


format of the resource name depends entirely on the protocol used, but
for many protocols, including HTTP, the resource name contains one or
more of the components listed in the following table:

99
Host Name The name of the machine on which the resource lives.

Filename The pathname to the file on the machine.

Port
The port number to which to connect (typically optional).
Number

A reference to a named anchor within a resource that


Reference usually identifies a specific location within a file (typically
optional).

For many protocols, the host name and the filename are required,
while the port number and reference are optional. For example, the
resource name for an HTTP URL must specify a server on the network
(Host Name) and the path to the document on that machine (Filename); it
also can specify a port number and a reference. In the URL for the Java
Web site java.sun.com is the host name and the trailing slash is shorthand
for the file named /index.html.

HTTP

HTTP is a protocol with the lightness and speed necessary for a


distributed collaborative hypermedia information system. It is a generic
stateless object-oriented protocol, which may be used for many similar
tasks such as name servers, and distributed object-oriented systems, by
extending the commands, or "methods", used. A feature if HTTP is the
negotiation of data representation, allowing systems to be built
independently of the development of new advanced representations

100
Structure of HTTP Transactions

Like most network protocols, HTTP uses the client-server model:


An HTTP client opens a connection and sends a request message to an
HTTP server; the server then returns a response message, usually
containing the resource that was requested. After delivering the response,
the server closes the connection (making HTTP a stateless protocol, i.e.
not maintaining any connection information between transactions).

The format of the request and response messages is similar, and


English-oriented. Both kinds of messages consist of:

Ø an initial line,
Ø zero or more header lines,
Ø a blank line (i.e. a CRLF by itself), and
Ø an optional message body (e.g. a file, or query data, or
query output).

Initial Request Line

The initial line is different for the request than for the response. A
request line has three parts, separated by spaces: a method name, the
local path of the requested resource, and the version of HTTP being used.
A typical request line is:

GET /path/to/file/index.html HTTP/1.0

101
Initial Response Line (Status Line)

The initial response line, called the status line, also has three parts
separated by spaces: the HTTP version, a response status code that gives
the result of the request, and an English reason phrase describing the
status code.

Ø The status code is a three-digit integer, and the first digit identifies
the general category of response:

Þ 1xx indicates an informational message only


Þ 2xx indicates success of some kind
Þ 3xx redirects the client to another URL
Þ 4xx indicates an error on the client's part
Þ 5xx indicates an error on the server's part

Header Lines

Header lines provide information about the request or response, or


about the object sent in the message body.

The Message Body

An HTTP message may have a body of data sent after the header
lines. In a response, this is where the requested resource is returned to the
client (the most common use of the message body), or perhaps
explanatory text if there's an error. In a request, this is where user-entered
data or uploaded files are sent to the server.

102
Other HTTP Methods, Like HEAD and POST

Besides GET, the two most commonly used methods are HEAD
and POST.

The HEAD Method

A HEAD request is just like a GET request, except it asks the


server to return the response headers only, and not the actual resource
(i.e. no message body). This is useful to check characteristics of a
resource without actually downloading it, thus saving bandwidth. Use
HEAD when you don't actually need a file's contents.

The response to a HEAD request must never contain a message


body, just the status line and headers.

The POST Method:

A POST request is used to send data to the server to be processed


in some way, like by a CGI script. A POST request is different from a
GET request in the following ways:

Networking Basics

Computers running on the Internet communicate to each other


using either the Transmission Control Protocol (TCP) or the User
Datagram Protocol (UDP), as this diagram illustrates:

103
When you write Java programs that communicate over the
network, you are programming at the application layer. Typically, you
don't need to concern yourself with the TCP and UDP layers. Instead,
you can use the classes in the java.net package. These classes provide
system-independent network communication. However, to decide which
Java classes your programs should use, you do need to understand how
TCP and UDP differ.

TCP:

When two applications want to communicate to each other


reliably, they establish a connection and send data back and forth over
that connection. This is analogous to making a telephone call. If you want
to speak to Aunt Beatrice in Kentucky, a connection is established when
you dial her phone number and she answers. You send data back and
forth over the connection by speaking to one another over the phone
lines. Like the phone company, TCP guarantees that data sent from one
end of the connection actually gets to the other end and in the same order
it was sent. Otherwise, an error is reported.

TCP provides a point-to-point channel for applications that require


reliable communications. The Hypertext Transfer Protocol (HTTP), File
Transfer Protocol (FTP), and Telnet are all examples of applications that

104
require a reliable communication channel. The order in which the data is
sent and received over the network is critical to the success of these
applications. When HTTP is used to read from a URL, the data must be
received in the order in which it was sent. Otherwise, you end up with a
jumbled HTML file, a corrupt zip file, or some other invalid information.

Definition: TCP (Transmission Control Protocol) is a connection-based


protocol that provides a reliable flow of data between two computers.

UDP

The UDP protocol provides for communication that is not


guaranteed between two applications on the network. UDP is not
connection-based like TCP. Rather, it sends independent packets of data,
called datagrams, from one application to another. Sending datagrams is
much like sending a letter through the postal service: The order of
delivery is not important and is not guaranteed, and each message is
independent of any other.

Definition: UDP ( User Datagram Protocol) is a protocol that sends


independent packets of data, called datagrams, from one computer to
another with no guarantees about arrival. UDP is not connection-based
like TCP.

105
Understanding Ports

Generally speaking, a computer has a single physical connection to


the network. All data destined for a particular computer arrives through
that connection. However, the data may be intended for different
applications running on the computer. So how does the computer know to
which application to forward the data? Through the use of ports.

Data transmitted over the Internet is accompanied by addressing


information that identifies the computer and the port for which it is
destined. The computer is identified by its 32-bit IP address, which IP
uses to deliver data to the right computer on the network. Ports are
identified by a 16-bit number, which TCP and UDP use to deliver the
data to the right application.

In connection-based communication such as TCP, a server


application binds a socket to a specific port number. This has the effect of
registering the server with the system to receive all data destined for that
port. A client can then rendezvous with the server at the server's port, as
illustrated here:

Definition: The TCP and UDP protocols use ports to map incoming data
to a particular process running on a computer.

106
In datagram-based communication such as UDP, the datagram packet
contains the port number of its destination and UDP routes the packet to
the appropriate application, as illustrated in this figure:

Port numbers range from 0 to 65,535 because ports are represented


by 16-bit numbers. The port numbers ranging from 0 - 1023 are
restricted; they are reserved for use by well-known services such as
HTTP and FTP and other system services. These ports are called well-
known ports. Your applications should not attempt to bind to them.

107
CONCLUSIONS

108
Conclusions

The project intended to interact with the user through the GUI,
allowing him to easily deal with the web site as a whole. Although this
application has not got some of the features present in the existing
systems, it has incorporated some new features.

This application helps the user in getting the broad picture of the
web site and aimed at providing a set of services, which can help the user
to deal with the remote web site in an easy and flexible manner. The
application developed has been able to satisfy most of the requirements,
initially drawn out during the problem description phase. The following
conclusions can be drawn from the development of the project

i. It provides an easy tool available for a browser to deal with sites


while searching for information

ii. It overcomes the delay caused in establishing connections if the


web site is already downloaded.

iii. It can help in sharing of the web site over the intra net

109
To improve the functionality and usefulness of the application, the
following enhancements can be made

i. Enabling the application to handle any type of protocol. i.e.


shttp, gopher, FTP etc.

ii. To let the user interpret any type of file that can be displayed
in the internal browser.

iii. Automatic download of the files whose contents have


changed based on the date of last modification

110
BIBLIOGRAPHY

111
BIBLIOGRAPHY

1. Elliotte Rusty Harold, “JAVA Networking Programming”,


O’Reilly publications, 2000.

2. Patrick Naughton, Herbert Schildt, “The Complete Reference JAVA 2”,


Tata-McGraw-Hill, 2001

3. cay s. Horstmann, gary cornell, “Core JAVA 2 volume I” and “Volume II”,
Pearson Education Asia, 2000 .

4. Meilir Page-Jones “Fundamentals in object oriented design in UML”,


Pearson Education Asia, 2000

5. Reger S. Pressman, “Software Engineering – A Practitioner’s


Approach III Edition”,
McGraw-Hill International Editions,1992.

6. Behrouz A. Forouzan , “TCP/IP protocol suite ’’


Tata McGraw-Hill Edition 2000

7. http://java.sun.com/aboutjava/

112

S-ar putea să vă placă și