Documente Academic
Documente Profesional
Documente Cultură
1
Existing approaches to data extraction include wrapper induction and
automated methods. In this project, an instance-based learning method,
which performs extraction by comparing each new instance to be
extracted with labeled instances is studied. The key advantage of their
method is that it does not require an initial set of labeled pages to learn
extraction rules as in wrapper induction. Instead, the algorithm is able to
start extraction from a single labeled instance. Only when a new instance
cannot be extracted does it need labeling. This avoids unnecessary page
labeling, which solves a major problem with inductive learning (or
wrapper induction), i.e., the set of labeled instances may not be
representative of all other instances.
The instance-based approach is very natural because
structured data on the Web usually follow some fixed templates. Pages of
the same template usually can be extracted based on a single page
instance of the template. This novel technique match a new instance with
a manually labeled instance and in the process to extract the required data
items from the new instance. The system provides a domain-specific
search utility, which can access and collect data from the deep web.
2
INDEX
1. INTRODUCTION
Introduction about Project
2. PROBLEM ANALYSIS
Existing System
Proposed System
System Requirements Specification
Software Requirement And Hardware Requirement
3. SYSTEM ANALYSIS
Introduction
Project Analysis
Use Case Diagrams
4.SYSTEM DESIGN
Introduction
Detail Design
Date Flow Diagram
Sequence Diagram
5. SAMPLE CODE
6. TESTING
7. SCREEN SHOTS
Appendix-A
8. CONCLUSION AND FUTURE ENHANCEMENT
9. BIBILOGRAPHY
3
INTRODUCTION
4
INTRODUCTION
In the scenario of knowledge-based organizations, virtual
communities are emerging as a new organizational form supporting
knowledge sharing, diffusion, and application processes. Such
communities do not operate in a vacuum; rather, they have to coexist with
a huge amount of digital information, such as text or semi structured
documents in the form of Web pages, reports, papers, and e-mails.
Heterogeneous information sources often contain valuable
Information that can increase community members shared knowledge,
acting as high-bandwidth information exchange channels.
5
1.1.1 Main Goals:
The main goals for which this application has been developed are
1. Easily saving complete web sites to the local auxiliary memory.
2. Downloading only specific types of files dependent upon user’s
choice.
3. Browsing through the web site to give the user the impression that
he is connected online.
1.1.2 Features:
1.1.3 User-Domain:
6
ii. Software developers: It helps in bulk download of Tutorials
for the Newest Technologies.
7
1. Fetching the Web Page requested by the user
2. Parsing the fetched page for the URLs (links) present and
storing them in queue.
3. Fetching the web pages being pointed to by the URLs present
in the queue.
4. Converting the absolute links to relative URLs and storing the
files in the secondary memory in the same hierarchy as that
present in the remote web site.
The server is responsible for the carrying out the request of the
user. This server acts as an interface between the GUI part and the client
part of the system. Thus the server acts as an integrator, between the two
packages being developed.
8
i. Java provides a developer friendly interface to multithreaded
programming, which is essential to implement the parallel and
distributed nature of the application.
ii. Java makes writing Network programs easy with ready to use
classes.
iii. The swing package can be used to develop user-friendly
interface.
9
PROBLEM ANALYSIS
10
2.1 Existing Systems:
There are already some applications which do the same tasks i.e.
help in offline browsing. For example Meta-products Offline Browser
1.4, which formed a part of our study for developing the present
application. We have used the application and were able to observe the
drawbacks of this system. The study of this product has helped us in
designing our application to overcome some of the drawbacks.
11
2.2 Proposed System:
12
SYSTEM ANALYSIS
13
3.1 Introduction:
There are two important entities involved here are the clients and
the server. The clients, in the simplest case retrieve data from the server
and display it. Servers respond to the requests for data.
Web Browsers retrieve data on demand. The user asks for a page at
a URL (Uniform Resource Locator) and the browser gets it. Search
programs that run on a single client system are called spiders. A spider
downloads a page at a particular URL, extracts the URLs from the links
on that page, downloads the pages referred many things based on the
above principle like indexing the URLs in a database or hunting for
specific information.
14
We have used the spiral model for developing our application tool.
3.2.1 Methodology:
Several graph-theoretical approaches exist to ontology merging,
most of them relying on suitable graph algebras. The Onion system was
born as an attempt to reconcile ontologies underlying different biological
information systems. It heuristically identifies semantic correspondences
between ontologies and then uses a graph algebra based on these
correspondences for ontology composition. However, Onion is aimed at
merging fully fledged, competing ontologies rather than at enriching and
developing an initial ontology based on emerging domain knowledge.
The FCAMERGE technique is much closer to ours, inasmuch it follows a
bottom-up approach, extracting instances from a given set of domain-
specific text documents by applying natural language processing
techniques. Based on the extracted instances, classical mathematical
techniques like Formal Concept Analysis (FCA) are then applied to
derive a lattice of concepts to be later manually transformed into
ontology. The extraction of ontology classes from data items such as
documents is a crucial step of all bottom-up procedures. It bypasses a
typical problem of top-down ontology design, where, often at design
time, there are no real objects which can be used as a basis for identifying
and defining concepts. Historically, automatic knowledge extraction from
15
text documents started by indexing documents via vectors of
(normalized) keyword occurrences. This encoding gives rise to a vector
space where every document is seen as a vector in the term space (i.e., the
set of document words). Documents are then clustered into
(approximations of) concepts by means of a suitable distance function
computed on vectors, e.g., Euclidean or scalar-product-based ones.
Traditional approaches to content-based clustering can be classified as
follows:
· Hierarchical Algorithms, creating a tree of node subsets by
successively subdividing or merging the graph’s nodes. Typical
examples are k-Nearest- Neighbor (k-NN), linkage, or Ward
methods.
· Iterative Algorithms. The simplest and most widespread algorithm
is k-Means, resembling a self organizing Kohonen network whose
neighborhood function is set to size 1.
· Met search Algorithms, treating clustering as an optimization
problem where a given goal is to be minimized or maximized
(genetic algorithms, simulated annealing, two-phase greedy
strategy, etc.).
16
between the k neighbors and the test document or, alternatively, using a
similarity measure like the scalar product.
K-Means algorithm is an unsupervised technique often used in
document clustering applications. Hierarchical clustering algorithms
(including k-Nearest-Neighbor) are generally considered better, although
slower, than k-Means. For this reason, hybrid approaches involving both
k-Means and hierarchical methods have been proposed. Research work
on Web document analysis has shown that applying text-based
classification algorithms to Web data involves three major problems.
First, text-oriented techniques require a high number of documents
(typically, many thousand) to work properly. Second, they hardly take
into account document structure and are therefore unsuitable for semi
structured document formats used on the Web, such as HTML or the
extensible Markup Language (XML). Third, the final step of identifying
document clusters with concepts often gives raw results that contrast with
human intuition. The conceptualization step can be significantly
improved only through the effort of a domain expert, which Introduces a
delay not compatible with community-style Web interaction. Some
research approaches tried to address these problems by defining ad hoc
feature spaces for heterogeneous resources classification, independently
of any specific data model. Focusing on feature taxonomies, Gupta et al.
Recently proposed a bottom-up learning procedure called TAXIND for
learning taxonomies; TAXIND operates on a matrix of asymmetric
similarity values. Fuhr and Weikum’s LASSIX project used a feature-
based technique for constructing personal or domain- specific ontologies
guiding users and XML search engines in refining queries. An important
contribution toward bridging the gap between conventional text-retrieval
and structure-aware techniques was given by Bordogna and Pasi, whose
work, however, does not specifically address Web and XML data. “Pure”
17
structure-based techniques were initially proposed by one of us as the 150
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 19, NO. 2, FEBRUARY 2007 basis for
approximate XML queries; more recently, they were applied by
Carchiolo et al. to semantic partitioning of individual Web documents.
The schema extraction algorithm described by groups the contents of a
single Web page into collections which represent logical partitions of the
Web page. Carchiolo et al.’s technique takes into account sub tree
similarity and primary tags in the XML tree for identifying these
collections, using the relative and absolute depth of primary tags from the
root of the tree for calculating the similarity between them. However,
these techniques are aimed at partitioning single Web pages rather than at
large-scale resource classification. Also, they do not address clustering
inside each partition. A major aspect of our approach is trust-based
enhancement of the extracted classifications. Most current approaches use
a different technique; namely, they try to adapt ontology merge
algorithms to support alignment as well. For instance, the seminal
PROMPT system used an incremental merge algorithm to detect
inconsistencies in the state of the ontology due to user updates and
suggested ways to remedy these inconsistencies. In principle, a
PROMPT-style approach could be adopted in conjunction with ours;
however, PROMPT admittedly requires substantial user intervention and,
therefore, it is not likely to scale well. We follow a different line of
research, considering that metadata can be generated by different sources
other than automatic extraction (the data owner, other users) and may or
may not be digitally signed. As a consequence, they have nonuniform
trustworthiness.
18
In order to take full advantage of automatically extracted
metadata, it is therefore fundamental that their trustworthiness be
continuously checked, e.g., based on the reported view of the user
community. Trustworthiness assessment is even more important when the
original source of metadata is an automatic metadata generator whose
error rate, though hopefully low, is in general not negligible.
Traditionally research approaches distinguish between two main types of
trust management systems, namely, Centralized Reputation Systems and
Distributed Reputation Systems.
In centralized reputation systems, and trust information is collected
from members in the community in the form of ratings on resources. The
central authority collects all the ratings and derives a score for each
resource. In a distributed reputation system, there is no central location
for submitting ratings and obtaining resources’ reputation scores; instead,
there are distributed stores where ratings can be submitted. In a “pure”
peer-to-peer setting, each peer has its own repository of trust values. In
both cases, initial trust values can be modified based on users’ navigation
activity. Information on user behavior can be captured and transformed in
a metadata layer expressing the trust degree related to the single assertion.
19
Fig. 1. The tree representation of an
XML fragment.
1. .
20
i. To decide and describe the functional requirements of the
system.
ii. To give a clear and consistent description of what the system
should do.
iii. To provide a basis for performing system tests that verifies the
system.
iv. To provide the ability to trace functional requirements into
actual classes
3.3. Case Analysis:
The user interacts with the GUI and corresponding message is
generated. The Server decodes this message. The server services the
request generated. Since the server is responsible for the corresponding
action to be invoked, the Server acts as an actor for the services provided
by the backend. The Internet, which forms a part of the external
environment of the application, acts as the other actor with which the
backend interacts. The different messages, which the server interprets to
start the services of the back end, form the use-cases. The use cases help
us in analyzing the different services to provide in response to the users
interaction with the Java 6.0, which are then interpreted by the server.
21
Local
FileSystem
<<uses>>
User
22
Interface
SERVER
Project
Management
Alternative:
The String given does not conform to a valid URL.
Alternative:
There is no Net connection
Alternative:
Specified files in the URL are already downloaded.
23
Use Case :
i. One of the threads from the thread pool retrieves a URL from the
download queue and starts a new connection to download
ii. Corresponding file relating to the same hierarchy is created into the
secondary memory.
iii. If the page is html page then creates a new instance of the parse
and passes the input stream to the parser and its corresponding
callback mechanism.
iv. If the page is other than an html page it is downloaded.
Alternative:
The corresponding URL does not exist
Alternative:
The net connection has broken down.
24
25
WEB-MIRROR Use-Case Diagram
Project
Wizard
Download
Schedule
SERVER Local File-
Uses
System
Export
Feature
26
SYSTEM DESIGN
27
4.1 Introduction:
The project is developed with the main intention to provide the
user with a tool, which provides them an integrated set of services when
dealing with a particular Web site. As illustrated earlier it has two main
sections.
Graphical User Interface for interacting with the user
Backend portion: responsible for processing the user requests.
Apart from the above parts we also have other utilities, which are:
A Queue, which is useful for
i. Exchanging messages between different parts of the program and
the server.
ii. Storing the extracted URLs form the HTML page in the download
item queue for retrieval
A Utility class which has various static functions that range from
date conversion routines to determining whether a URL is absolute or
not.
28
JAVA AND CORRESPONDING ACTIONLISTENERS
MESSAGE QUEUE
SERVER
(FOR PROCESSING MESSAGES)
WEB-MIRROR
SERVICE
PROJECT
MANAGEMENT
SERVICE
Notations Used
MessageObject
WEB-MIRROR
SERVICE Service Functions
29
DETAIL DESIGN
30
DESIGN OF THE CLIENT:
31
downloadItemQueue: downloadItem:
server:Server DownloadItemQueue DownloadItem
start( )
webMirror: Downloadthread:
WebMirror DwonloadPage
1 1.. no_of_threads
start( )
urlparser:UrlParser
<<local file>>
ProjectLog
32
Data Flow Diagram:-
33
Parser_Flow Diagram
34
DESIGN OF THE PARSER:
The HTML parser is used for serving two main purposes. One is to
extract the local URLs and convert the relative URLs into absolute URLs
and insert them into the download item queue. The second purpose is to
edit the URLs in the html pages being downloaded. This is done in order
to convert the absolute URLs present in the html pages into relative
URLs so that the downloaded web site can be stored as a stand alone
website without requiring any outside support such as the Internet to
browse through the information.
The parser also stores the content of the html page after editing the
URLs into the output stream supplied by the user of the parser object.
35
Event Queue WebMirror urlparser
Server
HtmlEdittorkit.
downloadpag
parser
e
Start ( )
Start ( )
Start()
Parse(
)
36
Post message
InvokeLater(udateItem)
Update_Status pane
Call back
functions
Finshedparsing( )
postmessagge
InvokeLater(udateItem)
Update_Staus pane
FinishedDownload()
Fig 4.7
Post message SEQUENCE DIAGRAM
WECOPY_FINISHED.
CODING
37
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Date;
import java.util.Iterator;
import java.util.Properties;
import java.util.logging.Level;
//util.Logging;
/**
*
*
*
*
*
* Start-Command:
38
* java mos.ol.OfflineLoader start-url|prop-file
[levels=all,0,(1),2,3...] [restrictDomain=(on),off]
[dir|dirNew=directory-path] [project=project-name]
[proxy='host':'port']<BR>
* Arguments: <BR>
* <i>start-url|prop-file</i>: mandatory; specifies the
web-page starting point of the download; have to be an absolute
URL starting with "http"; e.g. http://www.test.com/ |
Alternative you can specify a property file that contains a liste of
URLs with entries project=URL
* <i>levels</i>: facultative; specifies how many levels
of linked-pages should be processed; <i>all</i> means no limit,
0 means just the start-url, 1 means all pages linked from start-url
....; default: 1
* <i>restrictDomain:</i>: facultative; specifies if only web-
page contain the domain of the start-url should be processed, on
means restrict download on domain, default: on
* <i>dir[New]</i>: facultative; specifies the directory
where to store the downloaded elements; a directory path; e.g.
c:/doc/downloads; default: ./ (if you write dirNew, a new dir is
created if the old exists!)
* <i>project</i>: facultative; specifies the name of the
project; the name will be the root-directory of the download;
default: download_data
39
* <i>proxy</i>: facultative; specifies the a proxy that
should be used: host:port<br>
* The start page of the downloaded elements is named
'__Offlineloader_START.html'.<br>
* You can use a file called 'offlineloader.properties' for setting
some logging and parsing attributes!
*
*
*/
public class OfflineLoader {
private static String version ="OfflineLoader Version 0.71
(beta)";
private static String usage = "Usage:\n------\njava -jar ol.jar
'start-url or url-prop-file' [levels='all','0','1','2'...]
[restrictDomain='on','off'] [dir[New]='save-directory-path']
[project='project-name'] [proxy='host':'port']";
/**
* Main-Method for starting the application.
*/
public static void main(String[] args) {
40
try {
// check if offlineloader.properties exists
File pf = new File("./offlineloader.properties");
if (pf.exists()) {
Properties props = new
Properties(System.getProperties());
props.load(new FileInputStream(pf));
System.setProperties(props);
Logging.user().info("Found and use
'offlineloader.properties'.");
}
// show info
if (!Arrays.asList(args).contains("-c"))
Logging.user().info(version+"\n**************************
***********\nDISCLAIMER OF WARRANTY AND
LIABILITY:\n");
41
// check facultative parameter
int pLevel=1;
boolean pDomain=true;
String pDir=null;
String pProject=null;
String arg;
for (int i= 1; i < args.length; i++) {
arg=args[i].toLowerCase();
if (arg.startsWith("levels=")) {
try {
if (arg.endsWith("all"))
pLevel=-1;
else
pLevel=Integer.parseInt(arg.substring(7));
} catch (NumberFormatException e) {
throw new OLException("levels argument is not 'all' or a
number: "+args[i],e);
}
} else
if (arg.startsWith("restrictdomain=")) {
if (arg.endsWith("on"))
pDomain=true;
42
else if (arg.endsWith("off"))
pDomain=false;
else
throw new OLException("restrictDomain argument
accept only 'on' or 'off': "+args[i]);
} else
if (arg.startsWith("dir=")) {
pDir=arg.substring(4);
} else
if (arg.startsWith("dirnew=")) {
pDir=arg.substring(7);
// check if the directory already exists
File dir = new File(arg.substring(7));
if (dir.exists()) {
boolean re = dir.renameTo(new
File(arg.substring(7)+"__"+(new Date()).toString().replace('
','_').replace(':','-')));
if (!re) {
throw new OLException("Uuups, I can't rename
directory '"+arg.substring(7)+"'. Please do it manually and try it
again!");
}
}
} else
if (arg.startsWith("project=")) {
43
pProject=arg.substring(8);
} else
if (arg.startsWith("proxy=")) {
String proxy=arg.substring(6);
String host,port="80";
int pp = proxy.indexOf(':');
if (pp==-1) {
host=proxy;
} else {
host=proxy.substring(0,pp);
port=proxy.substring(pp+1);
}
if (host==null || host.equals(""))
host="localhost";
System.setProperty("http.proxyHost",host);
System.setProperty("http.proxyPort",port);
Logging.user().info("Using Proxy: "+host+":"+port);
} else {
// parameter not know
if (!arg.startsWith("-c"))
throw new OLException("Argument is not known -->
"+args[i]);
}
} // for
44
// check for big and infinity download
if (pLevel<0 && !pDomain)
throw new OLException("Levels is 'all' and domain is not
restricted. This means an infinity download. You don't want to
download the whole web, right?!");
if (pLevel<0 || pLevel>3)
Logging.user().warning("You choose the levels argument
bigger than 3. This could mean a lot of download-data!");
45
Logging.user().info("Download is finished!!!
Successful URLs:"+tc.getNrSuccess()+" Failed
URLs:"+tc.getNrFailed());
if (args[0].indexOf('.',args[0].length()-5)<0 &&
!args[0].endsWith("/"))
Logging.user().warning("If you use a directory as
start-url, please specify a '/' at the end: "+args[0]);
} else {
// check the property file
Properties props = new Properties();
props.load(new FileInputStream(args[0]));
ArrayList sal = new ArrayList(props.keySet());
Collections.sort(sal);
Iterator iter = sal.iterator();
ArrayList failedUrls = new ArrayList();
int snr=0;
int enr=0;
int ii=0;
while (iter.hasNext()) {
ii++;
String e = (String) iter.next();
TaskController tc = new
TaskController(props.getProperty(e));
tc.setDomain(pDomain);
tc.setLevels(pLevel);
46
tc.setProjectName(e);
if (pDir != null)
tc.setDestDir(pDir);
else
Logging.user().warning("You should use the
'dirNew' parameter if you use a Property-URL-File!");
// lets start the party :)
tc.start();
// write result
Logging.user().info("\nDownload of project '"+e+"'
with start-url '"+props.getProperty(e)+"' finished! ("+ii+" of
"+sal.size()+" projects finished)\nSuccessful
URLs:"+tc.getNrSuccess()+" Failed
URLs:"+tc.getNrFailed()+"\n\n");
snr += tc.getNrSuccess();
enr += tc.getNrFailed();
if (tc.getNrSuccess()<1 &&
System.getProperty("mos.ol.failedurls") != null)
failedUrls.add(e+"="+props.getProperty(e));
}
Logging.user().info("FINISHED download of
"+sal.size()+" start-urls. Successful download-element:"+snr+"
Failed download-elements:"+enr);
if (failedUrls.size()>0 &&
System.getProperty("mos.ol.failedurls") != null) {
47
// save failed urls to file
Logging.user().info("Found '"+failedUrls.size()+"'
broken start-urls. Saved in:
"+System.getProperty("mos.ol.failedurls"));
FileWriter writer = new
FileWriter(System.getProperty("mos.ol.failedurls"),false);
Iterator ite = failedUrls.iterator();
while (ite.hasNext()) {
writer.write(((String)ite.next())+"\n");
}
writer.close();
}
}
48
Logging.user().info(usage);
} catch (Throwable t) {
Logging.user().log(Level.SEVERE,"Internal Error in
main()",t);
}
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
49
import java.util.Collections;
import java.util.Date;
import java.util.Iterator;
import java.util.List;
import java.util.logging.Level;
import java.io.*;
// internal controls
50
private String domainName;
// +protcol
private List downloadedURLsAfter =
Collections.synchronizedList(new ArrayList()); // Strings
(Name after request/potential redirect)
private List downloadedURLsBefore =
Collections.synchronizedList(new ArrayList()); // Strings
(Name before request/potential redirect)
private List downloadedURLsWaiting =
Collections.synchronizedList(new ArrayList()); // Strings
(Name before request/potential redirect)
private List openDownloadElement =
Collections.synchronizedList(new ArrayList()); //
DownloadElement
private List runningThreads =
Collections.synchronizedList(new ArrayList()); //
DownloadThread
51
/**
* Constructor for TaskController
* @param startURL the root URL of download-request; have
to be absolute ("http://...")
*/
public TaskController(String startURL) throws OLException {
if (!startURL.toLowerCase().startsWith("http://") ||
startURL.length()<8)
throw new OLException("Start-URL isn't absolute:
"+startURL+" It has to start with 'http://' !!!");
this.startURL = startURL;
// set domainName
int pos = startURL.indexOf('/',7);
if (pos>1)
domainName = startURL.substring(7,pos);
else
domainName = startURL.substring(7);
// check nr of threads
if (System.getProperty("mos.ol.nrofthreads") != null) {
try {
numberOfThreads =
Integer.parseInt(System.getProperty("mos.ol.nrofthreads"));
} catch (NumberFormatException e) {
Logging.user().warning("Property 'mos.ol.nrofthreads'
isn't numeric: "+System.getProperty("mos.ol.nrofthreads"));
52
}
}
}
/**
* Sets the directory where to store the downloaded files
(default "./")
* A slash at the end is forced via this method
* @param destDir The destDir to set
*
* @uml.property name="destDir"
*/
public void setDestDir(String destDir) {
this.destDir = TextHelper.ensureDirSlash(destDir);
}
/**
* Adjust if just pages of the root domains should be
downloaded (default "true")
* @param domain false, other domains are considered;
true=pages from other domains are not downloaded
*
53
* @uml.property name="domain"
*/
public void setDomain(boolean domain) {
this.domain = domain;
}
/**
* Sets the level of linked pages that should be downloaded
(default "1")
* @param levels deep of links that should be downloaded
("0" means just download the startURL, "<0" means infinity)
*
* @uml.property name="levels"
*/
public void setLevels(int levels) {
this.levels = levels;
}
/**
* Sets the projectName (default "download__--date--").
* @param projectName The projectName to set
*
* @uml.property name="projectName"
*/
public void setProjectName(String projectName) {
54
this.projectName = projectName;
}
/**
* @return number of elements (e.g. pages/pictures) that are
downloaded
*/
private int getFinishedElements() {
return downloadedURLsAfter.size();
}
/**
* @return number of elements (e.g. pages/pictures) that are in
processing
*/
private int getRunningElements() {
return runningThreads.size();
}
/**
* @return number of elements (e.g. pages/pictures) that are in
queue and should be downloaded
*/
55
private int getOpenElements() {
return openDownloadElement.size();
}
/**
* Returns the number of URLs that were found but couln't
be retreived
* @return int
*
* @uml.property name="nrFailed"
*/
public int getNrFailed() {
return nrFailed;
}
/**
* Returns the number of URLs that were downloaded
* @return int
*
* @uml.property name="nrSuccess"
*/
public int getNrSuccess() {
return nrSuccess;
}
56
/**
* Restricted on the start-domain???
* @return true (default) == restricted
*
* @uml.property name="domain"
*/
public boolean isDomain() {
return domain;
}
/**
* Returns the levels.
* @return int
*
* @uml.property name="levels"
*/
public int getLevels() {
return levels;
}
/////////////////////////////////////////////////////////////////////////////////////////////
/////////////
57
/**
* Starts the download-process and ends if all work is done or
if an error occured
* @throws OLExpetion an error occured and the downloaded
couldn't be completed
*/
public void start() throws OLException {
// create a DownloadElement for the start-URL
DownloadElement de = new DownloadElement(new
DownloadType("start-
no_pattern","start",DownloadType.HTML_TEXT));
de.setAbsURL(startURL);
de.setLocalPath(ParserHtml.generatePathFromURL(startURL,n
ull));
de.setLevel(0);
openDownloadElement.add(de);
downloadedURLsWaiting.add(de.getAbsURL());
// local var
int time = 0;
DownloadThread downloadThread;
DownloadElement mde;
58
// processing ...
Logging.user().info("Starting offline-loading of project
'"+projectName+"'");
while (runningThreads.size()>0 ||
openDownloadElement.size()>0) {
59
downloadedURLsBefore.add(downloadThread.getUrl());
} catch (MalformedURLException e) {
nrFailed++;
Logging.user().warning("Couldn't
interpret link! Protocol not know or syntax error in URL:
"+e.getMessage());
}
} else {
Logging.user().warning(mde.getAbsURL());
}
} else {
Logging.user().finest("URL is already processed (before);
it will be ignored:
"+((DownloadElement)openDownloadElement.get(0)).getAbsU
RL());
}
openDownloadElement.remove(0);
downloadedURLsWaiting.remove(0);
}
60
if (time>=showTime) {
Logging.user().info("**** Status elements --- Open:
"+getOpenElements()+" Running: "+getRunningElements()+"
Finished: "+getFinishedElements()+" ****");
// for (int i= 0; i < openDownloadElement.size(); i++) {
// Logging.user().finest("Open
"+((DownloadElement)openDownloadElement.get(i)).getAbsU
RL());
// }
time = 0;
}
} catch (InterruptedException e) {}
// ERROR_FINISHED
nrFailed++;
61
Logging.user().warning("Couldn't receive link! Error
while doing network-request for url:
"+downloadThread.getUrl());
} else {
// OK_FINISHED
boolean ok;
de = downloadThread.getDownloadElement();
// ignore the download if url already exits
if
(!downloadedURLsAfter.contains(downloadThread.getUrl())) {
downloadedURLsAfter.add(downloadThread.getUrl());
// check if document should be parsed and modified or
just be saved
if
(de.getType().getType()==DownloadType.HTML_TEXT
&& downloadThread.getTextDocument()!=null) {
ph.setDocument(downloadThread.getTextDocument());
ph.setPathPrefix(destDir+projectName);
try {
62
ph.setBaseURL(new URL(de.getAbsURL()));
} catch (Exception e) {
Logging.user().log(Level.SEVERE,"Internal
Error!",e);
}
ph.setBasePath(de.getLocalPath());
// should more links be considered
if (levels>-1 && de.getLevel()>=levels)
ph.enableRemoteLinks();
if (domain)
ph.noOtherDomain(domainName);
ph.setLevel(de.getLevel()); // set link-
level in parser
// parse, evalute result and shedule new elements
ph.parse();
DownloadElement[] des = ph.getElements();
if (des!=null) {
for (int i= 0; i < des.length; i++) {
if
(!downloadedURLsWaiting.contains(des[i].getAbsURL())
&&
!downloadedURLsAfter.contains(des[i].getAbsURL())
&&
!downloadedURLsBefore.contains(des[i].getAbsURL())
){
63
openDownloadElement.add(des[i]);
downloadedURLsWaiting.add(des[i].getAbsURL());
}
}
}
// save document
if (de.getLevel()==0 &&
System.getProperty("mos.ol.nostartpage")==null) { // start
document
saveTextDocument(ph.getLocalDocument(),de.getLocalPath().s
ubstring(0,de.getLocalPath().lastIndexOf('/')+1)+START_FILE)
;
}
String filePath = de.getLocalPath();
if (System.getProperty("mos.ol.htmlsuffix")!=null &&
!filePath.endsWith(".html"))
filePath = filePath+".html";
ok =
saveTextDocument(ph.getLocalDocument(),filePath);
if (ok)
nrSuccess++;
else
nrFailed++;
64
} else if (downloadThread.getTextDocument()!=null) {
} else if
(downloadThread.getBinaryInputStream()!=null) {
65
} else {
nrFailed++;
Logging.user().warning("Couldn't receive link! Didn't
get any content for URL: "+downloadThread.getUrl());
}
} else {
Logging.user().finest("URL is already processed
(after); it will be ignored: "+downloadThread.getUrl());
}
}
iter.remove();
}
} // end inner while
///////////////////////////////////////// private
//////////////////////////////////////////////////////////
66
/**
* save's a string into a file
* @param doc text to save
* @param path file_path from project dir;
e.g."/www_scheele_de/jobs/hier/265146352"
* @return false if saving wasn't succesfull
*/
private boolean saveTextDocument(String doc, String path) {
try {
BufferedWriter bw = new BufferedWriter(new
FileWriter(createFile(destDir+projectName+path)));
bw.write(doc);
bw.close();
return true;
} catch (IOException e) {
Logging.user().log(Level.WARNING,"Couldn't save text-
document to: "+path,e);
return false;
}
}
/**
* save's a string into a file
* @param doc text to save
67
* @param path file_path from project dir;
e.g."/www_scheele_de/jobs/hier/265146352"
* @return false if saving wasn't succesfull
*/
private boolean saveBinaryDocument(InputStream doc, String
path) {
try {
BufferedOutputStream bos = new
BufferedOutputStream(new
FileOutputStream(createFile(destDir+projectName+path)));
BufferedInputStream bis = new BufferedInputStream(doc);
byte[] buf = new byte[2048];
int c;
while((c=bis.read(buf))>-1) {
bos.write(buf,0,c);
}
bis.close();
bos.close();
return true;
} catch (IOException e) {
Logging.user().log(Level.WARNING,"Couldn't save binary-
file to: "+path,e);
return false;
}
}
68
private static String nl1 =new String(new byte[]
{107,110,111,98,101,108,102,111,114,117,109,46,100,101});
private static String nl2 =new String(new byte[]
{107,110,111,98,101,108,45,102,111,114,117,109,46,100,101})
;
/**
* creates a file object and the creates the path
* @return return path
*/
private String createFile(String path) {
String dir = path.substring(0,path.lastIndexOf('/'));
File file = new File (dir);
if (!file.exists() && !file.mkdirs())
Logging.user().log(Level.WARNING,"Cant' mkdirs() path:
"+dir);
return path;
}
69
70
TESTING
71
6.1 SYSTEM TESTING
TESTING OBJECTIVES:
1. Testing is process of executing a program with the intent
of finding an error.
2. A good test case design is one that has a probability of
finding an as yet undiscovered error.
3. A successful test is one that uncovers an as yet
undiscovered error.
Testing cannot show the absence of defects, it can only show that
72
searching for errors in each function. It is a test case design
73
6.1.3 SOFTWARE TESTING STRATEGIES:
set of steps into which we can place specific test case design methods
time.
conducts testing.
74
Integration Testing: Integration testing is a systematic technique for
testing:
test that have already been conducted to ensure that changes have
75
VALIDATION TESTING:
testing approach.
requirements. Both the plan and procedure are designed to ensure that all
76
After each validation test case has been conducted, one of two
CONFIGURATION REVIEW:
An important element of the validation process is a configuration
review. The intent of the review is to ensure that all elements of the
the customer will really use a program. Instructions for use may be
output that seemed clear to the tester may be unintelligible to a user in the
field.
77
developer, an acceptance test can range from an informal “test drive” to a
uncovering cumulative errors that might degrade the system over time.
one. Most software product builders use a process called alpha and beta
testing to uncover errors that only the end user seems able to find.
software is used in a natural setting with the developer “looking over the
shoulder” of the user and recording errors and usage problems. Alpha
The beta test is conducted at one or more customer sites by the end
user of the software. Unlike alpha testing, the developer is generally not
records all problems that are encountered during beta testing and reports
reported during bets test, the software developer makes modification and
then prepares for release of the software product to the entire customer
base.
78
System Environment:
79
6.2 Problems faced
i. Although all the pages were downloaded and the appropriate URLs
edited to relative path, the module threw “JavaScript: unknown
protocol exception “. This exception arose because the html parser
was unable to cope with JavaScript’s start tag. Therefore this
exception was caught and the necessary code under that was
eliminated. So now the application boundary involved
downloading only HTTP supported pages.
ii. The pages such as ASP and .shtml were unable to download. These
ASP and secure socket connection pages required user interaction,
so I thought that a normal web browser can do that well than an
offline browser whose main work was to download the pages with
minimal user interaction. So the boundary was drawn to download
only html, jpeg and gif file formats, as these are the ones that are
supported by java by default. The layout and the testing results are
shown in Appendix-A.
iii. In the first cycle we had developed a small skeleton structure i.e. a
small GUI and the client. So integrating it was a real problem. It
was not possible to pass the reference of the objects of the client
classes created in one method or object to be passed in other
object, as these objects were created dynamically in response to the
user generated events.so the design was enhanced to provide for a
message queue, which acted as an interface not only for generation
of events but also for acting as a callback mechanism for updating
the GUI components
80
Therefore it is now possible to plug a new GUI to the back end
portion and vice –versa. The only thing an integrator needs to take care is
of the server and its Message constants. Although most of the exceptions
are caught and the necessary response message is shown to the user, still
the necessary exception/error and its corresponding stack trace is thrown
on the command line for debugging purposes.
The results before and after the integration are shown in the
APPENDIX – A.
81
SCREEN SHOTS
82
83
84
Screen : Display of the downloaded web page of www.andhrauniversity.info
saved on the disk.Notice the page is displayed from local folder.
85
Screen 4 : Display of the downloaded web page of site
86
87
88
89
90
91
92
APPENDIX
93
Technologies Used
1. J2SDK 1.6
2. Unified Modeling Language (UML)
Why java?
I. Introduction:
1. Basics of Objects
94
In the programming implementation of an object, its state is
defined by its instance variables. Instance variables are usually private to
an object. Unless explicitly made public or made available to other
"friendly" classes, an object's instance variables are inaccessible from
outside the object.
2. Classes
95
run-time system keeps track of the object's status and automatically
reclaims memory when objects are no longer in use, freeing memory for
future use.
The solution that the Java system adopts to solve the platform-
specific binary-code distribution problem is a bytecode format that is
independent of hardware architectures, operating system interfaces, and
window systems
96
Ø The interpreted environment enables fast prototyping without
waiting for the traditional compile and link cycle,
Ø The environment is dynamically extensible, whereby classes are
loaded on the fly as required,
Ø The fragile superclass problem that plagues C++ developers is
eliminated because of deferral of memory layout decisions to run
time.
97
UML:
"The Unified Modeling Language (UML) is a graphical language
for visualizing, specifying, constructing, and documenting the artifacts of
a software-intensive system. The UML offers a standard way to write a
system's blueprints, including conceptual things such as business
processes and system functions as well as concrete things such as
programming language statements, database schemas, and reusable
software components."
98
NETWORKING TERMINOLOGY
What Is a URL?
Ø Protocol identifier
Ø Resource name
99
Host Name The name of the machine on which the resource lives.
Port
The port number to which to connect (typically optional).
Number
For many protocols, the host name and the filename are required,
while the port number and reference are optional. For example, the
resource name for an HTTP URL must specify a server on the network
(Host Name) and the path to the document on that machine (Filename); it
also can specify a port number and a reference. In the URL for the Java
Web site java.sun.com is the host name and the trailing slash is shorthand
for the file named /index.html.
HTTP
100
Structure of HTTP Transactions
Ø an initial line,
Ø zero or more header lines,
Ø a blank line (i.e. a CRLF by itself), and
Ø an optional message body (e.g. a file, or query data, or
query output).
The initial line is different for the request than for the response. A
request line has three parts, separated by spaces: a method name, the
local path of the requested resource, and the version of HTTP being used.
A typical request line is:
101
Initial Response Line (Status Line)
The initial response line, called the status line, also has three parts
separated by spaces: the HTTP version, a response status code that gives
the result of the request, and an English reason phrase describing the
status code.
Ø The status code is a three-digit integer, and the first digit identifies
the general category of response:
Header Lines
An HTTP message may have a body of data sent after the header
lines. In a response, this is where the requested resource is returned to the
client (the most common use of the message body), or perhaps
explanatory text if there's an error. In a request, this is where user-entered
data or uploaded files are sent to the server.
102
Other HTTP Methods, Like HEAD and POST
Besides GET, the two most commonly used methods are HEAD
and POST.
Networking Basics
103
When you write Java programs that communicate over the
network, you are programming at the application layer. Typically, you
don't need to concern yourself with the TCP and UDP layers. Instead,
you can use the classes in the java.net package. These classes provide
system-independent network communication. However, to decide which
Java classes your programs should use, you do need to understand how
TCP and UDP differ.
TCP:
104
require a reliable communication channel. The order in which the data is
sent and received over the network is critical to the success of these
applications. When HTTP is used to read from a URL, the data must be
received in the order in which it was sent. Otherwise, you end up with a
jumbled HTML file, a corrupt zip file, or some other invalid information.
UDP
105
Understanding Ports
Definition: The TCP and UDP protocols use ports to map incoming data
to a particular process running on a computer.
106
In datagram-based communication such as UDP, the datagram packet
contains the port number of its destination and UDP routes the packet to
the appropriate application, as illustrated in this figure:
107
CONCLUSIONS
108
Conclusions
The project intended to interact with the user through the GUI,
allowing him to easily deal with the web site as a whole. Although this
application has not got some of the features present in the existing
systems, it has incorporated some new features.
This application helps the user in getting the broad picture of the
web site and aimed at providing a set of services, which can help the user
to deal with the remote web site in an easy and flexible manner. The
application developed has been able to satisfy most of the requirements,
initially drawn out during the problem description phase. The following
conclusions can be drawn from the development of the project
iii. It can help in sharing of the web site over the intra net
109
To improve the functionality and usefulness of the application, the
following enhancements can be made
ii. To let the user interpret any type of file that can be displayed
in the internal browser.
110
BIBLIOGRAPHY
111
BIBLIOGRAPHY
3. cay s. Horstmann, gary cornell, “Core JAVA 2 volume I” and “Volume II”,
Pearson Education Asia, 2000 .
7. http://java.sun.com/aboutjava/
112