Documente Academic
Documente Profesional
Documente Cultură
September, 2003
Table of Contents Background ___________________________________________________________ 3 SinglePoint Market Research Portal Overview _______________________________ 5 Custom Content Integration ______________________________________________ 7 Search Technology Architecture History ____________________________________ 9 Search Technology Architecture Overview__________________________________ 11
Documents _______________________________________________________________________ 11 Metadata and Metatags ____________________________________________________________ 11 Data Model_______________________________________________________________________ 12 Data Collection and Web Crawling ___________________________________________________ 13 Search Service Description __________________________________________________________ 13 Query Database __________________________________________________________________ 13 Automatic Classification ___________________________________________________________ 14 Northern Light Taxonomy__________________________________________________________ 14 Sample Taxonomy: Aerospace ____________________________________________________ 15 Sample Taxonomy: Construction __________________________________________________ 16 Indexing________________________________________________________________________ 21 Query Service ___________________________________________________________________ 21 Searching _______________________________________________________________________ 22 Relevancy Ranking _______________________________________________________________ 23 Custom Search Folders ____________________________________________________________ 24 Session Management ______________________________________________________________ 24 Security ________________________________________________________________________ 24 Alerts __________________________________________________________________________ 25
Applications Development and Hosting ____________________________________ 26 Northern Light Technology Awards _______________________________________ 27
Background
Northern Light was founded in 1996 with the objectives to: (i) unify all of the best content in the world into one database, (ii) build search technology that allows searchers to easily find the most relevant (not just the most) information within that database, (iii) build search technology that works for both the first time untrained user and the information professional, and (iv) create a set of tools to allow Northern Light to build and operate custom information solutions for businesses that utilize these capabilities. This mission was formed from the following observations: Most interesting questions have relevant content from many sources, i.e., the Web, news feeds, licensed research, journal archives, and internal corporate information. Given the penetration of Internet technology on corporate networks, there is no longer any technical or distribution barrier to making all digital information available from any desktop computer. Unstructured information is overwhelmingly the most common type. It is impossible in any large search application to know what the organization, metatags, or document structure will be in advance. People in all walks of life are search engine literate and use search engines to access information on a daily basis. There is no other corporate application that requires less training or support than a search application. In particular, the key problem for search engine information retrieval is that of producing a precise set of relevant documentsfewer good documents, not more useless ones. To accomplish its mission, Northern Light set out to meet the following goals: Build a continually growing stable of content integration technology that would allow Northern Light to create databases of greatly diverse content, Make use of the best existing technology and develop new technology for highly scalable and precise searching, Develop highly scalable automated classification and related technologies to use pre- and post-search because even the best query interpretation and relevancy ranking are frequently inadequate to answer an information need expressed in one or a few words against a database of a billion documents or more, and Copyright 2003, Northern Light Group, LLC, All rights reserved 3
Become extraordinarily proficient at dealing with unstructured data, bringing accessibility, usability, organization, and classification to arbitrarily large and diverse content sets from thousands of sources.
NorthernLight.com formally launched in August 1997, and was the first Internet search engine to offer access to both Web, published, and internal corporate content in a single database.
Organization: All sources indexed and subject classified to a consistent standard User interface: One login One user interface One search One results list
Seat management: Enforce access privileges by user or group to sources, groups of sources or individual documents. Outsourced turnkey solution: Northern Light develops, hosts and maintains the SinglePoint to help keep overhead down and minimize impact on your internal IT resources. Security: Private database and private Web server via secure API, VPN or T1. User name and passwords can be used, as well as IP validation. Security is customized to meet specific client requirements or integration with corporate network security systems is available.
Index and classify the content, creating the comprehensive multi-vendor index for our customer to search.
Load the content every day, create the index, and serve the end-user queries. Northern Light disposes of our copy of the vendors content (or of internal content) a few days after the load process is complete. (We do maintain a copy for a few days so we can re-index a recent load if there turns out to be a problem of any sort.) Note that a SinglePoint database index is unique to a specific customer, facilitating security and usability.
Once the comprehensive multi-vendor/multi-source database index has been built, end-users may query from UIs provided on their intranet or from a UI we can provide as a private website. Results returned are from all the content in the database across all of the licensed vendors, consistently relevance ranked and indexed. When an end-user wants a document, he or she clicks on the link just like any other search engine. We then instantly, transparently, and automatically authenticate that the end-user has rights to the document, log-in the end-user in to the vendor-in-questions service, fetch the document, and put the document in the browser window of the end-user. The transaction system of the vendor records the event as if the enduser had logged in. Below is a diagram of the content integration process. Northern Light Customer
User Interface
Trash Internal Repository Internal Mkt. Research Investment reports Industry Web sites Newswires & journals Mkt. Research vendor Full-text content Computer Reports
generated data
These information management products reflect the companys core competencies in search technology, classification and taxonomy development, and integration of diverse content and federated search. Northern Light typically offers these services on an outsourced, ASP basis. Copyright 2003, Northern Light Group, LLC, All rights reserved
However, the Northern Light search technology is also available as licensed software for in-house customer use on Solaris and Linux platforms. All of Northern Lights solutions derive from the original vision expressed in our Web search engine of one database that could provide access to all the worlds useful information. All of our solutions share these characteristics Scale to gigantic numbers of documents. Speed to efficiently search such large databases. Precision, or relevance ranking, to make the large databases useful. Classification to automatically organize the body of unstructured content in useful ways.
10
11
Data Model
The standard metadata used for Custom Search Folders subject, type, source, language and region is treated specially in a number of ways. The possible values for each of these fields have been defined and comprise a taxonomy or set of possible values for that field/domain. These taxonomies are all hierarchical, though they contain many cross-references; a given value may have more than one parent because each taxonomy is actually a directed acyclic graph. The subject taxonomy contains approximately 17,000 values (referred to as nodes), starting at the top level with broad categories such as humanities, and sometimes going eight or more levels deep in certain areas to provide very specific subject values such as works of W.H. Auden or robotics. The type field refers to the kind of document an article (the default and most populated type), a review (with more specific typing of book review and others), an editorial, a letter, a report, something for sale, etc. The source field refers to where the document came from, and is either a Web source of some kind (e.g., a Web site, or possibly higher level source node such as all commercial sites) or a Special Collection source typically a single journal or book title at the lowest level (e.g., The Economist, or the Boston Herald) or, again, a higher level aggregate (e.g., journals and magazines, news articles, etc.). The language field is the predominant language(s) of the document currently one of English, French, German, Spanish, Italian and unknown (i.e., some other language). The region field specifies a location or locations referenced in the document a city, country, geographic region, etc. For type, language and subject, the metadata value(s) attempt to capture what the document is really about (or substantially written in, in the case of language). Multiple values are possible but these are intended to represent true multi-subject documents. In the case of region, the model is slightly different; a document will be tagged with any and all regions that can possibly be identified with a document. The difference between region and these other fields is the way they are used in searching. This multi-dimensional model is in contrast to a single dimensional model that must rely on repetition within a single domain in order to achieve comprehensive document descriptions. For example, in a single dimensioned design, values like reviews or biographies could be repeated under all or a very large number of subject values. As the amount of metadata increases, the Copyright 2003, Northern Light Group, LLC, All rights reserved 12
single (or few) domain model becomes increasingly complex and unwieldy. The Northern Light multi-dimensional model, however, can maintain multiple taxonomies easily and simply and can class and organize documents against them.
13
index, of course, where we got it from so that the document can be retrieved if an end-user desires to read it after selecting it from a results list.
Automatic Classification
To deliver automated classification against a huge and heterogeneous data set, the Northern Light technology uses our own classification taxonomies for subject, type (e.g., article, review, FAQ, job listing, etc.) and other document attributes, drawing on existing taxonomies and supplementing them to provide comprehensive coverage for a wide range of users. An automatic system has also been built that uses multiple strategies (e.g., pattern extraction from training documents, co-location analysis, and structural elements) for classifying documents for a given attribute. Both the taxonomies and the automated system have been in production and supporting end users since August, 1997, have classified over a billion documents, and are continually being refined to deliver more comprehensive and precise classification and better operational performance. At this point, Northern Lights automatic classification is still the only system to ever automatically subject classify the World Wide Web. Performance levels have been achieved by fully divorcing the logical classification models from their practical implementation and creating data structures appropriate for rapid classification of documents against the large but relatively stable taxonomies, patterns, and rules that are the basis of the classification process. One primary use of classification information (i.e., metadata) at Northern Light Integration today is to organize the results (through Custom Search Folders) of a search by appropriate attribute values. This facilitates rapid navigation and some level of automatic query refinement, while allowing more expert users to limit their search initially by some appropriate attribute value. Metadata is also used as one factor (among many) in relevancy ranking.
14
Below are two examples of taxonomy branches, one of aerospace technology and industry. The other of construction technology and industry.. Sample Taxonomy: Aerospace The node identifier is the ID#, and the @ sign indicates inclusion by reference of other branches of the taxonomy. Aviation & space technology ID#18340 Aerodynamics ID#18341 Aeronautics ID#18342 Flight control & navigation ID#18368 Aeronomy ID#39560 Aerospace communications equipment ID#18367 Aerospace materials ID#18344 Air traffic control ID#38332 Aircraft design & construction ID#18346 Aircraft engines & motors ID#18356 Aerospace propulsion ID#18395 Jet engines ID#18357 Rocket engines ID#18359 Aviation instrumentation ID#18382 Commercial aircraft design ID#18347 Flight simulators ID#17478 Flight testing ID#18369 Gliders ID#18351 Helicopters ID#18352 Homebuilts & ultralights ID#18349 Hot air balloons ID#18350 Landing gear ID#18383 Military aircraft ID#18353 Seaplanes ID#18354 Small planes ID#18355 Astronautics ID#18363 Space systems ID#18412 Astrophysics ID#18366 @Celestial mechanics (ID#14437) ID#13928 Aviation ground facilities ID#18372 Airport planning & design ID#18373 Copyright 2003, Northern Light Group, LLC, All rights reserved 15
Military aircraft ground facilities ID#18375 Spacecraft ground facilities ID#18377 Avionics ID#37848 Civil aviation ID#37768 @Flight control & navigation (ID#14428) ID#18368 History of aviation & space technology ID#18378 History of aviation ID#18379 History of space flight ID#18380 @National Aeronautics & Space Administration (NASA) (ID#14427) ID#10135 @Remote sensing (ID#14429) ID#13629 Satellite technology ID#18402 Communications satellites ID#18404 Space stations ID#18411 MIR space station ID#36642 Space travel & exploration ID#18413 Space colonization ID#18405 Spacecraft & Space missions ID#18406 Apollo space missions ID#18407 Gemini space mission ID#18408 Manned spacecraft ID#39606 Mercury space missions ID#18409 Space Shuttle ID#18410 Space launch vehicles & equipment ID#39607 Space probes ID#18389 Space safety ID#39608 Unmanned spacecraft ID#39609 Viking mission to Mars ID#29129 @Telescopes (ID#14432) ID#13198 Sample Taxonomy: Construction Architectural engineering ID#18323 @Architectural design (ID#40386) ID#358 Building acoustics ID#18324 @Construction engineering (ID#14424) ID#18511 @Construction management (ID#14425) ID#18512 Heating, ventilation & air conditioning ID#18331 @Air conditioners & fans (ID#41168) ID#14698 @Home furnaces (ID#43116) ID#37672 Copyright 2003, Northern Light Group, LLC, All rights reserved 16
Lighting & electrical systems ID#18333 Commercial lighting ID#14840 Exterior lighting ID#14767 @Lighting design (ID#42055) @Structural engineering (ID#14426) Architectural services ID#4574 Architectural drafting ID#4576 House plans ID#37844 @Landscape architecture (ID#40544) Lighting design ID#27266 Asbestos ID#5305 @Asbestos exposure (ID#40679) @Asbestos removal (ID#40570) Civil engineering ID#18509 Bridge engineering ID#18510 Construction engineering ID#18511 Building standards & codes ID#39565 Construction automation ID#18513 Construction management ID#18512 Construction safety ID#6194 @Dams, canals & waterways (ID#14473) Earthworks engineering ID#18518 Fire technology ID#28435 Combustion & flammability ID#28439 Fire investigation ID#28448 Fire prevention ID#28450 Fire safety systems ID#19295 Fire suppression ID#28441 Geotechnical engineering ID#18520 Earthquake engineering ID#39468 Geo-environmental systems ID#13643 Geosynthetics ID#13644 Hydraulic engineering ID#18524 Coast & Harbor engineering ID#18525 Flood control ID#18527 @Hydraulic cement (ID#43096) @Hydraulic fluids (ID#43328)
17
@Hydraulic machinery (ID#41488) Hydraulic structures ID#18528 Aqueduct engineering ID#37887 Dams, canals & waterways ID#18530 Reservoir engineering ID#18532 Irrigation & drainage ID#13032 Sediment transport ID#18535 Surface water runoff ID#18536 Lighthouses ID#18540 @Ocean engineering (ID#14472) Structural engineering ID#18547 @Mechanical behavior of materials (ID#14474) Structural concrete ID#18551 Structural steel ID#18554 Surveying ID#18555 @Geographic information systems (GIS) (ID#14475) Photogrammetry ID#18560 @Remote sensing (ID#14476) Transportation engineering ID#18565 @Automotive engineering (ID#41482) Electric vehicles ID#36914 Emission control ID#18569 High-speed ground transportation ID#18570 Highways, roads & pavements ID#18522 Intelligent transportation systems ID#18574 Marine transportation ID#19488 Pipeline transportation ID#39656 Railroad engineering ID#18544 Transportation planning ID#18575 Transportation safety ID#39610 @National Transportation Safety Board (NTSB) (ID#40938) @Urban transportation (ID#41339) Tunnel engineering ID#18577 Construction industry ID#26320 Building contractor services ID#4725 Building materials ID#39453 @Carpentry & woodworking (ID#41390)
18
@Construction machinery (ID#41486) @Driveway coating & construction (ID#42066) @Electrician services (ID#41176) @Fences & stone walls (ID#40582) Hand & power tools ID#14756 @Home improvement centers (ID#43309) @Insulation services (ID#41174) @Landscaping services (ID#41177) Nonresidential Construction ID#39628 Plumbers & plumbing supplies ID#14751 @Bathroom fixtures & accessories (ID#41169) @Pool construction & maintenance services (ID#41179) Residential construction ID#39449 @Roofing services (ID#41180) @Septic systems (ID#42067) @Underwater construction & Habitats (ID#43333) @Water well drilling (ID#42047) Facilities management ID#6270 Floor laying, refinishing & resurfacing ID#27352 Heating & Ventilation industry ID#26835 @Heating, ventilation & air conditioning (ID#41445) House painting & wall covering services ID#14789 Industrial equipment & Heavy machinery industry ID#26341 @Farm equipment & Supplies industry (ID#41904) @Manufacturing equipment & machinery (ID#41485) @Turbomachinery (ID#41508) Lighting industry ID#26349 Electrician services ID#14784 Electrical supplies ID#14744 Electrical testing & inspection ID#18674 Electrical wiring ID#18675 @Lighting & electrical systems (ID#41446) @Lamps & light fixtures (ID#40577) @Lighting design (ID#42054) Laminated wood ID#27394 Particle board ID#27398 Plywood & veneer ID#5068
19
Pressure treated wood ID#27401 Sheet metal ID#5124 Wire & Cable products ID#39463 Aluminum & aluminum products ID#5112 Copper & copper products ID#5116 Iron industry ID#26357 Steel industry ID#26358 Stainless steel ID#5125 Paint & paint supplies ID#14748 Property developers ID#5232 Rock mechanics ID#39569 Soil science & technology ID#18312 Erosion ID#18313 Fertilizers ID#18316 Chemical fertilizers ID#17611 Organic fertilizer ID#17612 Soil cultivation ID#18315 Soil pollution ID#13029 Soil remediation ID#37889 Stone, clay, glass & concrete product industry ID#26373 Cement ID#5306 Hydraulic cement ID#36456 @Ceramics & Pottery (ID#40405) ID#797 Concrete ID#5307 Concrete block & brick ID#36460 Ready-mixed concrete ID#36461 Cut stone & stone products ID#5308 Granite ID#36462 Limestone ID#36463 Marble ID#36464 @Memorials & Grave stones (ID#42060) ID#27319 Earthenware ID#5309 Glass products ID#5311 Automobile glass ID#36478 Flat glass ID#36465 Glass containers ID#36466 @Mirrors (ID#40565) ID#4879
20
Pressed & blown glass ID#5315 @Sand & Gravel (ID#41514) ID#19469 Vitreous china ID#36475 Windows & doors ID#4998
Indexing
During indexing, all of the terms in the documents and metadata fields are extracted and indexed into appropriate index structures; searches can be resolved by using these structures and without having to refer to the original documents. This process is exhaustive and comprehensive; all visible terms in the document display object and all appropriate metadata values are indexed. There are no stop words or special characters that are not indexed and there is no practical cutoff in terms of document length at which point indexing stops. All search terms (including those inside quoted phrases) are viewed as nouns and transformed (stemmed) automatically to their common singular form during indexing (and at query time). This allows a search on a singular or plural noun to find occurrences of either. The stemming rules are fairly simple and do not handle most irregular forms, which tend to occur for very common words not generally useful in searching. All terms are indexed as all lower-case letters. In addition, all terms containing at least one instance of both upper- and lower-case are also indexed in a special case-sensitive index. This allows queries to find all instances of a term regardless of case; query terms are also translated to all lower-case for initial query resolution. This also allows a match in case-sensitivity with a query term to be used as a relevancy factor. The above rules are English-language dependent. However, given the symmetry with which they are applied at indexing and query time, they generally preserve appropriate search processing for all languages. Double-byte language support, which requires the licensing of an third-party language processing module from Teragram, includes language-sensitive stemming and other sorts of processing. Numeric tokens (or tokens of mixed letters and numbers) are indexed as text. Proximity information can be represented in indices in various ways, allowing either very fast access of short phrases, or complete and precise (but slower) access of phrases of unlimited length.
Query Service
Finished databases are connected to Northern Lights Northern Light network by the Query Listener (QL). The QL accepts queries from external clients (such as a Web server) and passes Copyright 2003, Northern Light Group, LLC, All rights reserved 21
them to the Query Server (QS). The QS translates the search syntax and other parameters, queries the database indices, and returns the appropriate citation information and metadata. The Query Listener is also responsible for identifying itself by broadcasting, via UDP, a database identifier and load information. Clients use this information to select a listener appropriate to their mission.
Searching
Nearly all search fields on all search forms accept and process searches in the same way. The query interpretation algorithm proceeds as follows: 1. If the query is well-formed Boolean expression, it is rigorously interpreted as such. Boolean expressions can contain AND, OR, NOT, simple terms (words), quoted phrases, wildcards and parentheses including sub-expressions, and may contain an unlimited amount of nesting. In addition, a Boolean expression may itself contain any number of fielded sub-expressions that specify a search against a particular metadata field, e.g., (lawsuit or sue) and title: microsoft or netscape. By default, terms are searched against the text field, which includes all full text and all document metadata. This field may also be specified by use of the text: keyword. Search terms may also include one or more trailing or multi-character wildcards (indicated by *) or singlecharacter wildcards (%) as long as there are at least four non-wildcard characters before the first wildcard, e.g., rachm%inof*. Search fields are available to the user through appropriate fields on search forms or through the field: syntax. Other special fields include relational operators that can be used for date fields, sort: date (a reverse chronological sort), or sort: relevancy (the default). 2. If a well-formed Boolean expression is not found and the query is more than a specified length (currently 12 terms), a statistical query evaluation process is used. This only requires the presence of a single term in the query for a document to appear on the results list, but makes use of all terms or phrases appearing in a document to determine the best documents. This statistical evaluation can also be forced on a query of any length by preceding the query with the pseudo-field like:, e.g., like: side effects of antidepressants and sedatives 3. If neither of the above two conditions is met, then a query with any use of the + or - operators common among Internet search engines will be interpreted according to generally accepted rules. The rules are that any term or quoted phrase immediately preceded by a + must be in a document to appear on the results list, and any term or
22
quoted phrase preceded by a - cannot be in any document for it to appear on the results list. Other terms in the query are considered desirable but not required. 4. If the query does not meet any of the above criteria, a fuzzy search is performed. This does an implicit AND of most content-bearing words (or what are generally non-content bearing words if those are the only query terms) but uses all terms entered for relevancy ranking purposes. Some limited natural language analysis is also performed on terms, such as recognition of the word not. All query terms are presumed to be nouns and are translated, if necessary, to their singular form using fairly simple algorithms. This allows a match against either a singular or plural form, since all document terms are similarly converted to singular form during indexing. Query terms are also translated to lower case in order to be able to match any form of the word in any document; all document terms are similarly converted to lower-case at indexing time. Mixed case terms are searched against a special mixed case index to provide information about case-sensitive matches for relevancy ranking.
Relevancy Ranking
One of the strengths of Northern Lights technology is its advanced relevancy ranking algorithms. These not only provide a novel approach to ranking but are based on highly optimized index structures and algorithms that allow Northern Light Search and Content Integration to perform significant relevancy ranking operations on a very large database. Ranking takes into account several different factors, each of which contributes weight to a documents overall relevancy score and to its eventual placement in the results list. A maximum theoretical relevancy score is calculated for every query, and displayed relevancy scores represent a simple transformation to a 1-99% range of the actual document score as compared to the maximum theoretical score. These factors include the following: Number of occurrences of matching terms (term frequency factor, or TF). Relative frequency of those terms in the entire database (term inverse document frequency, or IDF). Implicit phrase recognition Location of matching terms and phrases in document metadata
23
Number and authority of external sites linking to this document (applies to Web documents only)
Date of the document (All other things being equal, more recent documents are considered to be more relevant than older documents.)
Session Management
Northern Lights Northern Light service uses a proprietary session state management system to store user state between page requests or other transactions. Each user is given a cookie with a unique token that contains no outwardly useful information. The users browser transmits the token in the headers of each request, and the Northern Light software uses the token to retrieve session data.
Security
Northern Light can invoke a variety of security solutions that the option of the customer. If a username password scheme is desired, we provide an administrative user-interface for managing the passwords. IP validation can be used in lieu of username/password security, or in addition to it. Secure https protocols are customarily used with Verisign certificates insuring the validity of
24
connections, or leased T1 lines can be used for extreme security. In our 7- year history, Northern Light has never experienced a single hacker intrusion into customer applications or our network.
Alerts
Northern Light offers users and enterprise customers the ability to save any search and have it run automatically whenever the database referenced by the search is updated. At that time, an email is sent to the registered owner of the alert if and only if there are new documents in the database that meet the search criteria; the e-mail message contains a link to just these new results. The system further keeps track of when a user has actually accessed these new results so that, if a user receives a string of alert e-mails before being able to see any of them, accessing any one of them will provide all the new results since the users last access; it is unnecessary to cycle through the alert messages one at a time in order to see all new results.
25
26
27