SP - WhitePaperof Northern Light

Northern Light SinglePoint Market Research Portal Overview White Paper
September, 2003
One Broadway, 14th Floor, Cambridge MA 02142 617-242-5960
Copyright 2003, Northern Light Group, LLC, All Rights Reserved
Table of Contents Background ___________________________________________________________ 3 SinglePoint Market Research Portal Overview _______________________________ 5 Custom Content Integration ______________________________________________ 7 Search Technology Architecture History ____________________________________ 9 Search Technology Architecture Overview__________________________________ 11
Documents _______________________________________________________________________ 11 Metadata and Metatags ____________________________________________________________ 11 Data Model_______________________________________________________________________ 12 Data Collection and Web Crawling ___________________________________________________ 13 Search Service Description __________________________________________________________ 13 Query Database __________________________________________________________________ 13 Automatic Classification ___________________________________________________________ 14 Northern Light Taxonomy__________________________________________________________ 14 Sample Taxonomy: Aerospace ____________________________________________________ 15 Sample Taxonomy: Construction __________________________________________________ 16 Indexing________________________________________________________________________ 21 Query Service ___________________________________________________________________ 21 Searching _______________________________________________________________________ 22 Relevancy Ranking _______________________________________________________________ 23 Custom Search Folders ____________________________________________________________ 24 Session Management ______________________________________________________________ 24 Security ________________________________________________________________________ 24 Alerts __________________________________________________________________________ 25
Applications Development and Hosting ____________________________________ 26 Northern Light Technology Awards _______________________________________ 27
Copyright 2003, Northern Light Group, LLC, All rights reserved
Background
Northern Light was founded in 1996 with the objectives to: (i) unify all of the best content in the world into one database, (ii) build search technology that allows searchers to easily find the most relevant (not just the most) information within that database, (iii) build search technology that works for both the first time untrained user and the information professional, and (iv) create a set of tools to allow Northern Light to build and operate custom information solutions for businesses that utilize these capabilities. This mission was formed from the following observations: Most interesting questions have relevant content from many sources, i.e., the Web, news feeds, licensed research, journal archives, and internal corporate information. Given the penetration of Internet technology on corporate networks, there is no longer any technical or distribution barrier to making all digital information available from any desktop computer. Unstructured information is overwhelmingly the most common type. It is impossible in any large search application to know what the organization, metatags, or document structure will be in advance. People in all walks of life are search engine literate and use search engines to access information on a daily basis. There is no other corporate application that requires less training or support than a search application. In particular, the key problem for search engine information retrieval is that of producing a precise set of relevant documentsfewer good documents, not more useless ones. To accomplish its mission, Northern Light set out to meet the following goals: Build a continually growing stable of content integration technology that would allow Northern Light to create databases of greatly diverse content, Make use of the best existing technology and develop new technology for highly scalable and precise searching, Develop highly scalable automated classification and related technologies to use pre- and post-search because even the best query interpretation and relevancy ranking are frequently inadequate to answer an information need expressed in one or a few words against a database of a billion documents or more, and Copyright 2003, Northern Light Group, LLC, All rights reserved 3
Become extraordinarily proficient at dealing with unstructured data, bringing accessibility, usability, organization, and classification to arbitrarily large and diverse content sets from thousands of sources.
NorthernLight.com formally launched in August 1997, and was the first Internet search engine to offer access to both Web, published, and internal corporate content in a single database.
SinglePoint Market Research Portal Overview

Most large companies license studies, analysis, and commentary from dozens of market research firms, trade journals, periodicals, equity analyst investment reports, and newswires. When there is the need to learn about a market or a product or a competitor, searching all the licensed sources is a major problem. How does the researcher, who may be a marketing, sales, or product development professional rather than a librarian, search dozens of market research and other sources? Does he or she log in dozens of times with dozens of different user names and passwords? Learn dozens of user interfaces on dozens of different Web sites? Run dozens of searches? Manually collate dozens of results lists? Another challenge confronting organizations that license content from many sources is the utilization of that content. Most of us are very impatient with search processes as a result of use of Internet search engines like Google. As end-users, we know that it is technically possible to have all the relevant information in a single database. Individuals will simply not preserve to search several sources. The most popular and obvious one will be searched, then maybe one more, on rare occasion three sources might be searched. The fourth through the hundredth licensed services are rarely consulted. Northern Light SinglePoint offers a more efficient and effective way to use all of a companys licensed content with one login, one search, one integrated results from all licensed sources. Companies can even include search of internal market research and studies so users can access relevant material produced within the organization while simultaneously looking at external sources. Using SinglePoint, it is also possible to include Web content in the database. (Is any search today complete without consulting relevant Web sources?) For example, vertical search of competitors websites, trade association websites, and government regulatory agency websites are highly relevant to many marketing and product planning decisions. The outcome: complete research from many sources, in the same time it now takes to search just one source. How much is the time of every professional employee in an organization worth? Also, the utilization of licensed content soars when a SinglePoint application is deployed. Sources are used based on their relevance, not on how familiar the staff is with them. The salient features of SinglePoint include: Content: All licensed, internal and Internet content from any source, in any format, from any organization anywhere in the world. Whatever the sources are, we can put them into one integrated index. Copyright 2003, Northern Light Group, LLC, All rights reserved
Organization: All sources indexed and subject classified to a consistent standard User interface: One login One user interface One search One results list
Seat management: Enforce access privileges by user or group to sources, groups of sources or individual documents. Outsourced turnkey solution: Northern Light develops, hosts and maintains the SinglePoint to help keep overhead down and minimize impact on your internal IT resources. Security: Private database and private Web server via secure API, VPN or T1. User name and passwords can be used, as well as IP validation. Security is customized to meet specific client requirements or integration with corporate network security systems is available.
Custom Content Integration

The heart of a SinglePoint application is custom content integration. It is not unusual to integrate dozens, even scores, of sources into a SinglePoint market research portal. The integration normally involves these steps: Determine the licensing arrangements of our customer with each of the sources. Northern Light has to understand whether the content is available on an enterprise-wide flat rate or is limited to a certain number of seats or limited in some other way. Most research vendors offer a wide variety of options in their offerings, and Northern Light has to understand which options have been selected by our customer. Work with the market research vendor to understand the structure of the vendors content and network, and to determine how to acquire it to build our customers index. Vendors most often today use an http-based extranet that we can crawl (with the vendors help getting through the firewall). Alternatively, FTP file transfers are a common technique for acquiring the content. Northern Light will then set up automated processes to acquire exactly what our customer has licensed from each vendor. Write filters to convert the vendors content to the Northern Light load format. This involves determining how to capture the vendors metadata (or metatags) so that the metadata can be included in the index (and hence available to select, sort, and filter results). Determine the login required for the vendor to fulfill document requests and setting up our transaction system to be able to transmit documents from the vendor site to the browsers of the end-users of our customer. Determine if any internal content is intended for the database. This can be internal market studies, MS PowerPoint presentations, or locally held copies of licensed market research. If internal content is to be included, crawling or FTP file transfers procedures must be arranged. For small volumes of internal content, Northern Lights Automated Submission and Publishing (ASAP) system can used to easily publish documents to the database. Determine if any Web content is desired for the database. Popular choices include competitors websites, trade association websites, and government regulatory agency websites. Subscription websites and e-journals can be included in the database, e.g., trade journal sites or industry publication sites.
Index and classify the content, creating the comprehensive multi-vendor index for our customer to search.
Load the content every day, create the index, and serve the end-user queries. Northern Light disposes of our copy of the vendors content (or of internal content) a few days after the load process is complete. (We do maintain a copy for a few days so we can re-index a recent load if there turns out to be a problem of any sort.) Note that a SinglePoint database index is unique to a specific customer, facilitating security and usability.
Once the comprehensive multi-vendor/multi-source database index has been built, end-users may query from UIs provided on their intranet or from a UI we can provide as a private website. Results returned are from all the content in the database across all of the licensed vendors, consistently relevance ranked and indexed. When an end-user wants a document, he or she clicks on the link just like any other search engine. We then instantly, transparently, and automatically authenticate that the end-user has rights to the document, log-in the end-user in to the vendor-in-questions service, fetch the document, and put the document in the browser window of the end-user. The transaction system of the vendor records the event as if the enduser had logged in. Below is a diagram of the content integration process. Northern Light Customer
Integrated Crawler Index
User Interface
Trash Internal Repository Internal Mkt. Research Investment reports Industry Web sites Newswires & journals Mkt. Research vendor Full-text content Computer Reports
generated data
Search Technology Architecture History

Northern Lights initial strategy was to provide a service to site visitors that would generate revenue from advertising and from sales of licensed business information library known as the Special Collection of over 7000 journal and periodicals sources. In order to offer this service, Northern Light had to master the Web and hundreds of publisher formats, and integrate all of the content into one database, normalize it from an indexing and classification viewpoint, and search it all with one query. The total database approached 400 million documents and it still may be the largest business research library ever assembled. The service (the search engine then available at www.NorthernLight.com) was warmly received by existing consumers of high-quality, fee-based information: professionals in corporations, educational institutions and governments. As a result, a marketing effort was directed toward the enterprise market through high volume sales to organizations. Northern Lights unique focus on both Web and non-Web data coupled with its ability to sell nonWeb documents made it of interest to organizations that wanted to be directly involved in offering their own data or services. As a result, the initial strategy was combined with a gateway partnership component that typically paired Northern Light with other organizations in the offering of co-branded, specialized search Internet-based sites. Northern Lights enterprise strategy now includes a suite of customized search solutions based on the Northern Light technology platform that creates value for a large range of organizations and situations. These solutions include: Custom intranet portals that feature search of customized content sets of licensed, internal, and Web sources Search-based services for extranets, Hosting and sale of archival documents for publishers, Search-of-site for Web-based services, and Full-scale custom information products (e.g., for the U.S. government), etc.
These information management products reflect the companys core competencies in search technology, classification and taxonomy development, and integration of diverse content and federated search. Northern Light typically offers these services on an outsourced, ASP basis. Copyright 2003, Northern Light Group, LLC, All rights reserved
However, the Northern Light search technology is also available as licensed software for in-house customer use on Solaris and Linux platforms. All of Northern Lights solutions derive from the original vision expressed in our Web search engine of one database that could provide access to all the worlds useful information. All of our solutions share these characteristics Scale to gigantic numbers of documents. Speed to efficiently search such large databases. Precision, or relevance ranking, to make the large databases useful. Classification to automatically organize the body of unstructured content in useful ways.
10
Search Technology Architecture Overview

Documents
The basic unit of the Northern Light database is a document. Documents are most normally text based, even though they appear to the database just as objects with a URL so they could be any media. Each document is viewed as a multi-dimensional object that may have one or more values for a number of different fields, attributes, dimensions and/or domains (terms used interchangeably). One such special field is the display-object the viewable document itself, generally stored in the form it was delivered to Northern Light. The display object is generally not retained by Northern Light unless there is an arrangement to do so. Most often, the display object resides where it was crawled, and the URL in the Northern Light database points to it there.
Metadata and Metatags

Other fields, or values for all other fields - title, author, creation date, subject, source, etc. are generally referred to as metadata,data about the document. The term metadata sometimes applies to the actual values a document may have for each of these fields, as well as to the named fields themselves. Certain metadata is required for any document, including a display object, title, etc. Metadata is also used for Custom Search Folders and other classification-based browsing and searching, and includes key attributes such as subject, type, source, language and region. This metadata is sometimes present in the document itself (in which case the metadata might be called metatags) and sometimes it is generated by the Northern Light technologys autoclassification capabilities. Unique metadata of custom data types, e.g., internal documents, may be captured by the content loading filter so that they will always be present for use in classification and searching. A document can have multiple values for a single field e.g., multiple subjects, multiple authors. All metadata is represented in the database index, which makes it available for filtering, organizing, and sorting at query time.
11
Data Model
The standard metadata used for Custom Search Folders subject, type, source, language and region is treated specially in a number of ways. The possible values for each of these fields have been defined and comprise a taxonomy or set of possible values for that field/domain. These taxonomies are all hierarchical, though they contain many cross-references; a given value may have more than one parent because each taxonomy is actually a directed acyclic graph. The subject taxonomy contains approximately 17,000 values (referred to as nodes), starting at the top level with broad categories such as humanities, and sometimes going eight or more levels deep in certain areas to provide very specific subject values such as works of W.H. Auden or robotics. The type field refers to the kind of document an article (the default and most populated type), a review (with more specific typing of book review and others), an editorial, a letter, a report, something for sale, etc. The source field refers to where the document came from, and is either a Web source of some kind (e.g., a Web site, or possibly higher level source node such as all commercial sites) or a Special Collection source typically a single journal or book title at the lowest level (e.g., The Economist, or the Boston Herald) or, again, a higher level aggregate (e.g., journals and magazines, news articles, etc.). The language field is the predominant language(s) of the document currently one of English, French, German, Spanish, Italian and unknown (i.e., some other language). The region field specifies a location or locations referenced in the document a city, country, geographic region, etc. For type, language and subject, the metadata value(s) attempt to capture what the document is really about (or substantially written in, in the case of language). Multiple values are possible but these are intended to represent true multi-subject documents. In the case of region, the model is slightly different; a document will be tagged with any and all regions that can possibly be identified with a document. The difference between region and these other fields is the way they are used in searching. This multi-dimensional model is in contrast to a single dimensional model that must rely on repetition within a single domain in order to achieve comprehensive document descriptions. For example, in a single dimensioned design, values like reviews or biographies could be repeated under all or a very large number of subject values. As the amount of metadata increases, the Copyright 2003, Northern Light Group, LLC, All rights reserved 12
single (or few) domain model becomes increasingly complex and unwieldy. The Northern Light multi-dimensional model, however, can maintain multiple taxonomies easily and simply and can class and organize documents against them.
Data Collection and Web Crawling

Using SinglePoint, any content in any format located on any computer anywhere in the world in the possession of any organization can be put into the database. From a technical viewpoint, data flows into the database via crawling (if the content is on an http platform such as the Web or an intranet), licensed feeds, or by FTP file transfer. Data is converted to a standard Northern Light format that captures the document itself plus associated metadata, including title, date, and anything else that the customer wants to have captured, such as document type. Since data typically arrives in non-HTML format, part of this conversion involves changing the document text itself (often in tagged ASCII, SGML or other formats) into HTML. Northern Light has, to date, converted over 200 different data formats, including the older tif-wrapped PDF and documents rendered as images. (Images are processed with programmatic OCR to make the content available for indexing and full text search.) In the case of certain third-party content licenses specifically for one or more SinglePoint implementations (such as market research content from vendors such as Gartner, IDC or Forrester), Northern Light keeps the content only as long as it takes to create the necessary indexes. Once the content has been completely indexed, it is discarded, making it impossible for Northern Light to re-create the full-text of these content sources.
Search Service Description

Query Database
Once data is placed in a standard Northern Light format, it is loaded into a Northern Light database. Loading may be a misleading term because the content does not actually reside in the Northern Light Light database. Loading refers subject classifying the documents and indexing them. For SinglePoint applications, the index of classified documents is unique to an individual corporate customer. End-users send queries to the index and the results lists are generated from the index. The content itself is not touched during querying, indeed, for many applications Northern Light actually disposes of its copy of the content after it is loaded, remembering in the
13
index, of course, where we got it from so that the document can be retrieved if an end-user desires to read it after selecting it from a results list.
Automatic Classification
To deliver automated classification against a huge and heterogeneous data set, the Northern Light technology uses our own classification taxonomies for subject, type (e.g., article, review, FAQ, job listing, etc.) and other document attributes, drawing on existing taxonomies and supplementing them to provide comprehensive coverage for a wide range of users. An automatic system has also been built that uses multiple strategies (e.g., pattern extraction from training documents, co-location analysis, and structural elements) for classifying documents for a given attribute. Both the taxonomies and the automated system have been in production and supporting end users since August, 1997, have classified over a billion documents, and are continually being refined to deliver more comprehensive and precise classification and better operational performance. At this point, Northern Lights automatic classification is still the only system to ever automatically subject classify the World Wide Web. Performance levels have been achieved by fully divorcing the logical classification models from their practical implementation and creating data structures appropriate for rapid classification of documents against the large but relatively stable taxonomies, patterns, and rules that are the basis of the classification process. One primary use of classification information (i.e., metadata) at Northern Light Integration today is to organize the results (through Custom Search Folders) of a search by appropriate attribute values. This facilitates rapid navigation and some level of automatic query refinement, while allowing more expert users to limit their search initially by some appropriate attribute value. Metadata is also used as one factor (among many) in relevancy ranking.
Northern Light Taxonomy

Subject classification has been designed to classify a document to the one or a small number of subjects from our 17,000+-term subject taxonomy that a document is truly about (vs. classifying to all subjects that occur in the document). The system can today subject classify approximately 25% of random Web documents (about what human editors are able to do) at accuracy rates of from 90-95% using user/customer appraisals. These coverage and accuracy rates are significantly better for non-Web documents.. Classification coverage and accuracy have been realized by continually engineering and extending both known and novel technologies in light of specifically identified problems.
14
Below are two examples of taxonomy branches, one of aerospace technology and industry. The other of construction technology and industry.. Sample Taxonomy: Aerospace The node identifier is the ID#, and the @ sign indicates inclusion by reference of other branches of the taxonomy. Aviation & space technology ID#18340 Aerodynamics ID#18341 Aeronautics ID#18342 Flight control & navigation ID#18368 Aeronomy ID#39560 Aerospace communications equipment ID#18367 Aerospace materials ID#18344 Air traffic control ID#38332 Aircraft design & construction ID#18346 Aircraft engines & motors ID#18356 Aerospace propulsion ID#18395 Jet engines ID#18357 Rocket engines ID#18359 Aviation instrumentation ID#18382 Commercial aircraft design ID#18347 Flight simulators ID#17478 Flight testing ID#18369 Gliders ID#18351 Helicopters ID#18352 Homebuilts & ultralights ID#18349 Hot air balloons ID#18350 Landing gear ID#18383 Military aircraft ID#18353 Seaplanes ID#18354 Small planes ID#18355 Astronautics ID#18363 Space systems ID#18412 Astrophysics ID#18366 @Celestial mechanics (ID#14437) ID#13928 Aviation ground facilities ID#18372 Airport planning & design ID#18373 Copyright 2003, Northern Light Group, LLC, All rights reserved 15
Military aircraft ground facilities ID#18375 Spacecraft ground facilities ID#18377 Avionics ID#37848 Civil aviation ID#37768 @Flight control & navigation (ID#14428) ID#18368 History of aviation & space technology ID#18378 History of aviation ID#18379 History of space flight ID#18380 @National Aeronautics & Space Administration (NASA) (ID#14427) ID#10135 @Remote sensing (ID#14429) ID#13629 Satellite technology ID#18402 Communications satellites ID#18404 Space stations ID#18411 MIR space station ID#36642 Space travel & exploration ID#18413 Space colonization ID#18405 Spacecraft & Space missions ID#18406 Apollo space missions ID#18407 Gemini space mission ID#18408 Manned spacecraft ID#39606 Mercury space missions ID#18409 Space Shuttle ID#18410 Space launch vehicles & equipment ID#39607 Space probes ID#18389 Space safety ID#39608 Unmanned spacecraft ID#39609 Viking mission to Mars ID#29129 @Telescopes (ID#14432) ID#13198 Sample Taxonomy: Construction Architectural engineering ID#18323 @Architectural design (ID#40386) ID#358 Building acoustics ID#18324 @Construction engineering (ID#14424) ID#18511 @Construction management (ID#14425) ID#18512 Heating, ventilation & air conditioning ID#18331 @Air conditioners & fans (ID#41168) ID#14698 @Home furnaces (ID#43116) ID#37672 Copyright 2003, Northern Light Group, LLC, All rights reserved 16
Lighting & electrical systems ID#18333 Commercial lighting ID#14840 Exterior lighting ID#14767 @Lighting design (ID#42055) @Structural engineering (ID#14426) Architectural services ID#4574 Architectural drafting ID#4576 House plans ID#37844 @Landscape architecture (ID#40544) Lighting design ID#27266 Asbestos ID#5305 @Asbestos exposure (ID#40679) @Asbestos removal (ID#40570) Civil engineering ID#18509 Bridge engineering ID#18510 Construction engineering ID#18511 Building standards & codes ID#39565 Construction automation ID#18513 Construction management ID#18512 Construction safety ID#6194 @Dams, canals & waterways (ID#14473) Earthworks engineering ID#18518 Fire technology ID#28435 Combustion & flammability ID#28439 Fire investigation ID#28448 Fire prevention ID#28450 Fire safety systems ID#19295 Fire suppression ID#28441 Geotechnical engineering ID#18520 Earthquake engineering ID#39468 Geo-environmental systems ID#13643 Geosynthetics ID#13644 Hydraulic engineering ID#18524 Coast & Harbor engineering ID#18525 Flood control ID#18527 @Hydraulic cement (ID#43096) @Hydraulic fluids (ID#43328)
17
@Hydraulic machinery (ID#41488) Hydraulic structures ID#18528 Aqueduct engineering ID#37887 Dams, canals & waterways ID#18530 Reservoir engineering ID#18532 Irrigation & drainage ID#13032 Sediment transport ID#18535 Surface water runoff ID#18536 Lighthouses ID#18540 @Ocean engineering (ID#14472) Structural engineering ID#18547 @Mechanical behavior of materials (ID#14474) Structural concrete ID#18551 Structural steel ID#18554 Surveying ID#18555 @Geographic information systems (GIS) (ID#14475) Photogrammetry ID#18560 @Remote sensing (ID#14476) Transportation engineering ID#18565 @Automotive engineering (ID#41482) Electric vehicles ID#36914 Emission control ID#18569 High-speed ground transportation ID#18570 Highways, roads & pavements ID#18522 Intelligent transportation systems ID#18574 Marine transportation ID#19488 Pipeline transportation ID#39656 Railroad engineering ID#18544 Transportation planning ID#18575 Transportation safety ID#39610 @National Transportation Safety Board (NTSB) (ID#40938) @Urban transportation (ID#41339) Tunnel engineering ID#18577 Construction industry ID#26320 Building contractor services ID#4725 Building materials ID#39453 @Carpentry & woodworking (ID#41390)
18
@Construction machinery (ID#41486) @Driveway coating & construction (ID#42066) @Electrician services (ID#41176) @Fences & stone walls (ID#40582) Hand & power tools ID#14756 @Home improvement centers (ID#43309) @Insulation services (ID#41174) @Landscaping services (ID#41177) Nonresidential Construction ID#39628 Plumbers & plumbing supplies ID#14751 @Bathroom fixtures & accessories (ID#41169) @Pool construction & maintenance services (ID#41179) Residential construction ID#39449 @Roofing services (ID#41180) @Septic systems (ID#42067) @Underwater construction & Habitats (ID#43333) @Water well drilling (ID#42047) Facilities management ID#6270 Floor laying, refinishing & resurfacing ID#27352 Heating & Ventilation industry ID#26835 @Heating, ventilation & air conditioning (ID#41445) House painting & wall covering services ID#14789 Industrial equipment & Heavy machinery industry ID#26341 @Farm equipment & Supplies industry (ID#41904) @Manufacturing equipment & machinery (ID#41485) @Turbomachinery (ID#41508) Lighting industry ID#26349 Electrician services ID#14784 Electrical supplies ID#14744 Electrical testing & inspection ID#18674 Electrical wiring ID#18675 @Lighting & electrical systems (ID#41446) @Lamps & light fixtures (ID#40577) @Lighting design (ID#42054) Laminated wood ID#27394 Particle board ID#27398 Plywood & veneer ID#5068
19
Pressure treated wood ID#27401 Sheet metal ID#5124 Wire & Cable products ID#39463 Aluminum & aluminum products ID#5112 Copper & copper products ID#5116 Iron industry ID#26357 Steel industry ID#26358 Stainless steel ID#5125 Paint & paint supplies ID#14748 Property developers ID#5232 Rock mechanics ID#39569 Soil science & technology ID#18312 Erosion ID#18313 Fertilizers ID#18316 Chemical fertilizers ID#17611 Organic fertilizer ID#17612 Soil cultivation ID#18315 Soil pollution ID#13029 Soil remediation ID#37889 Stone, clay, glass & concrete product industry ID#26373 Cement ID#5306 Hydraulic cement ID#36456 @Ceramics & Pottery (ID#40405) ID#797 Concrete ID#5307 Concrete block & brick ID#36460 Ready-mixed concrete ID#36461 Cut stone & stone products ID#5308 Granite ID#36462 Limestone ID#36463 Marble ID#36464 @Memorials & Grave stones (ID#42060) ID#27319 Earthenware ID#5309 Glass products ID#5311 Automobile glass ID#36478 Flat glass ID#36465 Glass containers ID#36466 @Mirrors (ID#40565) ID#4879
20
Pressed & blown glass ID#5315 @Sand & Gravel (ID#41514) ID#19469 Vitreous china ID#36475 Windows & doors ID#4998
Indexing
During indexing, all of the terms in the documents and metadata fields are extracted and indexed into appropriate index structures; searches can be resolved by using these structures and without having to refer to the original documents. This process is exhaustive and comprehensive; all visible terms in the document display object and all appropriate metadata values are indexed. There are no stop words or special characters that are not indexed and there is no practical cutoff in terms of document length at which point indexing stops. All search terms (including those inside quoted phrases) are viewed as nouns and transformed (stemmed) automatically to their common singular form during indexing (and at query time). This allows a search on a singular or plural noun to find occurrences of either. The stemming rules are fairly simple and do not handle most irregular forms, which tend to occur for very common words not generally useful in searching. All terms are indexed as all lower-case letters. In addition, all terms containing at least one instance of both upper- and lower-case are also indexed in a special case-sensitive index. This allows queries to find all instances of a term regardless of case; query terms are also translated to all lower-case for initial query resolution. This also allows a match in case-sensitivity with a query term to be used as a relevancy factor. The above rules are English-language dependent. However, given the symmetry with which they are applied at indexing and query time, they generally preserve appropriate search processing for all languages. Double-byte language support, which requires the licensing of an third-party language processing module from Teragram, includes language-sensitive stemming and other sorts of processing. Numeric tokens (or tokens of mixed letters and numbers) are indexed as text. Proximity information can be represented in indices in various ways, allowing either very fast access of short phrases, or complete and precise (but slower) access of phrases of unlimited length.
Query Service
Finished databases are connected to Northern Lights Northern Light network by the Query Listener (QL). The QL accepts queries from external clients (such as a Web server) and passes Copyright 2003, Northern Light Group, LLC, All rights reserved 21
them to the Query Server (QS). The QS translates the search syntax and other parameters, queries the database indices, and returns the appropriate citation information and metadata. The Query Listener is also responsible for identifying itself by broadcasting, via UDP, a database identifier and load information. Clients use this information to select a listener appropriate to their mission.
Searching
Nearly all search fields on all search forms accept and process searches in the same way. The query interpretation algorithm proceeds as follows: 1. If the query is well-formed Boolean expression, it is rigorously interpreted as such. Boolean expressions can contain AND, OR, NOT, simple terms (words), quoted phrases, wildcards and parentheses including sub-expressions, and may contain an unlimited amount of nesting. In addition, a Boolean expression may itself contain any number of fielded sub-expressions that specify a search against a particular metadata field, e.g., (lawsuit or sue) and title: microsoft or netscape. By default, terms are searched against the text field, which includes all full text and all document metadata. This field may also be specified by use of the text: keyword. Search terms may also include one or more trailing or multi-character wildcards (indicated by *) or singlecharacter wildcards (%) as long as there are at least four non-wildcard characters before the first wildcard, e.g., rachm%inof*. Search fields are available to the user through appropriate fields on search forms or through the field: syntax. Other special fields include relational operators that can be used for date fields, sort: date (a reverse chronological sort), or sort: relevancy (the default). 2. If a well-formed Boolean expression is not found and the query is more than a specified length (currently 12 terms), a statistical query evaluation process is used. This only requires the presence of a single term in the query for a document to appear on the results list, but makes use of all terms or phrases appearing in a document to determine the best documents. This statistical evaluation can also be forced on a query of any length by preceding the query with the pseudo-field like:, e.g., like: side effects of antidepressants and sedatives 3. If neither of the above two conditions is met, then a query with any use of the + or - operators common among Internet search engines will be interpreted according to generally accepted rules. The rules are that any term or quoted phrase immediately preceded by a + must be in a document to appear on the results list, and any term or
22
quoted phrase preceded by a - cannot be in any document for it to appear on the results list. Other terms in the query are considered desirable but not required. 4. If the query does not meet any of the above criteria, a fuzzy search is performed. This does an implicit AND of most content-bearing words (or what are generally non-content bearing words if those are the only query terms) but uses all terms entered for relevancy ranking purposes. Some limited natural language analysis is also performed on terms, such as recognition of the word not. All query terms are presumed to be nouns and are translated, if necessary, to their singular form using fairly simple algorithms. This allows a match against either a singular or plural form, since all document terms are similarly converted to singular form during indexing. Query terms are also translated to lower case in order to be able to match any form of the word in any document; all document terms are similarly converted to lower-case at indexing time. Mixed case terms are searched against a special mixed case index to provide information about case-sensitive matches for relevancy ranking.
Relevancy Ranking
One of the strengths of Northern Lights technology is its advanced relevancy ranking algorithms. These not only provide a novel approach to ranking but are based on highly optimized index structures and algorithms that allow Northern Light Search and Content Integration to perform significant relevancy ranking operations on a very large database. Ranking takes into account several different factors, each of which contributes weight to a documents overall relevancy score and to its eventual placement in the results list. A maximum theoretical relevancy score is calculated for every query, and displayed relevancy scores represent a simple transformation to a 1-99% range of the actual document score as compared to the maximum theoretical score. These factors include the following: Number of occurrences of matching terms (term frequency factor, or TF). Relative frequency of those terms in the entire database (term inverse document frequency, or IDF). Implicit phrase recognition Location of matching terms and phrases in document metadata
23
Number and authority of external sites linking to this document (applies to Web documents only)
Date of the document (All other things being equal, more recent documents are considered to be more relevant than older documents.)
Classification metadata Document length Presence of any detectable spam
Custom Search Folders

For any result set containing more than 25 results, Northern Light determines a set of Custom Search Folders (CSFs) before returning the results to the user. To do this, Northern Light examines the metadata values of the documents on the results list, uses those values to determine appropriate CSFs, weighs each of the CSFs and then displays the top-weighted CSFs. Weighting of CSFs is determined by a number of rules that contribute different values to the overall weighting of that CSF. Certain rules assign weights based on how many documents are in the CSF, or how many of its documents rank high on the results list. Other CSFs assign values based on how different the CSFs are from other candidate CSFs, or based on the more exact nature of the metadata values involved.
Session Management
Northern Lights Northern Light service uses a proprietary session state management system to store user state between page requests or other transactions. Each user is given a cookie with a unique token that contains no outwardly useful information. The users browser transmits the token in the headers of each request, and the Northern Light software uses the token to retrieve session data.
Security
Northern Light can invoke a variety of security solutions that the option of the customer. If a username password scheme is desired, we provide an administrative user-interface for managing the passwords. IP validation can be used in lieu of username/password security, or in addition to it. Secure https protocols are customarily used with Verisign certificates insuring the validity of
24
connections, or leased T1 lines can be used for extreme security. In our 7- year history, Northern Light has never experienced a single hacker intrusion into customer applications or our network.
Alerts
Northern Light offers users and enterprise customers the ability to save any search and have it run automatically whenever the database referenced by the search is updated. At that time, an email is sent to the registered owner of the alert if and only if there are new documents in the database that meet the search criteria; the e-mail message contains a link to just these new results. The system further keeps track of when a user has actually accessed these new results so that, if a user receives a string of alert e-mails before being able to see any of them, accessing any one of them will provide all the new results since the users last access; it is unnecessary to cycle through the alert messages one at a time in order to see all new results.
25
Applications Development and Hosting

Northern Light offers complete services for producing, or letting customers produce, custom search applications to be run either in ASP mode, in-house at a customer site, or some combination of the two. These can include documented APIs for searching, customized results lists, alerts and other capabilities (usually through XML interfaces). Northern Light also has a dedicated applications group, efficient tools for rapid development of user interfaces and, behind it all, a 7x24 secure operations facility.
26
Northern Light Technology Awards

Top 100 eContent magazine, December 2001 Best of the Web US News and World Report, October 2001 Top 100 Companies That Matter KMWorld magazine September 2001 Editors Choice PC Magazine, November 2000 Best of the Web, Forbes magazine, September 2000 Web Business Award For Online Excellence CIO magazine, July 2000 Best Online Business/Professional Service, Software & Information Industry Association, March 2000 Best Online Research Product, Software & Information Industry Association, March 2000 Best Online Information Service, Software & Information Industry Association, March 2000 Editors Choice PC Magazine, September 1999, 1998, 1997
27

SP - WhitePaperof Northern Light

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

SP - WhitePaperof Northern Light

Încărcat de

Drepturi de autor:

Formate disponibile

Northern Light SinglePoint Market Research Portal Overview White Paper

One Broadway, 14th Floor, Cambridge MA 02142 617-242-5960

Copyright 2003, Northern Light Group, LLC, All Rights Reserved

Copyright 2003, Northern Light Group, LLC, All rights reserved

Copyright 2003, Northern Light Group, LLC, All rights reserved

SinglePoint Market Research Portal Overview

Copyright 2003, Northern Light Group, LLC, All rights reserved

Custom Content Integration

Copyright 2003, Northern Light Group, LLC, All rights reserved

Integrated Crawler Index

Copyright 2003, Northern Light Group, LLC, All rights reserved

Search Technology Architecture History

Copyright 2003, Northern Light Group, LLC, All rights reserved

Search Technology Architecture Overview

Metadata and Metatags

Copyright 2003, Northern Light Group, LLC, All rights reserved

Data Collection and Web Crawling

Search Service Description

Copyright 2003, Northern Light Group, LLC, All rights reserved

Northern Light Taxonomy

Copyright 2003, Northern Light Group, LLC, All rights reserved

Copyright 2003, Northern Light Group, LLC, All rights reserved

Copyright 2003, Northern Light Group, LLC, All rights reserved

Copyright 2003, Northern Light Group, LLC, All rights reserved

Copyright 2003, Northern Light Group, LLC, All rights reserved

Copyright 2003, Northern Light Group, LLC, All rights reserved

Copyright 2003, Northern Light Group, LLC, All rights reserved

Classification metadata Document length Presence of any detectable spam

Custom Search Folders

Copyright 2003, Northern Light Group, LLC, All rights reserved

Copyright 2003, Northern Light Group, LLC, All rights reserved

Applications Development and Hosting

Copyright 2003, Northern Light Group, LLC, All rights reserved

Northern Light Technology Awards

Copyright 2003, Northern Light Group, LLC, All rights reserved

S-ar putea să vă placă și