an open source text search engine API high-performance, full-featured,pure java
Pengjy@262.net
................page 2 ................ Agenda
Overview APIs How dose Search Engine Work Feature For Chinese character
................page 3 ................ Overview
An Apache Jakarta PRoject High-performance, full-featured Open source text search engine APIs Easy to use, fast to build your own search engine
................page 4 ................ Overview
Version 1.2 rc4 applications using Lucence 2a.WebSearch Jive Forums RockyNewsgroup.org
................page 5 ................ APIs
org.apache.lucene.analysis defines an abstract Analyzer API for converting text from a java.io.Reader into a TokenStream, an enumeration of Token's. A TokenStream is composed by applying TokenFilter's to the output of a Tokenizer. A few simple implemetations are provided, including StopAnalyzer and the grammar-based StandardAnalyzer (use JavaCC).
................page 6 ~9................ APIs
org.apache.lucene.document provides a simple Document class. A document is simply a set of named Field's, whose values may be strings or instances of java.io.Reader.
org.apache.lucene.index provides two primary classes: IndexWriter, which creates and adds documents to indices; and IndexReader, which ccesses the data in the index.
org.apache.lucene.queryParser uses JavaCC to implement a QueryParser
org.apache.lucene.search provides data structures to represent queries (TermQuery for individual words, PhraseQuery for phrases, and BooleanQuery for boolean combinations of queries) and the abstract Searcher which turns queries into Hits. IndexSearcher implements search over a single IndexReader.
org.apache.lucene.store defines an abstract class for storing persistent data,the Directory, a collection of named files written by an OutputStream and read by an InputStream. Two implementations are provided, FSDirectory, which uses a file system directory to store files, and RAMDirectory which implements files as memory-resident data structures.
org.apache.lucene.util contains a few handy data structures, e.g., BitVector and PriorityQueue.
................page 10 ................ How dose Search Engine Work
Create indices
input -->analyzer-->filters-->tokens-->indices ^
tokenize
................page 11 ~ 14 ................ How dose Search Engine Work
Store Indices Rather than maintaining a single index, it builds multiple index segments. For each new document indexed, Lucene creates a new index segment. It merges small segments with larger ones -- this keeps the total number of segments small so searches remain fast.
To prevent conflicts (or locking overhead) between index readers and writers, Lucene never modifies segments in place, it only creates new ones. When merging segments, Lucene writes a new segment and deletes the old ones -- after any active readers have closed it.
A Lucene index segment consists of several files: A dictionary index containing one entry for each 100 entries in the dictionary A dictionary containing one entry for each unique word A postings file containing an entry for each posting
Since Lucene never updates segments in place, they can be stored in flat files instead of complicated B-trees. For quick retrieval, the dictionary index contains offsets into the dictionary file, and the dictionary holds offsets into the postings file.
Lucene also implements a variety of tricks to compress the dictionary and posting files -- thereby reducing disk I/O -- without incurring substantial CPU overhead.
Incremental indexing Incremental indexing allows easy adding of documents to an existing index. Lucene supports both incremental and batch indexing.
Data sources Lucene allows developers to deliver the document to the indexer through a String or an InputStream, permitting the data source to be abstracted from the data. However, with this approach, the developer must supply the appropriate readers for the data. Feature
Indexing control Some search engines can automatically crawl through a directory tree or a Website to find documents to index. Since Lucene Operates primarily in incremental mode, it lets the application find and retrieve documents.
File formats Lucene supports a filter mechanism, which offers a simple alternative to indexing word processing documents, SGML documents, and other file formats.
Content tagging Lucene supports content tagging by treating documents as collections of fields, and supports queries that specify which field(s) to search. This permits semantically richer queries like "author contains 'Hamilton' AND body contains 'Constitution'".
Stop-word processing Search engines will not index certain words, called stop words.such as "a", "and," and "the". Lucene handles stop words with the more general Analyzer mechanism, and provides the StopAnalyzer class, which eliminates stop words from the input stream.
Query features Lucene supports a wide range of query features, including all of those listed below: Boolean queries; andqueries. return a "relevance" score with each hit. handle adjacency or proximity queries -- "search followed by engine" or "Knicks near Celtics" search on single keywords. search multiple indexes at once and merge the results to give a meaningful relevance score.
However, Lucene does not support the valuable "Soundex", or "sounds like," query.
Concurrency Lucene allows users to search an index transactionally, even if another user is simultaneously updating the index.
Non-English support As Lucene preprocesses the input stream through the Analyzer class provided by the developer, it is possible to perform language-specific filtering.
................page 23 ................ For Chinese character
JavaCC -- the Java Compiler Compiler.
build complex compilers for languages such as Java or C++. write tools that parse Java source code and perform automatic analysis or transformation tasks. EBNF (Extended Backus-Naur-Form)
................page 24 ................ For Chinese character