Lucene 查詢結果預設是用 score 做排序,分數愈高的排前面,分數的計算相當複雜,不過有兩個影響分數的因素:
- Field 的值(字串)愈長,得到的分數愈低。
- 使用 Boost 為特定 Document 或 Field 加分。
用 TestCase 搭配 RAMDirectory 來測試。
public class BoostNormsTestCase extends TestCase {
private static final Version VERSION = Version.LUCENE_36;
private static final String F_TITLE = "title";
private static final String F_BODY = "body";
private Directory directory = new RAMDirectory();
private IndexWriter writer;
private IndexReader reader;
private IndexSearcher searcher;
private Document createDocument(Index indexType, String value, float boost) {
Document doc = new Document();
doc.add(new Field(BoostNormsTestCase.F_TITLE, value, Field.Store.YES,
indexType));
doc.setBoost(boost);
return doc;
}
private Field createField(Index indexType, String name, String value,
float boost) {
Field field = new Field(name, value, Field.Store.YES, indexType);
field.setBoost(boost);
return field;
}
private IndexWriter createWriter() throws CorruptIndexException,
LockObtainFailedException, IOException {
IndexWriterConfig config = new IndexWriterConfig(
BoostNormsTestCase.VERSION, new StandardAnalyzer(
BoostNormsTestCase.VERSION));
config.setOpenMode(OpenMode.CREATE);
return new IndexWriter(this.directory, config);
}
private IndexSearcher createSearcher() throws CorruptIndexException,
IOException {
this.reader = IndexReader.open(this.directory);
return new IndexSearcher(this.reader);
}
private void closeWriter() {
if (this.writer != null) {
try {
this.writer.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
}
private void closeSearcher() {
if (this.reader != null) {
try {
this.reader.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
if (this.searcher != null) {
try {
this.searcher.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
}
}
Term 的長度影響分數
public void testFieldLengthBoost() {
System.out.println("testFieldLengthBoost...");
try {
// index
this.writer = this.createWriter();
this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
"Lucene in action", 1F));
this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
"Spring in action, Manning", 1F));
this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
"Hibernate in action 2e, Manning Publication", 1F));
this.closeWriter();
// search
this.searcher = this.createSearcher();
TopDocs results = this.searcher.search(new TermQuery(new Term(
BoostNormsTestCase.F_TITLE, "action")), 10);
// output
Document doc;
for (ScoreDoc sdoc : results.scoreDocs) {
doc = this.searcher.doc(sdoc.doc);
System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - "
+ sdoc.score);
}
}
catch (IOException e) {
Assert.fail(e.getMessage());
}
finally {
this.closeWriter();
this.closeSearcher();
}
}
建立三筆長度不同的 Document,再用共有的關鍵字做查詢,得到以下的結果。// 相同長度 Lucene in action - 0.4451987 Spring in action - 0.4451987 Hibernate in action - 0.4451987 // 不同長度 Lucene in action - 0.4451987 Spring in action, Manning - 0.35615897 Hibernate in action 2e, Manning Publication - 0.3116391 // 相同長度,關鍵字出現愈多次分數愈高 Action in action - 0.5254995 Lucene in action - 0.37158427 // 改用 Field.Index.ANALYZED_NO_NORMS Lucene in action - 0.71231794 Spring in action, Manning - 0.71231794 Hibernate in action 2e, Manning Publication - 0.71231794長度愈長者分數愈低,反之愈高,另外關鍵字出現次數也會影響分數。
Field.Index 有五個選項,分別為:
- No - 不做索引。
- ANALYZED - 拆字做索引。
- NOT_ANALYZED - 不拆字做索引。
- ANALYZED_NO_NORMS - 拆字做索引,但是停用 Norms。
- NOT_ANALYZED_NO_NORMS - 不拆字做索引 ,但是停用 Norms 。
不拆字的欄位必須完全符合才能查詢的到,例如「Lucene in action」不拆字的話,就只能用「Lucene in action」才能找到,用「Lucene」或「action」都不會找到。
停用 Norms 只會影響到查詢結果的排序(不考慮 Boost 值與 Term 長度),並不會影響查詢方式;所以上面的測試在改用 Field.Index.ANALYZED_NO_NORMS 之後,分數就都一樣了。
Document Boost
public void testBoostDocument() {
System.out.println("testBoostDocument...");
try {
// index
this.writer = this.createWriter();
this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
"Lucene in action", 0.5F));
this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
"Spring in action", 1.5F));
this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
"Hibernate in action", 1F));
this.closeWriter();
// search
this.searcher = this.createSearcher();
TopDocs results = this.searcher.search(new TermQuery(new Term(
BoostNormsTestCase.F_TITLE, "action")), 10);
// output
Document doc;
for (ScoreDoc sdoc : results.scoreDocs) {
doc = this.searcher.doc(sdoc.doc);
System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - "
+ sdoc.score);
}
}
catch (IOException e) {
Assert.fail(e.getMessage());
}
finally {
this.closeWriter();
this.closeSearcher();
}
}
對 Document 設定不同的 boost 值,得到以下的結果。// 相同的 Boost 值(預設為 1) Lucene in action - 0.4451987 Spring in action - 0.4451987 Hibernate in action - 0.4451987 // 不同的 Boost 值 Spring in action - 0.71231794 Hibernate in action - 0.4451987 Lucene in action - 0.22259936 // 改用 Field.Index.ANALYZED_NO_NORMS Lucene in action - 0.71231794 Spring in action - 0.71231794 Hibernate in action - 0.71231794Boost 值愈高者分數愈高,反之愈低。
Field Boost
public void testBoostField() {
System.out.println("testBoostField...");
try {
// index
this.writer = this.createWriter();
Document doc = new Document();
doc.add(this.createField(Field.Index.ANALYZED,
BoostNormsTestCase.F_TITLE, "Lucene in action", 1F));
doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY,
"Lucene in action, Manning", 0.5F));
this.writer.addDocument(doc);
doc = new Document();
doc.add(this.createField(Field.Index.ANALYZED,
BoostNormsTestCase.F_TITLE, "Spring in action", 1F));
doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY,
"Spring in action, Manning", 1F));
this.writer.addDocument(doc);
doc = new Document();
doc.add(this.createField(Field.Index.ANALYZED,
BoostNormsTestCase.F_TITLE, "Hibernate in action", 1F));
doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY,
"Hibernate in action, Manning", 1.5F));
this.writer.addDocument(doc);
this.closeWriter();
// search
this.searcher = this.createSearcher();
TopDocs results = this.searcher.search(new TermQuery(new Term(
BoostNormsTestCase.F_BODY, "manning")), 10);
// output
for (ScoreDoc sdoc : results.scoreDocs) {
doc = this.searcher.doc(sdoc.doc);
System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - "
+ sdoc.score);
}
}
catch (IOException e) {
Assert.fail(e.getMessage());
}
finally {
this.closeWriter();
this.closeSearcher();
}
}
對 Field 設定不同的 boost 值,得到以下的結果。// 相同的 Boost 值(預設為 1)
Lucene in action - 0.35615897
Spring in action - 0.35615897
Hibernate in action - 0.35615897
// 不同的 Boost 值
Hibernate in action - 0.53423846
Spring in action - 0.35615897
Lucene in action - 0.17807949
// 改用 Field.Index.ANALYZED_NO_NORMS
Lucene in action - 0.71231794
Spring in action - 0.71231794
Hibernate in action - 0.71231794
Boost 值愈高者分數愈高,反之愈低。停用 Norms 的原因只有一個,就是查詢時可以少用一些記憶體,記憶體使用的多寡與索引的數量有關;用過 Norms 後若要停用,得整個索引重建,不然只要有一個欄位使用 Norms,整個查詢就會用 Norms。
---
沒有留言:
張貼留言