Lucene 查詢結果預設是用 score 做排序,分數愈高的排前面,分數的計算相當複雜,不過有兩個影響分數的因素:
- Field 的值(字串)愈長,得到的分數愈低。
- 使用 Boost 為特定 Document 或 Field 加分。
用 TestCase 搭配 RAMDirectory 來測試。
public class BoostNormsTestCase extends TestCase { private static final Version VERSION = Version.LUCENE_36; private static final String F_TITLE = "title"; private static final String F_BODY = "body"; private Directory directory = new RAMDirectory(); private IndexWriter writer; private IndexReader reader; private IndexSearcher searcher; private Document createDocument(Index indexType, String value, float boost) { Document doc = new Document(); doc.add(new Field(BoostNormsTestCase.F_TITLE, value, Field.Store.YES, indexType)); doc.setBoost(boost); return doc; } private Field createField(Index indexType, String name, String value, float boost) { Field field = new Field(name, value, Field.Store.YES, indexType); field.setBoost(boost); return field; } private IndexWriter createWriter() throws CorruptIndexException, LockObtainFailedException, IOException { IndexWriterConfig config = new IndexWriterConfig( BoostNormsTestCase.VERSION, new StandardAnalyzer( BoostNormsTestCase.VERSION)); config.setOpenMode(OpenMode.CREATE); return new IndexWriter(this.directory, config); } private IndexSearcher createSearcher() throws CorruptIndexException, IOException { this.reader = IndexReader.open(this.directory); return new IndexSearcher(this.reader); } private void closeWriter() { if (this.writer != null) { try { this.writer.close(); } catch (IOException e) { e.printStackTrace(); } } } private void closeSearcher() { if (this.reader != null) { try { this.reader.close(); } catch (IOException e) { e.printStackTrace(); } } if (this.searcher != null) { try { this.searcher.close(); } catch (IOException e) { e.printStackTrace(); } } } }
Term 的長度影響分數
public void testFieldLengthBoost() { System.out.println("testFieldLengthBoost..."); try { // index this.writer = this.createWriter(); this.writer.addDocument(this.createDocument(Field.Index.ANALYZED, "Lucene in action", 1F)); this.writer.addDocument(this.createDocument(Field.Index.ANALYZED, "Spring in action, Manning", 1F)); this.writer.addDocument(this.createDocument(Field.Index.ANALYZED, "Hibernate in action 2e, Manning Publication", 1F)); this.closeWriter(); // search this.searcher = this.createSearcher(); TopDocs results = this.searcher.search(new TermQuery(new Term( BoostNormsTestCase.F_TITLE, "action")), 10); // output Document doc; for (ScoreDoc sdoc : results.scoreDocs) { doc = this.searcher.doc(sdoc.doc); System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - " + sdoc.score); } } catch (IOException e) { Assert.fail(e.getMessage()); } finally { this.closeWriter(); this.closeSearcher(); } }建立三筆長度不同的 Document,再用共有的關鍵字做查詢,得到以下的結果。
// 相同長度 Lucene in action - 0.4451987 Spring in action - 0.4451987 Hibernate in action - 0.4451987 // 不同長度 Lucene in action - 0.4451987 Spring in action, Manning - 0.35615897 Hibernate in action 2e, Manning Publication - 0.3116391 // 相同長度,關鍵字出現愈多次分數愈高 Action in action - 0.5254995 Lucene in action - 0.37158427 // 改用 Field.Index.ANALYZED_NO_NORMS Lucene in action - 0.71231794 Spring in action, Manning - 0.71231794 Hibernate in action 2e, Manning Publication - 0.71231794長度愈長者分數愈低,反之愈高,另外關鍵字出現次數也會影響分數。
Field.Index 有五個選項,分別為:
- No - 不做索引。
- ANALYZED - 拆字做索引。
- NOT_ANALYZED - 不拆字做索引。
- ANALYZED_NO_NORMS - 拆字做索引,但是停用 Norms。
- NOT_ANALYZED_NO_NORMS - 不拆字做索引 ,但是停用 Norms 。
不拆字的欄位必須完全符合才能查詢的到,例如「Lucene in action」不拆字的話,就只能用「Lucene in action」才能找到,用「Lucene」或「action」都不會找到。
停用 Norms 只會影響到查詢結果的排序(不考慮 Boost 值與 Term 長度),並不會影響查詢方式;所以上面的測試在改用 Field.Index.ANALYZED_NO_NORMS 之後,分數就都一樣了。
Document Boost
public void testBoostDocument() { System.out.println("testBoostDocument..."); try { // index this.writer = this.createWriter(); this.writer.addDocument(this.createDocument(Field.Index.ANALYZED, "Lucene in action", 0.5F)); this.writer.addDocument(this.createDocument(Field.Index.ANALYZED, "Spring in action", 1.5F)); this.writer.addDocument(this.createDocument(Field.Index.ANALYZED, "Hibernate in action", 1F)); this.closeWriter(); // search this.searcher = this.createSearcher(); TopDocs results = this.searcher.search(new TermQuery(new Term( BoostNormsTestCase.F_TITLE, "action")), 10); // output Document doc; for (ScoreDoc sdoc : results.scoreDocs) { doc = this.searcher.doc(sdoc.doc); System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - " + sdoc.score); } } catch (IOException e) { Assert.fail(e.getMessage()); } finally { this.closeWriter(); this.closeSearcher(); } }對 Document 設定不同的 boost 值,得到以下的結果。
// 相同的 Boost 值(預設為 1) Lucene in action - 0.4451987 Spring in action - 0.4451987 Hibernate in action - 0.4451987 // 不同的 Boost 值 Spring in action - 0.71231794 Hibernate in action - 0.4451987 Lucene in action - 0.22259936 // 改用 Field.Index.ANALYZED_NO_NORMS Lucene in action - 0.71231794 Spring in action - 0.71231794 Hibernate in action - 0.71231794Boost 值愈高者分數愈高,反之愈低。
Field Boost
public void testBoostField() { System.out.println("testBoostField..."); try { // index this.writer = this.createWriter(); Document doc = new Document(); doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_TITLE, "Lucene in action", 1F)); doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY, "Lucene in action, Manning", 0.5F)); this.writer.addDocument(doc); doc = new Document(); doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_TITLE, "Spring in action", 1F)); doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY, "Spring in action, Manning", 1F)); this.writer.addDocument(doc); doc = new Document(); doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_TITLE, "Hibernate in action", 1F)); doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY, "Hibernate in action, Manning", 1.5F)); this.writer.addDocument(doc); this.closeWriter(); // search this.searcher = this.createSearcher(); TopDocs results = this.searcher.search(new TermQuery(new Term( BoostNormsTestCase.F_BODY, "manning")), 10); // output for (ScoreDoc sdoc : results.scoreDocs) { doc = this.searcher.doc(sdoc.doc); System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - " + sdoc.score); } } catch (IOException e) { Assert.fail(e.getMessage()); } finally { this.closeWriter(); this.closeSearcher(); } }對 Field 設定不同的 boost 值,得到以下的結果。
// 相同的 Boost 值(預設為 1)
Lucene in action - 0.35615897
Spring in action - 0.35615897
Hibernate in action - 0.35615897
// 不同的 Boost 值
Hibernate in action - 0.53423846
Spring in action - 0.35615897
Lucene in action - 0.17807949
// 改用 Field.Index.ANALYZED_NO_NORMS
Lucene in action - 0.71231794
Spring in action - 0.71231794
Hibernate in action - 0.71231794
Boost 值愈高者分數愈高,反之愈低。停用 Norms 的原因只有一個,就是查詢時可以少用一些記憶體,記憶體使用的多寡與索引的數量有關;用過 Norms 後若要停用,得整個索引重建,不然只要有一個欄位使用 Norms,整個查詢就會用 Norms。
---
沒有留言:
張貼留言