Java Artisan / Neil Chan: Boost and Norms in Lucene 3.6.0

這篇筆記源自於「Norms」這個陌生的字眼，還有「Boost」。

Lucene 查詢結果預設是用 score 做排序，分數愈高的排前面，分數的計算相當複雜，不過有兩個影響分數的因素：

Field 的值（字串）愈長，得到的分數愈低。
使用 Boost 為特定 Document 或 Field 加分。

上面這兩個因素也可以稱為「Norms」，指的就是建立索引檔時的 Field 與 Document Boost 值與 Term 長度會影響分數。

用 TestCase 搭配 RAMDirectory 來測試。

public class BoostNormsTestCase extends TestCase {

  private static final Version VERSION = Version.LUCENE_36;
  private static final String F_TITLE = "title";
  private static final String F_BODY = "body";
  private Directory directory = new RAMDirectory();
  private IndexWriter writer;
  private IndexReader reader;
  private IndexSearcher searcher;

  private Document createDocument(Index indexType, String value, float boost) {
    Document doc = new Document();
    doc.add(new Field(BoostNormsTestCase.F_TITLE, value, Field.Store.YES,
        indexType));
    doc.setBoost(boost);
    return doc;
  }

  private Field createField(Index indexType, String name, String value,
      float boost) {
    Field field = new Field(name, value, Field.Store.YES, indexType);
    field.setBoost(boost);
    return field;
  }

  private IndexWriter createWriter() throws CorruptIndexException,
      LockObtainFailedException, IOException {
    IndexWriterConfig config = new IndexWriterConfig(
        BoostNormsTestCase.VERSION, new StandardAnalyzer(
            BoostNormsTestCase.VERSION));
    config.setOpenMode(OpenMode.CREATE);
    return new IndexWriter(this.directory, config);
  }

  private IndexSearcher createSearcher() throws CorruptIndexException,
      IOException {
    this.reader = IndexReader.open(this.directory);
    return new IndexSearcher(this.reader);
  }

  private void closeWriter() {
    if (this.writer != null) {
      try {
        this.writer.close();
      }
      catch (IOException e) {
        e.printStackTrace();
      }
    }
  }

  private void closeSearcher() {
    if (this.reader != null) {
      try {
        this.reader.close();
      }
      catch (IOException e) {
        e.printStackTrace();
      }
    }
    if (this.searcher != null) {
      try {
        this.searcher.close();
      }
      catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}

Term 的長度影響分數

  public void testFieldLengthBoost() {
    System.out.println("testFieldLengthBoost...");
    try {

      // index
      this.writer = this.createWriter();
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Lucene in action", 1F));
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Spring in action, Manning", 1F));
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Hibernate in action 2e, Manning Publication", 1F));
      this.closeWriter();

      // search
      this.searcher = this.createSearcher();
      TopDocs results = this.searcher.search(new TermQuery(new Term(
          BoostNormsTestCase.F_TITLE, "action")), 10);

      // output
      Document doc;
      for (ScoreDoc sdoc : results.scoreDocs) {
        doc = this.searcher.doc(sdoc.doc);
        System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - "
            + sdoc.score);
      }
    }
    catch (IOException e) {
      Assert.fail(e.getMessage());
    }
    finally {
      this.closeWriter();
      this.closeSearcher();
    }
  }

建立三筆長度不同的 Document，再用共有的關鍵字做查詢，得到以下的結果。

// 相同長度
Lucene in action - 0.4451987
Spring in action - 0.4451987
Hibernate in action - 0.4451987
// 不同長度
Lucene in action - 0.4451987
Spring in action, Manning - 0.35615897
Hibernate in action 2e, Manning Publication - 0.3116391
// 相同長度，關鍵字出現愈多次分數愈高
Action in action - 0.5254995
Lucene in action - 0.37158427
// 改用 Field.Index.ANALYZED_NO_NORMS
Lucene in action - 0.71231794
Spring in action, Manning - 0.71231794
Hibernate in action 2e, Manning Publication  - 0.71231794

長度愈長者分數愈低，反之愈高，另外關鍵字出現次數也會影響分數。

Field.Index 有五個選項，分別為：

No - 不做索引。
ANALYZED - 拆字做索引。
NOT_ANALYZED - 不拆字做索引。
ANALYZED_NO_NORMS - 拆字做索引，但是停用 Norms。
NOT_ANALYZED_NO_NORMS - 不拆字做索引，但是停用 Norms 。

不做索引的欄位，就不能用來查詢，可以（或一定）搭配 Field.Store.YES 來存放資訊在索引檔裡。

不拆字的欄位必須完全符合才能查詢的到，例如「Lucene in action」不拆字的話，就只能用「Lucene in action」才能找到，用「Lucene」或「action」都不會找到。

停用 Norms 只會影響到查詢結果的排序（不考慮 Boost 值與 Term 長度），並不會影響查詢方式；所以上面的測試在改用 Field.Index.ANALYZED_NO_NORMS 之後，分數就都一樣了。

Document Boost

  public void testBoostDocument() {
    System.out.println("testBoostDocument...");
    try {

      // index
      this.writer = this.createWriter();
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Lucene in action", 0.5F));
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Spring in action", 1.5F));
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Hibernate in action", 1F));
      this.closeWriter();

      // search
      this.searcher = this.createSearcher();
      TopDocs results = this.searcher.search(new TermQuery(new Term(
          BoostNormsTestCase.F_TITLE, "action")), 10);

      // output
      Document doc;
      for (ScoreDoc sdoc : results.scoreDocs) {
        doc = this.searcher.doc(sdoc.doc);
        System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - "
            + sdoc.score);
      }
    }
    catch (IOException e) {
      Assert.fail(e.getMessage());
    }
    finally {
      this.closeWriter();
      this.closeSearcher();
    }
  }

對 Document 設定不同的 boost 值，得到以下的結果。

// 相同的 Boost 值（預設為 1）
Lucene in action - 0.4451987
Spring in action - 0.4451987
Hibernate in action - 0.4451987
// 不同的 Boost 值
Spring in action - 0.71231794
Hibernate in action - 0.4451987
Lucene in action - 0.22259936
// 改用 Field.Index.ANALYZED_NO_NORMS
Lucene in action - 0.71231794
Spring in action - 0.71231794
Hibernate in action - 0.71231794

Boost 值愈高者分數愈高，反之愈低。

Field Boost

  public void testBoostField() {
    System.out.println("testBoostField...");
    try {
      
      // index
      this.writer = this.createWriter();
      Document doc = new Document();
      doc.add(this.createField(Field.Index.ANALYZED,
          BoostNormsTestCase.F_TITLE, "Lucene in action", 1F));
      doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY,
          "Lucene in action, Manning", 0.5F));
      this.writer.addDocument(doc);
      doc = new Document();
      doc.add(this.createField(Field.Index.ANALYZED,
          BoostNormsTestCase.F_TITLE, "Spring in action", 1F));
      doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY,
          "Spring in action, Manning", 1F));
      this.writer.addDocument(doc);
      doc = new Document();
      doc.add(this.createField(Field.Index.ANALYZED,
          BoostNormsTestCase.F_TITLE, "Hibernate in action", 1F));
      doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY,
          "Hibernate in action, Manning", 1.5F));
      this.writer.addDocument(doc);
      this.closeWriter();

      // search
      this.searcher = this.createSearcher();
      TopDocs results = this.searcher.search(new TermQuery(new Term(
          BoostNormsTestCase.F_BODY, "manning")), 10);
      
      // output
      for (ScoreDoc sdoc : results.scoreDocs) {
        doc = this.searcher.doc(sdoc.doc);
        System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - "
            + sdoc.score);
      }
    }
    catch (IOException e) {
      Assert.fail(e.getMessage());
    }
    finally {
      this.closeWriter();
      this.closeSearcher();
    }
  }

對 Field 設定不同的 boost 值，得到以下的結果。

// 相同的 Boost 值（預設為 1）
Lucene in action - 0.35615897
Spring in action - 0.35615897
Hibernate in action - 0.35615897
// 不同的 Boost 值
Hibernate in action - 0.53423846
Spring in action - 0.35615897
Lucene in action - 0.17807949
// 改用 Field.Index.ANALYZED_NO_NORMS
Lucene in action - 0.71231794
Spring in action - 0.71231794
Hibernate in action - 0.71231794

Boost 值愈高者分數愈高，反之愈低。

停用 Norms 的原因只有一個，就是查詢時可以少用一些記憶體，記憶體使用的多寡與索引的數量有關；用過 Norms 後若要停用，得整個索引重建，不然只要有一個欄位使用 Norms，整個查詢就會用 Norms。
---

2012-07-18

Boost and Norms in Lucene 3.6.0

沒有留言:

張貼留言