2012-07-18

Boost and Norms in Lucene 3.6.0

這篇筆記源自於「Norms」這個陌生的字眼,還有「Boost」。

Lucene 查詢結果預設是用 score 做排序,分數愈高的排前面,分數的計算相當複雜,不過有兩個影響分數的因素:
  • Field 的值(字串)愈長,得到的分數愈低。
  • 使用 Boost 為特定 Document 或 Field 加分。
上面這兩個因素也可以稱為「Norms」,指的就是建立索引檔時的 Field 與 Document Boost 值與 Term 長度會影響分數。

用 TestCase 搭配 RAMDirectory 來測試。

public class BoostNormsTestCase extends TestCase {

  private static final Version VERSION = Version.LUCENE_36;
  private static final String F_TITLE = "title";
  private static final String F_BODY = "body";
  private Directory directory = new RAMDirectory();
  private IndexWriter writer;
  private IndexReader reader;
  private IndexSearcher searcher;

  private Document createDocument(Index indexType, String value, float boost) {
    Document doc = new Document();
    doc.add(new Field(BoostNormsTestCase.F_TITLE, value, Field.Store.YES,
        indexType));
    doc.setBoost(boost);
    return doc;
  }

  private Field createField(Index indexType, String name, String value,
      float boost) {
    Field field = new Field(name, value, Field.Store.YES, indexType);
    field.setBoost(boost);
    return field;
  }

  private IndexWriter createWriter() throws CorruptIndexException,
      LockObtainFailedException, IOException {
    IndexWriterConfig config = new IndexWriterConfig(
        BoostNormsTestCase.VERSION, new StandardAnalyzer(
            BoostNormsTestCase.VERSION));
    config.setOpenMode(OpenMode.CREATE);
    return new IndexWriter(this.directory, config);
  }

  private IndexSearcher createSearcher() throws CorruptIndexException,
      IOException {
    this.reader = IndexReader.open(this.directory);
    return new IndexSearcher(this.reader);
  }

  private void closeWriter() {
    if (this.writer != null) {
      try {
        this.writer.close();
      }
      catch (IOException e) {
        e.printStackTrace();
      }
    }
  }

  private void closeSearcher() {
    if (this.reader != null) {
      try {
        this.reader.close();
      }
      catch (IOException e) {
        e.printStackTrace();
      }
    }
    if (this.searcher != null) {
      try {
        this.searcher.close();
      }
      catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}

Term 的長度影響分數
  public void testFieldLengthBoost() {
    System.out.println("testFieldLengthBoost...");
    try {

      // index
      this.writer = this.createWriter();
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Lucene in action", 1F));
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Spring in action, Manning", 1F));
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Hibernate in action 2e, Manning Publication", 1F));
      this.closeWriter();

      // search
      this.searcher = this.createSearcher();
      TopDocs results = this.searcher.search(new TermQuery(new Term(
          BoostNormsTestCase.F_TITLE, "action")), 10);

      // output
      Document doc;
      for (ScoreDoc sdoc : results.scoreDocs) {
        doc = this.searcher.doc(sdoc.doc);
        System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - "
            + sdoc.score);
      }
    }
    catch (IOException e) {
      Assert.fail(e.getMessage());
    }
    finally {
      this.closeWriter();
      this.closeSearcher();
    }
  }
建立三筆長度不同的 Document,再用共有的關鍵字做查詢,得到以下的結果。
// 相同長度
Lucene in action - 0.4451987
Spring in action - 0.4451987
Hibernate in action - 0.4451987
// 不同長度
Lucene in action - 0.4451987
Spring in action, Manning - 0.35615897
Hibernate in action 2e, Manning Publication - 0.3116391
// 相同長度,關鍵字出現愈多次分數愈高
Action in action - 0.5254995
Lucene in action - 0.37158427
// 改用 Field.Index.ANALYZED_NO_NORMS
Lucene in action - 0.71231794
Spring in action, Manning - 0.71231794
Hibernate in action 2e, Manning Publication  - 0.71231794
長度愈長者分數愈低,反之愈高,另外關鍵字出現次數也會影響分數

Field.Index 有五個選項,分別為:
  • No - 不做索引。
  • ANALYZED - 拆字做索引。
  • NOT_ANALYZED - 不拆字做索引。
  • ANALYZED_NO_NORMS -  拆字做索引,但是停用 Norms。
  • NOT_ANALYZED_NO_NORMS -  不拆字做索引 ,但是停用 Norms 。
不做索引的欄位,就不能用來查詢,可以(或一定)搭配 Field.Store.YES 來存放資訊在索引檔裡。

不拆字的欄位必須完全符合才能查詢的到,例如「Lucene in action」不拆字的話,就只能用「Lucene in action」才能找到,用「Lucene」或「action」都不會找到。


停用 Norms 只會影響到查詢結果的排序(不考慮 Boost 值與 Term 長度),並不會影響查詢方式;所以上面的測試在改用 Field.Index.ANALYZED_NO_NORMS 之後,分數就都一樣了。

Document Boost
  public void testBoostDocument() {
    System.out.println("testBoostDocument...");
    try {

      // index
      this.writer = this.createWriter();
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Lucene in action", 0.5F));
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Spring in action", 1.5F));
      this.writer.addDocument(this.createDocument(Field.Index.ANALYZED,
          "Hibernate in action", 1F));
      this.closeWriter();

      // search
      this.searcher = this.createSearcher();
      TopDocs results = this.searcher.search(new TermQuery(new Term(
          BoostNormsTestCase.F_TITLE, "action")), 10);

      // output
      Document doc;
      for (ScoreDoc sdoc : results.scoreDocs) {
        doc = this.searcher.doc(sdoc.doc);
        System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - "
            + sdoc.score);
      }
    }
    catch (IOException e) {
      Assert.fail(e.getMessage());
    }
    finally {
      this.closeWriter();
      this.closeSearcher();
    }
  }
對 Document 設定不同的 boost 值,得到以下的結果。
// 相同的 Boost 值(預設為 1)
Lucene in action - 0.4451987
Spring in action - 0.4451987
Hibernate in action - 0.4451987
// 不同的 Boost 值
Spring in action - 0.71231794
Hibernate in action - 0.4451987
Lucene in action - 0.22259936
// 改用 Field.Index.ANALYZED_NO_NORMS
Lucene in action - 0.71231794
Spring in action - 0.71231794
Hibernate in action - 0.71231794
Boost 值愈高者分數愈高,反之愈低

Field Boost
  public void testBoostField() {
    System.out.println("testBoostField...");
    try {
      
      // index
      this.writer = this.createWriter();
      Document doc = new Document();
      doc.add(this.createField(Field.Index.ANALYZED,
          BoostNormsTestCase.F_TITLE, "Lucene in action", 1F));
      doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY,
          "Lucene in action, Manning", 0.5F));
      this.writer.addDocument(doc);
      doc = new Document();
      doc.add(this.createField(Field.Index.ANALYZED,
          BoostNormsTestCase.F_TITLE, "Spring in action", 1F));
      doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY,
          "Spring in action, Manning", 1F));
      this.writer.addDocument(doc);
      doc = new Document();
      doc.add(this.createField(Field.Index.ANALYZED,
          BoostNormsTestCase.F_TITLE, "Hibernate in action", 1F));
      doc.add(this.createField(Field.Index.ANALYZED, BoostNormsTestCase.F_BODY,
          "Hibernate in action, Manning", 1.5F));
      this.writer.addDocument(doc);
      this.closeWriter();

      // search
      this.searcher = this.createSearcher();
      TopDocs results = this.searcher.search(new TermQuery(new Term(
          BoostNormsTestCase.F_BODY, "manning")), 10);
      
      // output
      for (ScoreDoc sdoc : results.scoreDocs) {
        doc = this.searcher.doc(sdoc.doc);
        System.out.println(doc.get(BoostNormsTestCase.F_TITLE) + " - "
            + sdoc.score);
      }
    }
    catch (IOException e) {
      Assert.fail(e.getMessage());
    }
    finally {
      this.closeWriter();
      this.closeSearcher();
    }
  }
對 Field 設定不同的 boost 值,得到以下的結果。
// 相同的 Boost 值(預設為 1)
Lucene in action - 0.35615897
Spring in action - 0.35615897
Hibernate in action - 0.35615897
// 不同的 Boost 值
Hibernate in action - 0.53423846
Spring in action - 0.35615897
Lucene in action - 0.17807949
// 改用 Field.Index.ANALYZED_NO_NORMS
Lucene in action - 0.71231794
Spring in action - 0.71231794
Hibernate in action - 0.71231794
Boost 值愈高者分數愈高,反之愈低

停用 Norms 的原因只有一個,就是查詢時可以少用一些記憶體,記憶體使用的多寡與索引的數量有關;用過 Norms 後若要停用,得整個索引重建,不然只要有一個欄位使用 Norms,整個查詢就會用 Norms。
---

沒有留言:

張貼留言