三津石智巳

👦🏻👦🏻👧🏻 Father of 3 | 🗺️ Service Reliability Engineering Manager at Rakuten Travel | 📚 Avid Reader | 👍 Wagashi | 👍 Caffe Latte | 👍 Owarai

【感想】Lucene in Action, Second Edition

Luceneについて学ばなければという気持ちが強い。

"If you’re the curious type, and you just won’t leave any stone unturned, or your application needs to use all the bells and whistles, the rest of this chapter is for you!"

 

via Check out this quote from Lucene in Action, Second Edition - https://learning.oreilly.com/library/view/-/9781933988177/kindle_split_015.html

本筋と関係ないが便利な英語表現。

"It need not have the same fields as the previous document you added. It can even have the same fields, with different options, than in other documents."

 

via Check out this quote from Lucene in Action, Second Edition - https://learning.oreilly.com/library/view/-/9781933988177/kindle_split_015.html

また本筋と関係ないが、needが助動詞だと気づけて嬉しい。

"In this case, you can tell Lucene to skip indexing the term frequency and positions by calling Field.setOmitTermFreqAndPositions(true). This approach will save some disk space in the index, and may also speed up searching and filtering, but will silently prevent searches that require positional information, such as PhraseQuery and SpanQuery, from working. Let’s move on to controlling how Lucene stores a field’s value."

 

via Check out this quote from Lucene in Action, Second Edition - https://learning.oreilly.com/library/view/-/9781933988177/kindle_split_015.html

Vector Space Modelではなく、存在確認だけが必要であれば、Field.setOmitTermFreqAndPositions(true)を指定する。

"But what exactly are term vectors? Term vectors are a mix between an indexed field and a stored field. They’re similar to a stored field because you can quickly retrieve all term vector fields for a given document: term vectors are keyed first by document ID. But then, they’re keyed secondarily by term, meaning they store a miniature inverted index for that one document."

 

via Check out this quote from Lucene in Action, Second Edition - https://learning.oreilly.com/library/view/-/9781933988177/kindle_split_015.html

term vectorは1ドキュメントに対するinverted indexと考えられる。

"Lucene’s search results are ranked according to how closely each document matches the query, and each matching document is assigned a score. Lucene’s scoring formula consists of a number of factors, and the boost factor is one of them."

 

via Check out this quote from Lucene in Action, Second Edition - https://learning.oreilly.com/library/view/-/9781933988177/kindle_split_015.html

このscoringを独自のビジネスロジックにできないか。

"MaxFieldLength.UNLIMITED, which means no truncation will take place, and MaxFieldLength.LIMITED, which means fields are truncated at 10,000 terms. You can also instantiate MaxFieldLength with your own limit."

 

via Check out this quote from Lucene in Action, Second Edition - https://learning.oreilly.com/library/view/-/9781933988177/kindle_split_015.html

1 field中のterm制限は可変。

"Optimizing only improves searching speed, not indexing speed."

 

via Check out this quote from Lucene in Action, Second Edition - https://learning.oreilly.com/library/view/-/9781933988177/kindle_split_015.html

segmentのマージはsearchのパフォーマンスだけを向上する。

"When a flush occurs, the writer creates new segment and deletion files in the Directory. However, these files are neither visible nor usable to a newly opened IndexReader until the writer commits the changes and the reader is reopened. It’s important to understand this difference. Flushing is done to free up memory consumed by buffered changes to the index. Committing is done to make all changes (buffered or already flushed) persistent and visible in the index. This means IndexReader always sees the starting state of the index (when IndexWriter was opened), until the writer commits."

 

via Check out this quote from Lucene in Action, Second Edition - https://learning.oreilly.com/library/view/-/9781933988177/kindle_split_015.html

flushはメモリ上のバッファをディスクに永続化する操作、commitは検索可能にする操作。

"But a newly opened near-real-time reader (see section 2.8) is able to see the changes without requiring a commit() or close()."

 

via Check out this quote from Lucene in Action, Second Edition - https://learning.oreilly.com/library/view/-/9781933988177/kindle_split_015.html

near-real-time readerはflushを強制するペナルティと引き換えにnear-real-time searchを実現する。