三津石智巳

👦🏻👦🏻👧🏻 Father of 3 | 🗺️ Service Reliability Engineering Manager at Rakuten Travel | 📚 Avid Reader | 👍 Wagashi | 👍 Caffe Latte | 👍 Owarai

【感想】Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application

postings listで検索して。ElasticsearchよりもLuceneを学ぶべきなのではという気がしてくる。

"Search, in the context of IR systems, is the art of extracting information with high relevance for a given query."

 

via Check out this quote from Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application - https://learning.oreilly.com/library/view/-/9781484263457/html/485809_1_En_1_Chapter.xhtml

searchはsecondary index scanだと認識していたが、high relevanceが定義に入っているのが興味深い。

"Note that the indexing referred to here and in later chapters differs from typical indexing in relational database systems. Relational database systems typically refer to indexing as the creation of a secondary data structure such as B-tree on top of a table attribute for faster searching. In this context, indexing refers to building an inverted index on top of the given dataset."

 

via Check out this quote from Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application - https://learning.oreilly.com/library/view/-/9781484263457/html/485809_1_En_1_Chapter.xhtml

データベースかIRの違いなんだな。

作りたいものはDB検索なのかWeb検索(a.k.a. 情報検索)なのか。F値=1が自然に期待されるのであれば、それはDB検索なのだろうと言えそう。DB検索がなおWeb検索から学べそうなことは

  1. Single-field and multifield searching
  2. Sorting and faceting
  3. Multi-index searches

など。

"user queries are deconstructed into terms and then sent for further processing."

 

via Check out this quote from Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application - https://learning.oreilly.com/library/view/-/9781484263457/html/485809_1_En_1_Chapter.xhtml

user queriesをdeconstructするという感じをDB検索にも取り入れたいんだよな。

"Databases are typically used for precise and exact retrievals, whereas IR systems are normally used for imprecise and “wide” searches, where all documents can match a query, and hence the concepts of ranking and relevance apply."

 

via Check out this quote from Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application - https://learning.oreilly.com/library/view/-/9781484263457/html/485809_1_En_1_Chapter.xhtml

こちらでもDB検索と情報検索の違いが述べられている。先の「最強データベース講義」で言うところの「トランザクショナルサイト」にユーザが至っている時点で提供する検索はDB検索だけで良いのではという気がする。

"The optimal way to execute such queries is to have a columnar store that maps document IDs to field values, giving all the benefits described earlier. So, the value of using DocValues should be obvious to you now."

 

via Check out this quote from Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application - https://learning.oreilly.com/library/view/-/9781484263457/html/485809_1_En_3_Chapter.xhtml

知りたかったLuceneにおけるソートについて書いてあった。考えてみれば当たり前なのだが、inverted indexではqueryに対するdocument IDしか取得できないので、ranking/scoring以外でsortをするには、単純な手法ではhitsの全documentの該当fieldを読み込む必要がある。これをメモリを逼迫せずに行うためにDocValuesがあるという。

"While fielddata defaults to loading values into memory on the fly, this is not the only option. It can also be written to disk at index time in a way that provides all the functionality of in-memory fielddata, but without the heap memory usage. This alternative format is called doc values."

 

via Check out this quote from Elasticsearch: The Definitive Guide - https://learning.oreilly.com/library/view/-/9781449358532/part04ch10.html

Elasticsearchでも、doc_valuesというそのままの名前で指定ができる。いずれにしても、inverted indexはfield sortには不向きという理解。

"During the scoring, the Weight creates a Scorer, which iterates over all documents in the index, filters out the irrelevant documents, and applies the scoring algorithm on the candidate documents."

 

via Check out this quote from Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application - https://learning.oreilly.com/library/view/-/9781484263457/html/485809_1_En_3_Chapter.xhtml

scoringはIndexSearcherのあとに動くのか。オンラインで計算してパフォーマンスをどう稼ぐのかが気になる。

"that also signifies an information explosion in the amount of data that is available at any given point in time for such queries. Hence it is imperative that we build storage, indexing, and searching strategies that scale with the data while consistently delivering a reasonable degree of performance."

 

via Check out this quote from Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application - https://learning.oreilly.com/library/view/-/9781484263457/html/485809_1_En_5_Chapter.xhtml

これこれ。

"Many geographic information system (GIS) tools that are available do the task of performing geographic searches. However, the advantage of leveraging a search engine for this use case is that you can combine structured and unstructured data."

 

via Check out this quote from Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application - https://learning.oreilly.com/library/view/-/9781484263457/html/485809_1_En_5_Chapter.xhtml

なるほど。同じことはDB検索にも言えると思うが、Web検索を使うことの利点は、アプリケーションとして情報の組み合わせができることと言える。

"How do we solve this problem of storage vs. search quality? The next section answers this question for you."

 

via Check out this quote from Practical Apache Lucene 8: Uncover the Search Capabilities of Your Application - https://learning.oreilly.com/library/view/-/9781484263457/html/485809_1_En_5_Chapter.xhtml

知りたい…!と思ったが、要するにgeohashの話のようだ。どう応用できるだろうか…。

 

Luceneの一冊目として非常に良かった。

 

次のキーワード

  • postings list
  • multifield searching
  • index intersection
  • Scorer/IndexReader
  • top kによる性能向上