Word Embeddings - An Alternative and Efficient Approach to Search for Documents

Duration: 50 mins
Ananth Gundabattula
Senior Architect, Commonwealth Bank of Australia

Searching for documents in a collection is typically implemented via a TF/IDF principle in open source document search engines. However recent developments in the field of NLP has shown positive results in representing text into more concise vector representations as opposed to a bag of words construct. In addition to this, these approaches also add richness to the information models like taking care of analogies and semantics of the words. This talk would walk through an end to end data workflow to enable such a construct.

The first part of the session would describe the typical flow of how a search query is processed by default in any of the lucene powered search engines today. The concept of TF/IDF is also introduced in this part of the session.

The session then proceeds to describe the concept of word embeddings using a library like Facebooks fasttext.

Subsequently, a representative data pipeline is discussed as to how an incoming stream of data can be turned into vector representations and made amenable for searching with a few seconds of turn around time.

The session would close with a few references to the more recent developments in this space.

