I have a few thousand notes, a pile of blog posts and tweets, a huge amount of sent emails, and various other data, written by me. In order to get more value out of this dataset, I built a prototype of a personal search engine usgin NLP. I set up a system that automatically connects my emails, notes, posts, chat messages, tweets, book highlights, etc. Now, when writing a new note, I get a list of relevant items written by myself at some earlier time. I think this has a lot of potential for serendipity – at least it’s proven to be amusing.
For example, I wrote some observations about a paper describing how an ML model analysed and graded Airbnb photos: What Makes a Good Image? Airbnb Demand Analytics Leveraging Interpretable Image Features. My system surfaced my old note from 2014 about photography. Very nice! Now I had useful extra context for the new note. Making interesting connections makes data cumulatively better. Both of these notes are now better than they would have been without the connection.
System Structure
First issue to tackle was to somehow import my data from the separate data silos. Then apply some data science methods and use a learning algorithm to distil meaning from the mess. In short:
- Import my data from various sources.
- Parse through and process the data so that they become indexable documents.
- Analyse the documents and build a search index.
- Implement a search to the index.
Data Import
I like Karlicoss’s HPI. It’s a concept and an extensible framework for accessing data. The philosophy is to provide a data access layer that abstracts away the trouble of managing multiple data sources with different data models. HPI can import data from APIs, when available, or from local files like the Twitter GDPR export that I used. Note to self: next time I build a service, make sure there’s an API for downloading my data. Written in Python and utilizing namespace packages, HPI allows me to customize and extend to custom data. So I implemented my personal hpi-overlay.
Document Index
For indexing, I tried working with existing tools such as Carrot2 and Solr. The results were not good enough. My main problem was that my data are multi-lingual. A majority is in English, but a lot of data are in other languages: Finnish, German, and French. And by “a lot” I actually mean “a little” because the overall data amounts in my personal space are small from a data science perspective (less than 10.000 documents when excluding emails).
“Traditional” indexing and clustering and making useful searches require some language processing. In order to tokenize and list keywords, we need language-specific stemming. Multiple languages would require multiple setups for stemming, stopwords, etc.
Word Vectors
Instead, I turned to a machine learning algorithm. Fasttext is a word embedding implementation that utilizes subword information. I figured it would be a good candidate to handle a mixed-language dataset, where some languages (Finnish) have high morphology.
I implemented a quick preprocessing tool and exported the texts from all various sources to a single corpus. The resulting corpus size amounted to a mere 12MB. The next step was to train a Fasttext model from scratch, using both subwords and wordNgrams. Training with this multi-language corpus was fast. A quick test using Fasttext’s query tool demonstrated that the model had learned something meaningful. Querying with a misspelled word returned the correct spelling. Querying with a concept returned related concepts. Etc. For example:
Query word? intelligence
intelligent 0.882046
intellectual 0.778315
artificial 0.777577
episodes 0.724803
treatise 0.714352
terrifying 0.711142
psychology 0.710391
inductive 0.705542
fundamentals 0.703758
visionists 0.701901
The training step could probably be made much better by utilizing pre-trained word vectors and putting more effort in the preprocessing. Cross-lingual embedding space could also prove beneficial for my dataset.
Document Similarity
Now I had a custom model for getting word vectors. Next I took a document, got the vector for each word in the document and created a document vector by averaging the individual word vectors. This is crude approximation and more accurate methods exists, e.g. Le and Mikolov, 2014.
My search index became thus a combination of documents and their correspoding document vectors.
How to find a set of documents that are similar to the one I’m currently working on? Similarity here implies that the documents are somehow related and should be clustered together. Documents are related if they cover the same topic or related topics. For example, are a list of items to pack for a roadtrip related with route planning? Would it be useful to surface both when thinking about the topic?
We have argued that the automated measurement of the similarity between text documents is fundamentally a psychological modeling problem.
Lee corpus
There are multiple potential similarity metrics. See e.g. https://github.com/taki0112/Vector_Similarity for implementation of TS-SS that takes into account vector magnitude and direction. A common and simple similarity metric is to compute the cosine similarity.
For a new note, I would compute the document vector, and then compute cosine similarity with every document vector in the index. As this calculation is done “online” it has a big effect on user experience. I was worried that Python is too slow to be usable. My first implementation affirmed that this was indeed the case, but after turning to Pandas/Numpy and implementing a vectorized version of the computation, the delay became negligible.
Conclusion
Our data are typically siloed in different services, and it is hard to link between items. As so many times before, I was again surprised at how much work it took just to build a dataset for a machine learning project. However, the process is now in place, and HPI is helping to keep it going. Another interesting tool is Promnesia.
Learning a Fasttext model from scratch was surprisingly convenient. The process is fast, and the results are useful. Out-of-vocabulary words are handled by subwords information. Working in the word embedding space seems to make the indexing / similarity more semantic instead of just counting keyword frequencies. Word vectors capture meaning, see e.g. the blog post Less Sexism in Finnish.
After using the system for a while, I’ve been pleasantly surprised to find a long-forgotten note/email/message related to something I’m working on. These reminders from the past have felt useful in making connetions between concepts and building understanding.