7eb55478ed
TODO : Recode the pybloom library in JS to load back the BloomFilters in the JS script. |
||
---|---|---|
data | ||
index_generation | ||
js | ||
samples | ||
.gitignore | ||
index.html | ||
README.md |
BloomJS
A javascript search engine.
Basic idea
I have a static weblog, generated thanks to [Blogit](https://github.com/phyks/blogit, caution this code is ugly) and, as I only want to have html files on my server, I needed to find a way to enable users to search my blog.
An index is generated by a Python script, upon generation of the pages, and is dynamically downloaded by the client when he wants to search for contents.
Files
Index generation (index_generation/
folder)
generate_index.py
: Python script to generate the index (runs only at page generation) in a nice format for Javascriptpybloom.py
: Library to handle bloom filters in Pythonstemmer.py
: Implementation of Porter Stemming algorithm in Python, from Vivake Gupta.
Example html search form
index.html
js/bloom.js
: main JS codejs/bloomfilters.js
: JS library to use BloomFilters
Examples
samples/
: samples for testing purpose (taken from my blog articles)
Data storing
One of the main problem was to transmit the binary data from the Python script to the JS script. I found an article about handling binary data in JavaScript which helped me a lot.
Data from the python script is just the array of bloomfilters bitarray written as a binary file (data/search_index
), which I open with JS. The list of articles is also written in JSON form in a specific file (data/pages_index.json
).
Here's the format of the output from the python script:
- [16 bits] : number of articles (== number of bitarrays)
- for each bitarray:
- [16 bits] : length of the bitarray
- […] : the bitarray itself
Notes
-
I got the idea while reading this page found on Sebsauvage's shaarli. I searched a bit for code doing what I wanted and found these ones:
But I wasn't fully satisfied by the first one, and I found the second one too heavy and complicated for my purpose, so I ended up coding this.
-
This code is mainly a proof of concept. As such, it is not fully optimized (actually, I just tweaked until the resulted files and calculations could be considered "acceptable"). For those looking for more effective solutions, here are a few things I found while looking for information on the web:
- The stemming algorithm used may not be the most efficient one. People wanting to work with non-English languages or to optimize the overall computation of the index can easily move to a more effective algorithm. See Wikipedia and the stemming library in Python which has C wrappers for best performances.