77 lines
4.2 KiB
Markdown
77 lines
4.2 KiB
Markdown
BloomJS
|
|
====
|
|
A javascript search engine for static websites.
|
|
|
|
|
|
Have you ever dreamt of having a search engine on your static website ? BloomySearch implements a static index generation, when you generate your webpages, and a client-side JavaScript script which actually implements all the search logic. It downloads the index and performs the search query.
|
|
|
|
To preserve bandwith, the index is stored in a binary file, using BloomFilters, instead of using a JSON index as <a href="http://lunrjs.com/">Lunr.JS</a> does.
|
|
|
|
For full details about BloomySearch, please refer to <a href="http://phyks.me/2014/11/bloomysearch.html">this blog post</a>.
|
|
|
|
|
|
## Basic idea
|
|
I have a static weblog, generated thanks to [Blogit](https://github.com/phyks/blogit, caution this code is ugly) and, as I only want to have html files on my server, I needed to find a way to enable users to search my blog.
|
|
|
|
An index is generated by a Python script, upon generation of the pages, and is dynamically downloaded by the client when he wants to search for contents.
|
|
|
|
## Files
|
|
|
|
### Index generation (`index_generation/` folder)
|
|
|
|
* `generate_index.py`: Python script to generate the index (runs only at page generation) in a nice format for Javascript
|
|
* `pybloom.py`: Library to handle bloom filters in Python
|
|
* `stemmer.py`: Implementation of Porter Stemming algorithm in Python, from Vivake Gupta.
|
|
|
|
### Example html search form
|
|
|
|
* `index.html`
|
|
* `js/bloom.js`: main JS code
|
|
* `js/bloomfilters.js`: JS library to use BloomFilters
|
|
|
|
### Examples
|
|
|
|
* `samples/`: samples for testing purpose (taken from my blog articles)
|
|
|
|
## Data storing
|
|
|
|
One of the main problem was to transmit the binary data from the Python script to the JS script. I found [an article about handling binary data in JavaScript](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Sending_and_Receiving_Binary_Data) which helped me a lot.
|
|
|
|
Data from the python script is just the array of bloomfilters bitarray written as a binary file (`data/search_index`), which I open with JS. The list of articles is also written in JSON form in a specific file (`data/pages_index.json`).
|
|
|
|
Here's the format of the output from the python script:
|
|
|
|
* [16 bits] : number of articles (== number of bitarrays)
|
|
* for each bitarray:
|
|
* [16 bits] : length of the bitarray
|
|
* […] : the bitarray itself
|
|
|
|
## Notes
|
|
|
|
* I got the idea while reading [this page](http://www.stavros.io/posts/bloom-filter-search-engine/?print) found on [Sebsauvage's shaarli](http://sebsauvage.net/links/). I searched a bit for code doing what I wanted and found these ones:
|
|
|
|
* https://github.com/olivernn/lunr.js
|
|
* https://github.com/reyesr/fullproof
|
|
|
|
But I wasn't fully satisfied by the first one, and I found the second one too heavy and complicated for my purpose, so I ended up coding this.
|
|
|
|
* This code is mainly a proof of concept. As such, it is not fully optimized (actually, I just tweaked until the resulted files and calculations could be considered "acceptable"). For those looking for more effective solutions, here are a few things I found while looking for information on the web:
|
|
|
|
* The stemming algorithm used may not be the most efficient one. People wanting to work with non-English languages or to optimize the overall computation of the index can easily move to a more effective algorithm. See [Wikipedia](http://en.wikipedia.org/wiki/Stemming) and [the stemming library in Python](https://pypi.python.org/pypi/stemming/1.0) which has C wrappers for best performances.
|
|
|
|
## License
|
|
|
|
TLDR; I don't give a damn to anything you can do using this code. It would just
|
|
be nice to quote where the original code comes from. All the included libraries
|
|
(pybloom and the stemming library) have their own license.
|
|
|
|
* -----------------------------------------------------------------------------
|
|
* "THE NO-ALCOHOL BEER-WARE LICENSE" (Revision 42):
|
|
* Phyks (webmaster@phyks.me) wrote this file. As long as you retain this notice
|
|
* you can do whatever you want with this stuff (and you can also do whatever
|
|
* you want with this stuff without retaining it, but that's not cool...). If we
|
|
* meet some day, and you think this stuff is worth it, you can buy me a
|
|
* <del>beer</del> soda in return.
|
|
* Phyks
|
|
* ------------------------------------------------------------------------------
|