bloomysearch/README.md

41 lines
2.2 KiB
Markdown
Raw Normal View History

BloomJS
====
A javascript search engine.
## Basic idea
I have a static weblog, generated thanks to [Blogit](https://github.com/phyks/blogit, caution this code is ugly) and, as I only want to have html files on my server, I needed to find a way to enable users to search my blog.
An index is generated by a Python script, upon generation of the pages, and is dynamically downloaded by the client when he wants to search for contents.
## Files
### Index generation (`index_generation/` folder)
* `generate_index.py`: Python script to generate the index (runs only at page generation) in a nice format for Javascript
* `pybloom.py`: Library to handle bloom filters in Python
* `stemmer.py`: Implementation of Porter Stemming algorithm in Python, from Vivake Gupta.
### Example html search form
* `index.html`
* `js/bloom.js`: main JS code
* `js/bloomfilters.js`: JS library to use BloomFilters
* `js/jquery-2.0.3.min.js`: jQuery to have convenient functions, will mostly be dropped in the future.
### Examples
* `samples/`: samples for testing purpose (taken from my blog articles)
## Notes
* I got the idea while reading [this page](http://www.stavros.io/posts/bloom-filter-search-engine/?print) found on [Sebsauvage's shaarli](http://sebsauvage.net/links/). I searched a bit for code doing what I wanted and found these ones:
* https://github.com/olivernn/lunr.js
* https://github.com/reyesr/fullproof
But I wasn't fully satisfied by the first one, and I found the second one too heavy and complicated for my purpose, so I ended up coding this.
* This code is mainly a proof of concept. As such, it is not fully optimized (actually, I just tweaked until the resulted files and calculations could be considered "acceptable"). For those looking for more effective solutions, here are a few things I found while looking for information on the web:
* The stemming algorithm used may not be the most efficient one. People wanting to work with non-English languages or to optimize the overall computation of the index can easily move to a more effective algorithm. See [Wikipedia](http://en.wikipedia.org/wiki/Stemming) and [the stemming library in Python](https://pypi.python.org/pypi/stemming/1.0) which has C wrappers for best performances.