Go to file
Phyks d759e7c8ab Clean + switch to bloom filters and bitarrays
* Refactor of the repo structure, for better usability.
* README.md refactored.
* Switch to BloomFilters in python script, to decrease the index file.

TODO:
* Handle binary files in JS to pass the BloomFilters from python to JS.

Note: Current implementations of BloomFilters differ in JS and Python
lib.
2014-01-02 21:24:22 +01:00
index_generation Clean + switch to bloom filters and bitarrays 2014-01-02 21:24:22 +01:00
js Clean + switch to bloom filters and bitarrays 2014-01-02 21:24:22 +01:00
samples Initial commit 2013-12-26 17:16:12 +01:00
index.html First working, clearly not optimized, version 2013-12-28 20:42:06 +01:00
README.md Clean + switch to bloom filters and bitarrays 2014-01-02 21:24:22 +01:00

BloomJS

A javascript search engine.

Basic idea

I have a static weblog, generated thanks to [Blogit](https://github.com/phyks/blogit, caution this code is ugly) and, as I only want to have html files on my server, I needed to find a way to enable users to search my blog.

An index is generated by a Python script, upon generation of the pages, and is dynamically downloaded by the client when he wants to search for contents.

Files

Index generation (index_generation/ folder)

  • generate_index.py: Python script to generate the index (runs only at page generation) in a nice format for Javascript
  • pybloom.py: Library to handle bloom filters in Python
  • stemmer.py: Implementation of Porter Stemming algorithm in Python, from Vivake Gupta.

Example html search form

  • index.html
  • js/bloom.js: main JS code
  • js/bloomfilters.js: JS library to use BloomFilters
  • js/jquery-2.0.3.min.js: jQuery to have convenient functions, will mostly be dropped in the future.

Examples

  • samples/: samples for testing purpose (taken from my blog articles)

Notes

  • I got the idea while reading this page found on Sebsauvage's shaarli. I searched a bit for code doing what I wanted and found these ones:

    But I wasn't fully satisfied by the first one, and I found the second one too heavy and complicated for my purpose, so I ended up coding this.

  • This code is mainly a proof of concept. As such, it is not fully optimized (actually, I just tweaked until the resulted files and calculations could be considered "acceptable"). For those looking for more effective solutions, here are a few things I found while looking for information on the web:

    • The stemming algorithm used may not be the most efficient one. People wanting to work with non-English languages or to optimize the overall computation of the index can easily move to a more effective algorithm. See Wikipedia and the stemming library in Python which has C wrappers for best performances.