The git repo behind my blog.

proof-of-concept-bloomysearch-a-javascript-client-side-search-engine-for-static-websites.html 14KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180
  1. <!DOCTYPE html>
  2. <html lang="en">
  3. <head>
  4. <meta charset="utf-8" />
  5. <meta http-equiv="X-UA-Compatible" content="IE=edge" />
  6. <meta name="HandheldFriendly" content="True" />
  7. <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  8. <meta name="robots" content="index, follow" />
  9. <link href='https://phyks.me/theme/stylesheet/fonts.css' rel='stylesheet' type='text/css'>
  10. <link rel="stylesheet" type="text/css" href="https://phyks.me/theme/stylesheet/style.min.css">
  11. <link rel="stylesheet" type="text/css" href="https://phyks.me/theme/pygments/monokai.min.css">
  12. <link rel="stylesheet" type="text/css" href="https://phyks.me/theme/font-awesome/css/font-awesome.min.css">
  13. <link href="https://phyks.me/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Phyks' blog Atom">
  14. <link rel="shortcut icon" href="/images/favicon.ico" type="image/x-icon">
  15. <link rel="icon" href="/images/favicon.ico" type="image/x-icon">
  16. <!-- Chrome, Firefox OS and Opera -->
  17. <meta name="theme-color" content="#333">
  18. <!-- Windows Phone -->
  19. <meta name="msapplication-navbutton-color" content="#333">
  20. <!-- iOS Safari -->
  21. <meta name="apple-mobile-web-app-capable" content="yes">
  22. <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
  23. <!-- Microsoft EDGE -->
  24. <meta name="msapplication-TileColor" content="#333">
  25. <meta name="author" content="Phyks" />
  26. <meta name="description" content="Overview Many websites and blogs are statically generated and the webserver only serves static files. It is the case of many doc websites and more and more blogs, starting from this one, as Jekyll / Pelican develops. This is really useful to reduce the complexity of the website and the load …" />
  27. <meta name="keywords" content="">
  28. <meta property="og:site_name" content="Phyks' blog"/>
  29. <meta property="og:title" content="Proof-of-concept: BloomySearch, a (JavaScript) client-side search engine for static websites"/>
  30. <meta property="og:description" content="Overview Many websites and blogs are statically generated and the webserver only serves static files. It is the case of many doc websites and more and more blogs, starting from this one, as Jekyll / Pelican develops. This is really useful to reduce the complexity of the website and the load …"/>
  31. <meta property="og:locale" content="en_US.UTF-8"/>
  32. <meta property="og:url" content="https://phyks.me/2014/11/proof-of-concept-bloomysearch-a-javascript-client-side-search-engine-for-static-websites.html"/>
  33. <meta property="og:type" content="article"/>
  34. <meta property="article:published_time" content="2014-11-08 18:45:00+01:00"/>
  35. <meta property="article:modified_time" content=""/>
  36. <meta property="article:author" content="https://phyks.me/author/phyks.html">
  37. <meta property="article:section" content="Dev"/>
  38. <meta property="og:image" content="/images/profile.png">
  39. <title>Phyks' blog &ndash; Proof-of-concept: BloomySearch, a (JavaScript) client-side search engine for static websites</title>
  40. </head>
  41. <body>
  42. <aside>
  43. <div>
  44. <a href="https://phyks.me">
  45. <img src="/images/profile.png" alt="Phyks" title="Phyks">
  46. </a>
  47. <h1><a href="https://phyks.me">Phyks</a></h1>
  48. <p>I write about dev, FOSS, DIY and more, in French and English.</p>
  49. <ul class="social">
  50. <li><a class="sc-rss" href="feeds/all.atom.xml" target="_blank"><i class="fa fa-rss"></i></a></li>
  51. <li><a class="sc-envelope-o" href="mailto:phyks+blog@phyks.me" target="_blank"><i class="fa fa-envelope-o"></i></a></li>
  52. <li><a class="sc-github" href="http://github.com/phyks/" target="_blank"><i class="fa fa-github"></i></a></li>
  53. <li><a class="sc-gitlab" href="https://git.phyks.me/phyks" target="_blank"><i class="fa fa-gitlab"></i></a></li>
  54. </ul>
  55. </div>
  56. </aside>
  57. <main>
  58. <nav>
  59. <a href="https://phyks.me">Home</a>
  60. <a href="https://links.phyks.me">Bookmarks</a>
  61. <a href="/pages/hosted-tools.html">Tools</a>
  62. <a href="/archives.html">Archives</a>
  63. <a href="/pages/memo-autohebergement.html">Autohébergement</a>
  64. <a href="https://phyks.me/feeds/all.atom.xml">Atom</a>
  65. </nav>
  66. <article class="single">
  67. <header>
  68. <h1 id="proof-of-concept-bloomysearch-a-javascript-client-side-search-engine-for-static-websites">Proof-of-concept: BloomySearch, a (JavaScript) client-side search engine for static&nbsp;websites</h1>
  69. <p>
  70. Posted on November 08, 2014 in <a href="https://phyks.me/category/dev.html">Dev</a>
  71. &#8226; 5 min read
  72. </p>
  73. </header>
  74. <div>
  75. <h2>Overview</h2>
  76. <p>Many websites and blogs are statically generated and the webserver only serves static files. It is the case of many doc websites and more and more blogs, starting from this one, as <a href="http://jekyllrb.com/">Jekyll</a> / <a href="http://blog.getpelican.com/">Pelican</a>&nbsp;develops.</p>
  77. <p>This is really useful to reduce the complexity of the website and the load on the webserver. All the complex logic is done at the&nbsp;generation.</p>
  78. <p>However, this also means you do not have dynamic pages on your website to handle search queries. Then, you are left with two (or three) choices&nbsp;:</p>
  79. <ol>
  80. <li>Use an external search engine, such as an embedded Google search box. This raises some privacy concerns and make you depends on an external&nbsp;service.</li>
  81. <li>(Use a <span class="caps">JS</span> search engine such as the <a href="http://www.airpair.com/angularjs#/10-filters-core-">filters</a> provided by Angular.<span class="caps">JS</span>. This only works on the displayed content, and is not a real&nbsp;solution).</li>
  82. <li>Stop worrying about search engine on your website and let the users <code>wget</code>-ing and <code>grep</code>-ing your website on their computers. This is not the most user-friendly&nbsp;solution…</li>
  83. </ol>
  84. <p>There are a couple of solutions around, mostly based on <a href="http://lunrjs.com/">Lunr.js</a> which generates an index from the articles available, and use this index for fulltext search. This is the best solution I found so far but it is still not perfect. Although there is a stemmer and an index generation to reduce the amount of data to be transferred, the data is not stored in a very efficient way, and the full index is sent as <span class="caps">JSON</span>. An example implementation for Jekyll is available through <a href="https://github.com/slashdotdash/jekyll-lunr-js-search">the jekyll-lunr-js-search plugin</a>.</p>
  85. <p>I had the idea of a client side search engine in mind for a while, but was facing the same problem as Lunr.js: how not to send a full (very large) index over the network to every single client ? Not having an optimized data structure would basically mean sending twice the content of the website to the client. It may not be a practical problem nowadays, as transfer speed is not always the limiting resource, but it is still not to be considered as a good practice, in my opinion, especially if your website might be accessed from mobile&nbsp;devices.</p>
  86. <p>I came accross <a href="http://www.stavros.io/posts/bloom-filter-search-engine/?print">this article</a> from Stavros Korokithakis and thought something similar could be achieved directly in the browser. Instead of using a standard dictionary to store the index, this article proposes to use a Bloom filter per article. Bloom filters are very interesting probabilistic structures which can store whether an element is or not in a set, with a fixed number of bits. It can return false positives: if an element is in the set, it always returns <code>True</code>, but if an element is not in the set, it may say it is actually in, with a small probability. <a href="https://en.wikipedia.org/wiki/Bloom_filter">Wikipedia page</a> on the subject has all the necessary stuff to understand these data&nbsp;structures.</p>
  87. <p>I wrote it in the context of my blog, which means a Python script to generate the index at pages generation, and a client side search engine in JavaScript, running in&nbsp;browser.</p>
  88. <p>A demo is available <a href="https://phyks.github.io/BloomySearch/">here</a>. It contains all the articles of my blog, as of writing this article, totalizing 160k characters, and only 7kB of index, allowing 10% of false positives, which may be a bit too much for a really reliable search engine. Reducing the error rate will lead to an increase in the index size (11kB for 1% of false positives and the same amount of&nbsp;characters).</p>
  89. <h2>Details of the&nbsp;implementation</h2>
  90. <p>As JavaScript is not the easier language to use for hashing and binary data manipulation, I started by implementing the client side search engine. Then, it would be easier to adapt the Python code to the <span class="caps">JS</span> lib than doing the contrary. Actually, I found <a href="https://github.com/jasondavies/bloomfilter.js">this bloomfilters.js library</a> from Jason Davies which was doing most of the job and did not need many modifications. I edited it a bit to support a construction with a <code>capacity</code> and an <code>error_rate</code>, instead of an explicit number of bits and times to apply the hashing function. This forked version is available <a href="https://github.com/Phyks/bloomfilter.js/blob/master/bloomfilter.js">here</a>.</p>
  91. <p>Then, I reimplemented this library in Python, to generate readable Bloom filters for the JavaScript&nbsp;script.</p>
  92. <h3>Server&nbsp;side</h3>
  93. <p>The generation script takes every articles in a given directory and for each of&nbsp;them:</p>
  94. <ol>
  95. <li>It gets a set of all the words in this article, ignoring too short&nbsp;words.</li>
  96. <li>It applies <a href="http://tartarus.org/martin/PorterStemmer/">Porter Stemming Algorithm</a> to reduce drastically the number of words to&nbsp;keep.</li>
  97. <li>It generates a Bloom filters containing all of these&nbsp;words.</li>
  98. </ol>
  99. <p>Finally, it concatenates all the per article Bloom filters in a binary file, to be sent to the client. It also generates a <span class="caps">JSON</span> index mapping the id of the Bloom filter in the binary file to the corresponding <span class="caps">URL</span> and title for each&nbsp;article.</p>
  100. <h3>Client&nbsp;side</h3>
  101. <p>Upon loading, the JavaScript script downloads the binary file (see <a href="https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Sending_and_Receiving_Binary_Data">this <span class="caps">MDN</span> doc</a> for more details) containing the Bloom filters and the <span class="caps">JSON</span> index, and regenerate BloomFilters on the client&nbsp;side.</p>
  102. <p>When the client searches for something, the JavaScript script splits the query in words and iterate over the Bloom filters to search for the words. That&#8217;s it&nbsp;=)</p>
  103. <h2>(Fun) facts found while reimplementing the Bloom filters library in&nbsp;Python</h2>
  104. <p>First problem I had to deal with : the difference between JavaScript <code>Number</code> type and Python <code>int</code>. JavaScript has only one type for all numbers (<code>int</code> or <code>floats</code>) and it is <code>Number</code> (see <a href="https://stackoverflow.com/questions/307179/what-is-javascripts-highest-integer-value-that-a-number-can-go-to-without-losin">this <span class="caps">SO</span> thread</a>). They are 64-bit floating point values, with a magnitude no greater than 2<sup>53</sup>. However, when doing bitwise operations, they are casted to 32 bits before doing the operation. This is something to take care of, because Python&#8217;s <code>int</code> can be 64 bits (<a href="http://legacy.python.org/dev/peps/pep-0237/">http://legacy.python.org/dev/peps/pep-0237/</a>). Then, when a bitwise operation overflows in JavaScript, it may not overflow the same way in&nbsp;Python.</p>
  105. <p>The solution to this problem was to use <code>ctypes.c_int</code> in Python for bitwise operations, as proposed <a href="https://stackoverflow.com/questions/1694507/difference-between-operator-in-js-and-python">here</a>.</p>
  106. <p>Another problem was the difference between modulo behaviour with negative numbers in Python and in JavaScript. Unlike C, C++ and JavaScript, Python&#8217;s modulo operator (%) always return a number having the same sign as the divisor (<a href="https://stackoverflow.com/questions/3883004/negative-numbers-modulo-in-python">Source</a>). Then, we have to reimplement the C behaviour in a modulo function in&nbsp;Python.</p>
  107. <p>Finally, there was no “shift right adding zeros” (logical right shift) in Python, contrary to <span class="caps">JS</span>, see <a href="https://stackoverflow.com/questions/5832982/how-to-get-the-logical-right-binary-shift-in-python">this <span class="caps">SO</span> thread</a>.</p>
  108. </div>
  109. <div class="tag-cloud">
  110. <p>
  111. </p>
  112. </div>
  113. </article>
  114. <footer>
  115. <p>
  116. &copy; 2017 - This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>
  117. </p>
  118. <p>Powered by <a href="http://getpelican.com" target="_blank">Pelican</a> - <a href="https://github.com/alexandrevicenzi/flex" target="_blank">Flex</a> theme by <a href="http://alexandrevicenzi.com" target="_blank">Alexandre Vicenzi</a></p><p>
  119. <a rel="license"
  120. href="http://creativecommons.org/licenses/by-nc-sa/4.0/"
  121. target="_blank">
  122. <img alt="Creative Commons License"
  123. title="Creative Commons License"
  124. style="border-width:0"
  125. src="https://phyks.me/theme/img/cc/by-nc-sa.png"
  126. width="80"
  127. height="15"/>
  128. </a>
  129. </p> </footer>
  130. </main>
  131. <script type="application/ld+json">
  132. {
  133. "@context" : "http://schema.org",
  134. "@type" : "Blog",
  135. "name": " Phyks' blog ",
  136. "url" : "https://phyks.me",
  137. "image": "/images/profile.png",
  138. "description": "I write about dev, FOSS, DIY and more, in French and English."
  139. }
  140. </script>
  141. </body>
  142. </html>