Browse Source

fetching paper references updated

Phyks (Lucas Verney) 3 years ago
parent
commit
3d48121e82

+ 23
- 3
blog/2016/01/fetching_references_papers.html View File

@@ -16,12 +16,12 @@
16 16
 
17 17
                     <h2>Catégories</h2>
18 18
                         <nav id="sidebar-tags">
19
-                            <div class="tag"><a href="//phyks.me/tags/aNimble.html">/aNimble (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Arch.html">/Arch (48)</a> </div><div class="tag"><a href="//phyks.me/tags/Autohébergement.html">/Autohébergement (48)</a> </div><div class="tag"><a href="//phyks.me/tags/CoffeeShops.html">/CoffeeShops (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Development.html">/Development (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Devops.html">/Devops (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Dev.html">/Dev (128)</a> </div><div class="tag"><a href="//phyks.me/tags/DIY.html">/DIY (33)</a> </div><div class="tag"><a href="//phyks.me/tags/Électronique.html">/Électronique (32)</a> </div><div class="tag"><a href="//phyks.me/tags/Game Engine.html">/Game Engine (1)</a> </div><div class="tag"><a href="//phyks.me/tags/GeoData.html">/GeoData (1)</a> </div><div class="tag"><a href="//phyks.me/tags/JavaScript.html">/JavaScript (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Known.html">/Known (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Libre.html">/Libre (112)</a> </div><div class="tag"><a href="//phyks.me/tags/Linux.html">/Linux (96)</a> </div><div class="tag"><a href="//phyks.me/tags/Localization.html">/Localization (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Mobile.html">/Mobile (1)</a> </div><div class="tag"><a href="//phyks.me/tags/OpenAccess.html">/OpenAccess (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Phyks.html">/Phyks (3)</a> </div><div class="tag"><a href="//phyks.me/tags/RaspberryPi.html">/RaspberryPi (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Science.html">/Science (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Selfhost.html">/Selfhost (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Smartphone.html">/Smartphone (32)</a> </div><div class="tag"><a href="//phyks.me/tags/TupperVim.html">/TupperVim (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Vim.html">/Vim (17)</a> </div><div class="tag"><a href="//phyks.me/tags/Webapp.html">/Webapp (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Web.html">/Web (112)</a> </div><div class="tag"><a href="//phyks.me/tags/Weechat.html">/Weechat (32)</a> </div><div class="tag"><a href="//phyks.me/tags/workstation.html">/workstation (1)</a> </div>
19
+                            <div class="tag"><a href="//phyks.me/tags/aNimble.html">/aNimble (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Arch.html">/Arch (48)</a> </div><div class="tag"><a href="//phyks.me/tags/Autohébergement.html">/Autohébergement (48)</a> </div><div class="tag"><a href="//phyks.me/tags/CoffeeShops.html">/CoffeeShops (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Development.html">/Development (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Devops.html">/Devops (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Dev.html">/Dev (128)</a> </div><div class="tag"><a href="//phyks.me/tags/DIY.html">/DIY (33)</a> </div><div class="tag"><a href="//phyks.me/tags/Électronique.html">/Électronique (32)</a> </div><div class="tag"><a href="//phyks.me/tags/Game Engine.html">/Game Engine (1)</a> </div><div class="tag"><a href="//phyks.me/tags/GeoData.html">/GeoData (1)</a> </div><div class="tag"><a href="//phyks.me/tags/i3.html">/i3 (1)</a> </div><div class="tag"><a href="//phyks.me/tags/JavaScript.html">/JavaScript (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Known.html">/Known (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Libre.html">/Libre (112)</a> </div><div class="tag"><a href="//phyks.me/tags/Linux.html">/Linux (96)</a> </div><div class="tag"><a href="//phyks.me/tags/Localization.html">/Localization (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Mobile.html">/Mobile (1)</a> </div><div class="tag"><a href="//phyks.me/tags/OpenAccess.html">/OpenAccess (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Phyks.html">/Phyks (3)</a> </div><div class="tag"><a href="//phyks.me/tags/RaspberryPi.html">/RaspberryPi (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Science.html">/Science (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Selfhost.html">/Selfhost (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Smartphone.html">/Smartphone (32)</a> </div><div class="tag"><a href="//phyks.me/tags/TupperVim.html">/TupperVim (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Vim.html">/Vim (17)</a> </div><div class="tag"><a href="//phyks.me/tags/Webapp.html">/Webapp (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Web.html">/Web (112)</a> </div><div class="tag"><a href="//phyks.me/tags/Weechat.html">/Weechat (32)</a> </div><div class="tag"><a href="//phyks.me/tags/workstation.html">/workstation (1)</a> </div>
20 20
                         </nav>
21 21
 
22 22
                     <h2>Derniers articles</h2>
23 23
                         <ul id="sidebar-articles">
24
-                            <li><a href="//phyks.me/2016/01/fetching_references_papers.html">Comparison of tools to fetch references for scientific papers</a></li><li><a href="//phyks.me/2015/12/localizing_webapp.html">Localizing a webapp with webL10n.js</a></li><li><a href="//phyks.me/2015/12/putting_metadata_on_arxiv.html">Let's add some metadata on arXiv!</a></li><li><a href="//phyks.me/2015/10/low_cost_telepresence.html">Doing low cost telepresence (for under $200)</a></li><li><a href="//phyks.me/2015/10/working_in_paris.html">Working on the go in Paris</a></li><li><a href="//phyks.me/archives.html">Archives</a></li>
24
+                            <li><a href="//phyks.me/2016/05/i3_back_and_forth.html">Improved back and forth between workspaces</a></li><li><a href="//phyks.me/2016/01/fetching_references_papers.html">Comparison of tools to fetch references for scientific papers</a></li><li><a href="//phyks.me/2015/12/localizing_webapp.html">Localizing a webapp with webL10n.js</a></li><li><a href="//phyks.me/2015/12/putting_metadata_on_arxiv.html">Let's add some metadata on arXiv!</a></li><li><a href="//phyks.me/2015/10/low_cost_telepresence.html">Doing low cost telepresence (for under $200)</a></li><li><a href="//phyks.me/archives.html">Archives</a></li>
25 25
                         </ul>
26 26
 
27 27
                     <h2>Liens</h2>
@@ -67,7 +67,27 @@
67 67
 <p>To compare them, I asked <a href="http://antonin.delpeuch.eu/">Antonin</a> to build a list of most important journals and take five papers for every such journal, from <a href="http://dissem.in/">Dissemin</a>. This gives us a <a href="http://pub.phyks.me/paper_references_extractor/papers.json">JSON file</a> containing around 500 papers.</p>
68 68
 <p>I downloaded some articles, to get a (hopefully) representative set, composed of 147 different papers from various journals (I did not had access to some of them, so I could not fetch the full dataset). I ran <code>pdfextract</code>, <code>Grobid</code> and <code>Cermine</code> on each of them and compared the results.</p>
69 69
 <p>The raw results are available <a href="http://pub.phyks.me/paper_references_extractor/">here</a> for each paper, and I generated a single page comparison to ease the visual diff between the three results, available <a href="http://pub.phyks.me/paper_references_extractor/diff.html">here</a> (note that this webpage is <strong>very</strong> heavy, around 16MB).</p>
70
-<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself. </p>
70
+<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.</p>
71
+<p><strong>EDIT</strong>:</p>
72
+<ul>
73
+<li>
74
+<p>I also found <a href="https://github.com/knmnyn/ParsCit">ParsCit</a> which may be
75
+of interest. Though, you first need to extract text from your PDF file. I did 
76
+not yet test it more in depth.</p>
77
+</li>
78
+<li>
79
+<p><a href="https://twitter.com/_krisjack/status/736490898192764928">This tweet</a> tends
80
+  to confirm the results I had, that Grobid is the best one.</p>
81
+</li>
82
+<li>
83
+<p>If it can be useful, <a href="https://github.com/Phyks/CitationExtractor">here</a> is a
84
+  small web service written in Python to allow a user to upload a paper and
85
+  parse citations and try to assess open-access availability of the cited
86
+  papers. It uses CERMINE as it was the easiest way to go, especially since it
87
+  offers a web API, which allows me to distribute a simply working script,
88
+  without any additional requirements.</p>
89
+</li>
90
+</ul>
71 91
 		<footer><p class="date">Le 19/01/2016 à 14:45</p>
72 92
 		<p class="tags">Tags : <a href="//phyks.me/tags/Science.html">Science</a>, <a href="//phyks.me/tags/OpenAccess.html">OpenAccess</a></p></footer>
73 93
 	</div>

+ 23
- 3
blog/2016/01/index.html View File

@@ -16,12 +16,12 @@
16 16
 
17 17
                     <h2>Catégories</h2>
18 18
                         <nav id="sidebar-tags">
19
-                            <div class="tag"><a href="//phyks.me/tags/aNimble.html">/aNimble (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Arch.html">/Arch (48)</a> </div><div class="tag"><a href="//phyks.me/tags/Autohébergement.html">/Autohébergement (48)</a> </div><div class="tag"><a href="//phyks.me/tags/CoffeeShops.html">/CoffeeShops (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Development.html">/Development (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Devops.html">/Devops (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Dev.html">/Dev (128)</a> </div><div class="tag"><a href="//phyks.me/tags/DIY.html">/DIY (33)</a> </div><div class="tag"><a href="//phyks.me/tags/Électronique.html">/Électronique (32)</a> </div><div class="tag"><a href="//phyks.me/tags/Game Engine.html">/Game Engine (1)</a> </div><div class="tag"><a href="//phyks.me/tags/GeoData.html">/GeoData (1)</a> </div><div class="tag"><a href="//phyks.me/tags/JavaScript.html">/JavaScript (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Known.html">/Known (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Libre.html">/Libre (112)</a> </div><div class="tag"><a href="//phyks.me/tags/Linux.html">/Linux (96)</a> </div><div class="tag"><a href="//phyks.me/tags/Localization.html">/Localization (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Mobile.html">/Mobile (1)</a> </div><div class="tag"><a href="//phyks.me/tags/OpenAccess.html">/OpenAccess (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Phyks.html">/Phyks (3)</a> </div><div class="tag"><a href="//phyks.me/tags/RaspberryPi.html">/RaspberryPi (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Science.html">/Science (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Selfhost.html">/Selfhost (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Smartphone.html">/Smartphone (32)</a> </div><div class="tag"><a href="//phyks.me/tags/TupperVim.html">/TupperVim (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Vim.html">/Vim (17)</a> </div><div class="tag"><a href="//phyks.me/tags/Webapp.html">/Webapp (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Web.html">/Web (112)</a> </div><div class="tag"><a href="//phyks.me/tags/Weechat.html">/Weechat (32)</a> </div><div class="tag"><a href="//phyks.me/tags/workstation.html">/workstation (1)</a> </div>
19
+                            <div class="tag"><a href="//phyks.me/tags/aNimble.html">/aNimble (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Arch.html">/Arch (48)</a> </div><div class="tag"><a href="//phyks.me/tags/Autohébergement.html">/Autohébergement (48)</a> </div><div class="tag"><a href="//phyks.me/tags/CoffeeShops.html">/CoffeeShops (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Development.html">/Development (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Devops.html">/Devops (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Dev.html">/Dev (128)</a> </div><div class="tag"><a href="//phyks.me/tags/DIY.html">/DIY (33)</a> </div><div class="tag"><a href="//phyks.me/tags/Électronique.html">/Électronique (32)</a> </div><div class="tag"><a href="//phyks.me/tags/Game Engine.html">/Game Engine (1)</a> </div><div class="tag"><a href="//phyks.me/tags/GeoData.html">/GeoData (1)</a> </div><div class="tag"><a href="//phyks.me/tags/i3.html">/i3 (1)</a> </div><div class="tag"><a href="//phyks.me/tags/JavaScript.html">/JavaScript (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Known.html">/Known (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Libre.html">/Libre (112)</a> </div><div class="tag"><a href="//phyks.me/tags/Linux.html">/Linux (96)</a> </div><div class="tag"><a href="//phyks.me/tags/Localization.html">/Localization (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Mobile.html">/Mobile (1)</a> </div><div class="tag"><a href="//phyks.me/tags/OpenAccess.html">/OpenAccess (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Phyks.html">/Phyks (3)</a> </div><div class="tag"><a href="//phyks.me/tags/RaspberryPi.html">/RaspberryPi (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Science.html">/Science (2)</a> </div><div class="tag"><a href="//phyks.me/tags/Selfhost.html">/Selfhost (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Smartphone.html">/Smartphone (32)</a> </div><div class="tag"><a href="//phyks.me/tags/TupperVim.html">/TupperVim (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Vim.html">/Vim (17)</a> </div><div class="tag"><a href="//phyks.me/tags/Webapp.html">/Webapp (1)</a> </div><div class="tag"><a href="//phyks.me/tags/Web.html">/Web (112)</a> </div><div class="tag"><a href="//phyks.me/tags/Weechat.html">/Weechat (32)</a> </div><div class="tag"><a href="//phyks.me/tags/workstation.html">/workstation (1)</a> </div>
20 20
                         </nav>
21 21
 
22 22
                     <h2>Derniers articles</h2>
23 23
                         <ul id="sidebar-articles">
24
-                            <li><a href="//phyks.me/2016/01/fetching_references_papers.html">Comparison of tools to fetch references for scientific papers</a></li><li><a href="//phyks.me/2015/12/localizing_webapp.html">Localizing a webapp with webL10n.js</a></li><li><a href="//phyks.me/2015/12/putting_metadata_on_arxiv.html">Let's add some metadata on arXiv!</a></li><li><a href="//phyks.me/2015/10/low_cost_telepresence.html">Doing low cost telepresence (for under $200)</a></li><li><a href="//phyks.me/2015/10/working_in_paris.html">Working on the go in Paris</a></li><li><a href="//phyks.me/archives.html">Archives</a></li>
24
+                            <li><a href="//phyks.me/2016/05/i3_back_and_forth.html">Improved back and forth between workspaces</a></li><li><a href="//phyks.me/2016/01/fetching_references_papers.html">Comparison of tools to fetch references for scientific papers</a></li><li><a href="//phyks.me/2015/12/localizing_webapp.html">Localizing a webapp with webL10n.js</a></li><li><a href="//phyks.me/2015/12/putting_metadata_on_arxiv.html">Let's add some metadata on arXiv!</a></li><li><a href="//phyks.me/2015/10/low_cost_telepresence.html">Doing low cost telepresence (for under $200)</a></li><li><a href="//phyks.me/archives.html">Archives</a></li>
25 25
                         </ul>
26 26
 
27 27
                     <h2>Liens</h2>
@@ -67,7 +67,27 @@
67 67
 <p>To compare them, I asked <a href="http://antonin.delpeuch.eu/">Antonin</a> to build a list of most important journals and take five papers for every such journal, from <a href="http://dissem.in/">Dissemin</a>. This gives us a <a href="http://pub.phyks.me/paper_references_extractor/papers.json">JSON file</a> containing around 500 papers.</p>
68 68
 <p>I downloaded some articles, to get a (hopefully) representative set, composed of 147 different papers from various journals (I did not had access to some of them, so I could not fetch the full dataset). I ran <code>pdfextract</code>, <code>Grobid</code> and <code>Cermine</code> on each of them and compared the results.</p>
69 69
 <p>The raw results are available <a href="http://pub.phyks.me/paper_references_extractor/">here</a> for each paper, and I generated a single page comparison to ease the visual diff between the three results, available <a href="http://pub.phyks.me/paper_references_extractor/diff.html">here</a> (note that this webpage is <strong>very</strong> heavy, around 16MB).</p>
70
-<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself. </p>
70
+<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.</p>
71
+<p><strong>EDIT</strong>:</p>
72
+<ul>
73
+<li>
74
+<p>I also found <a href="https://github.com/knmnyn/ParsCit">ParsCit</a> which may be
75
+of interest. Though, you first need to extract text from your PDF file. I did 
76
+not yet test it more in depth.</p>
77
+</li>
78
+<li>
79
+<p><a href="https://twitter.com/_krisjack/status/736490898192764928">This tweet</a> tends
80
+  to confirm the results I had, that Grobid is the best one.</p>
81
+</li>
82
+<li>
83
+<p>If it can be useful, <a href="https://github.com/Phyks/CitationExtractor">here</a> is a
84
+  small web service written in Python to allow a user to upload a paper and
85
+  parse citations and try to assess open-access availability of the cited
86
+  papers. It uses CERMINE as it was the easiest way to go, especially since it
87
+  offers a web API, which allows me to distribute a simply working script,
88
+  without any additional requirements.</p>
89
+</li>
90
+</ul>
71 91
 		<footer><p class="date">Le 19/01/2016 à 14:45</p>
72 92
 		<p class="tags">Tags : <a href="//phyks.me/tags/Science.html">Science</a>, <a href="//phyks.me/tags/OpenAccess.html">OpenAccess</a></p></footer>
73 93
 	</div>

+ 39
- 35
blog/2016/index.html View File

@@ -45,47 +45,51 @@
45 45
                 <div id="articles">
46 46
 <article>
47 47
 	<aside>
48
-		<p class="day">29</p>
49
-		<p class="month">Mai</p>
48
+		<p class="day">19</p>
49
+		<p class="month">Janvier</p>
50 50
 	</aside>
51 51
 	<div class="article">
52
-		<header><h1 class="article_title"><a href="//phyks.me/2016/05/i3_back_and_forth.html">Improved back and forth between workspaces</a></h1></header>
52
+		<header><h1 class="article_title"><a href="//phyks.me/2016/01/fetching_references_papers.html">Comparison of tools to fetch references for scientific papers</a></h1></header>
53 53
 		<!-- 
54 54
     @author=Phyks
55
-    @date=29052016-1934
56
-    @title=Improved back and forth between workspaces
57
-    @tags=i3
55
+    @date=19012016-1445
56
+    @title=Comparison of tools to fetch references for scientific papers
57
+    @tags=Science,OpenAccess
58 58
 -->
59 59
 
60
-<p>i3 has <a href="https://i3wm.org/docs/userguide.html#_automatic_back_and_forth_when_switching_to_the_current_workspace">a 
61
-feature</a> 
62
-to enable going back and forth between workspaces. Once enabled, if you are on 
63
-workspace 1 and switch to workspace 2 and then just press <code>mod+2</code> again to 
64
-switch to workspace 2, you will go back to workspace 1.</p>
65
-<p>However, this feature is quite limited as it does not remember more than one 
66
-previous workspace. For example, say you are on workspace 1, switch to 
67
-workspace 2 and then to workspace 3. Then, typing <code>mod+3</code> will send you back 
68
-to workspace 2 as expected. But then, typing <code>mod+2</code> will send you back to 
69
-workspace 3 whereas one may have expected it to switch to workspace 1 (as does 
70
-Weechat with buffers switch for instance).</p>
71
-<p>This can be solved by wrapping around the workspace switching in the i3 
72
-config. I wrote <a href="https://gist.github.com/Phyks/4fbc2572dcc5eed96caa">this small 
73
-script</a> to handle it.</p>
74
-<p>Basically, you have to start the script when you start i3 by putting</p>
75
-<p><code>exec_always --no-startup-id "python PATH_TO_/workspace_back_and_forth_enhanced.py"</code></p>
76
-<p>in your <code>.i3/config</code> file.</p>
77
-<p>Then, you can replace your <code>bindsym</code> commands to switch workspaces, calling 
78
-the same script:</p>
79
-<p><code>bindsym $mod+agrave exec "echo 10 | socat - 
80
-UNIX-CONNECT:$XDG_RUNTIME_DIR/i3/i3-back-and-forth-enhanced.sock"</code>
81
-(Replace <code>$XDG_RUNTIME_DIR</code> by <code>/tmp</code> if this environment variable is not
82
-defined on your system.)</p>
83
-<p>This script does maintain a queue of 20 previously seen workspaces (so you can 
84
-go back 20 workspaces ago in your history). This can be increased by editing 
85
-the <code>WORKSPACES_STACK = deque(maxlen=20)</code> line according to your needs.</p>
86
-<p>Hope this helps!&nbsp;:) </p>
87
-		<footer><p class="date">Le 29/05/2016 à 19:34</p>
88
-		<p class="tags">Tags : <a href="//phyks.me/tags/i3.html">i3</a></p></footer>
60
+<p>Recently, I tried to aggregate in <a href="https://github.com/Phyks/libbmc/">a single place</a> various codes I had written to handle scientific papers. Some feature I was missing, and I would like to add, was the ability to fetch automatically references from a given paper. For arXiv papers, I had a <a href="http://known.phyks.me/2015/lets-some-metadata-on-arxiv">simple solution</a> using the LaTeX sources, but I wanted to have something more universal, taking a simple PDF file in input (thanks <a href="https://www.linkedin.com/in/john-dove-a8825">John</a> for the suggestion, and <a href="http://www.alstevens.org/">Al</a> for the tips on existing software solutions).</p>
61
+<p>I tried a comparison of three existing software to extract references from a PDF file:</p>
62
+<ul>
63
+<li><a href="https://github.com/CrossRef/pdfextract">pdfextract</a> from Crossref, very easy to use, written in Ruby.</li>
64
+<li><a href="https://github.com/kermitt2/grobid">Grobid</a>, more advanced (using machine learning models), written in Java, but quite easy to use too.</li>
65
+<li><a href="https://github.com/CeON/CERMINE">Cermine</a>, using the same approach as Grobid, but I could not get it to build on my computer. I used their <a href="http://cermine.ceon.pl/index.html">REST service</a> instead.</li>
66
+</ul>
67
+<p>To compare them, I asked <a href="http://antonin.delpeuch.eu/">Antonin</a> to build a list of most important journals and take five papers for every such journal, from <a href="http://dissem.in/">Dissemin</a>. This gives us a <a href="http://pub.phyks.me/paper_references_extractor/papers.json">JSON file</a> containing around 500 papers.</p>
68
+<p>I downloaded some articles, to get a (hopefully) representative set, composed of 147 different papers from various journals (I did not had access to some of them, so I could not fetch the full dataset). I ran <code>pdfextract</code>, <code>Grobid</code> and <code>Cermine</code> on each of them and compared the results.</p>
69
+<p>The raw results are available <a href="http://pub.phyks.me/paper_references_extractor/">here</a> for each paper, and I generated a single page comparison to ease the visual diff between the three results, available <a href="http://pub.phyks.me/paper_references_extractor/diff.html">here</a> (note that this webpage is <strong>very</strong> heavy, around 16MB).</p>
70
+<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.</p>
71
+<p><strong>EDIT</strong>:</p>
72
+<ul>
73
+<li>
74
+<p>I also found <a href="https://github.com/knmnyn/ParsCit">ParsCit</a> which may be
75
+of interest. Though, you first need to extract text from your PDF file. I did 
76
+not yet test it more in depth.</p>
77
+</li>
78
+<li>
79
+<p><a href="https://twitter.com/_krisjack/status/736490898192764928">This tweet</a> tends
80
+  to confirm the results I had, that Grobid is the best one.</p>
81
+</li>
82
+<li>
83
+<p>If it can be useful, <a href="https://github.com/Phyks/CitationExtractor">here</a> is a
84
+  small web service written in Python to allow a user to upload a paper and
85
+  parse citations and try to assess open-access availability of the cited
86
+  papers. It uses CERMINE as it was the easiest way to go, especially since it
87
+  offers a web API, which allows me to distribute a simply working script,
88
+  without any additional requirements.</p>
89
+</li>
90
+</ul>
91
+		<footer><p class="date">Le 19/01/2016 à 14:45</p>
92
+		<p class="tags">Tags : <a href="//phyks.me/tags/Science.html">Science</a>, <a href="//phyks.me/tags/OpenAccess.html">OpenAccess</a></p></footer>
89 93
 	</div>
90 94
 </article>
91 95
             </div>

+ 21
- 1
blog/index.html View File

@@ -112,7 +112,27 @@ the <code>WORKSPACES_STACK = deque(maxlen=20)</code> line according to your need
112 112
 <p>To compare them, I asked <a href="http://antonin.delpeuch.eu/">Antonin</a> to build a list of most important journals and take five papers for every such journal, from <a href="http://dissem.in/">Dissemin</a>. This gives us a <a href="http://pub.phyks.me/paper_references_extractor/papers.json">JSON file</a> containing around 500 papers.</p>
113 113
 <p>I downloaded some articles, to get a (hopefully) representative set, composed of 147 different papers from various journals (I did not had access to some of them, so I could not fetch the full dataset). I ran <code>pdfextract</code>, <code>Grobid</code> and <code>Cermine</code> on each of them and compared the results.</p>
114 114
 <p>The raw results are available <a href="http://pub.phyks.me/paper_references_extractor/">here</a> for each paper, and I generated a single page comparison to ease the visual diff between the three results, available <a href="http://pub.phyks.me/paper_references_extractor/diff.html">here</a> (note that this webpage is <strong>very</strong> heavy, around 16MB).</p>
115
-<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself. </p>
115
+<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.</p>
116
+<p><strong>EDIT</strong>:</p>
117
+<ul>
118
+<li>
119
+<p>I also found <a href="https://github.com/knmnyn/ParsCit">ParsCit</a> which may be
120
+of interest. Though, you first need to extract text from your PDF file. I did 
121
+not yet test it more in depth.</p>
122
+</li>
123
+<li>
124
+<p><a href="https://twitter.com/_krisjack/status/736490898192764928">This tweet</a> tends
125
+  to confirm the results I had, that Grobid is the best one.</p>
126
+</li>
127
+<li>
128
+<p>If it can be useful, <a href="https://github.com/Phyks/CitationExtractor">here</a> is a
129
+  small web service written in Python to allow a user to upload a paper and
130
+  parse citations and try to assess open-access availability of the cited
131
+  papers. It uses CERMINE as it was the easiest way to go, especially since it
132
+  offers a web API, which allows me to distribute a simply working script,
133
+  without any additional requirements.</p>
134
+</li>
135
+</ul>
116 136
 		<footer><p class="date">Le 19/01/2016 à 14:45</p>
117 137
 		<p class="tags">Tags : <a href="//phyks.me/tags/Science.html">Science</a>, <a href="//phyks.me/tags/OpenAccess.html">OpenAccess</a></p></footer>
118 138
 	</div>

+ 22
- 2
blog/rss.xml View File

@@ -7,7 +7,7 @@
7 7
 		<language>fr</language>
8 8
 		<copyright>CC BY</copyright>
9 9
 		<webMaster>webmaster@phyks.me (Phyks)</webMaster>
10
-		<lastBuildDate>Sun, 29 May 2016 16:34:20 -0000</lastBuildDate>
10
+		<lastBuildDate>Sun, 29 May 2016 16:42:28 -0000</lastBuildDate>
11 11
 		<item>
12 12
 			<title>Improved back and forth between workspaces</title>
13 13
 			<link>http://phyks.me/2016/05/i3_back_and_forth.html</link>
@@ -90,7 +90,27 @@ Recently, I tried to aggregate in a single place various codes I had written to
90 90
 <p>To compare them, I asked <a href="http://antonin.delpeuch.eu/">Antonin</a> to build a list of most important journals and take five papers for every such journal, from <a href="http://dissem.in/">Dissemin</a>. This gives us a <a href="http://pub.phyks.me/paper_references_extractor/papers.json">JSON file</a> containing around 500 papers.</p>
91 91
 <p>I downloaded some articles, to get a (hopefully) representative set, composed of 147 different papers from various journals (I did not had access to some of them, so I could not fetch the full dataset). I ran <code>pdfextract</code>, <code>Grobid</code> and <code>Cermine</code> on each of them and compared the results.</p>
92 92
 <p>The raw results are available <a href="http://pub.phyks.me/paper_references_extractor/">here</a> for each paper, and I generated a single page comparison to ease the visual diff between the three results, available <a href="http://pub.phyks.me/paper_references_extractor/diff.html">here</a> (note that this webpage is <strong>very</strong> heavy, around 16MB).</p>
93
-<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself. </p>
93
+<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.</p>
94
+<p><strong>EDIT</strong>:</p>
95
+<ul>
96
+<li>
97
+<p>I also found <a href="https://github.com/knmnyn/ParsCit">ParsCit</a> which may be
98
+of interest. Though, you first need to extract text from your PDF file. I did 
99
+not yet test it more in depth.</p>
100
+</li>
101
+<li>
102
+<p><a href="https://twitter.com/_krisjack/status/736490898192764928">This tweet</a> tends
103
+  to confirm the results I had, that Grobid is the best one.</p>
104
+</li>
105
+<li>
106
+<p>If it can be useful, <a href="https://github.com/Phyks/CitationExtractor">here</a> is a
107
+  small web service written in Python to allow a user to upload a paper and
108
+  parse citations and try to assess open-access availability of the cited
109
+  papers. It uses CERMINE as it was the easiest way to go, especially since it
110
+  offers a web API, which allows me to distribute a simply working script,
111
+  without any additional requirements.</p>
112
+</li>
113
+</ul>
94 114
 <footer>
95 115
 <p class="tags">Tags : <a href="http://phyks.me/tags/Science.html">Science</a>, <a href="http://phyks.me/tags/OpenAccess.html">OpenAccess</a></p></footer>
96 116
 </div>]]></content:encoded>

+ 21
- 1
blog/tags/OpenAccess.html View File

@@ -67,7 +67,27 @@
67 67
 <p>To compare them, I asked <a href="http://antonin.delpeuch.eu/">Antonin</a> to build a list of most important journals and take five papers for every such journal, from <a href="http://dissem.in/">Dissemin</a>. This gives us a <a href="http://pub.phyks.me/paper_references_extractor/papers.json">JSON file</a> containing around 500 papers.</p>
68 68
 <p>I downloaded some articles, to get a (hopefully) representative set, composed of 147 different papers from various journals (I did not had access to some of them, so I could not fetch the full dataset). I ran <code>pdfextract</code>, <code>Grobid</code> and <code>Cermine</code> on each of them and compared the results.</p>
69 69
 <p>The raw results are available <a href="http://pub.phyks.me/paper_references_extractor/">here</a> for each paper, and I generated a single page comparison to ease the visual diff between the three results, available <a href="http://pub.phyks.me/paper_references_extractor/diff.html">here</a> (note that this webpage is <strong>very</strong> heavy, around 16MB).</p>
70
-<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself. </p>
70
+<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.</p>
71
+<p><strong>EDIT</strong>:</p>
72
+<ul>
73
+<li>
74
+<p>I also found <a href="https://github.com/knmnyn/ParsCit">ParsCit</a> which may be
75
+of interest. Though, you first need to extract text from your PDF file. I did 
76
+not yet test it more in depth.</p>
77
+</li>
78
+<li>
79
+<p><a href="https://twitter.com/_krisjack/status/736490898192764928">This tweet</a> tends
80
+  to confirm the results I had, that Grobid is the best one.</p>
81
+</li>
82
+<li>
83
+<p>If it can be useful, <a href="https://github.com/Phyks/CitationExtractor">here</a> is a
84
+  small web service written in Python to allow a user to upload a paper and
85
+  parse citations and try to assess open-access availability of the cited
86
+  papers. It uses CERMINE as it was the easiest way to go, especially since it
87
+  offers a web API, which allows me to distribute a simply working script,
88
+  without any additional requirements.</p>
89
+</li>
90
+</ul>
71 91
 		<footer><p class="date">Le 19/01/2016 à 14:45</p>
72 92
 		<p class="tags">Tags : <a href="//phyks.me/tags/Science.html">Science</a>, <a href="//phyks.me/tags/OpenAccess.html">OpenAccess</a></p></footer>
73 93
 	</div>

+ 21
- 1
blog/tags/Science.html View File

@@ -67,7 +67,27 @@
67 67
 <p>To compare them, I asked <a href="http://antonin.delpeuch.eu/">Antonin</a> to build a list of most important journals and take five papers for every such journal, from <a href="http://dissem.in/">Dissemin</a>. This gives us a <a href="http://pub.phyks.me/paper_references_extractor/papers.json">JSON file</a> containing around 500 papers.</p>
68 68
 <p>I downloaded some articles, to get a (hopefully) representative set, composed of 147 different papers from various journals (I did not had access to some of them, so I could not fetch the full dataset). I ran <code>pdfextract</code>, <code>Grobid</code> and <code>Cermine</code> on each of them and compared the results.</p>
69 69
 <p>The raw results are available <a href="http://pub.phyks.me/paper_references_extractor/">here</a> for each paper, and I generated a single page comparison to ease the visual diff between the three results, available <a href="http://pub.phyks.me/paper_references_extractor/diff.html">here</a> (note that this webpage is <strong>very</strong> heavy, around 16MB).</p>
70
-<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself. </p>
70
+<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.</p>
71
+<p><strong>EDIT</strong>:</p>
72
+<ul>
73
+<li>
74
+<p>I also found <a href="https://github.com/knmnyn/ParsCit">ParsCit</a> which may be
75
+of interest. Though, you first need to extract text from your PDF file. I did 
76
+not yet test it more in depth.</p>
77
+</li>
78
+<li>
79
+<p><a href="https://twitter.com/_krisjack/status/736490898192764928">This tweet</a> tends
80
+  to confirm the results I had, that Grobid is the best one.</p>
81
+</li>
82
+<li>
83
+<p>If it can be useful, <a href="https://github.com/Phyks/CitationExtractor">here</a> is a
84
+  small web service written in Python to allow a user to upload a paper and
85
+  parse citations and try to assess open-access availability of the cited
86
+  papers. It uses CERMINE as it was the easiest way to go, especially since it
87
+  offers a web API, which allows me to distribute a simply working script,
88
+  without any additional requirements.</p>
89
+</li>
90
+</ul>
71 91
 		<footer><p class="date">Le 19/01/2016 à 14:45</p>
72 92
 		<p class="tags">Tags : <a href="//phyks.me/tags/Science.html">Science</a>, <a href="//phyks.me/tags/OpenAccess.html">OpenAccess</a></p></footer>
73 93
 	</div>

+ 21
- 1
gen/2016/01/fetching_references_papers.gen View File

@@ -22,7 +22,27 @@
22 22
 <p>To compare them, I asked <a href="http://antonin.delpeuch.eu/">Antonin</a> to build a list of most important journals and take five papers for every such journal, from <a href="http://dissem.in/">Dissemin</a>. This gives us a <a href="http://pub.phyks.me/paper_references_extractor/papers.json">JSON file</a> containing around 500 papers.</p>
23 23
 <p>I downloaded some articles, to get a (hopefully) representative set, composed of 147 different papers from various journals (I did not had access to some of them, so I could not fetch the full dataset). I ran <code>pdfextract</code>, <code>Grobid</code> and <code>Cermine</code> on each of them and compared the results.</p>
24 24
 <p>The raw results are available <a href="http://pub.phyks.me/paper_references_extractor/">here</a> for each paper, and I generated a single page comparison to ease the visual diff between the three results, available <a href="http://pub.phyks.me/paper_references_extractor/diff.html">here</a> (note that this webpage is <strong>very</strong> heavy, around 16MB).</p>
25
-<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself. </p>
25
+<p>Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.</p>
26
+<p><strong>EDIT</strong>:</p>
27
+<ul>
28
+<li>
29
+<p>I also found <a href="https://github.com/knmnyn/ParsCit">ParsCit</a> which may be
30
+of interest. Though, you first need to extract text from your PDF file. I did 
31
+not yet test it more in depth.</p>
32
+</li>
33
+<li>
34
+<p><a href="https://twitter.com/_krisjack/status/736490898192764928">This tweet</a> tends
35
+  to confirm the results I had, that Grobid is the best one.</p>
36
+</li>
37
+<li>
38
+<p>If it can be useful, <a href="https://github.com/Phyks/CitationExtractor">here</a> is a
39
+  small web service written in Python to allow a user to upload a paper and
40
+  parse citations and try to assess open-access availability of the cited
41
+  papers. It uses CERMINE as it was the easiest way to go, especially since it
42
+  offers a web API, which allows me to distribute a simply working script,
43
+  without any additional requirements.</p>
44
+</li>
45
+</ul>
26 46
 		<footer><p class="date">Le 19/01/2016 à 14:45</p>
27 47
 		<p class="tags">Tags : <a href="//phyks.me/tags/Science.html">Science</a>, <a href="//phyks.me/tags/OpenAccess.html">OpenAccess</a></p></footer>
28 48
 	</div>

+ 16
- 0
raw/2016/01/fetching_references_papers.md View File

@@ -23,3 +23,19 @@ The raw results are available [here](http://pub.phyks.me/paper_references_extrac
23 23
 
24 24
 
25 25
 Briefly comparing the results, the machine learning based models (Cermine and Grobid) seems to give far better results than the simple approach taken by pdfextract, at the expense of being more difficult to build and run. Cermine gives a bunch of infos, too much in my opinion, and I think Grobid is given the most reusable and complete results. Feel free to compare them yourself.
26
+
27
+**EDIT**:
28
+
29
+* I also found [ParsCit](https://github.com/knmnyn/ParsCit) which may be
30
+of interest. Though, you first need to extract text from your PDF file. I did
31
+not yet test it more in depth.
32
+
33
+* [This tweet](https://twitter.com/_krisjack/status/736490898192764928) tends
34
+  to confirm the results I had, that Grobid is the best one.
35
+
36
+* If it can be useful, [here](https://github.com/Phyks/CitationExtractor) is a
37
+  small web service written in Python to allow a user to upload a paper and
38
+  parse citations and try to assess open-access availability of the cited
39
+  papers. It uses CERMINE as it was the easiest way to go, especially since it
40
+  offers a web API, which allows me to distribute a simply working script,
41
+  without any additional requirements.