bmc/README.md

BiblioManager
=============

BiblioManager is a simple script to download and store your articles. Read on if you want more info :)

**Note :** This script is currently a work in progress.

## What is BiblioManager (or what it is **not**) ?

I used to have a folder with poorly named papers and books and wanted something to help me handle it. I don't like Mendeley and Zotero and so on, which are heavy and overkill for my needs. I just want to feed a script with PDF files of papers and books, or URLs to PDF files, and I want it to automatically maintain a BibTeX index of these files, to help me cite them and find them back. Then, I want it to give me a way to easily retrieve a file, either by author, by title or with some other search method, and give me the associated bibtex entry.

This is the goal of BiblioManager. This script can :
* Download or import PDF/Djvu files
* Try to get automatically the metadata of the files (keywords, author, review, …)
* Store all the metadata in a BibTex file
* Rename your files to store them in a logical and homogeneous way according to a user-defined mask
* Help you find them back
* Give you directly the bibtex entry necessary to cite them
* Remove some of the watermarks included in those files (the front page with your ip address from IOP for instance)

BiblioManager will always use standard formats such as BibTeX, so that you can easily edit your library, export it and manage it by hand, even if you quit this software for any reason.


## Current status

* Able to import a PDF / djvu file, automagically find the DOI / ISBN, get the bibtex entry back and add it to the library. If DOI / ISBN search fails, it will prompt you for it.
* Able to download a URL, using any specified proxy (you can list many and it will try all of them) and store the pdf file with its metadata.

Should be almost working and usable now, although still to be considered as **experimental**.

**Important note :** I use it for personal use, but I don't read articles from many journals. If you find any file which is not working, please fill an issue or send me an e-mail with the relevant information. There are alternative ways to get the metadata for example, and I didn't know really which one was the best one as writing this code.


## Installation

* Clone this git repository where you want : `git clone https://github.com/Phyks/BMC`
* Install `requesocks`, `PyPDF2` and `isbntools` _via_ Pypi
* Install `pdftotext` (provided by Xpdf) and `djvulibre` _via_ your package manager the way you want
* Copy `params.py.example` to `params.py` and customize it to fit your needs

## Usage

### To import an existing PDF / Djvu file

Run `./main.py import PATH_TO_FILE [article|book]`. `[article|book]` is an optional argument (article or book) to search only for DOI or ISBN and thus, speed up the import.

It will get automatically the bibtex entry corresponding to the document, and you will be prompted for confirmation. It will then copy the file to your papers dir, renaming it according to the specified mask in `params.py`.

### To download a PDF / Djvu file

Run `./main.py download URL_TO_PDF [article|book]`, where `[article|book]` (article or book) is again a parameter to specify to search only for DOI or ISBN only, and thus speed up the import. The `URL_TO_PDF` parameter should be a direct link to the PDF file (meaning it should be the link to the pdf page, which may have an authentication portal and not the page with abstract on many publishers websites).

The script will try to download the file with the proxies specified in `params.py` until it manages to get the file, or runs out of available proxies.

It will get automatically the bibtex entry corresponding to the document, and you will be prompted for confirmation. It will then put the file in your papers dir, renaming it according to the specified mask in `params.py`.

### Delete an entry

Run `./main.py delete PARAM` where `PARAM` should be either a path to a paper file, or an ident in the bibtex index. This will remove the corresponding entry in the bibtex index, and will remove the file from your papers dir. Although it will prompt you for confirmation, there's no way to recover your file after deletion, so use with care.

### Search for an entry

TODO

### List all entries

TODO


### Edit entries

TODO

### Data storage

All your documents will be stored in the papers dir specified in `params.py`. All the bibtex entries will be added to the `index.bib` file. You should **not** add entries to this file (but you can edit existing entries without any problem), as this will break synchronization between documents in papers dir and the index. If you do so, you can resync the index file with `./main.py resync`.

The resync option will check that all bibtex entries have a corresponding file and all file have a corresponding bibtex entry. It will prompt you what to do for unmatched entries.

## License

All the source code I wrote is under a `no-alcoohol beer-ware license`. All functions that I didn't write myself are under the original license and their origin is specified in the function itself.
```
* --------------------------------------------------------------------------------
* "THE NO-ALCOHOL BEER-WARE LICENSE" (Revision 42):
* Phyks (webmaster@phyks.me) wrote this file. As long as you retain this notice you
* can do whatever you want with this stuff (and you can also do whatever you want
* with this stuff without retaining it, but that's not cool...). If we meet some 
* day, and you think this stuff is worth it, you can buy me a <del>beer</del> soda 
* in return.
*																		Phyks
* ---------------------------------------------------------------------------------
```

I used the `tearpages.py` script from sciunto, which can be found [here](https://github.com/sciunto/tear-pages) and is released under a GNU GPLv3 license.

## Inspiration

Here are some sources of inspirations for this project :

* MPC
* http://en.dogeno.us/2010/02/release-a-python-script-for-organizing-scientific-papers-pyrenamepdf-py/
* [Bibsoup](http://openbiblio.net/2012/02/09/bibsoup-beta-released/)
* [Paperbot](https://github.com/kanzure/paperbot)

## Ideas, TODO

A list of ideas and TODO. Don't hesitate to give feedback on the ones you really want or to propose your owns.

10. Refactor
    11. Use bibtex-parser lib to write bibtex, instead of parsed2BibTex
20. No DOI for arXiv / HAL
30. Parameter to disable remote search
40. Open file
45. Doc / Man
50. Webserver interface
60. Categories
70. Edit an entry instead of deleting it and adding it again

## Issues ?

* Remove the watermarks on pdf files => done, some warning in okular on generated pdf, but seems ok. Seems to be a bug in Okular.
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00			`BiblioManager`
			`=============`
README: initial content 2013-01-27 14:50:12 +01:00
Improved doc 2014-04-26 15:32:34 +02:00			`BiblioManager is a simple script to download and store your articles. Read on if you want more info :)`
Config file, SOCKS support, multiple servers 2013-05-11 16:10:48 +02:00
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00			`Note : This script is currently a work in progress.`
README: initial content 2013-01-27 14:50:12 +01:00
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00			`## What is BiblioManager (or what it is not) ?`

Improved doc 2014-04-26 15:32:34 +02:00			I used to have a folder with poorly named papers and books and wanted something to help me handle it. I don't like Mendeley and Zotero and so on, which are heavy and overkill for my needs. I just want to feed a script with PDF files of papers and books, or URLs to PDF files, and I want it to automatically maintain a BibTeX index of these files, to help me cite them and find them back. Then, I want it to give me a way to easily retrieve a file, either by author, by title or with some other search method, and give me the associated bibtex entry.
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00
Improved doc 2014-04-26 15:32:34 +02:00			`This is the goal of BiblioManager. This script can :`
Update README 2014-04-25 16:20:04 +02:00			`* Download or import PDF/Djvu files`
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00			`* Try to get automatically the metadata of the files (keywords, author, review, …)`
			`* Store all the metadata in a BibTex file`
Improved doc 2014-04-26 15:32:34 +02:00			`* Rename your files to store them in a logical and homogeneous way according to a user-defined mask`
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00			`* Help you find them back`
			`* Give you directly the bibtex entry necessary to cite them`
Improved doc 2014-04-26 15:32:34 +02:00			`* Remove some of the watermarks included in those files (the front page with your ip address from IOP for instance)`
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00
			`BiblioManager will always use standard formats such as BibTeX, so that you can easily edit your library, export it and manage it by hand, even if you quit this software for any reason.`


Import should be working \ol/ 2014-04-25 14:22:34 +02:00			`## Current status`

Update README 2014-04-25 16:20:04 +02:00			`* Able to import a PDF / djvu file, automagically find the DOI / ISBN, get the bibtex entry back and add it to the library. If DOI / ISBN search fails, it will prompt you for it.`
Download of papers working You should pass the url of the pdf file to the script, along with the `download` parameter. It will try the proxies in the `params.py` file, until it finds one that allow him to get the pdf file. TODO : Use pdfparanoia to remove watermarks 2014-04-26 11:52:19 +02:00			`* Able to download a URL, using any specified proxy (you can list many and it will try all of them) and store the pdf file with its metadata.`

			`Should be almost working and usable now, although still to be considered as experimental.`
Import should be working \ol/ 2014-04-25 14:22:34 +02:00
Typo in README 2014-04-30 00:48:33 +02:00			`Important note : I use it for personal use, but I don't read articles from many journals. If you find any file which is not working, please fill an issue or send me an e-mail with the relevant information. There are alternative ways to get the metadata for example, and I didn't know really which one was the best one as writing this code.`
Import should be working \ol/ 2014-04-25 14:22:34 +02:00

Updated README and cleaned repo 2014-04-23 22:27:55 +02:00			`## Installation`

Improved doc 2014-04-26 15:32:34 +02:00			* Clone this git repository where you want : `git clone https://github.com/Phyks/BMC`
Remove first page of IOP papers + various bugfixes 2014-04-26 23:26:25 +02:00			* Install `requesocks`, `PyPDF2` and `isbntools` _via_ Pypi
Improved doc 2014-04-26 15:32:34 +02:00			* Install `pdftotext` (provided by Xpdf) and `djvulibre` _via_ your package manager the way you want
			* Copy `params.py.example` to `params.py` and customize it to fit your needs
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00
Improved doc 2014-04-26 15:32:34 +02:00			`## Usage`
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00
Improved doc 2014-04-26 15:32:34 +02:00			`### To import an existing PDF / Djvu file`
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00
Improved doc 2014-04-26 15:32:34 +02:00			Run `./main.py import PATH_TO_FILE [article\|book]`. `[article\|book]` is an optional argument (article or book) to search only for DOI or ISBN and thus, speed up the import.
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00
Improved doc 2014-04-26 15:32:34 +02:00			It will get automatically the bibtex entry corresponding to the document, and you will be prompted for confirmation. It will then copy the file to your papers dir, renaming it according to the specified mask in `params.py`.
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00
Improved doc 2014-04-26 15:32:34 +02:00			`### To download a PDF / Djvu file`
README: initial content 2013-01-27 14:50:12 +01:00
Improved doc 2014-04-26 15:32:34 +02:00			Run `./main.py download URL_TO_PDF [article\|book]`, where `[article\|book]` (article or book) is again a parameter to specify to search only for DOI or ISBN only, and thus speed up the import. The `URL_TO_PDF` parameter should be a direct link to the PDF file (meaning it should be the link to the pdf page, which may have an authentication portal and not the page with abstract on many publishers websites).

			The script will try to download the file with the proxies specified in `params.py` until it manages to get the file, or runs out of available proxies.

			It will get automatically the bibtex entry corresponding to the document, and you will be prompted for confirmation. It will then put the file in your papers dir, renaming it according to the specified mask in `params.py`.

			`### Delete an entry`

			Run `./main.py delete PARAM` where `PARAM` should be either a path to a paper file, or an ident in the bibtex index. This will remove the corresponding entry in the bibtex index, and will remove the file from your papers dir. Although it will prompt you for confirmation, there's no way to recover your file after deletion, so use with care.

			`### Search for an entry`

			`TODO`

			`### List all entries`
README: initial content 2013-01-27 14:50:12 +01:00
Updated README and cleaned repo 2014-04-23 22:27:55 +02:00			`TODO`
Started the main code 2014-04-24 00:18:49 +02:00
Remove first page of IOP papers + various bugfixes 2014-04-26 23:26:25 +02:00
			`### Edit entries`

			`TODO`

Improved doc 2014-04-26 15:32:34 +02:00			`### Data storage`

Resync function. To be tested… 2014-05-01 00:45:31 +02:00			All your documents will be stored in the papers dir specified in `params.py`. All the bibtex entries will be added to the `index.bib` file. You should not add entries to this file (but you can edit existing entries without any problem), as this will break synchronization between documents in papers dir and the index. If you do so, you can resync the index file with `./main.py resync`.

			`The resync option will check that all bibtex entries have a corresponding file and all file have a corresponding bibtex entry. It will prompt you what to do for unmatched entries.`
Improved doc 2014-04-26 15:32:34 +02:00
			`## License`

Use tempfile when downloading a file URL 2014-04-26 18:27:01 +02:00			All the source code I wrote is under a `no-alcoohol beer-ware license`. All functions that I didn't write myself are under the original license and their origin is specified in the function itself.
Improved doc 2014-04-26 15:32:34 +02:00			```
			`* --------------------------------------------------------------------------------`
			`* "THE NO-ALCOHOL BEER-WARE LICENSE" (Revision 42):`
			`* Phyks (webmaster@phyks.me) wrote this file. As long as you retain this notice you`
			`* can do whatever you want with this stuff (and you can also do whatever you want`
			`* with this stuff without retaining it, but that's not cool...). If we meet some`
			`* day, and you think this stuff is worth it, you can buy me a <del>beer</del> soda`
			`* in return.`
			`* Phyks`
			`* ---------------------------------------------------------------------------------`
			```

Use tempfile when downloading a file URL 2014-04-26 18:27:01 +02:00			I used the `tearpages.py` script from sciunto, which can be found [here](https://github.com/sciunto/tear-pages) and is released under a GNU GPLv3 license.

Started the main code 2014-04-24 00:18:49 +02:00			`## Inspiration`

Improved doc 2014-04-26 15:32:34 +02:00			`Here are some sources of inspirations for this project :`

Import should be working \ol/ 2014-04-25 14:22:34 +02:00			`* MPC`
Update README 2014-04-25 16:20:04 +02:00			`* http://en.dogeno.us/2010/02/release-a-python-script-for-organizing-scientific-papers-pyrenamepdf-py/`
			`* [Bibsoup](http://openbiblio.net/2012/02/09/bibsoup-beta-released/)`
Improved doc 2014-04-26 15:32:34 +02:00			`* [Paperbot](https://github.com/kanzure/paperbot)`
Import should be working \ol/ 2014-04-25 14:22:34 +02:00
			`## Ideas, TODO`

			`A list of ideas and TODO. Don't hesitate to give feedback on the ones you really want or to propose your owns.`

Move server to a new branch 2014-04-28 23:37:11 +02:00			`10. Refactor`
			`11. Use bibtex-parser lib to write bibtex, instead of parsed2BibTex`
			`20. No DOI for arXiv / HAL`
			`30. Parameter to disable remote search`
			`40. Open file`
			`45. Doc / Man`
			`50. Webserver interface`
			`60. Categories`
			`70. Edit an entry instead of deleting it and adding it again`
Remove first page of IOP papers + various bugfixes 2014-04-26 23:26:25 +02:00
			`## Issues ?`

Flake8 2014-04-30 00:54:15 +02:00			`* Remove the watermarks on pdf files => done, some warning in okular on generated pdf, but seems ok. Seems to be a bug in Okular.`