flatisfy/doc/0.getting_started.md

245 lines
11 KiB
Markdown

Getting started
===============
## Dependency on Woob
**Important**: Flatisfy relies on [Woob](https://gitlab.com/woob/woob/) to fetch
housing posts from housing websites.
If you `pip install -r requirements.txt` it will install the latest
development version of [Woob](https://gitlab.com/woob/woob/) and the
[Woob modules](https://gitlab.com/woob/modules/), which should be the
best version available out there. You should update these packages regularly,
as they evolve quickly.
Woob is made of two parts: a core and modules (which is the actual code
fetching data from websites). Modules tend to break often and are then updated
often, you should keep them up to date. This can be done by installing and
upgrading the packages listed in the `requirements.txt` and using the default
configuration.
This is a safe default configuration. However, a better option is usually to
clone [Woob git repo](https://gitlab.com/woob/woob/) somewhere, on
your disk, to point `modules_path` configuration option to
`path_to_woob_git/modules` (see the configuration section below) and to run
a `git pull; python setup.py install` in the Woob git repo often.
A copy of the Woob modules is available in the `modules` directory at the
root of this repository, you can use `"modules_path": "/path/to/flatisfy/modules"` to use them.
This copy may or may not be more up to date than the current state of official
Woob modules. Some changes are made there, which are not backported
upstream. Woob official modules are not synced in the `modules` folder on a
regular basis, so try both and see which ones match your needs! :)
## TL;DR
An alternative method is available using Docker. See [2.docker.md](2.docker.md).
1. Clone the repository.
2. Install required Python modules: `pip install -r requirements.txt`.
3. Init a configuration file: `python -m flatisfy init-config > config.json`.
Edit it according to your needs (see below).
4. Build the required data files:
`python -m flatisfy build-data --config config.json`.
5. You can now run `python -m flatisfy import --config config.json` to fetch
available flats, filter them and import everything in a SQLite database,
usable with the web visualization.
6. Install JS libraries and build the webapp:
`npm install && npm run build:dev` (use `build:prod` in production).
7. Use `python -m flatisfy serve --config config.json` to serve the web app.
_Note_: `Flatisfy` requires an up-to-date Node version. You can find
instructions on the [NodeJS website](https://nodejs.org/en/) to install latest
LTS version.
_Note_: Alternatively, you can `python -m flatisfy fetch --config config.json`
to fetch available flats, filter them and output them as a filtered JSON list
(the web visualization will not be able to display them). This is mainly
useful if you plan in integrating Flatisfy in your own pipeline.
## Available commands
The available commands are:
* `init-config` to generate an empty configuration file, either on the `stdin`
or in the specified file.
* `build-data` to rebuild OpenData datasets.
* `fetch` to load and filter housings posts and output a JSON dump.
* `filter` to filter again the flats in the database (and update their status)
according to changes in config. It can also filter a previously fetched list
of housings posts, provided as a JSON dump (with a `--input` argument).
* `import` to import and filter housing posts into the database.
* `serve` to serve the built-in webapp with the development server. Do not use
in production.
_Note:_ Fetching flats can be quite long and take up to a few minutes. This
should be better optimized. To get a verbose output and have an hint about the
progress, use the `-v` argument. It can remain stuck at "Loading flats for
constraint XXX...", which simply means it is fetching flats (using Woob
under the hood) and this step can be super long if there are lots of flats to
fetch. If this happens to you, you can set `max_entries` in your config to
limit the number of flats to fetch.
### Common arguments
You can pass some command-line arguments to Flatisfy commands, common to all the available commands. These are
* `--help`/`-h` to get some help message about the current command.
* `--data-dir DIR` to overload the `data_directory` value from config.
* `--config CONFIG` to use the config file located at `CONFIG`.
* `--passes [0, 1, 2, 3]` to overload the `passes` value from config.
* `--max-entries N` to overload the `max_entries` value from config.
* `-v` to enable verbose output.
* `-vv` to enable debug output.
* `--constraints` to specify a list of constraints to use (e.g. to restrict
import to a subset of available constraints from the config). This list
should be passed as a comma-separated list.
## Configuration
List of configuration options:
* `data_directory` is the directory in which you want data files to be stored.
`null` is the default value and means default `XDG` location (typically
`~/.local/share/flatisfy/`)
* `max_entries` is the maximum number of entries to fetch.
* `passes` is the number of passes to run on the data. First pass is a basic
filtering and using only the informations from the housings list page.
Second pass loads any possible information about the filtered flats and does
better filtering.
* `database` is an SQLAlchemy URI to a database file. Defaults to `null` which
means that it will store the database in the default location, in
`data_directory`.
* `navitia_api_key` is an API token for [Navitia](https://www.navitia.io/)
which is required to compute travel times for `PUBLIC_TRANSPORT` mode.
* `mapbox_api_key` is an API token for [Mapbox](http://mapbox.com/)
which is required to compute travel times for `WALK`, `BIKE` and `CAR`
modes.
* `modules_path` is the path to the Woob modules. It can be `null` if you
want Woob to use the locally installed [Woob
modules](https://gitlab.com/woob/modules/), which you should install
yourself. This is the default value. If it is a string, it should be an
absolute path to the folder containing Woob modules.
* `port` is the port on which the development webserver should be
listening (default to `8080`).
* `host` is the host on which the development webserver should be listening
(default to `127.0.0.1`).
* `webserver` is a server to use instead of the default Bottle built-in
webserver, see [Bottle deployment
doc](http://bottlepy.org/docs/dev/deployment.html).
* `backends` is a list of Woob backends to enable. It defaults to any
available and supported Woob backend.
* `store_personal_data` is a boolean indicated whether or not Flatisfy should
fetch personal data from housing posts and store them in database. Such
personal data include contact phone number for instance. By default,
Flatisfy does not store such personal data.
* `max_distance_housing_station` is the maximum distance (in meters) between
an housing and a public transport station found for this housing (default is
`1500`). This is useful to avoid false-positive.
* `duplicate_threshold` is the minimum score in the deep duplicate detection
step to consider two flats as being duplicates (defaults to `15`).
* `serve_images_locally` lets you download all the images from the housings
websites when importing the posts. Then, all your Flatisfy works standalone,
serving the local copy of the images instead of fetching the images from the
remote websites every time you look through the fetched housing posts.
_Note:_ In production, you can either use the `serve` command with a reliable
webserver instead of the default Bottle webserver (specifying a `webserver`
value) or use the `wsgi.py` script at the root of the repository to use WSGI.
### Constraints
You should specify some constraints to filter the resulting housings list,
under the `constraints` key. The available constraints are:
* `type` is the type of housing you want, either `RENT` (to rent), `SALE` (to
buy), `SHARING` (for a shared housing), `FURNISHED_RENT` (for a furnished
rent), `VIAGER` (for a viager, lifetime sale).
* `house_types` is a list of house types you are looking for. Values can be
`APART` (flat), `HOUSE`, `PARKING`, `LAND`, `OTHER` (everything else) or
`UNKNOWN` (anything which was not matched with one of the previous
categories).
* `area` (in m²), `bedrooms`, `cost` (in currency unit), `rooms`: this is a
tuple of `(min, max)` values, defining an interval in which the value should
lie. A `null` value means that any value is within this bound.
* `postal_codes` (as strings) is a list of postal codes. You should include any postal code
you want, and especially the postal codes close to the precise location you
want.
* `time_to` is a dictionary of places to compute travel time to them.
Typically,
```
"time_to": {
"foobar": {
"gps": [LAT, LNG],
"mode": A transport mode,
"time": [min, max]
}
}
```
means that the housings must be between the `min` and `max` bounds (possibly
`null`) from the place identified by the GPS coordinates `LAT` and `LNG`
(latitude and longitude), and we call this place `foobar` in human-readable
form. `mode` should be either `PUBLIC_TRANSPORT`, `WALK`, `BIKE` or `CAR`.
Beware that `time` constraints are in **seconds**. You should take
some margin as the travel time computation is done with found nearby public
transport stations, which is only a rough estimate of the flat position. For
`PUBLIC_TRANSPORT` the travel time is computed assuming a route the next
Monday at 8am.
* `minimum_nb_photos` lets you filter out posts with less than this number of
photos.
* `description_should_contain` lets you specify a list of terms that should
be present in the posts descriptions. Typically, if you expect "parking" to
be in all the posts Flatisfy fetches for you, you can set
`description_should_contain: ["parking"]`. You can also use list of terms
which acts as an "or" operation. For example, if you are looking for a flat
with a parking and with either a balcony or a terrace, you can use
`description_should_contain: ["parking", ["balcony", "terrace"]]`
* `description_should_not_contain` lets you specify a list of terms that should
never occur in the posts descriptions. Typically, if you wish to avoid
"coloc" in the posts Flatisfy fetches for you, you can set
`description_should_not_contain: ["coloc"]`.
You can think of constraints as "a set of criterias to filter out flats". You
can specify as many constraints as you want, in the configuration file,
provided that you name each of them uniquely.
## Building the web assets
If you want to build the web assets, you can use `npm run build:dev`
(respectively `npm run watch:dev` to build continuously and monitor changes in
source files). You can use `npm run build:prod` (`npm run watch:prod`) to do
the same in production mode (with minification etc).
## Upgrading
To update the app, you can simply `git pull` the latest version. The database
schema might change from time to time. Here is how to update it automatically:
* First, edit the `alembic.ini` file and ensure the `sqlalchemy.url` entry
points to the database URI you are actually using for Flatisfy.
* Then, run `alembic upgrade head` to run the required migrations.
## Misc
### Other tools more or less connected with Flatisfy
+ [ZipAround](https://github.com/guix77/ziparound) generates a list of ZIP codes centered on a city name, within a radius of N kilometers and within a certain travel time by car (France only). You can invoke it with:
```sh
npm ziparound
# or alternatively
npm ziparound --code 75001 --distance 3
```