Phyks (Lucas Verney) 977e354646 Reduce number of requests to housing websites

Keep track of the last seen date and start crawling again from there for
the next crawl, instead of crawling everything at each invocation.

Can be configured through configuration options.

2021-04-12 23:28:42 +02:00

12 KiB

Raw Blame History

Getting started

Dependency on Woob

Important: Flatisfy relies on Woob to fetch housing posts from housing websites.

If you pip install -r requirements.txt it will install the latest development version of Woob and the Woob modules, which should be the best version available out there. You should update these packages regularly, as they evolve quickly.

Woob is made of two parts: a core and modules (which is the actual code fetching data from websites). Modules tend to break often and are then updated often, you should keep them up to date. This can be done by installing and upgrading the packages listed in the requirements.txt and using the default configuration.

This is a safe default configuration. However, a better option is usually to clone Woob git repo somewhere, on your disk, to point modules_path configuration option to path_to_woob_git/modules (see the configuration section below) and to run a git pull; python setup.py install in the Woob git repo often.

A copy of the Woob modules is available in the modules directory at the root of this repository, you can use "modules_path": "/path/to/flatisfy/modules" to use them. This copy may or may not be more up to date than the current state of official Woob modules. Some changes are made there, which are not backported upstream. Woob official modules are not synced in the modules folder on a regular basis, so try both and see which ones match your needs! :)

TL;DR

An alternative method is available using Docker. See 2.docker.md.

Clone the repository.
Install required Python modules: pip install -r requirements.txt.
Init a configuration file: python -m flatisfy init-config > config.json. Edit it according to your needs (see below).
Build the required data files: python -m flatisfy build-data --config config.json.
You can now run python -m flatisfy import --config config.json to fetch available flats, filter them and import everything in a SQLite database, usable with the web visualization.
Install JS libraries and build the webapp: npm install && npm run build:dev (use build:prod in production).
Use python -m flatisfy serve --config config.json to serve the web app.

Note: Flatisfy requires an up-to-date Node version. You can find instructions on the NodeJS website to install latest LTS version.

Note: Alternatively, you can python -m flatisfy fetch --config config.json to fetch available flats, filter them and output them as a filtered JSON list (the web visualization will not be able to display them). This is mainly useful if you plan in integrating Flatisfy in your own pipeline.

Available commands

The available commands are:

init-config to generate an empty configuration file, either on the stdin or in the specified file.
build-data to rebuild OpenData datasets.
fetch to load and filter housings posts and output a JSON dump.
filter to filter again the flats in the database (and update their status) according to changes in config. It can also filter a previously fetched list of housings posts, provided as a JSON dump (with a --input argument).
import to import and filter housing posts into the database.
serve to serve the built-in webapp with the development server. Do not use in production.

Note: Fetching flats can be quite long and take up to a few minutes. This should be better optimized. To get a verbose output and have an hint about the progress, use the -v argument. It can remain stuck at "Loading flats for constraint XXX...", which simply means it is fetching flats (using Woob under the hood) and this step can be super long if there are lots of flats to fetch. If this happens to you, you can set max_entries in your config to limit the number of flats to fetch.

Common arguments

You can pass some command-line arguments to Flatisfy commands, common to all the available commands. These are

--help/-h to get some help message about the current command.
--data-dir DIR to overload the data_directory value from config.
--config CONFIG to use the config file located at CONFIG.
--passes [0, 1, 2, 3] to overload the passes value from config.
--max-entries N to overload the max_entries value from config.
-v to enable verbose output.
-vv to enable debug output.
--constraints to specify a list of constraints to use (e.g. to restrict import to a subset of available constraints from the config). This list should be passed as a comma-separated list.

Configuration

List of configuration options:

data_directory is the directory in which you want data files to be stored. null is the default value and means default XDG location (typically ~/.local/share/flatisfy/)
max_entries is the maximum number of entries to fetch.
passes is the number of passes to run on the data. First pass is a basic filtering and using only the informations from the housings list page. Second pass loads any possible information about the filtered flats and does better filtering.
database is an SQLAlchemy URI to a database file. Defaults to null which means that it will store the database in the default location, in data_directory.
navitia_api_key is an API token for Navitia which is required to compute travel times for PUBLIC_TRANSPORT mode.
mapbox_api_key is an API token for Mapbox which is required to compute travel times for WALK, BIKE and CAR modes.
modules_path is the path to the Woob modules. It can be null if you want Woob to use the locally installed Woob modules, which you should install yourself. This is the default value. If it is a string, it should be an absolute path to the folder containing Woob modules.
port is the port on which the development webserver should be listening (default to 8080).
host is the host on which the development webserver should be listening (default to 127.0.0.1).
webserver is a server to use instead of the default Bottle built-in webserver, see Bottle deployment doc.
backends is a list of Woob backends to enable. It defaults to any available and supported Woob backend.
force_fetch_all is a boolean indicating whether or not Flatisfy should fetch all available flats or only theones added from the last fetch (relying on last known housing date). By default, Flatisfy will only iterate on housings until the last known housing date.
store_personal_data is a boolean indicating whether or not Flatisfy should fetch personal data from housing posts and store them in database. Such personal data include contact phone number for instance. By default, Flatisfy does not store such personal data.
max_distance_housing_station is the maximum distance (in meters) between an housing and a public transport station found for this housing (default is 1500). This is useful to avoid false-positive.
duplicate_threshold is the minimum score in the deep duplicate detection step to consider two flats as being duplicates (defaults to 15).
serve_images_locally lets you download all the images from the housings websites when importing the posts. Then, all your Flatisfy works standalone, serving the local copy of the images instead of fetching the images from the remote websites every time you look through the fetched housing posts.

Note: In production, you can either use the serve command with a reliable webserver instead of the default Bottle webserver (specifying a webserver value) or use the wsgi.py script at the root of the repository to use WSGI.

Constraints

You should specify some constraints to filter the resulting housings list, under the constraints key. The available constraints are:

type is the type of housing you want, either RENT (to rent), SALE (to buy), SHARING (for a shared housing), FURNISHED_RENT (for a furnished rent), VIAGER (for a viager, lifetime sale).
house_types is a list of house types you are looking for. Values can be APART (flat), HOUSE, PARKING, LAND, OTHER (everything else) or UNKNOWN (anything which was not matched with one of the previous categories).
area (in m²), bedrooms, cost (in currency unit), rooms: this is a tuple of (min, max) values, defining an interval in which the value should lie. A null value means that any value is within this bound.
postal_codes (as strings) is a list of postal codes. You should include any postal code you want, and especially the postal codes close to the precise location you want.
time_to is a dictionary of places to compute travel time to them. Typically,
```
"time_to": {
  "foobar": {
      "gps": [LAT, LNG],
      "mode": A transport mode,
      "time": [min, max]
  }
}
```
means that the housings must be between the min and max bounds (possibly null) from the place identified by the GPS coordinates LAT and LNG (latitude and longitude), and we call this place foobar in human-readable form. mode should be either PUBLIC_TRANSPORT, WALK, BIKE or CAR. Beware that time constraints are in seconds. You should take some margin as the travel time computation is done with found nearby public transport stations, which is only a rough estimate of the flat position. For PUBLIC_TRANSPORT the travel time is computed assuming a route the next Monday at 8am.
minimum_nb_photos lets you filter out posts with less than this number of photos.
description_should_contain lets you specify a list of terms that should be present in the posts descriptions. Typically, if you expect "parking" to be in all the posts Flatisfy fetches for you, you can set description_should_contain: ["parking"]. You can also use list of terms which acts as an "or" operation. For example, if you are looking for a flat with a parking and with either a balcony or a terrace, you can use description_should_contain: ["parking", ["balcony", "terrace"]]
description_should_not_contain lets you specify a list of terms that should never occur in the posts descriptions. Typically, if you wish to avoid "coloc" in the posts Flatisfy fetches for you, you can set description_should_not_contain: ["coloc"].

You can think of constraints as "a set of criterias to filter out flats". You can specify as many constraints as you want, in the configuration file, provided that you name each of them uniquely.

Building the web assets

If you want to build the web assets, you can use npm run build:dev (respectively npm run watch:dev to build continuously and monitor changes in source files). You can use npm run build:prod (npm run watch:prod) to do the same in production mode (with minification etc).

Upgrading

To update the app, you can simply git pull the latest version. The database schema might change from time to time. Here is how to update it automatically:

First, edit the alembic.ini file and ensure the sqlalchemy.url entry points to the database URI you are actually using for Flatisfy.
Then, run alembic upgrade head to run the required migrations.

Misc

Other tools more or less connected with Flatisfy

ZipAround generates a list of ZIP codes centered on a city name, within a radius of N kilometers and within a certain travel time by car (France only). You can invoke it with:

npm ziparound
# or alternatively
npm ziparound --code 75001 --distance 3

12 KiB Raw Blame History