OpenSource platform for downloading and querying Spanish Official Cadaster Registry (Catastro)
Go to file
Jose J. Martinez bf81203444
Update README.md
2023-01-22 00:20:18 +01:00
.idea Adds initialize_elasticsearch script to configurate the ES index 2019-09-16 17:45:24 +02:00
src ElasticSearch host and port are taken for ENV_VAR, so that we can run ElasticSearch in a remote machine 2019-11-08 21:53:26 +01:00
.env Prints settings when launched 2019-11-08 22:07:29 +01:00
.gitignore Adds gitignore 2019-09-23 13:18:21 +02:00
LICENSE Create LICENSE 2019-09-28 11:34:42 +02:00
README.md Update README.md 2023-01-22 00:20:18 +01:00
docker-compose.yml Checks if address already present in ElasticSearch and skips it. Adds ENV var to docker-compose 2019-09-23 13:01:05 +02:00
initialize_elasticsearch.py Adds documentation of most of functions and methods. 2019-09-26 16:52:53 +02:00
libreCatastro.py Adds information about how to run with 1 Elastic Search and multiple machines. 2019-11-08 22:16:38 +01:00
requirements.txt Adds XML multiparcela. Fixes several bugs. 2019-09-20 19:15:32 +02:00

README.md

libreCATASTRO

An opensource, MIT-licensed application that scraps the official Spanish Cadaster registry and stores information in Elastic Searcher.

Master/Develop branches work for ELK Stack 6<x<7. If you want 7<x<8, please checkout branch Elastic7.5 image

Features

Scrapping

  • From XML webservices. Check http://www.catastro.meh.es/ws/Webservices_Libres.pdf
  • From HTML webpages.
  • Scraps all properties, including houses, flats, garages, storehouses, even buildings in ruins!
  • Scraps all usages and purposes: living, commercial, religious, military...
  • Scraps rural (parcelas) and urban properties.
  • Retrieves the building plan of every property
  • Skips already scrapped information
  • Can be queried to scrap a list of provinces
  • Can be queried to scrap by a polygon of coordinates
  • Can be queried to start from a specific city in a province

Storing

  • Stores in ElasticSearch
  • Supports automatic map visualization in Kibana

Visualization

Includes a configured Kibana that shows.

  1. A heatmap in the map of Spain (World) where the properties are
  2. All data in tables
  3. The picture of the property

DoS Warning

Spanish Cadaster has set restrictions, banning temporarily IPs that more than 10 queries in 5 seconds. A sleep command has been set to 5sec where needed, and can be configured at your own risk.

At night DoS happens more often it seems, and 5sec can throw a Connection Reset by Peer message. To try to avoid this, add this two cron commands after having launched libreCatastro to send to sleep at 23:00 and restart processing at 09:00 everyday

0 23 * * * ps aux | grep "[l]ibreCadastro" | awk '{print $2}' | xargs kill -TSTP
0 09 * * * ps aux | grep "[l]ibreCadastro" | awk '{print $2}' | xargs kill -CONT

Installation

Having Docker and Docker-compose installed, run first:

docker-compose up -d
pip install -r requirements.txt

Then configure ElasticSearch index:

python3 initialize_elasticsearch.py

An finally, execute libreCatastro as follows in the next step.

Execution

$ python libreCatastro.py --help

usage: libreCatastro.py [-h] [--coords]
                        [--filenames FILENAMES [FILENAMES ...]]
                        [--provinces PROVINCES [PROVINCES ...]]
                        [--sleep SLEEP] [--html] [--scale SCALE] [--pictures]
                        [--startcity STARTCITY] [--listprovinces]
                        [--listcities LISTCITIES] [--health]

Runs libreCatastro

optional arguments:
  -h, --help            show this help message and exit
  --coords (scrapping by coordinates. By default, if not set, it's by provinces)
  --filenames FILENAMES [FILENAMES ...] (for files with polygon coordinates)
  --provinces PROVINCES [PROVINCES ...] (for a list of provinces to scrap)
  --sleep SLEEP (time to sleep to avoid Cadaster DoS)
  --html (if you prefer to scrap HTML or if XML servers are down)
  --scale SCALE (for scrapping by coordinates, how big is the step)
  --pictures (scrap also the plan of the house)
  --startcity STARTCITY (start from a specific city in a province, in alphabetic order)
  --listprovinces (just list all provinces in alphabetic order)
  --listcities PROVINCE (just list all cities of a province in alphabetic order)
  --health (check if Cadaster servers are up)

Health I highly recommend to execute first of all: python3 libreCatastro.py --health to check if XML and HTML servers are up.

Time to get the complete DB Taking into account that there are restrictions that prevents a crapping faster than 5sec per page, scrapping can take very long time. so:

  1. Go directly to the provinces / cities you need the most. Leave the rest for later.
  2. Use different IP addresses and query parallely.
  3. Write me an email to jjmcarrascosa@gmail.com to get the full DB.

Using additional machines (parallel extraction) You won't need to repeat the previous steps, because we will use one Elastic Search for all the machines. For additional machines, do the following:

  1. Make sure you have successfully run all the previous steps and ElasticSearch is running in one machine;
  2. Copy the pubic IP address of that machine
  3. In a new machine, clone this repository and do the following:
pip install -r requirements.txt
export ES_HOST="{IP OR HOST OF THE MACHINE RUNNING ELASTICSEARCH"
export ES_PORT="{PORT OF THE MACHINE RUNNING ELASTICSEARCH. USUALLY 9200}"

And finally, run libreCatastro:

python libreCatastro.py [....]