From PDF to interactive map

Let's say you are thinking about moving to Rome in the near future.

Let's say you have family, and you want to find all daycares within 30 mins by public transport to your perspective new house.

Or maybe you want to find a house that's near a daycare, which in turn should be within 30 mins to your workplace.

In the past, I would have done this manually: find list of day cares, look at a map, check workplace, apartment, eventually find something that works.

But with a little javascript, some scripting skills, and a couple hours to spare, it turns out that this sort of problem is really easy to solve by using public APIs, and a little work.

Before I get started, and only if you are curious, you can see the outcome here http://rome.rabexc.org. The source code can also be found on github, in this repository.

This article can serve as a very quick start and brief introdcution to Google Maps APIs.

Extracting the data

The first step for me was finding the data: the list of all daycares / pre-schools in Rome.

Turns out that Google maps, yelp, or the usual suspects don't have very good data about Rome: if you just search for "scuola dell'infanzia", "asilo" or similar, you will only get a handful of results.

However, it turns out that usrlazio.it, the regional entity responsible for licensing schools, maintains some relatively good lists (this one, for example).

The bad news? The lists are provided as PDF files, containing tables, that turns out to be fairly hard to parse automatically.

I was hoping that tools like pdftotext and pdftohtml from poppler-utils would produce a simple text file, one record per line, and some amount of spaces separating fields. Instead, the vertical centering of the text in cells caused the tools to be really confused, having records take multiple lines, and no real way to easily tell which line belonged to which record.

Fortunately, I found tabula-java, a simple tool written for the specific purpose of extracting tables out of PDF files. After a few attempts, running something like:

java -jar ./tabula-0.9.1-jar-with-dependencies.jar -r --pages all --guess ../data/ELENCONONPARITARIELAZIO2016_2017.pdf

gave me a nice csv file, that from manual inspection, looked mostly correct. Converting all the pdfs was then as easy as running:

for file in *.pdf; do
  java -jar ../tools/tabula-0.9.1-jar-with-dependencies.jar -r --pages all --guess $file > ${file%%.pdf}.csv;
done

Geolocating the schools

The next steps was to turn all the .csv files, in different formats, into a uniform format with the list of schools and their geographic coordinates. So:

  1. Parse the CSV: easy - pretty much in any language. I picked golang, just because it is one of my favourite languages lately.

  2. Turn an address typed by a human being into coordinates: easy as well - just use the google maps geocdoding APIs I started off with this example, and modified it to suit my needs.

  3. Generate a JSON file to consume from a web site: easy as well. Really, not much to say here. My final json file can be found here.

Within about 1 hour of work, I had the list of schools, with latitude and longitued associated.

The only tricky part, in all fairness, was obtaining an API key from Google I could use, and opening it up so it could be used from my laptop. Not that hard, though: you can manage your API keys from this console, and generate new ones by following one of the thousands "GET A KEY" links in pretty much any of the Google Maps APIs tutorials.

Drawing a map

The next stpe was drawing a map, and show some points. I started with something really simple:

<!doctype html>
<html lang="en">
  <head>
    <title>Test</title>

    <style>
       <!-- important! how big is the map? 400px high, 100% wide -->
       #map {
        height: 400px;
        width: 100%;
       }
    </style>
  </head>
  <body>
    <!-- where the map will be! -->
    <div id="map"></div>

    <script>
    <!-- creates a map, centered on rome -->
    function initMap() {
      var rome = {lat: 41.9028, lng: 12.4964};
      var map = new google.maps.Map(document.getElementById('map'), {
        zoom: 12,
        center: rome
      });
    }
    </script>

    <!-- actually loads the maps APIs -->
    <script async defer src="https://maps.googleapis.com/maps/api/js?key=YOUR_KEY&libraries=places&callback=initMap"></script>
  </body>
</html>

Which pretty much displayed a simple map, centered on Rome. And moved on from there:

  • Add a marker on the map? Really easy:

     var coordinates = {lat, lng};
     var marker = new google.maps.Marker({
        position: coordinates,
        title: "My cool marker",
        map: map,
     });
    
  • Hide the marker?

    marker.setMap(null);
    
  • Show it again?

    marker.setMap(null);
    
  • Make it clickable? So a nice pop up would show details about the school?

    var window = new google.maps.infoWindow({content: "<b>Arbitrary HTML HERE</b>"});
    marker.addListener('click', function() {
      window.open(map, marker);
    });
    
  • Automatically close a pop up when another marker was clicked on?

    var nowOpen = null;
    [...]
    marker.addListener('click', function() {
      if (nowOpen) nowOpen.close();
      window.open(map, marker);
      nowOpen = window;
    });
    

This pretty much got me a static map with all the schools in less than 30 minutes of work.

Picking a start address

Now I wanted to be able to 1) pick an address, and 2) only show the schools that were reachable by public transport within a given time from that address.

I started by adding two input boxes in the HTML:

  • One to specify a time, in minutes.
  • One to select an address, a starting point.

For this second box, I wanted to have an address autocomplete that worked well. Once again, it was really easy to do:

// Find the input box autocomplete should work on.
var input = $("#address-from")[0];
// Attach the autocomplete functionality to it.
var autocomplete = new google.maps.places.Autocomplete(input);
// Limit the search to the map on the screen.
autocomplete.bindTo("bounds", map);

and then:

autocomplete.getPlace();

would just return the coordinates of the selected place.

Filtering the schools by distance

The filtering part was actually the hardest. Turns out that there is a very simple API (distancematrix) to fetch the distance between multiple addresses with Google Maps. However:

  • The API is limited to about 25 origins per request. I had ~700 schools.
  • The API is limited to a few requests per second, unless you pay for the API. I did not want to pay for a toy project.

What I ended up with works well and it is still not that hard:

  • Query the distance for the first 25 schools.
  • Wait some time. Repeat for the next 25 schools.
  • If an error is returned, retry the same schools after waiting some time.

Something like this turned out to work pretty well:

// List of the coordinates of the schools.
var schools = [ ... {lat, lng}, {lat, lng}, ...];
// Prepare to use distancematrix API.
var service = new google.maps.DistanceMatrixService();

// Computes the distance for the 25 schools starting at
// offset start, and then the next 25, and so on.
var filterSome = function (start) {
  if (start >= schools.length) return;

  service.getDistanceMatrix({
    origins: schools.slice(start, start + 25),
    destinations: [autocomplete.getPlace().geometry.location],
    travelMode: 'TRANSIT',
  }, function(response, status) {
    // Retry same set of schools in 1 second if an error was received.
    if (status == "OVER_QUERY_LIMIT") {
      setTimeout(filterSome, 1000, start);
      return;
    }

    $.each(response.rows, function (key, value) {
      // [...] marker.setMap(null) or marker.setMap(map) to hide/show
      // schools, update a progress bar, add some text to the school
      // description.
    });

    // Move to the next set of schools if distance was computed correctly.
    setTimeout(filterSome, 1000, start + 25);
  });
};

// Start filtering schools from the beginning.
filterSome(0);

If you look at the html of the page at rome.rabexc.org, you can see that the code is a bit more complex, but not that much.

The main reasons for the difference is that I wanted to display a progress bar, as it would take about 30 seconds for the filtering to complete, and well, there's actually code to compute the time, and display or hide points accordingly.

Conclusions

Doing something like this 10 years ago would have probably been hard or unfeasible from the comfort of my home. Just finding good geolocation and route calculation APIs with public transports and related data would have been both hard and expensive.

As of today, I managed to put something together that is well usable in the time it would have taken to watch a movie, which hopefully will save me at least a few hours of planning, as we go through the list of apartments and actually find a daycare / pre-school we like.

The best part? The list from usrlazio.it is showing about ~600 - 700 schools, against the 10s I could find on maps / bing / google and similar. By manually checking some of them, it seems like they are indeed pre-schools and daycare, even though the information is often buried deep in their web sites (eg, elementary school that also provides pre-school, or religious estabilshment, ...).

The worst part? That same list seems manually maintained. The addresses, school names, and so on contain several errors or typos. Sometimes the address is a legal address, rather than where the school actually is, and sometimes the Google maps APIs could not figure out the typos or the correct address based on the text in the .pdf.

By skimming through the points and by double checking the addresses, I would say about 5% of the records have something wrong. The pop-up, though, is now showing all the data from the .pdf, and linking to a Google search. So it is pretty easy to sift through the errors.


Other posts

Technology/System Administration