The api/service/geosearch API can be used to query the geo index in PostGIS. It currently serves three styles of queries:
By default, results are returned in geojson format [2]. Other formats supported include json - the typical Metaweb JSON format, kml - to work with Google Earth, kml/maps work with Google Maps, ids - to only return the Freebase ids of matches and guids - to only return the Freebase guids of matches.
The geosearch API uses both a geo index and the graph to produce its results. The query is first sent to the geo index which returns the guids of matches along with the geo data these were indexed with. These guids are then to the graph via a MQL query to extract more information such as name, thumbnail image, id, etc... about the matching topics.
By default, the MQL query used is:
[ {
"guid": None,
"id": None,
"name": None,
"type": [],
"/common/topic/image": [ {
"guid": None,
"id": None,
"type": "/common/image",
"optional": True,
"limit": 1,
"index": None,
"sort": "index",
} ]
} ]
Depending on the format selected, the MQL results are then inserted into the final query results as follows. When the format chosen is json or geojson, the MQL results are inserted via a properties dictionary.
When kml is selected, these MQL results are processed to extract the name, the Freebase URL and the Freebase thumbnail for the topic into the feature's description. The MQL results are also inserted into an <extendeddata> element tree under each <Placemark> as specified at [6]. / characters in MQL key names are converted to - when used in XML element names.
The MQL query step is skipped when mql_output is "null" or when format is ids or guids.
The scope of a geo query is something to pay close attention to as such a query can easily return the entire database.
Douglas-Peuker shape simplification introduces a new trade-off between performance and accuracy. A hull, while less precise, is guaranteed to contain the entire original shape. A simplified shape, on the other hand, may have lost some area. Again, this can be visualized by reading [7].
All three styles of queries described above normally require that one or more locations be specified via the location or the mql_input parameters.
If neither location or mql_input are specified the queries are run against the whole world; the generated SQL contains no geo constraints and the only supported value for inside is true or 1.
The location parameter specifies a single location and accepts a variety of input formats:
- An id for a graph topic such as /en/california that was indexed because it had location data.
- A guid for a graph topic starting with # that was indexed because it had location data. (Note that # is not a valid URL character and must be encoded as %23).
- One or more terms such as Berkeley or San Francisco that are passed to the Lucene relevance server to retrieve the actual topic id. The optional location_type parameter may be specified to help the Lucene relevance query return the desired topic. This parameter defaults to /location/location.
- A GEOJSON shape as a dictionary as specified at [1]
- A bounding box as a list [x0, y0, x1, y1] where (x0, y0) and (x1, y1) specify two diagonally opposite points of a rectangle. x denotes a longitude and y, a latitude.
The mql_input parameter specifies a MQL query to run against the graph to collects the guids of matching topics to query the geo index with. This query may be single or multi cardinality. If a topic was not indexed because it had no location data, the results may be empty or not contain any geo data. For example, the MQL query about all cities named "San Francisco":
[
{
"name" : "San Francisco",
"type" : "/location/citytown"
}
]
finds five such cities, only one of which has actual geo data and occurs in the geo index.
Please note that both location and mql_input usually take values that must be encoded for URL use. Python's urllib.quote() does the trick:
>>> urllib.quote('[{ "name" : "San Francisco","type" : "/location/citytown" }]')
>>> %5B%7B%20%22name%22%20%3A%20%22San%20Francisco%22%2C%22type%22%20%3A%20%22/location/citytown%22%20%7D%5D
The simplest possible query is to retrieve one or several shapes for a given topic. When several shapes are present for a topic, they're sorted in decreasing order of dimension (see ST_Dimension() function at [3]).
Retrieving the shapes for San Francisco: geosearch?location=San+Francisco&location_type=/location/citytown
Retrieving the shapes for San Francisco in KML format: geosearch?location=San+Francisco&location_type=/location/citytown&format=kml
Retrieving one shape for San Francisco, typically its outline: geosearch?location=San+Francisco&location_type=/location/citytown&limit=1
Retrieving the bounding box for San Francisco: geosearch?location=San+Francisco&location_type=/location/citytown&limit=1&accessor=envelope
Retrieving the shapes of all the cities called Berkeley: geosearch?mql_input=[{"name":"Berkeley","type":"/location/citytown"}]
The more complex queries supported involve finding topics in proximity to a given anchor topic or location. Two styles of proximity query are supported:
One of these parameters must be specified for a proximity query. If more than one is used, function overrides operator which overrides inside which overrides within.
Finding the restaurants in San Francisco:: geosearch?location=San+Francisco&location_type=/location/citytown&type=/dining/restaurant&inside=true&indent=1
Finding the restaurants in San Francisco and return KML. Enter this URL into http://maps.google.com for a cool rendering of the results: geosearch?location=San+Francisco&location_type=/location/citytown&type=/dining/restaurant&inside=true&format=kml/maps
Finding the restautants within 5 kms of Berkeley and return KML: geosearch?location=/en/berkeley_california&type=/dining/restaurant&within=5&format=kml
If a consistent order is required, but neither relevance nor distance matter, order_by=uid can be used to request the results be sorted in the order they are stored in the PostgreSQL table.
Finding the 100 most relevant restaurants in California outside of San Francisco, return KML: geosearch?location=San+Francisco&location_type=/location/citytown&type=/dining/restaurant&outer_bounds=/en/california&inside=false&format=kml&limit=100&order_by=relevance
Finding the 100 restaurants in California that are closest to San Francisco but not in it, return KML. Enter this URL into http://maps.google.com for a cool rendering of the results: geosearch?location=San+Francisco&location_type=/location/citytown&type=/dining/restaurant&outer_bounds=/en/california&inside=false&format=kml/maps&limit=100&order_by=distance
When geojson or json is returned, the results include the relevance value or the distance value the query results were ordered by.
Counting the restaurants in California outside of San Francisco:: geosearch?location=San+Francisco&location_type=/location/citytown&type=/dining/restaurant&outer_bounds=/en/california&inside=false&count=1&indent=1
Counting all indexed locations in San Francisco: geosearch?location=San+Francisco&location_type=/location/citytown&inside=true&count=1&indent=1
When working with complex shapes, containment queries can become quite slow. At the expense of some accuracy, when the inside parameter is used, the hull of shapes is used instead during queries greatly speeding up queries.
When the function or operator parameter is used, the actual complex shape is used instead ensuring an accurate, albeit much slower, query.
Depending on the shape, using its convex hull can have considerable precision drawbacks. In particular, multipolygons become one simple polygon that contains all the in-between parts. For example, the hull around France contains pieces of the Mediterranean, Italy, Switzerland and Germany because of the position of Corsica.
Another performance accuracy trade-off is possible with the optional simplify parameter that triggers the use of the PostGIS ST_Simplify() function to simplify complex shapes with the Douglas-Peuker algorithm [7]. It takes a floating point number, a so-called tolerance value, expressed in degrees, which is best explained by reading [7] where one can visualize its meaning. The right tolerance to use depends on the actual geographical size of the shape used in the query.
Douglas-Peuker shape simplification introduces a new trade-off between performance and accuracy. A hull, while less precise, is guaranteed to contain the entire original shape. A simplified shape, on the other hand, may have lost some area. Again, this can be visualized by reading [7].
| [1] | (1, 2) http://wiki.geojson.org/GeoJSON_draft_version_6#Geometries |
| [2] | http://wiki.geojson.org/GeoJSON_draft_version_6 |
| [3] | http://postgis.refractions.net/docs/ch06.html#id2595672 |
| [4] | http://postgis.refractions.net/docs/ch06.html#id2594839 |
| [5] | http://postgis.refractions.net/docs/ch06.html#id2597154 |
| [6] | http://code.google.com/apis/kml/documentation/extendeddata.html#opaquedata |
| [7] | (1, 2, 3, 4, 5, 6) http://marblemice.com/2007/09/12/douglas-peuker-line-simplification-explained/ |
| [8] | http://mql.freebaseapps.com/ch02.html#typedatetime |