Constraints and Joins

Count, Optional

Let's start a new type-based query: what are the writing systems for various (human) languages? For example, Kanji is a writing system for the Japanese language. You can try to use the Query Editor to formulate this query by yourself. Presumably you'll end up with something like this:

[{
  "type": "/language/human_language",
  "name": null,
  "writing_system": [{
    "name": null
  }]
}]

If you don't have the second "name" : null in your own query then put it in, as it'll serve to illustrate something important soon.

Run the query and then pick the Tree tab in the output pane on the right. You can see that each result (each chemical element) is numbered, starting from 0. The last number is 99, meaning that there are 100 results. But we know there can't be just 100 human languages. You might already know that there are a few thousand human languages, or if you don't, the perfect number 100 should look suspicious to you. So, apparently Freebase has limited the number of results to just 100.

We can find out how many results there actually are by adding "return" : "count" to the query:

[{
  "type": "/language/human_language",
  "name": null,
  "writing_system": [{
    "name": null
  }],
  "return" : "count"
}]

return is a directive, meaning that it is an explicit instruction for the query engine regarding how to evaluate the query. In this case, it tells the query engine not to return the results, but to count them and return that count instead.

When you run that query in the Query Editor, you'd get something like this

{
  "code": "/api/status/ok",
  "result": [
    547
  ],
  "status": "200 OK",
  "transaction_id": "cache;cache01.p01.sjc1:8101;2009-07-04T04:42:55Z;0023"
}

meaning that there are 547 results. This is better than 100, but we know that there are more languages than just 547. So what's going on?

What has happened is as follows: In the query, the query node

  "writing_system": [{
    "name": null
  }]

serves as a blank where the name of the writing system is to be filled in. But that query node also serves as a constraint, requiring that the name blank must be filled in. Thus, if a human language has no writing system, then that constraint cannot be satisfied for that particular human language, and that human language cannot be included in the results. Thus, the query as written only returns the number of human languages for which there is at least one writing system each.

In order to count all human languages, including those with no writing system, we have to either eliminate that query node

[{
  "type": "/language/human_language",
  "name": null,
  "return" : "count"
}]

or mark that query node as optional, like so

[{
  "type": "/language/human_language",
  "name": null,
  "writing_system": [{
    "name": null,
    "optional" : true
  }],
  "return" : "count"
}]

Now when you run either versions, you'll get a number in the thousands.

The lesson here is that if you're getting fewer results than you expect, chances are your query is more restrictive than you think--more query nodes in the query act as constraints than you have intended. This brings us to the next topic: query constraints.

Constraints

If you're familiar with SQL, you'd expect some form of a WHERE clause for specifying query constraints (conditions):

SELECT ... fields ... FROM ... tables ... WHERE ... conditions ...

In MQL, there is no such WHERE clause. In MQL, a constraint is typically just a property value that has been provided, in the form of a string or a number or a boolean. For example, in our very first query on this page,

[{
  "id": "/id/the_dark_knight",
  "/film/film/directed_by": [{}]
}]

"id" : "/en/the_dark_knight" is a constraint, because it requires any applicable result to have an ID equal /en/the_dark_knight (there is only one such applicable result). Recall from the Basic Concepts section that each topic has a GUID; hence, we could have rewritten that same query as follows:

[{
  "guid": "#9202a8c04000641f8000000000acd666",
  "/film/film/directed_by": [{}]
}]

Or even as follows:

[{
  "id": "/guid/9202a8c04000641f8000000000acd666",
  "/film/film/directed_by": [{}]
}]

In the chemical element query, "type" : "/chemistry/chemical_element" is a constraint, because it requires any applicable result to have the type /chemistry/chemical_element.

Here is a query with 2 constraints

[{
  "type": "/language/human_language",
  "name": null,
  "writing_system": [{
    "name": "Chinese character"
  }]
}]

The first constraint is on the type of each result, and the second constraint is on the name of the writing system of each result. Actually, the first constraint is superfluous. We could have written the query with just one constraint, provided that we fully qualify the property writing_system:

[{
  "name": null,
  "/language/human_language/writing_system": [{
    "name": "Chinese character"
  }]
}]

The type constraint is superfluous in the first version of the query because each result should already be assigned the type /language/human_language before it is given a /language/human_language/writing_system property value.

Here's a query with a numerical constraint:

[{
  "type": "/chemistry/chemical_element",
  "name": null,
  "van_der_waals_radius" : 200
}]

So far we've only used exact equality constraints. You can also specify comparators on numerical properties

[{
  "type": "/chemistry/chemical_element",
  "name": null,
  "/chemistry/chemical_element/atomic_mass": {
    "mass<": 10,
    "mass": null
  }
}]

and comparators on date properties and patterns on string properties

[{
  "type": "/film/actor",
  "name~=": "Christian *",
  "name": null,
  "/people/person/date_of_birth>": "1970",
  "/people/person/date_of_birth": null
}]

When you include more than one constraint (as in the immediately preceding query), the constraints are considered in conjunction with one another. It's another way of saying that they are AND'ed together, or that all of them have to be true for each result. Thus, in the preceding query, every actor in the result is named Christian and was born in 1970 or later. (It's a bit harder to do disjunctive constraints, and we'll discuss them in the advanced section.)

Finally, as we've seen previously, inner query nodes can also act as constraints. In this query, the inner query node requires each human language in the results to have at least one writing system:

[{
  "type": "/language/human_language",
  "name": null,
  "writing_system": [{
    "name": null
  }]
}]

So that the inner query node only serves to retrieve more data without constraining the whole query, we can add the "optional" : true directive:

[{
  "type": "/language/human_language",
  "name": null,
  "writing_system": [{
    "name": null,
    "optional" : true
  }]
}]

TODO: prefix in property keys, |= operator

Joints

As you've just seen, MQL is quite different from SQL when it comes to constraints. Their joins are also quite different. Whereas SQL requires you to decide which joins to use and how (which tables, which ordering of the tables, and which foreign keys), MQL takes care of all the joins for you. In fact, you would almost never ever think that you are using joins, even though almost everything you do in MQL translates into joins inside the MQL query engine. Perhaps because almost every MQL query requires a lot of joins that if you were to have to think about the joins yourself, you would be confounded. So, the MQL language is designed to hide all of those mundane details away from you.

Consider a simple query for films and their directors

[{
  "type": "/film/film",
  "name": null,
  "directed_by": [{
    "name": null
  }]
}]

This query involves films and directors as well as their many-to-many relationships (one film can be directed by several people, and one person can direct several films). Thus, if you were to design a database schema to store such data, you would need at least 3 tables, and the SQL query equivalent to the MQL query above is:

SELECT films.name, directors.name FROM films, directorships, directors WHERE
  directorships.film_id = films.id AND directorships.director_id = directors.id

You can see that the explicit joins in SQL are a lot harder to read than the implicit joins in MQL (implicit by the nesting in the JSON).

Now let's go a little deeper into our MQL query. For each film's director in the MQL query above, we also want the director's parent(s) and the film(s) that the parent(s) directed:

[{
  "type": "/film/film",
  "name": null,
  "directed_by": [{

    "name": null,
    "/people/person/parents": [{

      "name": null,
      "/film/director/film": [{
        "name": null
      }]

    }]

  }]
}]

For example, one of the results will be the film Rails and Ties, directed by Alison Eastwood, whose father, Clint Eastwood, has directed several movies such as Mystic River, Changeling, etc. Here is a visualization of the query as well as one particular result:

Note that the SQL query equivalent to this MQL would contain at least 6 joins, which are becoming quite unmanageable.

So far, joins in MQL are specified by adding inner (nested) query nodes. You can also perform joins just by specifying properties of different types on the same query node. This is effectively joining the film director table with the company founder table:

[{
  "name": null,
  "/film/director/film": [{
    "name": null
  }],
  "/business/company_founder/companies_founded": [{
    "name": null
  }]
}]

In fact, it is best not to even think in terms of joins when you deal with MQL queries. How to think of joins in MQL brings us to the next topic: understanding how MQL queries are evaluated.

Understanding How Queries are Evaluated Conceptually

Conceptually, evaluating a query consists of two phases:

  • Considering all the constraints in the query, enumerate all the applicable results
  • For each result, retrieves the data requested (property values, related topics), i.e., filling in the blanks

The second phase is pretty easy to understand, but the first phase deserves more explanation.

Consider this simple MQL query with 3 constraints:

[{
  "type": "/film/film",
  "name": null,
  "directed_by": {
    "id": "/en/terry_gilliam"
  },
  "/film/film/starring": {
    "actor": {
      "id": "/en/heath_ledger"
    }
  }
}]

In English, the query means: retrieve the names of films (constraint #1) that are directed by Terry Gilliam (constraint #2) and star Heath Ledger (constraint #3). We can visualize this query as follows:

just as we've done at the end of the "Formulating Simple Queries" section. This time the "tree" doesn't simply grow downward; it's just more convenient that way for subsequent discussion. The idea that the query is a tree of nodes still holds.

Most of the query has been specified and the only missing data is the name of the film, marked with "?". Note that you can refer to a type directly by its ID and you don't need to use the { "id" : ... } construct. This is one of those shortcuts.

Now, think of Freebase data as an immense web of interconnected real life things--people, places, events, artifacts,... The formal and more generic name of such a web of interconnected things is a graph. That graph is too big to draw here, but if we were to take a magnifying glass and look closely at one part of that huge graph, particularly where the film "The Brothers Grimm" is mentioned, then we would see something like this:

That is, we would see how the film is connected to its directors, to its stars, etc., and how each person, say, the director Terry Gilliam, is connected to his films, and possibly to everything else in his life. If you compare the two diagrams--the query and the subgraph of Freebase surrounding The Brothers Grimm--you would see that the constraint parts of the query match the blue parts of the subgraph. This means that the blue parts of the subgraph form an applicable result for that query.

Thus, when working with such a graph database as Freebase and such a graph-based query language as MQL, think of queries in terms of subgraph patterns that match whichever parts of the graph that you want.

Rephrasing Queries

In English, the query above can be phrased in several ways:

  • What are the films that are directed by Terry Gilliam and star Heath Ledger?
  • Among Terry Gilliam-directed films, which are the ones that star Heath Ledger?
  • Among Heath Ledger-starring films, which are the ones directed by Terry Gilliam?

Each phrasing places emphasis on a different constraint, or in other words, we're looking at the same matter from different perspectives. We can visualize the three phrasings as follows

To change from one phrasing to another, it's just a matter of conceptually grabbing another of the 3 colored nodes and lifting it to the top. The rest of the nodes will dangle themselves. This conceptual operation of rephrasing a query is thus called re-dangling.

The first phrasing can be expressed in JSON as follows:

[{
  "type": "/film/film",
  "name": null,
  "directed_by": {
    "id": "/en/terry_gilliam"
  },
  "/film/film/starring": {
    "actor": {
      "id": "/en/heath_ledger"
    }
  }
}]

The second:

{
  "id": "/en/terry_gilliam",
  "!/film/film/directed_by": [{
    "type": "/film/film",
    "name": null,
    "/film/film/starring": {
      "actor": {
        "id": "/en/heath_ledger"
      }
    }
  }]
}

Note the !: it is used because the arrow from the film node to the director node is now pointing upward. Also note that any property, when reversed, must be fully qualified.

Because the property /film/film/directed_by does have a reverse property, /film/director/film (you can verify here), we could also have written the second phrasing as

{
  "id": "/en/terry_gilliam",
  "/film/director/film": [{
    "type": "/film/film",
    "name": null,
    "/film/film/starring": {
      "actor": {
        "id": "/en/heath_ledger"
      }
    }
  }]
}

Similarly, the third phrasing can be written as

{
  "id": "/en/heath_ledger",
  "!/film/performance/actor": [{
    "!/film/film/starring": {
      "type": "/film/film",
      "name": null,
      "directed_by": {
        "id": "/en/terry_gilliam"
      }
    }
  }]
}

or as

{
  "id": "/en/heath_ledger",
  "/film/actor/film": [{
    "/film/performance/film": {
      "type": "/film/film",
      "name": null,
      "directed_by": {
        "id": "/en/terry_gilliam"
      }
    }
  }]
}

These different versions of the same conceptual query give you the same result albeit expressed differently. Try them in the Query Editor and see for yourself.

Rephrasing an English question is not so hard, but rephrasing/re-dangling a query can be very errorprone. This is why the Query Editor provides an experimental feature that does this automatically for you. Try it as follows. First, paste the first version of the query above into the Query Editor, then place your cursor where indicated below by the red vertical bar:

[{
  "type": "/film/film",
  "name": null,
  "directed_by": {
    "id": "/en/terry_gilliam" |
  },
  "/film/film/starring": {
    "actor": {
      "id": "/en/heath_ledger"
    }
  }
}]

That tells the query editor which node you want to lift to the top. Then in the control panel of the Query Editor (below the query input text box), click the Turn Inside Out button (in the Tools tab). The resulting query replaces the original one in the query input text box.