finding Netflix keys and imdb keys in data dump files?

  1.  
    1. I got a Freebase data dump file of all of the movies info (as of March 2009) . It is basically a lot of TSV flat files. One file, called films.tsv (after extraction) is a list of over 59,000 movies, with a lot of columns for dates and director and other things that have a 1 to 1 relationship with a film, including columns for a netflix id and an imdb id. This is where I would expect to find the exact data I want. In that entire file, I did not see a single value in the "netflix id" column. The same with the imdb id column: no data for any of the 60,000 movies in there. I DID however see a blog posting (back in October?) saying that Netflix keys for over 23,000 movies were recently added to the films base. Does anyone know if it is possible to find these in one of the data dump files?
      1. The TSV data dumps don't currently include any keys. If you'd like to download keys in bulk, you can use the Link Export (1.6 GB, http://download.freebase.com/datadumps/). In that file, IMDB and Netflix keys will look like this:

        /guid/9202a8c04000641f8000000000009e89 /type/object/key /authority/imdb/title tt0083658
        /guid/9202a8c04000641f8000000000009e89 /type/object/key /authority/netflix/movie 70053131

        Will that work for you? Let me know if there's anything else I can help with.

      2. I think it should work, depending on the formatting within the dump files, and how long it takes me to successfully pull down a 1.6 gig file...

        I'll post back when done. Thanks! 

      3. OK. It took about an hour to download, which was faster than I expected. It also took about an hour to unzip, which was quite a bit more.

        I am afraid that I now have a 25 gig file here that I am not sure how to approach. I think i still have much reading to do on your site (or somewhere) to figure out how I can start parsing some data out of that file.

        Do I need to run a graph database over here locally for this? 

      4. Or, let me back up. What I am really trying to get is something that would normally be returned by a SQL query that looks something like this:

        SELECT imdb_id, netflix_id WHERE imdb_id IS NOT NULL and netflix_id IS NOT NULL; 

        Is there a simpler way to get this subset of data easily? 

      5. You don't need any kind of database to start using this data. If you're using a Unix-like environment, the simplest way to get at the keys you want is to directly filter it as a text-file for the key-namespaces you want using grep, like this:

        grep "/authority/imdb/title" freebase-datadump-quadruples.tsv > imdb_keys.tsv

        In this file, keys are represented as "id property key-namespace key-value". You could also filter by "/type/object/key" to get all the keys in Freebase.

      6. That worked like a charm with the imdb data giving 19,558 records.

        The netflix version returned 0 records, so I shortened the "search for" string down to "/netflix/" and got a lot of records with:

        /user/hal/netflix/movie  (23,929 records)

         and   /user/hal/netflix/role  (8,397 records)

        Are these the netflix records you referred to above as:

        guid/9202a8c04000641f8000000000009e89 /type/object/key /authority/netflix/movie 70053131

        ?

      7. Correlating the /user/hal/netflix/movie records with the 

        /authority/imdb/title records, i know have 11,189 matches -- that is 11,189 movies in the freebase dump file that have both an imdb id and a netflix id.

        Does that sound about right? 

      8. As I write this, there are 11,182 topics on the live graph which have both IMDB and Netflix title keys, so 11,189 from the slightly out of date data in the dumps is a perfectly reasonable number.

        As an aside, 19 of those topics are not currently typed as a film. The first one of those that I looked at, The Human Factor, is a book and needs to be split into two topics; similar things may well apply for the other 18. Just a small health warning :-)

      9. Yep, that looks right. Sorry about the namespace confusion: /user/hal/netflix/movie is correct -- it's actually exactly the same as /authority/netflix/movie. We should probably deprecate the /user/hal name for it.

    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

Search Discussions