Contributing Large Datasets » Discuss

Discussions on Contributing Large Datasets

Filtersonly show threads also posted in:

  1.  

    NNDB Links for People

    1. I would like to add about 20,000 links from people to their NNDB pages. To do this I've created the NNDB Profile Page type.

      I've gone through all the people on Freebase and matched them to their NNDB page by name. In the cases where several people share the same name, I simply ignore those pages.

      Would it be appropriate to upload this data to the sandbox?

      1. It would indeed. Please go ahead.

      2. Ok, The complete set of links has been written to the sandbox. Please check out the results and let me know if I can write them to the main database.

      3. The data looks good!

        However, I lied when I said earlier, on the mailing list, that the IMDB Profile Page model was the way to go. We now have the ability to use keys into an external database as a way to generate URIs, which further provides uniqueness checking. I am working on converting our IMDb references into this form. It would be great if you could wait on the final NNDB load and model it that way; I will be happy to show you how once I figure it out myself. (-:

        1. Have you made any progress on this? I've looked at the documentation on enumerations but I can't figure out how to apply it to my model.
      4. Sounds like a great way to model these things. I look forward to learning how to use this new technique.

      5. Oh, yeah! It’s totally done and I forgot to come back here.

        First, I made namespaces (/authority/imdb/film, name, character). Then I made properties that enumerate those namespaces and attached URI templates to them.

        Check out the IMDb profile property on film. It expects Enumeration as its type. Then you have to get a little fancy and switch to the admin view. Set the Enumeration property of the property to the expected namespace. Then co-type the property as a Foreign key property and set a URI template. Set its type to URI Template and fill in both the canonical template (used to generate and recognize URIs) and the other templates (used only for recognition).

        The schema UI will support this at some point… just not yet, as it’s kind of a power-user thing.

      6. Ok, I've created the namespace, I've set up the enumeration property on the NNDB Person type and I've attached a URI template to that property.

        Then I added a sample key to the Paul Newman topic and the NNDB link shows up as expected. Unfortunately, the ID has a forward slash in it which gets escaped and breaks the link. I went back to the Foreign Key Property and explicitly disabled URI encoding but that hasn't fixed it. Any ideas on how to handle this?

        Coincidentally, tsegaran added a foreign key to a NYT page on the same Paul Newman topic and his key also contains forward slashes but he seems to have entered the weblink seperately without using a URI template.

      7. Ah, yes, the char escaping with keys and URI templates... I am in the process of converting the NYT keys to use URI templating, and I also ran into that problem.  There's a bug filed to have the UI behave properly when it encounters escaped URLs - I'll post back when there's a status update to this.

        Toby (tsegaran) actually added a key, and created a discrete weblink (it's not using URI templating).  When I implement the URI templating, I'll be removing the superfluous weblink.

        BTW, good work!

      8. This sounds like simply a bug  - you're right that the NYTimes links got added separately, and they'll probably need to be fixed. It may be too late to get a fix in for next week's release, but I'll try. For my and other's reference, this is CLI-4538 in our bug system.
      9. Ok, thanks guys. I'll watch for CLI-4538 in the release notes.
      10. Looks like everything is working fine now in the new release. Thanks for fixing this.
      11. I've uploaded a new version of the data to the sandbox. If no one has any objections, I'll add it to the main site.

        Is there a limit on how many writes I can do on the main site? Will I be able to make 19,000 writes in a day?

      12. I believe the normal limit is 10,000 writes.  I'll see if I can get your limit increased.
      13. Thanks Brian. The exact count should be 19,619 writes to the API with 2 properties being updated each time.
      14. Any luck getting my limit increased?
      15. I'll try to get it done before the sandbox refresh (Monday PM PST) so you can test against sandbox once more before going live.

      16. Shawn, your limit is now 25K, on sandbox and www.  Happy loading!
      17. Thanks for updating the limit. I ran one more test on sandbox and then uploaded them to www and exceeded my limit. I guess that's 25k combined sandbox & www so I'll upload the rest tomorrow.
      18. I tried running it again today and I exceeded my limit again about half way through. Is it 25k writes or 25k facts? This only happens on www. The sandbox was able to write the whole dataset at once.
      19. Just noticed that the "limit exceeded" error message says that my max_writes is 10,000 per day. I guess www is still using the old limit.
      20. Hmmm, we just recently moved datacenters, and in  the shuffle, your write limit which was previously upped, might have gotten reverted.  I'll check it out and get back to you...
      21. Additionally, are you adding keys to 25K topics, or something less?  Co-typing and adding the NNDB key is 2 primitives, which if you are trying to do for 25K topics, we really need to set your limit to 50K.
      22. I'm adding the NNDB Person type to the topic and adding the key, so if the limit is on the number of primitives rather than the number of writes then I would need a limit of 50k.

    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

  2.  

    biology Topics

    also posted to
    1. Hi :-)

      I'm a grad student, and I'm helping to develop an experiment which will ask people to annotate web services with tags.  The experiment is part of the ED project, described at http://www.connotea.org/wiki/EntityDescriber.

      What I want to do is load the list of candidate tags into Freebase as Topics, all under a single Type.  However, there are a lot of them.  (I estimate about 100,000.)  I expect that I will hit the user upload limit when I try to do this, and I was wondering if you could raise the limit temporarily for me?  (50,000 assertions per day would be okay, and it could be set back to normal in two weeks.)

      Thanks for your help, 

      - Ben Vandervalk

      Masters Student

      University of British Columbia

      Wilkinson Lab

      1. Hi Ben,

        I've passed your request on to the data team - someone should be replying to you shortly.

        1. Hi cheunger,

          Any word yet?

           - Ben V.

      2. Hey Ben -

        Are working with Benjamin Good on this?  I thought his original idea for using Freebase to annotate Connotea was great. I am very supportive of using Freebase IDs as content tags.

        Could you send me a note about the nature of the candidate tags.  Are they tied to other source vocabularies we could reference?  I'd also be curious if there are other semantic relationships we could draw out between the tags (or other Topic vocabularies) to make navigating the annotated relationships in Connotea more powerful.

        Jamie [at] metaweb

      3. Hi Jamie, 

        Wow, you know about ED already.  Cool! 

         Yes, I am working with Ben Good on this experiment.   Basically, what he is doing is using ED to annotate BioMOBY web services with Freebase Topics (each BioMOBY web service has a unique URI).  In addition to general tags to describe a service, he is also collecting a special set of tags which describe the relationship between the input and output data: i.e.  "Each output is a <insert tag here> about/of the input(s)", e.g. "Each output is a _Homolog_ about/of the input(s)"

        Since the services are all bioinformatics related, my plan was to load terms from a biology/bioinformatics ontology into Freebase.   My choice is the NCI Thesaurus (http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl) , which has about 80,000 classes representing all sorts of biological things.

        The main reason we want to use Freebase for this is because of the great type-ahead feature.    I was planning to keep it simple and just load all the terms under one Type.  So to be honest, it's "throw away" data, that I would remove from Freebase after the experiment was done.

        Thanks for your interest :-)

         - Ben Vandervalk

         

         

           

      4. Oh also, you can browse the NCI Thesaurus at http://bioportal.bioontology.org/visualize/13578

         - Ben V.

      5. Hi Ben,

        I've been working on some ontology mapping types based off of some of Jamie's own Web Ontology types. If you import 80,000 topics into Freebase I would be really interested in seeing those Freebase topics mapped back to their original URIs in the NCI Thesaurus.

        To give you some idea of how this would look, I have a basic Freebase to FOAF mapping here

        Shawn 

      6. Cool, nice work :-) That's the proper way to do it, I think.

        My intention is to something quick and dirty though, just so I can use the Freebase type ahead feature in our annotation experiment.   Mapping all 80,000 classes and umpteen properties to existing Freebase entities would be a ton of work, and I just don't have time to do that!

        - Ben V.

      7. Hi Ben,

        Have you been communicating with Jamie about this?  Can you send him an email regarding the nature of the candidate tags - his email is jamie [at] metaweb.  Thanks.

      8. Sure, I'll copy my message above to jamie@metaweb.com.

        -Ben V.


    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

  3.  

    Schools in England and Wales

    also posted to
    1. I have a great big xls file with every school (not universities or nurseries) in England and Wales in it, including; School name, local authority, county, town, street, postcode, telephone number, headteacher, headteacher's degree, type of school, phase of education, and minimum/maximum age. Obviously not all of that would be useful, but I have no idea how to go about importing this into freebase.
      1. Hi cooksey87,

         I've passed this information on to our data team and someone should be getting back to you shortly!

      2. Thanks, is it possible that this will be done soon?
      3. Cooksey, if you email me the XLS to kirrily@metaweb.com I can help progress it.
      4. Thanks very much, will do. I didn't mean to sound rude, but I might not have had it for much longer.

    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

  4.  

    Link airports as containedby for apropriate locations

    also posted to
    1. Airport is co typed as location so I would like to set airport as the containedby value for the correct locations.

      For example Amsterdam Schiphol should be contained by Amsterdam and Netherlands.

      Got some script and run it on the sandbox, it will get the airports serves location and add the airport as contains value for the serves location and all locations that contain the serves.

      Like Amsterdam Schiphol serves Amsterdam so schiphol is added as a value contains for amsterdam, amsterdam is contained by netherlands so schiphol is added as contains value for netherlands.

      1. Is Schiphol actually contained by Amsterdam?  I am not that familiar with Dutch location containment, but it seems like Schiphol is contained by the municipality of Haarlemmermeer and possibly contained by the city Schiphol-Rijk.

        Are you just adding containment by country, or containment by administrative division also?  If it is also the latter, there could also be the situation where an airport may serve one location, but be contained by a different administrative division.  The example I'm thinking of is Newark Airport is contained by New Jersey, but serves New York City.

      2. You have a point, I think there is no airport that is actually in the city they serve, it would be just outside the city. But nobody says thei're flying to schiphol rijk or whatever (unknown small) town that is closest to the airport or containing the airport.

        I guess thats why serves is there. In case of Schiphol, the administrative division Amsterdam is responsible for Schiphol. 

        I will only add the countries containing the airport.

      3. FYI - We are trying to get our hands on some country/airport data also, so if licensing is compatible, we may be able to add country containment to a large percentage of airport topics, and perhaps and new ones!


    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

  5.  

    Postsecondary Schools

    also posted to
    1. I'd like to get the list of all postsecondary schools loaded.

      The nasty format is here: http://www.ed.gov/offices/OSFAP/PEPS/dataextracts.html and I can get it into a better format but the current definition of "school" in freebase leaves a lot to be desired.

       

      What should I do?

      1. The types you should be looking at are /education/institution and /education/university. Between them, they should have all the properties for post-secondary schools. (Educational Institution has properties that are common to schools of all levels.)

        If there is data in your dataset that you want to load,and which doesn't match any existing properties, we can talk about adding new properties. The first place to start, though, would be to create a type in your private domain to hold those properties so you can test them out and so that people can review them before adding them to the types in the education domain.

        This looks like a great dataset. Feel free to keep posting questions while you work on it.


    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

  6.  

    Food Recipes

    also posted to