Discussions on Loading Data and Contributing Large Datasets
Start a New Discussion
-
-
Ignore the last paragraph in this document. The correct procedure to request an increase in write quotas is to open a ticket in Jira http://bugs.freebase.com
-
No, that's not quite true. We'll respond to requests for quota increases whatever forum you ask for them in. You just have to ask clearly and we'll see it, whether it's here in the discussions, on the dev list, or on Jira.
Note that it may take us a few days to get to it, and that we will want to review your work before increasing your quota.
-
-
-
Does a "create" : "unless_exists" which returns "existed" count against my quota?
If I cotype something with three additional types in a single write, how many does that count? I'm guessing three.
How about creating a CVT with a date and two outgoing links (3 properties) and connecting it all in a single write? 1, 2, 4, other?
And just for good measure, when does the count get reset? I'm trying to make sure that I don't run out of quota in the middle of a midnight debug session. :-)
-
I asked this question on the developer's email list and was told that set of queries below would return the information needed (replace 'tfmorris' with your username and adjust the dates appropriately), although I found the results to be off by >25% and I'm not sure why.
I was also told:
- it's a rolling 24-hour clock, so no fixed reset time
- the sandbox isn't currently write throttled
'http://api.freebase.com/api/service/mqlread?queries={"q0":{"query":
{"creator":"/user/tfmorris","return":"count"}},"q1":{"query":{"type":"/
type/link"," creator":"/user/tfmorris","return":"count"}},"q2":
{"query":{"creator":"/user/
tfmorris","return":"count"},"as_of_time":"2009-02-01"},"q3":{"query":
{"type":"/type/link", "creator":"/user/tfmorris","return":"count"},"as_of_time":"2009-02-01"}}'
-
-
-
I have been trying to upload lists of filming locations. I have lists of a lot of different cities in the United states that I would like to upload but when I use the list uploader for FIlming location, I have no option to add the flims shot in those locations as well. Can this be added or is there some way, currently, for me to do this.
-
-
-
I would like to add about 20,000 links from people to their NNDB pages. To do this I've created the NNDB Profile Page type.
I've gone through all the people on Freebase and matched them to their NNDB page by name. In the cases where several people share the same name, I simply ignore those pages.
Would it be appropriate to upload this data to the sandbox?
-
It would indeed. Please go ahead.
-
Ok, The complete set of links has been written to the sandbox. Please check out the results and let me know if I can write them to the main database.
-
The data looks good!
However, I lied when I said earlier, on the mailing list, that the IMDB Profile Page model was the way to go. We now have the ability to use keys into an external database as a way to generate URIs, which further provides uniqueness checking. I am working on converting our IMDb references into this form. It would be great if you could wait on the final NNDB load and model it that way; I will be happy to show you how once I figure it out myself. (-:
-
Sounds like a great way to model these things. I look forward to learning how to use this new technique.
-
Oh, yeah! It’s totally done and I forgot to come back here.
First, I made namespaces (/authority/imdb/film, name, character). Then I made properties that enumerate those namespaces and attached URI templates to them.
Check out the IMDb profile property on film. It expects Enumeration as its type. Then you have to get a little fancy and switch to the admin view. Set the Enumeration property of the property to the expected namespace. Then co-type the property as a Foreign key property and set a URI template. Set its type to URI Template and fill in both the canonical template (used to generate and recognize URIs) and the other templates (used only for recognition).
The schema UI will support this at some point… just not yet, as it’s kind of a power-user thing.
-
Ok, I've created the namespace, I've set up the enumeration property on the NNDB Person type and I've attached a URI template to that property.
Then I added a sample key to the Paul Newman topic and the NNDB link shows up as expected. Unfortunately, the ID has a forward slash in it which gets escaped and breaks the link. I went back to the Foreign Key Property and explicitly disabled URI encoding but that hasn't fixed it. Any ideas on how to handle this?
Coincidentally, tsegaran added a foreign key to a NYT page on the same Paul Newman topic and his key also contains forward slashes but he seems to have entered the weblink seperately without using a URI template.
-
Ah, yes, the char escaping with keys and URI templates... I am in the process of converting the NYT keys to use URI templating, and I also ran into that problem. There's a bug filed to have the UI behave properly when it encounters escaped URLs - I'll post back when there's a status update to this.
Toby (tsegaran) actually added a key, and created a discrete weblink (it's not using URI templating). When I implement the URI templating, I'll be removing the superfluous weblink.
BTW, good work!
-
This sounds like simply a bug - you're right that the NYTimes links got added separately, and they'll probably need to be fixed. It may be too late to get a fix in for next week's release, but I'll try. For my and other's reference, this is CLI-4538 in our bug system.
-
Ok, thanks guys. I'll watch for CLI-4538 in the release notes.
-
Looks like everything is working fine now in the new release. Thanks for fixing this.
-
I've uploaded a new version of the data to the sandbox. If no one has any objections, I'll add it to the main site.
Is there a limit on how many writes I can do on the main site? Will I be able to make 19,000 writes in a day?
-
I believe the normal limit is 10,000 writes. I'll see if I can get your limit increased.
-
Thanks Brian. The exact count should be 19,619 writes to the API with 2 properties being updated each time.
-
Any luck getting my limit increased?
-
I'll try to get it done before the sandbox refresh (Monday PM PST) so you can test against sandbox once more before going live.
-
Shawn, your limit is now 25K, on sandbox and www. Happy loading!
-
Thanks for updating the limit. I ran one more test on sandbox and then uploaded them to www and exceeded my limit. I guess that's 25k combined sandbox & www so I'll upload the rest tomorrow.
-
I tried running it again today and I exceeded my limit again about half way through. Is it 25k writes or 25k facts? This only happens on www. The sandbox was able to write the whole dataset at once.
-
Just noticed that the "limit exceeded" error message says that my max_writes is 10,000 per day. I guess www is still using the old limit.
-
Hmmm, we just recently moved datacenters, and in the shuffle, your write limit which was previously upped, might have gotten reverted. I'll check it out and get back to you...
-
Additionally, are you adding keys to 25K topics, or something less? Co-typing and adding the NNDB key is 2 primitives, which if you are trying to do for 25K topics, we really need to set your limit to 50K.
-
I'm adding the NNDB Person type to the topic and adding the key, so if the limit is on the number of primitives rather than the number of writes then I would need a limit of 50k.
-
-
-
Hi :-)
I'm a grad student, and I'm helping to develop an experiment which will ask people to annotate web services with tags. The experiment is part of the ED project, described at http://www.connotea.org/wiki/EntityDescriber.
What I want to do is load the list of candidate tags into Freebase as Topics, all under a single Type. However, there are a lot of them. (I estimate about 100,000.) I expect that I will hit the user upload limit when I try to do this, and I was wondering if you could raise the limit temporarily for me? (50,000 assertions per day would be okay, and it could be set back to normal in two weeks.)
Thanks for your help,
- Ben Vandervalk
Masters Student
University of British Columbia
Wilkinson Lab
-
Hi Ben,
I've passed your request on to the data team - someone should be replying to you shortly.
-
Hey Ben -
Are working with Benjamin Good on this? I thought his original idea for using Freebase to annotate Connotea was great. I am very supportive of using Freebase IDs as content tags.
Could you send me a note about the nature of the candidate tags. Are they tied to other source vocabularies we could reference? I'd also be curious if there are other semantic relationships we could draw out between the tags (or other Topic vocabularies) to make navigating the annotated relationships in Connotea more powerful.
Jamie [at] metaweb
-
Hi Jamie,
Wow, you know about ED already. Cool!
Yes, I am working with Ben Good on this experiment. Basically, what he is doing is using ED to annotate BioMOBY web services with Freebase Topics (each BioMOBY web service has a unique URI). In addition to general tags to describe a service, he is also collecting a special set of tags which describe the relationship between the input and output data: i.e. "Each output is a about/of the input(s)", e.g. "Each output is a _Homolog_ about/of the input(s)"
Since the services are all bioinformatics related, my plan was to load terms from a biology/bioinformatics ontology into Freebase. My choice is the NCI Thesaurus (http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl) , which has about 80,000 classes representing all sorts of biological things.
The main reason we want to use Freebase for this is because of the great type-ahead feature. I was planning to keep it simple and just load all the terms under one Type. So to be honest, it's "throw away" data, that I would remove from Freebase after the experiment was done.
Thanks for your interest :-)
- Ben Vandervalk
-
Oh also, you can browse the NCI Thesaurus at http://bioportal.bioontology.org/visualize/13578
- Ben V.
-
Hi Ben,
I've been working on some ontology mapping types based off of some of Jamie's own Web Ontology types. If you import 80,000 topics into Freebase I would be really interested in seeing those Freebase topics mapped back to their original URIs in the NCI Thesaurus.
To give you some idea of how this would look, I have a basic Freebase to FOAF mapping here.
Shawn
-
Cool, nice work :-) That's the proper way to do it, I think.
My intention is to something quick and dirty though, just so I can use the Freebase type ahead feature in our annotation experiment. Mapping all 80,000 classes and umpteen properties to existing Freebase entities would be a ton of work, and I just don't have time to do that!
- Ben V.
-
Hi Ben,
Have you been communicating with Jamie about this? Can you send him an email regarding the nature of the candidate tags - his email is jamie [at] metaweb. Thanks.
-
Sure, I'll copy my message above to jamie@metaweb.com.
-Ben V.
-
-
-
I have a great big xls file with every school (not universities or nurseries) in England and Wales in it, including; School name, local authority, county, town, street, postcode, telephone number, headteacher, headteacher's degree, type of school, phase of education, and minimum/maximum age. Obviously not all of that would be useful, but I have no idea how to go about importing this into freebase.
-
Hi cooksey87,
I've passed this information on to our data team and someone should be getting back to you shortly!
-
Thanks, is it possible that this will be done soon?
-
Cooksey, if you email me the XLS to kirrily@metaweb.com I can help progress it.
-
Thanks very much, will do. I didn't mean to sound rude, but I might not have had it for much longer.
-
-
-
Airport is co typed as location so I would like to set airport as the containedby value for the correct locations.
For example Amsterdam Schiphol should be contained by Amsterdam and Netherlands.
Got some script and run it on the sandbox, it will get the airports serves location and add the airport as contains value for the serves location and all locations that contain the serves.
Like Amsterdam Schiphol serves Amsterdam so schiphol is added as a value contains for amsterdam, amsterdam is contained by netherlands so schiphol is added as contains value for netherlands.
-
Is Schiphol actually contained by Amsterdam? I am not that familiar with Dutch location containment, but it seems like Schiphol is contained by the municipality of Haarlemmermeer and possibly contained by the city Schiphol-Rijk.
Are you just adding containment by country, or containment by administrative division also? If it is also the latter, there could also be the situation where an airport may serve one location, but be contained by a different administrative division. The example I'm thinking of is Newark Airport is contained by New Jersey, but serves New York City.
-
You have a point, I think there is no airport that is actually in the city they serve, it would be just outside the city. But nobody says thei're flying to schiphol rijk or whatever (unknown small) town that is closest to the airport or containing the airport.
I guess thats why serves is there. In case of Schiphol, the administrative division Amsterdam is responsible for Schiphol.
I will only add the countries containing the airport.
-
FYI - We are trying to get our hands on some country/airport data also, so if licensing is compatible, we may be able to add country containment to a large percentage of airport topics, and perhaps and new ones!
-
-
-
I'd like to get the list of all postsecondary schools loaded.
The nasty format is here: http://www.ed.gov/offices/OSFAP/PEPS/dataextracts.html and I can get it into a better format but the current definition of "school" in freebase leaves a lot to be desired.
What should I do?
-
The types you should be looking at are /education/institution and /education/university. Between them, they should have all the properties for post-secondary schools. (Educational Institution has properties that are common to schools of all levels.)
If there is data in your dataset that you want to load,and which doesn't match any existing properties, we can talk about adding new properties. The first place to start, though, would be to create a type in your private domain to hold those properties so you can test them out and so that people can review them before adding them to the types in the education domain.
This looks like a great dataset. Feel free to keep posting questions while you work on it.
-
-
-
Hello, I have found that there are a lot of recipes available in the public domain as mealmaster format or many times as RML (recipe xml files). I have wrote a script to convert those formats and insert them in my sql database. The thing is that I am sure this is a very valuable information for a freebase and it will also help me write my application without needed to store myself all the recipes.
The list of recipes I am talking about is here: http://dsquirrel.tripod.com/recipeml/indexrecipes2.html
Please, take a look at on of the files and tell me if this is valuable information for Freebase and I will start experimenting in sandbox.freebase.com and see how the data will be stored. Also I did not see a category for recipes here on freebase and I am thinking if you wanted to include it at a later stage or what?
Thanks,
Kiril-
Hi Kiril-
There was some discussion in the "Food" domain awhile back about adding recipes. You might want to post something there and see if anyone else is starting to work on it. You can read that discussion here. -
Link is broken or leads to nothing?! is there any solution adding recipes in freebase?
sebastian
-
Hi Sebastian,
User skud was looking for reviewers on her food domain. I would think that would be a good start for then developing a type for recipes where it could use those types. Perhaps you may want to collaborate with other users on creating a recipe type? You could start a discussion on the data-modeling list (if you're not part of the list, you can join here).
-
-
-
Hi there,
I have a data set with for about 38,000 artists with a MusicBrainz ID a list of the artists that are most related to it according to last.fm.
I have about 60,000 more without a MusicBrainz ID and I can obviously retrieve more data through the last.fm webservices.My plan is to add all artists with a MusicBrainz ID that are more than 80% similar to an artist (about two or three artists, usually) as a 'similar artist' relation. Is that ok?
As an example, I added links for about 100 artists to the sandbox. I do my lookups based on the MusicBrainz ID, and do not create new artists. See: Édith Piaf
Any comments? Should I somehow link the artists to their last.fm page after processing? Add their last.fm urlname as a key, perhaps?
Thanks,
- Jeroen
-
Thanks, Jeroen. First, please be sure that the data can be contributed legally; we did not collect the artist similarity information from MusicBrainz, for instance, because it is licensed CreativeCommons-Non-Commercial, and as a commercial enterprise, we can’t legally use that data.
If you have permission to load that data, then this sounds like a great idea! You could also add the last.fm page as a Web link. For lookup, since last.fm and Freebase both use MusicBrainz keys, a key is probably not needed
-
Ugh, right, I forgot to check the NonCommercial part. Never mind...
-
-
-
I've just setup an Exchange rate compound value and an Exchanged currency type for handling exchange rates for currency. I've currently associated one value to US $ and would like to do a much larger import of exchange rate data from the federal reserve bank.
-
Can you give us an estimate of how large a data-set you're talking about?
-
Can you give us an estimate of how large a data-set you're talking about?
-
Right now, I've got USD-AUD from 1990 to 1999, which is 2561 records. I'm getting the data from: http://www.federalreserve.gov/releases/h10/Hist/default1999.htm I have a script ready to go to do the import of the AUD-USD dataset, and afterwards I'd like to import more recent data, and then start to import other currencies from the reserve banks datasets.
-
You've generated a lot of interest here over your dataset, and we'd like to work with you to put your types in the top-level /finance domain. What we probably would do is hook the "exchange rate" type to the existing "currency" type, rather than keeping the "exchanged currency" type. What do you think?
-
That was what I originally wanted to do, but realized I obviously couldn't. Having made exchanged currency though, I'm not 100% sure that it's not a better option now. The basic idea is that not all currencies are exchanged. On the other hand, I think just having a blank target/source field may be enough to let you know it's not exchanged :)
-
I've connected "exchange rate" to the currency type; it was more complicated to do than I expected, so please take a look at the types and let me know whether I did it correctly. If it's set up correctly, we can move the "exchange rate" type to the finance domain.
-
I looked at the currency pages and they look fine, and the schema looks fine, so I'm going to assume it was connected correctly. I've been trying to figure out if we want to do reciprocal exchange rates, since with AUD/USD I can calculate USD/AUD easily, would it be a good idea to add both records, or to just let people do the calculations themselves?
-
I've moved "exchange rate" to the finance domain. This change will be copied to sandbox tonight, so you will be able to try our your import there. I'm not sure about the reciprocal exchange rates, but I'll ask around.
-
I think that because the source and target are explicitly modeled there is no need to add the reciprocal rates since an application can just reverse them.
-
I attempted an import on sandbox today.. Seems to have not worked out so well. After doing a lot of reformating on the date to get it into a form that the API would accept, I ended up getting a 503 every time I tried to do the import. I didn't seem to see anything on sandbox's web interface so I kept trying. Much later I looked and there were records, but It didn't look like they'd all been imported, so I made my import script do the imports in 100 unit segments, and started to get a 2 entries match error, so I looked that up and discovered that I had duplicate records even though I'd use unless_exists in my create clause! I'm not exactly sure what to do at this point, is there a way to delete all the records in a type on sandbox?
-
There's no quick way to delete anything. You could de-type them, though, which would get them out of the way of your scripts, at least. Questions about scripts, MQL, and the API can be posted to the Freebase developers' list, also, where you'll probably get a quicker response than the 17 hours it took for this (sorry! my RSS reader lost this message for some reason): http://lists.freebase.com/mailman/listinfo/developers.
-
I've subscribed to developers now. Thanks for the detyping tip. I de-typed all the entries and imported 3 datasets (1971-1989, 1990-1999, 2000-2008) for USD-AUD into sandbox and everything looks ok. I did get a 400 error about midway into doing 1971-1989 but I just restarted it from the last entry that was added and it worked out.
-
-
-
We would like to use freebase to store census data of India. There are over 120 fields and many thousands of records. Say over 300,000 records. Our idea is to also add GIS info for these records eventually. For now, it would be great if we can upload a few thousand records, which are available as xls files.
-
dinesh - This sounds like a very interesting data set. As you may have noticed we have been somewhat selective in our use of the US Census data, trying to identify the level of detail that is useful to people as well as integrating the Census data across Freebase domains. I'd be very interested to work with you to select and map the India Census fields as well as cull the appropriate level of detail from the records (for instance, we have deliberately omitted the "block" level of detail in the US Census.)
-
I did not get a notification of this response, even though I am watching this thread. Thus my delay in response.
Good to know you can assist. Look forward. What next?
Let me know how we can communicate further. Shall I upload a few files of census data somewhere and make it available to you?
BTW, I have not found an discussion on US census data on freebase. Can you give me a link? In fact I wanted to look at any census data type but could not locate them (on freebase).
d
ps: cc replies to my email if you can
-
-
-
I've just signed on with an interest in quantitative economic data (and other stuff, but let's start there. ) First, I don't see any such data available; a query only produced a definition. Since much of this is government prepared it is public domain and I'd like to begin uploading it with currently active time series and some interesting historical series. My vision was that if we had a fait number of others involved we could easily allocate the tasks of regular upload of current data as well. I'm not a developer, so I need guidance, but I do know data and where it is available in a public domain format. Let me know how to proceed (and when if it is still too soon). Of course, if I've missed it and this kind of data is already there, let me know. Thanks, Bob A
-
Hi Bob- The best way to get something like this going is to create the data types in your private domain, then try loading a small subset of the data to see if your model is working or not (you don't need to dump it all up there). Getting the model right is actually the hard part. :-) Once you're happy with the model, add a discussion post to the domain you think your data would be most appropriate for, and the admins can help you take it from there. And feel free to post any questions that come up during the modeling process. Good luck - and have fun!
-
Hi Bob, I'm interested in time series as well, and have initiated a discussion of freebase support for them associated with the Data Modelling entry in the help pages. fyi, I'm developing a wiki environment for collaborative analysis of public domain data, especially time series. I'm investigating the possibility of using Freebase as the data engine. Regards, Mike
-
-