I have been working for some time on a large scale import of taxo data, reconciled with the topics we already have. I would like to briefly summarize where im at and solicit feedback on what is or is not important in the first load.
There are about 38,000 taxa already in the system as topics. These include plants and fungi. For example Erythroxylum ellipticum. Step one is to type these things. Step two is to add them into a hierarchy with their upper and lower taxa and their rank. I have written tools to do this based on the ITIS database.
This brings us to the topic of datasets. There are several efforts in the world at a comprehensive database of life. The ones I have studied (in no particular order) include:
ITIS
Species 2000
DiscoverLife
NCBI
EOL
International Plant Names Index
Wikispecies
Wikipedia-en taxoboxes
If there are others that folks are aware of please let me know. There are a number of issues related to licensing, style of attribution, quality, and size of these databases.
The ITIS database currently seems like the best one to start with with the addition of foreign keys for Species 2000 and NCBI. Its a high quality 'core' set and the data is indeed unencumbered.
DiscoverLife and Species 2000 are both interesting and larger datasets. They both combine approx 50 databases in one place. This greatly complicates licensing since they are republishing data that comes from domain experts such as Fishbase. Both SP2000 and DiscoverLife aim to get to the approximately 1.8 million species mark in a few years. I think we should be just as concerned with the richness of the interlinking of the data on freebase as we are with completeness. For example being able to query across genes, diseases and species requires interlinking them all, not just having 20,000 Scarabaeidae unlinked to much else.
Im also very interested in sources of CC or GFDL images of organisms if anyone has researched that world.
Proposed Taxo import
-
-
-
If you look at the Species page in the right column there are already many topics that have been filled in with taxonomies all the way up to Domain, using the type Organism Classification. I'm worried that data load will overwrite the work already done on this type.
-
Hi Jeff.
What Im proposing to load would not break any data loaded by users, but add to it. Its not feasible to add several hundred thousand things by hand so we need to work together on this. I will contact you by email and we can chat about how best to collaborate. -
As you suggested, I'll follow up on the Organism Classification discussion page.
-
I can't find the Organism Classification discussion page, but wondered if there was any progress on this. I see there are about 212 organism classification entries so far and wanted to know the progress of the proposed bulk upload, especially in the Animal kingdom. I have some of this data available but perhaps someone else is going to upload it? Thoughts?
-
On the type page for Organism Classification, in the Actions window click Discuss "Organism Classification". I don't know the status of the data upload that jg is working on.
-
Hi Hilary
I have made some progress on the bulk upload. There were 70,000+ wikipedia topics to reconcile with but Im almost done, so I expect to load something on sandbox this week. Ill post more when its uploaded for inspection. After the 70,000 load Ill do 400,000 more from ITIS, again on sandbox. -
> This brings us to the topic of datasets. There are several efforts in the world at a comprehensive database of life. The ones I have studied (in no particular order) include:
> ITIS
> Species 2000
> DiscoverLife
> NCBI
> EOL
> International Plant Names Index
> Wikispecies
> Wikipedia-en taxoboxes
There's Tree of life which I've mentioned before but which apparently has a licensing issue.http://www.tolweb.org/tree/home.pages/downloadtree.html
( http://www.dbfordummies.com/Example/Ex710.asp )
( http://paste.uni.cc/11838 )For butterflies and moths there is BAMONA and All Leps.
(Bamona isn't easily accessible online however the maintainer Thomas Naberhaus is willing to create an extract for use elsewhere; I'm working with him on that just now in fact for a Flickr project to create a leps field guide)
http://www.lepbarcoding.org/files/nth_am_lep_full_checklist.xls
I have a question which may be more appropriate in some techie thread but I'll ask it here anyway since the specific example that first brought it to mind was the TOL dump; will there be facilities for importing via XML? It seems at least as useful as spreadsheet imports.
Spreadsheets seem fine for flat data typical of a relational database, but I hope that Freebase has support for hierarchical data as well.
-
This question is probably better discussed on the developers mailing list.
However, the reason that spreadsheets are supported first is that it is easy for non-technical people with domain expertise to understand a spreadsheet, and a mapping is fairly straightforward.
Freebase data is not hierarchical; it is a general graph, which can represent hierarchies, but isn’t constrained the way XML is. That means that the mappings are more complex, and probably better handled with a custom application, at least until we come up with some superduper UI assistant.
If the data you want to import can be flattened, then a flat import tool can be used, but otherwise, you will probably need some kind of application that comprehends and maps your data.
-