Domains & Types » Biology » Discuss

Discussions on Biology

Filtersonly show threads also posted in:

  1.  

    monotypic taxa

    1. If a taxon is monotypic (having only one member, such as a genus that has only 1 species), there is no Freebase topic for the monotypic taxon.  Instead, the monotypic taxon is listed as an alias of the next lower taxon that has >1 member, or if there is only one member of the group, that member.

      For example, the genus Gavia (Loons) is the only member of the family Gaviidae, which is the only family in the order Gaviformes.  There is only topic for Gavia, and Gaviidae and Gaviformes are listed as aliases.

      My instinct was to add topics for the family and order, so that we could programatically answer questions like "what order is the Red-throated Loon in?" -- it seems wrong to have structured data in the alias.  But it looks like this was a deliberate decision that was made when the topics were loaded, so I thought I'd better throw it out there.

      (I have a list of birds from the americas that I wanted to add data for -- there were a couple of hundred genera for which there is no Freebase topic, and a few random searches revealed that many of these genera had only 1 species.)

      1. This wasn’t a deliberate decision within Freebase, but a side-effect of the import from Wikipedia. The Wikipedia article has no problem being about a genus, a family, and an order, all at once. However, the Freebase topic asserts that it is about the genus; there should be discrete topics for the family and the order. I wonder how easy it would be to automatically find these Wikipedian conflations.
      2. Since an organism_classification instance has a rank, I think we definitely want to add the additional ranks and not conflate the topics.

      3. +1 for separate topics.
      4. +1 for separate topics.



    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

  2.  

    "Found in" property?

    1. I'm interested in gardening (the landscaping kind) and a couple of the big issues/topics is using "native" plants and not using "invasives". I know in the US the states keep lists of what plants are considered invasive, or potentially invasive. Native is a whole can of worms (no agreement on what is really native), but we could at least use the plants.usda.gov info to show where the plants were found to be growing.  It would present the data in a better way for analysis.

      Related to this are animals in a given area.  It would be great to be able to find what snakes, for example, live in a certain state or country.  Or where else you can find your favorite bird, including see where it winters.  I know there are databases out there that track at least the birds, but not sure if we can incorporate that somehow?

      I'm not sure how any of this could fit in exactly, but I wanted to mention it in case anyone had any ideas.

      1. I agree, this would be great.  Would it be worth adding a separate type(s)?  For birds, something like "Migrating Bird" with Locations for where it spends summers and winters ...  Maybe it's worth having a separate "Plant" type anyway, to capture perennial vs annual as well as native status ...
      2. Ooh, birds. You may want to talk to spatialed, from what I understand he's quite the expert on bird conservation. He's the creator of the Birds domain.


    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

  3.  

    Proposed Taxo import

    1. I have been working for some time on a large scale import of taxo data, reconciled with the topics we already have. I would like to briefly summarize where im at and solicit feedback on what is or is not important in the first load.

      There are about 38,000 taxa already in the system as topics. These include plants and fungi. For example Erythroxylum ellipticum. Step one is to type these things. Step two is to add them into a hierarchy with their upper and lower taxa and their rank. I have written tools to do this based on the ITIS database.

      This brings us to the topic of datasets. There are several efforts in the world at a comprehensive database of life. The ones I have studied (in no particular order) include:

      ITIS
      Species 2000
      DiscoverLife
      NCBI
      EOL
      International Plant Names Index
      Wikispecies
      Wikipedia-en taxoboxes

      If there are others that folks are aware of please let me know. There are a number of issues related to licensing, style of attribution, quality, and size of these databases.

      The ITIS database currently seems like the best one to start with with the addition of foreign keys for Species 2000 and NCBI. Its a high quality 'core' set and the data is indeed unencumbered.

      DiscoverLife and Species 2000 are both interesting and larger datasets. They both combine approx 50 databases in one place. This greatly complicates licensing since they are republishing data that comes from domain experts such as Fishbase. Both SP2000 and DiscoverLife aim to get to the approximately 1.8 million species mark in a few years. I think we should be just as concerned with the richness of the interlinking of the data on freebase as we are with completeness. For example being able to query across genes, diseases and species requires interlinking them all, not just having 20,000 Scarabaeidae unlinked to much else.

      Im also very interested in sources of CC or GFDL images of organisms if anyone has researched that world.

      1. If you look at the Species page in the right column there are already many topics that have been filled in with taxonomies all the way up to Domain, using the type Organism Classification. I'm worried that data load will overwrite the work already done on this type.

      2. Hi Jeff.
        What Im proposing to load would not break any data loaded by users, but add to it. Its not feasible to add several hundred thousand things by hand so we need to work together on this. I will contact you by email and we can chat about how best to collaborate.

      3. As you suggested, I'll follow up on the Organism Classification discussion page.

      4. I can't find the Organism Classification discussion page, but wondered if there was any progress on this. I see there are about 212 organism classification entries so far and wanted to know the progress of the proposed bulk upload, especially in the Animal kingdom. I have some of this data available but perhaps someone else is going to upload it? Thoughts?

      5. On the type page for Organism Classification, in the Actions window click Discuss "Organism Classification". I don't know the status of the data upload that jg is working on.

      6. Hi Hilary
        I have made some progress on the bulk upload. There were 70,000+ wikipedia topics to reconcile with but Im almost done, so I expect to load something on sandbox this week. Ill post more when its uploaded for inspection. After the 70,000 load Ill do 400,000 more from ITIS, again on sandbox.

      7. > This brings us to the topic of datasets. There are several efforts in the world at a comprehensive database of life. The ones I have studied (in no particular order) include:

        > ITIS
        > Species 2000
        > DiscoverLife
        > NCBI
        > EOL
        > International Plant Names Index
        > Wikispecies
        > Wikipedia-en taxoboxes

        There's Tree of life which I've mentioned before but which apparently has a licensing issue.

        http://www.tolweb.org/tree/home.pages/downloadtree.html
          ( http://www.dbfordummies.com/Example/Ex710.asp )
          ( http://paste.uni.cc/11838 )

         For butterflies and moths there is BAMONA and All Leps.

        (Bamona isn't easily accessible online however the maintainer Thomas Naberhaus is willing to create an extract for use elsewhere; I'm working with him on that just now in fact for a Flickr project to create a leps field guide)

         http://www.lepbarcoding.org/files/nth_am_lep_full_checklist.xls

         

        I have a question which may be more appropriate in some techie thread but I'll ask it here anyway since the specific example that first brought it to mind was the TOL dump; will there be facilities for importing via XML?  It seems at least as useful as spreadsheet imports.

        Spreadsheets seem fine for flat data typical of a relational database, but I hope that Freebase has support for hierarchical data as well.

      8. This question is probably better discussed on the developers mailing list.

        However, the reason that spreadsheets are supported first is that it is easy for non-technical people with domain expertise to understand a spreadsheet, and a mapping is fairly straightforward.

        Freebase data is not hierarchical; it is a general graph, which can represent hierarchies, but isn’t constrained the way XML is. That means that the mappings are more complex, and probably better handled with a custom application, at least until we come up with some superduper UI assistant.

        If the data you want to import can be flattened, then a flat import tool can be used, but otherwise, you will probably need some kind of application that comprehends and maps your data. 



    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

  4.  

    Selecting species from a certain kingdom

    1. Jg I have been following your progress on the Organism Classification type in the sandbox. It's looking good. I have a question: how will one select all species from a certain kingdom? e.g. select all species in the Plantae kingdom. Thanks.

      // Frank

      1. Hi Frank.
        The plan is to connect the taxa in a hierarchy, with parents and children. So to get everything in Plantae you would start there and then follow the lower_classifications all the way down.

      2. That makes sense, thanks. Would this be an expensive (slow?) query for the Freebase system?

      3. Also sorry for the double post but will the 'Also known as' field contain a plant's common names from the ITIS data? I'm working on a web application to incorporate this data and because users search a lot by common name it would be very useful to have. Double thanks!

      4. Can you give an example? Usually the common name (if known) is the topic title and the scientific name is in the Scientific Name field.

      5. Frank, we dont expect the query to be slow, but as soon as we have the data loaded Ill try it out. Some databases (e.g. Species 2000 have a field which is the flattened complete hierarchy, which we could fall back to if we need it)

        As for common names, I agree with Jeff that they should be the name field, and there is alias if we need it. Also ITIS has many names in Spanish, which I plan to import.

      6. >>> Can you give an example? Usually the common name (if known) is the topic title and the scientific name is in the Scientific Name field. <<<

        Sure. For example Crocus vernus ( http://sandbox.freebase.com/view/crocus_vernus ) has two common names, Dutch Crocus and Spring Crocus. If I have followed correctly the final format would be that one of the two common names would be the Name, 'Crocus vernus' is in the Scientific Name field and the remaining common name can be added as an alias?

        >>> As for common names, I agree with Jeff that they should be the name field, and there is alias if we need it. Also ITIS has many names in Spanish, which I plan to import. <<<

        Above all I was curious as to how common names would be handled because I didn't see a common name field per say. It never occurred to me that the Name field may be used :)

        Thanks for the prompt reply guys.


        Thanks for the prompt replies guys.

      7. >>> Sure. For example Crocus vernus ( http://sandbox.freebase.com/view/crocus_vernus ) has two common names, Dutch Crocus and Spring Crocus. If I have followed correctly the final format would be that one of the two common names would be the Name, 'Crocus vernus' is in the Scientific Name field and the remaining common name can be added as an alias? <<<

        That sounds right.

        >>> Thanks for the prompt replies guys. <<<

        Now that they've got the RSS feeds working, the discussion boards work a lot better.

      8. For that specific example, Spring Crocus may be too generic. For example this database only has the other common name http://plants.usda.gov/java/profile?symbol=CRVE4

      9. Here's a better example http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=34342 has 5 common names listed.

      10. OK, let me look at the actual ITIS database and see if we can distinguish one as primary, which would allow the others to be aliases....

      11. Well it turns out that they are not distinguished in ITIS. http://www.itis.gov/vernac.pdf

        I propose that if the first name matches the Wikipedia name we leave it alone, and we load all the vernaculars as /common/topic/alias. For new topics from ITIS we use the first name, capitalize the first word and use the remainder (if any) as aliases.

      12. Sounds like a good plan of action jg :-) Looking forward to updates.



    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

  5.  

    Initial human genomics data loaded

    1. I've uploaded data from the human genome project and some annotations. These include the genes and their locations on the genome (when known). The Gene Ontology groups and hierarchy are also now online, with membership info for human genes and evidence codes. Better links for citations will be added soon (e.g. links out to public web pages for genes, Pubmed for publications).

      Very much looking forward to seeing how this small kernel grows. Transcripts, disease annotations, and known chromosomal aberrations would be valuable, and of course links to the protein schema. With the addition of other species' genomes, notions of synteny and orthology will be logical directions to explore.

      I would be interested in hearing the thoughts of others on the schema, data, and ideas on how to proceed.

      1. This looks good. Here's a link to /biology/gene for anyone who would like to start there.

        One question - is the NCBI ID the same as an Entrez ID?

        1. Yes, the NCBI ID and Entrez ID's should be the same. I'm not sure which name for the identifier is more appropriate (or perhaps some other name).

          I originally had some confusion when doing a cursory search to make sure the ID's were the same. Some things match up nicely, but others do not. Here's an example. The protein col13a1 has an entrez id of 12817. The gene has an id of 1305. There was no protein on freebase with ID 1305 and no gene on freebase with ID 12817. Going to the Entrez gene site, it looks like the ID 12817 may correspond to a mouse protein rather than a human one.

          It might be worthwhile to see if we can programmatically compare the two data sets in whole instead of trying to do it piecemeal.

      2. This is great. Looking forward to playing with the information

      3. The protein information from the signaling gateway is species specific, and includes lots of mouse proteins, however, when Patrick imported the data, there was no "organism" or "species" object to match that link to so this information is not captured (and may lead to confusion). For example, the Cyclin D1 (http://www.freebase.com/view/%239202a8c04000641f80000000051757c6) entry with Entrez ID (12443) is for Mus Musculus, but that's not clear.

        I'd love to be able to link up the protein information with corresponding organism/species information. I'm assuming the appropriate class is "Organism Classification"? Should we (I?) modify the Protein type appropriately?

      4. In terms of linking proteins to species, we originally had a species property on the Genome type, but decided to punt on that until taxonomy was figured out. The model then would be that a gene links to its genome which then links to the species. For orthologs we could add a specific forward and reverse properties between the genes of two species. If this were a good way to go for genes, then a similar model would work for proteins (and a Proteome) as well.



    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area: