Proteins & Genes - Mouse & Human

  1.  
    1. I'm just learning all this stuff, but I'm confused about how the proteins and genes are being modeling in relation to each other. When I look at the Sorcin gene and Sorcin protein, they don't seem to mesh very well. I understand that genes typically get named with the same name as their associated protein, even though they're two different things.

      • Aren't the NCBI ID (/biology/gene/ncbi_id) and the Entrez ID (biology/protein/entrez_gene_id) the same thing? Consistent terminology (and documentation!) would help make this clear.

      • Why does the protein link to the mouse gene (only)?

      • Why is the linkage done via an external identifier, rather than directly?

      I know Luke has been working on related areas (particularly cleaning up duplicates), so I've copied him on the discussion. Are there others doing active work in this space?

      1. I forgot one:

        • Should the gene symbols and alternate symbols (e.g. SRI for Sorcin) be added to the aliases list to make them easier to find and duplicates less likely to be created?
      2. I agree with all these points. In particular, I think we need a direct linkage between proteins and genes.

        I don't know if we've captured any mouse genes inside Freebase, though that information is present in Wikipedia and probably in NCBI - I think we should! Maybe creating a "Mouse genome" topic and "Mouse chromosome x" topics would be a good start ... So while Wikipedia conflates human genes and mouse genes, I think we should keep them separate.

      3. Dan (druderman) apparently tried to reply too, but his response got trashed by the system :-(

        https://bugs.freebase.com/browse/FREEBASE-1127

        As far as mouse genes go, the Sorcin protein Entrez gene id: 109552, is actually a mouse gene, so there are already mouse genes in Freebase (although this sounds like it's a bug since the schema says these should be human proteins).

        Wikipedia has a single page for all three (protein, human gene, mouse gene), but they do list the mouse gene and human gene ids separately, so they're not really conflated. The Wikipedia page was changed (yesterday!) to say that Sorcin is the protein and SRI is the gene, but Entrez http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=6717 gives Sorcin as the "Official Full Name" of the gene and SRI as the symbol. The mouse gene (http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=109552) uses the same terms.

        Dan - if you're still listening and you want to email me a reply (@gmail), I'll make sure it gets posted.

      4. Dan (/user/drunderman) emailed the attached to post here:

        I agree with keeping the genes for different organisms separate. The data model that appeals to me most is one of collecting scientific statements rather than amassing "facts". So, for example, in a particular human genome build a given genomic locus (given by chromosome, strand, and base range) is identified as a given named gene. This annotation may someday change (e.g. it may turn out to be a pseudogene). Similarly, someone may have defined an orthology between genomic loci in two different organisms, thus equating them with a single gene name. Again, this is subject to change. What's nice about Freebase is that you can define a compound value type which links these inter-organismic loci as an "Orthology" (a Freebase Type one would create), and, importantly, that orthology would have attached to it some data about who said it. Note that not everyone may agree on the same set of orthologs! So keeping track of who said what is important.

        What we will end up with over time is a dynamic view of the genome and its annotations as they change over the years. This way we can mine knowledge across time rather than just referring to the current view of the "truth".

      5. Darn embedded markup! That was just supposed to be a dashed line separating my preface, not a giant heading.

        I love Dan's model. It aligns with thoughts I've had about expressing research conclusions (his "scientific statements"). One key piece, I think, is to develop a stronger citation model and practices to link the "in Freebase" with the "outside Freebase."

      6. Why don't we start with adding a property to Gene of "Protein encoded" and the corresponding (reciprocal) property to Protein of "Encoded by"? (I assume that each Gene encodes one Protein but each Protein could be encoded by multiple Genes, e.g. mouse and human.)

        I do like the idea of an Orthology type - can we work on that later?

      7. Might be best to define a compound value type for this linkage so it is clear how the correspondence between gene and protein was arrived at. I have some experience with this so I'm happy to help with the schema.

        What information are you thinking of using to relate protein to gene?

        Dan

      8. The Entrez Gene database seemed like a good starting point.

        If you want to suggest a schema, Dan, that would be great.

      9. Entrez Gene sounds good. Looks like the NCBI ID that I placed with each gene is the same as Entrez Gene (please correct me if I'm wrong). Can you point me to the online data source you'd like to use which maps between mouse and human?

        As for a schema, in the big picture I'd like to eventually include genomic locations of exons, mRNA transcripts, and then their corresponding protein products. But that's longer term. For now we might want to simply create a CVT which links protein entry to gene entry and is annotated as an identity based on Entrez Gene ID. So the CVT would have at one end the gene and at the other end the protein. We could either define a general CVT for linking gene to protein and then a more specific CVT which is an Entrez Gene ID link. Or we could just have a CVT for gene to protein and then a flag within that CVT which explains the link (e.g. some text, like "Entrez Gene ID link").

        Other ideas?

        Dan

      10. Sounds good to me. (I always prefer starting with the simpler scheme and then building up gradually to the more complex.)

        I don't know of any data sources that map between mouse and human (except for Wikipedia!) - I was hoping that the linkage would fall out of the data.

      11. Homologene is supposed to show cross-species gene connections, I think. Here's the entry for the human SRI gene:

        http://www.ncbi.nlm.nih.gov/sites/entrez?Db=homologene&Cmd=Retrieve&list_uids=37736&log$=seqview_homolog

        At the general level, the proteins produced by these genes have the same name:

        Human gene http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=6717#geneGeneral%20protein%20info Mouse gene http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=109552#geneGeneral%20protein%20info

        but it looks like at the detailed level, even the proteins are tracked separately by species. This link has a cluster of 18 "different" proteins across 11 different species.

        http://www.uniprot.org/uniref/?query=member%3aP30626+identity:0.9

        I'm not sure what level this stuff needs to be modeled at. Also, the more I look at the various databases that are available, the more I wonder what value Freebase would add to the ecosystem. What gaps need to be filled?

    Discussion is posted in:

    Think this discussion also relates to something else? Cross-post it by adding a new discussion area:

Search Discussions

Related Discussions