About LinkHub and LinkHub Help

Background and Motivation     top

A key abstraction in representing biological data is the notion of unique identifiers for biological entities and relationships among them. For example, each protein sequence in the UniProt database is given a unique accession by the UniProt curators, e.g. Q60996; this accession uniquely identifies its associated protein sequence and can be used as a key to access its sequence record in UniProt. And UniProt sequence records contain cross-references to related information in other genomics databases, e.g. Q60996 is cross-linked in UniProt to Gene Ontology identifier GO:0005634 and Pfam identifier PF01603 (although the kinds of relationships, which would here be "functional annotation" and "family membership" respectively, are not specified in UniProt). Two identifiers such as Q60996 and GO:0005634 and the cross-reference between them can be viewed as a single edge between two nodes in a graph, and conceptually then the whole of biological knowledge can be viewed as a massive graph whose nodes are biological entities such as proteins, genes, etc. represented by identifiers and the links in the graph are typed and are the specific relationships among the biological entities.

The internet and web are vast and people and groups are making biological information available independently and mostly without knowledge of one another. There is simply too much biological data, and it is changing too fast, for any single individual or group to have complete knowledge and be able to make all possible connections in a fully centralized fashion. Implicit among all the web-accesible biological resources (both structured and unstructured) however is a rich web of complex semantic relationships. Unfortunately, this rich web of relationships is for the most part not made explicit, and many biological resources are mostly independent with limited connections to each other or more generally lacking important contextual information. The motivation for LinkHub is that centralization to an extent does make sense, e.g. a single lab or organization might want to interrelate its various resources to one another and to larger, well-known resources such as UniProt or GenBank, i.e. create a local central hub of interconnections among its individual data resources; but it does not want to have to explicitly connect its data resources up to everything in existence, which is impossible. The key idea is that if groups independently maintaining data resources each connect their resources up to some other resource X, then any of them can reach any other through these connections to X, and we can collectively achieve incremental global integration of genomics data in this way. LinkHub is a software system which aims to help realize this goal by enabling one to create such local minor hubs of data interconnections and connecting them to major hubs of data such as UniProt or GenBank in a federated "hub of hubs" framework.

LinkHub as the Gerstein Lab's 'Links Portal'     top

Practically, LinkHub serves as the main gateway into the various web resources built and maintained within the Gerstein Lab and affiliated groups (such as the Northeast Structural Genomics Consortium) and their relationships with external biological knowledge (mainly through UniProt). The primary access into LinkHub is by providing a biological identifier, e.g. UniProt accession, PDB identifier, etc. LinkHub takes the given identifier, determines its mappings to other identifiers (and their mappings to further identifiers, etc.) and then presents to the user as a DHTML expandible/collapsible list the graph of identifier relationships stemming from the given identifier, as well as hyperlinks to particular information pages, e.g. UniProt entry page for a UniProt accession, etc., for each identifier in the graph; a single layer of the graph is shown at a time, and the user may then selectively expand fringe nodes to explore more. Thus, the user does not need to know about particular local Gerstein lab identifier naming schemes or its particular web resources --- they can simply give an identifier that they know, such as a UniProt accession, and the system will internally translate to other identifiers and present all that is known about the given identifier.

LinkHub Interface     top

Below is a screenshot of the LinkHub interface view for UniProt accession P26364:

LinkHub Screenshot for P26364

P26364 is presented at the root of the list (with its identifier type 'UniProtKB/Swiss-Prot Acc' prepended to it), and lower levels contain information on additional related identifiers. Each identifier has two subsections: Links which gives a list of hyperlinks to web documents directly relevant to the identifier; and Equivalent or Related Ids which contains a list of additional identifiers related to the first identifier by various relationship types (the relationship type if it exists is given in parentheses; a synonym relationship is assumed if no relationship is given). The identifiers in the Equivalent and Related Ids section may themselves be further related to other identifiers which will have their own Links and Equivalent or Related Ids sections, ad nauseum. The initial display shows the transitive closure of the root identifier one level deep, and dynamic callbacks to the server retrieve additional data when the user clicks on identifiers whose subsections have not yet been loaded; in this way, the user can explore the relationship paths he desires without performance penalties (of loading the whole graph) or "information overload". The interface is dynamic, and a list icon can be expanded to view the hidden underlying content, and a list icon can be clicked to hide the content. In the example screenshot above, it is seen that P26364 is equivalent to Yeast ORF YER170W, and through YER170W are hyperlinks to further information in the Gerstein Lab web resources GeneCensus, Topology of Networks (TopNet), and others. Further links are given through other identifier relationships with PIR, Gene Ontology and Pfam. Note that hyperlinks to external, non-Gerstein lab web resources are also given (e.g. SGD, Pfam, GO, etc.) The default behavior is, for a given identifier, to show all relationships stemming from it and all hyperlinks known; however, the user can perform filtering to show only the hyperlinks of specific web resources or subsets of web resources and this functionality is accesible from the LinkHub front page described next.

LinkHub Front Page     top

Below is a screenshot of the LinkHub front page (accesible at either hub.gersteinlab.org or hub.nesg.org):

LinkHub FrontPage screensho

ID is where you enter the identifier you are interested in. If you just enter an identifier and then click "Get Links" you will be given the interface view for that identifier with the default behavior, i.e. showing all known hyperlinks. Note that wildcard searching is supported for the ID field by using the '*' character; e.g. '*xpn' will search for identifiers ending in 'xpn' and '*lsid*' will search for identifiers that contain 'lsid'. If you also select a Resource Name then only links from the selected resource will be shown; if only one link results you will automatically be redirected to that link. Note that in LinkHub identifiers and resources are considered separate and independent. While many resources (e.g. Gene Ontology) will have their own associated identifier types and identifiers (e.g. GO:0050897, etc.) LinkHub considers these separate --- a given identifier can be a key to information at many resources and not just the resource that originated it, e.g. UniProt identifiers are used to access information at many web sites besides just UniProt. Resource groups are subsets of all the known resources, and selecting a resource group in addition to an identifier will cause only links from resources within the resource group to be shown; for example GERSTEIN_LAB is the resource group which contains all the web resources of the Gerstein Lab, and selecting it will show only hyperlinks to Gerstein Lab resources for a given identifier. Finally, if you select only a Resource Name (no ID) then you will be provided with a list of all known identifiers which have hyperlinks to that resource; this list consists of hyperlinks to the LinkHub interface view for the identifiers.

Examples     top

Here are some examples (ID, resource and resource group are encoded in the URL):
  1. Show everything known for PDB 6ldh (i.e. no filtering based on resource/resource group)
  2. Show everything for UniProt P26364
  3. Show everything for Yeast YOR133W
  4. Gerstein lab links only for Yeast YOR133W
  5. Only SGD links for Yeast YOR133W (redirect there if only one)
  6. Wildcard search for YOR*

LinkHub Paper     top

A BMC Bioinformatics paper about LinkHub is available here.