About LinkHub and LinkHub Help
Background and Motivation top
A key abstraction in representing biological data is the notion of unique identifiers for biological
entities and relationships among them. For example, each protein sequence in
the
UniProt database is given a unique
accession by the UniProt curators, e.g.
Q60996;
this accession
uniquely identifies its associated protein sequence and can be used as a key to
access its sequence record in UniProt. And UniProt sequence records contain
cross-references to related information in other genomics databases, e.g.
Q60996 is cross-linked in UniProt to Gene Ontology identifier GO:0005634 and
Pfam identifier PF01603 (although the kinds of relationships, which would here
be "functional annotation" and "family membership" respectively, are not
specified in UniProt). Two identifiers such as Q60996 and GO:0005634 and the
cross-reference between them can be viewed as a single edge between two nodes
in a graph, and conceptually then the whole of biological knowledge can be
viewed as a massive graph whose nodes are biological entities such as proteins,
genes, etc. represented by identifiers and the links in the graph are typed and
are the specific relationships among the biological entities.
The internet and web are vast and people and groups are making biological
information available independently and mostly without knowledge of one
another. There is simply too much biological data, and it is changing too fast,
for any single individual or group to have complete knowledge and be able to
make all possible connections in a fully centralized fashion. Implicit among
all the web-accesible biological resources (both structured and unstructured)
however is a rich web of complex semantic relationships. Unfortunately, this
rich web of relationships is for the most part not made explicit, and many
biological resources are mostly independent with limited connections to each
other or more generally lacking important contextual information.
The motivation for LinkHub is that centralization to an extent does make sense,
e.g. a single lab or organization might want to interrelate its various
resources to one another and to larger, well-known resources such as UniProt or
GenBank, i.e. create a local central hub of interconnections among its
individual data resources; but it does not want to have to explicitly
connect its data resources up to everything in existence, which is impossible.
The key idea is that if groups independently maintaining data resources each
connect their resources up to some other resource X, then any of them can reach
any other through these connections to X, and we can collectively achieve
incremental global integration of genomics data in this way. LinkHub is a
software system which aims to help realize this goal by enabling one to
create such local minor hubs of data interconnections and connecting them
to major hubs of data such as UniProt or GenBank in a federated "hub of hubs"
framework.
LinkHub as the Gerstein Lab's
'Links Portal'
top
Practically, LinkHub serves as the main gateway into the various web resources
built and maintained within the
Gerstein
Lab and affiliated groups (such as the
Northeast
Structural Genomics Consortium) and their relationships with external
biological knowledge (mainly through UniProt). The primary access into LinkHub
is by providing a biological identifier, e.g. UniProt accession, PDB
identifier, etc. LinkHub takes the given identifier, determines its mappings
to other identifiers (and their mappings to further identifiers, etc.) and
then presents to the user as a
DHTML
expandible/collapsible list the graph
of identifier relationships stemming from the given identifier, as well as
hyperlinks to particular information pages, e.g. UniProt entry page for a
UniProt accession, etc., for each identifier in the graph; a single layer of
the graph is shown at a time, and the user may then selectively expand fringe
nodes to explore more. Thus, the user does not need to know about particular
local Gerstein lab identifier naming schemes or its particular web resources
--- they can simply give an identifier that they know, such as a UniProt
accession, and the system will internally translate to other identifiers and
present all that is known about the given identifier.
LinkHub Interface
top
Below is a screenshot of the LinkHub interface view for UniProt accession P26364:
P26364 is presented at the root of the list (with its identifier type
'UniProtKB/Swiss-Prot Acc' prepended to it), and lower levels contain
information on additional related identifiers. Each identifier has two
subsections:
Links which gives a list of hyperlinks to web documents
directly relevant to the identifier; and
Equivalent or Related Ids which
contains a list of additional identifiers related to the first identifier by
various relationship types (the relationship type if it exists is given in
parentheses; a synonym relationship is assumed if no relationship is given).
The identifiers in the Equivalent and Related Ids section may themselves be
further related to other identifiers which will have their own Links and
Equivalent or Related Ids sections, ad nauseum. The initial display shows
the transitive closure of the root identifier one level deep, and dynamic
callbacks to the server retrieve additional data when the user clicks on
identifiers whose subsections have not yet been loaded; in this way, the
user can explore the relationship paths he desires without performance
penalties (of loading the whole graph) or "information overload". The
interface is dynamic, and a

list icon can be expanded to view the
hidden underlying content, and a

list icon can be
clicked to hide the content. In the example screenshot above, it is seen that
P26364 is equivalent to Yeast ORF YER170W, and through YER170W are hyperlinks
to further information in the Gerstein Lab web resources
GeneCensus,
Topology of Networks (TopNet), and others. Further links are given through
other identifier relationships with PIR, Gene Ontology and Pfam.
Note that hyperlinks to external, non-Gerstein lab web resources are also
given (e.g.
SGD,
Pfam,
GO, etc.)
The default behavior is, for a given identifier, to
show all relationships stemming from it and all hyperlinks known; however,
the user can perform filtering to show only the hyperlinks of specific web
resources or subsets of web resources and this functionality is accesible from
the LinkHub front page described next.
LinkHub Front Page
top
Below is a screenshot of the LinkHub front page (accesible at either
hub.gersteinlab.org or
hub.nesg.org):
ID is where you enter the identifier you are interested in. If you just
enter an identifier and then click "Get Links" you will be given the
interface view for that identifier with the default behavior, i.e. showing
all known hyperlinks. Note that wildcard searching is supported for
the ID field by using the '*' character; e.g. '*xpn' will search for identifiers ending in
'xpn' and '*lsid*' will search for identifiers that contain 'lsid'.
If you also select a Resource Name then only links from
the selected resource will be shown; if only one link results you will
automatically be redirected to that link. Note that in LinkHub identifiers
and resources are considered separate and independent. While many resources
(e.g. Gene Ontology) will have their own associated identifier types and identifiers
(e.g. GO:0050897, etc.) LinkHub considers these separate --- a given identifier can
be a key to information at many resources and not just the resource that originated
it, e.g. UniProt identifiers are used to access information at many web sites besides
just UniProt. Resource groups are subsets of all
the known resources, and selecting a resource group in addition to an
identifier will cause only links from resources within the resource group
to be shown; for example GERSTEIN_LAB is the resource group which contains
all the web resources of the Gerstein Lab, and selecting it will show only
hyperlinks to Gerstein Lab resources for a given identifier. Finally, if
you select only a Resource Name (no ID) then you will be provided with a
list of all known identifiers which have hyperlinks to that resource; this
list consists of hyperlinks to the LinkHub interface view for the
identifiers.
Examples
top
Here are some examples (ID, resource and resource group are encoded in the URL):
- Show everything known for PDB 6ldh (i.e. no filtering based on
resource/resource group)
- Show everything for UniProt P26364
- Show everything for Yeast YOR133W
- Gerstein lab links
only for Yeast YOR133W
- Only SGD links for Yeast YOR133W
(redirect there if only one)
- Wildcard search for YOR*
LinkHub Paper
top
A BMC Bioinformatics paper about LinkHub is available
here.