BioUML Genome Browser
A Genome browser is a software product which allows users to display and interactively navigate through genomic data obtained from a biological database. Currently there are a few dozens of free and commercial genome browsers available, both as standalone (desktop) and web applications. Standalone genome browsers include Argo Genome Browser http://www.broad.mit.edu/annotation/argo/ [Engels et al, 2006], Artemis Genome Browser http://www.sanger.ac.uk/resources/software/artemis/ [Carver et al, 2011], Gaggle Genome Browser http://gaggle.systemsbiology.net/docs/geese/genomebrowser/ [Bare et al, 2010], Integrated Genome Browser http://igb.bioviz.org/ [Nicol et al, 2009], Integrative Genomics Viewer [Thorvaldsdóttir et al, 2012] and so on. Many standalone genome browsers still need Internet access to download genomic annotation from online databases.
Notable web-based genome browsers are Ensembl genome browser http://www.ensembl.org/ [Hubbard et al, 2002], UCSC Genome Browser http://genome.ucsc.edu/ [Kent et al, 2002], JBrowse http://jbrowse.org/ [Skinner et al, 2009]. Modern web technologies allow the web-applications to be as interactive as standalone ones, though Ensembl genome browser still lacks interactive features like “drag-and-scroll”. Still the only notable drawback of web-based genome browsers today is the necessity to upload user’s own data to display them in the browser.
Most of the genome browsers (both standalone and web) come as independent products or as add-ons to an online database (E.g. Ensembl genome browser). While they can display both predefined and user-uploaded tracks, they lack integration with analysis tools. Usually users have to launch separate tools to process their data, then import the results into the genome browser to visualize them. On the other hand, many notable genomic data analysis platforms like Galaxy (http://galaxyproject.org/ [Goecks et al, 2010]) lack their own genome browsers. Galaxy can import genomic features from UCSC Genome Browser, but cannot send them back, so the user still has to download data files and send them to the genome browser in some way. These steps slow down the research. This becomes even more problematic in case of next generation sequencing (NGS) data processing due to huge data volumes.
As the BioUML platform contains genomic data analyses, it was rather a straightforward decision to include a genome browser there as well, so that the user can visualize the analyses results directly in the system. The integrated genome browser should be able to visualize genomic sequences, pre-defined tracks (e.g. Ensembl genes), user uploaded files and analysis results, combining tracks from various sources on a single display. All meta-information associated with track sites should be displayed on user request. The genome browser should be available both in the BioUML workbench (desktop edition) and BioUML web edition.
Methods and technologies
BioUML web edition is client-server software. The Server code is implemented in Java 1.6 using Apache Tomcat web-server to handle HTTP requests.
The BioUML workbench is a standalone Java 1.6 application which is based on the Swing user interface library.
In general most genome browser implementations contain genomic sequence and a set of individual tracks. Each track contains a number of “features” or “sites”. Every site has genomic coordinates (chromosome, start and end positions) and other information such as name, strand, intensity, and so on, which can be used to display it correctly.
The BioUML genome browser implementation schema is displayed in Fig. 1. It consists of several software layers:
Backend: data sources and storage;
Model: abstract model-level interfaces to the data sources;
View: the hierarchy of graphical primitives describing visual representation of the selected part.
Let’s consider these layers in more detail.
Tracks and sequences displayed in the genome browser are gathered from various sources. The currently supported sequence sources are:
Locally installed Ensembl database [Hubbard et al, 2002].
User-uploaded sequence file. Several sequence formats are supported including FASTA, FASTQ [Cock et al, 2009], EMBL, GenBank. Indexes are created for the file to seek faster.
Sequences from remote web-services accessible via DAS protocol [Jenkinson et al, 2008].
Sequences from a remote BioUML server.
The currently supported track sources are:
Locally installed Ensembl database provides 5 tracks: karyotype, genes, repeats, variations and GC content.
SQL-track: table in MySQL database representing genomic intervals with arbitrary annotation. It can be created in a number of ways:
User-uploaded track file. A variety of file formats are supported including the following:
Output of many programs including the following:
Output of many built-in BioUML analyses.
BAM-track: BAM (Binary sequence Alignment/Map [Li et al, 2009]) file, either uploaded by the user or created by an analysis tool. An index is created for the file to retrieve information faster.
DAS-track: sites (features) are retrieved from the web-service via DAS protocol.
Client track: track data is retrieved from a remote BioUML server.
Virtual track: track based on existing track(s) with some transformation applied:
A set of interfaces is used to encapsulate tracks and sequences implementation. The main interfaces here are Sequence and Track, which can be used to obtain sequence data (nucleotide letters) and track data (sites information) in a convenient way. The Site interface represents a single feature on the track. The Site and Track interfaces with some of the implementing classes are shown in Fig. 2.
The Track interface provides a number of methods to retrieve either collection of all sites or sites located in a specific region of a specific sequence (chromosome). The SlicedTrack abstract class supports partial loading of track data from the back-end. The WritableTrack interface allows adding sites to the track. Two WritableTrack implementations are available: SqlTrack, which serializes the track data into MySQL database, and TrackImpl, which stores all the sites in the memory (this can be useful inside some analyses). ClientTrack is used as a proxy for Track located on a remote BioUML server.
The Site interface allows retrieving different site information such as sequence and coordinates where the site is located, strand, site type and so on. Any custom information associated via site can be retrieved using the getProperties() method.
There is also TrackViewBuilder class which can build a view of a specific region of a track given specific visualization options. Most tracks use the default builder, but the custom builder is necessary sometimes to display special tracks like karyotype.
To store and display graphical representation of the selected genomic region, the View class hierarchy is used. This hierarchy represents a number of graphical primitives as well as the CompositeView class, which can combine several views together. The class diagram is shown in Fig. 3.
The View abstract class defines basic methods for all graphical elements including manipulation methods (move, scale, setModel, setSelectable), informational methods (getBounds, isSelectable, getModel), rendering (paint) and the serialization method (toJSON). The CompositeView class is a type of View which may contain other views.
A model can be associated with an individual view. In our case it allows to bind the view representing an individual site with the site itself. Thanks to this the user can just select a site to get all the annotation associated with it.
To make the genome browser more user-friendly the semantic zoom feature is implemented. Depending on the current zoom level and number of visible sites a different view is displayed for the user as shown in Fig. 4.
Here a BED file uploaded by the user is used as an example track. The BED file contains aligned reads of a ChIP-Seq experiment performed on FoxA1 protein. The genome browser in Fig. 4. displays Homo sapiens chromosome 22 (NCBI36 genome build). As you can see, the genome browser is capable of displaying this track in any scale from per-nucleotide to chromosome-wide. Upon zooming in the coverage profile view is changed to the individual reads view. Further zooming adds reads’ titles (strand direction is used as the read title in the example) then reads become displayed as arrows. On the most detailed scale the nucleotide sequence becomes visible.
Asynchronous track loading
In BioUML web edition the genome browser supports asynchronous track loading. Some tracks may be loaded and displayed slowly depending on their source, requested region and settings applied. For example, a DAS track can work slowly due to network latency or delays on the remote DAS server. Thanks to asynchronous track loading genome browser doesn't stall because of slow tracks: while they are being loaded, the BioUML genome browser displays other tracks with all the functionality working.
BioUML web edition genome browser
The genome browser is available as a part of the BioUML platform. The web-version is freely available on http://ie.biouml.org/bioumlweb/ . Users may either press “Demo” button for anonymous access or register to get a personal project folder.
Example genome browser views are available in the “Data” tree tab, under “data/Examples/GenomeBrowser/Data/Views”. You may use a shortcut URL to get right there: http://ie.biouml.org/samples/gb. Fig. 5. shows a screenshot of this demonstration.
This stored view contains the chromosome 1 sequence from Ensembl database which is located at “databases/EnsemblHuman52”. Three tracks are added from the same Ensembl database. A Fourth track is the DAS-track, which queries data from EBI genomic DAS-server (http://www.ebi.ac.uk/das-srv/genomicdas ) using DAS source “CytoChip v3”. Some of the available interactive features are listed below:
Click and drag with your mouse to navigate along the track.
Select a range with the mouse while holding Ctrl or Shift key to zoom into this range.
Select a sequence in “Sequence (chromosome)” drop-down list or type a new position in the “Position” box to navigate into the new range.
Use the mouse wheel while holding Ctrl key to zoom in or out.
Use “Semantic zoom out” and “Semantic zoom in” toolbar buttons to zoom in or out.
Move the mouse pointer over the genome browser display to get context-sensitive information like current position or site name.
Click on an individual site to get the site information in the Info box in the bottom-left corner.
Drag track names up and down to rearrange them.
Drag tracks from the tree to the genome browser to add them into the view.
Use “Sites” tab to see the list of all visible sites on the selected track along with the annotation.
Use “Tracks” tab to rename or remove tracks, setup their options and see the legend.
Another sample displaying a BAM-track and an SQL-track (user loaded BED file) is located in “data/Examples/GenomeBrowser/Data/Views/BAM & BED samples”. Navigate to this location in the tree and then double-click to open it. This view is shown in Fig. 6.
Here Homo sapiens (GRCh37) chromosome 1 is displayed from Ensembl database. The first track called “Short bam example” is a user-uploaded BAM-file. At the current zoom-level it’s displayed in the profile mode. Next two tracks are loaded from Ensembl and the last track called “Reads_FoxA1_treatment” is a user-loaded BED file which was also displayed in Fig. 4.
To create your own genome browser view you should first select the genome. You may navigate, for example, into “databases/Ensembl/Sequences/chromosomes GRCh37” and double-click on this folder or on the chromosome name. Alternatively you may upload your own sequences (for example, in FASTA format) and open them. Anonymous users have a shared project which is located under “data/Collaboration/Demo/Data”. You may import your data into this location.
After opening a sequence you may be prompted to add some tracks associated with it. You can always add more tracks later simply dragging them from the tree. To save the resulting view you can press the “Save view” button on the toolbar.
BioUML workbench genome browser
The BioUML workbench can be downloaded and installed from the BioStore service (https://bio-store.org/). After free registration and login navigate into “Downloads” area and download the BioUML workbench installer from there. Please note that Java 1.6 or later is required for the BioUML workbench to work correctly.
Upon successful installation you can launch the BioUML workbench. It has preinstalled “EnsemblRemote” collection, which is connected to the “databases/Ensembl” collection on ie.biouml.org BioUML server. Open it, navigate into “Sequences” folder and double-click on “chromosomes GRCh37”. The Dialog box will appear asking you which tracks you want to add to the genome browser as it’s displayed in Fig. 7. Select some of them (use Ctrl or Shift to select several) and press “Ok”.
The Genome browser will be opened on the first chromosome as it’s displayed in Fig. 8. Most of the functions work in the same manner as for BioUML web edition: you can navigate through the sequence, switch to another chromosome, obtain sites information, and so on. The “Sites” and “Tracks” tabs below the genome browser are similar to those on the web.
Genome browsers can be classified in a number of ways. One of the main characteristics is the application type: either standalone or web. Web-based genome browsers differ in how they render the graphics: images can be generated on server or client-side.
Another interesting characteristic is the list of data formats supported. Currently there are many track data formats including BED, GFF, GTF, VCF, Wiggle, BigWig, SAM, BAM and so on. To make a genome browser convenient most of these formats should be supported. Other data sources are also interesting: DAS-tracks, locally installed NCBI or Ensembl database and so on.
Some genome browsers allow the user to display custom sequences. This may be crucial for users working with custom genome builds or with genomes not available in public databases.
For BAM files it’s important which visualization features are supported: coverage (profile) visualization, reads visualization, phred quality profiles and variations (SNP) visualization. One more important BAM-related feature is automatic indexing. Although there is a separate tool to index BAM-files in samtools package [Li et al, 2009], automatic indexing makes it easier to upload BAM-files.
Sometimes it’s necessary to get the tabular representation of the features displayed along with their coordinates and annotation. Some genome browsers support this feature.
Table 1 shows a feature comparison between the BioUML genome browser and some other well-known genome browsers including web-based and desktop ones.
Fully functional BioUML genome browser was created and proved to be useful in genomic research. For analysis results obtained in BioUML it’s much more convenient to visualize them in the BioUML genome browser than in any external genome browser. Most of the features available in other modern genome browsers are supported by the BioUML genome browser as well.
The BioUML genome browser is capable of working with very large tracks. For example, Ensembl Human version 65 Variation track contains more than 40,000,000 individual variations. User uploaded BED and VCF files containing more than 50,000,000 were tested successfully. BAM files containing more than 500,000,000 aligned reads were also visualized without any problems. Thus the BioUML genome browser is suitable for NGS data as well.
- There are currently no refbacks.