Summary: The challenges of Molecular Biology Computing

What is a database

Types of databases

(A) Flat-file Databases

(B) Relational Databases

(C) World Wide Web access to databases

(D) The historical problem

(E) Unifying approaches to link databases


A. Molecular Biology DataBases

Bioinformatics scientists collect, organize and make sequence data that is generated, available to all biologists

Today data is shared and integrated between the three major data depositories, namely, GenBank, which forms part of the NCBI, European Molecular Biology Laboratory (EMBL) and the DNA Database of Japan (DDBJ)]

During Oct. 1996, GenBank contained 1,021,211 sequence records = 652,000,000 bases of DNA sequence = 3.1 gigabytes of computer storage space. In June 1997 this escalated to 1,491,000 records and 967,000,000 bases. Check the sequence record out from 1982 to 2004

The contents of GenBank are now doubling in less than a year, and the doubling rate is accelerating ie the data generated and collected is growing exponentially.

A list of completely sequenced genomes and ongoing genome projects are maintained at Genomes Online Database (GOLD), The Institute for Genome Research , DOE Joint Genome Institute, GenomeNet Database

Even simple computation or searching these enormous database requires a huge amount of computer power. What will be needed in 5 to 10 years time is hard to image.

B. The Resources at NCBI

NCBI was Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.

The NCBI can be summarised as having 3 arms:

The various Sequence Data Bases and PubMed literature Data Base are linked as shown below

ENTREZ is at the core of the search and retrieval system that integrates and links th e various databases. In order to maximise the benfits of the various databases it is imperative that you read and learn from the ENTREZHELP FILE

C. Exercise 1

This exercise shows you the power of using ENTREZ. You will be initially searching for all molecular sequences in the database which contain the term "penicillin-binding" and then limiting the search to Mycobacterium tuberculosis genome only.

Remember that if you wish to improve and make your searches efficient and effective, than you will need to read and understand the documents GenBank records and ENTREZHELP

D. Ribosomal DataBase Project (RDP)

RDP data base contains aligned and unaligned small subunit ribosomal rRNA sequences. Most of the sequences have been extracted from the GenBank Data Base and RDP is updated once or twice a year. It can therefore be regarded as a GenBank subset specialist Data Base.

In addition, the database conatins a set of integrated online analysis bioinformatics tools useful for aligning user input sequences based on rRNA secondary structural constraints and for constructing phylogeny. It is also possible to download sequences in the aligned and unaligned forms. The sequences are in GenBank format.

E. KEGG Data Base

Kyoto Encyclopedia of Genes and Genomes (KEGG) data base is an excellent data base which links the metabolic pathways of all the organisms whose genomes have been sequenced. It also has links to the genes involved in the metabolic pathways. Kyoto Encyclopedia of Genes and Genomes (KEGG) data base is an excellent data base which links the metabolic pathways of all the organisms whose genomes have been sequenced. It also has links to the genes involved in the metabolic pathways.

NOTE: KEGG is part of the GenomeNet and Bioinformatics in Japan and also houses a range of on-line tools such as BLAST, CLUSTAL etc and is worth looking through.

F. Exercise 2

Sequence Formats Sequences in different databases are written in different formats (aka file formats) and it is normal to convert one file format to another if one is using different computational tools for analysis.

(i) Copy the following "GenBank DNA sequence" associated with NCBI database
(You may need to use Control C to copy)

LOCUS       AY078053                1275 bp    DNA     linear   BCT 13-FEB-2002
DEFINITION  Corbulabacter subterraneus 16S ribosomal RNA gene, partial
VERSION     AY078053
SOURCE      Corbulabacter subterraneus.
  ORGANISM  Corbulabacter subterraneus
            Bacteria; Proteobacteria; alpha subdivision; Rhizobiaceae group;
            Beijerinckia group; Corbulabacter.
REFERENCE   1  (bases 1 to 1275)
  AUTHORS   Patel,B.K.C. and Kanso,S.
  TITLE     Corbulabacter subterraneus gen. nov. sp. nov., a novel bacterium
            from the subsurface Great Artesian Basin of Australia thermal
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 1275)
  AUTHORS   Patel,B.K.C. and Kanso,S.
  TITLE     Direct Submission
  JOURNAL   Submitted (06-FEB-2002) Biomolecular and Biomedical Sciences,
            Griffith University, Nathan, Brisbane 4111, Australia
FEATURES             Location/Qualifiers
     source          1..1275
                     /organism="Corbulabacter subterraneus"
                     /note="type strain; Fai4; ATCC BAA-295; DSM 14364"
     rRNA            <1..>1275
                     /product="16S ribosomal RNA"
BASE COUNT      306 a    305 c    413 g    251 t
        1 atcctggctc agaacgaacg ctggcggcag gcttaacaca tgcaagtcga acgggccctt
       61 cggggtcagt ggcagacggg tgagtaacac gtgggaacgt gcccttcagt tcggaataac
      121 ccagggaaac ttgggctaat accggatacg cccttttggg gaaagattta tcgctgaagg
      181 atcggcccgc gtctgattag ctagttggtg gggtaaaggc tcaccaaggc gacgatcagt
      241 agctggtctg agaggatgat cagccacact gggactgaga cacggcccag actcctacgg
      301 gaggcagcag tggggaatat tggacaatgg gcgcaagcct gatccagcca tgccgcgtga
      361 gtgatgaagg ccttagggtt gtaaagctct ttcggcgggg acgataatga cggtacccgc
      421 agaagaagcc ccggctaact tcgtgccagc agccgcggta atacgaaggg ggctagcgtt
      481 gttcggaatc actgggcgta aagggcgcgt aggcggcttt gtaagtcggg ggtgaaagcc
      541 tgtggctcaa ccacagaatt gccttcggat actgcatggc ttgagaccgg aagaggtaag
      601 tggaactgcg agtgtagagg tgaaattcgt agatattcgc aagaacaccc agtggcgaag
      661 gcggcttact ggtccggatc tgacgctgag gcgcgaaagc gtggggagca aacaggatta
      721 gataccctgg tagtccacgc cgtaaacgat gaatgccaac cgttgggcag cttgctgctc
      781 agtggcgcag ctaacgcttt aagcattccg cctggggagt acggtcgcaa aattaaaact
      841 caaagaaatt gacgggggcc cgcacaagcg gtggagcatg tggtttaatt cgaagcaacg
      901 cgcagaacct taccagcctt tgacatgtcc ggtatggatc ctggagacag gttccttcag
      961 ttcggctggc cggaacacag gtgctgcatg gctgtcgtca gctcgtgtcg tgagatgttg
     1021 ggttaagtcc cgcaacgagc gcaaccctcg cccttagttg ccatcattca gttgggcact
     1081 ctaaggggac tgccggtgat aagccgagag gaaggtgggg atgacgtcaa gtcctcatgg
     1141 cccttacggg ctgggctaca cacgtgctac aatggcggtg acaatgggca gcgaacccgc
     1201 gagggggagc taatcccaaa aagccgtctc agttcggatt gcactctgca actcgagtgc
     1261 atgaaggtgg aatcg

(ii) Paste the sequence into the following URL Sequence File Format Converter and / or Readseq, which is a more recent version of the Sequence File Format Converter Program. So, the same program can "visualised" in many different ways without the underlying program changing. NOTE: You may need to use Control V to paste the sequence in the two programs.

(iii) Select "GenBank" from the pull down the menu from the INPUT FORMAT section and select "Fasta" format from the OUTPUT FORMAT pull down menu. Note the differences in the input format and output format.

Different bioinformatics software may require different types of file formats. For example, the Phylip format is required for the PHYLIP suite of software. (NOTE FOR STUDENTS - CAN YOU FIND AN EXAMPLE OF THE PHYLIP Sequenec File Format.

Send comments to Professor Bharat Patel:
[Created: 10 Jan 1999]
[Modified: 10 March 2009]