JaMBW Chapter 4.1

Pattern search


With this, a given sequence (nucleic acid or protein) can be searched for the presence in it of known patterns. The practical purposes for which this applet can be used are countless, and include:

How use it

Pretty simple:
  1. paste in the appropriate window your nucleic acid or protein sequence which you want to be analyzed. Any character in the letter in the sequence will be used, but both digits and spaces will be ignored. Therefore if you use a sequence in GCG format paste hereafter everything after the double dot (..) , while if you use the FASTA format paste hereafter all the lines but not the first (not the one starting with the ">" symbol.
  2. paste in the REBASE window your list of patterns that you want to use, using the REBASE format. The Restriction Enzyme Database, is a collection of information about restriction enzymes, methylases, the microorganisms from which they have been isolated, recognition sequences, cleavage sites, methylation specificity, the commercial availability of the enzymes, and references - both published and unpublished observations (dating back to 1952). It is freely available from vent.neb.com and its Author is reachable by email:roberts@neb.com. The format of the REBASE database to paste is the GCG format, and it had been chosen since in this way makes very simple to use it directly on your desktop. In the present distribution the complete REBASE site database is distributed.
  3. alternatively, you can paste in the " FINDPATTERN " window the pattern or patterns that you are looking for, using the notation adopted by the GCG FINDPATTERN program. Again, the reason for using a somebodyelses format is to facilitate the User in performing his /her job recycling his /her own data without further manipulation needs.
  4. push the Compute button
  5. Once the job is finished to run and the graphical window will be ready to display location and cliccable information, you can get a printable form of it by pushing the "print/save" button in the bottom part of the window.
In this distribution are included the following pattern files:


The FINDPATTERN language

In addition to a simple matching / substituting language as used by the REBASE pattern searching input, the FINDPATTERN language had been developed by the GCG developers in order to allow a more flexible description of nucleic acid / protein patterns. Certainly, a completely flexible and rigorous way of performing searches is nowaday the Regular Expression approach. However, the syntax for using the Regular Expression approach is not as intuitive as the FINDPATTERN language, and given the large knowledge-base of most users in the latter, the latter had been hereafter used. In future editions of this applet, also the regular expression mechanism will be supported (also with many examples) in order to initiate the users to this much more powerful and general mechanism. Thanks to the work of sbrandt@stevesoft.win.net infact regular Expression applets are already available and the cumbersome aspect of its implementation are thereof not of influence.

DEFINING PATTERNS

          FindPatterns,  Map,  MapSort,  MapPlot,  and  Motifs  all  let  you search  with
          ambiguous  expressions that match many different sequences.  The expressions can
          include any legal GCG sequence  character  (see Appendix III).  The  expressions
          can also include  several non-sequence characters, which are used  to specify OR
          matching,  NOT matching,  begin  and  end constraints,  and repeat counts.   For
          instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of
          any base,  followed by ATG.   Following  is  an  explanation  of  the syntax for
          pattern specification.

3                   Implied Sets and Repeat Counts

               Parentheses () enclose one or more symbols that can be repeated some number
               of times.   Braces {} enclose numbers that  tell how many times the symbols
               within the preceding parentheses must be found.

               Sometimes, you can leave out  part  of  an  expression.  If  braces  appear
               without preceding parentheses, the numbers in the  braces define the number
               of  repeats  for  the immediately  preceding  symbol.   One  or both of the
               numbers within  the  braces  may be  missing.   For instance,  the  pattern
               GATG{2,}A  means GAT,  followed by  G  repeated from  2 to  350,000  times,
               followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0
               to  350,000  times,  followed  by  A;  the pattern GAT(TG){,2}A means  GAT,
               followed  by TG repeated from 0 to 2 times, followed by A.  (If the pattern
               in the parentheses is an  OR expression (see below),  it cannot be repeated
               more than 2,000 times.)

3                   OR Matching

               If  you  are searching  nucleic  acids,  the ambiguity symbols  defined  in
               Appendix III let you define any combination of  G, A, T, or C.   If you are
               searching  proteins, you  can specify any  of  several  symbol  choices  by
               enclosing the  different choices in parentheses and separating  the choices
               with commas.  For instance, RGF(Q,A)S means RGF followed  by either Q  or A
               followed by S.   The length of  choices need not be the same, and there can
               be up to 31 different choices within each set of parentheses.  The  pattern
               GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from
               1 to 4 times followed by A.   The sequence GATTGGA  matches  this  pattern.
               There can be several parentheses  in  a pattern, but parentheses  cannot be
               nested.

3                   NOT Matching

               The pattern GC~CAT means GC, followed by  any symbol except C, followed  by
               AT.  The pattern GC~(A,T)CC means GC, followed by any symbol except A or T,
               followed by CC.

3                   Begin and End Constraints

               The pattern <GACCAT can only be found if it occurs  at the beginning of the
               sequence range being searched.  Likewise, the pattern GACCAT> would only be
               found if it occurs at the end of the sequence range.