JaMBW Chapter 4.1
Pattern search
With this, a given sequence (nucleic acid or protein) can be searched for the presence in it of known patterns.
The practical purposes for which this applet can be used are countless, and include:
- restriction mapping of nucleic acids
- restriction mapping of peptides and proteins
- identification of binding sites
- discovery of promoters
- ...
How use it
Pretty simple:
- paste in the appropriate window your nucleic acid or protein sequence which you want to
be analyzed. Any character in the letter in the sequence will be used, but both digits and
spaces will be ignored. Therefore if you use a sequence in GCG format paste hereafter everything
after the double dot (..) , while if you use the FASTA format paste hereafter all the lines
but not the first (not the one starting with the ">" symbol.
- paste in the REBASE window your list of patterns that you want to use, using the REBASE
format. The Restriction Enzyme Database, is a collection of information about restriction enzymes, methylases, the microorganisms from which they have been
isolated, recognition sequences, cleavage sites, methylation specificity, the commercial availability of the enzymes, and references - both published and
unpublished observations (dating back to 1952). It is freely available from vent.neb.com and
its Author is reachable by email:roberts@neb.com. The format of the REBASE database to paste
is the GCG format, and it had been chosen since in this way makes very simple to use it directly
on your desktop. In the present distribution the complete REBASE site database is distributed.
- alternatively, you can paste in the " FINDPATTERN " window the pattern or patterns
that you are looking for, using the notation adopted by the GCG FINDPATTERN program. Again,
the reason for using a somebodyelses format is to facilitate the User in performing his /her
job recycling his /her own data without further manipulation needs.
- push the Compute button
- Once the job is finished to run and the graphical window will be ready to display location
and cliccable information, you can get a printable form of it by pushing the "print/save" button
in the bottom part of the window.
In this distribution are included the following pattern files:
The FINDPATTERN language
In addition to a simple matching / substituting language as used by the REBASE pattern searching
input, the FINDPATTERN language had been developed by the GCG developers in order to allow
a more flexible description of nucleic acid / protein patterns. Certainly, a completely
flexible and rigorous way of performing searches is nowaday the Regular Expression approach.
However, the syntax for using the Regular Expression approach is not as intuitive as the FINDPATTERN
language, and given the large knowledge-base of most users in the latter, the latter had been
hereafter used. In future editions of this applet, also the regular expression mechanism will
be supported (also with many examples) in order to initiate the users to this much more powerful
and general mechanism. Thanks to the work of sbrandt@stevesoft.win.net infact regular Expression
applets are already available and the cumbersome aspect of its implementation are thereof not
of influence.
DEFINING PATTERNS
FindPatterns, Map, MapSort, MapPlot, and Motifs all let you search with
ambiguous expressions that match many different sequences. The expressions can
include any legal GCG sequence character (see Appendix III). The expressions
can also include several non-sequence characters, which are used to specify OR
matching, NOT matching, begin and end constraints, and repeat counts. For
instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of
any base, followed by ATG. Following is an explanation of the syntax for
pattern specification.
3 Implied Sets and Repeat Counts
Parentheses () enclose one or more symbols that can be repeated some number
of times. Braces {} enclose numbers that tell how many times the symbols
within the preceding parentheses must be found.
Sometimes, you can leave out part of an expression. If braces appear
without preceding parentheses, the numbers in the braces define the number
of repeats for the immediately preceding symbol. One or both of the
numbers within the braces may be missing. For instance, the pattern
GATG{2,}A means GAT, followed by G repeated from 2 to 350,000 times,
followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0
to 350,000 times, followed by A; the pattern GAT(TG){,2}A means GAT,
followed by TG repeated from 0 to 2 times, followed by A. (If the pattern
in the parentheses is an OR expression (see below), it cannot be repeated
more than 2,000 times.)
3 OR Matching
If you are searching nucleic acids, the ambiguity symbols defined in
Appendix III let you define any combination of G, A, T, or C. If you are
searching proteins, you can specify any of several symbol choices by
enclosing the different choices in parentheses and separating the choices
with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A
followed by S. The length of choices need not be the same, and there can
be up to 31 different choices within each set of parentheses. The pattern
GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from
1 to 4 times followed by A. The sequence GATTGGA matches this pattern.
There can be several parentheses in a pattern, but parentheses cannot be
nested.
3 NOT Matching
The pattern GC~CAT means GC, followed by any symbol except C, followed by
AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T,
followed by CC.
3 Begin and End Constraints
The pattern <GACCAT can only be found if it occurs at the beginning of the
sequence range being searched. Likewise, the pattern GACCAT> would only be
found if it occurs at the end of the sequence range.