GDE2.2 rev1	1


Genetic Data Environment
	version 2.2

Table of Contents
Introduction	2
What's New for this Release	2
System Requirements	2
Note to Motif users	2
Installing the GDE	3
Using the GDE	3
Data Types	7
Menu Functions
File menu	7
Edit menu	9
DNA/RNA menu	9
External Functions	9
Bug reports/extensions	12
Acknowledgments	12
Appendix A, File Formats	13
Appendix B, Adding Functions	16
Appendix C, External functions	19


.c.Introduction

The Genetic Data Environment is part of a growing 
set of programs for manipulating and analyzing 
"genetic" data.  It differs in design from other 
analysis programs in that it is intended to be an 
expandable and customizable system, while still 
being easy to use.

There are a tremendous number of publicly available 
programs for sequence analysis.  Many of these 
programs have found their way into commercial 
packages which incorporate them into integrated, 
easy to use systems.  The goal of the GDE is to 
minimize the amount of effort required to integrate 
sequence analysis functions into a common 
environment.  The GDE takes care of the user 
interface issues, and allows the programmer to 
concentrate on the analysis itself.  Existing programs 
can be tied into the GDE in a matter of hours (or 
minutes) as apposed to days or weeks.  Programs 
may be written in any language, and still seamlessly 
be incorporated into the GDE.

These programs are, and will continue to be, 
available at no charge.  It is the hope that this 
system will grow in functionality as more and more 
people see the benefits of a modular analysis 
environment.  Users are encouraged to make 
modifications to the system, and forward all changes 
and additions to Steven Smith at 
smith@bioimage.millipore.com.  

.c.What's New for this Release

GDE 2.2 represents a maintainence release.  Several 
small bugs have been fixed, as well as new editing 
features and user interface elements.  Also, I have 
tried to update all of the contributed external 
programs to their latest release.  Updated programs 
include:

Phylip 
Treetool
LoopTool
Readseq
Blast
Fasta

Improved versions of printing, and translate are 
included as well.  As for new editing features, a 
useful "yanking" feature has been added by Scott 
Ferguson from Exxon Research, and the capability 
to export the colormap for a seqeunce (see 
appendicies A/C).  Among the bugs fixed in this 
release are:

Selection mask problems when exporting to 
Genbank (fixed in 2.1)
Memory leaks (fixed in 2.1)
Correct handling of circular sequences
More liberal interpretation of Genbank formatted 
files. (not column dependent)


.c.System Requirements

GDE 2.2 currently runs on the Sun family of 
workstations.  This includes the Sun3 and Sun4 
(Sparcstation) systems.  It was written in XView, 
and runs on Suns using OpenWindows 3.0 or MIT's 
X Windows.  It runs in both monochrome, and color, 
and can be run remotely on any system capable of 
running X Windows Release 4.  You should have at 
least 15 meg of free disk space available.  The binay 
release for SparcStations was compiled under 
SunOS 4.1.2 and Openwindows 3.0.

We are also supporting a DECStation version of 
GDE.  This is running under XView 3.0/X11R5. We 
encourage interested people to port the programs to 
their favorite Unix platform.  There are informal 
ports to the SGI line of unix machines.

.c.Note to Motif users

GDE2.2 can be run using different window 
managers.  The most common alternative to olwm is 
the Motif window manager (mwm).  The only 
problem in using another window manager is that 
the status line is not displayed.  We have added a 
"Message panel" as an option under "File-
>Properties" which displays all of the information 
contained on the status line. 

People using other window managers may also 
prefer using xterm, and xedit as default terminals and 
file editors.  This can be accomplished by replacing 
all occurrences of 'shelltool' and 'textedit' with 
'xterm -e' and 'xedit' in the 
$GDE_HELP_DIR/.GDEmenus file.


.c.Installing the GDE

Instructions for the source code release are included 
in the README.install file.

The binary installations consist of creating a GDE 
directory, such as /usr/local/GDE, and un-taring the 
installation tarfile into the directory.  If you are 
installing the GDE for your own use, then you can 
simply make a GDE subdirectory.  There is no need 
to be superuser (root) to do the installation in your 
own directory.  For example:

demo% mkdir /usr/local/GDE
demo% cp GDE2.2.tar /usr/local/GDE
demo% cd /usr/local/GDE
demo% tar -xf GDE2.2.tar

After this, each new user will need to add two lines 
to their .cshrc file so that they can find the gde 
programs and files.

demo% cat >> ~/.cshrc
set path = ($path /usr/local/GDE/bin)
setenv GDE_HELP_DIR /usr/local/GDE/help/
^D

You may wish to make a copy of the .GDEmenus 
file from the help directory into your home directory.  
This is only necessary if you wish to modify your 
menus.  Copy the demo files from 
/usr/local/GDE/demo into your local directory, and 
you are now ready to use the GDE.

FastA and Blast need to have the properly formatted 
databases installed in the $GDE_HELP_DIR under 
the directories FASTA/PIR, FASTA/GENBANK, 
BLAST/pir BLAST/genbank.  For FASTA, simply 
copy a version of PIR and Genbank into the proper 
directory.  Alternately,  the PIR and GENBANK 
files can be symbolic links to copies of Genbank 
held elsewhere on your system.  You may need to 
look at the .GDEmenus file in $GDE_HELP_DIR to 
verify that you are using the same divisions for 
these databases.

Blast installation involves converting PIR and 
GENBANK to a temporary FASTA format (using 
pir2fasta and gb2fasta) and then using pressdb for 
nucleic acid, and setdb for amino acid to reformat the 
databases again into blast format.  The .GDEmenus 
file is currently set up to search with blast using the 
following databases: pir, genpept, genupdate, and 
genbank.  If you wish to divide these into 
subdivisions, then the .GDEmenus file will have to 
be edited.

The most up to date release of blast can be obtained 
via anonymous ftp to ncbi.nlm.nih.gov.  The most 
recent release of FASTA can be obtained via 
anonymous ftp to uvaarpa.virginia.edu.  It is 
strongly recommended that you retrieve these copies, 
and become familiar with their setup.

.c.Using the GDE

It is assumed that the user is familiar with the Unix, 
and OpenWindows/Xwindows environments.  It is 
also assumed that people running standard MIT X-
Windows will be using the OpenLook window 
manager (olwm).  Other window managers work 
with varied success.  If you are not certain as to how 
your system is set up, please contact your systems 
administrator.

Once the window system has started, and a terminal 
window (xterm, shelltool etc.) you can start up the 
GDE by typing:

demo% gde tRNAs

This should load the sample data set tRNAs into 
GDE, and the following window should appear:
                                                               


This is the sequence alignment editor.  It consists of 
a color alignment display, a set of command menus, 
horizontal and vertical scroll bars to navigate the 
alignment, a list of short sequence names (usually 
the LOCUS of a Genbank entry), and a status line.  
The cursor is located in the upper left corner.


Using the Mouse

The mouse follow OpenLook standards for 
operation.  The functions for each button are:

                                 


The left mouse button is used for placing the cursor, 
selecting sequences by their short names, 
scrolling/paging, performing split screens, and 
resizing.  The right button is used for pop up menus, 
and scrollbar menus.  The middle button is used for 
extending a text selection.

Cursor Movement

The cursor can be moved using the arrow keys, or by 
clicking the mouse within a sequence.  The cursors 
position is displayed on the status line in both 
sequence position and alignment column number.  
The right hand side of the status line shows the left 
and right column positions of the currently active 
display.

Scrolling is controlled by the scrollbar elevator.  By 
clicking (left mouse button) on one of the elevator 
arrows, the screen will scroll one character in that 
direction.  By dragging the elevator center, the 
screen can be moved directly to any location.  By 
clicking directly to one side of the elevator, the 
screen will scroll one full screen in that direction.  
And by clicking on the scrollbar anchor, the elevator 
will move to that anchor.  Scrollbars also have 
menus associated with them giving other scroll 
options.  Use the right mouse button to activate the 
menu.


Selecting Sequences

Sequence selection is necessary before most 
functions can be performed.  Selecting sequences is 
accomplished by clicking or dragging (left button) 
over the short name associated with the sequence(s).  
The name of the sequence should become 
highlighted on the release of the mouse button.  By 
holding down the shift key, you can toggle the 
selection on or off for any set of sequences.  By 
clicking just to the right of any sequence short 
name, you will deselect all of them.

Selecting Text

Selecting text is accomplished in much the same 
way as selecting entire sequences.  In the editing 
window, you can drag the mouse pointer over a 
rectangular region the select a block of text.  By 
using the shift key (or the middle mouse button) 
you can adjust the selection to include other 
sequences, or other columns of text.  If groups are 
enabled, GDE will automatically select all sequences 
in a group if any one sequence in a group is selected 
(See Sequence Editing).

Sequence Protection

All sequences can be individually protected against 
accidental modification.  This is accomplished by 
selecting the set of sequences that you are interested 
in editing, and choosing the "Set protections" menu 
item under the File menu.  Your choices are:

Unambiguous modification		
	Changing/adding/deleting regular characters
Ambiguous changes			
	Changing ambiguous codes ('N', 'X'...)
Alignment modifications			
	Changing alignment gaps ('-', '~')

Sequence Editing

Sequences can be edited by simply typing to insert, 
and using the delete or backspace key to delete 
characters.  Sequences must have the proper 
protections set to allow the type of modifications 
that you are attempting.  The default protection level 
only allows modification to the alignment, but not 
to the sequences themselves.  The Sun function 
keys, cut, copy and paste are used to edit selected 
text.  Text selections work in rectangular (possibly 
disjointed) regions.  You can cut or copy a block of 
sequence text, and paste it to a new cursor location 
using these three keys.


Sequence Yanking:

Yanking referes to the "pulling" of a base to fill a 
gapped position like beads on an abacus.  Place the 
cursor over a gap character, and type  k to yank 
the character from the left into the current position.  
Type l to pull the character from the right.  
Repeat counts are honored ("20  l" will yank 
20 characters from the right).


Repeat Counts

By typing a numeric value before an editing 
function you can insert, delete or move a number of 
characters at a time.  The current repeat count is 
displayed on the status line, and can be cleared by 
clicking the left mouse button in the alignment 
window.  In order to insert twenty gaps into a 
sequence, one would type "20-".  In order to move 
down five sequences, one would type "5¯".  This 
works with all sequence types, however the meta 
(diamond) key must be held down when the cursor 
is in a text or mask sequence.  This is because 
numbers are valid characters in these sequences, and 
would otherwise be confused with repeat counts.  

Split Screen

Split screen editing allows the viewing one region 
while editing another.  This is very useful for 
aligning "downstream" regions by editing 
"upstream".

The alignment window can be split horizontally into 
two or more windows into the alignment.  These 
windows scroll independently of each other both 
horizontally and vertically.  The short names 
displayed to the left of the alignment correspond to 
the window that was last scrolled or edited.  Care 
should be taken in any modifications done in this 
mode so that edits are performed on the correct 
sequence.  To avoid confusion during split screen 
operations, the vertical scroll bars may be locked so 
that all windows scroll together.
	
	
                                  


In order to split a window into two views, grab (left 
button) the left or right anchor (small rectangle) at 
either end of the horizontal scrollbar and drag to the 
middle of the window.  This should split the 
window into two views.  To join two views, place 
the mouse pointer on the horizontal scroll bar use 
the menu (right button) .
		
The views are NOT two copies of the alignment.  
Changes in one window are reflected in the other.  
Users should not be confused by this fact.  

Sequence Grouping

Sequences can be grouped for editing functions.  
This is very helpful when trying to adjust several 
sub alignments.  When grouped, all sequences 
within a group will be affected by editing in any 
member of the group.  All sequences within a group 
must have protections set to allow modification 
before any one will be modified.  

In order to group sequences, select the names of the 
sequences that should fall within a group, and select 
Group under the Edit menu.  A number will be 
placed at the left of the sequence representing its 
assigned group number.  To any sequence or 
sequences, the user selects those sequences and uses 
the Ungroup command under the Edit menu.

Special keys

There are also a few special function keys used in 
the GDE.  Some functions have meta key 
equivalences so that they can be called from the 
keyboard, instead of by the menu system.  The 
"meta" key is a standard property of X windows, and 
may be remapped to a different key symbol for 
different keyboards.  For example, meta on Sun 
workstations is represented with a à, where on a 
Macintosh running MacX it might be the "apple" 
key.  The operation of the key is the same as the 
control or shift key, it is held down while pressing 
the second key in the sequence.

Cut text, copy text and paste text are mapped to the 
Openlook equivalent keys (L10, L6, and L8 on Sun 
keyboards).  Other meta keys are defined in the 
.GDEmenus file, and may be changed to suit your 
preferences.

.c.Data Types

The GDE supports several data types.  The data 
types supported in 2.2 are DNA, RNA, protein 
(single letter codes), mask sequence, and text.  

DNA and RNA

Nucleic acid sequences are tightly type cast, and can 
contain any IUPAC code (ACGTUM 
RSVWYHKDBN) as well as two alignment gap 
characters ('~' and '-').  Some keys are remapped to 
fit IUPAC codes.  For example,  'X' is mapped to 
'N'.  All nonstandard characters get mapped to the 
alignment gap '-'.  Upper and lower case are both 
supported, and the T/U characters are mapped based 
on whether you are working with DNA or RNA.  
The color coding for DNA and RNA is identical.  
The color for ambiguous characters, and for 
alignment gaps is grey.

Amino Acid Sequence

Amino acid sequences are loosely type cast, and can 
contain any valid ASCII character.  The results of 
analysis on nonstandard characters is not guaranteed. 
The color for nonstandard amino acid characters, and 
for alignment gaps is grey.

Text Sequence

Any valid ASCII printable character can be entered 
into a text sequence.  Care should be taken with 
using space characters, as these will only be saved 
properly in Genbank format, and not in flat file 
format.  The characters @#% and " should be 
avoided as well, as these can confuse the reading of 
flat files if saved in that format.

Mask Sequence

Mask sequence is identical to text sequence with the 
following exceptions.  Mask sequence can have the 
ability (function dependent) of masking out 
positions in an alignment for analysis.  If a mask 
sequence is selected along with some other 
sequence(s) for an analysis function that permits 
masking, then all columns that contain a '0' in the 
mask sequence will be ignored by the function.  The 
mask itself would not be passed to the analysis 
function either.  Some functions allow masking, 
some do not.  Refer to the instruction page for each 
function to see whether or not it supports sequence 
masking.

Color Masks

Color masks give color to a sequence on a position 
by position basis.  Individual sequences can have 
color masks attached to them, or one color mask can 
be used for an entire alignment.  Color masks are 
generated externally by some analysis functions, and 
are then passed back to the GDE.  The file format for 
a colormask is described in Appendix A.


.c.Menu Functions: File menu

The GDE has several built-in menu functions under 
the File and Edit menus.  These functions are unique 
in that they are part of the primary display editor, 
and are not described in the .GDEmenus file.

Open...
Selecting this will bring up the open file dialog 
box.  Users can scroll through a list of files in the 
current directory, move up and down the directory 
tree, and open any individual data file.  The 
sequence data in that file is loaded into the current 
editing window below any existing sequences.  The 
open command will open any Genbank formatted 
file, or a GDE flat file.

Save as...
This function will save the entire alignment to a 
specified file in either Genbank or flat file format.  
The file will be saved in the local directory unless a 
relative or absolute path is specified.

Properties...
Properties controls the display settings.  Those 
settings include character size, color type,  and insert 
direction.  The screen can also be inverted, vertical 
scroll lock and keyboard clicks (tactile feedback) 
can be turned on or off.    Vertical scrollbar lock will 
cause all split views to scroll together in the vertical 
direction.
			
	                  



Protections...
This will display, and then set the default 
protections for all selected sequences.  If two or 
more of the sequences differ in their current 
protection settings, a warning message will appear in 
the protection dialog box.  The protections currently 
available are alignment gap protection, ambiguous 
character protection, unambiguous character 
protection, and translation protection.
				
	           

Get info...
This option allows the viewing and setting of 
attributes associated with each individual sequence.  
These attributes include short name, full name, 
description, author, comments, and the sequence 
type. The attributes loosely correspond to fields in a 
Genbank entry.  Comments can be included for each 
sequence in the comments field.

	
	
                                   



.c2. Edit menu

Select All
Selects all sequences.  This is helpful when you 
have several dozen sequences.

Select by name...
Select all sequences containing a given string in 
their short names field.  No wild cards are allowed, 
and only selecting is allowed, not de-selecting.  The 
search is started when the Return key is pressed, and 
multiple searches can be accumulated.  Press Done 
when finished.

Cut/Copy/Paste sequences
Cut copy and paste are primarily useful for 
reordering sequences, and for making duplicate 
copies of a given sequence.  They do not pass 
information to other programs.  This capability will 
be implemented in a later release.  Cut and copy will 
place the selected sequences on an internal clipboard.  
They can then be pasted back into the top of editing 
window (default) or under the last selected 
sequence. 

Group/Ungroup
Assign a group number to the selected sequences.  
Edit operations in any one sequence within the 
group will be propagated to all within the group.  
Sequence protections from one group are also 
imposed upon all other sequence in the given group.  
If a given operation is illegal in one sequence in a 
group (i.e. alignment modification) then it will not 
work in any of the sequences in that group.  
Ungroup will remove the selected sequences from a 
given group.

Compress 
Compress will remove gap characters from the 
selected sequences.  The user has the option of 
removing all gaps, or simply all columns containing 
nothing but gaps.  This is useful for minimizing the 
length of a subalignment.

Reverse Sequence
Reverses the selected sequences.  Alignment gaps are 
reversed as well.  The selected sequences will remain 
aligned after reversal.

.c2. DNA/RNA menu

Complement Sequence
Converts DNA/RNA into its complement strand 
(keeping full IUPAC ambiguity).  This function has 
no effect on text, protein, or mask sequence.  Note 
that this function does not produce the reverse 
strand of DNA but merely converts A<->T and G<-
>C.  If the reverse strand is needed, remember to 
Complement and Reverse the sequence (Edit menu).


.c.External Functions

See appendix C for a full description of functions 
supported in GDE 2.2  All external functions are 
described in the configuration file .GDEmenus.  Here 
is a brief description of some of the basic functions 
included.

File menu

New Sequence 	Create a new sequence. 
Prompts for sequence type, and short name.

Import foreign format
Export foreign format	Load and save sequences 
using Readseq by Don Gilbert (see Appendix C).

Save Selection		Save the currently 
selected sequences in a specified file.

Pretty print		Print using the sequence 
formatter supplied by Readseq.

Print Selection		Print the selected 
sequences to the chosen printer.  This function 
supports
			the Unix command 
enscript as well as lpr. The .GDEmenus file may 
need to
			be modified to add the 
names of local printers to the printer list.

Edit menu

sort...			Sort the selected 
sequences by a primary and secondary key.  Pass the 
new order
			to a new GDE window.

Extract			Extract the selected 
sequences into a new window.

DNA/RNA Menu
Translate		Translate the selected 
sequences from DNA/RNA to Amino acid.  The user 
can
			specify the desired 
reading frame, and the minimum open reading frame 
(stop
			codon to stop codon) to 
translate.  The user can also choose between single
			letter code and triple 
letter codes. There is also an option to allow each 
ORF to
			to be entered as a seperate 
sequence.

Dot plot		Display a dotplot 
identity matrix for the selected sequence(s).  If only 
one
			sequence is selected, then 
the dotplot is a self comparison.  If two or more
			sequences are selected, 
then the first two sequences are compared.

Clustal Align		Align the selected 
sequences using the clustalv algorithm by Des 
Higgins.
			(See Appendix C)

Find All 	Search and highlight the 
selected sequences for a given substring.  A 
specified
			percent of mismatching 
can also be allowed.

Variable Positions	The selected sequences 
are scored column by column for conservation.  The
			result is returned as a 
grey scale alignment color mask.  This can be useful
			in selecting PCR primers.

Sequence Consensus	Return the consensus for 
the selected sequences.  This can either be a majority
			consensus, or an 
ambiguity consensus using IUPAC coding.

Distance Matrix 		Calculate a distance 
matrix for the selected sequences. (See Appendix C)

MFOLD		Fold the selected 
sequences using MFOLD by Michael Zuker.  The 
resulting
			structure is returned as a 
nested bracket ('[]') representation of the secondary
			structure.(See appendix 
C.)

Draw Secondary Struct	Draw the selected 
sequence using the proposed secondary structure.  
Both the
			secondary structure 
prediction, and the RNA sequence should be 
selected before
			calling this function.  
The drawing program is LoopTool. (See Appendix 
C)

Highlight Helix		Show all violations to a 
proposed RNA secondary structure.  The secondary
			structure represented must 
be selected, as well as the aligned sequences to be
			tested.  The selected 
sequences will then be colored according to whether 
or not
			they support the 
proposed 2¡ structure.  Standard Watson/Crick 
paring will be
			colored dark blue, G-U 
paring will be colored light blue, mismatches will be
			colored gold, and pairng 
to gaps will be red.

Blastn/BlastX		Search the selected 
sequence (select only one) against a given database 
with the
			BLAST searching tool 
written by Altschul, Gish, Miller, Myers, and 
Lipman.
			Blastn searches DNA 
against DNA databases, blastx searches DNA against 
AA
			databases by translating 
the sequence in all six reading frames. (See 
Appendix C)

FastA			Search the selected 
sequence (select only one) against a given database 
using the
			FASTA similarity search 
program written by Pearson and Lipman. (See
			Appendix C)

Protein Menu


Clustal Align		Align the selected amino 
acid sequences using the clustal algorithm. (See
			Appendix C)

Blastp, Tblastn, Blast3	Search the selected 
sequence (select only one) against a given database 
with the
			BLAST searching tool 
written by Altschul, Gish, Miller, Myers, and 
Lipman.
			Blastp searches AA 
against AA databases, tblastn searches AA against 
DNA
			databases by translating 
the database in all six reading frames.  Blast3 finds
			three way alignments that 
are could not be found with only pairwise 
comparisons.
			(See Appendix C)

Sequence Management Menu

Assemble contigs	Assemble the selected 
sequences into contigs using the program CAP 
(Contig
			Assemble Program) 
written by Xiaoqiu Huang.  The resulting sequences 
are
			returned to the current 
GDE window, and they are grouped into contigs.  
The
			user can then sort the 
sequences by group, and offset to produce an ordered 
list of
			the contigs. (See 
Appendix C)

Strategy view		Pass out the selected 
sequences to StratView.  This program will display 
contigs
			in a greatly reduced line 
drawing.  This is very useful for large contigs.

Restriction sites		Search the selected 
sequences for the restriction enzymes specified in the 
given
			enzyme file.  The 
restriction sites are then colored by enzyme.

Phylogeny menu

DeSoete Tree fit		Calculate a phylogenetic 
tree using a least squares fitting algorithm on a 
distance
			matrix calculated from the 
selected sequences.  The results can then be passed 
on
			to treetool for display 
and manipulation. (See Appendix C)

Phylip 3.5		Pass the selected data to 
on of the treeing programs in Phylip, written by
			Joe Felsenstein.  The 
chosen phylip program is started in it's own 
window,
			with the selected 
sequences already loaded. (See Appendix C)




Citation of work

We ask that any published work using any of the 
external functions in GDE cite the appropriate 
authors.  Please see Appendix C for references.



.c.Bug reports/extensions

Any bug reports, request for enhancement, and useful 
extensions to the GDE should be forwarded by 
electronic mail to:

smith@bioimage.millipore.com

Please include as much detail as possible in bug 
reports so that the bug can be reproduced. 
Correspondence should be addressed to:

Steven Smith
Millipore Imaging Systems
777 E. Eisenhower Pkwy
Ann Arbor, MI	48108





.c.Acknowledgments

I would like to thank the following people for their 
input and assistance and code used in the 
development of the GDE:

Carl Woese, Gary Olsen and Mike Maciukenas at 
University of Illinois Dept of Microbiology,  Ross 
Overbeek at Argonne National Laboratories,Walter 
Gilbert, Patrick Gillevet, Chunwei Wang, Susan 
Russo and Erik Bunce at the Harvard Genome 
Laboratory.  I would also like to personally thank 
the following people for their permission to include 
their software with this release of GDE.

Tim Littlejohn
Scott Ferguson
Brian Fristensky
Des Higgins
David Lipman and the group at NCBI
William Pearson
Don Gilbert
Xiaoqui Huang
Joe Felsenstein
Michael Zuker
Geert DeSoete


Many thanks to all the people who have directly and 
indirectly helped with the ongoing support of GDE. 
It is only by the generosity of these people that 
GDE has been successful.



.c.Appendix A, File Formats

The currently supported file formats include GDE 
data files, Genbank formatted files (with type 
extensions), a generic flat file format, and a color 
mask file.

GDE format
GDE format is a tagged field format used for storing 
all available information about a sequence.  The 
format matches very closely the GDE internal 
structures for sequence data.  The format consists of 
text records starting and ending with braces ('{}').  
Between the open and close braces are several tagged 
field lines specifying different pieces of information 
about a given sequence.  The tag values can be 
wrapped with double quote characters ('""') as 
needed.  If quotes are not used, the first whitespace 
delimited string is taken as the value. The allowable 
fields are:

{
name		"Short name for sequence"
longname	"Long (more descriptive) name for 
sequence"
sequence-ID	"Unique ID number"
creation-date	"mm/dd/yy hh:mm:ss"
direction	[-1|1]
strandedness	[1|2]
type	
	[DNA|RNA||PROTEIN|TEXT|MASK]
offset		(-999999,999999)
group-ID	(0,999)
creator		"Author's name"
descrip		"Verbose description"
comments	"Lines of comments that can be 
fairly arbitrary
text about a sequence.  Return characters are allowed, 
but no internal
double quotes or brace characters.  Remember to 
close with a double
quote"
sequence
	"gctagctagctagctagctcttagctgtagtcgtagctgatgc
tagct
gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg
gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattgc"
}

Any fields that are not specified are assumed to be 
the default values.  Offsets can be negative as well 
as positive.  Genbank entries written out in this 
format will have all (") converted to ('), and all ({}) 
converted to ([]) to avoid confusion in the parser.  
Leading and trailing gaps are removed prior to 
writing each sequence.  This format is deliberately 
verbose in order to be simple to duplicate.


Genbank format:
GDE can read a concatenated list of Genbank entries, 
and extract certain fields from such files.  The 
default method for storing nucleic acid, amino acid, 
masking sequences or text is in Genbank format.  
The following fields are recognized:

LOCUS:		Short name for this sequence 
(Maximum of 32 characters)  
DEFINITION:	Definition of sequence (Maximum 
of 80 characters)
ORGANISM:		Full name of organism 
(Maximum of 80 characters)
AUTHORS:		Authors of this sequence 
(Maximum of 80 characters)
ACCESSION:	ID Number for this sequence 
(Maximum of 80 characters)
ORIGIN:		Beginning of sequence data  
//			End of sequence data  

All other lines are retained as comments.  The 
LOCUS line also specifies what type of sequence 
follows.  The form of this line is:

LOCUS       name		size bp	type	date
	

where name is the Genbank Locus name, size is total 
base count, type is one of DNA, RNA, PROTEIN, 
MASK, or TEXT and date is of the form dd-MON-
yyyy.  In this way, the standard Genbank format is 
extended to store all text, mask and protein data.  
The Genbank character set has also been extended in 
order to support these other data types.  Valid 
characters are:

DNA/RNA:		Full IUPAC coding as 
well as '-' and '~' characters for alignment
			gaps
Protein:		All valid single letter 
codes plus '-' and '~'.  Other ASCII characters
			may be inserted, however 
external functions may be confused by
			such characters.
Mask:			All legal printable ASCII 
characters.  If used as a selection mask, all
			columns containing a '0' 
will be removed from any analysis.
Text:			All valid ASCII 
characters.

Here is a valid Genbank entry for two E.coli 
tRNA's:

LOCUS         ECOTRNT4      76 bp     RNA               
28-JAN-1991
DEFINITION  E. coli (T4 infected) vulnerable tRNA 
(A).
  ORGANISM  Escherichia coli
   AUTHORS  Amitsur,M., Levitz,R. and Kaufmann,G.
FEATURES       From  To/Span     Description
    tRNA          1       76     vulnerable tRNA(A)
BASE COUNT  ?
ORIGIN
        1 GGGUCGUUAG CUCAGUUGGU AGAGCAGUUG 
ACUUUUAAUC AAUUGGNCGC AGGUUCGAAU
       61 CCUGCACGAC CCACCA
//
LOCUS          ECOTRQ1      75 bp     RNA               
28-JAN-1991
DEFINITION  E.coli Gln-tRNA-1.
  ORGANISM  Escherichia coli
   AUTHORS  Yaniv,M. and Folk,W.R.
SOURCE      -REFERENCE   [1]  JOURNAL   J. Biol. 
Chem. 250, 3243-3253 (1975)
FEATURES       From  To/Span     Description
    tRNA          1       75     Gln-tRNA-1 (NAR: 
0510)
    refnumbr      1        1     sequence not 
numbered in [1]
BASE COUNT  ?
ORIGIN
        1 UGGGGUAUCG CCAAGCGGUA AGGCACCGGU 
UUUUGAUACC GGCAUUCCCU GGUUCGAAUC
       61 CAGGUACCCC AGCCA
//


Flat file format:
This is a simplified format for importing sequence 
data, and passing it out to analysis functions.  Very 
little information is actually retained in this format, 
and should be used carefully so as not to lose 
attribute information.  It is defined as follow:

type_character short_name
sequence_data
sequence_data
sequence_data
...

The type character is # for DNA/RNA, % for protein 
sequence, @ for mask sequence, and " for text.  The 
short name is the same as the LOCUS line in 
Genbank.  This is followed by lines of sequence, 
each ending with a return character.These lines are 
read until the next type character is encountered, or 
until the end of the file is reached.  Care should be 
taken in using this format with text as space 
characters are stripped automatically.  As of release 
2.0, flat file format allows for an optional offset to 
be specified in parentheses after the sequence name.  
An offset represents how many leading gap 
characters should be placed before the start of a 
sequence.  If this offset does not exist, then it is 
defined to be 0.  

Here is a sample flat file for two Ecoli tRNA's:

#ECOTRNT4
GGGUCGUUAGCUCAGUUGGUAGAGCAGUUGACUUUUAAUCAAUUGGNCGCAG
GUUCGAAU
CCUGCACGACCCACCA
#ECOTRQ1
UGGGGUAUCGCCAAGCGGUAAGGCACCGGUUUUUGAUACCGGCAUUCCCUGG
UUCGAAUC
CAGGUACCCCAGCCA


Color mask:
The format for a color mask has been kept simple to 
make implementation of color functions easy.  The 
format optionally defines which sequence to color, 
whether or not to color alignment gaps in the 
existing sequence, and how long the following mask 
will be.  It is then followed by a list of decimal 
color codes (range 0 to 15) for each position in the 
sequence.  There are four keywords used in the color 
mask file.  Those keywords are:

name:short name			If short name 
matches a currently loaded sequence,
				then impose this 
color mask on that sequence.  If this
				line is omitted, 
then color all sequences this color, and the color
				mask is expected 
to start at the leftmost column on the screen.

length:length			The following 
list in length long

nodash:				Skip over dash 
characters when imposing this color mask
				on the named 
sequence.  This allows an unaligned color
				mask to be 
placed over aligned sequence.

start:				Begin reading 
the color mask on the next line.

Here is a sample color mask file:

name:test_sequence
length:10
nodash:
start:
3
3
3
6
5
3
3
3
2
7

The colors in the default color lookup table are:
0	White				8
	Black
1	Yellow				9
	Grey 1
2	Violet				10
	Grey 2
3	Red				11
	Grey 3
4	Aqua				12
	Grey 4
5	Lime Green			13
	Grey 5
6	Blue				14
	Grey 6
7	Purple				15
	White



.c.Appendix B, Adding Functions

The GDE uses a menu description language to 
define what external programs it can call, and what 
parameters and data to pass to each function.  This 
language allows users to customize their own 
environment to suite individual needs.

The following is how the GDE handles external 
programs when selected from a menu:

	
                                 


Each step in this process is described in a file 
.GDEmenus in the user's current or home directory.

The language used in this file describes three phases 
to an external function call.  The first phase 
describes the menu item as it will appear, and the 
Unix command line that is actually run when it is 
selected.  The second phase describes how to prompt 
for the parameters needed by the function.  The third 
phase describes what data needs to be passed as 
input to the external function, and what data (if any) 
needs to be read back from its output.

The form of the language is a simple keyword/value 
list delimited by the colon (:) character.  The 
language retains old values until new ones are set.  
For example, setting the menu name is done once for 
all items in that menu, and is only reset when the 
next menu is reached.

The keywords for phase one are:

menu:menu name				
	Name of current menu
item:item name				
	Name of current menu item
itemmeta:meta_key			
	Meta key equivalence (quick keys)
itemhelp:help_file			
	Help file (either full path, or in 
					
	GDE_HELP_DIR)
itemmethod:Unix command

The item method command is a bit more involved, it 
is the Unix command that will actually run the 
external program intended.  It is one line long, and 
can be up to 256 characters in length.  It can have 
embedded variable names (starting with a '$') that 
will be replaced with appropriate values later on.  It 
can consist of multiple Unix commands separated by 
semi-colons (;), and may contain shell scripts and 
background processes as well as simple command 
names.  Examples will be given later.

The keywords for phase two are:

arg:argument_variable_name		
		Name of this variable.  It will 
appear
					
		in the itemmethod: line with a 
dollar
					
		sign ($) in front of it.
argtype:slider,chooser,choice_menu or text	
		The type of graphic object
					
		representing this argument.

arglabel:descriptive label		
		A short description of what this
					
		argument represents

argmin:minimum_value (integer)		
		Used for sliders.

argmax:maximum_value (integer)		
		Used for sliders.

argvalue:default_value (integer)		
		It is the numeric value associated 
with 
					
		sliders or the default choice in
					
		choosers, choice_menus, and 
choice_lists
					
		(the first choice is 0, the second is 
1 etc.)

argtext:default value			
		Used for text fields.
	
argchoice:displayed value:passed value	
		Used for choosers and
					
		choice_menus.  The first value is
					
		displayed on screen, and the 
second
					
		value is passed to the itemmethod
					
		line.

The keywords for phase three are as follows:

in:input_file 				
		GDE will replace this name with a
					
		randomly generated temporary file
					
		name. It will then write the 
selected
					
		data out to this file.

informat:file_format			
		Write data to this file for input to
					
		this function.  Currently support
					
		values are Genbank, and flat.
inmask:					
		This data can be controlled by a
					
		selection mask.

insave:					
		Do not remove this file after 
running
					
		the external function.  This is 
useful
					
		for functions put in the 
background.

out:output_file				
		GDE will replace this name with a
					
		randomly generated temporary file
					
		name.  It is up to the external 
function
					
		to fill this file with any results 
that
					
		might be read back into the GDE.

outformat:file_format			
		The data in the output file will be 
in
					
		this format.  Currently support
					
		values are colormask, Genbank, 
and
					
		flat.

outsave:				
		Do not remove this file after 
reading.
					
		This is useful for background 
tasks.

outoverwrite:				
		Overwrite existing sequences in 
the current
					
		GDE window.  Currently 
supported with
					
		"gde" format only.



Here is a sample dialog box, and it's entry in the 
.GDEmenus file:
	
	
                              

Using the default parameters given in the dialog 
box, the executed Unix command line would be:

(tr '[a-z]' '[A-Z]' < .gde_001 >.gde_001.tmp ; mv 
.gde_001.tmp CAPS ; gde CAPS -Wx medium ; rm 
.gde_001 ) &

where .gde_001 is the name of the temporary file 
generated by the GDE which contains the selected 
sequences in flat file format.  Since the GDE runs 
this command in the background ('&' at the end) it 
is necessary to specify the insave: line, and to 
remove all temporary files manually.  There is no 
output file specific because the data is not loaded 
back into the current GDE window, but rather a new 
GDE window is opened on the file.  A simpler 
command that reloads the data after conversion 
might be:

item:All caps
itemmethod:tr '[a-z]' '[A-Z]'  OUTPUT

in:INPUT
informat:flat

out:OUTPUT
outformat:flat

In this example, no arguments are specified, and so 
no dialog box will appear.  The command is not run 
in the background, so the GDE can clean up after 
itself automatically.  The converted sequence is 
automatically loaded back into the current GDE 
window.

In general, the easiest type of program to integrate 
into the GDE is a program completely driven from a 
Unix command line.  Interactive programs can be 
tied in (MFOLD for example), however shell scripts 
must be used to drive the parameter entry for these 
programs.  Programs of the form:

program_name -a1 argument1 -a2 arguement2 -f 
inputfile -er errorfile > outputfile

can be specified in the .GDEmenus file directly. As 
this is the general form of most one Unix commands, 
these tend to be simpler to implement under the 
GDE.

As functions grow in complexity, they may begin to 
need a user interface of their own.  In these cases, the 
command line calling arguments are still necessary 
in order to allow the GDE to hand them the 
appropriate data, and possible retrieve results after 
some external manipulation.


.c.Appendix C, External functions

ClustalV - Cluster multiple sequence alignment 

Author: Des Higgins.

Reference:	Higgins,D.G. Bleasby,A.J. and 
Fuchs,R. (1991) CLUSTAL V: improved software
		for multiple sequence alignment.  
ms. submitted to CABIOS

Parameters:
		k-tuple pairwise search	Word 
size for pairwise comparisons
		Window size		Smaller 
values give faster alignments,
					larger 
values are more sensitive.
		Transitions weighted	Can 
weight transitions twice as high as
				
	transversions (DNA only).
		Fixed gap penalty	Gap 
insertion penalty, lower value, more gaps
		Floating gap penalty	Gap 
extension penalty, lower value, longer gaps

		

Comments:
		ClustalV is a directed multiple 
sequence alignment algorithm that
		aligns a set of sequences based on 
their level of similarity.  It first
		uses a Lipman Peasron pairwise 
similarity scoring to find "clusters"
		of similar sequences, and pre-
aligns those sequences.  It then adds
		other sequences to the alignment 
in the order of their similarity so as
		to produce the cleanest alignment.

		Warning:  ClustalV only uses 
unambiguous character codes.  It will also
		convert all sequences to upper case 
in the process of aligning.  Clustal
		does not pass back comments, 
author etc.  Be sure to keep copies of your
		sequences if you do not wish to 
lose this information.


MFOLD - RNA secondary prediction

Author: Michael Zuker

Reference: 	M. Zuker
		On Finding All Suboptimal 
Foldings of an RNA Molecule.
		Science, 244, 48-52, (1989)

		J. A. Jaeger, D. H. Turner and M. 
Zuker
		Improved Predictions of 
Secondary Structures for RNA.
		Proc. Natl. Acad. Sci. USA, 
BIOCHEMISTRY, 86, 7706-7710, (1989)

		J. A. Jaeger, D. H. Turner and M. 
Zuker
		Predicting Optimal and 
Suboptimal Secondary Structure for RNA.
		in "Molecular Evolution: 
Computer Analysis of Protein and
		Nucleic Acid Sequences", R. F. 
Doolittle ed.
		Methods in Enzymology, 183, 
281-306 (1989)

Parameters:
		Linear/circular RNA fold
		ct File to save results

Comments:
		MFOLD passes it's output to a 
program Zuk_to_gen that translates the secondary
		structure prediction to a nested 
bracket ([]) notation.  This notation can then be 
used
		in the Highlight Helix, and Draw 
Secondary structure (LoopTool) functions.

		MFOLD currently does not 
support much in the way of additional parameters.
		We hope to have all additional 
parameters available soon.


Blast - Basic Local Alignment Search Tool

Reference:
		Karlin, Samuel and Stephen F. 
Altschul (1990).  Methods  for
		assessing the statistical 
significance of molecular sequence
		features by using general scoring 
schemes, Proc. Natl. Acad.
		Sci. USA 87:2264-2268.

     		Altschul, Stephen F., Warren Gish, 
Webb  Miller,  Eugene  W.
     		Myers,  and  David  J. Lipman 
(1990).  Basic local alignment
     		search tool, J. Mol. Biol.  
215:403-410.

   		Altschul,  Stephen  F.  (1991).   
Amino  acid   substitution
   		matrices  from an information 
theoretic perspective. J. Mol.
     		Biol.  219:555-565.



Parameters:
		Which Database		Which 
nucleic or amino acid database
					to 
search.

		Word Size		Length 
of initial hit. after locating a match of
					this 
length, alignment extension is attempted.
	Blastn
		Match score		Score 
for matches in secondary alignment extension
		Mismatch score		Score 
for mismatches in secondary alignment extension

	Blastx, tblastn, blastp,  blast3
		Substitution Matrix
	PAM120 or PAM250
		

	Comments:	The report is loaded into 
a text editor.  This should be saved as a new file
			as the default file is 
removed after execution.  The latest version of blast 
can
			be obtained via 
anonymous ftp to ncbi.nlm.nih.gov.




FastA - Similarity search

	Reference:
		W.  R.  Pearson  and D. J. Lipman 
(1988),
		"Improved Tools for Biological 
Sequence Analysis", PNAS  85:2444-2448

		W.  R.   Pearson (1990) "Rapid 
and Sensitive Sequence
		Comparison with FASTP and 
FASTA" Methods  in  Enzymology  183:63-98

	Parameters:
		Database		
	Which database to search
		Number of alignments to report
		SMATRIX		
	Which similarity matrix to use
		

	Comments:
  		The FastA package includes 
several additional programs for pairwise alignment.
		We have only included a bare 
bones link to FastA.  We hope to include a more
		complete setup for the actual 2.2 
release.




Assemble Contigs - CAP Contig Assembly Program

	Author - Xiaoqiu Huang
		Department of Computer Science
		Michigan Technological 
University
		Houghton, MI 49931
		E-mail: huang@cs.mtu.edu

		Minor modifications for I/O by S. 
Smith

	Reference - 
		"A Contig Assembly Program 
Based on Sensitive Detection of
 		Fragment Overlaps" (submitted to 
Genomics, 1991)

	Parameters:
		Minimum overlap	
	Number of bases required for overlap
		Percent match within overlap
	Percentage match required in the overlap
					
	region before merge is alowwed.

	Comments:

		CAP returns the aligned sequences 
to the current editor window.  The sequences are
		placed into contigs by setting the 
groupid.  Cap does not change the order of the
		sequences, and so the results 
should be sorted by group and offset (see sort under 
the
		Edit menu).


Lsadt - Least squares additive tree analysis

Author: Geert De Soete, 'C' implementation by Mike 
Maciukenas University of Illinois

Reference:LSADT, 1983 Psychometrika, 1984 
Quality and Quantity

Parameters:
		Distance correction to use in 
distance matrix calculations (see count below).
		What should be used for initial 
parameters estimates
		Random number seed
		Display method (See TreeTool 
below)

Comments:
		The program has been rewritten in 
'C' and will be included with the rRNA Database
		phylogenetic package being 
written at the University of Illinois Department  of 
		Microbiology.

		Count is a  short program to 
calculate a distance matrix from a sequence
		alignment (see below).



Count - Distance matrix calculator

Author: Steven Smith

Parameters:
		Correction method	
	Currently Jukes-Cantor or none
		Include dashed columns
		Match upper case to lower


Comments:
		Passes back a distance matrix in a 
format readable by LSADT.




Treetool - Tree drawing/manipulation

Author:	Michael Maciukenas, University of Illinois

Comments:
		See included documentation for 
TreeTool usage.



Readseq - format conversion program

Author:		Don Gilbert

Parameters:	Many, but can easily be run in 
interactive mdoe.

Comments:
		Readseq is  a very useful program 
for format conversion. The latest versionsupports 
over a 
		dozen different file formats, as 
well as formating capabilities for publication.  GDE 
makes 
		of Readseq for importing and 
exporting seqeuences as well as a filtering tool to 
some
		external functions.



	
Lsadt - Least squares additive tree analysis

Author: Geert De Soete, 'C' implementation by Mike 
Maciukenas University of Illinois

Reference:LSADT, 1983 Psychometrika, 1984 
Quality and Quantity

Parameters:
		Distance correction to use in 
distance matrix calculations (see count below).
		What should be used for initial 
parameters estimates
		Random number seed
		Display method (See TreeTool 
below)

Comments:
		The program has been rewritten in 
'C' and will be included with the rRNA Database
		phylogenetic package being 
written at the University of Illinois Department  of 
		Microbiology.

		Count is a  short program to 
calculate a distance matrix from a sequence
		alignment (see below).



Count - Distance matrix calculator

Author: Steven Smith

Parameters:
		Correction method	
	Currently Jukes-Cantor or none
		Include dashed columns
		Match upper case to lower


Comments:
		Passes back a distance matrix in a 
format readable by LSADT.



Copyright Notice

The Genetic Data Environment (GDE) software and 
documentation are not in the public domain.  
Portions of this code are owned and copyrighted by 
the The Board of Trustees of the University of 
Illinois and by Steven Smith. External functions 
used by GDE are the proporty of, their respective 
authors. This release of the GDE program and 
documentation may not be sold, or incorporated into 
a commercial product, in whole or in part without 
the expressed written consent of the University of 
Illinois and of its author, Steven Smith.

All interested parties may redistribute the GDE as 
long as all copies are accompanied by this 
documentation,  and all copyright notices remain 
intact.  Parties interested in redistribution must do 
so on a non-profit basis, charging only for cost of 
media.  Modifications to the GDE core editor should 
be forwarded to the author Steven Smith.  External 
programs used by the GDE are copyright by, and are 
the property of their respective authors unless 
otherwise stated.


While all attempts have been made to insure the 
integrity of these programs:

Disclaimer

THE UNIVERSITY OF ILLINOIS, HARVARD 
UNIVERSITY AND THE AUTHOR, STEVEN 
SMITH GIVE NO WARRANTIES, EXPRESSED 
OR IMPLIED FOR THE SOFTWARE AND 
DOCUMENTATION PROVIDED, INCLUDING, 
BUT NOT LIMITED TO WARRANTY OF 
MERCHANTABILITY AND WARRANTY OF 
FITNESS FOR A PARTICULAR PURPOSE.  
User understands the software is a research tool for 
which no warranties as to capabilities or accuracy are 
made, and user accepts the software "as is."  User 
assumes the entire risk as to the results and 
performance of the software and documentation.  The 
above parties cannot be held liable for any direct, 
indirect, consequential or incidental damages with 
respect to any claim by user or any third party on 
account of, or arising from the use of software and 
associated materials.  This disclaimer covers both the 
GDE core editor and all external programs used by 
the GDE.

  Required field