Copy (C) 2006, 2007 The Institute for
Genomic Research (TIGR). All rights reserved
This program is free software; you can
redistribute it and/or modify it under the terms of the GNU
General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be
useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE. See the GNU
General Public License for
more details.
You should have received a copy of the GNU General Public
License along with this program; if not, write to the Free
Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
Boston, MA 02110-1301, USA.
CREDITS
PATCHES
INTRODUCTION
SYSTEM
REQUIREMENTS
SOFTWARE
REQUIREMENTS/DEPENDENCIES
INCLUDED IN
DISTRIBUTION
INSTALLATION
REQUIRED
INPUT FILES
INVOCATION
PHAGE_FINDER
PIPELINE
EXAMPLE
DATA
CREDITS
--------------
Math::Round was written by Geoffrey
Rommel and is available at:
(http://search.cpan.org/~grommel/Math-Round-0.05/Round.pm).
PATCHES
--------------
Suggested updates should be directed
to Derrick Fouts (dfouts@tigr.org) for consideration.
INTRODUCTION
-------------------------
Phage_Finder is a heuristic computer
program written in PERL that uses BLASTP data, HMM results,
and tRNA/tmRNA information to find prophage regions in
complete bacterial genome sequences. For more information
please visit the Phage_Finder website at:
http://www.tigr.org/software/phage_finder
Phage_Finder was written by:
Derrick E. Fouts, Ph.D.
Microbial Genomics
The Institute for Genomic Research (TIGR)
9712 Medical Center Drive
Rockville, MD 20850
(301) 795-7874
dfouts@tigr.org
SYSTEM
REQUIREMENTS
----------------------------------------
The programs should run on all Unix
platforms. It has been tested on Linux (Redhat, Suse and
Debian) and Mac OS X 10.3/10.4 operating systems. It has not
been tested on all Unix platforms.
SOFTWARE
REQUIREMENTS/DEPENDENCIES
----------------------------------------------------------------------
Phage_Finder.pl requires the following
programs or packages for full functionality:
PERL
version 5.8.5 or later
NCBI BLASTALL version 2.2.10 or later
Linux
Mac
OS X (Apple/Genentech
optimized NCBI BLAST)
WUBLAST
2.0
HMMSEARCH
tRNAscan-SE
Aragorn
version 1.1.6
FASTA33, MUMMER (if you want to use these to find attachment
sites)
Math::Round
PERL module [INCLUDED in this distribution]
PHAGE::Phage_subs PERL module [INCLUDED in this
distribution]
Getopt::Std [should be installed with PERL]
XGRAPH
INCLUDED IN
DISTRIBUTION
---------------------------------------------
-PERL SCRIPTS AND MODULES-
~/phage_finder/bin/phage_finder.pl:
The main PERL script for finding prophage regions
~/phage_finder/lib/PHAGE/Phage_subs.pm: PERL module that
contains several reusable subroutines need by
phage_finder.pl (NOT object oriented)
~/phage_finder/lib/Math/Round.pm: The Math::Round PERL
module by Geoffrey Rommel. Phage_Finder uses the nhimult and
nlowmult to round numbers up or down by a defined multiple.
Very cool module Geoffrey :)
-BASH SHELL SCRIPTS-
~/phage_finder/bin/Phage_Finder.sh: A
BASH shell script used to run the entire Phage_Finder
pipeline (including BLAST, HMM-finding, tRNA/tmRNA-finding
and the Phage_Finder.pl script itself)
~/phage_finder/bin/HMM_searchs.sh: A BASH shell script to
run all of the GLOCAL model phage HMMs, reporting the
progress as % completed and concatenating the results into a
combined.hmm_GLOCAL file
~/phage_finder/bin/HMM_FRAG_searches.sh: Similar to
HMM_searchs.sh, but searching all of the FRAGMENT model
phage HMMs.
-BLAST DATABASE-
~/phage_finder/DB/phage_06_06_05_release.db:
NCBI BLAST-formatted (.phr, .pin, and .psq) and
WUBLAST-formatted (.ahd, .atb, and .bsq) files.
-Helper ascii files-
~/phage_finder/HMM_master.lst: File
containing the trusted and noise cutoff values and the name
of each GLOCAL mode phage HMM
~/phage_finder/HMM_master_FRAG.lst: File containing the
trusted and noise cutoff values and the name of each
FRAGMENT mode phage HMM
~/phage_finder/phage_com_names_combo.txt: List of acceptable
phage annotation
~/phage_finder/phage_exclude.list: List of accesstions to
exclude from analysis
~/phage_finder/PHAGE_core_HMM.lst: List of "core" phage
HMMs
~/phage_finder/lysin_holin.lst: List of lysis or holin phage
HMMs
~/phage_finder/tails_hmm.lst: List of phage tails HMMs
~/phage_finder/terminase_hmm.lst: List of phage large
terminase HMMs
~/phage_finder/portal_hmm.lst: List of phage portal HMMs
~/phage_finder/Large_term.lst: List of manually-curated
phage Large terminase accessions
~/phage_finder/portal.lst: List of manually-curated phage
portal accessions
-Hidden Markov Models
(HMMs)-
~/phage_finder/PHAGE_HMMs_dir/:
Directory containing 295 GLOCAL mode phage HMM models
~/phage_finder/PHAGE_FRAG_HMMS_dir/: Directory containing
146 FRAGMENT mode phage HMM models
~/phage_finder/examples_dir/: Directory containing the
42-genome test dataset
INSTALLATION
------------------------
First, place the distribution tarball
to your home directory (~/)
Second, uncompress the distribution tarball by typing:
% tar -xvzf phage_finder.tar.gz
REQUIRED
INPUT FILES
-------------------------------------
1) WU-BLAST or NCBI (-m 8 option) btab
input file
2) phage_finder_info.txt file (a tab-delimitied file
containing scaffold/contig/assembly_ID size_of_molecule
feat_name end5 end3 com_name)
INVOCATION
---------------------
Usage: Phage_Finder.pl
<options>
Example: Phage_Finder.pl -t ncbi.out -i
phage_finder_info.txt -r tRNAscan.out -n tmRNA_aragorn.out
-A NC_000913.con -S
Switch: -h for help
Options:
-b: base directory path [default =
PWD]
-p: path to btab file (default = base
directory)
-t: name of WU-BLAST or NCBI (-m 8
option) btab input file [REQUIRED]
-i: tab-delimitied flat file
containing scaffold/contig/assembly_ID size_of_molecule
feat_name end5 end3 com_name [REQUIRED]
-m: htab file containing HMM data
(REQUIRED for finding integrases and att sites)
-F: search method (B or b for NCBI
BLAST, M or m for MUMmer, F or f for FASTA33) (default =
BLAST)
-r: tRNAscan-SE output file
[optional]
-n: Aragon tmRNA-finding output file
(-m option in aragon) [optional]
-w: Scanning WINDOW size (default =
10000 nucleotides)
-s: STEP size (default = 5000
nucleotides)
-E: E-value (default =
0.000001)
-H: Number of allowable hits per
window to mark a region (default = 4)
-a: User-defined asmbl_id to search
(default picks asmbl_id with largest size)
-A: File name of .1con
-B: Path to .1con if not in base
directory
-V: print version
information
-S: Strict mode: print only regions
that have core HMM hits or Mu-like and are > 10 Kbp
(default = 0)
-d: DEBUG MODE (default =
0)
Output: All stored within a
subdirectory of the current working directory ($PWD) named
by the genome contig or accession id (ie
NC_000913)
1) phage_phinder_<id>.log: a log
file recording Phage_Finder progress
2) phgraph file: an XGRAPH plot of the
phage regions
3) phreport file: a tab-delimited
report file that shows (coordinate incremented by the step
size, # hits per window, and the feat_name or locus name of
the hits)
4) phpico, phmedio, phregions:
tab-delimited files containing the 5' end of each gene, tRNA
or att site within each region, the name of the feature, and
the annotation/database match/HMM match as well as the G+C%
content of each region, a best guess for the type of region,
and the coordinates of each region with or without att site
adjustments. There are three different names for this file,
depending on the size of the regions (1-10000 bp
[phpico], 10001-18000 bp [phmedio] and
>18001 bp [phreigons])
5) .tab file: a tab-delimited file
containing (contig_id, size of the genome, G+C% content of
the genome, 5' end of the phage region, 3' end of the phage
region, size of region in bp, label (small, medium, large),
region type (prophage, integrated element, degenerate),
sequence of attR, sequence of attL, name of integration
target, G+C% of region, 5' feat_name or locus name, 3'
feat_name or locus name, # integrase HMM hits, # core_HMM
hits, # above noise core_HMM hits, # lytic gene HMM hits, #
tail HMM hits, # Mu HMM hits, orientation of the prophage
based on orientation of the target or the position of the
integrase, the distance from att site to integrase, and the
number of genes in the region
6) .1con file: a file in FASTA format
containing the DNA sequence of the phage region
7) .seq file: a file in FASTA format
containing the DNA sequence of each gene within the phage
region
8) .pep file: a file in FASTA format
containing the protein sequence of each gene within the
phage region
PHAGE_FINDER
PIPELINE
----------------------------------------
To run the complete Phage_Finder
pipeline automaticly, you need to have the following:
1) each genome within a directory that is named by accession
or contig_id (ie NC_000913) (see examples_dir)
2) protein sequences for each genome in their repective
directories with extension (.pep or .faa)
3) phage_finder_info.txt or GenBank .ptt file in their
repective directories
4) complete genome sequence (.con or .fna file) in their
repective directories
% findPhage.sh <file with list of accessions>
EXAMPLE DATA
-------------------------
Sample data can be found in
~/phage_finder/examples_dir
The file phage_phinder_postprocess.out contains data parsed
using PERL from each test genome's *.tab file generated by
Phage_Finder.
Each column (number by PERL convention) in this file
represents:
0) ORGANISM = The name of the organism tested
1) ACCESSION = The GenBank accession
number of the test genome
2) SIZE = The size in base pairs of
the test genome
3) GC% = The G+C% of the test
genome
4) # = The number of the prophage
region
5) pgc% = The G+C% of the prophage
region
6) end5 = The 5' end of the prophage
region
7) end3 = The 3' end of the prophage
region
8) loci = The gene span of the
prophage region (locus 5'-locus 3')
9) ori = The orientation of the
prophage region
10) #phgbp = The size (in base pairs)
of the prophage region
11) Ty = Region type (PRO=prophage,
BAC=bacteriocin, DEG=degenerate, IEL=integrated
element)
12) class = Region class (LRG=large,
MED=medium, SML=small, SAT=satellite, RET=retron element,
Mu=Mu-like, P2=P2-like, P4=P4-like)
13) att = Predicted attachment site
found (Y/N)
14) target = Predicted target of
insertion
15) In = The number of integrase HMM
hits
16) Co = The number of core phage HMM
hits above the trusted cutoff
17) NC = The number of core phage HMM
hits above the noise cutoff
18) Ly = The number of phage lysis HMM
hits
19) Ta = The number of phage tail HMM
hits
20) Mu = The number of Mu-like phage
HMM hits
21) dX = The distance from integrase
to attachment site in base pairs
22) OC = The total number of open
reading frames (ORFs) within region
You can also see the results from searching 302 complete
bacterial genomes in the file
~/phage_finder/examples_dir/302_bac_genomes.txt
|