Phage_Finder

 home

Documentation

Copy (C) 2006, 2007 The Institute for Genomic Research (TIGR). All rights reserved

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

CREDITS
PATCHES
INTRODUCTION
SYSTEM REQUIREMENTS
SOFTWARE REQUIREMENTS/DEPENDENCIES
INCLUDED IN DISTRIBUTION
INSTALLATION
REQUIRED INPUT FILES
INVOCATION
PHAGE_FINDER PIPELINE
EXAMPLE DATA


CREDITS

--------------

Math::Round was written by Geoffrey Rommel and is available at:

(http://search.cpan.org/~grommel/Math-Round-0.05/Round.pm).

PATCHES
--------------

Suggested updates should be directed to Derrick Fouts (dfouts@tigr.org) for consideration.

INTRODUCTION
-------------------------

Phage_Finder is a heuristic computer program written in PERL that uses BLASTP data, HMM results, and tRNA/tmRNA information to find prophage regions in complete bacterial genome sequences. For more information please visit the Phage_Finder website at:

http://www.tigr.org/software/phage_finder

Phage_Finder was written by:

Derrick E. Fouts, Ph.D.
Microbial Genomics
The Institute for Genomic Research (TIGR)
9712 Medical Center Drive
Rockville, MD 20850
(301) 795-7874
dfouts@tigr.org

SYSTEM REQUIREMENTS
----------------------------------------

The programs should run on all Unix platforms. It has been tested on Linux (Redhat, Suse and Debian) and Mac OS X 10.3/10.4 operating systems. It has not been tested on all Unix platforms.

SOFTWARE REQUIREMENTS/DEPENDENCIES
----------------------------------------------------------------------

Phage_Finder.pl requires the following programs or packages for full functionality:

PERL version 5.8.5 or later

NCBI BLASTALL version 2.2.10 or later

Linux
Mac OS X (Apple/Genentech optimized NCBI BLAST)

WUBLAST 2.0

HMMSEARCH

tRNAscan-SE

Aragorn version 1.1.6

FASTA33, MUMMER (if you want to use these to find attachment sites)

Math::Round PERL module [INCLUDED in this distribution]

PHAGE::Phage_subs PERL module [INCLUDED in this distribution]

Getopt::Std [should be installed with PERL]

XGRAPH

INCLUDED IN DISTRIBUTION
---------------------------------------------

-PERL SCRIPTS AND MODULES-

~/phage_finder/bin/phage_finder.pl: The main PERL script for finding prophage regions

~/phage_finder/lib/PHAGE/Phage_subs.pm: PERL module that contains several reusable subroutines need by phage_finder.pl (NOT object oriented)

~/phage_finder/lib/Math/Round.pm: The Math::Round PERL module by Geoffrey Rommel. Phage_Finder uses the nhimult and nlowmult to round numbers up or down by a defined multiple. Very cool module Geoffrey :)

-BASH SHELL SCRIPTS-

~/phage_finder/bin/Phage_Finder.sh: A BASH shell script used to run the entire Phage_Finder pipeline (including BLAST, HMM-finding, tRNA/tmRNA-finding and the Phage_Finder.pl script itself)

~/phage_finder/bin/HMM_searchs.sh: A BASH shell script to run all of the GLOCAL model phage HMMs, reporting the progress as % completed and concatenating the results into a combined.hmm_GLOCAL file

~/phage_finder/bin/HMM_FRAG_searches.sh: Similar to HMM_searchs.sh, but searching all of the FRAGMENT model phage HMMs.

-BLAST DATABASE-

~/phage_finder/DB/phage_06_06_05_release.db: NCBI BLAST-formatted (.phr, .pin, and .psq) and WUBLAST-formatted (.ahd, .atb, and .bsq) files.

-Helper ascii files-

~/phage_finder/HMM_master.lst: File containing the trusted and noise cutoff values and the name of each GLOCAL mode phage HMM

~/phage_finder/HMM_master_FRAG.lst: File containing the trusted and noise cutoff values and the name of each FRAGMENT mode phage HMM

~/phage_finder/phage_com_names_combo.txt: List of acceptable phage annotation

~/phage_finder/phage_exclude.list: List of accesstions to exclude from analysis

~/phage_finder/PHAGE_core_HMM.lst: List of "core" phage HMMs

~/phage_finder/lysin_holin.lst: List of lysis or holin phage HMMs

~/phage_finder/tails_hmm.lst: List of phage tails HMMs

~/phage_finder/terminase_hmm.lst: List of phage large terminase HMMs

~/phage_finder/portal_hmm.lst: List of phage portal HMMs

~/phage_finder/Large_term.lst: List of manually-curated phage Large terminase accessions
~/phage_finder/portal.lst: List of manually-curated phage portal accessions

-Hidden Markov Models (HMMs)-

~/phage_finder/PHAGE_HMMs_dir/: Directory containing 295 GLOCAL mode phage HMM models

~/phage_finder/PHAGE_FRAG_HMMS_dir/: Directory containing 146 FRAGMENT mode phage HMM models

~/phage_finder/examples_dir/: Directory containing the 42-genome test dataset

INSTALLATION
------------------------

First, place the distribution tarball to your home directory (~/)

Second, uncompress the distribution tarball by typing:

% tar -xvzf phage_finder.tar.gz

REQUIRED INPUT FILES
-------------------------------------

1) WU-BLAST or NCBI (-m 8 option) btab input file

2) phage_finder_info.txt file (a tab-delimitied file containing scaffold/contig/assembly_ID size_of_molecule feat_name end5 end3 com_name)

INVOCATION
---------------------

Usage: Phage_Finder.pl <options>

Example: Phage_Finder.pl -t ncbi.out -i phage_finder_info.txt -r tRNAscan.out -n tmRNA_aragorn.out -A NC_000913.con -S

Switch: -h for help

Options:

-b: base directory path [default = PWD]

-p: path to btab file (default = base directory)

-t: name of WU-BLAST or NCBI (-m 8 option) btab input file [REQUIRED]

-i: tab-delimitied flat file containing scaffold/contig/assembly_ID size_of_molecule feat_name end5 end3 com_name [REQUIRED]

-m: htab file containing HMM data (REQUIRED for finding integrases and att sites)

-F: search method (B or b for NCBI BLAST, M or m for MUMmer, F or f for FASTA33) (default = BLAST)

-r: tRNAscan-SE output file [optional]

-n: Aragon tmRNA-finding output file (-m option in aragon) [optional]

-w: Scanning WINDOW size (default = 10000 nucleotides)

-s: STEP size (default = 5000 nucleotides)

-E: E-value (default = 0.000001)

-H: Number of allowable hits per window to mark a region (default = 4)

-a: User-defined asmbl_id to search (default picks asmbl_id with largest size)

-A: File name of .1con

-B: Path to .1con if not in base directory

-V: print version information

-S: Strict mode: print only regions that have core HMM hits or Mu-like and are > 10 Kbp (default = 0)

-d: DEBUG MODE (default = 0)

Output: All stored within a subdirectory of the current working directory ($PWD) named by the genome contig or accession id (ie NC_000913)

1) phage_phinder_<id>.log: a log file recording Phage_Finder progress

2) phgraph file: an XGRAPH plot of the phage regions

3) phreport file: a tab-delimited report file that shows (coordinate incremented by the step size, # hits per window, and the feat_name or locus name of the hits)

4) phpico, phmedio, phregions: tab-delimited files containing the 5' end of each gene, tRNA or att site within each region, the name of the feature, and the annotation/database match/HMM match as well as the G+C% content of each region, a best guess for the type of region, and the coordinates of each region with or without att site adjustments. There are three different names for this file, depending on the size of the regions (1-10000 bp [phpico], 10001-18000 bp [phmedio] and >18001 bp [phreigons])

5) .tab file: a tab-delimited file containing (contig_id, size of the genome, G+C% content of the genome, 5' end of the phage region, 3' end of the phage region, size of region in bp, label (small, medium, large), region type (prophage, integrated element, degenerate), sequence of attR, sequence of attL, name of integration target, G+C% of region, 5' feat_name or locus name, 3' feat_name or locus name, # integrase HMM hits, # core_HMM hits, # above noise core_HMM hits, # lytic gene HMM hits, # tail HMM hits, # Mu HMM hits, orientation of the prophage based on orientation of the target or the position of the integrase, the distance from att site to integrase, and the number of genes in the region

6) .1con file: a file in FASTA format containing the DNA sequence of the phage region

7) .seq file: a file in FASTA format containing the DNA sequence of each gene within the phage region

8) .pep file: a file in FASTA format containing the protein sequence of each gene within the phage region

PHAGE_FINDER PIPELINE
----------------------------------------

To run the complete Phage_Finder pipeline automaticly, you need to have the following:

1) each genome within a directory that is named by accession or contig_id (ie NC_000913) (see examples_dir)

2) protein sequences for each genome in their repective directories with extension (.pep or .faa)

3) phage_finder_info.txt or GenBank .ptt file in their repective directories

4) complete genome sequence (.con or .fna file) in their repective directories

% findPhage.sh <file with list of accessions>

EXAMPLE DATA
-------------------------

Sample data can be found in ~/phage_finder/examples_dir

The file phage_phinder_postprocess.out contains data parsed using PERL from each test genome's *.tab file generated by Phage_Finder.

Each column (number by PERL convention) in this file represents:

0) ORGANISM = The name of the organism tested

1) ACCESSION = The GenBank accession number of the test genome

2) SIZE = The size in base pairs of the test genome

3) GC% = The G+C% of the test genome

4) # = The number of the prophage region

5) pgc% = The G+C% of the prophage region

6) end5 = The 5' end of the prophage region

7) end3 = The 3' end of the prophage region

8) loci = The gene span of the prophage region (locus 5'-locus 3')

9) ori = The orientation of the prophage region

10) #phgbp = The size (in base pairs) of the prophage region

11) Ty = Region type (PRO=prophage, BAC=bacteriocin, DEG=degenerate, IEL=integrated element)

12) class = Region class (LRG=large, MED=medium, SML=small, SAT=satellite, RET=retron element, Mu=Mu-like, P2=P2-like, P4=P4-like)

13) att = Predicted attachment site found (Y/N)

14) target = Predicted target of insertion

15) In = The number of integrase HMM hits

16) Co = The number of core phage HMM hits above the trusted cutoff

17) NC = The number of core phage HMM hits above the noise cutoff

18) Ly = The number of phage lysis HMM hits

19) Ta = The number of phage tail HMM hits

20) Mu = The number of Mu-like phage HMM hits

21) dX = The distance from integrase to attachment site in base pairs

22) OC = The total number of open reading frames (ORFs) within region

You can also see the results from searching 302 complete bacterial genomes in the file
~/phage_finder/examples_dir/302_bac_genomes.txt


 Download

 Documentation

 Requirements

 Reporting Bugs

 Developers

 Phage HMMs

 Screenshots

 License

 Support

SourceForge.net Logo