Commands reference

From PileLine

(Difference between revisions)
Jump to: navigation, search
(Analysis Commands)
(Analysis Commands)
 
(13 intermediate revisions not shown)
Line 7: Line 7:
  Option                                  Description                             
  Option                                  Description                             
  ------                                  -----------                             
  ------                                  -----------                             
-
  -p, --gp-file <File>                    SORTED genome position file to seek [required]                           
+
-f, --regions-file <File>              File with seek positions in the form of seq:start[:end] per line [required if no -s].
 +
                                        Please note: if several regions are provided, they will not be merged, and the output will be ordered in the same order as the input intervals                     
 +
  -p, --gp-file <File>                    SORTED genome position file to seek [required]                           
  --pos-col <Integer>                    position column for the gp-file. The first is 1 (default: 2)               
  --pos-col <Integer>                    position column for the gp-file. The first is 1 (default: 2)               
-
  -s                                      seek position in the form of seq:start[:end] [required]            
+
  -s                                      seek positions in the form of seq:start[:end] [required]. Please   
-
  --seq-col <Integer>                    sequence column for gp-file. The first is 1 (default: 1)
+
                                        Note: if several regions are provided, they will not be merged, and the output will be ordered in the same order as the input intervals
 +
  --seq-col <Integer>                    sequence column for gp-file. The first is 1 (default: 1)
Example of use:
Example of use:
Line 58: Line 61:
  pileline-fastjoin -a <GP_file> -b <GP_file>           
  pileline-fastjoin -a <GP_file> -b <GP_file>           
 +
 +
*'''''pileline-fulljoin'''''
 +
Merges two or more GP files, printing for each genome position, the corresponding line of each input file (if any).
 +
 +
'''Usage:'''
 +
pileline-fulljoin -i <GP_file> -i <GP_file2> [-i <GP_file3> ...] [--seq-col <int>] [--pos-col <int>]
 +
 +
Option                                  Description                           
 +
------                                  -----------                           
 +
-i, --gp-file <File>                    SORTED genome position files to full 
 +
                                          join [required 2 or more]           
 +
--pos-col <Integer>                    position column for all gp-files. The 
 +
                                          first column is 1 (default: 2)     
 +
--pos-cols                              comma-separated position columns for each input gp-   
 +
                                          file (--pos-col will be ignored).   
 +
                                          The first column is 1               
 +
--seq-col <Integer>                    sequence column for all gp-files. The 
 +
                                          first column is 1 (default: 1)     
 +
--seq-cols                              comma-separated sequence columns for each input gp-   
 +
                                          file (--seq-col will be ignored).   
 +
                                          The first column is 1 
 +
 +
Example of use:
 +
pileline-fulljoin -i <GP_file1> -i <GP_file2> -i <GP_file3>
Line 68: Line 95:
  Option                                  Description                             
  Option                                  Description                             
  ------                                  -----------                             
  ------                                  -----------                             
-
  -A, --input-file                        SORTED genome position file. Use - for stdin [required]                    
+
  -A, --input-file                        SORTED genome position file. Use - for stdin. [required] Positions are considered 1-based
-
  --annotate                              Do not filter. Annotate the lines with the ranges (last column)            
+
  --annotate                              Do not filter. Annotate the lines with the ranges (last column)
-
  -b, --intervals-bed-file <File>        intervals file in BED format [required -b or -g]                          
+
  -b, --intervals-bed-file <File>        intervals file in BED format.[required -b or -g] Intervals are taken as 0-based and the end-position is exclusive
-
  --end-col-intervals <Integer>          end position column in the intervals file. The first is 1 (default: 3)  
+
  --end-col-intervals <Integer>          end position column in the intervals file. The first is 1 (default: 3)
-
  -g, --intervals-gff-file <File>        intervals file in GFF format [required -b or -g]                          
+
  -g, --intervals-gff-file <File>        intervals file in GFF format.[required -b or -g] Intervals are taken as 1-based and the end-position is inclusive
  -i, --intervals-gp-file <File>          intervals file in any other format     
  -i, --intervals-gp-file <File>          intervals file in any other format     
-
  --pos-col-input <Integer>              position column in the input file. The first is 1 (default: 2)            
+
  --pos-col-input <Integer>              position column in the input file. The first is 1 (default: 2)
-
  --seq-col-input <Integer>              sequence column in the input file. The first is 1 (default: 1)            
+
  --seq-col-input <Integer>              sequence column in the input file. The first is 1 (default: 1)
-
  --seq-col-intervals <Integer>          sequence column in the intervals file. The first is 1 (default: 1)        
+
  --seq-col-intervals <Integer>          sequence column in the intervals file.The first is 1 (default: 1)
-
  --start-col-intervals <Integer>        start position column in the intervals file. The first is 1 (default: 2)  
+
  --start-col-intervals <Integer>        start position column in the intervals file. The first is 1 (default: 2)
 +
-v, --inverse                          inverse filtering, that is, output lines that are OUTSIDE of the provided intervals
  -w, --window <Integer>                  expand each interval with <window> size at both sides (default: 0)
  -w, --window <Integer>                  expand each interval with <window> size at both sides (default: 0)
Line 88: Line 116:
  cat <GP_file.txt> | pileline-rfilter.sh --annotate -A - -i <annotations1.bed> | pileline-rfilter.sh --annotate -A - -i <annotations2.bed> > <myfullyannotated_GP_file.txt>  
  cat <GP_file.txt> | pileline-rfilter.sh --annotate -A - -i <annotations1.bed> | pileline-rfilter.sh --annotate -A - -i <annotations2.bed> > <myfullyannotated_GP_file.txt>  
 +
'''Please note: If you are experimenting memory issues, please give the intervals file in a compressed+indexed form with tabix. (Please see: [[Compress/index input with bgzip+tabix]])'''
Line 151: Line 180:
*'''''pileline-2smc'''''
*'''''pileline-2smc'''''
-
Looks for discrepancies in genotypes of two samples (i.e.: case vs control) in pileup format files. It also can annotate each output position with a user provided BED file containing custom annotations.
+
Looks for discrepancies in genotypes of two samples (i.e.: case vs control) in pileup format files. It also can annotate each output position with a user provided BED file containing custom annotations. The INPUT FILES MUST BE SORTED.
  '''Usage:''' pileline-2smc -a <pileup> -b <pileup> --variants-a <pileup> --variants-b <pileup> [OPTIONS]
  '''Usage:''' pileline-2smc -a <pileup> -b <pileup> --variants-a <pileup> --variants-b <pileup> [OPTIONS]
Line 179: Line 208:
*'''''pileline-nsmc'''''
*'''''pileline-nsmc'''''
-
Takes the output of several 2smc comparisons commands to reports where variants are reproduced. It can operate in two modes: by exact position or by intervals (i.e.:genes). For intervals mode, you have to provide an additional intervals file (.bed, .gff or custom)
+
Takes the output of several 2smc comparisons commands to reports where variants are reproduced. It can operate in two modes: by exact position or by intervals (i.e.:genes). For intervals mode, you have to provide an additional intervals file (.bed, .gff or custom).The INPUT FILES MUST BE SORTED.
  '''Usage (by position):''' pileline-nsmc -a <GP_file> -a <GP_file> -a <GP_file>... -b <GP_file> -b <GP_file> -b <GP_file>... [OPTIONS] -o <OUTFILE>
  '''Usage (by position):''' pileline-nsmc -a <GP_file> -a <GP_file> -a <GP_file>... -b <GP_file> -b <GP_file> -b <GP_file>... [OPTIONS] -o <OUTFILE>
Line 198: Line 227:
  -a                                      variant pileup (pileup -c)  files for   
  -a                                      variant pileup (pileup -c)  files for   
                                           sample A (one or more. i.e: -a file1  
                                           sample A (one or more. i.e: -a file1  
 +
--allele-col <Integer>                  allele (variant) column in sample     
 +
                                          files. The first is 1 (default: 4) 
                                           -a file2 -a file3...) [required]     
                                           -a file2 -a file3...) [required]     
  -b                                      variant pileup (pileup -c) files for   
  -b                                      variant pileup (pileup -c) files for   
                                           sample B (one or more. i.e: -b file4  
                                           sample B (one or more. i.e: -b file4  
                                           -b file5 -b file6...) [required]     
                                           -b file5 -b file6...) [required]     
 +
-c, --expand-cells-col <Integer>        When using -e, fill the cell with the 
 +
                                          info of the specified column.       
 +
                                          (default: 0)                       
  -e, --expand-cells                      In exact position mode: fill each cell  
  -e, --expand-cells                      In exact position mode: fill each cell  
                                           in the output with the corresponding  
                                           in the output with the corresponding  
Line 214: Line 248:
                                           file. The first is 1 (default: 3)     
                                           file. The first is 1 (default: 3)     
  -o                                      output file. Use - for stdout           
  -o                                      output file. Use - for stdout           
-
                                           [required]                          
+
                                           [required]
 +
--ref-col <Integer>                    reference genome column in sample     
 +
                                          files. The first is 1 (default: 3) 
  --seq-col-intervals <Integer>          sequence column in the intervals file.  
  --seq-col-intervals <Integer>          sequence column in the intervals file.  
-
                                           The first is 1 (default: 1)        
+
                                           The first is 1 (default: 1)                            
  --start-col-intervals <Integer>        start position column in the intervals  
  --start-col-intervals <Integer>        start position column in the intervals  
                                           file. The first is 1 (default: 2)     
                                           file. The first is 1 (default: 2)     

Latest revision as of 12:18, 31 August 2011

Processing and Annotation Commands

  • pileline-fastseek

Prints a given range of a GP file.

Usage: pileline-fastseek -p <GP_file> -s <range> [--seq-col <int>] [--pos-col <int>]

Option                                  Description                            
------                                  -----------                            
-f, --regions-file <File>               File with seek positions in the form of seq:start[:end] per line [required if no -s]. 
                                        Please note: if several regions are provided, they will not be merged, and the output will be ordered in the same order as the input intervals                      
-p, --gp-file <File>                    SORTED genome position file to seek [required]                           
--pos-col <Integer>                     position column for the gp-file. The first is 1 (default: 2)              
-s                                      seek positions in the form of seq:start[:end] [required]. Please     
                                        Note: if several regions are provided, they will not be merged, and the output will be ordered in the same order as the input intervals
--seq-col <Integer>                     sequence column for gp-file. The first  is 1 (default: 1)

Example of use:

pileline-fastseek -p <GP_file> -s chr10:100:10000


  • pileline-sort

Sorts a GP file by position coordinate.

Usage:pileline-sort -i <GP_file> -o <outfile> [OPTIONS]

Option                                  Description                            
------                                  -----------                            
-T, --temp-dir                          Directory for temporary files [default is the system's temp dir]            
-i, --input-file                        Input file to sort. Use - for stdin  [required]                           
--max-chars-chunk <Long>                max chars per temporal file (default: 2000000)                             
-o, --output-file                       Output sorted file. Use - for stdout [required]                           
--pos-col <Integer>                     position column in the input file. The first is 1 (default: 2)              
--seq-col <Integer>                     sequence column in the input file. The first is 1 (default: 1)  

Example of use:

pileline-sort -i <GP_file> -o <outfile>


  • pileline-fastjoin

Joins two sorted GP files. Note: You may use pileline-sort whether you need to sort GP files to run pileline-fastjoin command.

Usage: 
pileline-fastjoin.sh -a <left_file> -b <right_file> [--right-outer-join | --left-outer-join][--noprint-a | --noprint-b][--seq-col-a <int>][--pos-col-a <int>][--seq-col-b <int>][--pos-col-b <int>]

Option                                  Description                            
------                                  -----------                            
-a, --left-file <File>                 left tab-delimited AND SORTED genome position file [required]             
-b, --right-file <File>                right tab-delimited AND SORTED genome position file [required]             
--left-outer-join                      performs a left outer join: all A records will be in output, inexistent B records are showed by a NULL                                 
--noprint-a                            prints only data fields of A           
--noprint-b                            prints only data fields of B           
--pos-col-a <Integer>                  position column for the left file. The first is 1 (default: 2)              
--pos-col-b <Integer>                  position column for the right file. The first is 1 (default: 2)          
--right-outer-join                     performs a right outer join: all B records will be in output,  inexistent A records are showed as   NULL                                 
--seq-col-a <Integer>                  sequence column for the left file. The first is 1 (default: 1)              
--seq-col-b <Integer>                  sequence column for the right file. The first is 1 (default: 1)

Example of use:

pileline-fastjoin -a <GP_file> -b <GP_file>           


  • pileline-fulljoin

Merges two or more GP files, printing for each genome position, the corresponding line of each input file (if any).

Usage:
pileline-fulljoin -i <GP_file> -i <GP_file2> [-i <GP_file3> ...] [--seq-col <int>] [--pos-col <int>]

Option                                  Description                            
------                                  -----------                            
-i, --gp-file <File>                    SORTED genome position files to full   
                                         join [required 2 or more]            
--pos-col <Integer>                     position column for all gp-files. The  
                                         first column is 1 (default: 2)       
--pos-cols                              comma-separated position columns for each input gp-    
                                         file (--pos-col will be ignored).    
                                         The first column is 1                
--seq-col <Integer>                     sequence column for all gp-files. The  
                                         first column is 1 (default: 1)       
--seq-cols                              comma-separated sequence columns for each input gp-    
                                         file (--seq-col will be ignored).    
                                         The first column is 1  

Example of use:

pileline-fulljoin -i <GP_file1> -i <GP_file2> -i <GP_file3>


  • pileline-rfilter.sh

Filters (or annotates) a positional file with range-based annotations (in bed, gff or custom formats). Each position that is inside of a specific range is annotated.

Usage: 
pileline-rfilter [--annotate] -A <GP_file> [-b <bed> | -g <gff> | -i <intervals_file>] [-w <int>] [--seq-pos-input <int>] [--pos-col-input <int>] 
                 [--seq-col-intervals <int>] [--start-col-intervals <int>] [--end-col-intervals <int>]

Option                                  Description                            
------                                  -----------                            
-A, --input-file                        SORTED genome position file. Use - for stdin. [required] Positions are considered 1-based
--annotate                              Do not filter. Annotate the lines with the ranges (last column)
-b, --intervals-bed-file <File>         intervals file in BED format.[required -b or -g] Intervals are taken as 0-based and the end-position is exclusive
--end-col-intervals <Integer>           end position column in the intervals file. The first is 1 (default: 3)
-g, --intervals-gff-file <File>         intervals file in GFF format.[required -b or -g] Intervals are taken as 1-based and the end-position is inclusive
-i, --intervals-gp-file <File>          intervals file in any other format     
--pos-col-input <Integer>               position column in the input file. The first is 1 (default: 2)
--seq-col-input <Integer>               sequence column in the input file. The first is 1 (default: 1)
--seq-col-intervals <Integer>           sequence column in the intervals file.The first is 1 (default: 1)
--start-col-intervals <Integer>         start position column in the intervals file. The first is 1 (default: 2)
-v, --inverse                           inverse filtering, that is, output lines that are OUTSIDE of the provided intervals
-w, --window <Integer>                  expand each interval with <window> size at both sides (default: 0)

Examples of use:

#on target filtering
pileline-rfilter -A <GP_file.txt> -i <targets.bed>
#simple annotation 
pileline-rfilter --annotate -A <GP_file.txt> -i <annotations.bed>
#multiple annotation (combining UNIX commands)
cat <GP_file.txt> | pileline-rfilter.sh --annotate -A - -i <annotations1.bed> | pileline-rfilter.sh --annotate -A - -i <annotations2.bed> > <myfullyannotated_GP_file.txt> 

Please note: If you are experimenting memory issues, please give the intervals file in a compressed+indexed form with tabix. (Please see: Compress/index input with bgzip+tabix)


  • pileline-genindex.sh

Indexes fasta genome and then can perform range based queries in that genome.

Usage: pileline-genindex [OPTIONS]

Option                                  Description                            
------                                  -----------                            
-g, --genome-file <File>                genome file to index in one unique fasta (on index mode) [required in --index]                               
-i, --index-file <File>                 index file to create (on index mode) or to access (on seek mode) [required]                           
--index                                 Index mode                             
-s                                      Seek position in the form of seq:start[:end] [required]             
--seek                                  Seek mode [default if no --index]

Examples of use:

pileline-genindex --index -g <fasta> -i <new_index>
pileline-genindex --seek -i <index> -s chr1:1000:2000


  • pileline-pileup2sift.sh

Generates a SIFT-compatible change column for each variant line in pileup files.

Usage: pileline-pileup2sift -i <pileup>

Option                                  Description                            
------                                  -----------                            
-i, --pileup-file                       variants pileup (pileup -c) file to annotate. Use - for stdin.

Example of use:

pileline-pileup2sift -i <pileup_file> 


  • pileline-pileup2polyphen.sh

Generates a Polyhen-compatible change column for each variant line in pileup files.

Usage: pileline-pileup2polyphen -i <pileup>

Option                                  Description                            
------                                  -----------                            
-i, --pileup-file                       variants pileup (pileup -c) file to annotate. Use - for stdin.

Example of use:

pileline-pileup2polyphen -i <pileup_file>


  • pileline-pileup2firestar.sh

Generates a Firestar-compatible input for each variant line in pileup files.

Usage: pileline-pileup2firestar -i <pileup>

Option                                  Description                            
------                                  -----------                            
-i, --pileup-file                       variants pileup (pileup -c) file to annotate. Use - for stdin.

Example of use:

pileline-pileup2firestar -i <pileup_file>

Analysis Commands

  • pileline-2smc

Looks for discrepancies in genotypes of two samples (i.e.: case vs control) in pileup format files. It also can annotate each output position with a user provided BED file containing custom annotations. The INPUT FILES MUST BE SORTED.

Usage: pileline-2smc -a <pileup> -b <pileup> --variants-a <pileup> --variants-b <pileup> [OPTIONS]

Option                                  Description                            
------                                  -----------                            
--AdiscrepantB                          Calculate variants present in sample A  (-v) and in sample B (-w), but with different genotype                   
-a, --genotype-a <File>                 Whole genotype pileup (with MAQ consensus) file of sample A [required]          
--all                                   Calculate all mutations (onlyA, onlyB, AdiscrepantB and both) [default]     
--annotate <File>                       Annotated positions with those of the provided BED file                    
-b, --genotype-b <File>                Whole genotype pileup (with MAQ consensus) file of sample B [required]          
--both                                  Calculate variants present in sample A (-v) and in sample B (-w) and with the same genotype                    
--cq-column <Integer>                   MAQ consensus quality column in variants and genotype files (default: 5)                         
-d, --genotype-depth-filter-threshold   genotype depth filter threshold <Integer>(default: 10)                        
-o, --out-prefix <File>                 Output files prefix [required]         
--onlyA                                 Calculate mutations which are variants in sample A (-v) and are homozigous-reference in B                       
--onlyB                                 Calculate mutations which are variants in sample B (-w) and are homozigous- reference in A                       
-r, --reference-column <Integer>        reference genotype column in genotype files (options -a and -b) (default:3)                                   
-t, --genotype-depth-filter-column      genotype depth column (default: 8) <Integer>                                                                    
-v, --variants-a <File>                 Variants of interest in pileup format (with MAQ consensus) of sample A [required]                           
-w, --variants-b <File>                 Variants of interest in pileup format (with MAQ consensus) of sample B [required]

Example of use:

pileline-2smc -a <pileup> -b <pileup> --variants-a <pileup> --variants-b <pileup> --annotate <bed> -d 30


  • pileline-nsmc

Takes the output of several 2smc comparisons commands to reports where variants are reproduced. It can operate in two modes: by exact position or by intervals (i.e.:genes). For intervals mode, you have to provide an additional intervals file (.bed, .gff or custom).The INPUT FILES MUST BE SORTED.

Usage (by position): pileline-nsmc -a <GP_file> -a <GP_file> -a <GP_file>... -b <GP_file> -b <GP_file> -b <GP_file>... [OPTIONS] -o <OUTFILE>
Usage (by intervals): pileline-nsmc [-B <bed_file> | -G <gff_file>] -a <GP_file> -a <GP_file> -a <GP_file>... -b <GP_file> -b <GP_file> -b <GP_file>... [OPTIONS] -o <OUTFILE>

Option                                  Description                            
------                                  -----------                            
-B, --intervals-bed-file <File>         intervals file in BED format.          
                                         Intervals are taken as 0-based and   
                                         the end-position is exclusive        
-G, --intervals-gff-file <File>         intervals file in GFF format.          
                                         Intervals are taken as 1-based and   
                                         the end-position is inclusive        
-I, --intervals-gp-file <File>          intervals file in custom format. You   
                                         can provide the columns for          
                                         sequence, start and stop in the      
                                         appropriate parameters               
-a                                      variant pileup (pileup -c)  files for  
                                         sample A (one or more. i.e: -a file1 
--allele-col <Integer>                  allele (variant) column in sample      
                                         files. The first is 1 (default: 4)   
                                         -a file2 -a file3...) [required]     
-b                                      variant pileup (pileup -c) files for   
                                         sample B (one or more. i.e: -b file4 
                                         -b file5 -b file6...) [required]     
-c, --expand-cells-col <Integer>        When using -e, fill the cell with the  
                                         info of the specified column.        
                                         (default: 0)                         
-e, --expand-cells                      In exact position mode: fill each cell 
                                         in the output with the corresponding 
                                         pileup line if it exists, separated  
                                         by '|' (default will appear YES or   
                                         NO in the cell). In intervals mode:  
                                         show how many entries in the         
                                         variants file are within the         
                                         interval (this is not taken into     
                                         account in the fisher test)          
--end-col-intervals <Integer>           end position column in the intervals   
                                         file. The first is 1 (default: 3)    
-o                                      output file. Use - for stdout          
                                         [required]
--ref-col <Integer>                     reference genome column in sample      
                                         files. The first is 1 (default: 3)   
--seq-col-intervals <Integer>           sequence column in the intervals file. 
                                         The first is 1 (default: 1)                             
--start-col-intervals <Integer>         start position column in the intervals 
                                         file. The first is 1 (default: 2)    

Examples of use:

pileline-nsmc -a <GP_file> -a <GP_file> -b <GP_file> -b <GP_file> -o <OUTFILE>
pileline-nsmc -G <gff_file> -a <GP_file> -a <GP_file> -a <GP_file> -o <OUTFILE>


  • pileline-genotest.sh

Calculates the NGS performance on genotyping, surveying a set of genomic positions whose genotype is known in the sample.

Usage: 
pileline-genotest --create-genotest-file <new_genotest> -p <pileup> -g <gold> -r <reference>
pileline-genotest -a <new_genotest> -t <int> [--print-help-table] [--depth-filter <int>]
pileline-genotest -a <new_genotest> --roc
pileline-genotest -a <new_genotest> --batch-t 0,255,1

Option                                  Description                            
------                                  -----------                            
-a, --genotest-file <File>              the genotest intermediate file to analyze [required if no -c]          
--batch-t                               A sequence of thresholds to test, specified as <start>,<end>,<step> [required -t or --roc or --batch-t]  
-c, --create-genotest-file <File>       creates the genotest intermediate file for further analysis                 
--depth-filter <Integer>                consider as no base called positions below this depth filter (default: 0) 
-g, --gold-genotype <File>              gold genotype (chr<tab>pos<tab>genotpye (two letters, including NN) [required if -c]                     
-p, --pileup <File>                     complete pileup [required if -c]       
--print-help-table                      print measures help table              
-r, --ref-genome <File>                 index of the reference genome (created  with gentools-genindex) [required if -c]                                  
--roc                                   output roc values [required -t or -- roc or --batch-t]                    
--simple-output                         print only the performance measures in a single line. Useful to include in scripts                              
-t, --threshold <Double>                SQPq threshold to report variant [required -t or --roc or --batch-t] (default: 1.0)

Example of use:

# Warning: Check that your alleles in the <gold_genotype.sorted> file are expressed in the same strand as the 
#          reference genome sequence used in your NGS experiment. Typically forward (+) strand. 

## Step1.

#Create reference index <ref_genome.pileline> using pileline-genindex command.
pileline-genindex --index -i  <ref_genome.pileline> -g <ref_genome.fa>

## Step2.
#Create genotest file (required).
pileline-genotest --create-genotest-file <experiment.genotest> -p <GP_file.txt> -g <gold_genotype.sorted> -r <ref_genome.pileline>

## Step3. QC analysis.

#Generate a metrics table of performance at a given threshold.
pileline-genotest -a <experiment.genotest> -t <snpq_treshold>

#Generate all performance metrics for several thresholds
pileline-genotest -a <experiment.genotest> --batch-t 0,255,1

#Generate values for ROC curve plot (outfile compatible to ROCR R package)
pileline-genotest -a <experiment.genotest> --roc
Personal tools