Main Page

From PileLine

(Difference between revisions)
Jump to: navigation, search
(What's new)
 
(109 intermediate revisions not shown)
Line 1: Line 1:
 +
= What's new =
 +
== PileLine GUI 1.3 has been released on July 29, 2011 ==
 +
The major changes in this version are:
 +
* Compatible with the latest pileline tools (version 1.2).
 +
* GP Files viewer: using JXTable instead of JTable to show data rows and added a new functionality to set the number of rows to retrieve by the seeker.
 +
* Genome Browser:
 +
** Added support to 2smc output Pileup files.
 +
** File loader.
 +
** Using double-buffer for each track.
 +
** Allow the user to center the sequence in a given position.
 +
* Several bug fixes in genome browser.
 +
* Bugfix when joinning files with rightouter/leftouter options.
 +
 +
 +
== PileLine 1.2 has been released on July 29, 2011 ==
 +
The major changes in this version are:
 +
* pileline-nsmc
 +
** Added two new parameters: -c (column to include when using -e), --ref-col and --allele-col for reference genome and alternative (allele) columns.
 +
* pileline-fulljoin:
 +
** Support for different seq-col and pos-col for each input file via --seq-cols and --pos-cols
 +
* pileline-fastjoin:
 +
** Bugfix. NullPointerException when using custom columns and the sequence or position are the last column.
 +
** Added new behaviour: The data of each joined file now corresponds to all columns but those of sequence and position.
 +
* pileline-fastseek:
 +
** Bugfix. Hang when using a custom --seq-col
 +
* pileline-2smc, pileline-nsmc:
 +
** Added column names in output files as a comment.
 +
* core:
 +
** Bugfix in tabix-based intervals index. Exception "Cannot getOverlappingIntervals with an iteration in progress" is thrown when getting overlapping intervals and getting only the first. This also produces that unit test didn't complete successfully.
 +
 +
 +
== PileLine 1.1 has been released on Feb 11, 2011 ==
 +
The major changes in this version are:
 +
* New pileline-fulljoin command to join n GP files.
 +
* New pileline-nsmc mode: intervals mode. Now it can check the reproducibility of mutations by entire intervals (i.e.:genes, provided as .bed), instead of only exact genome position.
 +
* Added BGZF compatibility. This is the samtools compressed format of GP files. All PileLine commands are adapted to receive both bgz-compressed or uncompressed files, transparently. If you want to know how to compress your GP files, please see: [[Compress/index input with bgzip+tabix]]
 +
* Added .fai compatibility for genome indexes (needed in pileline-genotest). The pileline-genindex command is now deprecated. You should use the "samtools faidx" command instead.
 +
* Added more flexibility to include custom GP files, by indicating the sequence, start and stop columns in files.
 +
* Bugfix: Bed files is treated in the standard form: (0-based and the last position in the interval is excluded).
 +
* Bugfix: Prevent fastseek from hanging due to bad input files (i.e.: zip compressed, non-ascii, non tab-separated). A small heuristic check is included before processing the file.
 +
 +
'''Please note: we are updating the commands help wiki pages'''
 +
= Welcome to PileLine Wiki =
= Welcome to PileLine Wiki =
-
'''PileLine''' (Pileup pipeLine) is  a flexible command-line toolkit for efficient handling, filtering, and comparison of locus text files produced by next-generation sequencing experiments (i.e. [http://samtools.sourceforge.net/pileup.shtml pileup] files from [http://samtools.sourceforge.net SAMtools]).  
+
'''PileLine''' is  a flexible command-line toolkit for efficient handling, filtering, and comparison of genomic position (GP) files produced by next-generation sequencing experiments (i.e. [http://samtools.sourceforge.net/pileup.shtml pileup], [http://genome.ucsc.edu/FAQ/FAQformat#format1 BED],[http://genome.ucsc.edu/FAQ/FAQformat#format3 GFF], or [http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2 VCF] files).  
-
'''PileLine''' is designed to be memory efficient by performing on-disk operations over sorted locus files directly.  
+
'''PileLine''' is designed to be memory efficient by performing on-disk operations over sorted GP files directly.  
-
'''PileLine''' is available for downloading at: [http://sourceforge.net/projects/pileline http://sourceforge.net/projects/pileline]
+
'''PileLine''' is available for downloading at: [http://sourceforge.net/projects/pilelinetools/ http://sourceforge.net/projects/pilelinetools/]
 +
[[#PileLine_GUI|PileLine GUI]] includes a front-end of the PileLine toolkit, plus a genome browser.
==Main Features==
==Main Features==
-
# Filtering and comparison of locus text files.
+
[[File:Glez_Peña_et_al_Figure2.png|right|thumb|PileLine commands and accepted input files.]]
-
# Full annotation of locus files with human [http://www.ncbi.nlm.nih.gov/projects/SNP/ dbSNP], [http://www.genenames.org/ HGNC Gene Symbol] and [http://www.ensembl.org/ Ensembl IDs]. Custom annotations are also allowed and may be supplied through standard [http://genome.ucsc.edu/FAQ/FAQformat#format1 .BED] or [http://genome.ucsc.edu/FAQ/FAQformat#format3 .GFF] files.  
+
 
-
# [http://sift.jcvi.org/ SIFT] and [http://genetics.bwh.harvard.edu/pph2/ PolyPhen-2] compatible outputs to facilitate the biological interpretation of huge lists of variants.  
+
# Quick filtering and search within GP files without indexing steps.
 +
# GP files comparisons.
 +
# Full annotation of GP files with human [http://www.ncbi.nlm.nih.gov/projects/SNP/ dbSNP], [http://www.genenames.org/ HGNC Gene Symbol] and [http://www.ensembl.org/ Ensembl IDs]. Custom annotations are also allowed and may be supplied through standard [http://genome.ucsc.edu/FAQ/FAQformat#format1 .BED] or [http://genome.ucsc.edu/FAQ/FAQformat#format3 .GFF] files.  
 +
# [http://sift.jcvi.org/ SIFT], [http://genetics.bwh.harvard.edu/pph2/ PolyPhen-2] and [http://ubio.bioinfo.cnio.es/inb/firestar/firestar_batch/firestar.html Firestar] compatible inputs to facilitate the biological interpretation of huge lists of variants.  
# Genotyping quality control functionality to estimate performance metrics [http://www.ncbi.nlm.nih.gov/pubmed/19327155 (Harismendi et al. 2009)] on detecting homo/heterozigote variants against a given gold standard genotype.
# Genotyping quality control functionality to estimate performance metrics [http://www.ncbi.nlm.nih.gov/pubmed/19327155 (Harismendi et al. 2009)] on detecting homo/heterozigote variants against a given gold standard genotype.
 +
# Modular design to facilitate the inclusion of new functionalities.
 +
 +
==Getting started==
 +
New to PileLine?
 +
Please, follow our  [[Quick_Start | Quick Start]].
==PileLine Commands==
==PileLine Commands==
-
===Processing Commands===
+
===Processing and Annotation Commands===
-
*'''''pileline-fastseek.sh'''''
+
*'''''pileline-fastseek'''''
-
Prints a given range of a locus file.
+
Prints a given range of a GP file.
 +
 
 +
*'''''pileline-sort'''''
 +
Sorts GP files by coordinate.
-
*'''''pileline-fastsjoin.sh'''''
+
*'''''pileline-fastjoin'''''
-
Joins two positional files.
+
Joins two SORTED GP files.
-
*'''''pileline-rfilter.sh'''''
+
*'''''pileline-rfilter'''''
Filters (or annotates) a positional file with range-based annotations (in bed format). Each position that is inside of a specific range is annotated.
Filters (or annotates) a positional file with range-based annotations (in bed format). Each position that is inside of a specific range is annotated.
-
*'''''pileline-genindex.sh'''''
+
*'''''pileline-pileup2sift'''''
-
Indexes fasta genome and then can perform range based queries in that genome.
+
Generates SIFT compatible infiles from pileup files.
 +
 
 +
*'''''pileline-pileup2polyphen'''''
 +
Generates PolyPhen-2 compatible infiles from pileup files.
 +
 
 +
*'''''pileline-pileup2firestar'''''
 +
Generates Firestar compatible infiles from GP files.
===Analysis Commands===
===Analysis Commands===
-
*'''''pileline-2smc.sh'''''
+
*'''''pileline-2smc'''''
-
Looks for discrepancies in genotypes of two samples (i.e.: case vs control). It also can annotate each output position with a given positional file containing custom annotations (i.e. dbSNP). Also produces a SIFT and PolyPhen-2 compatible outfiles.
+
Looks for discrepancies in genotypes of two samples (i.e.: case vs control). It also can annotate each output position with a user provided BED file containing custom annotations.
-
*'''''pileline-nsmc.sh'''''
+
*'''''pileline-nsmc'''''
-
Takes the output of several 2smc comparisons commands to reports where variants are reproduced.
+
Compares n samples reporting consistent variants.
-
*'''''pileline-genotest.sh'''''
+
*'''''pileline-genotest'''''
-
Calculates the NGS performance on genotyping, surveying a set of genomic positions whose genotype is known in the sample.
+
Calculates the NGS performance on genotyping, surveying a set of genomic positions whose genotype is known in the sample. This functionality requires a previous step using *'''''pileline-genindex'''''  command for genome indexing.
==Use Cases==
==Use Cases==
-
[[File:Figure_paper_Final.png|right|thumb|PileLine coupled to SAMtools facilitating pileup handling.]]
+
[[File:Figure_paper_Final.png|right|thumb|PileLine coupled to [http://samtools.sourceforge.net SAMtools] facilitating pileup handling. NS: non-synonymous]]
-
*'''Annotate a locus file with dbSNP.'''
+
*'''Perform 2 samples comparison'''
-
pileline-fastjoin.sh –a <locus_file.txt> -b dbSNP130.txt --left-outer-join
+
-
 
+
-
*'''Annotate a locus file with genes.'''
+
-
pileline-rfilter.sh --annotate –A <locus_file.txt> –b <genes.bed>
+
-
 
+
-
*'''Filter pileup to exon loci.'''
+
-
pileline-rfilter.sh –A <locus_file.txt> –b <exons.bed>
+
-
 
+
-
*'''Perform 2 samples comparison.'''
+
  pileline-2smc.sh  
  pileline-2smc.sh  
-
  –a <locusfile_A.txt> –b <locusfile_B.txt>
+
  –a <file_A.pileup> –b <file_B.pileup>
-
  –v <variants_locusfile_A.txt> –w <variants__locusfile_B.txt>  
+
  –v <variants_file_A.pileup> –w <variants_file_B.pileup>  
  –o <out.txt> -d <min_depth>
  –o <out.txt> -d <min_depth>
-
*'''Perform n samples comparison.'''
+
*'''Perform n samples comparison'''
  pileline-nsmc.sh
  pileline-nsmc.sh
-
  --a-samples<locusfile_a1>,<locusfile_a2>,<locusfile_a3>  
+
  --a-samples<GPfile_a1>,<GPfile_a2>,<GPfile_a3>  
-
  --b-samples <locusfile_b1>,<locusfile_b2>,<locusfile_b3>
+
  --b-samples <GPfile_b1>,<GPfile_b2>,<GPfile_b3>
 +
 
 +
*'''Sort GP files'''
 +
pileline-sort.sh -i <input_GP_file.txt> -o <outfile.sorted.txt>
 +
 
 +
*'''Annotate a GP file with dbSNP'''
 +
pileline-fastjoin.sh -a <GP_file.txt> -b dbSNP130.txt --left-outer-join
 +
 
 +
*'''Annotate a GP file with genes'''
 +
pileline-rfilter.sh --annotate -A <GP_file.txt> -b <genes.bed>
 +
 
 +
*'''Filter pileup to exon loci'''
 +
pileline-rfilter.sh -A <GP_file.txt> -b <exons.bed>
 +
 
 +
*'''Generate column compatible to SIFT intput'''
 +
pileline-pileup2sift.sh -i <file.pileup>
 +
 
 +
*'''Perform a genotyping test for quality control'''
 +
[[File:Genotest_output_table.png|right|thumb|Genotest metrics table description. It may be obtained by using --print-help-table argument.]]
 +
# ''Warning: Check that your alleles in the <gold_genotype.sorted> file are expressed in the same strand as the''
 +
#          ''reference genome sequence used in your NGS experiment. Typically forward (+) strand.''
 +
 +
## Step1.
 +
 +
#Create reference index <ref_genome.pileline> using pileline-genindex command.
 +
pileline-genindex --index -i  <ref_genome.pileline> -g <ref_genome.fa>
 +
 +
## Step2.
 +
#Create genotest file (required).
 +
pileline-genotest --create-genotest-file <experiment.genotest> -p <GP_file.txt> -g <gold_genotype.sorted> -r <ref_genome.pileline>
 +
 +
## Step3. QC analysis.
 +
  #Generate all performance metrics for several thresholds
 +
pileline-genotest -a <experiment.genotest> --batch-t 0,255,1
 +
 +
#Generate values for ROC curve plot (outfile compatible to ROCR R package)
 +
pileline-genotest -a <experiment.genotest> --roc
 +
 +
#Generate a metrics table of performance at a given threshold.
 +
pileline-genotest -a <experiment.genotest> -t <snpq_treshold>
 +
 
 +
==PileLine GUI==
 +
 
 +
 
 +
PileLine GUI is a front-end of the PileLine toolkit, plus a '''genome browser'''. With this intuitive graphical desktop application you can run the following tasks:
 +
# Processing commands of GP files, like seek, join, annotate and filtering.
 +
# Perform 2-samples and n-samples point somatic mutation calling (via the PileLine 2smc and nsmc commands).
 +
# Browse GP files in a interactive local genome browser.
 +
 
 +
You can download PileLine GUI from [[Downloads]].
-
*'''Perform a genotyping test for quality control.'''
+
{|
-
pileline-genotest –p <locus_file.txt> –g <gold_genotype> -r <ref_genome.pileline> -t <snpq_treshold>
+
|[[File:pileline_gui_scheme.png|thumb|General scheme of the PileLine GUI software.]]
 +
|[[File:genome_browser.png|thumb|PileLine GUI's interactive genome browser.]]
 +
|[[File:gpfiles_view.png|thumb|PileLine GUI showing a instantly-navigable .pileup file.]]
 +
|}

Latest revision as of 10:05, 29 July 2011

Contents

What's new

PileLine GUI 1.3 has been released on July 29, 2011

The major changes in this version are:

  • Compatible with the latest pileline tools (version 1.2).
  • GP Files viewer: using JXTable instead of JTable to show data rows and added a new functionality to set the number of rows to retrieve by the seeker.
  • Genome Browser:
    • Added support to 2smc output Pileup files.
    • File loader.
    • Using double-buffer for each track.
    • Allow the user to center the sequence in a given position.
  • Several bug fixes in genome browser.
  • Bugfix when joinning files with rightouter/leftouter options.


PileLine 1.2 has been released on July 29, 2011

The major changes in this version are:

  • pileline-nsmc
    • Added two new parameters: -c (column to include when using -e), --ref-col and --allele-col for reference genome and alternative (allele) columns.
  • pileline-fulljoin:
    • Support for different seq-col and pos-col for each input file via --seq-cols and --pos-cols
  • pileline-fastjoin:
    • Bugfix. NullPointerException when using custom columns and the sequence or position are the last column.
    • Added new behaviour: The data of each joined file now corresponds to all columns but those of sequence and position.
  • pileline-fastseek:
    • Bugfix. Hang when using a custom --seq-col
  • pileline-2smc, pileline-nsmc:
    • Added column names in output files as a comment.
  • core:
    • Bugfix in tabix-based intervals index. Exception "Cannot getOverlappingIntervals with an iteration in progress" is thrown when getting overlapping intervals and getting only the first. This also produces that unit test didn't complete successfully.


PileLine 1.1 has been released on Feb 11, 2011

The major changes in this version are:

  • New pileline-fulljoin command to join n GP files.
  • New pileline-nsmc mode: intervals mode. Now it can check the reproducibility of mutations by entire intervals (i.e.:genes, provided as .bed), instead of only exact genome position.
  • Added BGZF compatibility. This is the samtools compressed format of GP files. All PileLine commands are adapted to receive both bgz-compressed or uncompressed files, transparently. If you want to know how to compress your GP files, please see: Compress/index input with bgzip+tabix
  • Added .fai compatibility for genome indexes (needed in pileline-genotest). The pileline-genindex command is now deprecated. You should use the "samtools faidx" command instead.
  • Added more flexibility to include custom GP files, by indicating the sequence, start and stop columns in files.
  • Bugfix: Bed files is treated in the standard form: (0-based and the last position in the interval is excluded).
  • Bugfix: Prevent fastseek from hanging due to bad input files (i.e.: zip compressed, non-ascii, non tab-separated). A small heuristic check is included before processing the file.

Please note: we are updating the commands help wiki pages

Welcome to PileLine Wiki

PileLine is a flexible command-line toolkit for efficient handling, filtering, and comparison of genomic position (GP) files produced by next-generation sequencing experiments (i.e. pileup, BED,GFF, or VCF files). PileLine is designed to be memory efficient by performing on-disk operations over sorted GP files directly.

PileLine is available for downloading at: http://sourceforge.net/projects/pilelinetools/

PileLine GUI includes a front-end of the PileLine toolkit, plus a genome browser.

Main Features

PileLine commands and accepted input files.
  1. Quick filtering and search within GP files without indexing steps.
  2. GP files comparisons.
  3. Full annotation of GP files with human dbSNP, HGNC Gene Symbol and Ensembl IDs. Custom annotations are also allowed and may be supplied through standard .BED or .GFF files.
  4. SIFT, PolyPhen-2 and Firestar compatible inputs to facilitate the biological interpretation of huge lists of variants.
  5. Genotyping quality control functionality to estimate performance metrics (Harismendi et al. 2009) on detecting homo/heterozigote variants against a given gold standard genotype.
  6. Modular design to facilitate the inclusion of new functionalities.

Getting started

New to PileLine? Please, follow our Quick Start.

PileLine Commands

Processing and Annotation Commands

  • pileline-fastseek

Prints a given range of a GP file.

  • pileline-sort

Sorts GP files by coordinate.

  • pileline-fastjoin

Joins two SORTED GP files.

  • pileline-rfilter

Filters (or annotates) a positional file with range-based annotations (in bed format). Each position that is inside of a specific range is annotated.

  • pileline-pileup2sift

Generates SIFT compatible infiles from pileup files.

  • pileline-pileup2polyphen

Generates PolyPhen-2 compatible infiles from pileup files.

  • pileline-pileup2firestar

Generates Firestar compatible infiles from GP files.

Analysis Commands

  • pileline-2smc

Looks for discrepancies in genotypes of two samples (i.e.: case vs control). It also can annotate each output position with a user provided BED file containing custom annotations.

  • pileline-nsmc

Compares n samples reporting consistent variants.

  • pileline-genotest

Calculates the NGS performance on genotyping, surveying a set of genomic positions whose genotype is known in the sample. This functionality requires a previous step using *pileline-genindex command for genome indexing.

Use Cases

PileLine coupled to SAMtools facilitating pileup handling. NS: non-synonymous
  • Perform 2 samples comparison
pileline-2smc.sh 
–a <file_A.pileup> –b <file_B.pileup>
–v <variants_file_A.pileup> –w <variants_file_B.pileup> 
–o <out.txt> -d <min_depth>
  • Perform n samples comparison
pileline-nsmc.sh
--a-samples<GPfile_a1>,<GPfile_a2>,<GPfile_a3> 
--b-samples <GPfile_b1>,<GPfile_b2>,<GPfile_b3>
  • Sort GP files
pileline-sort.sh -i <input_GP_file.txt> -o <outfile.sorted.txt>
  • Annotate a GP file with dbSNP
pileline-fastjoin.sh -a <GP_file.txt> -b dbSNP130.txt --left-outer-join
  • Annotate a GP file with genes
pileline-rfilter.sh --annotate -A <GP_file.txt> -b <genes.bed>
  • Filter pileup to exon loci
pileline-rfilter.sh -A <GP_file.txt> -b <exons.bed>
  • Generate column compatible to SIFT intput
pileline-pileup2sift.sh -i <file.pileup>
  • Perform a genotyping test for quality control
Genotest metrics table description. It may be obtained by using --print-help-table argument.
# Warning: Check that your alleles in the <gold_genotype.sorted> file are expressed in the same strand as the 
#          reference genome sequence used in your NGS experiment. Typically forward (+) strand. 

## Step1.

#Create reference index <ref_genome.pileline> using pileline-genindex command.
pileline-genindex --index -i  <ref_genome.pileline> -g <ref_genome.fa>

## Step2.
#Create genotest file (required).
pileline-genotest --create-genotest-file <experiment.genotest> -p <GP_file.txt> -g <gold_genotype.sorted> -r <ref_genome.pileline>

## Step3. QC analysis.
 #Generate all performance metrics for several thresholds
pileline-genotest -a <experiment.genotest> --batch-t 0,255,1

#Generate values for ROC curve plot (outfile compatible to ROCR R package)
pileline-genotest -a <experiment.genotest> --roc

#Generate a metrics table of performance at a given threshold.
pileline-genotest -a <experiment.genotest> -t <snpq_treshold>

PileLine GUI

PileLine GUI is a front-end of the PileLine toolkit, plus a genome browser. With this intuitive graphical desktop application you can run the following tasks:

  1. Processing commands of GP files, like seek, join, annotate and filtering.
  2. Perform 2-samples and n-samples point somatic mutation calling (via the PileLine 2smc and nsmc commands).
  3. Browse GP files in a interactive local genome browser.

You can download PileLine GUI from Downloads.

General scheme of the PileLine GUI software.
PileLine GUI's interactive genome browser.
PileLine GUI showing a instantly-navigable .pileup file.
Personal tools