Quick Start

From PileLine

(Difference between revisions)
Jump to: navigation, search
(PipeLine Guided Example)
(3. You can also compare multiple samples to report consistent variants.)
 
(91 intermediate revisions not shown)
Line 1: Line 1:
-
==PipeLine Input Files==
+
==PileLine Input Files==
 +
[[File:Glez_Peña_et_al_Figure2.png|right|thumb|PileLine commands and accepted input files.]]
'''PileLine''' is capable to handle, filter and compare genomic position files (GP) including standard [http://samtools.sourceforge.net/pileup.shtml pileup], [http://genome.ucsc.edu/FAQ/FAQformat#format1 BED],[http://genome.ucsc.edu/FAQ/FAQformat#format3 GFF], or [http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2 VCF] files.  
'''PileLine''' is capable to handle, filter and compare genomic position files (GP) including standard [http://samtools.sourceforge.net/pileup.shtml pileup], [http://genome.ucsc.edu/FAQ/FAQformat#format1 BED],[http://genome.ucsc.edu/FAQ/FAQformat#format3 GFF], or [http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2 VCF] files.  
-
Basically, GP are tabular files where the two first columns contain chromosome name and position coordinate respectively.  
+
Basically, GP are tabular files where the two first columns contain sequence name (i.e. chromosome name) and position coordinate respectively.  
Additional optional fields are accepted in '''PileLine''', see an example of GP input file below:
Additional optional fields are accepted in '''PileLine''', see an example of GP input file below:
-
  10     118829    optional1    optional2    optional3    ...     
+
  chr10     118829    optional1    optional2    optional3    ...     
-
  10     121207    optional1    optional2    optional3    ...
+
  chr10     121207    optional1    optional2    optional3    ...
-
  10     121337    optional1    optional2    optional3    ...
+
  chr10     121337    optional1    optional2    optional3    ...
-
  10     121636    optional1    optional2    optional3    ...
+
  chr10     121636    optional1    optional2    optional3    ...
-
==PipeLine Guided Example==
+
==PileLine Guided Example #1==
-
'''1.''' '''Download''' GP example files (pileup format) to your working directory:
+
The code included in this tutorial runs on UNIX systems. For Windows you must use '''PileLine''' commands adding .bat extension (i.e. pileline-2smc.bat)  
-
* Experiment 1.
+
==='''1.'''  '''Download''' and uncompress GP files (pileup format) and Annotation files to your working directory:===
-
:[[File:Control1Files.zip]]
+
-
:[[File:Case1Files.zip]]
+
-
* Experiment 2.
+
* GP files-Experiment 1.
-
:[[File:Control2Files.zip]]
+
:[[Media:Control1Files.zip|Control1Files.zip]]
-
:[[File:Case2Files.zip]]
+
:[[Media:Case1Files.zip|Case1Files.zip]]
-
Each .zip file contains 2 pileup files:
+
* GP files-Experiment 2.
 +
:[[Media:Control2Files.zip|Control2Files.zip]]
 +
:[[Media:Case2Files.zip|Case2Files.zip]]
 +
 
 +
* Annotation files
 +
:[[Media:Hg18_hgnc_ensembl_genes.bed.zip|Hg18_hgnc_ensembl_genes.bed.zip]]
 +
:[[Media:DbSNP_36.3.bed.bgz.zip|DbSNP_36.3.bed.bgz.zip]] (to annotate using rfilter --annotate)
 +
:[[Media:DbSNP_36.3.gpfile.zip|DbSNP_36.3.gpfile.zip]] (to annotate using fastjoin)
 +
 
 +
 
 +
Each Experiment .zip file contains 2 pileup files:
* Whole pileup file (.pileup)
* Whole pileup file (.pileup)
* Variants against reference genome pileup file (.variants.pileup).
* Variants against reference genome pileup file (.variants.pileup).
-
'''2.'''  You may '''compare 2 samples''' at variant level using '''''pileline-2smc.sh''''' functionality. Use this command line to compare Case1 vs Control1:
+
==='''2.'''  You may '''compare 2 samples''' at variant level using '''''pileline-2smc.sh''''' functionality.===
 +
Use this command line to compare file A (Control1) vs file B (Case1), remind that INPUT FILES MUST BE SORTED :
  $ cd DOWNLOADED_FILES_DIRECTORY  
  $ cd DOWNLOADED_FILES_DIRECTORY  
  $ sh YOUR_PATH_TO_PILELINE/cmd/pileline-2smc.sh  
  $ sh YOUR_PATH_TO_PILELINE/cmd/pileline-2smc.sh  
-
  –a ./Control1.pileup –b ./Case1.pileup
+
  -a ./Control1.pileup -b ./Case1.pileup
-
  –v ./Control1.variants.pileup –w ./Case1.variants.pileup  
+
  -v ./Control1.variants.pileup -w ./Case1.variants.pileup  
-
  –o ./myoutput1.txt
+
  -o ./myoutput1.txt
 +
 
-
Executing this code you will obtain 4 output files containing:
+
Executing this command using by default parameters you will obtain 4 output files containing:
-
* [[myoutput1.txt.onlyA]]: Variants found in Control1 but not in Case1 (i.e. germ-line reverted mutations or SNPs)
+
* myoutput1.txt.onlyA: Only variants found in File A. In this example variants found in  Control1 but not in Case1 (i.e. germ-line reverted mutations or SNPs)
-
* [[myoutput1.txt.onlyB]]: Variants found in Case1 but not in Control1 (i.e. somatic mutations or SNPs)
+
* myoutput1.txt.onlyB: Only variants found in File B. In this case, variants found in Case1 but not in Control1 (i.e. somatic mutations or SNPs)
-
* [[myoutput1.txt.both]]: Case1 and Control1 variants are similar alleles and both of them are different to the reference genome allele.  (i.e. germ-line variants or SNPs)
+
* myoutput1.txt.both: Both File A and File B variants present the same allele and both of them are different to the reference genome allele.  (i.e. germ-line variants or SNPs)
-
* [[myoutput1.txt.AdiscrepantB]]: Both Case1 and Control1 variants are different alleles and both of them are different to the reference genome allele.  (i.e. germ-line variants mutated or SNPs).  
+
* myoutput1.txt.AdiscrepantB: Both File A and File B variants present different alleles and both of them are different to the reference genome allele.  (i.e. germ-line variants mutated or SNPs).  
See an example in this table:
See an example in this table:
Line 42: Line 53:
{| border="2"
{| border="2"
|+ '''''pileline-2smc.sh''''' output files
|+ '''''pileline-2smc.sh''''' output files
-
! Ref genome !! Control (-a file) !! Case (-b file) !! Output File Name
+
! Ref genome !! File A !! File B !! Output File Name
|-
|-
! T
! T
Line 59: Line 70:
-
'''''pileline-2smc.sh''''' output file format consists in both input files contents joined by variant positions. In this way:
+
'''''pileline-2smc.sh''''' output file format specifications are shown in the next table:
{| border="2"
{| border="2"
|+ '''''pileline-2smc.sh''''' output files format
|+ '''''pileline-2smc.sh''''' output files format
-
! Chr !! Position !! Control1_data.pileup !! Case1_data.pileup !! Variant Score  
+
! Chr !! Position !! Ref Genome Allele !! Variant Allele !! File ID !! Variant Score !! FileA_inputdata !! FileB_inputdata 
|-
|-
|}  
|}  
-
Note: If you apply '''pileline-2smc.sh''' to pileup files you'll find a variant score in the last column of the output file. This score reports the sum of Phred-scale Consensus Qualities (PCQ) for each position in both conditions (PCQ_Control + PCQ_Case). The higher is the score the more consistent is the variant.
+
See an explained example here:
 +
 
 +
10 116237 G A A 234 ... ... # Variant (G => A) found in File A (Control1 in this example)
 +
10 116402 T C B 424 ... ... # Variant (T => C) found in File B (Case1 in this example)
 +
10 118829 A G AB 125 ... ... # Variant (A => G) found in both A and B files (Control and Case1)
 +
10 118845 A K D 127 ... ... # Discrepant variant (A => K) found in both A and B files (Control and Case1). IUPAC code is used here.
 +
 
 +
Note: If you apply '''pileline-2smc.sh''' to pileup files you'll find a variant score in the output file. This score reports the sum of Phred-scale Consensus Qualities (PCQ) for each position in both conditions (PCQ_Control + PCQ_Case).  
Line 73: Line 91:
Now, run '''''pileline-2smc.sh''''' to compare Case2 vs Control2:
Now, run '''''pileline-2smc.sh''''' to compare Case2 vs Control2:
  $ sh YOUR_PATH_TO_PILELINE/cmd/pileline-2smc.sh  
  $ sh YOUR_PATH_TO_PILELINE/cmd/pileline-2smc.sh  
-
  –a ./Control2.pileup –b ./Case2.pileup
+
  -a ./Control2.pileup -b ./Case2.pileup
-
  –v ./Control2.variants.pileup –w ./Case2.variants.pileup  
+
  -v ./Control2.variants.pileup -w ./Case2.variants.pileup  
-
  –o ./myoutput2.txt
+
  -o ./myoutput2.txt
-
'''3.'''  You can also '''compare  multiple samples''' to report consistent variants. You should use '''''pileline-nsmc.sh''''' command. In the following example we compare 2 samples (Case1 and Case2 variants) but '''''pileline-nsmc.sh''''' can be employed for n samples:
+
==='''3.'''  You can also '''compare  multiple samples''' to report consistent variants. ===
 +
You should use '''''pileline-nsmc.sh''''' command.  
 +
In the following example we compare 4 samples (Control1 and Control2 variants vs Case1 and Case2 variants) but '''''pileline-nsmc.sh''''' can be applied to n samples (reminder: INPUT FILES MUST BE SORTED):
  $ sh YOUR_PATH_TO_PILELINE/cmd/pileline-nsmc.sh  
  $ sh YOUR_PATH_TO_PILELINE/cmd/pileline-nsmc.sh  
Line 92: Line 112:
|}
|}
-
See an example here:
+
'''''pileline-nsmc.sh''''' performs a Fisher's test to estimate dependency amongst variants presence and samples type. The False Discovery Rate (FDR) is obtained by using Benjamini-Hochberg adjustment.
 +
 
 +
 
 +
See next an example of the pileline-nsmc.sh output file:
  10 115839 C Y NO NO NO YES 1 0.50000002 1.0
  10 115839 C Y NO NO NO YES 1 0.50000002 1.0
Line 103: Line 126:
  10 42940557 G R NO NO YES YES 2 0.166666667 1
  10 42940557 G R NO NO YES YES 2 0.166666667 1
-
In this case the variant located at position 6101971 has been found  in both Control samples (-a files) but not in Case samples (-b files). On the contrary variant located in 42940557 has  been found in both Cases samples but not in Controls.
+
In this case the variant located at position 6101971 has been found  in both Control samples (-a files) but not in Case samples (-b files). On the contrary, the variant located in position 42940557 has  been found in both Cases samples but not in Controls.
-
'''''pileline-nsmc.sh''''' performs a Fisher's test to estimate dependency amongst variants presence and samples type. The False Discovery Rate (FDR) is obtained by using Benjamini-Hochberg adjustment.
 
-
Additionally, '''''pileline-nsmc.sh'''' is particularly useful whether you want to find common variants in biological replicates. You should run '''''pileline-nsmc.sh''' in this way:
+
Note that '''pileline-nsmc.sh''' command may be particularly useful whether you want to find common variants in biological replicates. You should run '''''pileline-nsmc.sh''' in this way:
  $ sh YOUR_PATH_TO_PILELINE/cmd/pileline-nsmc.sh  
  $ sh YOUR_PATH_TO_PILELINE/cmd/pileline-nsmc.sh  
-
  -a /Case1.variants.pileup -b ./Case2.variants.pileup
+
  -a /Case1.variants.pileup -a ./Case2.variants.pileup
  -o  ./mycommonvariants_in_Cases.txt
  -o  ./mycommonvariants_in_Cases.txt
-
'''4.''' At this point it could be useful to '''annotate SNPs''' in variants found between Case1 and Control1.
+
==='''4.''' Variants annotation (e.g. SNPs, Ensembl IDs, HGNC Gene Names ...)===
 +
To annotate SNPs in variants found between Case1 and Control1, you have two choices:
-
To this end, you should execute '''''pileline-fastjoin.sh''''' command as follows:
+
* Use the dbSNP_36.3.bed.bgz (a BED, that is, an intervals GP file) and execute '''''pileline-rfilter.sh --annotate''''' command as follows:
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-rfilter --annotate
 +
-A ./myoutput1.txt.onlyB.pileup
 +
-i ./dbSNP_36.3.gpfile > ./mydbSNPannotation1.txt
 +
 
 +
* Use the dbSNP_36.3.gpfile (a single-nucleotide GP file) and execute '''''pileline-fastjoin.sh''''' command as follows:
  $ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastjoin.sh  
  $ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastjoin.sh  
-
  –a ./myoutput1.txt.onlyB.pileup  
+
  -a ./myoutput1.txt.onlyB.pileup  
-
  -b YOUR_PATH_TO_PILELINE/dbSNP_36.3.txt --left-outer-join > ./mydbSNPannotation1.txt
+
  -b ./dbSNP_36.3.gpfile --left-outer-join > ./mydbSNPannotation1.txt
-
The output file (mydbSNPannotation1.txt) generated contains the same information as the input file but adding a new annotation column. SNPs are annotated by their respective dbSNP ID (i.e.rs7906865). Those variants which do not match to reported SNPs are annotated with NULL label.
+
The output file (mydbSNPannotation1.txt) generated contains the same information as the input file but adding a new annotation column. SNPs are annotated with their respective dbSNP ID (i.e.rs7906865). Those variants which do not match to reported SNPs are annotated with NULL label.
-
Gene Symbol
+
In order to '''annotate the variants with  Gene Symbols''', you need the '''pileline-rfilter.sh''' command:
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-rfilter.sh
 +
--annotate -A ./mydbSNPannotation1.txt
 +
-b ./hg18_hgnc_ensembl_genes.bed > ./mydbSNPandGeneannotation1.txt
-
'''5.'''  SIFT y Polyphen.
+
Note here that you can customize your annotations using your own BED file. Combining ''pileline-rfilter.sh''  and basic UNIX commands you can even '''annotate your pileup files with multiple .BED files''' using a single command line. See the following example:
 +
  cat mypileup | pileline-rfilter.sh --annotate -A - -i annotations_file1.bed | pileline-rfilter.sh --annotate -A - -i annotations_file2.bed > myfullyannotated.pileup
 +
 
 +
==='''5.''' Generate input for external mutational effects prediction software.===
 +
Finally, you may be interested in '''generate SIFT, Polyphen and Firestar compatible input files''' to predict the functional consequences of variants found between Case1 and Control1. Use '''pileline-pileup2sift.sh''' and '''pileline-pileup2polyphen.sh''' commands to this end.
 +
 
 +
#To generate SIFT-compatible inputs
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-pileup2sift.sh -i ./mydbSNPandGeneannotation1.txt > ./mysiftinput.txt
 +
 
 +
#To generate PolyPhen-compatible inputs:
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-pileup2polyphen.sh -i ./mydbSNPandGeneannotation1.txt > ./mypolypheninput.txt
 +
 
 +
#To generate Firestar-compatible inputs:
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-pileup2firestar.sh -i ./mydbSNPandGeneannotation1.txt > ./myfirestarinput.txt
 +
 
 +
==PipeLine Guided Example #2==
 +
 
 +
The code included in this tutorial runs on UNIX systems. For Windows you must use '''PileLine''' commands adding .bat extension (i.e. pileline-fastseek.bat)
 +
 
 +
==='''1.'''  '''Download''' GP files (vcf, .bed and .gff format):===
 +
 
 +
:[[Media:GP_Files_vcf_bed_gff.zip|GP_Files_vcf_bed_gff.zip]]
 +
 
 +
This .zip compressed file contains 3 GP files:
 +
 
 +
#'''TestVcf.vcf'''
 +
#'''TestBed.bed'''
 +
#'''TestGff.gff'''
 +
#'''My.intervals.gff'''
 +
 
 +
* Note: All the guided examples showed here may be also applied to .pileup files.
 +
 
 +
==='''2.'''  '''Sort''' GP files by chromosomic positions:===
 +
'''.vcf'''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-sort.sh -i ./TestVcf.vcf -o ./My.sorted.Vcf
 +
'''.bed'''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-sort.sh -i ./TestBed.bed -o ./My.sorted.bed
 +
'''.gff'''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-sort.sh -i ./TestGff.gff --pos-col 4  -o ./My.sorted.gff
 +
 
 +
==='''3.'''  '''Search''' for chromosomic positions within the GP files:===
 +
'''.vcf'''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastseek.sh -p ./My.sorted.Vcf -s 1:762969:804635
 +
'''.bed'''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastseek.sh -p ./My.sorted.bed -s 9:138938078:138940781
 +
'''.gff'''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastseek.sh -p ./My.sorted.gff --pos-col 4 -s 7:157540895:157681138
 +
 
 +
==='''4.'''  '''Join''' two sorted GP files by common rows (if any):===
 +
'''.vcf and .bed'''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastjoin.sh -a My.sorted.Vcf -b My.sorted.bed
 +
'''.vcf and .gff'''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastjoin.sh -a My.sorted.Vcf -b My.sorted.gff --pos-col-b 4
 +
'''bed and .gff'''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastjoin.sh -a My.sorted.bed -b My.sorted.gff --pos-col-b 4
 +
 
 +
==='''5.''' Filters''' (or annotates) a GP file:===
 +
'''Annotate a .vcf file with Ensembl IDs contained in a .bed file'''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-rfilter.sh --annotate -A ./My.sorted.Vcf -b ./hg18_hgnc_ensembl_genes.bed
 +
'''Filter .bed using a .gff file containing intervals to be filtered '''
 +
$ cd DOWNLOADED_FILES_DIRECTORY
 +
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-rfilter.sh -A ./My.sorted.bed -g ./My.intervals.gff --start-col-intervals 4 --end-col-intervals 5

Latest revision as of 12:15, 31 August 2011

Contents

PileLine Input Files

PileLine commands and accepted input files.

PileLine is capable to handle, filter and compare genomic position files (GP) including standard pileup, BED,GFF, or VCF files.

Basically, GP are tabular files where the two first columns contain sequence name (i.e. chromosome name) and position coordinate respectively. Additional optional fields are accepted in PileLine, see an example of GP input file below:

chr10     118829     optional1     optional2     optional3     ...    
chr10     121207     optional1     optional2     optional3     ...
chr10     121337     optional1     optional2     optional3     ...
chr10     121636     optional1     optional2     optional3     ...

PileLine Guided Example #1

The code included in this tutorial runs on UNIX systems. For Windows you must use PileLine commands adding .bat extension (i.e. pileline-2smc.bat)

1. Download and uncompress GP files (pileup format) and Annotation files to your working directory:

  • GP files-Experiment 1.
Control1Files.zip
Case1Files.zip
  • GP files-Experiment 2.
Control2Files.zip
Case2Files.zip
  • Annotation files
Hg18_hgnc_ensembl_genes.bed.zip
DbSNP_36.3.bed.bgz.zip (to annotate using rfilter --annotate)
DbSNP_36.3.gpfile.zip (to annotate using fastjoin)


Each Experiment .zip file contains 2 pileup files:

  • Whole pileup file (.pileup)
  • Variants against reference genome pileup file (.variants.pileup).

2. You may compare 2 samples at variant level using pileline-2smc.sh functionality.

Use this command line to compare file A (Control1) vs file B (Case1), remind that INPUT FILES MUST BE SORTED :

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-2smc.sh 
-a ./Control1.pileup -b ./Case1.pileup
-v ./Control1.variants.pileup -w ./Case1.variants.pileup 
-o ./myoutput1.txt


Executing this command using by default parameters you will obtain 4 output files containing:

  • myoutput1.txt.onlyA: Only variants found in File A. In this example variants found in Control1 but not in Case1 (i.e. germ-line reverted mutations or SNPs)
  • myoutput1.txt.onlyB: Only variants found in File B. In this case, variants found in Case1 but not in Control1 (i.e. somatic mutations or SNPs)
  • myoutput1.txt.both: Both File A and File B variants present the same allele and both of them are different to the reference genome allele. (i.e. germ-line variants or SNPs)
  • myoutput1.txt.AdiscrepantB: Both File A and File B variants present different alleles and both of them are different to the reference genome allele. (i.e. germ-line variants mutated or SNPs).

See an example in this table:

pileline-2smc.sh output files
Ref genome File A File B Output File Name
T A T myoutput1.txt.onlyA
T T G myoutput1.txt.onlyB
T A A myoutput1.txt.both
T C G myoutput1.txt.AdiscrepantB


pileline-2smc.sh output file format specifications are shown in the next table:

pileline-2smc.sh output files format
Chr Position Ref Genome Allele Variant Allele File ID Variant Score FileA_inputdata FileB_inputdata

See an explained example here:

10	116237	G	A	A	234	...	...	# Variant (G => A) found in File A (Control1 in this example)		
10	116402	T	C	B	424	...	...	# Variant (T => C) found in File B (Case1 in this example)
10	118829	A	G	AB	125	...	...	# Variant (A => G) found in both A and B files (Control and Case1)
10	118845	A	K	D	127	...	...	# Discrepant variant (A => K) found in both A and B files (Control and Case1). IUPAC code is used here.

Note: If you apply pileline-2smc.sh to pileup files you'll find a variant score in the output file. This score reports the sum of Phred-scale Consensus Qualities (PCQ) for each position in both conditions (PCQ_Control + PCQ_Case).


Now, run pileline-2smc.sh to compare Case2 vs Control2:

$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-2smc.sh 
-a ./Control2.pileup -b ./Case2.pileup
-v ./Control2.variants.pileup -w ./Case2.variants.pileup 
-o ./myoutput2.txt

3. You can also compare multiple samples to report consistent variants.

You should use pileline-nsmc.sh command. In the following example we compare 4 samples (Control1 and Control2 variants vs Case1 and Case2 variants) but pileline-nsmc.sh can be applied to n samples (reminder: INPUT FILES MUST BE SORTED):

$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-nsmc.sh 
-a  ./Control1.variants.pileup -a ./Control2.variants.pileup 
-b ./Case1.variants.pileup -b ./Case2.variants.pileup
-o ./mycommonvariants.txt

The output file (mycommonvariants.txt in this tutorial) contains the following information:

pileline-nsmc.sh output files
Chr Position Ref Genome Allele Variant Allele Presence of variant in -a files Presence of variant in -b files # of samples containing the variant Fisher's test p-value FDR

pileline-nsmc.sh performs a Fisher's test to estimate dependency amongst variants presence and samples type. The False Discovery Rate (FDR) is obtained by using Benjamini-Hochberg adjustment.


See next an example of the pileline-nsmc.sh output file:

10	115839	C	Y	NO	NO	NO	YES	1	0.50000002	1.0
10	116237	G	R	NO	YES	NO	YES	2	1.00000002	1.0
10	116402	T	C	YES	YES	NO	YES	3	0.50000002	1.0
10	116699	C	M	NO	YES	NO	YES	2	1.00000002	1.0
10	118829	A	R	YES	YES	YES	YES	4	1.00000002	1.0
10	6101971	A	M	YES	YES	NO	NO	2	0.16666669	1.0
...
10	42940557	G	R	NO	NO	YES	YES	2	0.166666667	1

In this case the variant located at position 6101971 has been found in both Control samples (-a files) but not in Case samples (-b files). On the contrary, the variant located in position 42940557 has been found in both Cases samples but not in Controls.


Note that pileline-nsmc.sh command may be particularly useful whether you want to find common variants in biological replicates. You should run pileline-nsmc.sh in this way:

$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-nsmc.sh 
-a /Case1.variants.pileup -a ./Case2.variants.pileup
-o  ./mycommonvariants_in_Cases.txt

4. Variants annotation (e.g. SNPs, Ensembl IDs, HGNC Gene Names ...)

To annotate SNPs in variants found between Case1 and Control1, you have two choices:

  • Use the dbSNP_36.3.bed.bgz (a BED, that is, an intervals GP file) and execute pileline-rfilter.sh --annotate command as follows:
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-rfilter --annotate 
-A ./myoutput1.txt.onlyB.pileup 
-i ./dbSNP_36.3.gpfile > ./mydbSNPannotation1.txt
  • Use the dbSNP_36.3.gpfile (a single-nucleotide GP file) and execute pileline-fastjoin.sh command as follows:
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastjoin.sh 
-a ./myoutput1.txt.onlyB.pileup 
-b ./dbSNP_36.3.gpfile --left-outer-join > ./mydbSNPannotation1.txt

The output file (mydbSNPannotation1.txt) generated contains the same information as the input file but adding a new annotation column. SNPs are annotated with their respective dbSNP ID (i.e.rs7906865). Those variants which do not match to reported SNPs are annotated with NULL label.


In order to annotate the variants with Gene Symbols, you need the pileline-rfilter.sh command:

$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-rfilter.sh 
--annotate -A ./mydbSNPannotation1.txt
-b ./hg18_hgnc_ensembl_genes.bed > ./mydbSNPandGeneannotation1.txt

Note here that you can customize your annotations using your own BED file. Combining pileline-rfilter.sh and basic UNIX commands you can even annotate your pileup files with multiple .BED files using a single command line. See the following example:

cat mypileup | pileline-rfilter.sh --annotate -A - -i annotations_file1.bed | pileline-rfilter.sh --annotate -A - -i annotations_file2.bed > myfullyannotated.pileup

5. Generate input for external mutational effects prediction software.

Finally, you may be interested in generate SIFT, Polyphen and Firestar compatible input files to predict the functional consequences of variants found between Case1 and Control1. Use pileline-pileup2sift.sh and pileline-pileup2polyphen.sh commands to this end.

#To generate SIFT-compatible inputs
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-pileup2sift.sh -i ./mydbSNPandGeneannotation1.txt > ./mysiftinput.txt
#To generate PolyPhen-compatible inputs:
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-pileup2polyphen.sh -i ./mydbSNPandGeneannotation1.txt > ./mypolypheninput.txt
#To generate Firestar-compatible inputs:
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-pileup2firestar.sh -i ./mydbSNPandGeneannotation1.txt > ./myfirestarinput.txt

PipeLine Guided Example #2

The code included in this tutorial runs on UNIX systems. For Windows you must use PileLine commands adding .bat extension (i.e. pileline-fastseek.bat)

1. Download GP files (vcf, .bed and .gff format):

GP_Files_vcf_bed_gff.zip

This .zip compressed file contains 3 GP files:

  1. TestVcf.vcf
  2. TestBed.bed
  3. TestGff.gff
  4. My.intervals.gff
  • Note: All the guided examples showed here may be also applied to .pileup files.

2. Sort GP files by chromosomic positions:

.vcf

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-sort.sh -i ./TestVcf.vcf -o ./My.sorted.Vcf

.bed

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-sort.sh -i ./TestBed.bed -o ./My.sorted.bed

.gff

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-sort.sh -i ./TestGff.gff --pos-col 4  -o ./My.sorted.gff

3. Search for chromosomic positions within the GP files:

.vcf

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastseek.sh -p ./My.sorted.Vcf -s 1:762969:804635

.bed

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastseek.sh -p ./My.sorted.bed -s 9:138938078:138940781

.gff

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastseek.sh -p ./My.sorted.gff --pos-col 4 -s 7:157540895:157681138

4. Join two sorted GP files by common rows (if any):

.vcf and .bed

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastjoin.sh -a My.sorted.Vcf -b My.sorted.bed

.vcf and .gff

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastjoin.sh -a My.sorted.Vcf -b My.sorted.gff --pos-col-b 4

bed and .gff

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastjoin.sh -a My.sorted.bed -b My.sorted.gff --pos-col-b 4

5. Filters (or annotates) a GP file:

Annotate a .vcf file with Ensembl IDs contained in a .bed file

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-rfilter.sh --annotate -A ./My.sorted.Vcf -b ./hg18_hgnc_ensembl_genes.bed

Filter .bed using a .gff file containing intervals to be filtered

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-rfilter.sh -A ./My.sorted.bed -g ./My.intervals.gff --start-col-intervals 4 --end-col-intervals 5
Personal tools