Quick Start

From PileLine

(Difference between revisions)
Jump to: navigation, search
(PipeLine Guided Example)
(PipeLine Guided Example)
Line 67: Line 67:
|}  
|}  
-
If you apply ''''pileline-2smc.sh''''' to pileup files you'll find a variant score in the last column of the output file. This score reports the sum of Phred-scale Consensus Qualities (PCQ) for each position in both conditions (PCQ_Control + PCQ_Case). The higher is the score the more consistent is the variant.
+
Note: If you apply '''pileline-2smc.sh''' to pileup files you'll find a variant score in the last column of the output file. This score reports the sum of Phred-scale Consensus Qualities (PCQ) for each position in both conditions (PCQ_Control + PCQ_Case). The higher is the score the more consistent is the variant.

Revision as of 14:51, 23 June 2010

PipeLine Input Files

PileLine is capable to handle, filter and compare genomic position files (GP) including standard pileup, BED,GFF, or VCF files.

Basically, GP are tabular files where the two first columns contain chromosome name and position coordinate respectively. Additional optional fields are accepted in PileLine, see an example of GP input file below:

10     118829     optional1     optional2     optional3     ...    
10     121207     optional1     optional2     optional3     ...
10     121337     optional1     optional2     optional3     ...
10     121636     optional1     optional2     optional3     ...

PipeLine Guided Example

1. Download GP example files (pileup format) to your working directory:

  • Experiment 1.
File:Control1Files.zip
File:Case1Files.zip
  • Experiment 2.
File:Control2Files.zip
File:Case2Files.zip

Each .zip file contains 2 pileup files:

  • Whole pileup file (.pileup)
  • Variants against reference genome pileup file (.variants.pileup).

2. You may compare 2 samples at variant level using pileline-2smc.sh functionality. Use this command line to compare Case1 vs Control1:

$ cd DOWNLOADED_FILES_DIRECTORY 
$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-2smc.sh 
–a ./Control1.pileup –b ./Case1.pileup
–v ./Control1.variants.pileup –w ./Case1.variants.pileup 
–o ./myoutput1.txt

Executing this code you will obtain 4 output files containing:

  • myoutput1.txt.onlyA: Variants found in Control1 but not in Case1 (i.e. germ-line reverted mutations or SNPs)
  • myoutput1.txt.onlyB: Variants found in Case1 but not in Control1 (i.e. somatic mutations or SNPs)
  • myoutput1.txt.both: Case1 and Control1 variants are similar alleles and both of them are different to the reference genome allele. (i.e. germ-line variants or SNPs)
  • myoutput1.txt.AdiscrepantB: Both Case1 and Control1 variants are different alleles and both of them are different to the reference genome allele. (i.e. germ-line variants mutated or SNPs).

See an example in this table:

pileline-2smc.sh output files
Ref genome Control (-a file) Case (-b file) Output File Name
T A T myoutput1.txt.onlyA
T T G myoutput1.txt.onlyB
T A A myoutput1.txt.both
T C G myoutput1.txt.AdiscrepantB


pileline-2smc.sh output file format consists in both input files contents joined by variant positions. In this way:

pileline-2smc.sh output files format
Chr Position Control1_data.pileup Case1_data.pileup Variant Score

Note: If you apply pileline-2smc.sh to pileup files you'll find a variant score in the last column of the output file. This score reports the sum of Phred-scale Consensus Qualities (PCQ) for each position in both conditions (PCQ_Control + PCQ_Case). The higher is the score the more consistent is the variant.


Now, run pileline-2smc.sh to compare Case2 vs Control2:

$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-2smc.sh 
–a ./Control2.pileup –b ./Case2.pileup
–v ./Control2.variants.pileup –w ./Case2.variants.pileup 
–o ./myoutput2.txt

3. You can also compare multiple samples to report consistent variants. You should use pileline-nsmc.sh command. In the following example we compare 2 samples (Case1 and Case2 variants) but pileline-nsmc.sh can be employed for n samples:

$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-nsmc.sh 
-a  ./Control1.variants.pileup -a ./Control2.variants.pileup 
-b ./Case1.variants.pileup -b ./Case2.variants.pileup
-o ./mycommonvariants.txt

The output file (mycommonvariants.txt in this tutorial) contains the following information:

pileline-nsmc.sh output files
Chr Position Ref Genome Allele Variant Allele Presence of variant in -a files Presence of variant in -b files # of samples containing the variant Fisher's test p-value FDR

See an example here:

10	115839	C	Y	NO	NO	NO	YES	1	0.50000002	1.0
10	116237	G	R	NO	YES	NO	YES	2	1.00000002	1.0
10	116402	T	C	YES	YES	NO	YES	3	0.50000002	1.0
10	116699	C	M	NO	YES	NO	YES	2	1.00000002	1.0
10	118829	A	R	YES	YES	YES	YES	4	1.00000002	1.0
10	6101971	A	M	YES	YES	NO	NO	2	0.16666669	1.0
...
10	42940557	G	R	NO	NO	YES	YES	2	0.166666667	1

In this case the variant located at position 6101971 has been found in both Control samples (-a files) but not in Case samples (-b files). On the contrary variant located in 42940557 has been found in both Cases samples but not in Controls.

pileline-nsmc.sh performs a Fisher's test to estimate dependency amongst variants presence and samples type. The False Discovery Rate (FDR) is obtained by using Benjamini-Hochberg adjustment.

Additionally, pileline-nsmc.sh' is particularly useful whether you want to find common variants in biological replicates. You should run pileline-nsmc.sh in this way:

$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-nsmc.sh 
-a /Case1.variants.pileup -b ./Case2.variants.pileup
-o  ./mycommonvariants_in_Cases.txt

4. At this point it could be useful to annotate SNPs in variants found between Case1 and Control1.

To this end, you should execute pileline-fastjoin.sh command as follows:

$ sh YOUR_PATH_TO_PILELINE/cmd/pileline-fastjoin.sh 
–a ./myoutput1.txt.onlyB.pileup 
-b YOUR_PATH_TO_PILELINE/dbSNP_36.3.txt --left-outer-join > ./mydbSNPannotation1.txt

The output file (mydbSNPannotation1.txt) generated contains the same information as the input file but adding a new annotation column. SNPs are annotated by their respective dbSNP ID (i.e.rs7906865). Those variants which do not match to reported SNPs are annotated with NULL label.


Gene Symbol


5. SIFT y Polyphen.

Personal tools