GitHub - Evodify - Genotype-Files-Manipulations - Set of Scripts To Manupulate Tab-Delimited Genotype Calls Files As Well As To Convert Calls-Files To Other Popular Formats

05/11/2018 GitHub - evodify/genotype-files-manipulations: Set of scripts to manupulate tab-delimited genotype calls files as …
Genotype calls files manipulations

Set of scripts to manipulate tab-delimited genotype calls files as well as to convert them to other
popular formats.
All python scripts contain description of input and output data format in a header of each file. To
see possible options, run python script with --help option: python script.py --help
Most of these scripts require the custom python module calls , so make sure that you also
download and put the file calls.py in the same directory where your scripts are.
Examples of a tab-delimited genotype calls file (hereafter, tab file).
Two-character coded table (e.g. produced with VariantsToTable from the GATK) :
CHROM POS REF sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
chr_1 1 A T/A ./. ./. A/A ./. ./. ./. ./.
chr_1 2 C T/C T/C ./. C/C C/C ./. C/C ./.
chr_1 3 C C/GCC C/C ./. C/C C/C C/C C/C C/C
chr_1 4 T T/T T/T ./. T/T T/T T/T T/T T/T
chr_2 1 A A/A A/A ./. A/A A/A A/A A/A A/A
chr_2 2 C C/C C/C ./. C/C C/C C/C C/C C/C
chr_2 3 C AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT
chr_2 4 C C/C T/T C/C C/C C/C C/C C/C C/C
chr_2 5 T T/T C/C T/T C/T T/T C/T T/T T/T
chr_3 1 G G/G ./. ./. G/G ./. ./. ./. ./.
chr_3 2 C G/C C/C ./. C/C C/C ./. C/C ./.
chr_3 3 CTT CTT/CTT CTT/C CTT/C CTT/CTT CTT/CTT CTT/CTT CTT/CTT CTT/CTT
chr_3 4 TA T/T T/T ./. T/T T/T T/T T/T T/TA
chr_3 5 G */* G/* ./. G/G G/G G/G C/C G/G
One-character coded tab file where heterozygous genotypes are represented by ambiguous
characters R, Y, M, K, S, W. (produced from a two-character coded table
with vcfTab_to_callsTab.py):
CHROM POS REF sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
chr_1 1 A W N N A N N N N
chr_1 2 C Y Y N C C N C N
chr_1 3 C N C N C C C C C
chr_1 4 T T T N T T T T T
chr_2 1 A A A N A A A A A
chr_2 2 C C C N C C C C C
chr_2 3 C N N N N N N N N
chr_2 4 C C T C C C C C C
chr_2 5 T T C T Y T Y T T
chr_3 1 G G N N G N N N N
chr_3 2 C S C N C C N C N
chr_3 3 N N N N N N N N N
chr_3 4 N T T N T T T T N
chr_3 5 G - N N G G G C G
https://github.com/evodify/genotype-files-manipulations 1/3
addGOannotation-to-gff3.py adds GO annotation to the gff3 file.
annotate_genes_withSlidingWindowsStats.py annotates genes from a sliding windows analysis

with the stats per gene.
assessNs_in_callsTab.py calculates missing data (Ns) per position/sample and visualizes the
results.
calculateNsPerWindow.py calculates number of positions with missing data (Ns) using the sliding
window approach.
calls.py is a custom python module. It is a dependency for the most of the scripts listed here.
calls_to_ped_map.py converts genotype calls file to ped and map files suitable for PLINK.
calls_to_treeMix_input.py outputs alleles counts file that is required as input for TreeMix.
callsToBED.py converts a tab-delimited file to a bed file.
callsToFastaPhy_RAM.py converts genotype calls file to FASTA and PHYLIP with little RAM
consumption.
callsToFastaPhy_speed.py converts genotype calls file to FASTA and PHYLIP fast but consumes a
lot of RAM.
combine_overlapping_BEDintervals.py combines overlapping genetic intervals in the BED format.
Ensembl.dat-to-topGO.db.py converts the Ensembl.dat file to the GO reference file used in the
topGO R program.
extractSIFT4Gannotation.py extracts the SIFT4G annotation for a given set of samples according
to their genotypes.
FastaToPhylip.py converts FASTA to PHYLIP.
FastaToTab.py converts FASTA to tab-delimited file with columns: Chr, Pos, REF.
filterByNs_callsTab.py removes all sites that consists of more than a given amount of missing data
(Ns).
find_popSpecificAlleles_in_callsTab.py outputs only unique allele of one population relative to

another.
findCommonAlleles.py outputs common and rare alleles in a given set of samples.
GFFextract.py extracts various info from the gff3 file.
keep_biallelic_in_callsTab.py removes sites with more than two alleles.
make_input_MSMC_from_callsTab.py makes input for MSMC.
makeSweepFinderInput_from_callsTab.py makes an input file for SweepFinder.
make_input_stairway_plot_v1_BS.py makes input files including bootstrap replicates

for Stairway version 1.
make_input_stairway_plot_v2.py makes an input file for Stairway version 2.
merge_phased_callsTab.py merges phased sites into two-character coded genotype file.

merge_SNP_wholeGenome_TabFiles.py merges whole genome and SNPs tab files. This is

needed because non-polymorphic sites and SNPs are filtered differently with GATK.
mergeChrPos_in_callsTab.py merges all chromosomes into continuous genomic coordinates.
mergeTabFiles.py merges two tab files by their overlapping positions.
polarizeGT_in_callsTab.py polarizes the genotype data by keeping only derived alleles relative to
an outgroup/ancestral sequence.
pseudoPhasingHetero_in_callsTab.py phases the sequences by random split of heterozygous

sites.
MAFtoTAB.py transforms the MAF file to tab file. Indels are skipped.
MAF-Calls_alignment-complement.py processes the Calls-MAF aligned file to complement the

reverse complemented sequences of MAF and outputs Tab file with the coordinates of new
genome.
MAF-TAB_reference.py transforms the MAF file to tab file with Chr Pos of both sequences. Indels
are skipped.
RefSeqGene_extract_summary.py extracts summary info from NCBI Gene annotation.
remove_Insertions_from_callsTab.py removes insertions of longer than 1 bp and replaces

deletions of 1 bp marked as "*" with "-".
remove_masked_intervals_from_callsTab.py removes the masked sites from a tab file. The

masked sites are provided in a BED file.
remove_masked_intervals_fromBED.py compares a BED interval file with the BED file of masked
regions and removes them.
removeMonomorphic_in_callsTab.py removes monomorphic positions, i.e. keeps only SNPs.
select_genes_by_intervals.py extracts gene names from a bed file by provided coordinates.
select_intervals_in_callsTab.py extracts lines from a calls file according to scaffold name, start and
end positions.
selectSamples_in_callsTab.py subsamples a genotype calls file by sample names. It also can be

used to rearrange samples in a calls file.
slidingWindowSNPs.py cuts genotype calls file with the given window size and outputs FASTA files
for every window.
split_calls_by_chromosomes.py splits a calls file into several files by chromosomes.
summarySIFT.awk summarizes the extracted SIFT4G annotation (output

of extractSIFT4Gannotation.py)
summarizeTAB.awk summarizes the genotyope file by counting homozygot, heterozygot, missing

etc.
vcf_to_SIFT4G.py converts a VCF file to SIFT4G input.
vcfTab_to_callsTab.py converts the two-character coded table produced with VariantsToTable

(GATK) to the one-character coded genotype table (calls format).

GitHub - Evodify - Genotype-Files-Manipulations - Set of Scripts To Manupulate Tab-Delimited Genotype Calls Files As Well As To Convert Calls-Files To Other Popular Formats

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

GitHub - Evodify - Genotype-Files-Manipulations - Set of Scripts To Manupulate Tab-Delimited Genotype Calls Files As Well As To Convert Calls-Files To Other Popular Formats

Încărcat de

Drepturi de autor:

Formate disponibile

05/11/2018 GitHub - evodify/genotype-ﬁles-manipulations: Set of scripts to manupulate tab-delimited genotype calls ﬁles as …

Genotype calls files manipulations

Examples of a tab-delimited genotype calls file (hereafter, tab file).

addGOannotation-to-gff3.py adds GO annotation to the gff3 file.

annotate_genes_withSlidingWindowsStats.py annotates genes from a sliding windows analysis

callsToBED.py converts a tab-delimited file to a bed file.

combine_overlapping_BEDintervals.py combines overlapping genetic intervals in the BED format.

FastaToPhylip.py converts FASTA to PHYLIP.

find_popSpecificAlleles_in_callsTab.py outputs only unique allele of one population relative to

findCommonAlleles.py outputs common and rare alleles in a given set of samples.

GFFextract.py extracts various info from the gff3 file.

keep_biallelic_in_callsTab.py removes sites with more than two alleles.

make_input_MSMC_from_callsTab.py makes input for MSMC.

makeSweepFinderInput_from_callsTab.py makes an input file for SweepFinder.

make_input_stairway_plot_v1_BS.py makes input files including bootstrap replicates

make_input_stairway_plot_v2.py makes an input file for Stairway version 2.

merge_phased_callsTab.py merges phased sites into two-character coded genotype file.

merge_SNP_wholeGenome_TabFiles.py merges whole genome and SNPs tab files. This is

mergeChrPos_in_callsTab.py merges all chromosomes into continuous genomic coordinates.

mergeTabFiles.py merges two tab files by their overlapping positions.

pseudoPhasingHetero_in_callsTab.py phases the sequences by random split of heterozygous

MAF-Calls_alignment-complement.py processes the Calls-MAF aligned file to complement the

RefSeqGene_extract_summary.py extracts summary info from NCBI Gene annotation.

remove_Insertions_from_callsTab.py removes insertions of longer than 1 bp and replaces

remove_masked_intervals_from_callsTab.py removes the masked sites from a tab file. The

removeMonomorphic_in_callsTab.py removes monomorphic positions, i.e. keeps only SNPs.

select_genes_by_intervals.py extracts gene names from a bed file by provided coordinates.

selectSamples_in_callsTab.py subsamples a genotype calls file by sample names. It also can be

split_calls_by_chromosomes.py splits a calls file into several files by chromosomes.

summarySIFT.awk summarizes the extracted SIFT4G annotation (output

summarizeTAB.awk summarizes the genotyope file by counting homozygot, heterozygot, missing

vcf_to_SIFT4G.py converts a VCF file to SIFT4G input.

vcfTab_to_callsTab.py converts the two-character coded table produced with VariantsToTable

S-ar putea să vă placă și