Sunteți pe pagina 1din 3

05/11/2018 GitHub - evodify/genotype-files-manipulations: Set of scripts to manupulate tab-delimited genotype calls files as …

Genotype calls files manipulations


Set of scripts to manipulate tab-delimited genotype calls files as well as to convert them to other
popular formats.

All python scripts contain description of input and output data format in a header of each file. To
see possible options, run python script with --help option: python script.py --help

Most of these scripts require the custom python module calls , so make sure that you also
download and put the file calls.py in the same directory where your scripts are.

Examples of a tab-delimited genotype calls file (hereafter, tab file).

Two-character coded table (e.g. produced with VariantsToTable from the GATK) :

CHROM POS REF sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
chr_1 1 A T/A ./. ./. A/A ./. ./. ./. ./.
chr_1 2 C T/C T/C ./. C/C C/C ./. C/C ./.
chr_1 3 C C/GCC C/C ./. C/C C/C C/C C/C C/C
chr_1 4 T T/T T/T ./. T/T T/T T/T T/T T/T
chr_2 1 A A/A A/A ./. A/A A/A A/A A/A A/A
chr_2 2 C C/C C/C ./. C/C C/C C/C C/C C/C
chr_2 3 C AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT
chr_2 4 C C/C T/T C/C C/C C/C C/C C/C C/C
chr_2 5 T T/T C/C T/T C/T T/T C/T T/T T/T
chr_3 1 G G/G ./. ./. G/G ./. ./. ./. ./.
chr_3 2 C G/C C/C ./. C/C C/C ./. C/C ./.
chr_3 3 CTT CTT/CTT CTT/C CTT/C CTT/CTT CTT/CTT CTT/CTT CTT/CTT CTT/CTT
chr_3 4 TA T/T T/T ./. T/T T/T T/T T/T T/TA
chr_3 5 G */* G/* ./. G/G G/G G/G C/C G/G

One-character coded tab file where heterozygous genotypes are represented by ambiguous
characters R, Y, M, K, S, W. (produced from a two-character coded table
with vcfTab_to_callsTab.py):

CHROM POS REF sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
chr_1 1 A W N N A N N N N
chr_1 2 C Y Y N C C N C N
chr_1 3 C N C N C C C C C
chr_1 4 T T T N T T T T T
chr_2 1 A A A N A A A A A
chr_2 2 C C C N C C C C C
chr_2 3 C N N N N N N N N
chr_2 4 C C T C C C C C C
chr_2 5 T T C T Y T Y T T
chr_3 1 G G N N G N N N N
chr_3 2 C S C N C C N C N
chr_3 3 N N N N N N N N N
chr_3 4 N T T N T T T T N
chr_3 5 G - N N G G G C G

https://github.com/evodify/genotype-files-manipulations 1/3
05/11/2018 GitHub - evodify/genotype-files-manipulations: Set of scripts to manupulate tab-delimited genotype calls files as …

addGOannotation-to-gff3.py adds GO annotation to the gff3 file.

annotate_genes_withSlidingWindowsStats.py annotates genes from a sliding windows analysis


with the stats per gene.

assessNs_in_callsTab.py calculates missing data (Ns) per position/sample and visualizes the
results.

calculateNsPerWindow.py calculates number of positions with missing data (Ns) using the sliding
window approach.

calls.py is a custom python module. It is a dependency for the most of the scripts listed here.

calls_to_ped_map.py converts genotype calls file to ped and map files suitable for PLINK.

calls_to_treeMix_input.py outputs alleles counts file that is required as input for TreeMix.

callsToBED.py converts a tab-delimited file to a bed file.

callsToFastaPhy_RAM.py converts genotype calls file to FASTA and PHYLIP with little RAM
consumption.

callsToFastaPhy_speed.py converts genotype calls file to FASTA and PHYLIP fast but consumes a
lot of RAM.

combine_overlapping_BEDintervals.py combines overlapping genetic intervals in the BED format.

Ensembl.dat-to-topGO.db.py converts the Ensembl.dat file to the GO reference file used in the
topGO R program.

extractSIFT4Gannotation.py extracts the SIFT4G annotation for a given set of samples according
to their genotypes.

FastaToPhylip.py converts FASTA to PHYLIP.

FastaToTab.py converts FASTA to tab-delimited file with columns: Chr, Pos, REF.

filterByNs_callsTab.py removes all sites that consists of more than a given amount of missing data
(Ns).

find_popSpecificAlleles_in_callsTab.py outputs only unique allele of one population relative to


another.

findCommonAlleles.py outputs common and rare alleles in a given set of samples.

GFFextract.py extracts various info from the gff3 file.

keep_biallelic_in_callsTab.py removes sites with more than two alleles.

make_input_MSMC_from_callsTab.py makes input for MSMC.

makeSweepFinderInput_from_callsTab.py makes an input file for SweepFinder.

make_input_stairway_plot_v1_BS.py makes input files including bootstrap replicates


for Stairway version 1.

make_input_stairway_plot_v2.py makes an input file for Stairway version 2.

merge_phased_callsTab.py merges phased sites into two-character coded genotype file.


https://github.com/evodify/genotype-files-manipulations 2/3
05/11/2018 GitHub - evodify/genotype-files-manipulations: Set of scripts to manupulate tab-delimited genotype calls files as …

merge_SNP_wholeGenome_TabFiles.py merges whole genome and SNPs tab files. This is


needed because non-polymorphic sites and SNPs are filtered differently with GATK.

mergeChrPos_in_callsTab.py merges all chromosomes into continuous genomic coordinates.

mergeTabFiles.py merges two tab files by their overlapping positions.

polarizeGT_in_callsTab.py polarizes the genotype data by keeping only derived alleles relative to
an outgroup/ancestral sequence.

pseudoPhasingHetero_in_callsTab.py phases the sequences by random split of heterozygous


sites.

MAFtoTAB.py transforms the MAF file to tab file. Indels are skipped.

MAF-Calls_alignment-complement.py processes the Calls-MAF aligned file to complement the


reverse complemented sequences of MAF and outputs Tab file with the coordinates of new
genome.

MAF-TAB_reference.py transforms the MAF file to tab file with Chr Pos of both sequences. Indels
are skipped.

RefSeqGene_extract_summary.py extracts summary info from NCBI Gene annotation.

remove_Insertions_from_callsTab.py removes insertions of longer than 1 bp and replaces


deletions of 1 bp marked as "*" with "-".

remove_masked_intervals_from_callsTab.py removes the masked sites from a tab file. The


masked sites are provided in a BED file.

remove_masked_intervals_fromBED.py compares a BED interval file with the BED file of masked
regions and removes them.

removeMonomorphic_in_callsTab.py removes monomorphic positions, i.e. keeps only SNPs.

select_genes_by_intervals.py extracts gene names from a bed file by provided coordinates.

select_intervals_in_callsTab.py extracts lines from a calls file according to scaffold name, start and
end positions.

selectSamples_in_callsTab.py subsamples a genotype calls file by sample names. It also can be


used to rearrange samples in a calls file.

slidingWindowSNPs.py cuts genotype calls file with the given window size and outputs FASTA files
for every window.

split_calls_by_chromosomes.py splits a calls file into several files by chromosomes.

summarySIFT.awk summarizes the extracted SIFT4G annotation (output


of extractSIFT4Gannotation.py)

summarizeTAB.awk summarizes the genotyope file by counting homozygot, heterozygot, missing


etc.

vcf_to_SIFT4G.py converts a VCF file to SIFT4G input.

vcfTab_to_callsTab.py converts the two-character coded table produced with VariantsToTable


(GATK) to the one-character coded genotype table (calls format).

https://github.com/evodify/genotype-files-manipulations 3/3

S-ar putea să vă placă și