Documente Academic
Documente Profesional
Documente Cultură
All python scripts contain description of input and output data format in a header of each file. To
see possible options, run python script with --help option: python script.py --help
Most of these scripts require the custom python module calls , so make sure that you also
download and put the file calls.py in the same directory where your scripts are.
Two-character coded table (e.g. produced with VariantsToTable from the GATK) :
CHROM POS REF sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
chr_1 1 A T/A ./. ./. A/A ./. ./. ./. ./.
chr_1 2 C T/C T/C ./. C/C C/C ./. C/C ./.
chr_1 3 C C/GCC C/C ./. C/C C/C C/C C/C C/C
chr_1 4 T T/T T/T ./. T/T T/T T/T T/T T/T
chr_2 1 A A/A A/A ./. A/A A/A A/A A/A A/A
chr_2 2 C C/C C/C ./. C/C C/C C/C C/C C/C
chr_2 3 C AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT
chr_2 4 C C/C T/T C/C C/C C/C C/C C/C C/C
chr_2 5 T T/T C/C T/T C/T T/T C/T T/T T/T
chr_3 1 G G/G ./. ./. G/G ./. ./. ./. ./.
chr_3 2 C G/C C/C ./. C/C C/C ./. C/C ./.
chr_3 3 CTT CTT/CTT CTT/C CTT/C CTT/CTT CTT/CTT CTT/CTT CTT/CTT CTT/CTT
chr_3 4 TA T/T T/T ./. T/T T/T T/T T/T T/TA
chr_3 5 G */* G/* ./. G/G G/G G/G C/C G/G
One-character coded tab file where heterozygous genotypes are represented by ambiguous
characters R, Y, M, K, S, W. (produced from a two-character coded table
with vcfTab_to_callsTab.py):
CHROM POS REF sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
chr_1 1 A W N N A N N N N
chr_1 2 C Y Y N C C N C N
chr_1 3 C N C N C C C C C
chr_1 4 T T T N T T T T T
chr_2 1 A A A N A A A A A
chr_2 2 C C C N C C C C C
chr_2 3 C N N N N N N N N
chr_2 4 C C T C C C C C C
chr_2 5 T T C T Y T Y T T
chr_3 1 G G N N G N N N N
chr_3 2 C S C N C C N C N
chr_3 3 N N N N N N N N N
chr_3 4 N T T N T T T T N
chr_3 5 G - N N G G G C G
https://github.com/evodify/genotype-files-manipulations 1/3
05/11/2018 GitHub - evodify/genotype-files-manipulations: Set of scripts to manupulate tab-delimited genotype calls files as …
assessNs_in_callsTab.py calculates missing data (Ns) per position/sample and visualizes the
results.
calculateNsPerWindow.py calculates number of positions with missing data (Ns) using the sliding
window approach.
calls.py is a custom python module. It is a dependency for the most of the scripts listed here.
calls_to_ped_map.py converts genotype calls file to ped and map files suitable for PLINK.
calls_to_treeMix_input.py outputs alleles counts file that is required as input for TreeMix.
callsToFastaPhy_RAM.py converts genotype calls file to FASTA and PHYLIP with little RAM
consumption.
callsToFastaPhy_speed.py converts genotype calls file to FASTA and PHYLIP fast but consumes a
lot of RAM.
Ensembl.dat-to-topGO.db.py converts the Ensembl.dat file to the GO reference file used in the
topGO R program.
extractSIFT4Gannotation.py extracts the SIFT4G annotation for a given set of samples according
to their genotypes.
FastaToTab.py converts FASTA to tab-delimited file with columns: Chr, Pos, REF.
filterByNs_callsTab.py removes all sites that consists of more than a given amount of missing data
(Ns).
polarizeGT_in_callsTab.py polarizes the genotype data by keeping only derived alleles relative to
an outgroup/ancestral sequence.
MAFtoTAB.py transforms the MAF file to tab file. Indels are skipped.
MAF-TAB_reference.py transforms the MAF file to tab file with Chr Pos of both sequences. Indels
are skipped.
remove_masked_intervals_fromBED.py compares a BED interval file with the BED file of masked
regions and removes them.
select_intervals_in_callsTab.py extracts lines from a calls file according to scaffold name, start and
end positions.
slidingWindowSNPs.py cuts genotype calls file with the given window size and outputs FASTA files
for every window.
https://github.com/evodify/genotype-files-manipulations 3/3