SNP Discovery from Single and Multiplex Genome Assemblies of Non-model Organisms

i. Abstract: Population genetic studies of non-model organisms often rely on initial ascertainment of genetic markers from a single individual or a small pool of individuals. This initial screening has been a significant barrier to beginning population studies on non-model organisms (1, 2) . As genomic data become increasingly available for non-model species, SNP ascertainment from across the genome can be performed directly from published genome contigs and short-read archive data. Alternatively, low to medium genome coverage from shotgun NGS library sequencing of single or pooled samples, or from reduced-representation libraries ( e.g., capture enrichment; see Ref . 3) can produce sufficient new data for SNP discovery with limited investment. We describe protocols for assembly of short read data to reference or related species genome contig sequences, followed by SNP discovery and filtering to obtain an optimal set of SNPs for population genotyping using a variety of downstream high-throughput genotyping methods.


Introduction
Despite the increasing ease, decreasing costs and wide variety of methods for sequencing complete genomes (4), there is a continuing need for finding and genotyping specified polymorphic sites, for projects ranging from whole genome association studies (5) to population genetics (e.g., 6 and references therein) and phylogenetics (7,8) to identification of functional variants (9,10).Once appropriate SNPs are identified, genotypes from hundreds to thousands of individuals can be obtained through costeffective directed genotyping methods that can make use of small amounts and/or degraded DNA (6,11).
Prior to highly-parallel "next-generation" sequencing (NGS) methods, Sanger sequencing of individual loci was a limiting step, especially when genomic data were not available for PCR primer design (1,2).With NGS technologies, generation of targeted-locus (3,12,13) or whole-genome sequence data is now relatively routine, yielding large amounts of genomic data that can be screened for sites that are variable in a single genome (14) or among individuals (3,15,16).A widely used method of both detecting novel SNPs and obtaining genotypes directly from reduced-representation genomic libraries is the RADseq method (17).This method typically uses sample sets for detecting thousands to tens of thousands of genomic SNPs from larger numbers of individuals (where highquality DNA is available from population samples, e.g., see Ref. 18).Software and methods for SNP detection and GBS have been previously described (e.g., 19,20), and so we will not discuss these methods further.
In this chapter, we will focus on the bioinformatics necessary to screen for SNPs from single-individual genome sequences representative of a species of interest, and from lowcoverage genome sequences from one or more individuals and a reference genome of the same or a related species.There are several large consortium genome sequencing projects producing genomic data at an ever increasing rate (e.g., 21,22), providing raw data for SNP discovery.Even when a complete, high-coverage genome assembly for a species of interest is not available, generating low-coverage genome data and aligning to a related species can provide relatively quick and inexpensive access to thousands of SNPs in a target species.
The methods in this chapter assume availability of a Unix or Linux computing system with sufficient capacity (memory and storage) to handle Gigabyte data sets.The primary software packages are all freely available, though some additional commercial software is recommended (but not necessary) for some analyses.

Materials
The methods in this chapter are based on the versions of software listed in Table 1.There is no one public software package that does all of the steps described below, so we have used multiple freely-available programs for data processing.We also describe several scripts that have been written using different programs (e.g., Python, R, Perl), so the programs to run these scripts are necessary.Software can change, so it is important to use the specified versions, or test newer versions to make sure that the commands are performing the same functions and options remain as stated.The best way to do this (and familiarize yourself with the software) is to read the help documents or guidelines for each software package and the specified commands.
The current versions of the scripts described below are available through GitHub at https://github.com/PAMorin/SNPdiscovery. Unless otherwise stated, software is implemented in the Red Hat Enterprise 6 (64 bit) Linux Server operating system.
Commands and software should function equally on Unix or Linux, but have not been tested on both.

Methods
This chapter assumes that short-read sequence data (typically 50-150bp, though this includes reads up to approximately 500bp are produced by some sequencing platforms) are available in one of the forms below, and that the publicly available data have been quality checked to remove adapter sequences and poor-quality reads.However, we outline adapter trimming steps for raw unprocessed sequence data in section 3.6 for shotgun sequence data (random sequence fragments from genomic DNA libraries) generated de novo for SNP discovery.For our purposes, we will consider two types of data: Publicly available genome assembly data that has already been assembled into a draft genome, and shotgun sequence data.For the latter, the assembled genome scaffold for a related species can be used to assist assembly, or in the absence of a reference genome, de novo assembly can be used to generate contigs to serve as the reference, though in most cases this will result in shorter contigs and potentially lower quality assemblies (23).
All Linux command lines are shown in Courier font, and are prefaced by a comment line with "#" at the beginning to describe the function of the subsequent command.This is to facilitate copy/paste of the commands into a text document or directly into the Linux interface.

Publicly available genome assembly and short read archive (SRA) data.
The benefits of a reference genome assembly include longer contigs (or scaffolds), quality-checked assemblies, repeat masking, and in some cases, annotation and chromosome assignment.These can all facilitate high-quality SNP discovery and selection based on additional knowledge of genome position and association with known genes or chromosomes.The NCBI web site is a good place to search for reference genomes (http://www.ncbi.nlm.nih.gov/genome/browse/),but the organization can be confusing and there are important issues to be aware of.As an example, browsing for the genus "Orcinus" results in one record for the genome of Orcinus orca, the killer whale.This is a member of the family Delphinidae and could be used as the reference for other species in the family, and even in other families of cetaceans, though potentially at the loss of coverage and some bias in genome region coverage due to divergence.The Genome overview page for Orcinus orca (www.ncbi.nlm.nih.gov/genome/14034)summarizes the project information, publications, summaries and links to the mitochondrial and nuclear genome information (under "Replicon Region"), and Genome Region, with links to the assembled scaffolds.We recommend starting under the Summary, and reviewing the BioProjects links to review the projects contributing to the killer whale genome.These may include multiple biosamples (different animals) and data types.Every project is different, so these need to be reviewed to determine which project(s) has the appropriate assembled contigs/scaffolds and SRA data for re-assembly and SNP discovery.For this example there are 2 BioProject ID's, and the project data for them are shown in Fig. 1a, b.
Both BioProjects are based on the same, single biosample, and link to the same assembly details, but one links to the SRA experiments (the sequence read data).If there are multiple biosamples, it is important to determine which SRA files are from each individual if the goal is to use either pooled sample data or just a single individual for SNP discovery.
On the BioProject description web page, the Assembly link leads to the project summary.
All of the assembly files can be obtained via ftp through the NCBI site: ftp://ftp.ncbi.nlm.nih.gov/genomes/.After connecting as a guest, you can navigate to the species directory and view the README file that describes each of the directories and files within.For our purposes, the important files are in the directory CHR_Un, which contains the "Chromosome directories", or "concatenated sequence data for scaffolds that have been assembled from individual GenBank records", in a variety of formats.The files ending in ".fa.gz" are contigs in compressed FASTA format, and the ones ending in ".mfa.gz" are repeat-masked contigs in compressed FASTA format.These can be downloaded by a variety of methods.Probably the most useful in a Linux environment is to use "wget", for example: File transfer may take several hours, depending on the size of the file and speed of connection.It is usually not advised to map reads to a repeat masked genome, as reads neighboring the masked repeats may not map, thereby reducing coverage around the repeats.However, if you have a reference genome with reasonably high coverage, the loss will be minimal relative to the amount of high-coverage genome data, and using the repeat-masked reference sequence will reduce the need for filtering out SNPs in repeats later.
The next step is to identify and download the short-read archive data.
Clicking on the SRA link in the "Project Data" table on the BioProject page (e.g., "7" under "Number of Links" in the killer whale example) produced a list of the SRA experiments, each of which may have ≥1 sequence run files.You can look through these on the web site to identify the type of sequence (e.g., Illumina HiSeq 2000, paired-end) and the size of the individual files.Once you know which files you want to use, you can download individual files from the SRA portal http://sra.dnanexus.com.At the site, you can search for your species, select the appropriate study (if there is more than one), and click the "related runs" button to show all of the sequencing runs associated with that project.For this example, the project with accession SRP015826 has 7 experiments with 16 runs total, representing the 16 short-read archive files.The files can be selected by checking the boxes of those you want and using the "Download" button.This generates a web page with the download buttons for individual files, or you can select files and click the button for "download SRA URLs" to download a text file containing all of the URL links to facilitate command-line transfer of the files using the "wget" command in a Linux environment: Example SRA URLs file: download_sra_urls.txt(for the Orcinus orca project): ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR574/SRR574967/SRR574967.sraftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR574/SRR574968/SRR574968.sraftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR574/SRR574969/SRR574969.sra(etc.)# Transfer each of the files using "wget" and the URLs in the # download_sra_urls.txtdocument: wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/srainstant/reads/ByRun/sra/SRR/SRR574/SRR574967/SRR574967.sraThese files are often very large, and the combined files can be several hundred GB, so the transfer may take a long time and require a lot of storage space on your system.It is important to determine how much storage space will be needed prior to starting to be sure you have sufficient storage capacity for the raw files and the assemblies.Although this is difficult to estimate and depends on whether or not you delete file that are no longer needed (e.g., delete sam files after converting to bam format), you should expect to use about 20 times as much disk space as the size of your decompressed short-read archive fastq file(s).
SRA files are compressed binary files and need to be expanded into FASTQ files prior to use in assembly.This requires the fastq-dump function from the SRAtoolkit software.
Files size will expand substantially, typically by 4-5x the original SRA file size.
# Convert each SRA files into FASTQ files fastq-dump SRR574967.sra--split-3 The "--split-3" option indicates that the file contains paired-end reads and will be split into two fastq files, one for read 1 and a second for read 2 (e.g., in this case, SRR574967_1.fastq,SRR574967_2.fastq).If the file only contains single-read data, only a single file will be generated, so it is recommended that this option is always used to ensure data are correctly parsed into files.
The FASTQ file may be too large to be processed when only one SRA file is available on the SRA portal for a high coverage genome.Fastq-splitter can be used to divide the large file in several smaller files that are easier to handle.After downloading the program, you need to make sure that the perl script is set as an executable.
# Command to check if the perl script is set to be executable chmod +x fastq-splitter.pl# Split the file in 6 parts for both the forward and reverse reads ./fastq-splitter.pl --n-parts 6 --check SRAfileF.fastq./fastq-splitter.pl --n-parts 6 --check SRAfileR.fastq# The path to the perl script (fastq-splitter.pl)should be given.# --n-parts indicates into how many files the large # SRA file should be divided (6 in this example).
# --check does some verification of FASTQ format correctness.

Creating a de novo set of reference contigs:
A draft genome sequence is not necessary to align sequence reads for SNP discovery, but some reference is necessary.In the absence of a reference genome, a set of reference contigs can be created by de novo assembly of short-read data.The details of de novo assembly are not the focus of this chapter, but there are commercially available programs (e.g., CLC Genomics Workbench (CLCbio), Geneious (BioMatters Limited)) and freely available programs such as SOAPdenovo2 (24), DISCOVAR de novo (Broad Institute), SGA (25), some of which are optimized for different operating systems or types of data (e.g., see list at http://omictools.com/genome-assembly-category).The goal of using these packages is to generate high-quality contigs, then use them as the reference sequences to re-align reads for SNP discovery.Once a set of de novo assembled contigs has been produced (in FASTA format), they can be treated the same as a reference genome sequence for read alignment and SNP discovery (see Ref. 23).

Alignment of reads to reference contigs:
Alignment of reads to reference contigs is fairly straightforward, but requires multiple steps that can be confusing.There are multiple alignment programs, but one of the most frequently used and easiest to use is the Burrows-Wheeler Aligner (BWA; 26).
When there are multiple smaller FASTQ files for a single individual, they can be concatenated into a single file for each read direction.This is a Linux function: # Concatenate forward read files.cat filename1_read1.fastqfilename2_read1.fastq> forward_reads.fastqHowever, when the files are large (e.g., >10GB), running separate alignments of each FASTQ file (or pair of files for paired-end data) will improve speed by allowing parallel processes run on separate cores (see Note 1).The resulting alignments can be combined after conversion to the smaller BAM files (see below).
Although there are many options that can be altered to vary alignment parameters, the updated aligner "BWA-MEM" algorithm (27) is implemented in BWA and has been optimized for aligning short reads to a large genome such as human without the need to specify alignment parameters, and performs as well as or better than many other aligners with short read (100bp) and longer read-length data (27).
Mitochondrial DNA can make up a significant portion of sequence reads, so removal of reads that map to the mitogenome can reduce file sizes and speed up subsequent analyses.
A reference mitogenome FASTA file can typically be downloaded from GenBank for the target species or a closely related species, or the mitogenome sequence can often be assembled de novo from the NGS reads (e.g., 28).Map the reads to the mitogenome with BWA, convert the SAM file to a BAM file and sort with SAMtools.The BAM file will also retain the reads that did not map to the mitogenome reference, i.e. the reads from the nuclear genome that we are interested in.These unmapped reads can be extracted to a new BAM file, and then converted to a FASTQ file.The output SAM files are typically very large, and can be reduced to the binary form, or BAM file format, for subsequent analyses.The SAM files can then be deleted, as they are no longer needed.If there are several BAM files for an individual (e.g., aln-se1_sorted_unique.bam,aln-se2_sorted_unique.bam,etc.), they need to be merged into one file (individual1.bam).
At this stage, the sorted BAM file contains both mapped and unmapped sequencing reads.
You may therefore want to remove the unmapped sequencing reads to save on disk storage space and improve the efficiency of your downstream analysis.But beware that your unmapped reads may contain useful data, such as reads associated with the Ychromosome if you mapped your reads to a female reference, or microbiome data, etc.
# Keep reads that have mapped to the reference genome with a qscore ≥30.samtools view -bh -F4 -q 30 individual1.bam> individual1_q30.bamIf the sequence reads have been aligned to a related species rather than reference contigs from the same species, a new target-species reference needs to be generated from the merged alignment file, and the alignment process is started over.This results in the reads being mapped to contigs of the same species so that inferred SNPs are due to intraspecific variation rather than differences between the reference and target species sequences.seqtk seq -a target_consensus.fq> target_consensus.fasta

Pipeline 1: SNP discovery from high-coverage genome assembly.
SNP discovery from alignment files (BAM) with medium to high average coverage (>10×) uses depth and quality criteria, and extracts SNPs with flanking sequence data for assay design (e.g. for AmpliSeq, GTseq, TaqMan, or capture-enrichment GBS).We describe a simple SNP discovery pipeline that uses the variant finder function of BCFtools (29) to scan an alignment (BAM file) for SNPs that meet specified criteria.The minimum and maximum coverage criteria are dependent on the average coverage of the genome alignment.When coverage is moderate (i.e., 10-30×), we set the minimum depth of coverage for a SNP at 10 reads, and a maximum depth of 100 in order to ignore potentially collapsed duplicated sequence in the assembly.However, when the average depth of an alignment is high (i.e., >30×), the criteria should be changed to reflect the range from 1/3 to twice the average coverage (see "Estimate coverage" in Section 3.8).
The BAM file to be used for SNP discovery needs to be indexed for SAMtools.
# Index BAM file for SAMtools.

samtools index individual1_q30.bam
The SNP discovery pipeline is managed through a shell script (see Appendix 1) that performs several functions.The shell script is launched by executing the script in a directory that also must contain the reference FASTA file (previously indexed with samtools faidx), the alignment BAM file (sorted) and the SNP discovery Python script "Script2_generate_genotype_blocks.py"(see Appendix 2).The shell script takes as input the alignment BAM file (sorted), reference FASTA file, and the total desired bases of the flanking sequence of the SNP (by default 300 bp).The shell script takes objects in order, assigning them to aliases that are used to fill in the appropriate filenames into a series of commands.Execution of the shell script first runs samtools mpileup to compile the read bases, quality and coverage information at each site in the alignment, then calls SNPs from the pileup using BCFtools, using filters to exclude loci with depth of coverage below and above specified levels (set within the shell script; see Appendix 1).The custom Python script "Script2_generate_genotype_blocks.py"further filters the SNPs based on user-defined quality cut-off and flanking sequence length parameters, and outputs two results files (see Note 2).# 'output' represents the file name that will be assigned to the # output files.
The output files from this script are a FASTA file containing the designated target sequences surrounding each SNP, and a genotype VCF file.The VCF file format is described in detail elsewhere (https://githum.com/samtools/hts-specs).This file contains information about each SNP, starting with the contig ID ("chrom"), the SNP position, the reference and alternate alleles, a quality score, and a list of additional "info" about the SNP, including the depth of coverage (DP), various quality measures, the count forward and reverse reads for each allele used in variant calling (DP4), and the consensus quality (FQ).(see Fig. 2).The total number of reads in DP4 may be lower than DP because lowquality bases are not counted.Given one sample in the alignment, the consensus quality (FQ) is typically positive for heterozygous loci, and negative for homozygous loci (see more details at http://samtools.sourceforge.net/mpileup.shtml).Although it seems counterintuitive that there would be homozygous SNPs, they may be frequent in some data sets, especially if the reference sequence is different from the reads mapped to it (different species, subspecies, or even just a different individual fixed for the alternate allele), and when reads are mapped incorrectly to the reference.A negative FQ value may also indicate that one of the alleles is rare.
The overall depth and the count of each base at the SNP site can be useful for further filtering of SNPs based on the number of nucleotides at the SNP position and their frequency.For example, if the SNPs are from a single individual, you would expect approximately even distribution of 2 alleles at heterozygous SNPs.A SNP with DP=20 and DP4 = 0,0,11,0 (No. of forward ref alleles, reverse ref alleles, forward non-ref alleles, reverse non-ref alleles) would indicate that even though there is a total depth of 20, only 11 reads met the quality threshold, and all of them were forward reads of the nonreference allele, so this is not likely to be a high-quality candidate SNP.A SNP with DP=20 and DP4=7,0,13,0 would have close to the expected 50% each of 2 alleles and all 20 reads passed the specified quality filters.

Pipeline 2: SNP discovery from low-coverage multiplex shotgun data
SNPs can be called from low-coverage shotgun sequencing data (<1×) from multiple samples using genotype likelihoods, therefore taking uncertainty in base-calling into account (30).This section describes read sequence data processing, genotype likelihoods estimation and a filtering pipeline for SNP discovery using this approach (following 31).
Data requirement: FASTQ formatted sequence read data from multiple samples.As an example, one lane of Illumina HiSeq or NextSeq sequencing of around 30 individuals (genome size of 2.5 Gb) would be suitable for genome-wide SNP discovery (31).In the latter study, mean sequencing read depth per site considering all the samples together was approximately 5×, but the majority of the sites had sequencing reads from just a small proportion of the individuals sampled, each sequenced at <1× coverage.We describe data processing for de novo generated read data (i.e., generated by the investigator specifically for SNP discovery) and unmasked reference sequences, followed by SNP calling that can be applied to either the assembled de novo sequence data or SRA data assembled to a masked reference genome from GenBank.

Sequence data processing
This section assumes that single-or paired-end read data have been de-multiplexed to separate reads by the index sequences of each sample in the multiplex pool.SRA files downloaded from GenBank have presumably been processed to remove adapters and filtered for quality.For data generated specifically for SNP discovery, these steps need to be done prior to mapping reads to the reference genome.For each FASTQ file, ADAPTER-REMOVAL (or another program, e.g., Trimmomatic; 32) can be used to remove adapter sequences from the sequencing reads and remove sequence reads that are ≤30 bp or another length threshold following trimming.Reads in FASTQ format must then be mapped to the reference genome of the studied species or a related species, then converted to a sorted BAM file following the same steps outlined above.The reference can be repeat hard-masked, or repeats can be masked in the BAM file following mapping.
To hard mask the reference FASTA file to which reads are being mapped to, we use RepeatMasker.Repeat libraries must also be accessed from the RepBase database which is available through the GIRI Web site (http://www.girinst.org/repbase/index.html).Note that this requires opening an account, however, there is currently no charge for this for academic non-profit users.The output files include a masked copy of the input file, which contains the query sequences, with identified repeats and low complexity sequences masked and replaced with 'N's.A map file is also generated, in which coordinates in terms of contig, start and end position are given, which can then be used as a .bedfile (see below).The advantage of this is that reads can be mapped to the unmasked reference and then removed, which increases coverage in regions adjacent to repetitive elements.# NOTE: if the files are not in the current directory, the path must # be specified (e.g., /path/individual1_q30.bam,etc.).See # documentation for ANGSD for more details.

SNP calling using genotype likelihoods for low-coverage alignments
SNP calling is performed using genotype likelihoods that take uncertainty into account and can be estimated in ANGSD (33) from multiple samples as recommended by Nielsen et al. (30).Uncertainty in SNP discovery in low coverage data can be the result of sequencing, base-calling, mapping and alignment errors.Briefly, genotype likelihoods are calculated using probabilistic methods and take the quality scores of the sequencing reads into account (30).A base is considered as a SNP if the minor allele frequency is significantly different from 0 as inferred from a likelihood ratio test (34).The threshold is set with the '-SNP_pval' option and P < 0.000001 is considered as conservative.The genotype likelihood is calculated using the SAMtools method with the '-GL1' option (29,33) to infer the major and minor alleles with -doMajorMinor1 and to estimate major and minor allele frequencies using the EM algorithm with -doMaf 2 (34,35)(see Note 3).
# Infer SNPs from the alignments listed in "data.bamlist"# using genotype likelihoods in ANGSD.Running ANGSD produces a compressed .mafsoutput file: genolike_data.mafs.gz.The file must be decompressed prior to use in the filtering steps (below).

SNP filtering
A filtering pipeline is then applied to further avoid bias linked to next-generationsequencing and low coverage data.The different filters are: -Filter 1: discarding SNPs in regions of poor mapping quality (MAPQ<30).
-Filters 2 & 3: removing SNPs in regions of excessive coverage and high numbers of individuals: Filter 2 removes SNPs that are in regions with twice the mean coverage.As this is a rather arbitrary threshold, we advise plotting and exploring the distribution of coverage at this stage.Illumina shotgun sequencing data typically results in a Poisson distribution of different coverage depths, and so selecting the point at which the upper bound of depth of coverage starts to deviate from this distribution can be a more formal approach to set a threshold for determining regions of excessive coverage.
Filter 3 removes SNPs that are found (i.e. for which there is coverage) in a high number of individuals.The cut-off is defined by the upper tail of the distribution of the plot of the number of SNPs against the number of individuals.SNPs in the upper tail of the distribution are removed.The high coverage of these SNPs may be because they are in unmasked repeated regions, in nuclear mitochondrial DNA (NUMTs), or other mapping artifacts (i.e.paralogous loci).This filter is only recommended with ultra-low coverage data.
-Filter 4: discarding SNPs that have an estimated minor allele frequency (MAF) < 0.05 as it is the smallest MAF that we could theoretically have with a cut-off of 10 individuals (threshold defined earlier but it may change depending on the data analyzed).However, estimations of genotype likelihoods should reduce the error rate.Additionally, rare variants are important for several applications such as the inference of demographic history based on the Site Frequency Spectrum and the estimation of several population genetic parameters (36,37).Hence, depending on the downstream analyses it may be worth keeping those SNPs.
-Filter 5: Select only SNPs with flanking sequence of a specified length (e.g., 150 bp on either side of the target SNP) and with no N's in the flanking region, to allow design of primers and/or probes.
-Filter 6: SNP validation.(Optional) This step is not necessary but if available, the identified SNPs could be compared to already existing genomic resources such as published high coverage genomes to provide further confidence in the discovered SNPs, and if they are from different geographical areas, to identify a set of globally shared variants.
-Filter 7: (Optional) Remove SNPs with excess heterozygosity.This step can only be conducted meaningfully when there are sufficient samples to infer a significant deviation from the expectations of Hardy-Weinberg equilibrium (HWE), and is useful in identifying putative SNPs that are the result of gene duplications, gene families, and repetitive regions.
-Filter 8: (Optional) Compare SNP locus sequences to reference databases (e.g., GenBank) to filter out loci that might match to non-target species, duplicated loci, gene families, and repeat regions.
The commands to run these 8 filtering steps are detailed below.Three output files are produced, including data.depthSample and data.depthGlobal.In data.depthSample each line represents one sample (i.e., one individual) and column 1 is the number of sites with a depth of 0×, column 2 the number of sites with a depth of 1×, etc.In data.depthGlobal the number of sites for each depth category is given across all samples.The mean depth across all samples can be calculated using this file (see Note 5).
It should be noted that the sequencing depth for all sites and not only SNPs is provided here.
# -doQsDist 1 counts the number of bases for each quality score # -maxDepth 100 sets the maximum depth at 100; sites covered at # depth >100 are combined, which may produce a peak at 100 # for samples with higher coverage.
# -out indicates the output file, followed by the output file name.
This command produced 4 output files: bam.qc.arg, bam.qc.depthGlobal, bam.qc.depthSample and bam.qc.qs.These files can be plotted using the R script Script10_plotQC.R (see Appendix 12).The script creates a pdf file of the plotted output, and the file "bam.qc.info" with the q-score and global depth data.
As an example, we consider here that the mean coverage across all samples is 5× and that regions with coverage >10× are potential unmasked repeated regions or mapping artifacts.Regions of poor mapping quality (Q<30) and excessive coverage (>10×) are detected using the CALLABLELOCI tool in GATK.Running CALLABLELOCI involves some data formatting.
# Create a dictionary file on the initial reference using Picard tools.CallableLoci in GATK can then be run to identify the regions of poor mapping quality (MAPQ<30) and excessive coverage (>2× mean coverage).
# -mmq: The minimum mapping quality (filter 1: regions with a # mapping quality lower than this threshold will be considered # of poor mapping quality).
The lines of the data_callable.bedfile look like the following (a few example lines have This file contains all sites.It indicates the contig, scaffold or chromosome depending on the available information, the start and end positions of the region and then the characteristic of the region (i.e., whether the region has no coverage, low coverage, poor mapping quality, excessive coverage or if the region is callable).For example on scaffold JH472447 bases from position 601658 to 601711 are in a region of poor mapping quality and on scaffold JH472461 bases from position 671966 to 671996 are in a region of excessive coverage.
Two similar R scripts are used to select independently the regions with poor mapping quality and excessive coverage.
# Source the script (Appendix 3) file using "Rscript" or # execute individual lines from the script in the R environment.
The "poor_mapping_quality.txt" output file looks like the following: In the first column (i.e."chromo"), the contig, scaffold or chromosome number is given (depending on the assembly level of the reference genome).The second column corresponds to the position on the scaffold, contig or chromosome.The major and minor alleles are given."unknownEM" corresponds to the minor allele frequency."pu-EM" is the p-value indicating that the minor allele frequency is significantly different from 0 (i.e. that the base is a SNP)."nInd" indicates the number of individuals for which there is coverage at each SNP.
The SNP at position 601678 on scaffold JH472447 in the .mafsfile is in a region of poor mapping quality (JH472447, positions 601658-601711).The R script is used to filter this SNP out.All the other SNPs in these example lines are not in a region of poor mapping in the example lines of the "poor_mapping_quality.txt"filter given above and will be kept.
# Source the "remove_poor_mapping_quality" script (Appendix 5) # file using "Rscript" or execute individual lines from the # script in the R environment.
# The output file is "SNPs_goodquality.txt" A similar script is used the remove the SNPs in regions of excessive coverage.The SNPs in those regions are likely to be in a repeat region, NUMT or in a region with some other mapping artifact.The script is run on the output file of Filter 1 (SNPs_goodquality.txt)using the "excessive_coverage.txt" filter.The R script to discard the SNPs that have a MAF less than 0.05 would be the same as for the number of individuals.The column that is used for the filtering is "unknownEM".

Applications
This approach could be applied to any species for which a reference genome or a reference of a closely related species exists to map the reads.The main advantage of this method is that it results in the discovery of a large number of SNPs.As this approach does not simultaneous SNP discovery and genotyping, the identified SNPs could then be target-sequenced using custom-produced SNP arrays or target enrichment capture arrays for many applications in population genomics.These SNPs would particularly useful for applications which need a large number of SNPs such as inferences of demographic history based on the site frequency spectrum (38) or inferences of selective sweeps (39).Samples used for SNP discovery should ideally span the geographical range of populations that will be target-sequenced to avoid ascertainment bias (2).
SNP discovery in the absence of population data can result in some false positives, where identified SNP loci are either the result of sequencing errors, or, more commonly, assembly errors that result in duplicated loci or repeat regions being combined into single assemblies.The effect of including false positives that are in fact monomorphic (false positives due to sequencing error) is relatively minor, resulting typically in a small portion of loci being uninformative.The effect of including false positives that fall in repeated regions can be much more significant, as highly repeated loci can heavily bias the read distribution in genotyping methods that rely on capture enrichment or amplification of groups of loci.Fig. 3 shows the distribution of reads from a set of 384 SNP loci genotyped from multiplexed amplicons (lifetechnologies.com/ampliseqcustom) of 95 DNA samples from blue whales.The first set of 384 loci was obtained from a draft blue whale genome assembly that had not been masked for repeats.Seven of the SNPs represented 90% of the read coverage, resulting in too low coverage of the remaining loci for genotyping (see Fig. 3a).The SNP loci were screened for repeats using RepeatMasker as described above and replaced with SNP loci that did not show evidence of repeats.
Genotyping of the second pool of 384 SNPs resulted in an almost 6x increase in average coverage of non-repeated loci and an even distribution of coverage across most loci (see Fig. 3b).We also recommend using BLAST to further screen for loci that either map to multiple loci or map to non-target species, e.g., bacterial or parasite DNA that could contaminate the sequence assembly.

Notes
1.If there are multiple sample files, running them sequentially through the same process can be facilitated by using shell scripts (e.g., Script1_SNP_call_filter.sh, Appendix 1).A shell script simply a text document containing the commands that would be entered sequentially.For running the same commands on a set of sample files, the same command can be entered for each sample, with only the filenames (input and output) changed.The script is executed by typing './' before the script name in the directory where your script and input files reside.Permissions may need to be set appropriately to allow execution of shell scripts (see "chmod" in Linux commands, see Table 2).
2. Within the shell script text file, the parameters for minimum and maximum depth of coverage are set by the user, as determined for the average depth of coverage of the BAM alignment file.The script also calls the python script, which must be in the same directory, and the path to the python program may also have to be specified.
3. The filenames provided in these steps are used later in R-scripts, so changing the names will require changes in the R-scripts as well.This process is slow, and may take days to complete.It does not typically use >1 thread, though it may increase thread use as it adds sample files.A newly published wrapper for the program ANGSD (40) may make some of these steps easier or improve visualization, but have not yet been tested by us.
4. If the mapping quality filter is applied and results in an empty output file (i.e., SNPs with mapping quality <30 have all been previously filtered out), then the subsequent filter will produce an error.This can be circumvented by copying the first line of the "excessive_coverage.txt" file and pasting it into the empty "poor_mapping_quality.txt"file prior to running Script4a.This will simply remove the one of the excessive coverage SNPs and allow the script to proceed.
Tables: top Typing "top" shows the current processes on the system, including how the percent CPU (i.e., 100% = 1 CPU) and percent of the total memory being used by each process.Exit with the keyboard combination "control" and "c" df -h shows current system status, including size, amount used, amount available, % used.

control-c
For a process that is active (i.e., not run in background mode), it can be cancelled with the keyboard combination "control" and "c" kill When a process is running in the background, it can be "killed" by using the "top" command to find the process ID (PID), then using typing "kill" and the PID.

Rscript
Run an R script (follow by specifying the path/name of the script).was sequenced on an Ion Proton (6,198,109 aligned reads), so the Plate 2 per-locus read depth was scaled by the ratio of plate 2 to plate 1 total reads for comparison purposes.

#
transfer reference genome file (compressed FASTA) wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Orcinus_orca/CHR_Un/oor_ref_Oorc_1.1_chrUn.mfa.gz# decompress the file.gunzip oor_ref_Oorc_1.1_chrUn.mfa.gz# Index your reference using the faidx function in SAMtools # to allow efficient random access to the reference bases.# This creates a file with the same name, but ending with ".fai".samtools faidx oor_ref_Oorc_1.1_chrUn.mfa# Create index files needed for BWA read alignment to the reference.# This creates 5 files with the same file name but different suffixes.bwa index oor_ref_Oorc_1.1_chrUn.mfa

#
Convert SAM file to BAM file and sort BAM file.samtools view -@1 -T ref.fa -b aln_se1.sam| samtools -@1 sort -aln_se1_sorted # -@ indicates the number of threads to use.Default = 1.This # option is only available in samtools for the 'view' and # 'sort' functions.# -b Output in the BAM format # -T FASTA format reference file.PCR duplicates should be removed using the rmdup function of SAMtools.This is very important, as PCR amplification can bias the distribution of reads across loci and alleles.# Collapse clonal reads samtools rmdup -S aln-se1_sorted.bamaln-se1_sorted_unique.bam# NOTE: There is a bug in Samtools 1.2 that causes this to not work.Both earlier and later (v1.3) versions have been reported to work.

#
Extract the consensus sequences of the target-species contigs # from the alignment.samtools mpileup -uf ref.fa individual1.bam| bcftools call -c | vcfutils.plvcf2fq > target_consensus.fq# Convert the output fastq file to a fasta file using SEQTK # to use as the new reference sequence, and index using # 'bwa index' as above.

#
Mask repeats RepeatMasker genome.fa-nolow -norna -specie Cetartiodactyla # genome.fa is the input reference fasta file.# -nolow specifies that only interspersed repeats are masked # and STRs and low complexity regions are retained.# -norna prevents the default masking of small RNAs # -specie specifies the taxanomic group the query sequence belongs # to, in our examples the Cetartiodactyla, and then checks the # RepBase file for known repetitive elements for this taxonomic # group.

#
Run BEDtools to remove repeated regions.intersectBed -abam data.bam-b RepeatMask.bed -v | samtools view -h -| samtools view -bSh -> data_noRepeats.bam# The BAM file from which the repeated regions should be removed # is given with the option -abam and the file containing the # repeated regions with -b.Follow the steps described above to download a reference genome file in FASTA format, and index it for processing with SAMtools and BWA.Align and remove mitogenome reads, align remaining reads to the reference, convert the resulting SAM file to BAM format, collapse clonal reads, merge multiple files for single samples (if necessary), and remove poor-quality alignments (MAPQ<30) as previously described (see Section 3.3 above).A "bamlist" file of the path to the BAM files for each individual needs to be created.It is a text file (data.bamlist)with a line for the path to each individual BAM file.If they are in the same directory as the ongoing analysis, then just the file names are listed.

Filter 1 & 2 (
removing SNPs in regions of poor mapping quality (MAPQ<30) and SNPs that are in regions with twice the mean coverage): Commands will be the same for filter 1 (poor mapping quality) and 2 (excessive coverage), the two scripts are provided and the differences between the two filters are highlighted below.The mapping quality filter is only needed if the regions of high mapping quality (MAPQ>30) were not previously selected using samtools view (see Section 3.3 and Note 4).First, coverage is estimated in angsd using the doDepth function.This function estimates the distribution of the depth of coverage for each individual as well as across all individuals.#Estimate coverage angsd -bam data.bamlist-doDepth 1 -out data -doCounts 1 -nInd 30 -minMapQ 30 -minQ 30 # Reads with a mapping quality (minMapQ) above 30 and nucleotide # qscore (minQ) above 30 are kept.# -nInd corresponds to the number of sequenced individuals.# -doDepth 1 function in angsd also requires the -doCounts 1 # option to estimate coverage.# -out indicates the output file, followed by the output filename.
java -jar path_to/GenomeAnalysisTK.jar -T CallableLoci -R ref.fa -I data_withheader.bam--maxDepth 10 -mmq 30 -summary data_callable.summary-o data_callable.bed# -T = Name of the tool to run.# -R = The reference genome against which the sequence data was # mapped.The GATK requires an index file (samtools faidx) and # a dictionary file accompanying the reference.They need to have # the same filename, ending in .dictfor the dictionary file and # .faifor the indexed file.# -I = Input file containing sequence data (BAM or CRAM).

#Filter 3 :Filter 4 :
Source the "remove_excessive_coverage" script (Appendix 6) # file using "Rscript" or execute individual lines from the # script in the R environment.Rscript Script4b_Filter2_Remove_excessive_coverage.R # This script requires the input files SNPs_goodquality.txt# and "excessive_coverage.txt".# The output file is "Good_coverage_SNPs.txt" remove SNPs covered in an excessive number of individuals SNPs that are covered in an excessive number of individuals are removed using an R script to further remove SNPs in potential repeat regions or mapping artifacts.The number of individuals for which there is coverage for a given SNP is indicated in the last column of the .mafsfile (see above).The cut-off is defined by the user after examining the upper tail of the distribution of the plot of the number of SNPs against the number of individuals.SNPs in the upper tail of the distribution are discarded.Here as an example, a cut-off of 10 individuals was defined.# Source the "excessive_individuals" script (Appendix 7) # file using "Rscript" or execute individual lines from the # script in the R environment.Rscript Script5_Filter3_excessive_individuals.R # Requires input file from Script4 "Good_coverage_SNPs.txt" # The output file is "SNP_10ind.txt"remove SNPs with a MAF (Minor Allele Frequency) <0.05

#Filter 5 (
Source the "Remove_rare_SNPs" script (Appendix 8) # file using "Rscript" or execute individual lines from the # script in the R environment.Rscript Script6_Filter4_Remove_rare_SNPs.R # Requires input file from Script5 "SNP_10ind.txt"# The output file is "SNPs_MAF_good.txt"scripts 7 and 8): Select SNPs with specified length flanking sequences for primer/probe design.This step only selects SNPs that have at least the specified length of sequence on either side of the SNP, without N's.# Scripts 7 and 8 (Appendix 9 and 10) must both be in the # target directory along with # The output of Script6 ("SNPs_MAF_good.txt")and the reference # FASTA file.
Chmod 755Change permissions on a file to allow everyone to read and execute the file, and the file owner is allowed to write to the file.Follow with file name.grepUseful for extracting text.For subsetting a fasta file from a list of loci:grep -Fwf SNP_list.txt-A1 file.fasta| grep -v '^--$' >subset_file.fasta# file.fastacontains a list of sequence name texts that will be found and selected from the fasta file, along with the following DNA sequence line for each locus.
java -jar path_to/picard.jarCreateSequenceDictionaryR= ref.fa O= ref.dict # The output file must have the same stem name as the reference # file so that it can be used by GATK based on the reference file name.If using low coverage data it may be best advised to merge individual BAM files into one consensus sequence to better identify putative repetitive regions or other mapping artifacts.
# RGID: Read Group ID Default value: 1 # RGLB: Read group library required # RGPL=illumina, can be changed for other NGS platforms # RGSM: Read group sample name required # RGPU: Read group platform unit (e.g.run barcode) required # CREATE_INDEX: create a BAM index when writing a coordinate-sorted # BAM file

Table 1 :
Required and optional software.The version numbers were tested.Later versions should be compatible, but sometimes changes are made that may not be compatible with the scripts and commands as we have presented them.