Package picard.sam
Class PositionBasedDownsampleSam
java.lang.Object
picard.cmdline.CommandLineProgram
picard.sam.PositionBasedDownsampleSam
Summary
Class to downsample a SAM/BAM file based on the position of the read in a flowcell. As withDownsampleSam
, all the
reads with the same queryname are either kept or dropped as a unit.
Details
The downsampling is not random (and there is no random seed). It is deterministically determined by the position of each read within its tile. Specifically, it draws an ellipse that covers aFRACTION
of the total tile's
area and of all the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads
with the same name have the same position (mates, secondary and supplemental alignments), the decision will be the
same for all of them. The main concern of this downsampling method is that due to "optical duplicates" downsampling
randomly can create a result that has a different optical duplicate rate, and therefore a different estimated library
size (when running MarkDuplicates). This method keeps (physically) close read together, so that (except
for reads near the boundary of the circle) optical duplicates are kept or dropped as a group.
By default the program expects the read names to have 5 or 7 fields separated by colons (:), and it takes the last two
to indicate the x and y coordinates of the reads within the tile whence it was sequenced. See
ReadNameParser.DEFAULT_READ_NAME_REGEX
for more detail. The program traverses the INPUT
twice: first
to find out the size of each of the tiles, and next to perform the downsampling.
Downsampling invalidates the duplicate flag because duplicate reads before downsampling may not all remain duplicated
after downsampling. Thus, the default setting also removes the duplicate information.
Example
java -jar picard.jar PositionBasedDownsampleSam \ I=input.bam \ O=downsampled.bam \ FRACTION=0.1
Caveats
-
This method is technology and read-name dependent. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines).
this will not work properly. It has been designed to work with Illumina technology and reads-names. Consider
modifying
READ_NAME_REGEX
in other cases. -
The code has been designed to simulate, as accurately as possible, sequencing less, not for getting an exact
downsampled fraction (Use
DownsampleSam
for that.) In particular, since the reads may be distributed non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input argumentFRACTION
. -
Consider running
MarkDuplicates
after downsampling in order to "expose" the duplicates whose representative has been downsampled away. - The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
-
Field Summary
FieldsFields inherited from class picard.cmdline.CommandLineProgram
COMPRESSION_LEVEL, CREATE_INDEX, CREATE_MD5_FILE, GA4GH_CLIENT_SECRETS, MAX_ALLOWABLE_ONE_LINE_SUMMARY_LENGTH, MAX_RECORDS_IN_RAM, QUIET, REFERENCE_SEQUENCE, referenceSequence, specialArgumentsCollection, SYNTAX_TRANSITION_URL, TMP_DIR, USE_JDK_DEFLATER, USE_JDK_INFLATER, VALIDATION_STRINGENCY, VERBOSITY
-
Constructor Summary
Constructors -
Method Summary
Methods inherited from class picard.cmdline.CommandLineProgram
checkRInstallation, getCommandLine, getCommandLineParser, getCommandLineParserForArgs, getDefaultHeaders, getFaqLink, getMetricsFile, getPGRecord, getStandardUsagePreamble, getStandardUsagePreamble, getVersion, hasWebDocumentation, instanceMain, instanceMainWithExit, makeReferenceArgumentCollection, parseArgs, requiresReference, setDefaultHeaders, useLegacyParser
-
Field Details
-
INPUT
-
OUTPUT
-
FRACTION
@Argument(shortName="F", doc="The (approximate) fraction of reads to be kept, between 0 and 1.") public Double FRACTION -
REMOVE_DUPLICATE_INFORMATION
@Argument(doc="Determines whether the duplicate tag should be reset since the downsampling requires re-marking duplicates.") public boolean REMOVE_DUPLICATE_INFORMATION -
READ_NAME_REGEX
@Argument(doc="Use these regular expressions to parse read names in the input SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y coordinates are used to determine the downsample decision. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.") public String READ_NAME_REGEX -
STOP_AFTER
@Argument(doc="Stop after processing N reads, mainly for debugging.", optional=true) public Long STOP_AFTER -
ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS
@Argument(doc="Allow downsampling again despite this being a bad idea with possibly unexpected results.", optional=true) public boolean ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS -
PG_PROGRAM_NAME
-
-
Constructor Details
-
PositionBasedDownsampleSam
public PositionBasedDownsampleSam()
-
-
Method Details
-
customCommandLineValidation
Description copied from class:CommandLineProgram
Put any custom command-line validation in an override of this method. clp is initialized at this point and can be used to print usage and access argv. Any options set by command-line parser can be validated.- Overrides:
customCommandLineValidation
in classCommandLineProgram
- Returns:
- null if command line is valid. If command line is invalid, returns an array of error message to be written to the appropriate place.
-
doWork
protected int doWork()Description copied from class:CommandLineProgram
Do the work after command line has been parsed. RuntimeException may be thrown by this method, and are reported appropriately.- Specified by:
doWork
in classCommandLineProgram
- Returns:
- program exit status.
-