Ticket #41437: bedtools.1

File bedtools.1, 161.8 KB (added by arjanvandervelde (Arjan van der Velde), 11 years ago)
Line 
1.\" Man page generated from reStructuredText.
2.
3.TH "BEDTOOLS" "1" "November 17, 2013" "2.16.2" "bedtools"
4.SH NAME
5bedtools \- Bedtools Documentation
6.
7.nr rst2man-indent-level 0
8.
9.de1 rstReportMargin
10\\$1 \\n[an-margin]
11level \\n[rst2man-indent-level]
12level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
13-
14\\n[rst2man-indent0]
15\\n[rst2man-indent1]
16\\n[rst2man-indent2]
17..
18.de1 INDENT
19.\" .rstReportMargin pre:
20. RS \\$1
21. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
22. nr rst2man-indent-level +1
23.\" .rstReportMargin post:
24..
25.de UNINDENT
26. RE
27.\" indent \\n[an-margin]
28.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
29.nr rst2man-indent-level -1
30.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
31.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
32..
33.
34.nr rst2man-indent-level 0
35.
36.de1 rstReportMargin
37\\$1 \\n[an-margin]
38level \\n[rst2man-indent-level]
39level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
40-
41\\n[rst2man-indent0]
42\\n[rst2man-indent1]
43\\n[rst2man-indent2]
44..
45.de1 INDENT
46.\" .rstReportMargin pre:
47. RS \\$1
48. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
49. nr rst2man-indent-level +1
50.\" .rstReportMargin post:
51..
52.de UNINDENT
53. RE
54.\" indent \\n[an-margin]
55.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
56.nr rst2man-indent-level -1
57.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
58.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
59..
60.sp
61Brief paragraph of the software.
62.SH OVERVIEW
63.SS 1.1 Background
64.sp
65The development of BEDTools was motivated by a need for fast, flexible tools with which to compare large sets of genomic
66features. Answering fundamental research questions with existing tools was either too slow or required modifications to the
67way they reported or computed their results. We were aware of the utilities on the UCSC Genome Browser and Galaxy websites, as
68well as the elegant tools available as part of Jim Kent’s monolithic suite of tools (“Kent source”). However, we found that
69the web\-based tools were too cumbersome when working with large datasets generated by current sequencing technologies.
70Similarly, we found that the Kent source command line tools often required a local installation of the UCSC Genome Browser.
71These limitations, combined with the fact that we often wanted an extra option here or there that wasn’t available with
72existing tools, led us to develop our own from scratch. The initial version of BEDTools was publicly released in the spring of
732009. The current version has evolved from our research experiences and those of the scientists using the suite over the last
74year. The BEDTools suite enables one to answer common questions of genomic data in a fast and reliable manner. The fact that
75almost all the utilities accept input from “stdin” allows one to “stream / pipe” several commands together to facilitate more
76complicated analyses. Also, the tools allow fine control over how output is reported. The initial version of BEDTools
77supported solely 6\-column \fI\%BED\fP files. \fIHowever, we have subsequently added support for sequence alignments in\fP \fI\%BAM\fP
78\fIformat, as well as for features in\fP \fI\%GFF\fP , \fI“blocked” BED format, and\fP
79\fI\%VCF\fP \fIformat\fP\&.
80The tools are quite fast and typically finish in a matter of a few seconds, even for large datasets. This manual seeks to describe the behavior and
81available functionality for each BEDTool. Usage examples are scattered throughout the text, and formal examples are
82provided in the last two sections, we hope that this document will give you a sense of the flexibility of
83the toolkit and the types of analyses that are possible with BEDTools. If you have further questions, please join the BEDTools
84discussion group, visit the Usage Examples on the Google Code site (usage, advanced usage), or take a look at the nascent
85“Usage From the Wild” page.
86.SS 1.2 Summary of available tools.
87.sp
88BEDTools support a  wide range of operations for  interrogating and manipulating genomic features. The table below summarizes
89the tools available in the suite.
90.TS
91center;
92|l|l|.
93_
94T{
95Utility
96T}      T{
97Description
98T}
99_
100T{
101\fBintersectBed\fP
102T}      T{
103Returns overlaps between two BED/GFF/VCF files.
104T}
105_
106T{
107\fBpairToBed\fP
108T}      T{
109Returns overlaps between a paired\-end BED file and a regular BED/VCF/GFF file.
110T}
111_
112T{
113\fBbamToBed\fP
114T}      T{
115Converts BAM alignments to BED6, BED12, or BEDPE format.
116T}
117_
118T{
119\fBbedToBam\fP
120T}      T{
121Converts BED/GFF/VCF features to BAM format.
122T}
123_
124T{
125\fBbed12ToBed6\fP
126T}      T{
127Converts "blocked" BED12 features to discrete BED6 features.
128T}
129_
130T{
131\fBbedToIgv\fP
132T}      T{
133Creates IGV batch scripts for taking multiple snapshots from BED/GFF/VCF features.
134T}
135_
136T{
137\fBcoverageBed\fP
138T}      T{
139Summarizes the depth and breadth of coverage of features in one BED versus features (e.g, windows, exons, etc.) defined in another BED/GFF/VCF file.
140T}
141_
142T{
143\fBmultiBamCov\fP
144T}      T{
145Counts sequence coverage for multiple position\-sorted bams at specific loci defined in a BED/GFF/VCF file
146T}
147_
148T{
149\fBtagBam\fP
150T}      T{
151Annotates a BAM file with custom tag fields based on overlaps with BED/GFF/VCF files
152T}
153_
154T{
155\fBnuclBed\fP
156T}      T{
157Profiles the nucleotide content of intervals in a fasta file
158T}
159_
160T{
161\fBgenomeCoverageBed\fP
162T}      T{
163Creates either a histogram, BEDGRAPH, or a "per base" report of genome coverage.
164T}
165_
166T{
167\fBunionBedGraphs\fP
168T}      T{
169Combines multiple BedGraph? files into a single file, allowing coverage/other comparisons between them.
170T}
171_
172T{
173\fBannotateBed\fP
174T}      T{
175Annotates one BED/VCF/GFF file with overlaps from many others.
176T}
177_
178T{
179\fBgroupBy\fP
180T}      T{
181Deprecated. Now in the filo package.
182T}
183_
184T{
185\fBoverlap\fP
186T}      T{
187Returns the number of bases pairs of overlap b/w two features on the same line.
188T}
189_
190T{
191\fBpairToPair\fP
192T}      T{
193Returns overlaps between two paired\-end BED files.
194T}
195_
196T{
197\fBclosestBed\fP
198T}      T{
199Returns the closest feature to each entry in a BED/GFF/VCF file.
200T}
201_
202T{
203\fBsubtractBed\fP
204T}      T{
205Removes the portion of an interval that is overlapped by another feature.
206T}
207_
208T{
209\fBwindowBed\fP
210T}      T{
211Returns overlaps between two BED/VCF/GFF files based on a user\-defined window.
212T}
213_
214T{
215\fBmergeBed\fP
216T}      T{
217Merges overlapping features into a single feature.
218T}
219_
220T{
221\fBcomplementBed\fP
222T}      T{
223Returns all intervals not spanned by the features in a BED/GFF/VCF file.
224T}
225_
226T{
227\fBfastaFromBed\fP
228T}      T{
229Creates FASTA sequences based on intervals in a BED/GFF/VCF file.
230T}
231_
232T{
233\fBmaskFastaFromBed\fP
234T}      T{
235Masks a FASTA file based on BED coordinates.
236T}
237_
238T{
239\fBshuffleBed\fP
240T}      T{
241Randomly permutes the locations of a BED file among a genome.
242T}
243_
244T{
245\fBslopBed\fP
246T}      T{
247Adjusts each BED entry by a requested number of base pairs.
248T}
249_
250T{
251\fBflankBed\fP
252T}      T{
253Creates flanking intervals for each feature in a BED/GFF/VCF file.
254T}
255_
256T{
257\fBsortBed\fP
258T}      T{
259Sorts a BED file by chrom, then start position. Other ways as well.
260T}
261_
262T{
263\fBlinksBed\fP
264T}      T{
265Creates an HTML file of links to the UCSC or a custom browser.
266T}
267_
268.TE
269.SS 1.3 Fundamental concepts.
270.SS 1.3.1 What are genome features and how are they represented?
271.sp
272Throughout this manual, we will discuss how to use BEDTools to manipulate, compare and ask questions of genome “features”. Genome features can be functional elements (e.g., genes), genetic polymorphisms (e.g.
273SNPs, INDELs, or structural variants), or other annotations that have been discovered or curated by genome sequencing groups or genome browser groups. In addition, genome features can be custom annotations that
274an individual lab or researcher defines (e.g., my novel gene or variant).
275.sp
276The basic characteristics of a genome feature are the chromosome or scaffold on which the feature “resides”, the base pair on which the
277feature starts (i.e. the “start”), the base pair on which feature ends (i.e. the “end”), the strand on which the feature exists (i.e. “+” or “\-“), and the name of the feature if one is applicable.
278.sp
279The two most widely used formats for representing genome features are the BED (Browser Extensible Data) and GFF (General Feature Format) formats. BEDTools was originally written to work exclusively with genome features
280described using the BED format, but it has been recently extended to seamlessly work with BED, GFF and VCF files.
281.sp
282Existing annotations for the genomes of many species can be easily downloaded in BED and GFF
283format from the UCSC Genome Browser’s “Table Browser” (\fI\%http://genome.ucsc.edu/cgi-bin/hgTables?command=start\fP) or from the “Bulk Downloads” page (\fI\%http://hgdownload.cse.ucsc.edu/downloads.html\fP). In addition, the
284Ensemble Genome Browser contains annotations in GFF/GTF format for many species (\fI\%http://www.ensembl.org/info/data/ftp/index.html\fP)
285.SS 1.3.2 Overlapping / intersecting features.
286.sp
287Two genome features (henceforth referred to as “features”) are said to overlap or intersect if they share at least one base in common.
288In the figure below, Feature A intersects/overlaps Feature B, but it does not intersect/overlap Feature C.
289.sp
290\fBTODO: place figure here\fP
291.SS 1.3.3 Comparing features in file “A” and file “B”.
292.sp
293The previous section briefly introduced a fundamental naming convention used in BEDTools. Specifically, all BEDTools that compare features contained in two distinct files refer to one file as feature set “A” and the other file as feature set “B”. This is mainly in the interest of brevity, but it also has its roots in set theory.
294As an example, if one wanted to look for SNPs (file A) that overlap with exons (file B), one would use intersectBed in the following manner:
295.INDENT 0.0
296.INDENT 3.5
297.sp
298.nf
299.ft C
300intersectBed –a snps.bed –b exons.bed
301.ft P
302.fi
303.UNINDENT
304.UNINDENT
305.sp
306There are two exceptions to this rule: 1) When the “A” file is in BAM format, the “\-abam” option must bed used. For example:
307.INDENT 0.0
308.INDENT 3.5
309.sp
310.nf
311.ft C
312intersectBed –abam alignedReads.bam –b exons.bed
313.ft P
314.fi
315.UNINDENT
316.UNINDENT
317.sp
318And 2) For tools where only one input feature file is needed, the “\-i” option is used. For example:
319.INDENT 0.0
320.INDENT 3.5
321.sp
322.nf
323.ft C
324mergeBed –i repeats.bed
325.ft P
326.fi
327.UNINDENT
328.UNINDENT
329.SS 1.3.4 BED starts are zero\-based and BED ends are one\-based.
330.sp
331BEDTools users are sometimes confused by the way the start and end of BED features are represented. Specifically, BEDTools uses the UCSC Genome Browser’s internal database convention of making the start position 0\-based and the end position 1\-based: (\fI\%http://genome.ucsc.edu/FAQ/FAQtracks#tracks1\fP)
332In other words, BEDTools interprets the “start” column as being 1 basepair higher than what is represented in the file. For example, the following BED feature represents a single base on chromosome 1; namely, the 1st base:
333.INDENT 0.0
334.INDENT 3.5
335.sp
336.nf
337.ft C
338chr1   0        1    first_base
339.ft P
340.fi
341.UNINDENT
342.UNINDENT
343.sp
344Why, you might ask? The advantage of storing features this way is that when computing the length of a feature, one must simply subtract the start from the end. Were the start position 1\-based,
345the calculation would be (slightly) more complex (i.e. (end\-start)+1). Thus, storing BED features this way reduces the computational burden.
346.SS 1.3.5 GFF starts and ends are one\-based.
347.sp
348In contrast, the GFF format uses 1\-based coordinates for both the start and the end positions. BEDTools is aware of this and adjusts the positions accordingly.
349In other words, you don’t need to subtract 1 from the start positions of your GFF features for them to work correctly with BEDTools.
350.SS 1.3.6 VCF coordinates are one\-based.
351.sp
352The VCF format uses 1\-based coordinates. As in GFF, BEDTools is aware of this and adjusts the positions accordingly.
353In other words, you don’t need to subtract 1 from the start positions of your VCF features for them to work correctly with BEDTools.
354.SS 1.3.7 File B is loaded into memory (most of the time).
355.sp
356Whenever a BEDTool compares two files of features, the “B” file is loaded into memory. By contrast, the “A” file is processed line by line and compared with the features from B.
357Therefore to minimize memory usage, one should set the smaller of the two files as the B file. One salient example is the comparison of aligned sequence reads from a
358current DNA sequencer to gene annotations.      In this case, the aligned sequence file (in BED format) may have tens of millions of features (the sequence alignments),
359while the gene annotation file will have tens of thousands of features. In this case, it is wise to sets the reads as file A and the genes as file B.
360.SS 1.3.8 Feature files \fImust\fP be tab\-delimited.
361.sp
362This is rather self\-explanatory. While it is possible to allow BED files to be space\-delimited, we have decided to require tab delimiters for three reasons:
363.INDENT 0.0
364.IP 1. 3
365By requiring one delimiter type, the processing time is minimized.
366.IP 2. 3
367Tab\-delimited files are more amenable to other UNIX utilities.
368.IP 3. 3
369GFF files can contain spaces within attribute columns. This complicates the use of space\-delimited files as spaces must therefore be treated specially depending on the context.
370.UNINDENT
371.SS 1.3.9 All BEDTools allow features to be “piped” via standard input.
372.sp
373In an effort to allow one to combine multiple BEDTools and other UNIX utilities into more complicated “pipelines”, all BEDTools allow features
374to be passed to them via standard input. Only one feature file may be passed to a BEDTool via standard input.
375The convention used by all BEDTools is to set either file A or file B to “stdin” or "\-". For example:
376.INDENT 0.0
377.INDENT 3.5
378.sp
379.nf
380.ft C
381cat snps.bed | intersectBed –a stdin –b exons.bed
382cat snps.bed | intersectBed –a \- –b exons.bed
383.ft P
384.fi
385.UNINDENT
386.UNINDENT
387.sp
388In addition, all BEDTools that simply require one main input file (the \-i file) will assume that input is
389coming from standard input if the \-i parameter is ignored. For example, the following are equivalent:
390.INDENT 0.0
391.INDENT 3.5
392.sp
393.nf
394.ft C
395cat snps.bed | sortBed –i stdin
396cat snps.bed | sortBed
397.ft P
398.fi
399.UNINDENT
400.UNINDENT
401.SS 1.3.10 Most BEDTools write their results to standard output.
402.sp
403To allow one to combine multiple BEDTools and other UNIX utilities into more complicated “pipelines”,
404most BEDTools report their output to standard output, rather than to a named file. If one wants to write the output to a named file, one can use the UNIX “file redirection” symbol “>” to do so.
405Writing to standard output (the default):
406.INDENT 0.0
407.INDENT 3.5
408.sp
409.nf
410.ft C
411intersectBed –a snps.bed –b exons.bed
412chr1 100100 100101 rs233454
413chr1 200100 200101 rs446788
414chr1 300100 300101 rs645678
415.ft P
416.fi
417.UNINDENT
418.UNINDENT
419.sp
420Writing to a file:
421.INDENT 0.0
422.INDENT 3.5
423.sp
424.nf
425.ft C
426intersectBed –a snps.bed –b exons.bed > snps.in.exons.bed
427
428cat snps.in.exons.bed
429chr1 100100 100101 rs233454
430chr1 200100 200101 rs446788
431chr1 300100 300101 rs645678
432.ft P
433.fi
434.UNINDENT
435.UNINDENT
436.SS 1.3.11 What is a “genome” file?
437.sp
438Some of the BEDTools (e.g., genomeCoverageBed, complementBed, slopBed) need to know the size of
439the chromosomes for the organism for which your BED files are based. When using the UCSC Genome
440Browser, Ensemble, or Galaxy, you typically indicate which species / genome build you are working.
441The way you do this for BEDTools is to create a “genome” file, which simply lists the names of the
442chromosomes (or scaffolds, etc.) and their size (in basepairs).
443Genome files must be tab\-delimited and are structured as follows (this is an example for C. elegans):
444.INDENT 0.0
445.INDENT 3.5
446.sp
447.nf
448.ft C
449chrI 15072421
450chrII 15279323
451\&...
452chrX 17718854
453chrM 13794
454.ft P
455.fi
456.UNINDENT
457.UNINDENT
458.sp
459BEDTools includes predefined genome files for human and mouse in the /genomes directory included
460in the BEDTools distribution. Additionally, the “chromInfo” files/tables available from the UCSC
461Genome Browser website are acceptable. For example, one can download the hg19 chromInfo file here:
462\fI\%http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz\fP
463.SS 1.3.12 Paired\-end BED files (BEDPE files).
464.sp
465We have defined a new file format (BEDPE) to concisely describe disjoint genome features, such as
466structural variations or paired\-end sequence alignments. We chose to define a new format because the
467existing BED block format (i.e. BED12) does not allow inter\-chromosomal feature definitions. Moreover,
468the BED12 format feels rather bloated when one want to describe events with only two blocks.
469.SS 1.3.13 Use “\-h” for help with any BEDTool.
470.sp
471Rather straightforward. If you use the “\-h” option with any BEDTool, a full menu of example usage
472and available options (when applicable) will be reported.
473.SS 1.3.14 BED features must not contain negative positions.
474.sp
475BEDTools will typically reject BED features that contain negative positions. In special cases, however,
476BEDPE positions may be set to \-1 to indicate that one or more ends of a BEDPE feature is unaligned.
477.SS 1.3.15 The start position must be <= to the end position.
478.sp
479BEDTools will reject BED features where the start position is greater than the end position.
480.SS 1.3.16 Headers are allowed in GFF and BED files
481.sp
482BEDTools will ignore headers at the beginning of BED and GFF files. Valid header lines begin with a
483“#” symbol, the work “track”, or the word “browser”. For example, the following examples are valid
484headers for BED or GFF files:
485.INDENT 0.0
486.INDENT 3.5
487.sp
488.nf
489.ft C
490track name=aligned_read description="Illumina aligned reads”
491chr5 100000 500000 read1 50 +
492chr5 2380000 2386000 read2 60 \-
493
494#This is a fascinating dataset
495chr5 100000 500000 read1 50 +
496chr5 2380000 2386000 read2 60 \-
497
498browser position chr22:1\-20000
499chr5 100000 500000 read1 50 +
500chr5 2380000 2386000 read2 60 \-
501.ft P
502.fi
503.UNINDENT
504.UNINDENT
505.SS 1.3.17 GZIP support: BED, GFF, VCF, and BEDPE file can be “gzipped”
506.sp
507BEDTools will process gzipped BED, GFF, VCF and BEDPE files in the same manner as
508uncompressed files. Gzipped files are auto\-detected thanks to a helpful contribution from Gordon
509Assaf.
510.SS 1.3.18 Support for “split” or “spliced” BAM alignments and “blocked” BED features
511.sp
512As of Version 2.8.0, five BEDTools (\fBintersectBed\fP, \fBcoverageBed\fP, \fBgenomeCoverageBed\fP,
513\fBbamToBed\fP, and \fBbed12ToBed6\fP) can properly handle “split”/”spliced” BAM alignments (i.e., having an
514“N” CIGAR operation) and/or “blocked” BED (aka BED12) features.
515.sp
516\fBintersectBed\fP, \fBcoverageBed\fP, and \fBgenomeCoverageBed\fP will optionally handle “split” BAM and/or
517“blocked” BED by using the \fB\-split\fP option. This will cause intersects or coverage to be computed only
518for the alignment or feature blocks. In contrast, without this option, the intersects/coverage would be
519computed for the entire “span” of the alignment or feature, regardless of the size of the gaps between
520each alignment or feature block. For example, imagine you have a RNA\-seq read that originates from
521the junction of two exons that were spliced together in a mRNA. In the genome, these two exons
522happen to be 30Kb apart. Thus, when the read is aligned to the reference genome, one portion of the
523read will align to the first exon, while another portion of the read will align ca. 30Kb downstream to the
524other exon. The corresponding CIGAR string would be something like (assuming a 76bp read):
52530M*3000N*46M. In the genome, this alignment “spans” 3076 bp, yet the nucleotides in the sequencing
526read only align “cover” 76bp. Without the \fB\-split\fP option, coverage or overlaps would be reported for the
527entire 3076bp span of the alignment. However, with the \fB\-split\fP option, coverage or overlaps will only
528be reported for the portions of the read that overlap the exons (i.e. 30bp on one exon, and
52946bp on the other).
530.sp
531Using the \-split option with bamToBed causes “spliced/split” alignments to be reported in BED12
532format. Using the \-split option with bed12ToBed6 causes “blocked” BED12 features to be reported in
533BED6 format.
534.SS 1.3.19 Writing uncompressed BAM output.
535.sp
536When working with a large BAM file using a complex set of tools in a pipe/stream, it is advantageous
537to pass uncompressed BAM output to each downstream program. This minimizes the amount of time
538spent compressing and decompressing output from one program to the next. All BEDTools that create
539BAM output (e.g. \fBintersectBed\fP, \fBwindowBed\fP) will now optionally create uncompressed BAM output
540using the \fB\-ubam\fP option.
541.SS 1.4 Implementation and algorithmic notes.
542.sp
543BEDTools was implemented in C++ and makes extensive use of data structures and fundamental
544algorithms from the Standard Template Library (STL). Many of the core algorithms are based upon the
545genome binning algorithm described in the original UCSC Genome Browser paper (Kent et al, 2002).
546The tools have been designed to inherit core data structures from central source files, thus allowing
547rapid tool development and deployment of improvements and corrections. Support for BAM files is
548made possible through Derek Barnett’s elegant C++ API called BamTools.
549.SS 1.5 License and availability.
550.sp
551BEDTools is freely available under a GNU Public License (Version 2) at:
552\fI\%http://bedtools.googlecode.com\fP
553.SS 1.6 Mailing list.
554.sp
555A discussion group for reporting bugs, asking questions of the developer and of the user community, as
556well as for requesting new features is available at:
557\fI\%http://groups.google.com/group/bedtools-discuss\fP
558.SS 1.7 Contributors.
559.sp
560As open\-source software, BEDTools greatly benefits from contributions made by other developers and
561users of the tools. We encourage and welcome suggestions, contributions and complaints. This is how
562software matures, improves and stays on top of the needs of its user community. The Google Code
563(GC) site maintains a list of individuals who have contributed either source code or useful ideas for
564improving the tools. In the near future, we hope to maintain a source repository on the GC site in
565order to facilitate further contributions. We are currently unable to do so because we use Git for
566version control, which is not yet supported by GC.
567.SH INSTALLATION
568.sp
569BEDTools is intended to run in a "command line" environment on UNIX, LINUX and Apple OS X
570operating systems. Installing BEDTools involves downloading the latest source code archive followed by
571compiling the source code into binaries on your local system. The following commands will install
572BEDTools in a local directory on a NIX or OS X machine. Note that the \fB"<version>"\fP refers to the
573latest posted version number on \fI\%http://bedtools.googlecode.com/\fP\&.
574.sp
575Note: \fIThe BEDTools "makefiles" use the GCC compiler. One should edit the Makefiles accordingly if
576one wants to use a different compiler.\fP:
577.INDENT 0.0
578.INDENT 3.5
579.sp
580.nf
581.ft C
582curl http://bedtools.googlecode.com/files/BEDTools.<version>.tar.gz > BEDTools.tar.gz
583tar \-zxvf BEDTools.tar.gz
584cd BEDTools\-<version>
585make clean
586make all
587ls bin
588.ft P
589.fi
590.UNINDENT
591.UNINDENT
592.sp
593At this point, one should copy the binaries in BEDTools/bin/ to either usr/local/bin/ or some
594other repository for commonly used UNIX tools in your environment. You will typically require
595administrator (e.g. "root" or "sudo") privileges to copy to usr/local/bin/. If in doubt, contact you
596system administrator for help.
597.SH QUICK START
598.SS Install BEDTools
599.INDENT 0.0
600.INDENT 3.5
601.sp
602.nf
603.ft C
604curl http://bedtools.googlecode.com/files/BEDTools.<version>.tar.gz > BEDTools.tar.gz
605tar \-zxvf BEDTools.tar.gz
606cd BEDTools
607make clean
608make all
609sudo cp bin/* /usr/local/bin/
610.ft P
611.fi
612.UNINDENT
613.UNINDENT
614.SS Use BEDTools
615.sp
616Below are examples of typical BEDTools usage. \fBAdditional usage examples are described in
617section 6 of this manual.\fP Using the "\-h" option with any BEDTools will report a list of all command
618line options.
619.sp
620A. Report the base\-pair overlap between the features in two BED files.
621.INDENT 0.0
622.INDENT 3.5
623.sp
624.nf
625.ft C
626intersectBed \-a reads.bed \-b genes.bed
627.ft P
628.fi
629.UNINDENT
630.UNINDENT
631.sp
632B. Report those entries in A that overlap NO entries in B. Like "grep \-v"
633.INDENT 0.0
634.INDENT 3.5
635.sp
636.nf
637.ft C
638intersectBed \-a reads.bed \-b genes.bed ?Cv
639.ft P
640.fi
641.UNINDENT
642.UNINDENT
643.sp
644C. Read BED A from stdin. Useful for stringing together commands. For example, find genes that overlap LINEs
645but not SINEs.
646.INDENT 0.0
647.INDENT 3.5
648.sp
649.nf
650.ft C
651intersectBed \-a genes.bed \-b LINES.bed | intersectBed \-a stdin \-b SINEs.bed ?Cv
652.ft P
653.fi
654.UNINDENT
655.UNINDENT
656.sp
657D. Find the closest ALU to each gene.
658.INDENT 0.0
659.INDENT 3.5
660.sp
661.nf
662.ft C
663closestBed \-a genes.bed \-b ALUs.bed
664.ft P
665.fi
666.UNINDENT
667.UNINDENT
668.sp
669E. Merge overlapping repetitive elements into a single entry, returning the number of entries merged.
670.INDENT 0.0
671.INDENT 3.5
672.sp
673.nf
674.ft C
675mergeBed \-i repeatMasker.bed \-n
676.ft P
677.fi
678.UNINDENT
679.UNINDENT
680.sp
681F. Merge nearby repetitive elements into a single entry, so long as they are within 1000 bp of one another.
682.INDENT 0.0
683.INDENT 3.5
684.sp
685.nf
686.ft C
687mergeBed \-i repeatMasker.bed \-d 1000
688.ft P
689.fi
690.UNINDENT
691.UNINDENT
692.SH GENERAL USAGE
693.SS 4.1 Supported file formats
694.SS 4.1.1 BED format
695.sp
696As described on the UCSC Genome Browser website (see link below), the BED format is a concise and
697flexible way to represent genomic features and annotations. The BED format description supports up to
69812 columns, but only the first 3 are required for the UCSC browser, the Galaxy browser and for
699BEDTools. BEDTools allows one to use the "BED12" format (that is, all 12 fields listed below).
700However, only intersectBed, coverageBed, genomeCoverageBed, and bamToBed will obey the BED12
701"blocks" when computing overlaps, etc., via the \fB"\-split"\fP option. For all other tools, the last six columns
702are not used for any comparisons by the BEDTools. Instead, they will use the entire span (start to end)
703of the BED12 entry to perform any relevant feature comparisons. The last six columns will be reported
704in the output of all comparisons.
705.sp
706The file description below is modified from: \fI\%http://genome.ucsc.edu/FAQ/FAQformat#format1\fP\&.
707.INDENT 0.0
708.IP 1. 3
709\fBchrom\fP \- The name of the chromosome on which the genome feature exists.
710.UNINDENT
711.INDENT 0.0
712.INDENT 3.5
713.INDENT 0.0
714.IP \(bu 2
715\fIAny string can be used\fP\&. For example, "chr1", "III", "myChrom", "contig1112.23".
716.IP \(bu 2
717\fIThis column is required\fP\&.
718.UNINDENT
719.UNINDENT
720.UNINDENT
721.INDENT 0.0
722.IP 2. 3
723\fBstart\fP \- The zero\-based starting position of the feature in the chromosome.
724.UNINDENT
725.INDENT 0.0
726.INDENT 3.5
727.INDENT 0.0
728.IP \(bu 2
729\fIThe first base in a chromosome is numbered 0\fP\&.
730.IP \(bu 2
731\fIThe start position in each BED feature is therefore interpreted to be 1 greater than the start position listed in the feature. For example, start=9, end=20 is interpreted to span bases 10 through 20,inclusive\fP\&.
732.IP \(bu 2
733\fIThis column is required\fP\&.
734.UNINDENT
735.UNINDENT
736.UNINDENT
737.INDENT 0.0
738.IP 3. 3
739\fBend\fP \- The one\-based ending position of the feature in the chromosome.
740.UNINDENT
741.INDENT 0.0
742.INDENT 3.5
743.INDENT 0.0
744.IP \(bu 2
745\fIThe end position in each BED feature is one\-based. See example above\fP\&.
746.IP \(bu 2
747\fIThis column is required\fP\&.
748.UNINDENT
749.UNINDENT
750.UNINDENT
751.INDENT 0.0
752.IP 4. 3
753\fBname\fP \- Defines the name of the BED feature.
754.UNINDENT
755.INDENT 0.0
756.INDENT 3.5
757.INDENT 0.0
758.IP \(bu 2
759\fIAny string can be used\fP\&. For example, "LINE", "Exon3", "HWIEAS_0001:3:1:0:266#0/1", or "my_Feature".
760.IP \(bu 2
761\fIThis column is optional\fP\&.
762.UNINDENT
763.UNINDENT
764.UNINDENT
765.INDENT 0.0
766.IP 5. 3
767\fBscore\fP \- The UCSC definition requires that a BED score range from 0 to 1000, inclusive. However, BEDTools allows any string to be stored in this field in order to allow greater flexibility in annotation features. For example, strings allow scientific notation for p\-values, mean enrichment values, etc. It should be noted that this flexibility could prevent such annotations from being correctly displayed on the UCSC browser.
768.UNINDENT
769.INDENT 0.0
770.INDENT 3.5
771.INDENT 0.0
772.IP \(bu 2
773\fIAny string can be used\fP\&. For example, 7.31E\-05 (p\-value), 0.33456 (mean enrichment value), "up", "down", etc.
774.IP \(bu 2
775\fIThis column is optional\fP\&.
776.UNINDENT
777.UNINDENT
778.UNINDENT
779.INDENT 0.0
780.IP 6. 3
781\fBstrand\fP \- Defines the strand \- either \(aq+\(aq or \(aq\-\(aq.
782.UNINDENT
783.INDENT 0.0
784.INDENT 3.5
785.INDENT 0.0
786.IP \(bu 2
787\fIThis column is optional\fP\&.
788.UNINDENT
789.UNINDENT
790.UNINDENT
791.INDENT 0.0
792.IP 7. 3
793\fBthickStart\fP \- The starting position at which the feature is drawn thickly.
794.UNINDENT
795.INDENT 0.0
796.INDENT 3.5
797.INDENT 0.0
798.IP \(bu 2
799\fIAllowed yet ignored by BEDTools\fP\&.
800.UNINDENT
801.UNINDENT
802.UNINDENT
803.INDENT 0.0
804.IP 8. 3
805\fBthickEnd\fP \- The ending position at which the feature is drawn thickly.
806.UNINDENT
807.INDENT 0.0
808.INDENT 3.5
809.INDENT 0.0
810.IP \(bu 2
811\fIAllowed yet ignored by BEDTools\fP\&.
812.UNINDENT
813.UNINDENT
814.UNINDENT
815.INDENT 0.0
816.IP 9. 3
817\fBitemRgb\fP \- An RGB value of the form R,G,B (e.g. 255,0,0).
818.UNINDENT
819.INDENT 0.0
820.INDENT 3.5
821.INDENT 0.0
822.IP \(bu 2
823\fIAllowed yet ignored by BEDTools\fP\&.
824.UNINDENT
825.UNINDENT
826.UNINDENT
827.INDENT 0.0
828.IP 10. 3
829\fBblockCount\fP \- The number of blocks (exons) in the BED line.
830.UNINDENT
831.INDENT 0.0
832.INDENT 3.5
833.INDENT 0.0
834.IP \(bu 2
835\fIAllowed yet ignored by BEDTools\fP\&.
836.UNINDENT
837.UNINDENT
838.UNINDENT
839.INDENT 0.0
840.IP 11. 4
841\fBblockSizes\fP \- A comma\-separated list of the block sizes.
842.UNINDENT
843.INDENT 0.0
844.INDENT 3.5
845.INDENT 0.0
846.IP \(bu 2
847\fIAllowed yet ignored by BEDTools\fP\&.
848.UNINDENT
849.UNINDENT
850.UNINDENT
851.INDENT 0.0
852.IP 12. 4
853\fBblockStarts\fP \- A comma\-separated list of block starts.
854.UNINDENT
855.INDENT 0.0
856.INDENT 3.5
857.INDENT 0.0
858.IP \(bu 2
859\fIAllowed yet ignored by BEDTools\fP\&.
860.UNINDENT
861.UNINDENT
862.UNINDENT
863.sp
864BEDTools requires that all BED input files (and input received from stdin) are \fBtab\-delimited\fP\&. The following types of BED files are supported by BEDTools:
865.INDENT 0.0
866.IP 1. 3
867.nf
868\fBBED3\fP: A BED file where each feature is described by \fBchrom\fP, \fBstart\fP, and \fBend\fP\&.
869For example: chr1          11873   14409
870.fi
871.sp
872.IP 2. 3
873.nf
874\fBBED4\fP: A BED file where each feature is described by \fBchrom\fP, \fBstart\fP, \fBend\fP, and \fBname\fP\&.
875For example: chr1  11873  14409  uc001aaa.3
876.fi
877.sp
878.IP 3. 3
879.nf
880\fBBED5\fP: A BED file where each feature is described by \fBchrom\fP, \fBstart\fP, \fBend\fP, \fBname\fP, and \fBscore\fP\&.
881For example: chr1 11873 14409 uc001aaa.3 0
882.fi
883.sp
884.IP 4. 3
885.nf
886\fBBED6\fP: A BED file where each feature is described by \fBchrom\fP, \fBstart\fP, \fBend\fP, \fBname\fP, \fBscore\fP, and \fBstrand\fP\&.
887For example: chr1 11873 14409 uc001aaa.3 0 +
888.fi
889.sp
890.IP 5. 3
891.nf
892\fBBED12\fP: A BED file where each feature is described by all twelve columns listed above.
893For example: chr1 11873 14409 uc001aaa.3 0 + 11873
89411873 0 3 354,109,1189, 0,739,1347,
895.fi
896.sp
897.UNINDENT
898.SS 4.1.2 BEDPE format
899.sp
900We have defined a new file format (BEDPE) in order to concisely describe disjoint genome features,
901such as structural variations or paired\-end sequence alignments. We chose to define a new format
902because the existing "blocked" BED format (a.k.a. BED12) does not allow inter\-chromosomal feature
903definitions. In addition, BED12 only has one strand field, which is insufficient for paired\-end sequence
904alignments, especially when studying structural variation.
905.sp
906The BEDPE format is described below. The description is modified from: \fI\%http://genome.ucsc.edu/FAQ/FAQformat#format1\fP\&.
907.INDENT 0.0
908.IP 1. 3
909\fBchrom1\fP \- The name of the chromosome on which the \fBfirst\fP end of the feature exists.
910.UNINDENT
911.INDENT 0.0
912.INDENT 3.5
913.INDENT 0.0
914.IP \(bu 2
915\fIAny string can be used\fP\&. For example, "chr1", "III", "myChrom", "contig1112.23".
916.IP \(bu 2
917\fIThis column is required\fP\&.
918.IP \(bu 2
919\fIUse "." for unknown\fP\&.
920.UNINDENT
921.UNINDENT
922.UNINDENT
923.INDENT 0.0
924.IP 2. 3
925\fBstart1\fP \- The zero\-based starting position of the \fBfirst\fP end of the feature on \fBchrom1\fP\&.
926.UNINDENT
927.INDENT 0.0
928.INDENT 3.5
929.INDENT 0.0
930.IP \(bu 2
931\fIThe first base in a chromosome is numbered 0\fP\&.
932.IP \(bu 2
933\fIAs with BED format, the start position in each BEDPE feature is therefore interpreted to be 1 greater than the start position listed in the feature. This column is required\fP\&.
934.IP \(bu 2
935\fIUse \-1 for unknown\fP\&.
936.UNINDENT
937.UNINDENT
938.UNINDENT
939.INDENT 0.0
940.IP 3. 3
941\fBend1\fP \- The one\-based ending position of the first end of the feature on \fBchrom1\fP\&.
942.UNINDENT
943.INDENT 0.0
944.INDENT 3.5
945.INDENT 0.0
946.IP \(bu 2
947\fIThe end position in each BEDPE feature is one\-based\fP\&.
948.IP \(bu 2
949\fIThis column is required\fP\&.
950.IP \(bu 2
951\fIUse \-1 for unknown\fP\&.
952.UNINDENT
953.UNINDENT
954.UNINDENT
955.INDENT 0.0
956.IP 4. 3
957\fBchrom2\fP \- The name of the chromosome on which the \fBsecond\fP end of the feature exists.
958.UNINDENT
959.INDENT 0.0
960.INDENT 3.5
961.INDENT 0.0
962.IP \(bu 2
963\fIAny string can be used\fP\&. For example, "chr1", "III", "myChrom", "contig1112.23".
964.IP \(bu 2
965\fIThis column is required\fP\&.
966.IP \(bu 2
967\fIUse "." for unknown\fP\&.
968.UNINDENT
969.UNINDENT
970.UNINDENT
971.INDENT 0.0
972.IP 5. 3
973\fBstart2\fP \- The zero\-based starting position of the \fBsecond\fP end of the feature on \fBchrom2\fP\&.
974.UNINDENT
975.INDENT 0.0
976.INDENT 3.5
977.INDENT 0.0
978.IP \(bu 2
979\fIThe first base in a chromosome is numbered 0\fP\&.
980.IP \(bu 2
981\fIAs with BED format, the start position in each BEDPE feature is therefore interpreted to be 1 greater than the start position listed in the feature. This column is required\fP\&.
982.IP \(bu 2
983\fIUse \-1 for unknown\fP\&.
984.UNINDENT
985.UNINDENT
986.UNINDENT
987.INDENT 0.0
988.IP 6. 3
989\fBend2\fP \- The one\-based ending position of the \fBsecond\fP end of the feature on \fBchrom2\fP\&.
990.UNINDENT
991.INDENT 0.0
992.INDENT 3.5
993.INDENT 0.0
994.IP \(bu 2
995\fIThe end position in each BEDPE feature is one\-based\fP\&.
996.IP \(bu 2
997\fIThis column is required\fP\&.
998.IP \(bu 2
999\fIUse \-1 for unknown\fP\&.
1000.UNINDENT
1001.UNINDENT
1002.UNINDENT
1003.INDENT 0.0
1004.IP 7. 3
1005\fBname\fP \- Defines the name of the BEDPE feature.
1006.UNINDENT
1007.INDENT 0.0
1008.INDENT 3.5
1009.INDENT 0.0
1010.IP \(bu 2
1011\fIAny string can be used\fP\&. For example, "LINE", "Exon3", "HWIEAS_0001:3:1:0:266#0/1", or "my_Feature".
1012.IP \(bu 2
1013\fIThis column is optional\fP\&.
1014.UNINDENT
1015.UNINDENT
1016.UNINDENT
1017.INDENT 0.0
1018.IP 8. 3
1019\fBscore\fP \- The UCSC definition requires that a BED score range from 0 to 1000, inclusive. \fIHowever, BEDTools allows any string to be stored in this field in order to allow greater flexibility in annotation features\fP\&. For example, strings allow scientific notation for p\-values, mean enrichment values, etc. It should be noted that this flexibility could prevent such annotations from being correctly displayed on the UCSC browser.
1020.UNINDENT
1021.INDENT 0.0
1022.INDENT 3.5
1023.INDENT 0.0
1024.IP \(bu 2
1025\fIAny string can be used\fP\&. For example, 7.31E\-05 (p\-value), 0.33456 (mean enrichment value), "up", "down", etc.
1026.IP \(bu 2
1027\fIThis column is optional\fP\&.
1028.UNINDENT
1029.UNINDENT
1030.UNINDENT
1031.INDENT 0.0
1032.IP 9. 3
1033\fBstrand1\fP \- Defines the strand for the first end of the feature. Either \(aq+\(aq or \(aq\-\(aq.
1034.UNINDENT
1035.INDENT 0.0
1036.INDENT 3.5
1037.INDENT 0.0
1038.IP \(bu 2
1039\fIThis column is optional\fP\&.
1040.IP \(bu 2
1041\fIUse "." for unknown\fP\&.
1042.UNINDENT
1043.UNINDENT
1044.UNINDENT
1045.INDENT 0.0
1046.IP 10. 3
1047\fBstrand2\fP \- Defines the strand for the second end of the feature. Either \(aq+\(aq or \(aq\-\(aq.
1048.UNINDENT
1049.INDENT 0.0
1050.INDENT 3.5
1051.INDENT 0.0
1052.IP \(bu 2
1053\fIThis column is optional\fP\&.
1054.IP \(bu 2
1055\fIUse "." for unknown\fP\&.
1056.UNINDENT
1057.UNINDENT
1058.UNINDENT
1059.INDENT 0.0
1060.IP 11. 4
1061\fBAny number of additional, user\-defined fields\fP \- BEDTools allows one to add as many additional fields to the normal, 10\-column BEDPE format as necessary. These columns are merely "passed through" \fBpairToBed\fP and \fBpairToPair\fP and are not part of any analysis. One would use these additional columns to add extra information (e.g., edit distance for each end of an alignment, or "deletion", "inversion", etc.) to each BEDPE feature.
1062.UNINDENT
1063.INDENT 0.0
1064.INDENT 3.5
1065.INDENT 0.0
1066.IP \(bu 2
1067\fIThese additional columns are optional\fP\&.
1068.UNINDENT
1069.UNINDENT
1070.UNINDENT
1071.sp
1072Entries from an typical BEDPE file:
1073.INDENT 0.0
1074.INDENT 3.5
1075.sp
1076.nf
1077.ft C
1078chr1  100   200   chr5  5000  5100  bedpe_example1  30   +  \-
1079chr9  1000  5000  chr9  3000  3800  bedpe_example2  100  +  \-
1080.ft P
1081.fi
1082.UNINDENT
1083.UNINDENT
1084.sp
1085Entries from a BEDPE file with two custom fields added to each record:
1086.INDENT 0.0
1087.INDENT 3.5
1088.sp
1089.nf
1090.ft C
1091chr1  10    20    chr5  50    60    a1     30       +    \-  0  1
1092chr9  30    40    chr9  80    90    a2     100      +    \-  2  1
1093.ft P
1094.fi
1095.UNINDENT
1096.UNINDENT
1097.SS 4.1.3 GFF format
1098.sp
1099The GFF format is described on the Sanger Institute\(aqs website (\fI\%http://www.sanger.ac.uk/resources/software/gff/spec.html\fP). The GFF description below is modified from the definition at this URL. All nine columns in the GFF format description are required by BEDTools.
1100.INDENT 0.0
1101.IP 1. 3
1102\fBseqname\fP \- The name of the sequence (e.g. chromosome) on which the feature exists.
1103.UNINDENT
1104.INDENT 0.0
1105.INDENT 3.5
1106.INDENT 0.0
1107.IP \(bu 2
1108\fIAny string can be used\fP\&. For example, "chr1", "III", "myChrom", "contig1112.23".
1109.IP \(bu 2
1110\fIThis column is required\fP\&.
1111.UNINDENT
1112.UNINDENT
1113.UNINDENT
1114.INDENT 0.0
1115.IP 2. 3
1116\fBsource\fP \- The source of this feature. This field will normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc.
1117.UNINDENT
1118.INDENT 0.0
1119.INDENT 3.5
1120.INDENT 0.0
1121.IP \(bu 2
1122\fIThis column is required\fP\&.
1123.UNINDENT
1124.UNINDENT
1125.UNINDENT
1126.INDENT 0.0
1127.IP 3. 3
1128\fBfeature\fP \- The feature type name. Equivalent to BED\(aqs \fBname\fP field.
1129.UNINDENT
1130.INDENT 0.0
1131.INDENT 3.5
1132.INDENT 0.0
1133.IP \(bu 2
1134\fIAny string can be used\fP\&. For example, "exon", etc.
1135.IP \(bu 2
1136\fIThis column is required\fP\&.
1137.UNINDENT
1138.UNINDENT
1139.UNINDENT
1140.INDENT 0.0
1141.IP 4. 3
1142\fBstart\fP \- The one\-based starting position of feature on \fBseqname\fP\&.
1143.UNINDENT
1144.INDENT 0.0
1145.INDENT 3.5
1146.INDENT 0.0
1147.IP \(bu 2
1148\fIThis column is required\fP\&.
1149.IP \(bu 2
1150\fIBEDTools accounts for the fact the GFF uses a one\-based position and BED uses a zero\-based start position\fP\&.
1151.UNINDENT
1152.UNINDENT
1153.UNINDENT
1154.INDENT 0.0
1155.IP 5. 3
1156\fBend\fP \- The one\-based ending position of feature on \fBseqname\fP\&.
1157.UNINDENT
1158.INDENT 0.0
1159.INDENT 3.5
1160.INDENT 0.0
1161.IP \(bu 2
1162\fIThis column is required\fP\&.
1163.UNINDENT
1164.UNINDENT
1165.UNINDENT
1166.INDENT 0.0
1167.IP 6. 3
1168\fBscore\fP \- A score assigned to the GFF feature. Like BED format, BEDTools allows any string to be stored in this field in order to allow greater flexibility in annotation features. We note that this differs from the GFF definition in the interest of flexibility.
1169.UNINDENT
1170.INDENT 0.0
1171.INDENT 3.5
1172.INDENT 0.0
1173.IP \(bu 2
1174\fIThis column is required\fP\&.
1175.UNINDENT
1176.UNINDENT
1177.UNINDENT
1178.INDENT 0.0
1179.IP 7. 3
1180\fBstrand\fP \- Defines the strand. Use \(aq+\(aq, \(aq\-\(aq or \(aq.\(aq
1181.UNINDENT
1182.INDENT 0.0
1183.INDENT 3.5
1184.INDENT 0.0
1185.IP \(bu 2
1186\fIThis column is required\fP\&.
1187.UNINDENT
1188.UNINDENT
1189.UNINDENT
1190.INDENT 0.0
1191.IP 8. 3
1192\fBframe\fP \-  The frame of the coding sequence. Use \(aq0\(aq, \(aq1\(aq, \(aq2\(aq, or \(aq.\(aq.
1193.UNINDENT
1194.INDENT 0.0
1195.INDENT 3.5
1196.INDENT 0.0
1197.IP \(bu 2
1198\fIThis column is required\fP\&.
1199.UNINDENT
1200.UNINDENT
1201.UNINDENT
1202.INDENT 0.0
1203.IP 9. 3
1204\fBattribute\fP \- Taken from \fI\%http://www.sanger.ac.uk/resources/software/gff/spec.html\fP: From version 2 onwards, the attribute field must have an tag value structure following the syntax used within objects in a .ace file, flattened onto one line by semicolon separators. Tags must be standard identifiers ([A\-Za\-z][
1205.nf
1206AZa\-z0\-9_
1207.fi
1208]*). Free text values must be quoted with double quotes. \fINote: all non\-printing characters in such free text value strings (e.g. newlines, tabs, control characters, etc) must be explicitly represented by their C (UNIX) style backslash\-escaped representation (e.g. newlines as \(aqn\(aq, tabs as \(aqt\(aq)\fP\&. As in ACEDB, multiple values can follow a specific tag. The aim is to establish consistent use of particular tags, corresponding to an underlying implied ACEDB model if you want to think that way (but acedb is not required).
1209.UNINDENT
1210.INDENT 0.0
1211.INDENT 3.5
1212.INDENT 0.0
1213.IP \(bu 2
1214\fIThis column is required\fP\&.
1215.UNINDENT
1216.UNINDENT
1217.UNINDENT
1218.sp
1219An entry from an example GFF file :
1220.INDENT 0.0
1221.INDENT 3.5
1222.sp
1223.nf
1224.ft C
1225seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ;
1226E_value 0.0003 dJ102G20 GD_mRNA coding_exon 7105 7201 . \- 2 Sequence
1227"dJ102G20.C1.1"
1228.ft P
1229.fi
1230.UNINDENT
1231.UNINDENT
1232.SS 4.1.3 GFF format
1233.sp
1234Some of the BEDTools (e.g., genomeCoverageBed, complementBed, slopBed) need to know the size of
1235the chromosomes for the organism for which your BED files are based. When using the UCSC Genome
1236Browser, Ensemble, or Galaxy, you typically indicate which which species/genome build you are
1237working. The way you do this for BEDTools is to create a "genome" file, which simply lists the names of
1238the chromosomes (or scaffolds, etc.) and their size (in basepairs).
1239.sp
1240Genome files must be \fBtab\-delimited\fP and are structured as follows (this is an example for \fIC. elegans\fP):
1241.INDENT 0.0
1242.INDENT 3.5
1243.sp
1244.nf
1245.ft C
1246chrI  15072421
1247chrII 15279323
1248\&...
1249chrX  17718854
1250chrM  13794
1251.ft P
1252.fi
1253.UNINDENT
1254.UNINDENT
1255.sp
1256BEDTools includes pre\-defined genome files for human and mouse in the \fB/genomes\fP directory included
1257in the BEDTools distribution.
1258.SS 4.1.5 SAM/BAM format
1259.sp
1260The SAM / BAM format is a powerful and widely\-used format for storing sequence alignment data (see
1261\fI\%http://samtools.sourceforge.net/\fP for more details). It has quickly become the standard format to which
1262most DNA sequence alignment programs write their output. Currently, the following BEDTools
1263support inout in BAM format: \fIintersectBed, windowBed, coverageBed, genomeCoverageBed,
1264pairToBed, bamToBed\fP\&. Support for the BAM format in BEDTools allows one to (to name a few):
1265compare sequence alignments to annotations, refine alignment datasets, screen for potential mutations
1266and compute aligned sequence coverage.
1267.sp
1268The details of how these tools work with BAM files are addressed in \fBSection 5\fP of this manual.
1269.SS 4.1.6 VCF format
1270.sp
1271The Variant Call Format (VCF) was conceived as part of the 1000 Genomes Project as a standardized
1272means to report genetic variation calls from SNP, INDEL and structural variant detection programs
1273(see \fI\%http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0\fP for details).
1274BEDTools now supports the latest version of this format (i.e, Version 4.0). As a result, BEDTools can
1275be used to compare genetic variation calls with other genomic features.
1276.SH THE BEDTOOLS SUITE
1277.sp
1278This section covers the functionality and default / optional usage for each of the available BEDTools.
1279Example "figures" are provided in some cases in an effort to convey the purpose of the tool. The
1280behavior of each available parameter is discussed for each tool in abstract terms. More concrete usage
1281examples are provided in \fBSection 6\fP\&.
1282.SS Table of contents
1283.SS 5.1 intersect
1284.sp
1285By far, the most common question asked of two sets of genomic features is whether or not any of the
1286features in the two sets "overlap" with one another. This is known as feature intersection. \fBbedtools intersect\fP
1287allows one to screen for overlaps between two sets of genomic features. Moreover, it allows one to have
1288fine control as to how the intersections are reported. \fBbedtools intersect\fP works with both BED/GFF/VCF
1289and BAM files as input.
1290.SS 5.1.1 Usage and option summary
1291.sp
1292\fBUsage\fP:
1293.INDENT 0.0
1294.INDENT 3.5
1295.sp
1296.nf
1297.ft C
1298bedtools intersect [OPTIONS] [\-a <BED/GFF/VCF> || \-abam <BAM>] \-b <BED/GFF/VCF>
1299
1300intersectBed [OPTIONS] [\-a <BED/GFF/VCF> || \-abam <BAM>] \-b <BED/GFF/VCF>
1301.ft P
1302.fi
1303.UNINDENT
1304.UNINDENT
1305.TS
1306center;
1307|l|l|.
1308_
1309T{
1310Option
1311T}      T{
1312Description
1313T}
1314_
1315T{
1316\fB\-a\fP
1317T}      T{
1318BED/GFF/VCF file A. Each feature in A is compared to B in search of overlaps. Use "stdin" if passing A with a UNIX pipe.
1319T}
1320_
1321T{
1322\fB\-b\fP
1323T}      T{
1324BED/GFF/VCF file B. Use "stdin" if passing B with a UNIX pipe.
1325T}
1326_
1327T{
1328\fB\-abam\fP
1329T}      T{
1330BAM file A. Each BAM alignment in A is compared to B in search of overlaps. Use "stdin" if passing A with a UNIX pipe: For example: samtools view \-b <BAM> | bedtools intersect \-abam stdin \-b genes.bed
1331T}
1332_
1333T{
1334\fB\-ubam\fP
1335T}      T{
1336Write uncompressed BAM output. The default is write compressed BAM output.
1337T}
1338_
1339T{
1340\fB\-bed\fP
1341T}      T{
1342When using BAM input (\-abam), write output as BED. The default is to write output in BAM when using \-abam. For example:   bedtools intersect \-abam reads.bam \-b genes.bed \-bed
1343T}
1344_
1345T{
1346\fB\-wa\fP
1347T}      T{
1348Write the original entry in A for each overlap.
1349T}
1350_
1351T{
1352\fB\-wb\fP
1353T}      T{
1354Write the original entry in B for each overlap. Useful for knowing what A overlaps. Restricted by \-f and \-r.
1355T}
1356_
1357T{
1358\fB\-wo\fP
1359T}      T{
1360Write the original A and B entries plus the number of base pairs of overlap between the two features. Only A features with overlap are reported. Restricted by \-f and \-r.
1361T}
1362_
1363T{
1364\fB\-wao\fP
1365T}      T{
1366Write the original A and B entries plus the number of base pairs of overlap between the two features. However, A features w/o overlap are also reported with a NULL B feature and overlap = 0. Restricted by \-f and \-r.
1367T}
1368_
1369T{
1370\fB\-u\fP
1371T}      T{
1372Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B. Restricted by \-f and \-r.
1373T}
1374_
1375T{
1376\fB\-c\fP
1377T}      T{
1378For each entry in A, report the number of hits in B while restricting to \-f. Reports 0 for A entries that have no overlap with B. Restricted by \-f and \-r.
1379T}
1380_
1381T{
1382\fB\-v\fP
1383T}      T{
1384Only report those entries in A that have no overlap in B. Restricted by \-f and \-r.
1385T}
1386_
1387T{
1388\fB\-f\fP
1389T}      T{
1390Minimum overlap required as a fraction of A. Default is 1E\-9 (i.e. 1bp).
1391T}
1392_
1393T{
1394\fB\-r\fP
1395T}      T{
1396Require that the fraction of overlap be reciprocal for A and B. In other words, if \-f is 0.90 and \-r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
1397T}
1398_
1399T{
1400\fB\-s\fP
1401T}      T{
1402Force "strandedness". That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.
1403T}
1404_
1405T{
1406\fB\-split\fP
1407T}      T{
1408Treat "split" BAM (i.e., having an "N" CIGAR operation) or BED12 entries as distinct BED intervals.
1409T}
1410_
1411.TE
1412.SS 5.1.2 Default behavior
1413.sp
1414By default, if an overlap is found, \fBbedtools intersect\fP reports the shared interval between the two
1415overlapping features.
1416.sp
1417For example:
1418.INDENT 0.0
1419.INDENT 3.5
1420.sp
1421.nf
1422.ft C
1423cat A.bed
1424chr1  10  20
1425chr1  30  40
1426
1427cat B.bed
1428chr1  15   20
1429
1430bedtools intersect \-a A.bed \-b B.bed
1431chr1  15   20
1432.ft P
1433.fi
1434.UNINDENT
1435.UNINDENT
1436.SS 5.1.3 (\-wa) Reporting the original A feature
1437.sp
1438Instead, one can force \fBbedtools intersect\fP to report the \fIoriginal\fP \fB"A"\fP feature when an overlap is found. As
1439shown below, the entire "A" feature is reported, not just the portion that overlaps with the "B" feature.
1440.sp
1441For example:
1442.INDENT 0.0
1443.INDENT 3.5
1444.sp
1445.nf
1446.ft C
1447cat A.bed
1448chr1  10  20
1449chr1  30   40
1450
1451cat B.bed
1452chr1  15  20
1453
1454bedtools intersect \-a A.bed \-b B.bed \-wa
1455chr1  10   20
1456.ft P
1457.fi
1458.UNINDENT
1459.UNINDENT
1460.SS 5.1.4 (\-wb) Reporting the original B feature
1461.sp
1462Similarly, one can force \fBbedtools intersect\fP to report the \fIoriginal\fP \fB"B"\fP feature when an overlap is found. If
1463just \-wb is used, the overlapping portion of A will be reported followed by the \fIoriginal\fP \fB"B"\fP\&. If both \-wa
1464and \-wb are used, the \fIoriginals\fP of both \fB"A"\fP and \fB"B"\fP will be reported.
1465.sp
1466For example (\-wb alone):
1467::
1468For example:
1469.INDENT 0.0
1470.INDENT 3.5
1471.sp
1472.nf
1473.ft C
1474cat A.bed
1475chr1  10  20
1476chr1  30  40
1477
1478cat B.bed
1479chr1  15   20
1480
1481bedtools intersect \-a A.bed \-b B.bed \-wb
1482chr1  15  20  chr 15  20
1483.ft P
1484.fi
1485.UNINDENT
1486.UNINDENT
1487.sp
1488Now \-wa and \-wb:
1489.INDENT 0.0
1490.INDENT 3.5
1491.sp
1492.nf
1493.ft C
1494cat A.bed
1495chr1  10  20
1496chr1  30  40
1497
1498cat B.bed
1499chr1  15   20
1500
1501bedtools intersect \-a A.bed \-b B.bed \-wa \-wb
1502chr1  10  20  chr 15  20
1503.ft P
1504.fi
1505.UNINDENT
1506.UNINDENT
1507.SS 5.1.5 (\-u) Reporting the presence of \fIat least one\fP overlapping feature
1508.sp
1509Frequently a feature in "A" will overlap with multiple features in "B". By default, \fBbedtools intersect\fP will
1510report each overlap as a separate output line. However, one may want to simply know that there is at
1511least one overlap (or none). When one uses the \-u option, "A" features that overlap with one or more
1512"B" features are reported once. Those that overlap with no "B" features are not reported at all.
1513.sp
1514For example (\fIwithout\fP \-u):
1515.INDENT 0.0
1516.INDENT 3.5
1517.sp
1518.nf
1519.ft C
1520cat A.bed
1521chr1  10  20
1522chr1  30  40
1523
1524cat B.bed
1525chr1  15   20
1526chr1  18   25
1527
1528bedtools intersect \-a A.bed \-b B.bed \-wb
1529chr1  10  20  chr 15  20
1530chr1  10  20  chr 18  25
1531.ft P
1532.fi
1533.UNINDENT
1534.UNINDENT
1535.sp
1536For example (\fIwith\fP \-u):
1537.INDENT 0.0
1538.INDENT 3.5
1539.sp
1540.nf
1541.ft C
1542cat A.bed
1543chr1  10  20
1544chr1  30  40
1545
1546cat B.bed
1547chr1  15   20
1548chr1  18   25
1549
1550bedtools intersect \-a A.bed \-b B.bed \-u
1551chr1  10  20
1552.ft P
1553.fi
1554.UNINDENT
1555.UNINDENT
1556.SS 5.1.6 (\-c) Reporting the number of overlapping features
1557.sp
1558The \-c option reports a column after each "A" feature indicating the \fInumber\fP (0 or more) of overlapping
1559features found in "B". Therefore, \fIeach feature in A is reported once\fP\&.
1560.sp
1561For example:
1562.INDENT 0.0
1563.INDENT 3.5
1564.sp
1565.nf
1566.ft C
1567cat A.bed
1568chr1    10    20
1569chr1    30    40
1570
1571cat B.bed
1572chr1    15  20
1573chr1    18  25
1574
1575bedtools intersect \-a A.bed \-b B.bed \-u
1576chr1    10    20    2
1577chr1    30    40    0
1578.ft P
1579.fi
1580.UNINDENT
1581.UNINDENT
1582.SS 5.1.7 (\-v) Reporting the absence of any overlapping features
1583.sp
1584There will likely be cases where you\(aqd like to know which "A" features do not overlap with any of the
1585"B" features. Perhaps you\(aqd like to know which SNPs don\(aqt overlap with any gene annotations. The \-v
1586(an homage to "grep \-v") option will only report those "A" features that have no overlaps in "B".
1587.sp
1588For example:
1589.INDENT 0.0
1590.INDENT 3.5
1591.sp
1592.nf
1593.ft C
1594cat A.bed
1595chr1  10  20
1596chr1  30  40
1597
1598cat B.bed
1599chr1  15  20
1600
1601bedtools intersect \-a A.bed \-b B.bed \-v
1602chr1  30   40
1603.ft P
1604.fi
1605.UNINDENT
1606.UNINDENT
1607.SS 5.1.8 (\-f) Requiring a minimal overlap fraction
1608.sp
1609By default, \fBbedtools intersect\fP will report an overlap between A and B so long as there is at least one base
1610pair is overlapping. Yet sometimes you may want to restrict reported overlaps between A and B to cases
1611where the feature in B overlaps at least X% (e.g. 50%) of the A feature. The \-f option does exactly
1612this.
1613.sp
1614For example (note that the second B entry is not reported):
1615.INDENT 0.0
1616.INDENT 3.5
1617.sp
1618.nf
1619.ft C
1620cat A.bed
1621chr1 100 200
1622
1623cat B.bed
1624chr1 130 201
1625chr1 180 220
1626
1627bedtools intersect \-a A.bed \-b B.bed \-f 0.50 \-wa \-wb
1628chr1 100 200 chr1 130 201
1629.ft P
1630.fi
1631.UNINDENT
1632.UNINDENT
1633.SS 5.1.9 (\-r, combined with \-f)Requiring reciprocal minimal overlap fraction
1634.sp
1635Similarly, you may want to require that a minimal fraction of both the A and the B features is
1636overlapped. For example, if feature A is 1kb and feature B is 1Mb, you might not want to report the
1637overlap as feature A can overlap at most 1% of feature B. If one set \-f to say, 0.02, and one also
1638enable the \-r (reciprocal overlap fraction required), this overlap would not be reported.
1639.sp
1640For example (note that the second B entry is not reported):
1641.INDENT 0.0
1642.INDENT 3.5
1643.sp
1644.nf
1645.ft C
1646cat A.bed
1647chr1 100 200
1648
1649cat B.bed
1650chr1 130 201
1651chr1 130 200000
1652
1653bedtools intersect \-a A.bed \-b B.bed \-f 0.50 \-r \-wa \-wb
1654chr1 100 200 chr1 130 201
1655.ft P
1656.fi
1657.UNINDENT
1658.UNINDENT
1659.SS 5.1.10 (\-s)Enforcing "strandedness"
1660.sp
1661By default, \fBbedtools intersect\fP will report overlaps between features even if the features are on opposite
1662strands. However, if strand information is present in both BED files and the "\-s" option is used, overlaps
1663will only be reported when features are on the same strand.
1664.sp
1665For example (note that the second B entry is not reported):
1666.INDENT 0.0
1667.INDENT 3.5
1668.sp
1669.nf
1670.ft C
1671cat A.bed
1672chr1 100 200 a1 100 +
1673
1674cat B.bed
1675chr1 130 201 b1 100 \-
1676chr1 130 201 b2 100 +
1677
1678bedtools intersect \-a A.bed \-b B.bed \-wa \-wb \-s
1679chr1 100 200 a1 100 + chr1 130 201 b2 100 +
1680.ft P
1681.fi
1682.UNINDENT
1683.UNINDENT
1684.SS 5.1.11 (\-abam)Default behavior when using BAM input
1685.sp
1686When comparing alignments in BAM format (\fB\-abam\fP) to features in BED format (\fB\-b\fP), \fBbedtools intersect\fP
1687will, \fBby default\fP, write the output in BAM format. That is, each alignment in the BAM file that meets
1688the user\(aqs criteria will be written (to standard output) in BAM format. This serves as a mechanism to
1689create subsets of BAM alignments are of biological interest, etc. Note that only the mate in the BAM
1690alignment is compared to the BED file. Thus, if only one end of a paired\-end sequence overlaps with a
1691feature in B, then that end will be written to the BAM output. By contrast, the other mate for the
1692pair will not be written. One should use \fBpairToBed(Section 5.2)\fP if one wants each BAM alignment
1693for a pair to be written to BAM output.
1694.sp
1695For example:
1696.INDENT 0.0
1697.INDENT 3.5
1698.sp
1699.nf
1700.ft C
1701bedtools intersect \-abam reads.unsorted.bam \-b simreps.bed | samtools view \- | head \-3
1702
1703BERTHA_0001:3:1:15:1362#0 99 chr4 9236904 0 50M = 9242033 5 1 7 9
1704AGACGTTAACTTTACACACCTCTGCCAAGGTCCTCATCCTTGTATTGAAG W c T U ] b \e g c e g X g f c b f c c b d d g g V Y P W W _
1705\ec\(gadcdabdfW^a^gggfgd XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:19 X1:i:2 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
1706BERTHA _0001:3:1:16:994#0 83 chr6 114221672 37 25S6M1I11M7S =
1707114216196 \-5493 G A A A G G C C A G A G T A T A G A A T A A A C A C A A C A A T G T C C A A G G T A C A C T G T T A
1708gffeaaddddggggggedgcgeggdegggggffcgggggggegdfggfgf XT:A:M NM:i:3 SM:i:37 AM:i:37 XM:i:2 X O : i :
17091 XG:i:1 MD:Z:6A6T3
1710BERTHA _0001:3:1:16:594#0 147 chr8 43835330 0 50M =
171143830893 \-4487 CTTTGGGAGGGCTTTGTAGCCTATCTGGAAAAAGGAAATATCTTCCCATG U
1712\ee^bgeTdg_Kgcg\(gaggeggg_gggggggggddgdggVg\egWdfgfgff XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:10 X1:i:7 X M : i :
17132 XO:i:0 XG:i:0 MD:Z:1A2T45
1714.ft P
1715.fi
1716.UNINDENT
1717.UNINDENT
1718.SS 5.1.12 (\-bed)Output BED format when using BAM input
1719.sp
1720When comparing alignments in BAM format (\fB\-abam\fP) to features in BED format (\fB\-b\fP), \fBbedtools intersect\fP
1721will \fBoptionally\fP write the output in BED format. That is, each alignment in the BAM file is converted
1722to a 6 column BED feature and if overlaps are found (or not) based on the user\(aqs criteria, the BAM
1723alignment will be reported in BED format. The BED "name" field is comprised of the RNAME field in
1724the BAM alignment. If mate information is available, the mate (e.g., "/1" or "/2") field will be
1725appended to the name. The "score" field is the mapping quality score from the BAM alignment.
1726.sp
1727For example:
1728.INDENT 0.0
1729.INDENT 3.5
1730.sp
1731.nf
1732.ft C
1733bedtools intersect \-abam reads.unsorted.bam \-b simreps.bed \-bed | head \-20
1734
1735chr4  9236903   9236953   BERTHA_0001:3:1:15:1362#0/1  0   +
1736chr6  114221671 114221721 BERTHA_0001:3:1:16:994#0/1   37  \-
1737chr8  43835329  43835379  BERTHA_0001:3:1:16:594#0/2   0   \-
1738chr4  49110668  49110718  BERTHA_0001:3:1:31:487#0/1   23  +
1739chr19 27732052  27732102  BERTHA_0001:3:1:32:890#0/2   46  +
1740chr19 27732012  27732062  BERTHA_0001:3:1:45:1135#0/1  37  +
1741chr10 117494252 117494302 BERTHA_0001:3:1:68:627#0/1   37  \-
1742chr19 27731966  27732016  BERTHA_0001:3:1:83:931#0/2   9   +
1743chr8  48660075  48660125  BERTHA_0001:3:1:86:608#0/2   37  \-
1744chr9  34986400  34986450  BERTHA_0001:3:1:113:183#0/2  37  \-
1745chr10 42372771  42372821  BERTHA_0001:3:1:128:1932#0/1 3   \-
1746chr19 27731954  27732004  BERTHA_0001:3:1:130:1402#0/2 0   +
1747chr10 42357337  42357387  BERTHA_0001:3:1:137:868#0/2  9   +
1748chr1  159720631 159720681 BERTHA_0001:3:1:147:380#0/2  37  \-
1749chrX  58230155  58230205  BERTHA_0001:3:1:151:656#0/2  37  \-
1750chr5  142612746 142612796 BERTHA_0001:3:1:152:1893#0/1 37  \-
1751chr9  71795659  71795709  BERTHA_0001:3:1:177:387#0/1  37  +
1752chr1  106240854 106240904 BERTHA_0001:3:1:194:928#0/1  37  \-
1753chr4  74128456  74128506  BERTHA_0001:3:1:221:724#0/1  37  \-
1754chr8  42606164  42606214  BERTHA_0001:3:1:244:962#0/1  37  +
1755.ft P
1756.fi
1757.UNINDENT
1758.UNINDENT
1759.SS 5.1.13 (\-split)Reporting overlaps with spliced alignments or blocked BED features
1760.sp
1761As described in section 1.3.19, bedtools intersect will, by default, screen for overlaps against the entire span
1762of a spliced/split BAM alignment or blocked BED12 feature. When dealing with RNA\-seq reads, for
1763example, one typically wants to only screen for overlaps for the portions of the reads that come from
1764exons (and ignore the interstitial intron sequence). The \fB\-split\fP command allows for such overlaps to be
1765performed.
1766.sp
1767For example, the diagram below illustrates the \fIdefault\fP behavior. The blue dots represent the "split/
1768spliced" portion of the alignment (i.e., CIGAR "N" operation). In this case, the two exon annotations
1769are reported as overlapping with the "split" BAM alignment, but in addition, a third feature that
1770overlaps the "split" portion of the alignment is also reported.
1771.INDENT 0.0
1772.INDENT 3.5
1773.sp
1774.nf
1775.ft C
1776Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1777
1778Exons       \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-                                       \-\-\-\-\-\-\-\-\-\-
1779
1780BED/BAM  A     ************.......................................****
1781
1782BED File B  ^^^^^^^^^^^^^^^                     ^^^^^^^^          ^^^^^^^^^^
1783
1784Result      ===============                     ========          ==========
1785.ft P
1786.fi
1787.UNINDENT
1788.UNINDENT
1789.sp
1790In contrast, when using the \fB\-split\fP option, only the exon overlaps are reported.
1791.INDENT 0.0
1792.INDENT 3.5
1793.sp
1794.nf
1795.ft C
1796Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1797
1798Exons       \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-                                       \-\-\-\-\-\-\-\-\-\-
1799
1800BED/BAM  A     ************.......................................****
1801
1802BED File B  ^^^^^^^^^^^^^^^                     ^^^^^^^^          ^^^^^^^^^^
1803
1804Result      ===============                                       ==========
1805.ft P
1806.fi
1807.UNINDENT
1808.UNINDENT
1809.SS 5.2 pairToBed
1810.sp
1811\fBpairToBed\fP compares each end of a BEDPE feature or a paired\-end BAM alignment to a feature file in
1812search of overlaps.
1813.sp
1814\fBNOTE: pairToBed requires that the BAM file is sorted/grouped by the read name. This
1815allows pairToBed to extract correct alignment coordinates for each end based on their
1816respective CIGAR strings. It also assumes that the alignments for a given pair come in
1817groups of twos. There is not yet a standard method for reporting multiple alignments
1818using BAM. pairToBed will fail if an aligner does not report alignments in pairs.\fP
1819.SS 5.2.1 Usage and option summary
1820.sp
1821\fBUsage:\fP
1822.INDENT 0.0
1823.INDENT 3.5
1824.sp
1825.nf
1826.ft C
1827pairToBed [OPTIONS] [\-a <BEDPE> || \-abam <BAM>] \-b <BED/GFF/VCF>
1828.ft P
1829.fi
1830.UNINDENT
1831.UNINDENT
1832.TS
1833center;
1834|l|l|.
1835_
1836T{
1837Option
1838T}      T{
1839Description
1840T}
1841_
1842T{
1843\fB\-a\fP
1844T}      T{
1845BEDPE file A. Each feature in A is compared to B in search of overlaps. Use "stdin" if passing A with a UNIX pipe. Output will be in BEDPE format.
1846T}
1847_
1848T{
1849\fB\-b\fP
1850T}      T{
1851BED file B. Use "stdin" if passing B with a UNIX pipe.
1852T}
1853_
1854T{
1855\fB\-abam\fP
1856T}      T{
1857BAM file A. Each end of each BAM alignment in A is compared to B in search of overlaps. Use "stdin" if passing A with a UNIX pipe: For example: samtools view ?Cb <BAM> | pairToBed ?Cabam stdin ?Cb genes.bed | samtools view \-
1858T}
1859_
1860T{
1861\fB\-ubam\fP
1862T}      T{
1863Write uncompressed BAM output. The default is write compressed BAM output.
1864T}
1865_
1866T{
1867\fB\-bedpe\fP
1868T}      T{
1869When using BAM input (\-abam), write output as BEDPE. The default is to write output in BAM when using \-abam. For example: pairToBed ?Cabam reads.bam ?Cb genes.bed ?Cbedpe
1870T}
1871_
1872T{
1873\fB\-ed\fP
1874T}      T{
1875Use BAM total edit distance (NM tag) for BEDPE score. Default for BEDPE is to use the \fIminimum\fP of the two mapping qualities for the pair. When \-ed is used the \fItotal\fP edit distance from the two mates is reported as the score.
1876T}
1877_
1878T{
1879\fB\-f\fP
1880T}      T{
1881Minimum overlap required as a fraction of A. Default is 1E\-9 (i.e. 1bp).
1882T}
1883_
1884T{
1885\fB\-s\fP
1886T}      T{
1887Force "strandedness". That is, only report hits in B that overlap A on the \fBsame\fP strand. By default, overlaps are reported without respect to strand.
1888T}
1889_
1890T{
1891\fB\-type\fP
1892T}      T{
1893Approach to reporting overlaps between BEDPE and BED.
1894.INDENT 0.0
1895.INDENT 3.5
1896.INDENT 0.0
1897.INDENT 3.5
1898\fBeither\-\fP Report overlaps if either end of A overlaps B.
1899.INDENT 0.0
1900.IP \(bu 2
1901\fIDefault\fP
1902.UNINDENT
1903.sp
1904\fBneither\-\fP Report A if neither end of A overlaps B.
1905.sp
1906\fBxor\-\fP Report overlaps if one and only one end of A overlaps B.
1907.sp
1908\fBboth\-\fP Report overlaps if both ends of A overlap B.
1909.sp
1910\fBnotboth\-\fP Report overlaps if neither end or one and only one end of A overlap B.
1911.sp
1912\fBispan\-\fP Report overlaps between [end1, start2] of A and B.
1913.INDENT 0.0
1914.IP \(bu 2
1915Note: If chrom1 <> chrom2, entry is ignored.
1916.UNINDENT
1917.UNINDENT
1918.UNINDENT
1919.sp
1920\fBospan\-\fP Report overlaps between [start1, end2] of A and B.
1921.INDENT 0.0
1922.INDENT 3.5
1923.INDENT 0.0
1924.IP \(bu 2
1925Note: If chrom1 <> chrom2, entry is ignored.
1926.UNINDENT
1927.sp
1928\fBnotispan\-\fP  Report A if ispan of A doesn\(aqt overlap B.
1929\- Note: If chrom1 <> chrom2, entry is ignored.
1930.sp
1931\fBnotospan\-\fP  Report A if ospan of A doesn\(aqt overlap B.
1932\- Note: If chrom1 <> chrom2, entry is ignored.
1933.UNINDENT
1934.UNINDENT
1935.UNINDENT
1936.UNINDENT
1937T}
1938_
1939.TE
1940.SS 5.2.2 Default behavior
1941.sp
1942By default, a BEDPE / BAM feature will be reported if \fIeither\fP end overlaps a feature in the BED file.
1943In the example below, the left end of the pair overlaps B yet the right end does not. Thus, BEDPE/
1944BAM A is reported since the default is to report A if either end overlaps B.
1945.sp
1946Default: Report A if \fIeither\fP end overlaps B.
1947.INDENT 0.0
1948.INDENT 3.5
1949.sp
1950.nf
1951.ft C
1952Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1953
1954BEDPE/BAM A         *****.................................*****
1955
1956BED File B         ^^^^^^^^                                          ^^^^^^
1957
1958Result              =====.................................=====
1959.ft P
1960.fi
1961.UNINDENT
1962.UNINDENT
1963.SS 5.2.3 (\-type)Optional overlap requirements
1964.sp
1965Using then \fB\-type\fP option, \fBpairToBed\fP provides several other overlap requirements for controlling how
1966overlaps between BEDPE/BAM A and BED B are reported. The examples below illustrate how each
1967option behaves.
1968.sp
1969\fB\-type both\fP: Report A only if \fIboth\fP ends overlap B.
1970.INDENT 0.0
1971.INDENT 3.5
1972.sp
1973.nf
1974.ft C
1975Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1976
1977BEDPE/BAM A         *****.................................*****
1978
1979BED File B         ^^^^^^^^                                          ^^^^^^
1980
1981Result
1982
1983
1984
1985BEDPE/BAM A         *****.................................*****
1986
1987BED File B         ^^^^^^^^                                   ^^^^^^
1988
1989Result              =====.................................=====
1990.ft P
1991.fi
1992.UNINDENT
1993.UNINDENT
1994.sp
1995\fB\-type neither\fP: Report A only if \fIneither\fP end overlaps B.
1996.INDENT 0.0
1997.INDENT 3.5
1998.sp
1999.nf
2000.ft C
2001Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2002
2003BEDPE/BAM A         *****.................................*****
2004
2005BED File B         ^^^^^^^^                                          ^^^^^^
2006
2007Result
2008
2009
2010
2011BEDPE/BAM A         *****.................................*****
2012
2013BED File B   ^^^^                                                  ^^^^^^
2014
2015Result              =====.................................=====
2016.ft P
2017.fi
2018.UNINDENT
2019.UNINDENT
2020.sp
2021\fB\-type xor\fP: Report A only if \fIone and only one\fP end overlaps B.
2022.INDENT 0.0
2023.INDENT 3.5
2024.sp
2025.nf
2026.ft C
2027Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2028
2029BEDPE/BAM A         *****.................................*****
2030
2031BED File B         ^^^^^^^^                                          ^^^^^^
2032
2033Result              =====.................................=====
2034
2035
2036
2037BEDPE/BAM A         *****.................................*****
2038
2039BED File B         ^^^^                                   ^^^^^^
2040
2041Result
2042.ft P
2043.fi
2044.UNINDENT
2045.UNINDENT
2046.sp
2047\fB\-type notboth\fP: Report A only if \fIneither end\fP \fBor\fP \fIone and only one\fP end overlaps B. Thus "notboth"
2048includes what would be reported by "neither" and by "xor".
2049.INDENT 0.0
2050.INDENT 3.5
2051.sp
2052.nf
2053.ft C
2054Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2055
2056BEDPE/BAM A         *****.................................*****
2057
2058BED File B         ^^^^^^^^                                          ^^^^^^
2059
2060Result              =====.................................=====
2061
2062
2063
2064BEDPE/BAM A         *****.................................*****
2065
2066BED File B     ^^^                                               ^^^^^^
2067
2068Result              =====.................................=====
2069
2070
2071
2072BEDPE/BAM A         *****.................................*****
2073
2074BED File B         ^^^^                                   ^^^^^^
2075
2076Result
2077.ft P
2078.fi
2079.UNINDENT
2080.UNINDENT
2081.sp
2082\fB\-type ispan\fP: Report A if it\(aqs "\fIinner span\fP" overlaps B. Applicable only to intra\-chromosomal features.
2083.INDENT 0.0
2084.INDENT 3.5
2085.sp
2086.nf
2087.ft C
2088Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2089
2090              Inner span |\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-|
2091
2092BEDPE/BAM A         *****.................................*****
2093
2094BED File B                         ^^^^^^^^
2095
2096Result              =====.................................=====
2097
2098
2099
2100BEDPE/BAM A         =====.................................=====
2101
2102BED File B         ====
2103
2104Result
2105.ft P
2106.fi
2107.UNINDENT
2108.UNINDENT
2109.sp
2110\fB\-type ospan\fP: Report A if it\(aqs "\fIouter span\fP" overlaps B. Applicable only to intra\-chromosomal features.
2111.INDENT 0.0
2112.INDENT 3.5
2113.sp
2114.nf
2115.ft C
2116Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2117
2118        Outer span  |\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-|
2119
2120BEDPE/BAM A         *****.................................*****
2121
2122BED File B             ^^^^^^^^^^^^
2123
2124Result              =====.................................=====
2125
2126
2127
2128BEDPE/BAM A         *****.................................*****
2129
2130BED File B     ^^^^
2131
2132Result
2133.ft P
2134.fi
2135.UNINDENT
2136.UNINDENT
2137.sp
2138\fB\-type notispan\fP: Report A only if it\(aqs "\fIinner span\fP" does not overlap B. Applicable only to intrachromosomal
2139features.
2140.INDENT 0.0
2141.INDENT 3.5
2142.sp
2143.nf
2144.ft C
2145Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2146
2147              Inner span |\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-|
2148
2149BEDPE/BAM A         *****.................................*****
2150
2151BED File B                         ^^^^^^^^
2152
2153Result
2154
2155
2156
2157BEDPE/BAM A         *****.................................*****
2158
2159BED File B         ^^^^
2160
2161Result              =====.................................=====
2162.ft P
2163.fi
2164.UNINDENT
2165.UNINDENT
2166.sp
2167\fB\-type notospan\fP: Report A if it\(aqs "\fIouter span\fP" overlaps B. Applicable only to intra\-chromosomal
2168features.
2169.INDENT 0.0
2170.INDENT 3.5
2171.sp
2172.nf
2173.ft C
2174Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2175
2176        Outer span  |\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-|
2177
2178BEDPE/BAM A         *****.................................*****
2179
2180BED File B             ^^^^^^^^^^^^
2181
2182Result
2183
2184
2185
2186BEDPE/BAM A         *****.................................*****
2187
2188BED File B     ^^^^
2189
2190Result              =====.................................=====
2191.ft P
2192.fi
2193.UNINDENT
2194.UNINDENT
2195.SS 5.2.4 (\-f)Requiring a minimum overlap fraction
2196.sp
2197By default, \fBpairToBed\fP will report an overlap between A and B so long as there is at least one base
2198pair is overlapping on either end. Yet sometimes you may want to restrict reported overlaps between A
2199and B to cases where the feature in B overlaps at least X% (e.g. 50%) of A. The \fB?Cf\fP option does exactly
2200this. The \fB\-f\fP option may also be combined with the \-type option for additional control. For example,
2201combining \fB\-f 0.50\fP with \fB\-type both\fP requires that both ends of A have at least 50% overlap with a
2202feature in B.
2203.sp
2204For example, report A only at least 50% of one of the two ends is overlapped by B.
2205.INDENT 0.0
2206.INDENT 3.5
2207.sp
2208.nf
2209.ft C
2210pairToBed \-a A.bedpe \-b B.bed \-f 0.5
2211
2212
2213Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2214
2215BEDPE/BAM A         *****.................................*****
2216
2217BED File B         ^^                                           ^^^^^^
2218
2219Result
2220
2221
2222
2223BEDPE/BAM A         *****.................................*****
2224
2225BED File B         ^^^^                                         ^^^^^^
2226
2227Result              =====.................................=====
2228.ft P
2229.fi
2230.UNINDENT
2231.UNINDENT
2232.SS 5.2.5 (\-s)Enforcing "strandedness"
2233.sp
2234By default, \fBpairToBed\fP will report overlaps between features even if the features are on opposing
2235strands. However, if strand information is present in both files and the \fB"\-s"\fP option is used, overlaps will
2236only be reported when features are on the same strand.
2237.sp
2238For example, report A only at least 50% of one of the two ends is overlapped by B.
2239.INDENT 0.0
2240.INDENT 3.5
2241.sp
2242.nf
2243.ft C
2244pairToBed \-a A.bedpe \-b B.bed \-s
2245
2246
2247
2248Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2249
2250BEDPE/BAM A         >>>>>.................................<<<<<
2251
2252BED File B         <<                                           >>>>>
2253
2254Result
2255
2256
2257
2258BEDPE/BAM A         >>>>>.................................<<<<<
2259
2260BED File B         >>                                          >>>>>
2261
2262Result              >>>>>.................................<<<<<
2263.ft P
2264.fi
2265.UNINDENT
2266.UNINDENT
2267.SS 5.2.6 (\-abam)Default is to write BAM output when using BAM input
2268.sp
2269When comparing \fIpaired\fP alignments in BAM format (\fB\-abam\fP) to features in BED format (\fB\-b\fP),
2270\fBpairToBed\fP will , by default, write the output in BAM format. That is, each alignment in the BAM
2271file that meets the user\(aqs criteria will be written (to standard output) in BAM format. This serves as a
2272mechanism to create subsets of BAM alignments are of biological interest, etc. Note that both
2273alignments for each aligned pair will be written to the BAM output.
2274.sp
2275For example:
2276.INDENT 0.0
2277.INDENT 3.5
2278.sp
2279.nf
2280.ft C
2281pairToBed ?Cabam pairedReads.bam ?Cb simreps.bed | samtools view \- | head \-4
2282
2283JOBU_0001:3:1:4:1060#0 99 chr10 42387928 29 50M = 42393091 5 2 1 3
2284AA A A A C G G A A T T A T C G A A T G G A A T C G A A G A G A A T C T T C G A A C G G A C C C G A
2285dcgggggfbgfgdgggggggfdfgggcggggfcggcggggggagfgbggc XT:A:R NM:i:5 SM:i:0 AM:i:0 X0:i:3 X 1 : i :
22863 XM:i:5 XO:i:0 XG:i:0 MD:Z:0T0C33A5T4T3
2287JOBU_0001:3:1:4:1060#0 147 chr10 42393091 0 50M = 42387928 \- 5 2 1 3
2288AAATGGAATCGAATGGAATCAACATCAAATGGAATCAAATGGAATCATTG K g d c g g d e c d g
2289\ed\(gaggfcgcggffcgggc^cgfgccgggfc^gcdgg\ebg XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:3 X1:i:13 XM:i:2 X O : i :
22900 XG:i:0 MD:Z:21T14G13
2291JOBU_0001:3:1:8:446#0 99 chr10 42388091 9 50M = 42392738 4 6 9 7
2292GAATCGACTGGAATCATCATCGGATGGAAATGAATGGAATAATCATCGAA f _ O f f \(ga ] I e Y f f \(ga f f e d d c f e f c P \(ga c _ W \e \e R _ ]
2293_BBBBBBBBBBBBBBBB XT:A:U NM:i:4 SM:i:0 AM:i:0 X0:i:1 X1:i:3 XM:i:4 XO:i:0 XG:i:0 M D : Z :
22947A22C9C2T6
2295JOBU_0001:3:1:8:446#0 147 chr10 42392738 9 50M = 42388091 \- 4 6 9 7
2296TTATCGAATGCAATCGAATGGAATTATCGAATGCAATCGAATAGAATCAT df^ffec_JW[\(gaMWceRec\(ga\(gafee\(gadcecfeeZae\(gac]
2297f^cNeecfccf^ XT:A:R NM:i:1 SM:i:0 AM:i:0 X0:i:2 X1:i:2 XM:i:1 XO:i:0 XG:i:0 MD:Z:38A11
2298.ft P
2299.fi
2300.UNINDENT
2301.UNINDENT
2302.SS 5.2.7 (\-bedpe)Output BEDPE format when using BAM input
2303.sp
2304When comparing \fIpaired\fP alignments in BAM format (\fB\-abam\fP) to features in BED format (\fB\-b\fP),
2305\fBpairToBed\fP will optionally write the output in BEDPE format. That is, each alignment in the BAM
2306file is converted to a 10 column BEDPE feature and if overlaps are found (or not) based on the user\(aqs
2307criteria, the BAM alignment will be reported in BEDPE format. The BEDPE "name" field is comprised
2308of the RNAME field in the BAM alignment. The "score" field is the mapping quality score from the
2309BAM alignment.
2310.sp
2311For example:
2312.INDENT 0.0
2313.INDENT 3.5
2314.sp
2315.nf
2316.ft C
2317pairToBed ?Cabam pairedReads.bam ?Cb simreps.bed \-bedpe | head \-5
2318chr10 42387927     42387977    chr10   42393090   42393140
2319      JOBU_0001:3:1:4:1060#0   29      +     \-
2320chr10 42388090 42388140        chr10   42392737   42392787
2321      JOBU_0001:3:1:8:446#0    9       +     \-
2322chr10 42390552 42390602        chr10   42396045   42396095
2323      JOBU_0001:3:1:10:1865#0  9       +     \-
2324chrX  139153741 139153791      chrX    139159018  139159068
2325      JOBU_0001:3:1:14:225#0   37      +     \-
2326chr4  9236903 9236953          chr4    9242032    9242082
2327      JOBU_0001:3:1:15:1362#0  0       +     \-
2328.ft P
2329.fi
2330.UNINDENT
2331.UNINDENT
2332.SS 5.3 pairToPair
2333.sp
2334\fBpairToPair\fP compares two BEDPE files in search of overlaps where each end of a BEDPE feature in A
2335overlaps with the ends of a feature in B. For example, using pairToPair, one could screen for the exact
2336same discordant paired\-end alignment in two files. This could suggest (among other things) that the
2337discordant pair suggests the same structural variation in each file/sample.
2338.SS 5.3.1 Usage and option summary
2339.sp
2340\fBUsage:\fP
2341.INDENT 0.0
2342.INDENT 3.5
2343.sp
2344.nf
2345.ft C
2346pairToPair [OPTIONS] \-a <BEDPE> \-b <BEDPE>
2347.ft P
2348.fi
2349.UNINDENT
2350.UNINDENT
2351.TS
2352center;
2353|l|l|.
2354_
2355T{
2356Option
2357T}      T{
2358Description
2359T}
2360_
2361T{
2362\fB\-a\fP
2363T}      T{
2364BEDPE file A. Each feature in A is compared to B in search of overlaps. Use "stdin" if passing A with a UNIX pipe.
2365T}
2366_
2367T{
2368\fB\-b\fP
2369T}      T{
2370BEDPE file B. Use "stdin" if passing B with a UNIX pipe.
2371T}
2372_
2373T{
2374\fB\-f\fP
2375T}      T{
2376Minimum overlap required as a fraction of A. Default is 1E\-9 (i.e. 1bp).
2377T}
2378_
2379T{
2380\fB\-is\fP
2381T}      T{
2382Force "strandedness". That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.
2383T}
2384_
2385T{
2386\fB\-type\fP
2387T}      T{
2388.INDENT 0.0
2389.INDENT 3.5
2390Approach to reporting overlaps between BEDPE and BED.
2391.UNINDENT
2392.UNINDENT
2393.nf
2394\fBeither\fP Report overlaps if either ends of A overlap B.
2395.fi
2396.sp
2397.INDENT 0.0
2398.INDENT 3.5
2399.nf
2400\fBneither\fP Report A if neither end of A overlaps B.
2401.fi
2402.sp
2403.nf
2404\fBboth\fP Report overlaps if both ends of A overlap B.   \-\fIDefault behavior.\fP
2405.fi
2406.sp
2407.UNINDENT
2408.UNINDENT
2409T}
2410_
2411.TE
2412.SS 5.3.2 Default behavior
2413.sp
2414By default, a BEDPE feature from A will be reported if \fIboth\fP ends overlap a feature in the BEDPE B
2415file. If strand information is present for the two BEDPE files, it will be further required that the
2416overlaps on each end be on the same strand. This way, an otherwise overlapping (in terms of genomic
2417locations) F/R alignment will not be matched with a R/R alignment.
2418.sp
2419Default: Report A if \fIboth\fP ends overlaps B.
2420.INDENT 0.0
2421.INDENT 3.5
2422.sp
2423.nf
2424.ft C
2425Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2426
2427BEDPE/BAM A         *****.................................*****
2428
2429BED File B         ^^^^^^^^                                          ^^^^^^
2430
2431Result              =====.................................=====
2432.ft P
2433.fi
2434.UNINDENT
2435.UNINDENT
2436.sp
2437Default when strand information is present in both BEDPE files: Report A if \fIboth\fP ends overlaps B \fIon
2438the same strands\fP\&.
2439.INDENT 0.0
2440.INDENT 3.5
2441.sp
2442.nf
2443.ft C
2444Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2445
2446BEDPE A         >>>>>.................................>>>>>
2447
2448BEDPE B            <<<<<.............................>>>>>
2449
2450Result
2451
2452
2453
2454BEDPE A         >>>>>.................................>>>>>
2455
2456BEDPE B            >>>>>.............................>>>>>
2457
2458Result          >>>>>.................................>>>>>
2459.ft P
2460.fi
2461.UNINDENT
2462.UNINDENT
2463.SS 5.3.3 (\-type neither)Optional overlap requirements
2464.sp
2465Using then \fB\-type neither, pairToPair\fP will only report A if \fIneither\fP end overlaps with a BEDPE
2466feature in B.
2467.sp
2468\fB\-type neither\fP: Report A only if \fIneither\fP end overlaps B.
2469.INDENT 0.0
2470.INDENT 3.5
2471.sp
2472.nf
2473.ft C
2474Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2475
2476BEDPE/BAM A         *****.................................*****
2477
2478BED File B         ^^^^^^^^......................................^^^^^^
2479
2480Result
2481
2482
2483
2484BEDPE/BAM A         *****.................................*****
2485
2486BED File B    ^^^^................................................^^^^^^
2487
2488Result              =====.................................=====
2489.ft P
2490.fi
2491.UNINDENT
2492.UNINDENT
2493.SS 5.4 bamToBed
2494.sp
2495\fBbamToBed\fP is a general purpose tool that will convert sequence alignments in BAM format to either
2496BED6, BED12 or BEDPE format. This enables one to convert BAM files for use with all of the other
2497BEDTools. The CIGAR string is used to compute the alignment end coordinate in an "ungapped"
2498fashion. That is, match ("M"), deletion ("D"), and splice ("N") operations are observed when computing
2499alignment ends.
2500.SS 5.4.1 Usage and option summary
2501.sp
2502\fBUsage:\fP
2503.INDENT 0.0
2504.INDENT 3.5
2505.sp
2506.nf
2507.ft C
2508bamToBed [OPTIONS] \-i <BAM>
2509.ft P
2510.fi
2511.UNINDENT
2512.UNINDENT
2513.TS
2514center;
2515|l|l|.
2516_
2517T{
2518Option
2519T}      T{
2520Description
2521T}
2522_
2523T{
2524\fB\-bedpe\fP
2525T}      T{
2526.INDENT 0.0
2527.INDENT 3.5
2528.INDENT 0.0
2529.TP
2530.B Write BAM alignments in BEDPE format. Only one alignment from paired\-end reads will be reported. Specifically, it each mate is aligned to the same chromosome, the BAM alignment reported will be the one where the BAM insert size is greater than zero. When the mate alignments are interchromosomal, the lexicographically lower chromosome will be reported first. Lastly, when an end is unmapped, the chromosome and strand will be set to "." and the start and end coordinates will be set to \-1. \fIBy default, this is disabled and the output will be reported in BED format\fP\&.
2531\fBNOTE: When using this option, it is required that the BAM file is sorted/grouped by the read name. This allows bamToBed to extract correct alignment coordinates for each end based on their respective CIGAR strings. It also assumes that the alignments for a given pair come in groups of twos. There is not yet a standard method for reporting multiple alignments using BAM. bamToBed will fail if an aligner does not report alignments in pairs\fP\&.
2532.UNINDENT
2533.UNINDENT
2534.UNINDENT
2535.sp
2536BAM files may be piped to bamToBed by specifying "\-i stdin". See example below.
2537T}
2538_
2539T{
2540\fB\-bed12\fP
2541T}      T{
2542Write "blocked" BED (a.k.a. BED12) format. This will convert "spliced" BAM alignments (denoted by the "N" CIGAR operation) to BED12.
2543T}
2544_
2545T{
2546\fB\-ed\fP
2547T}      T{
2548Use the "edit distance" tag (NM) for the BED score field. Default for BED is to use mapping quality. Default for BEDPE is to use the \fIminimum\fP of the two mapping qualities for the pair. When \-ed is used with \-bedpe, the total edit distance from the two mates is reported.
2549T}
2550_
2551T{
2552\fB\-tag\fP
2553T}      T{
2554Use other \fInumeric\fP BAM alignment tag for BED score. Default for BED is to use mapping quality. Disallowed with BEDPE output.
2555T}
2556_
2557T{
2558\fB\-color\fP
2559T}      T{
2560An R,G,B string for the color used with BED12 format. Default is (255,0,0).
2561T}
2562_
2563T{
2564\fB\-split\fP
2565T}      T{
2566Report each portion of a "split" BAM (i.e., having an "N" CIGAR operation) alignment as a distinct BED intervals.
2567T}
2568_
2569.TE
2570.sp
2571By default, each alignment in the BAM file is converted to a 6 column BED. The BED "name" field is
2572comprised of the RNAME field in the BAM alignment. If mate information is available, the mate (e.g.,
2573"/1" or "/2") field will be appended to the name. The "score" field is the mapping quality score from the
2574BAM alignment, unless the \fB\-ed\fP option is used.
2575.sp
2576Examples:
2577.INDENT 0.0
2578.INDENT 3.5
2579.sp
2580.nf
2581.ft C
2582bamToBed \-i reads.bam | head \-5
2583chr7   118970079   118970129   TUPAC_0001:3:1:0:1452#0/1   37   \-
2584chr7   118965072   118965122   TUPAC_0001:3:1:0:1452#0/2   37   +
2585chr11  46769934    46769984    TUPAC_0001:3:1:0:1472#0/1   37   \-
2586
2587bamToBed \-i reads.bam \-tag NM | head \-5
2588chr7   118970079   118970129   TUPAC_0001:3:1:0:1452#0/1   1    \-
2589chr7   118965072   118965122   TUPAC_0001:3:1:0:1452#0/2   3    +
2590chr11  46769934    46769984    TUPAC_0001:3:1:0:1472#0/1   1    \-
2591
2592bamToBed \-i reads.bam \-bedpe | head \-3
2593chr7   118965072   118965122   chr7   118970079   118970129
2594       TUPAC_0001:3:1:0:1452#0 37     +     \-
2595chr11  46765606    46765656    chr11  46769934    46769984
2596       TUPAC_0001:3:1:0:1472#0 37     +     \-
2597chr20  54704674    54704724    chr20  54708987    54709037
2598       TUPAC_0001:3:1:1:1833#0 37     +
2599.ft P
2600.fi
2601.UNINDENT
2602.UNINDENT
2603.sp
2604One can easily use samtools and bamToBed together as part of a UNIX pipe. In this example, we will
2605only convert properly\-paired (BAM flag == 0x2) reads to BED format.
2606.INDENT 0.0
2607.INDENT 3.5
2608.sp
2609.nf
2610.ft C
2611samtools view \-bf 0x2 reads.bam | bamToBed \-i stdin | head
2612chr7   118970079   118970129   TUPAC_0001:3:1:0:1452#0/1   37   \-
2613chr7   118965072   118965122   TUPAC_0001:3:1:0:1452#0/2   37   +
2614chr11  46769934    46769984    TUPAC_0001:3:1:0:1472#0/1   37   \-
2615chr11  46765606    46765656    TUPAC_0001:3:1:0:1472#0/2   37   +
2616chr20  54704674    54704724    TUPAC_0001:3:1:1:1833#0/1   37   +
2617chr20  54708987    54709037    TUPAC_0001:3:1:1:1833#0/2   37   \-
2618chrX   9380413     9380463     TUPAC_0001:3:1:1:285#0/1    0    \-
2619chrX   9375861     9375911     TUPAC_0001:3:1:1:285#0/2    0    +
2620chrX   131756978   131757028   TUPAC_0001:3:1:2:523#0/1    37   +
2621chrX   131761790   131761840   TUPAC_0001:3:1:2:523#0/2    37   \-
2622.ft P
2623.fi
2624.UNINDENT
2625.UNINDENT
2626.SS 5.4.2 (\-split)Creating BED12 features from "spliced" BAM entries.
2627.sp
2628bamToBed will, by default, create a BED6 feature that represents the entire span of a spliced/split
2629BAM alignment. However, when using the \fB\-split\fP command, a BED12 feature is reported where BED
2630blocks will be created for each aligned portion of the sequencing read.
2631.INDENT 0.0
2632.INDENT 3.5
2633.sp
2634.nf
2635.ft C
2636Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2637
2638Exons       ***************                                    **********
2639
2640BED/BAM A      ^^^^^^^^^^^^....................................^^^^
2641
2642Result      ===============                                    ====
2643.ft P
2644.fi
2645.UNINDENT
2646.UNINDENT
2647.SS 5.5 windowBed
2648.sp
2649Similar to \fBintersectBed\fP, \fBwindowBed\fP searches for overlapping features in A and B. However,
2650\fBwindowBed\fP adds a specified number (1000, by default) of base pairs upstream and downstream of
2651each feature in A. In effect, this allows features in B that are "near" features in A to be detected.
2652.SS 5.5.1 Usage and option summary
2653.sp
2654\fBUsage:\fP
2655.INDENT 0.0
2656.INDENT 3.5
2657.sp
2658.nf
2659.ft C
2660windowBed [OPTIONS] \-a <BED/GFF/VCF> \-b <BED/GFF/VCF>
2661.ft P
2662.fi
2663.UNINDENT
2664.UNINDENT
2665.TS
2666center;
2667|l|l|.
2668_
2669T{
2670Option
2671T}      T{
2672Description
2673T}
2674_
2675T{
2676\fB\-abam\fP
2677T}      T{
2678BAM file A. Each BAM alignment in A is compared to B in search of overlaps. Use "stdin" if passing A with a UNIX pipe: For example:  samtools view \-b <BAM> | windowBed \-abam stdin \-b genes.bed
2679T}
2680_
2681T{
2682\fB\-ubam\fP
2683T}      T{
2684Write uncompressed BAM output. The default is write compressed BAM output.
2685T}
2686_
2687T{
2688\fB\-bed\fP
2689T}      T{
2690When using BAM input (\-abam), write output as BED. The default is to write output in BAM when using \-abam. For example:  windowBed \-abam reads.bam \-b genes.bed \-bed
2691T}
2692_
2693T{
2694\fB\-w\fP
2695T}      T{
2696Base pairs added upstream and downstream of each entry in A when searching for overlaps in B. \fIDefault is 1000 bp\fP\&.
2697T}
2698_
2699T{
2700\fB\-l\fP
2701T}      T{
2702Base pairs added upstream (left of) of each entry in A when searching for overlaps in B. \fIAllows one to create assymetrical "windows". Default is 1000bp\fP\&.
2703T}
2704_
2705T{
2706\fB\-r\fP
2707T}      T{
2708Base pairs added downstream (right of) of each entry in A when searching for overlaps in B. \fIAllows one to create assymetrical "windows". Default is 1000bp\fP\&.
2709T}
2710_
2711T{
2712\fB\-sw\fP
2713T}      T{
2714Define \-l and \-r based on strand. For example if used, \-l 500 for a negative\-stranded feature will add 500 bp downstream. \fIBy default, this is disabled\fP\&.
2715T}
2716_
2717T{
2718\fB\-sm\fP
2719T}      T{
2720Only report hits in B that overlap A on the same strand. \fIBy default, overlaps are reported without respect to strand\fP\&.
2721T}
2722_
2723T{
2724\fB\-u\fP
2725T}      T{
2726Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B.
2727T}
2728_
2729T{
2730\fB\-c\fP
2731T}      T{
2732For each entry in A, report the number of hits in B while restricting to \-f. Reports 0 for A entries that have no overlap with B.
2733T}
2734_
2735.TE
2736.SS 5.5.2 Default behavior
2737.sp
2738By default, \fBwindowBed\fP adds 1000 bp upstream and downstream of each A feature and searches for
2739features in B that overlap this "window". If an overlap is found in B, both the \fIoriginal\fP A feature and the
2740\fIoriginal\fP B feature are reported. For example, in the figure below, feature B1 would be found, but B2
2741would not.
2742.INDENT 0.0
2743.INDENT 3.5
2744.sp
2745.nf
2746.ft C
2747Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2748                                      "window" = 10
2749BED File A                 <\-\-\-\-\-\-\-\-\-\-*************\-\-\-\-\-\-\-\-\-\->
2750
2751BED File B            ^^^^^^^^                                          ^^^^^^
2752
2753Result                ========
2754.ft P
2755.fi
2756.UNINDENT
2757.UNINDENT
2758.sp
2759For example:
2760.INDENT 0.0
2761.INDENT 3.5
2762.sp
2763.nf
2764.ft C
2765cat A.bed
2766chr1  100  200
2767
2768cat B.bed
2769chr1  500  1000
2770chr1  1300 2000
2771
2772windowBed \-a A.bed \-b B.bed
2773chr1  100  200  chr1  500  1000
2774.ft P
2775.fi
2776.UNINDENT
2777.UNINDENT
2778.SS 5.5.3 (\-w)Defining a custom window size
2779.sp
2780Instead of using the default window size of 1000bp, one can define a custom, \fIsymmetric\fP window around
2781each feature in A using the \fB\-w\fP option. One should specify the window size in base pairs. For example,
2782a window of 5kb should be defined as \fB\-w 5000\fP\&.
2783.sp
2784For example (note that in contrast to the default behavior, the second B entry is reported):
2785.INDENT 0.0
2786.INDENT 3.5
2787.sp
2788.nf
2789.ft C
2790cat A.bed
2791chr1  100  200
2792
2793cat B.bed
2794chr1  500  1000
2795chr1  1300 2000
2796
2797windowBed \-a A.bed \-b B.bed \-w 5000
2798chr1  100  200  chr1  500   1000
2799chr1  100  200  chr1  1300  2000
2800.ft P
2801.fi
2802.UNINDENT
2803.UNINDENT
2804.SS 5.5.4 (\-l and \-r)Defining assymteric windows
2805.sp
2806One can also define asymmetric windows where a differing number of bases are added upstream and
2807downstream of each feature using the \fB\-l (upstream)\fP and \fB\-r (downstream)\fP options.
2808.sp
2809For example (note the difference between \-l 200 and \-l 300):
2810.INDENT 0.0
2811.INDENT 3.5
2812.sp
2813.nf
2814.ft C
2815cat A.bed
2816chr1  1000  2000
2817
2818cat B.bed
2819chr1  500   800
2820chr1  10000 20000
2821
2822windowBed \-a A.bed \-b B.bed \-l 200 \-r 20000
2823chr1  100   200  chr1  10000  20000
2824
2825windowBed \-a A.bed \-b B.bed \-l 300 \-r 20000
2826chr1  100   200  chr1  500    800
2827chr1  100   200  chr1  10000  20000
2828.ft P
2829.fi
2830.UNINDENT
2831.UNINDENT
2832.SS 5.5.5 (\-sw)Defining assymteric windows based on strand
2833.sp
2834Especially when dealing with gene annotations or RNA\-seq experiments, you may want to define
2835asymmetric windows based on "strand". For example, you may want to screen for overlaps that occur
2836within 5000 bp upstream of a gene (e.g. a promoter region) while screening only 1000 bp downstream of
2837the gene. By enabling the \fB\-sw\fP ("stranded" windows) option, the windows are added upstream or
2838downstream according to strand. For example, imagine one specifies \fB\-l 5000 \-r 1000\fP as well as the \fB\-
2839sw\fP option. In this case, forward stranded ("+") features will screen 5000 bp to the \fIleft\fP (that is, \fIlower\fP
2840genomic coordinates) and 1000 bp to the \fIright\fP (that is, \fIhigher\fP genomic coordinates). By contrast,
2841reverse stranded ("\-") features will screen 5000 bp to the \fIright\fP (that is, \fIhigher\fP genomic coordinates) and
28421000 bp to the \fIleft\fP (that is, \fIlower\fP genomic coordinates).
2843.sp
2844For example (note the difference between \-l 200 and \-l 300):
2845.INDENT 0.0
2846.INDENT 3.5
2847.sp
2848.nf
2849.ft C
2850cat A.bed
2851chr1  10000  20000  A.forward  1  +
2852chr1  10000  20000  A.reverse  1  \-
2853
2854cat B.bed
2855chr1  1000   8000   B1
2856chr1  24000  32000  B2
2857
2858windowBed \-a A.bed \-b B.bed \-l 5000 \-r 1000 \-sw
2859chr1  10000  20000  A.forward  1  +  chr1  1000   8000   B1
2860chr1  10000  20000  A.reverse  1  \-  chr1  24000  32000  B2
2861.ft P
2862.fi
2863.UNINDENT
2864.UNINDENT
2865.SS 5.5.6 (\-sm)Enforcing "strandedness"
2866.sp
2867This option behaves the same as the \-s option for intersectBed while scanning for overlaps within the
2868"window" surrounding A. See the discussion in the intersectBed section for details.
2869.SS 5.5.7 (\-u)Reporting the presence of at least one overlapping feature
2870.sp
2871This option behaves the same as for intersectBed while scanning for overlaps within the "window"
2872surrounding A. See the discussion in the intersectBed section for details.
2873.SS 5.5.8 (\-c)Reporting the number of overlapping features
2874.sp
2875This option behaves the same as for intersectBed while scanning for overlaps within the "window"
2876surrounding A. See the discussion in the intersectBed section for details.
2877.SS 5.5.9 (\-v)Reporting the absence of any overlapping features
2878.sp
2879This option behaves the same as for intersectBed while scanning for overlaps within the "window"
2880surrounding A. See the discussion in the intersectBed section for details.
2881.SS 5.6 closestBed
2882.sp
2883Similar to \fBintersectBed, closestBed\fP searches for overlapping features in A and B. In the event that
2884no feature in B overlaps the current feature in A, \fBclosestBed\fP will report the \fIclosest\fP (that is, least
2885genomic distance from the start or end of A) feature in B. For example, one might want to find which
2886is the closest gene to a significant GWAS polymorphism. Note that \fBclosestBed\fP will report an
2887overlapping feature as the closest\-\-\-that is, it does not restrict to closest \fInon\-overlapping\fP feature.
2888.SS 5.6.1 Usage and option summary
2889.sp
2890\fBUsage:\fP
2891.INDENT 0.0
2892.INDENT 3.5
2893.sp
2894.nf
2895.ft C
2896closestBed [OPTIONS] \-a <BED/GFF/VCF> \-b <BED/GFF/VCF>
2897.ft P
2898.fi
2899.UNINDENT
2900.UNINDENT
2901.TS
2902center;
2903|l|l|.
2904_
2905T{
2906Option
2907T}      T{
2908Description
2909T}
2910_
2911T{
2912\fB\-s\fP
2913T}      T{
2914Force strandedness. That is, find the closest feature in B overlaps A on the same strand. \fIBy default, this is disabled\fP\&.
2915T}
2916_
2917T{
2918\fB\-d\fP
2919T}      T{
2920In addition to the closest feature in B, report its distance to A as an extra column. The reported distance for overlapping features will be 0.
2921T}
2922_
2923T{
2924\fB\-t\fP
2925T}      T{
2926How ties for closest feature should be handled. This occurs when two features in B have exactly the same overlap with a feature in A. \fIBy default, all such features in B are reported\fP\&.
2927.INDENT 0.0
2928.INDENT 3.5
2929Here are the other choices controlling how ties are handled:
2930.sp
2931\fIall\-\fP   Report all ties (default).
2932.sp
2933\fIfirst\-\fP   Report the first tie that occurred in the B file.
2934.sp
2935\fIlast\-\fP   Report the last tie that occurred in the B file.
2936.UNINDENT
2937.UNINDENT
2938T}
2939_
2940.TE
2941.SS 5.6.2 Default behavior
2942.sp
2943\fBclosestBed\fP first searches for features in B that overlap a feature in A. If overlaps are found, the feature
2944in B that overlaps the highest fraction of A is reported. If no overlaps are found, \fBclosestBed\fP looks for
2945the feature in B that is \fIclosest\fP (that is, least genomic distance to the start or end of A) to A. For
2946example, in the figure below, feature B1 would be reported as the closest feature to A1.
2947.INDENT 0.0
2948.INDENT 3.5
2949.sp
2950.nf
2951.ft C
2952Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2953
2954BED FILE A                             *************
2955
2956BED File B         ^^^^^^^^                            ^^^^^^
2957
2958Result                                                 ======
2959.ft P
2960.fi
2961.UNINDENT
2962.UNINDENT
2963.sp
2964For example:
2965.INDENT 0.0
2966.INDENT 3.5
2967.sp
2968.nf
2969.ft C
2970cat A.bed
2971chr1  100  200
2972
2973cat B.bed
2974chr1  500  1000
2975chr1  1300 2000
2976
2977closestBed \-a A.bed \-b B.bed
2978chr1  100  200  chr1  500  1000
2979.ft P
2980.fi
2981.UNINDENT
2982.UNINDENT
2983.SS 5.6.3 (\-s)Enforcing "strandedness"
2984.sp
2985This option behaves the same as the \-s option for intersectBed while scanning for the closest
2986(overlapping or not) feature in B. See the discussion in the intersectBed section for details.
2987.SS 5.6.4 (\-t)Controlling how ties for "closest" are broken
2988.sp
2989When there are two or more features in B that overlap the \fIsame fraction\fP of A, \fBclosestBed\fP will, by
2990default, report both features in B. Imagine feature A is a SNP and file B contains genes. It can often
2991occur that two gene annotations (e.g. opposite strands) in B will overlap the SNP. As mentioned, the
2992default behavior is to report both such genes in B. However, the \-t option allows one to optionally
2993choose the just first or last feature (in terms of where it occurred in the input file, not chromosome
2994position) that occurred in B.
2995.sp
2996For example (note the difference between \-l 200 and \-l 300):
2997.INDENT 0.0
2998.INDENT 3.5
2999.sp
3000.nf
3001.ft C
3002cat A.bed
3003chr1  100  101  rs1234
3004
3005cat B.bed
3006chr1  0  1000  geneA  100  +
3007chr1  0  1000  geneB  100  \-
3008
3009closestBed \-a A.bed \-b B.bed
3010chr1  100  101  rs1234  chr1  0  1000  geneA  100  +
3011chr1  100  101  rs1234  chr1  0  1000  geneB  100  \-
3012
3013closestBed \-a A.bed \-b B.bed \-t all
3014chr1  100  101  rs1234  chr1  0  1000  geneA  100  +
3015chr1  100  101  rs1234  chr1  0  1000  geneB  100  \-
3016
3017closestBed \-a A.bed \-b B.bed \-t first
3018chr1  100  101  rs1234  chr1  0  1000  geneA  100  +
3019
3020closestBed \-a A.bed \-b B.bed \-t last
3021chr1  100  101  rs1234  chr1  0  1000  geneB  100  \-
3022.ft P
3023.fi
3024.UNINDENT
3025.UNINDENT
3026.SS 5.6.5 (\-d)Reporting the distance to the closest feature in base pairs
3027.sp
3028ClosestBed will optionally report the distance to the closest feature in the B file using the \fB\-d\fP option.
3029When a feature in B overlaps a feature in A, a distance of 0 is reported.
3030.INDENT 0.0
3031.INDENT 3.5
3032.sp
3033.nf
3034.ft C
3035cat A.bed
3036chr1  100  200
3037chr1  500  600
3038
3039cat B.bed
3040chr1  500  1000
3041chr1  1300 2000
3042
3043closestBed \-a A.bed \-b B.bed \-d
3044chr1  100  200  chr1  500  1000  300
3045chr1  500  600  chr1  500  1000  0
3046.ft P
3047.fi
3048.UNINDENT
3049.UNINDENT
3050.SS 5.7 subtractBed
3051.sp
3052\fBsubtractBed\fP searches for features in B that overlap A. If an overlapping feature is found in B, the
3053overlapping portion is removed from A and the remaining portion of A is reported. If a feature in B
3054overlaps all of a feature in A, the A feature will not be reported.
3055.SS 5.7.1 Usage and option summary
3056.sp
3057Usage:
3058.INDENT 0.0
3059.INDENT 3.5
3060.sp
3061.nf
3062.ft C
3063subtractBed [OPTIONS] \-a <BED/GFF/VCF> \-b <BED/GFF/VCF>
3064.ft P
3065.fi
3066.UNINDENT
3067.UNINDENT
3068.TS
3069center;
3070|l|l|.
3071_
3072T{
3073Option
3074T}      T{
3075Description
3076T}
3077_
3078T{
3079\fB\-f\fP
3080T}      T{
3081Minimum overlap required as a fraction of A. Default is 1E\-9 (i.e. 1bp).
3082T}
3083_
3084T{
3085\fB\-s\fP
3086T}      T{
3087Force strandedness. That is, find the closest feature in B overlaps A on the same strand.  \fIBy default, this is disabled\fP\&.
3088T}
3089_
3090.TE
3091.SS 5.7.2 Default behavior
3092.sp
3093Figure:
3094.INDENT 0.0
3095.INDENT 3.5
3096.sp
3097.nf
3098.ft C
3099Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3100
3101BED FILE A             *************            ******
3102
3103BED File B         ^^^^^^^^                   ^^^^^^^^^^^
3104
3105Result                     =========
3106.ft P
3107.fi
3108.UNINDENT
3109.UNINDENT
3110.sp
3111For example:
3112.INDENT 0.0
3113.INDENT 3.5
3114.sp
3115.nf
3116.ft C
3117cat A.bed
3118chr1  100  200
3119chr1  10   20
3120
3121cat B.bed
3122chr1  0    30
3123chr1  180  300
3124
3125subtractBed \-a A.bed \-b B.bed
3126chr1  100  180
3127.ft P
3128.fi
3129.UNINDENT
3130.UNINDENT
3131.SS 5.7.3  (\-f)Requiring a minimal overlap fraction before subtracting
3132.sp
3133This option behaves the same as the \-f option for intersectBed. In this case, subtractBed will only
3134subtract an overlap with B if it covers at least the fraction of A defined by \-f. If an overlap is found,
3135but it does not meet the overlap fraction, the original A feature is reported without subtraction.
3136.sp
3137For example:
3138.INDENT 0.0
3139.INDENT 3.5
3140.sp
3141.nf
3142.ft C
3143cat A.bed
3144chr1  100  200
3145
3146cat B.bed
3147chr1  180  300
3148
3149subtractBed \-a A.bed \-b B.bed \-f 0.10
3150chr1  100  180
3151
3152subtractBed \-a A.bed \-b B.bed \-f 0.80
3153chr1  100  200
3154.ft P
3155.fi
3156.UNINDENT
3157.UNINDENT
3158.SS 5.7.4 (\-s)Enforcing "strandedness"
3159.sp
3160This option behaves the same as the \-s option for intersectBed while scanning for features in B that
3161should be subtracted from A. See the discussion in the intersectBed section for details.
3162.SS 5.8 mergeBed
3163.sp
3164\fBmergeBed\fP combines overlapping or "book\-ended" (that is, one base pair away) features in a feature file
3165into a single feature which spans all of the combined features.
3166.SS 5.8.1 Usage and option summary
3167.sp
3168Usage:
3169.INDENT 0.0
3170.INDENT 3.5
3171.sp
3172.nf
3173.ft C
3174mergeBed [OPTIONS] \-i <BED/GFF/VCF>
3175.ft P
3176.fi
3177.UNINDENT
3178.UNINDENT
3179.TS
3180center;
3181|l|l|.
3182_
3183T{
3184Option
3185T}      T{
3186Description
3187T}
3188_
3189T{
3190\fB\-s\fP
3191T}      T{
3192Force strandedness. That is, only merge features that are the same strand. \fIBy default, this is disabled\fP\&.
3193T}
3194_
3195T{
3196\fB\-n\fP
3197T}      T{
3198Report the number of BED entries that were merged. \fI1 is reported if no merging occurred\fP\&.
3199T}
3200_
3201T{
3202\fB\-d\fP
3203T}      T{
3204Maximum distance between features allowed for features to be merged. \fIDefault is 0. That is, overlapping and/or book\-ended features are merged\fP\&.
3205T}
3206_
3207T{
3208\fB\-nms\fP
3209T}      T{
3210Report the names of the merged features separated by semicolons.
3211T}
3212_
3213.TE
3214.SS 5.8.2 Default behavior
3215.sp
3216Figure:
3217.INDENT 0.0
3218.INDENT 3.5
3219.sp
3220.nf
3221.ft C
3222Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3223
3224BED FILE       *************   ***************   **********************
3225                       ********
3226
3227Result         ===============================   ======================
3228.ft P
3229.fi
3230.UNINDENT
3231.UNINDENT
3232.sp
3233For example:
3234.INDENT 0.0
3235.INDENT 3.5
3236.sp
3237.nf
3238.ft C
3239cat A.bed
3240chr1  100  200
3241chr1  180  250
3242chr1  250  500
3243chr1  501  1000
3244
3245mergeBed \-i A.bed
3246chr1  100  500
3247chr1  501  1000
3248.ft P
3249.fi
3250.UNINDENT
3251.UNINDENT
3252.SS 5.8.3 (\-s)Enforcing "strandedness"
3253.sp
3254This option behaves the same as the \-s option for intersectBed while scanning for features that should
3255be merged. Only features on the same strand will be merged. See the discussion in the intersectBed
3256section for details.
3257.SS 5.8.4 (\-n)Reporting the number of features that were merged
3258.sp
3259The \-n option will report the number of features that were combined from the original file in order to
3260make the newly merged feature. If a feature in the original file was not merged with any other features,
3261a "1" is reported.
3262.sp
3263For example:
3264.INDENT 0.0
3265.INDENT 3.5
3266.sp
3267.nf
3268.ft C
3269cat A.bed
3270chr1  100  200
3271chr1  180  250
3272chr1  250  500
3273chr1  501  1000
3274
3275mergeBed \-i A.bed \-n
3276chr1  100  500  3
3277chr1  501  1000 1
3278.ft P
3279.fi
3280.UNINDENT
3281.UNINDENT
3282.SS 5.8.5 (\-d)Controlling how close two features must be in order to merge
3283.sp
3284By default, only overlapping or book\-ended features are combined into a new feature. However, one can
3285force mergeBed to combine more distant features with the \-d option. For example, were one to set \-d to
32861000, any features that overlap or are within 1000 base pairs of one another will be combined.
3287.sp
3288For example:
3289.INDENT 0.0
3290.INDENT 3.5
3291.sp
3292.nf
3293.ft C
3294cat A.bed
3295chr1  100  200
3296chr1  501  1000
3297
3298mergeBed \-i A.bed
3299chr1  100  200
3300chr1  501  1000
3301
3302mergeBed \-i A.bed \-d 1000
3303chr1  100  200  1000
3304.ft P
3305.fi
3306.UNINDENT
3307.UNINDENT
3308.SS 5.8.6 (\-nms)Reporting the names of the features that were merged
3309.sp
3310Occasionally, one might like to know that names of the features that were merged into a new feature.
3311The \-nms option will add an extra column to the mergeBed output which lists (separated by
3312semicolons) the names of the merged features.
3313.sp
3314For example:
3315.INDENT 0.0
3316.INDENT 3.5
3317.sp
3318.nf
3319.ft C
3320cat A.bed
3321chr1  100  200  A1
3322chr1  150  300  A2
3323chr1  250  500  A3
3324
3325mergeBed \-i A.bed \-nms
3326chr1  100  500  A1;A2;A3
3327.ft P
3328.fi
3329.UNINDENT
3330.UNINDENT
3331.SS 5.9 coverageBed
3332.sp
3333\fBcoverageBed\fP computes both the \fIdepth\fP and \fIbreadth\fP of coverage of features in file A across the features
3334in file B. For example, \fBcoverageBed\fP can compute the coverage of sequence alignments (file A) across 1
3335kilobase (arbitrary) windows (file B) tiling a genome of interest. One advantage that \fBcoverageBed\fP
3336offers is that it not only \fIcounts\fP the number of features that overlap an interval in file B, it also
3337computes the fraction of bases in B interval that were overlapped by one or more features. Thus,
3338\fBcoverageBed\fP also computes the \fIbreadth\fP of coverage for each interval in B.
3339.SS 5.9.1 Usage and option summary
3340.sp
3341Usage:
3342.INDENT 0.0
3343.INDENT 3.5
3344.sp
3345.nf
3346.ft C
3347coverageBed [OPTIONS] \-a <BED/GFF/VCF> \-b <BED/GFF/VCF>
3348.ft P
3349.fi
3350.UNINDENT
3351.UNINDENT
3352.TS
3353center;
3354|l|l|.
3355_
3356T{
3357Option
3358T}      T{
3359Description
3360T}
3361_
3362T{
3363\fB\-abam\fP
3364T}      T{
3365.INDENT 0.0
3366.INDENT 3.5
3367BAM file A. Each BAM alignment in A is compared to B in search of overlaps. Use "stdin" if passing A with a UNIX pipe: For example:
3368.UNINDENT
3369.UNINDENT
3370.nf
3371samtools view \-b <BAM> | intersectBed \-abam stdin \-b genes.bed
3372.fi
3373T}
3374_
3375T{
3376\fB\-s\fP
3377T}      T{
3378Force strandedness. That is, only features in A are only counted towards coverage in B if they are the same strand. \fIBy default, this is disabled and coverage is counted without respect to strand\fP\&.
3379T}
3380_
3381T{
3382\fB\-hist\fP
3383T}      T{
3384Report a histogram of coverage for each feature in B as well as a summary histogram for _all_ features in B.
3385.nf
3386Output (tab delimited) after each feature in B:
3387.fi
3388.sp
3389.INDENT 0.0
3390.INDENT 3.5
3391.nf
33921) depth
33932) # bases at depth
33943) size of B
33954) % of B at depth
3396.fi
3397.sp
3398.UNINDENT
3399.UNINDENT
3400T}
3401_
3402T{
3403\fB\-d\fP
3404T}      T{
3405Report the depth at each position in each B feature. Positions reported are one based. Each position and depth follow the complete B feature.
3406T}
3407_
3408T{
3409\fB\-split\fP
3410T}      T{
3411Treat "split" BAM or BED12 entries as distinct BED intervals when computing coverage. For BAM files, this uses the CIGAR "N" and "D" operations to infer the blocks for computing coverage. For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds fields (i.e., columns 10,11,12).
3412T}
3413_
3414.TE
3415.SS 5.9.2 Default behavior
3416.sp
3417After each interval in B, \fBcoverageBed\fP will report:
3418.INDENT 0.0
3419.IP 1. 3
3420The number of features in A that overlapped (by at least one base pair) the B interval.
3421.IP 2. 3
3422The number of bases in B that had non\-zero coverage from features in A.
3423.IP 3. 3
3424The length of the entry in B.
3425.IP 4. 3
3426The fraction of bases in B that had non\-zero coverage from features in A.
3427.UNINDENT
3428.sp
3429Below are the number of features in A (N=...) overlapping B and fraction of bases in B with coverage.
3430.INDENT 0.0
3431.INDENT 3.5
3432.sp
3433.nf
3434.ft C
3435Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3436
3437BED FILE B  ***************     ***************     ******    **************
3438
3439BED File A  ^^^^ ^^^^              ^^             ^^^^^^^^^    ^^^ ^^ ^^^^
3440              ^^^^^^^^                                      ^^^^^ ^^^^^ ^^
3441
3442Result      [  N=3, 10/15 ]     [  N=1, 2/16 ]     [N=1,6/6]   [N=5, 11/12 ]
3443.ft P
3444.fi
3445.UNINDENT
3446.UNINDENT
3447.sp
3448For example:
3449.INDENT 0.0
3450.INDENT 3.5
3451.sp
3452.nf
3453.ft C
3454cat A.bed
3455chr1  10  20
3456chr1  20  30
3457chr1  30  40
3458chr1  100 200
3459
3460cat B.bed
3461chr1  0   100
3462chr1  100 200
3463chr2  0   100
3464
3465coverageBed \-a A.bed \-b B.bed
3466chr1  0   100  3  30  100 0.3000000
3467chr1  100 200  1  100 100 1.0000000
3468chr2  0   100  0  0   100 0.0000000
3469.ft P
3470.fi
3471.UNINDENT
3472.UNINDENT
3473.SS 5.9.4 (\-s)Calculating coverage by strand
3474.sp
3475Use the "\fB\-s\fP" option if one wants to only count coverage if features in A are on the same strand as the
3476feature / window in B. This is especially useful for RNA\-seq experiments.
3477.sp
3478For example (note the difference in coverage with and without \fB\-s\fP:
3479.INDENT 0.0
3480.INDENT 3.5
3481.sp
3482.nf
3483.ft C
3484cat A.bed
3485chr1  10  20  a1  1  \-
3486chr1  20  30  a2  1  \-
3487chr1  30  40  a3  1  \-
3488chr1  100 200 a4  1  +
3489
3490cat B.bed
3491chr1  0   100 b1  1  +
3492chr1  100 200 b2  1  \-
3493chr2  0   100 b3  1  +
3494
3495coverageBed \-a A.bed \-b B.bed
3496chr1  0   100 b1  1  +  3  30  100  0.3000000
3497chr1  100 200 b2  1  \-  1  100 100  1.0000000
3498chr2  0   100 b3  1  +  0  0   100  0.0000000
3499
3500coverageBed \-a A.bed \-b B.bed \-s
3501chr1  0   100 b1  1  +  0  0   100  0.0000000
3502chr1  100 200 b2  1  \-  0  0   100  0.0000000
3503chr2  0   100 b3  1  +  0  0   100  0.0000000
3504.ft P
3505.fi
3506.UNINDENT
3507.UNINDENT
3508.SS 5.9.5 (\-hist)Creating a histogram of coverage for each feature in the B file
3509.sp
3510One should use the "\fB\-hist\fP" option to create, for each interval in B, a histogram of coverage of the
3511features in A across B.
3512.sp
3513In this case, each entire feature in B will be reported, followed by the depth of coverage, the number of
3514bases at that depth, the size of the feature, and the fraction covered. After all of the features in B have
3515been reported, a histogram summarizing the coverage among all features in B will be reported.
3516.INDENT 0.0
3517.INDENT 3.5
3518.sp
3519.nf
3520.ft C
3521cat A.bed
3522chr1  10  20  a1  1  \-
3523chr1  20  30  a2  1  \-
3524chr1  30  40  a3  1  \-
3525chr1  100 200 a4  1  +
3526
3527cat B.bed
3528chr1  0   100 b1  1  +
3529chr1  100 200 b2  1  \-
3530chr2  0   100 b3  1  +
3531
3532coverageBed \-a A.bed \-b B.bed \-hist
3533chr1  0   100 b1  1  +  0  70  100  0.7000000
3534chr1  0   100 b1  1  +  1  30  100  0.3000000
3535chr1  100 200 b2  1  \-  1  100 100  1.0000000
3536chr2  0   100 b3  1  +  0  100 100  1.0000000
3537all   0   170 300 0.5666667
3538all   1   130 300 0.4333333
3539.ft P
3540.fi
3541.UNINDENT
3542.UNINDENT
3543.SS 5.9.6 (\-hist)Reporting the per\-base of coverage for each feature in the B file
3544.sp
3545One should use the "\fB\-d\fP" option to create, for each interval in B, a detailed list of coverage at each of the
3546positions across each B interval.
3547.sp
3548The output will consist of a line for each one\-based position in each B feature, followed by the coverage
3549detected at that position.
3550.INDENT 0.0
3551.INDENT 3.5
3552.sp
3553.nf
3554.ft C
3555cat A.bed
3556chr1  0  5
3557chr1  3  8
3558chr1  4  8
3559chr1  5  9
3560
3561cat B.bed
3562chr1  0  10
3563
3564coverageBed \-a A.bed \-b B.bed \-d
3565chr1  0  10  B  1  1
3566chr1  0  10  B  2  1
3567chr1  0  10  B  3  1
3568chr1  0  10  B  4  2
3569chr1  0  10  B  5  3
3570chr1  0  10  B  6  3
3571chr1  0  10  B  7  3
3572chr1  0  10  B  8  3
3573chr1  0  10  B  9  1
3574chr1  0  10  B  10 0
3575.ft P
3576.fi
3577.UNINDENT
3578.UNINDENT
3579.SS 5.9.7 (\-split)Reporting coverage with spliced alignments or blocked BED features
3580.sp
3581As described in section 1.3.19, coverageBed will, by default, screen for overlaps against the entire span
3582of a spliced/split BAM alignment or blocked BED12 feature. When dealing with RNA\-seq reads, for
3583example, one typically wants to only tabulate coverage for the portions of the reads that come from
3584exons (and ignore the interstitial intron sequence). The \fB\-split\fP command allows for such coverage to be
3585performed.
3586.SS 5.10 genomeCoverageBed
3587.sp
3588\fBgenomeCoverageBed\fP computes a histogram of feature coverage (e.g., aligned sequences) for a given
3589genome. Optionally, by using the \fB\-d\fP option, it will report the depth of coverage at \fIeach base\fP on each
3590chromosome in the genome file (\fB\-g\fP).
3591.SS 5.10.1 Usage and option summary
3592.sp
3593Usage:
3594.INDENT 0.0
3595.INDENT 3.5
3596.sp
3597.nf
3598.ft C
3599genomeCoverageBed [OPTIONS] \-i <BED> \-g <GENOME>
3600.ft P
3601.fi
3602.UNINDENT
3603.UNINDENT
3604.sp
3605NOTE: genomeCoverageBed requires that the input BED file be sorted by
3606chromosome. A simple sort \-k1,1 will suffice.
3607.TS
3608center;
3609|l|l|.
3610_
3611T{
3612Option
3613T}      T{
3614Description
3615T}
3616_
3617T{
3618\fB\-ibam\fP
3619T}      T{
3620.INDENT 0.0
3621.INDENT 3.5
3622BAM file as input for coverage. Each BAM alignment in A added to the total coverage for the genome. Use "stdin" if passing it with a UNIX pipe: For example:
3623.UNINDENT
3624.UNINDENT
3625.nf
3626samtools view \-b <BAM> | genomeCoverageBed \-ibam stdin \-g hg18.genome
3627.fi
3628T}
3629_
3630T{
3631\fB\-d\fP
3632T}      T{
3633Report the depth at each genome position. \fIDefault behavior is to report a histogram\fP\&.
3634T}
3635_
3636T{
3637\fB\-max\fP
3638T}      T{
3639Combine all positions with a depth >= max into a single bin in the histogram.
3640T}
3641_
3642T{
3643\fB\-bg\fP
3644T}      T{
3645Report depth in BedGraph format. For details, see: \fI\%http://genome.ucsc.edu/goldenPath/help/bedgraph.html\fP
3646T}
3647_
3648T{
3649\fB\-bga\fP
3650T}      T{
3651Report depth in BedGraph format, as above (i.e., \-bg). However with this option, regions with zero coverage are also reported. This allows one to quickly extract all regions of a genome with 0 coverage by applying: "grep \-w 0$" to the output.
3652T}
3653_
3654T{
3655\fB\-split\fP
3656T}      T{
3657Treat "split" BAM or BED12 entries as distinct BED intervals when computing coverage. For BAM files, this uses the CIGAR "N" and "D" operations to infer the blocks for computing coverage. For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds fields (i.e., columns 10,11,12).
3658T}
3659_
3660T{
3661\fB\-strand\fP
3662T}      T{
3663Calculate coverage of intervals from a specific strand. With BED files, requires at least 6 columns (strand is column 6).
3664T}
3665_
3666.TE
3667.SS 5.10.2 Default behavior
3668.sp
3669By default, \fBgenomeCoverageBed\fP will compute a histogram of coverage for the genome file provided.
3670The default output format is as follows:
36711. chromosome (or entire genome)
36722. depth of coverage from features in input file
36733. number of bases on chromosome (or genome) with depth equal to column 2.
36744. size of chromosome (or entire genome) in base pairs
36755. fraction of bases on chromosome (or entire genome) with depth equal to column 2.
3676.sp
3677For example:
3678.INDENT 0.0
3679.INDENT 3.5
3680.sp
3681.nf
3682.ft C
3683cat A.bed
3684chr1  10  20
3685chr1  20  30
3686chr2  0   500
3687
3688cat my.genome
3689chr1  1000
3690chr2  500
3691
3692genomeCoverageBed \-i A.bed \-g my.genome
3693chr1   0  980  1000  0.98
3694chr1   1  20   1000  0.02
3695chr2   1  500  500   1
3696genome 0  980  1500  0.653333
3697genome 1  520  1500  0.346667
3698.ft P
3699.fi
3700.UNINDENT
3701.UNINDENT
3702.SS 5.10.3 (\-max)Controlling the histogram\(aqs maximum depth
3703.sp
3704Using the \fB\-max\fP option, \fBgenomeCoverageBed\fP will "lump" all positions in the genome having feature
3705coverage greather than or equal to \fBmax\fP into the \fBmax\fP histogram bin. For example, if one sets \fB\-max\fP
3706equal to 50, the max depth reported in the output will be 50 and all positions with a depth >= 50 will
3707be represented in bin 50.
3708.SS 5.10.4 (\-d)Reporting "per\-base" genome coverage
3709.sp
3710Using the \fB\-d\fP option, \fBgenomeCoverageBed\fP will compute the depth of feature coverage for each base
3711on each chromosome in genome file provided.
3712.sp
3713The "per\-base" output format is as follows:
37141. chromosome
37152. chromosome position
37163. depth (number) of features overlapping this chromosome position.
3717.sp
3718For example:
3719.INDENT 0.0
3720.INDENT 3.5
3721.sp
3722.nf
3723.ft C
3724cat A.bed
3725chr1  10  20
3726chr1  20  30
3727chr2  0   500
3728
3729cat my.genome
3730chr1  1000
3731chr2  500
3732
3733genomeCoverageBed \-i A.bed \-g my.genome \-d | head \-15 | tail \-n 10
3734chr1  6  0
3735chr1  7  0
3736chr1  8  0
3737chr1  9  0
3738chr1  10 0
3739chr1  11 1
3740chr1  12 1
3741chr1  13 1
3742chr1  14 1
3743chr1  15 1
3744.ft P
3745.fi
3746.UNINDENT
3747.UNINDENT
3748.SS 5.1.13 (\-split)Reporting coverage with spliced alignments or blocked BED features
3749.sp
3750As described in section 1.3.19, genomeCoverageBed will, by default, screen for overlaps against the
3751entire span of a spliced/split BAM alignment or blocked BED12 feature. When dealing with RNA\-seq
3752reads, for example, one typically wants to only screen for overlaps for the portions of the reads that
3753come from exons (and ignore the interstitial intron sequence). The \fB\-split\fP command allows for such
3754overlaps to be performed.
3755.sp
3756For additional details, please visit the Usage From The Wild site and have a look at example 5,
3757contributed by Assaf Gordon.
3758.SS 5.11 fastaFromBed
3759.sp
3760\fBfastaFromBed\fP extracts sequences from a FASTA file for each of the intervals defined in a BED file.
3761The headers in the input FASTA file must exactly match the chromosome column in the BED file.
3762.SS 5.11.1 Usage and option summary
3763.sp
3764Usage:
3765.INDENT 0.0
3766.INDENT 3.5
3767.sp
3768.nf
3769.ft C
3770fastaFromBed [OPTIONS] \-fi <input FASTA> \-bed <BED/GFF/VCF> \-fo <output FASTA>
3771.ft P
3772.fi
3773.UNINDENT
3774.UNINDENT
3775.TS
3776center;
3777|l|l|.
3778_
3779T{
3780Option
3781T}      T{
3782Description
3783T}
3784_
3785T{
3786\fB\-name\fP
3787T}      T{
3788Use the "name" column in the BED file for the FASTA headers in the output FASTA file.
3789T}
3790_
3791T{
3792\fB\-tab\fP
3793T}      T{
3794Report extract sequences in a tab\-delimited format instead of in FASTA format.
3795T}
3796_
3797T{
3798\fB\-s\fP
3799T}      T{
3800Force strandedness. If the feature occupies the antisense strand, the sequence will be reverse complemented. \fIDefault: strand information is ignored\fP\&.
3801T}
3802_
3803.TE
3804.SS 5.11.2 Default behavior
3805.sp
3806\fBfastaFromBed\fP will extract the sequence defined by the coordinates in a BED interval and create a
3807new FASTA entry in the output file for each extracted sequence. By default, the FASTA header for each
3808extracted sequence will be formatted as follows: "<chrom>:<start>\-<end>".
3809.sp
3810For example:
3811.INDENT 0.0
3812.INDENT 3.5
3813.sp
3814.nf
3815.ft C
3816$ cat test.fa
3817>chr1
3818AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
3819
3820cat test.bed
3821chr1 5 10
3822
3823fastaFromBed \-fi test.fa \-bed test.bed \-fo test.fa.out
3824
3825cat test.fa.out
3826>chr1:5\-10
3827AAACC
3828.ft P
3829.fi
3830.UNINDENT
3831.UNINDENT
3832.SS 5.11.3 Using the BED "name" column as a FASTA header.
3833.sp
3834Using the \fB\-name\fP option, one can set the FASTA header for each extracted sequence to be the "name"
3835columns from the BED feature.
3836.sp
3837For example:
3838.INDENT 0.0
3839.INDENT 3.5
3840.sp
3841.nf
3842.ft C
3843cat test.fa
3844>chr1
3845AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
3846
3847cat test.bed
3848chr1 5 10 myseq
3849
3850fastaFromBed \-fi test.fa \-bed test.bed \-fo test.fa.out \-name
3851
3852cat test.fa.out
3853>myseq
3854AAACC
3855.ft P
3856.fi
3857.UNINDENT
3858.UNINDENT
3859.SS 5.11.4 Creating a tab\-delimited output file in lieu of FASTA output.
3860.sp
3861Using the \fB\-tab\fP option, the \fB\-fo\fP output file will be tab\-delimited instead of in FASTA format.
3862.sp
3863For example:
3864.INDENT 0.0
3865.INDENT 3.5
3866.sp
3867.nf
3868.ft C
3869cat test.fa
3870>chr1
3871AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
3872
3873cat test.bed
3874chr1 5 10 myseq
3875
3876fastaFromBed \-fi test.fa \-bed test.bed \-fo test.fa.out.tab \-name \-tab
3877
3878cat test.fa.out
3879myseq AAACC
3880.ft P
3881.fi
3882.UNINDENT
3883.UNINDENT
3884.SS 5.11.5 (\-s)Forcing the extracted sequence to reflect the requested strand
3885.sp
3886\fBfastaFromBed\fP will extract the sequence in the orientation defined in the strand column when the "\-s"
3887option is used.
3888.sp
3889For example:
3890.INDENT 0.0
3891.INDENT 3.5
3892.sp
3893.nf
3894.ft C
3895cat test.fa
3896>chr1
3897AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
3898
3899cat test.bed
3900chr1 20 25 forward 1 +
3901chr1 20 25 reverse 1 \-
3902
3903fastaFromBed \-fi test.fa \-bed test.bed \-s \-name \-fo test.fa.out
3904
3905cat test.fa.out
3906>forward
3907CGCTA
3908>reverse
3909TAGCG
3910.ft P
3911.fi
3912.UNINDENT
3913.UNINDENT
3914.SS 5.12 maskFastaFromBed
3915.sp
3916\fBmaskFastaFromBed\fP masks sequences in a FASTA file based on intervals defined in a feature file. The
3917headers in the input FASTA file must exactly match the chromosome column in the feature file. This
3918may be useful fro creating your own masked genome file based on custom annotations or for masking all
3919but your target regions when aligning sequence data from a targeted capture experiment.
3920.SS 5.12.1 Usage and option summary
3921.sp
3922Usage:
3923.INDENT 0.0
3924.INDENT 3.5
3925.sp
3926.nf
3927.ft C
3928maskFastaFromBed [OPTIONS] \-fi <input FASTA> \-bed <BED/GFF/VCF> \-fo <output FASTA>
3929.ft P
3930.fi
3931.UNINDENT
3932.UNINDENT
3933.sp
3934NOTE: The input and output FASTA files must be different.
3935.TS
3936center;
3937|l|l|.
3938_
3939T{
3940Option
3941T}      T{
3942Description
3943T}
3944_
3945T{
3946\fB\-soft\fP
3947T}      T{
3948Soft\-mask (that is, convert to lower\-case bases) the FASTA sequence. \fIBy default, hard\-masking (that is, conversion to Ns) is performed\fP\&.
3949T}
3950_
3951.TE
3952.SS 5.12.2 Default behavior
3953.sp
3954\fBmaskFastaFromBed\fP will mask a FASTA file based on the intervals in a BED file. The newly masked
3955FASTA file is written to the output FASTA file.
3956.sp
3957For example:
3958.INDENT 0.0
3959.INDENT 3.5
3960.sp
3961.nf
3962.ft C
3963cat test.fa
3964>chr1
3965AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
3966
3967cat test.bed
3968chr1 5 10
3969
3970maskFastaFromBed \-fi test.fa \-bed test.bed \-fo test.fa.out
3971
3972cat test.fa.out
3973>chr1
3974AAAAANNNNNCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
3975.ft P
3976.fi
3977.UNINDENT
3978.UNINDENT
3979.SS 5.12.3 Soft\-masking the FASTA file.
3980.sp
3981Using the \fB\-soft\fP option, one can optionally "soft\-mask" the FASTA file.
3982.sp
3983For example:
3984.INDENT 0.0
3985.INDENT 3.5
3986.sp
3987.nf
3988.ft C
3989cat test.fa
3990>chr1
3991AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
3992
3993cat test.bed
3994chr1 5 10
3995
3996maskFastaFromBed \-fi test.fa \-bed test.bed \-fo test.fa.out \-soft
3997
3998cat test.fa.out
3999>chr1
4000AAAAAaaaccCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG
4001.ft P
4002.fi
4003.UNINDENT
4004.UNINDENT
4005.SS 5.13 shuffleBed
4006.sp
4007\fBshuffleBed\fP will randomly permute the genomic locations of a fearure file among a genome defined in a
4008genome file. One can also provide an "exclusions" BED/GFF/VCF file that lists regions where you do
4009not want the permuted features to be placed. For example, one might want to prevent features from
4010being placed in known genome gaps. \fBshuffleBed\fP is useful as a \fInull\fP basis against which to test the
4011significance of associations of one feature with another.
4012.SS 5.13.1 Usage and option summary
4013.sp
4014Usage:
4015.INDENT 0.0
4016.INDENT 3.5
4017.sp
4018.nf
4019.ft C
4020shuffleBed [OPTIONS] \-i <BED/GFF/VCF> \-g <GENOME>
4021.ft P
4022.fi
4023.UNINDENT
4024.UNINDENT
4025.TS
4026center;
4027|l|l|.
4028_
4029T{
4030Option
4031T}      T{
4032Description
4033T}
4034_
4035T{
4036\fB\-excl\fP
4037T}      T{
4038A BED file of coordinates in which features from \-i should \fInot\fP be placed (e.g., genome gaps).
4039T}
4040_
4041T{
4042\fB\-chrom\fP
4043T}      T{
4044Keep features in \-i on the same chromosome. Solely permute their location on the chromosome. \fIBy default, both the chromosome and position are randomly chosen\fP\&.
4045T}
4046_
4047T{
4048\fB\-seed\fP
4049T}      T{
4050Supply an integer seed for the shuffling. This will allow feature shuffling experiments to be recreated exactly as the seed for the pseudo\-random number generation will be constant. \fIBy default, the seed is chosen automatically\fP\&.
4051T}
4052_
4053.TE
4054.SS 5.13.2 Default behavior
4055.sp
4056By default, \fBshuffleBed\fP will reposition each feature in the input BED file on a random chromosome at a
4057random position. The size and strand of each feature are preserved.
4058.sp
4059For example:
4060.INDENT 0.0
4061.INDENT 3.5
4062.sp
4063.nf
4064.ft C
4065cat A.bed
4066chr1  0  100  a1  1  +
4067chr1  0  1000 a2  2  \-
4068
4069cat my.genome
4070chr1  10000
4071chr2  8000
4072chr3  5000
4073chr4  2000
4074
4075shuffleBed \-i A.bed \-g my.genome
4076chr4  1498  1598  a1  1  +
4077chr3  2156  3156  a2  2  \-
4078.ft P
4079.fi
4080.UNINDENT
4081.UNINDENT
4082.SS 5.13.3 (\-chrom)Requiring that features be shuffled on the same chromosome
4083.sp
4084The "\fB\-chrom\fP" option behaves the same as the default behavior except that features are randomly
4085placed on the same chromosome as defined in the BED file.
4086.sp
4087For example:
4088.INDENT 0.0
4089.INDENT 3.5
4090.sp
4091.nf
4092.ft C
4093cat A.bed
4094chr1  0  100  a1  1  +
4095chr1  0  1000 a2  2  \-
4096
4097cat my.genome
4098chr1  10000
4099chr2  8000
4100chr3  5000
4101chr4  2000
4102
4103shuffleBed \-i A.bed \-g my.genome \-chrom
4104chr1  9560  9660  a1  1  +
4105chr1  7258  8258  a2  2  \-
4106.ft P
4107.fi
4108.UNINDENT
4109.UNINDENT
4110.SS 5.13.4 Excluding certain genome regions from shuffleBed
4111.sp
4112One may want to prevent BED features from being placed in certain regions of the genome. For
4113example, one may want to exclude genome gaps from permutation experiment. The "\fB\-excl\fP" option
4114defines a BED file of regions that should be excluded. \fBshuffleBed\fP will attempt to permute the
4115locations of all features while adhering to the exclusion rules. However it will stop looking for an
4116appropriate location if it cannot find a valid spot for a feature after 1,000,000 tries.
4117.sp
4118For example (\fInote that the exclude file excludes all but 100 base pairs of the chromosome\fP):
4119.INDENT 0.0
4120.INDENT 3.5
4121.sp
4122.nf
4123.ft C
4124cat A.bed
4125chr1  0  100   a1  1  +
4126chr1  0  1000  a2  2  \-
4127
4128cat my.genome
4129chr1  10000
4130
4131cat exclude.bed
4132chr1  100  10000
4133
4134shuffleBed \-i A.bed \-g my.genome \-excl exclude.bed
4135chr1  0  100  a1  1  +
4136Error, line 2: tried 1000000 potential loci for entry, but could not avoid excluded
4137regions. Ignoring entry and moving on.
4138.ft P
4139.fi
4140.UNINDENT
4141.UNINDENT
4142.sp
4143For example (\fInow the exclusion file only excludes the first 100 bases of the chromosome\fP):
4144.INDENT 0.0
4145.INDENT 3.5
4146.sp
4147.nf
4148.ft C
4149cat A.bed
4150chr1  0  100  a1  1  +
4151chr1  0  1000 a2  2  \-
4152
4153cat my.genome
4154chr1  10000
4155
4156cat exclude.bed
4157chr1  0  100
4158
4159shuffleBed \-i A.bed \-g my.genome \-excl exclude.bed
4160chr1  147  247  a1  1  +
4161chr1  2441 3441 a2  2  \-
4162.ft P
4163.fi
4164.UNINDENT
4165.UNINDENT
4166.SS 5.13.5 Defining a "seed" for the random replacement.
4167.sp
4168\fBshuffleBed\fP uses a pseudo\-random number generator to permute the locations of BED features.
4169Therefore, each run should produce a different result. This can be problematic if one wants to exactly
4170recreate an experiment. By using the "\fB\-seed\fP" option, one can supply a custom integer seed for
4171\fBshuffleBed\fP\&. In turn, each execution of \fBshuffleBed\fP with the same seed and input files should produce
4172identical results.
4173.sp
4174For example (\fInote that the exclude file below excludes all but 100 base pairs of the chromosome\fP):
4175.INDENT 0.0
4176.INDENT 3.5
4177.sp
4178.nf
4179.ft C
4180cat A.bed
4181chr1 0 100 a1 1 +
4182chr1 0 1000 a2 2 \-
4183
4184cat my.genome
4185chr1 10000
4186
4187shuffleBed \-i A.bed \-g my.genome \-seed 927442958
4188chr1 6177 6277 a1 1 +
4189chr1 8119 9119 a2 2 \-
4190
4191shuffleBed \-i A.bed \-g my.genome \-seed 927442958
4192chr1 6177 6277 a1 1 +
4193chr1 8119 9119 a2 2 \-
4194
4195\&. . .
4196
4197shuffleBed \-i A.bed \-g my.genome \-seed 927442958
4198chr1 6177 6277 a1 1 +
4199chr1 8119 9119 a2 2 \-
4200.ft P
4201.fi
4202.UNINDENT
4203.UNINDENT
4204.SS 5.14 slopBed
4205.sp
4206\fBslopBed\fP will increase the size of each feature in a feature file be a user\-defined number of bases. While
4207something like this could be done with an "\fBawk \(aq{OFS="t" print $1,$2\-<slop>,$3+<slop>}\(aq\fP",
4208\fBslopBed\fP will restrict the resizing to the size of the chromosome (i.e. no start < 0 and no end >
4209chromosome size).
4210.SS 5.14.1 Usage and option summary
4211.sp
4212Usage:
4213.INDENT 0.0
4214.INDENT 3.5
4215.sp
4216.nf
4217.ft C
4218slopBed [OPTIONS] \-i <BED/GFF/VCF> \-g <GENOME> [\-b or (\-l and \-r)]
4219.ft P
4220.fi
4221.UNINDENT
4222.UNINDENT
4223.TS
4224center;
4225|l|l|.
4226_
4227T{
4228Option
4229T}      T{
4230Description
4231T}
4232_
4233T{
4234\fB\-b\fP
4235T}      T{
4236Increase the BED/GFF/VCF entry by the same number base pairs in each direction. \fIInteger\fP\&.
4237T}
4238_
4239T{
4240\fB\-l\fP
4241T}      T{
4242The number of base pairs to subtract from the start coordinate. \fIInteger\fP\&.
4243T}
4244_
4245T{
4246\fB\-r\fP
4247T}      T{
4248The number of base pairs to add to the end coordinate. \fIInteger\fP\&.
4249T}
4250_
4251T{
4252\fB\-s\fP
4253T}      T{
4254Define \-l and \-r based on strand. For example. if used, \-l 500 for a negative\-stranded feature, it will add 500 bp to the \fIend\fP coordinate.
4255T}
4256_
4257.TE
4258.SS 5.14.2 Default behavior
4259.sp
4260By default, \fBslopBed\fP will either add a fixed number of bases in each direction (\fB\-b\fP) or an asymmetric
4261number of bases in each direction (\fB\-l\fP and \fB\-r\fP).
4262.sp
4263For example:
4264.INDENT 0.0
4265.INDENT 3.5
4266.sp
4267.nf
4268.ft C
4269cat A.bed
4270chr1 5 100
4271chr1 800 980
4272
4273cat my.genome
4274chr1 1000
4275
4276slopBed \-i A.bed \-g my.genome \-b 5
4277chr1 0 105
4278chr1 795 985
4279
4280slopBed \-i A.bed \-g my.genome \-l 2 \-r 3
4281chr1 3 103
4282chr1 798 983
4283.ft P
4284.fi
4285.UNINDENT
4286.UNINDENT
4287.sp
4288However, if the requested number of bases exceeds the boundaries of the chromosome, \fBslopBed\fP will
4289"clip" the feature accordingly.
4290.INDENT 0.0
4291.INDENT 3.5
4292.sp
4293.nf
4294.ft C
4295cat A.bed
4296chr1  5   100
4297chr1  800 980
4298
4299cat my.genome
4300chr1  1000
4301
4302slopBed \-i A.bed \-g my.genome \-b 5000
4303chr1  0   1000
4304chr1  0   1000
4305.ft P
4306.fi
4307.UNINDENT
4308.UNINDENT
4309.SS 5.14.3 Resizing features according to strand
4310.sp
4311\fBslopBed\fP will optionally increase the size of a feature based on strand.
4312.sp
4313For example:
4314.INDENT 0.0
4315.INDENT 3.5
4316.sp
4317.nf
4318.ft C
4319cat A.bed
4320chr1 100 200 a1 1 +
4321chr1 100 200 a2 2 \-
4322
4323cat my.genome
4324chr1 1000
4325
4326slopBed  \-i A.bed \-g my.genome \-l 50 \-r 80 \-s
4327chr1 50  280 a1 1 +
4328chr1 20  250 a2 2 \-
4329.ft P
4330.fi
4331.UNINDENT
4332.UNINDENT
4333.SS 5.15 sortBed
4334.sp
4335\fBsortBed\fP sorts a feature file by chromosome and other criteria.
4336.SS 5.15.1 Usage and option summary
4337.sp
4338Usage:
4339.INDENT 0.0
4340.INDENT 3.5
4341.sp
4342.nf
4343.ft C
4344sortBed [OPTIONS] \-i <BED/GFF/VCF>
4345.ft P
4346.fi
4347.UNINDENT
4348.UNINDENT
4349.TS
4350center;
4351|l|l|.
4352_
4353T{
4354Option
4355T}      T{
4356Description
4357T}
4358_
4359T{
4360\fB\-sizeA\fP
4361T}      T{
4362Sort by feature size in ascending order.
4363T}
4364_
4365T{
4366\fB\-sizeD\fP
4367T}      T{
4368Sort by feature size in descending order.
4369T}
4370_
4371T{
4372\fB\-chrThenSizeA\fP
4373T}      T{
4374Sort by chromosome, then by feature size (asc).
4375T}
4376_
4377T{
4378\fB\-chrThenSizeD\fP
4379T}      T{
4380Sort by chromosome, then by feature size (desc).
4381T}
4382_
4383T{
4384\fB\-chrThenScoreA\fP
4385T}      T{
4386Sort by chromosome, then by score (asc).
4387T}
4388_
4389T{
4390\fB\-chrThenScoreD\fP
4391T}      T{
4392Sort by chromosome, then by score (desc).
4393T}
4394_
4395.TE
4396.SS 5.15.2 Default behavior
4397.sp
4398By default, \fBsortBed\fP sorts a BED file by chromosome and then by start position in ascending order.
4399.sp
4400For example:
4401.INDENT 0.0
4402.INDENT 3.5
4403.sp
4404.nf
4405.ft C
4406cat A.bed
4407chr1 800 1000
4408chr1 80  180
4409chr1 1   10
4410chr1 750 10000
4411
4412sortBed \-i A.bed
4413chr1 1   10
4414chr1 80  180
4415chr1 750 10000
4416chr1 800 1000
4417.ft P
4418.fi
4419.UNINDENT
4420.UNINDENT
4421.SS 5.15.3 Optional sorting behavior
4422.sp
4423\fBsortBed\fP will also sorts a BED file by chromosome and then by other criteria.
4424.sp
4425For example, to sort by chromosome and then by feature size (in descending order):
4426.INDENT 0.0
4427.INDENT 3.5
4428.sp
4429.nf
4430.ft C
4431cat A.bed
4432chr1 800 1000
4433chr1 80  180
4434chr1 1   10
4435chr1 750 10000
4436
4437sortBed \-i A.bed \-sizeD
4438chr1 750 10000
4439chr1 800 1000
4440chr1 80  180
4441chr1 1   10
4442.ft P
4443.fi
4444.UNINDENT
4445.UNINDENT
4446.sp
4447\fBDisclaimer:\fP it should be noted that \fBsortBed\fP is merely a convenience utility, as the UNIX sort utility
4448will sort BED files more quickly while using less memory. For example, UNIX sort will sort a BED file
4449by chromosome then by start position in the following manner:
4450.INDENT 0.0
4451.INDENT 3.5
4452.sp
4453.nf
4454.ft C
4455sort \-k 1,1 \-k2,2 \-n a.bed
4456chr1 1   10
4457chr1 80  180
4458chr1 750 10000
4459chr1 800 1000
4460.ft P
4461.fi
4462.UNINDENT
4463.UNINDENT
4464.SS 5.16 linksBed
4465.sp
4466Creates an HTML file with links to an instance of the UCSC Genome Browser for all features /
4467intervals in a file. This is useful for cases when one wants to manually inspect through a large set of
4468annotations or features.
4469.SS 5.16.1 Usage and option summary
4470.sp
4471Usage:
4472.INDENT 0.0
4473.INDENT 3.5
4474.sp
4475.nf
4476.ft C
4477linksBed [OPTIONS] \-i <BED/GFF/VCF> > <HTML file>
4478.ft P
4479.fi
4480.UNINDENT
4481.UNINDENT
4482.TS
4483center;
4484|l|l|.
4485_
4486T{
4487Option
4488T}      T{
4489Description
4490T}
4491_
4492T{
4493\fB\-base\fP
4494T}      T{
4495The "basename" for the UCSC browser. \fIDefault: http://genome.ucsc.edu\fP
4496T}
4497_
4498T{
4499\fB\-org\fP
4500T}      T{
4501The organism (e.g. mouse, human). \fIDefault: human\fP
4502T}
4503_
4504T{
4505\fB\-db\fP
4506T}      T{
4507The genome build. \fIDefault: hg18\fP
4508T}
4509_
4510.TE
4511.SS 5.16.2 Default behavior
4512.sp
4513By default, \fBlinksBed\fP creates links to the public UCSC Genome Browser.
4514.sp
4515For example:
4516.INDENT 0.0
4517.INDENT 3.5
4518.sp
4519.nf
4520.ft C
4521head genes.bed
4522chr21 9928613  10012791  uc002yip.1 0  \-
4523chr21 9928613  10012791  uc002yiq.1 0  \-
4524chr21 9928613  10012791  uc002yir.1 0  \-
4525chr21 9928613  10012791  uc010gkv.1 0  \-
4526chr21 9928613  10061300  uc002yis.1 0  \-
4527chr21 10042683 10120796  uc002yit.1 0  \-
4528chr21 10042683 10120808  uc002yiu.1 0  \-
4529chr21 10079666 10120808  uc002yiv.1 0  \-
4530chr21 10080031 10081687  uc002yiw.1 0  \-
4531chr21 10081660 10120796  uc002yix.2 0  \-
4532
4533linksBed \-i genes.bed > genes.html
4534.ft P
4535.fi
4536.UNINDENT
4537.UNINDENT
4538.sp
4539When genes.html is opened in a web browser, one should see something like the following, where each
4540link on the page is built from the features in genes.bed:
4541.SS 5.16.3 Creating HTML links to a local UCSC Browser installation
4542.sp
4543Optionally, \fBlinksBed\fP will create links to a local copy of the UCSC Genome Browser.
4544.sp
4545For example:
4546.INDENT 0.0
4547.INDENT 3.5
4548.sp
4549.nf
4550.ft C
4551head \-3 genes.bed
4552chr21 9928613 10012791 uc002yip.1 0 \-
4553chr21 9928613 10012791 uc002yiq.1 0 \-
4554
4555linksBed \-i genes.bed \-base http://mirror.uni.edu > genes.html
4556.ft P
4557.fi
4558.UNINDENT
4559.UNINDENT
4560.sp
4561One can point the links to the appropriate organism and genome build as well:
4562.INDENT 0.0
4563.INDENT 3.5
4564.sp
4565.nf
4566.ft C
4567head \-3 genes.bed
4568chr21 9928613 10012791 uc002yip.1 0 \-
4569chr21 9928613 10012791 uc002yiq.1 0 \-
4570
4571linksBed \-i genes.bed \-base http://mirror.uni.edu \-org mouse \-db mm9 > genes.html
4572.ft P
4573.fi
4574.UNINDENT
4575.UNINDENT
4576.SS 5.17 complementBed
4577.sp
4578\fBcomplementBed\fP returns the intervals in a genome that are not by the features in a feature file. An
4579example usage of this tool would be to return the intervals of the genome that are not annotated as a
4580repeat.
4581.SS 5.17.1 Usage and option summary
4582.sp
4583Usage:
4584.INDENT 0.0
4585.INDENT 3.5
4586.sp
4587.nf
4588.ft C
4589complementBed [OPTIONS] \-i <BED/GFF/VCF> \-g <GENOME>
4590.ft P
4591.fi
4592.UNINDENT
4593.UNINDENT
4594.sp
4595\fBNo additional options.\fP
4596.SS 5.17.2 Default behavior
4597.sp
4598Figure:
4599.INDENT 0.0
4600.INDENT 3.5
4601.sp
4602.nf
4603.ft C
4604Chromosome  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4605
4606BED FILE A     *************   ***************     ******************
4607
4608Result      ===             ===               =====                  =======
4609.ft P
4610.fi
4611.UNINDENT
4612.UNINDENT
4613.sp
4614For example:
4615.INDENT 0.0
4616.INDENT 3.5
4617.sp
4618.nf
4619.ft C
4620cat A.bed
4621chr1  100  200
4622chr1  400  500
4623chr1  500  800
4624
4625cat my.genome
4626chr1  1000
4627
4628complementBed \-i A.bed \-g my.genome
4629chr1  0    100
4630chr1  200  400
4631chr1  800  1000
4632.ft P
4633.fi
4634.UNINDENT
4635.UNINDENT
4636.SS 5.18 bedToBam
4637.sp
4638\fBbedToBam\fP converts features in a feature file to BAM format. This is useful as an efficient means of
4639storing large genome annotations in a compact, indexed format for visualization purposes.
4640.SS 5.18.1 Usage and option summary
4641.sp
4642Usage:
4643.INDENT 0.0
4644.INDENT 3.5
4645.sp
4646.nf
4647.ft C
4648bedToBam [OPTIONS] \-i <BED/GFF/VCF> \-g <GENOME> > <BAM>
4649.ft P
4650.fi
4651.UNINDENT
4652.UNINDENT
4653.TS
4654center;
4655|l|l|.
4656_
4657T{
4658Option
4659T}      T{
4660Description
4661T}
4662_
4663T{
4664\fB\-mapq\fP
4665T}      T{
4666Set a mapping quality (SAM MAPQ field) value for all BED entries. \fIDefault: 255\fP
4667T}
4668_
4669T{
4670\fB\-ubam\fP
4671T}      T{
4672Write uncompressed BAM output. The default is write compressed BAM output.
4673T}
4674_
4675T{
4676\fB\-bed12\fP
4677T}      T{
4678Indicate that the input BED file is in BED12 (a.k.a "blocked" BED) format. In this case, bedToBam will convert blocked BED features (e.g., gene annotaions) into "spliced" BAM alignments by creating an appropriate CIGAR string.
4679T}
4680_
4681.TE
4682.SS 5.18.2 Default behavior
4683.sp
4684The default behavior is to assume that the input file is in unblocked format. For example:
4685.INDENT 0.0
4686.INDENT 3.5
4687.sp
4688.nf
4689.ft C
4690head \-5 rmsk.hg18.chr21.bed
4691chr21 9719768  9721892  ALR/Alpha  1004  +
4692chr21 9721905  9725582  ALR/Alpha  1010  +
4693chr21 9725582  9725977  L1PA3 3288 +
4694chr21 9726021  9729309  ALR/Alpha  1051  +
4695chr21 9729320  9729809  L1PA3 3897 \-
4696
4697bedToBam \-i rmsk.hg18.chr21.bed \-g human.hg18.genome > rmsk.hg18.chr21.bam
4698
4699samtools view rmsk.hg18.chr21.bam | head \-5
4700ALR/Alpha  0   chr21 9719769  255  2124M *  0  0  *  *
4701ALR/Alpha  0   chr21 9721906  255  3677M *  0  0  *  *
4702L1PA3      0   chr21 9725583  255  395M  *  0  0  *  *
4703ALR/Alpha  0   chr21 9726022  255  3288M *  0  0  *  *
4704L1PA3      16  chr21 9729321  255  489M  *  0  0  *  *
4705.ft P
4706.fi
4707.UNINDENT
4708.UNINDENT
4709.SS 5.18.3 Creating "spliced" BAM entries from "blocked" BED features
4710.sp
4711Optionally, \fBbedToBam\fP will create spliced BAM entries from "blocked" BED features by using the
4712\-bed12 option. This will create CIGAR strings in the BAM output that will be displayed as "spliced"
4713alignments. The image illustrates this behavior, as the top track is a BAM representation (using
4714bedToBam) of a BED file of UCSC genes.
4715.sp
4716For example:
4717.INDENT 0.0
4718.INDENT 3.5
4719.sp
4720.nf
4721.ft C
4722bedToBam \-i knownGene.hg18.chr21.bed \-g human.hg18.genome \-bed12 > knownGene.bam
4723
4724samtools view knownGene.bam | head \-2
4725uc002yip.1  16   chr21 9928614   2                       5                        5
4726
4727298M1784N71M1411N93M3963N80M1927N106M3608N81M1769N62M11856N89M98N82M816N61M6910N65M
4728738N64M146N100M1647N120M6478N162M1485N51M6777N60M9274N54M880N54M1229N54M2377N54M112
472968N58M2666N109M2885N158M     *   0  0  *  *
4730uc002yiq.1  16   chr21 9928614   2                       5                        5
4731
4732298M1784N71M1411N93M3963N80M1927N106M3608N81M1769N62M11856N89M98N82M816N61M6910N65M
4733738N64M146N100M1647N120M6478N162M1485N51M6777N60M10208N54M1229N54M2377N54M11268N58M
47342666N109M2885N158M       *   0   0  *  *
4735.ft P
4736.fi
4737.UNINDENT
4738.UNINDENT
4739.SS 5.19 overlap
4740.sp
4741\fBoverlap\fP computes the amount of overlap (in the case of positive values) or distance (in the case of
4742negative values) between feature coordinates occurring on the same input line and reports the result at
4743the end of the same line. In this way, it is a useful method for computing custom overlap scores from
4744the output of other BEDTools.
4745.SS 5.19.1 Usage and option summary
4746.sp
4747Usage:
4748.INDENT 0.0
4749.INDENT 3.5
4750.sp
4751.nf
4752.ft C
4753overlap [OPTIONS] \-i <input> \-cols s1,e1,s2,e2
4754.ft P
4755.fi
4756.UNINDENT
4757.UNINDENT
4758.TS
4759center;
4760|l|l|.
4761_
4762T{
4763Option
4764T}      T{
4765Description
4766T}
4767_
4768T{
4769\fB\-i\fP
4770T}      T{
4771Input file. Use "stdin" for pipes.
4772T}
4773_
4774T{
4775\fB\-cols\fP
4776T}      T{
4777Specify the columns (1\-based) for the starts and ends of the features for which you\(aqd like to compute the overlap/distance. The columns must be listed in the following order: \fIstart1,end1,start2,end2\fP
4778T}
4779_
4780.TE
4781.SS 5.19.2 Default behavior
4782.sp
4783The default behavior is to compute the amount of overlap between the features you specify based on the
4784start and end coordinates. For example:
4785.INDENT 0.0
4786.INDENT 3.5
4787.sp
4788.nf
4789.ft C
4790windowBed \-a A.bed \-b B.bed \-w 10
4791chr1  10  20  A  chr1  15  25  B
4792chr1  10  20  C  chr1  25  35  D
4793.ft P
4794.fi
4795.UNINDENT
4796.UNINDENT
4797.sp
4798# Now let\(aqs say we want to compute the number of base pairs of overlap
4799# between the overlapping features from the output of windowBed.
4800.INDENT 0.0
4801.INDENT 3.5
4802.sp
4803.nf
4804.ft C
4805windowBed \-a A.bed \-b B.bed \-w 10 | overlap \-i stdin \-cols 2,3,6,7
4806chr1  10  20  A  chr1  15  25  B  5
4807chr1  10  20  C  chr1  25  35  D  \-5
4808.ft P
4809.fi
4810.UNINDENT
4811.UNINDENT
4812.SS 5.20 bedToIgv
4813.sp
4814\fBbedToIgv\fP creates an IGV (\fI\%http://www.broadinstitute.org/igv/\fP) batch script (see: \fI\%http://\fP
4815www.broadinstitute.org/igv/batch for details) such that a ??snapshot?? will be taken at each features in a
4816feature file. This is useful as an efficient means for quickly collecting images of primary data at several
4817loci for subsequent screening, etc.
4818.sp
4819\fBNOTE: One must use IGV version 1.5 or higher.\fP
4820.SS 5.20.1 Usage and option summary
4821.sp
4822Usage:
4823.INDENT 0.0
4824.INDENT 3.5
4825.sp
4826.nf
4827.ft C
4828bedToIgv [OPTIONS] \-i <BED/GFF/VCF> > <igv.batch>
4829.ft P
4830.fi
4831.UNINDENT
4832.UNINDENT
4833.TS
4834center;
4835|l|l|.
4836_
4837T{
4838Option
4839T}      T{
4840Description
4841T}
4842_
4843T{
4844\fB\-path\fP
4845T}      T{
4846The full path to which the IGV snapshots should be written. \fIDefault: ./\fP
4847T}
4848_
4849T{
4850\fB\-sess\fP
4851T}      T{
4852The full path to an existing IGV session file to be loaded prior to taking snapshots. \fIDefault is for no session to be loaded and the assumption is that you already have IGV open and loaded with your relevant data prior to running the batch script\fP\&.
4853T}
4854_
4855T{
4856\fB\-sort\fP
4857T}      T{
4858The type of BAM sorting you would like to apply to each image. \fBValid sorting options\fP: \fIbase, position, strand, quality, sample, and readGroup Default is to apply no sorting at all\fP\&.
4859T}
4860_
4861T{
4862\fB\-clps\fP
4863T}      T{
4864Collapse the aligned reads prior to taking a snapshot. \fIDefault is to not collapse\fP\&.
4865T}
4866_
4867T{
4868\fB\-name\fP
4869T}      T{
4870Use the "name" field (column 4) for each image\(aqs filename. \fIDefault is to use the "chr:start\-pos.ext"\fP\&.
4871T}
4872_
4873T{
4874\fB\-slop\fP
4875T}      T{
4876Number of flanking base pairs on the left & right of the image.
4877T}
4878_
4879T{
4880\fB\-img\fP
4881T}      T{
4882The type of image to be created. \fBValid options\fP: \fIpng, eps, svg Default is png\fP\&.
4883T}
4884_
4885.TE
4886.SS 5.20.2 Default behavior
4887.sp
4888Figure:
4889.INDENT 0.0
4890.INDENT 3.5
4891.sp
4892.nf
4893.ft C
4894bedToIgv \-i data/rmsk.hg18.chr21.bed | head \-9
4895snapshotDirectory ./
4896goto chr21:9719768\-9721892
4897snapshot chr21:9719768\-9721892.png
4898goto chr21:9721905\-9725582
4899snapshot chr21:9721905\-9725582.png
4900goto chr21:9725582\-9725977
4901snapshot chr21:9725582\-9725977.png
4902goto chr21:9726021\-9729309
4903snapshot chr21:9726021\-9729309.png
4904.ft P
4905.fi
4906.UNINDENT
4907.UNINDENT
4908.SS 5.20.3 Using a bedToIgv batch script within IGV.
4909.sp
4910Once an IGV batch script has been created with \fBbedToIgv\fP, it is simply a matter of running it from
4911within IGV.
4912.sp
4913For example, first create the batch script:
4914.INDENT 0.0
4915.INDENT 3.5
4916.sp
4917.nf
4918.ft C
4919bedToIgv \-i data/rmsk.hg18.chr21.bed > rmsk.igv.batch
4920.ft P
4921.fi
4922.UNINDENT
4923.UNINDENT
4924.sp
4925Then, open and launch the batch script from within IGV. This will immediately cause IGV to begin
4926taking snapshots of your requested regions.
4927.SS 5.21 bed12ToBed6
4928.sp
4929\fBbed12ToBed6\fP is a convenience tool that converts BED features in BED12 (a.k.a. "blocked" BED
4930features such as genes) to discrete BED6 features. For example, in the case of a gene with six exons,
4931bed12ToBed6 would create six separate BED6 features (i.e., one for each exon).
4932.SS 5.21.1 Usage and option summary
4933.sp
4934Usage:
4935.INDENT 0.0
4936.INDENT 3.5
4937.sp
4938.nf
4939.ft C
4940bed12ToBed6 [OPTIONS] \-i <BED12>
4941.ft P
4942.fi
4943.UNINDENT
4944.UNINDENT
4945.TS
4946center;
4947|l|l|.
4948_
4949T{
4950Option
4951T}      T{
4952Description
4953T}
4954_
4955T{
4956\fB\-i\fP
4957T}      T{
4958The BED12 file that should be split into discrete BED6 features. \fIUse "stdin" when using piped input\fP\&.
4959T}
4960_
4961.TE
4962.SS 5.21.2 Default behavior
4963.sp
4964Figure:
4965.INDENT 0.0
4966.INDENT 3.5
4967.sp
4968.nf
4969.ft C
4970head data/knownGene.hg18.chr21.bed | tail \-n 3
4971chr21 10079666  10120808   uc002yiv.1  0  \-  10081686  1 0 1 2 0 6 0 8
4972      0     4   528,91,101,215, 0,1930,39750,40927,
4973chr21 10080031  10081687   uc002yiw.1  0  \-  10080031  1 0 0 8 0 0 3 1
4974      0     2   200,91,    0,1565,
4975chr21 10081660  10120796   uc002yix.2  0  \-  10081660  1 0 0 8 1 6 6 0
4976      0     3   27,101,223,0,37756,38913,
4977
4978head data/knownGene.hg18.chr21.bed | tail \-n 3 | bed12ToBed6 \-i stdin
4979chr21 10079666  10080194  uc002yiv.1 0  \-
4980chr21 10081596  10081687  uc002yiv.1 0  \-
4981chr21 10119416  10119517  uc002yiv.1 0  \-
4982chr21 10120593  10120808  uc002yiv.1 0  \-
4983chr21 10080031  10080231  uc002yiw.1 0  \-
4984chr21 10081596  10081687  uc002yiw.1 0  \-
4985chr21 10081660  10081687  uc002yix.2 0  \-
4986chr21 10119416  10119517  uc002yix.2 0  \-
4987chr21 10120573  10120796  uc002yix.2 0  \-
4988.ft P
4989.fi
4990.UNINDENT
4991.UNINDENT
4992.SS 5.22 groupBy
4993.sp
4994\fBgroupBy\fP is a useful tool that mimics the "groupBy" clause in database systems. Given a file or stream
4995that is sorted by the appropriate "grouping columns", groupBy will compute summary statistics on
4996another column in the file or stream. This will work with output from all BEDTools as well as any other
4997tab\-delimited file or stream.
4998.sp
4999\fBNOTE: When using groupBy, the input data must be ordered by the same
5000columns as specified with the \-grp argument. For example, if \-grp is 1,2,3, the the
5001data should be pre\-grouped accordingly. When groupBy detects changes in the
5002group columns it then summarizes all lines with that group\fP\&.
5003.SS 5.22.1 Usage and option summary
5004.sp
5005Usage:
5006.INDENT 0.0
5007.INDENT 3.5
5008.sp
5009.nf
5010.ft C
5011groupBy [OPTIONS] \-i <input> \-opCol <input column>
5012.ft P
5013.fi
5014.UNINDENT
5015.UNINDENT
5016.TS
5017center;
5018|l|l|.
5019_
5020T{
5021Option
5022T}      T{
5023Description
5024T}
5025_
5026T{
5027\fB\-i\fP
5028T}      T{
5029.INDENT 0.0
5030.INDENT 3.5
5031The input file that should be grouped and summarized. \fIUse "stdin" when using piped input\fP\&.
5032.UNINDENT
5033.UNINDENT
5034.sp
5035\fBNote: if \-i is omitted, input is assumed to come from standard input (stdin)\fP
5036T}
5037_
5038T{
5039\fB\-g OR \-grp\fP
5040T}      T{
5041Specifies which column(s) (1\-based) should be used to group the input. The columns must be comma\-separated and each column must be explicitly listed. No ranges (e.g. 1\-4) yet allowed. \fIDefault: 1,2,3\fP
5042T}
5043_
5044T{
5045\fB\-c OR \-opCol\fP
5046T}      T{
5047Specify the column (1\-based) that should be summarized. \fIRequired\fP\&.
5048T}
5049_
5050T{
5051\fB\-o OR \-op\fP
5052T}      T{
5053Specify the operation that should be applied to \fBopCol\fP\&.
5054.nf
5055Valid operations:
5056.fi
5057.sp
5058.INDENT 0.0
5059.INDENT 3.5
5060.nf
5061\fBsum\fP \- \fInumeric only\fP
5062\fBcount\fP \- \fInumeric or text\fP
5063\fBmin\fP \- \fInumeric only\fP
5064\fBmax\fP \- \fInumeric only\fP
5065\fBmean\fP \- \fInumeric only\fP
5066\fBstdev\fP \- \fInumeric only\fP
5067\fBmedian\fP \- \fInumeric only\fP
5068\fBmode\fP \- \fInumeric or text\fP
5069\fBantimode\fP \- \fInumeric or text\fP
5070\fBcollapse\fP (i.e., print a comma separated list) \- \fInumeric or text\fP
5071\fBfreqasc\fP \- \fIprint a comma separated list of values observed and the number of times they were observed. Reported in ascending order of frequency\fP
5072.fi
5073.sp
5074.UNINDENT
5075.UNINDENT
5076.nf
5077\fBfreqdesc\fP \- \fIprint a comma separated list of values observed and the number of times they were observed. Reported in descending order of frequency\fP
5078.fi
5079.sp
5080.INDENT 0.0
5081.INDENT 3.5
5082.nf
5083\fIDefault: sum\fP
5084.fi
5085.sp
5086.UNINDENT
5087.UNINDENT
5088T}
5089_
5090.TE
5091.SS 5.22.2 Default behavior.
5092.sp
5093Let\(aqs imagine we have three incredibly interesting genetic variants that we are studying and we are
5094interested in what annotated repeats these variants overlap.
5095.INDENT 0.0
5096.INDENT 3.5
5097.sp
5098.nf
5099.ft C
5100cat variants.bed
5101chr21  9719758 9729320 variant1
5102chr21  9729310 9757478 variant2
5103chr21  9795588 9796685 variant3
5104
5105intersectBed \-a variants.bed \-b repeats.bed \-wa \-wb > variantsToRepeats.bed
5106cat variantsToRepeats.bed
5107chr21  9719758 9729320 variant1   chr21  9719768 9721892 ALR/Alpha   1004  +
5108chr21  9719758 9729320 variant1   chr21  9721905 9725582 ALR/Alpha   1010  +
5109chr21  9719758 9729320 variant1   chr21  9725582 9725977 L1PA3       3288  +
5110chr21  9719758 9729320 variant1   chr21  9726021 9729309 ALR/Alpha   1051  +
5111chr21  9729310 9757478 variant2   chr21  9729320 9729809 L1PA3       3897  \-
5112chr21  9729310 9757478 variant2   chr21  9729809 9730866 L1P1        8367  +
5113chr21  9729310 9757478 variant2   chr21  9730866 9734026 ALR/Alpha   1036  \-
5114chr21  9729310 9757478 variant2   chr21  9734037 9757471 ALR/Alpha   1182  \-
5115chr21  9795588 9796685 variant3   chr21  9795589 9795713 (GAATG)n    308   +
5116chr21  9795588 9796685 variant3   chr21  9795736 9795894 (GAATG)n    683   +
5117chr21  9795588 9796685 variant3   chr21  9795911 9796007 (GAATG)n    345   +
5118chr21  9795588 9796685 variant3   chr21  9796028 9796187 (GAATG)n    756   +
5119chr21  9795588 9796685 variant3   chr21  9796202 9796615 (GAATG)n    891   +
5120chr21  9795588 9796685 variant3   chr21  9796637 9796824 (GAATG)n    621   +
5121.ft P
5122.fi
5123.UNINDENT
5124.UNINDENT
5125.sp
5126We can see that variant1 overlaps with 3 repeats, variant2 with 4 and variant3 with 6. We can use
5127groupBy to summarize the hits for each variant in several useful ways. The default behavior is to
5128compute the \fIsum\fP of the opCol.
5129.INDENT 0.0
5130.INDENT 3.5
5131.sp
5132.nf
5133.ft C
5134groupBy \-i variantsToRepeats.bed \-grp 1,2,3 \-opCol 9
5135chr21 9719758 9729320 6353
5136chr21 9729310 9757478 14482
5137chr21 9795588 9796685 3604
5138.ft P
5139.fi
5140.UNINDENT
5141.UNINDENT
5142.SS 5.22.3 Computing the min and max.
5143.sp
5144Now let\(aqs find the \fImin\fP and \fImax\fP repeat score for each variant. We do this by "grouping" on the variant
5145coordinate columns (i.e. cols. 1,2 and 3) and ask for the min and max of the repeat score column (i.e.
5146col. 9).
5147.INDENT 0.0
5148.INDENT 3.5
5149.sp
5150.nf
5151.ft C
5152groupBy \-i variantsToRepeats.bed \-g 1,2,3 \-c 9 \-o min
5153chr21 9719758 9729320 1004
5154chr21 9729310 9757478 1036
5155chr21 9795588 9796685 308
5156.ft P
5157.fi
5158.UNINDENT
5159.UNINDENT
5160.sp
5161We can also group on just the \fIname\fP column with similar effect.
5162.INDENT 0.0
5163.INDENT 3.5
5164.sp
5165.nf
5166.ft C
5167groupBy \-i variantsToRepeats.bed \-grp 4 \-opCol 9 \-op min
5168variant1 1004
5169variant2 1036
5170variant3 308
5171.ft P
5172.fi
5173.UNINDENT
5174.UNINDENT
5175.sp
5176What about the \fImax\fP score? Let\(aqs keep the coordinates and the name of the variants so that we
5177stay in BED format.
5178.INDENT 0.0
5179.INDENT 3.5
5180.sp
5181.nf
5182.ft C
5183groupBy \-i variantsToRepeats.bed \-grp 1,2,3,4 \-opCol 9 \-op max
5184chr21 9719758 9729320 variant1 3288
5185chr21 9729310 9757478 variant2 8367
5186chr21 9795588 9796685 variant3 891
5187.ft P
5188.fi
5189.UNINDENT
5190.UNINDENT
5191.SS 5.22.4 Computing the mean and median.
5192.sp
5193Now let\(aqs find the \fImean\fP and \fImedian\fP repeat score for each variant.
5194.INDENT 0.0
5195.INDENT 3.5
5196.sp
5197.nf
5198.ft C
5199cat variantsToRepeats.bed | groupBy \-g 1,2,3,4 \-c 9 \-o mean
5200chr21 9719758 9729320 variant1 1588.25
5201chr21 9729310 9757478 variant2 3620.5
5202chr21 9795588 9796685 variant3 600.6667
5203
5204groupBy \-i variantsToRepeats.bed \-grp 1,2,3,4 \-opCol 9 \-op median
5205chr21 9719758 9729320 variant1 1030.5
5206chr21 9729310 9757478 variant2 2539.5
5207chr21 9795588 9796685 variant3 652
5208.ft P
5209.fi
5210.UNINDENT
5211.UNINDENT
5212.SS 5.22.5 Computing the mode and "antimode".
5213.sp
5214Now let\(aqs find the \fImode\fP and \fIantimode\fP (i.e., the least frequent) repeat score for each variant (in this case
5215they are identical).
5216.INDENT 0.0
5217.INDENT 3.5
5218.sp
5219.nf
5220.ft C
5221groupBy \-i variantsToRepeats.bed \-grp 1,2,3,4 \-opCol 9 \-op mode
5222chr21 9719758 9729320 variant1 1004
5223chr21 9729310 9757478 variant2 1036
5224chr21 9795588 9796685 variant3 308
5225
5226groupBy \-i variantsToRepeats.bed \-grp 1,2,3,4 \-opCol 9 \-op antimode
5227chr21 9719758 9729320 variant1 1004
5228chr21 9729310 9757478 variant2 1036
5229chr21 9795588 9796685 variant3 308
5230.ft P
5231.fi
5232.UNINDENT
5233.UNINDENT
5234.SS 5.22.6 Computing the count of lines for a given group.
5235.sp
5236Figure:
5237.INDENT 0.0
5238.INDENT 3.5
5239.sp
5240.nf
5241.ft C
5242groupBy \-i variantsToRepeats.bed \-g 1,2,3,4 \-c 9 \-c count
5243chr21 9719758 9729320 variant1 4
5244chr21 9729310 9757478 variant2 4
5245chr21 9795588 9796685 variant3 6
5246.ft P
5247.fi
5248.UNINDENT
5249.UNINDENT
5250.SS 5.22.7 Collapsing: listing all of the values in the opCol for a given group.
5251.sp
5252Now for something different. What if we wanted all of the names of the repeats listed on the same line
5253as the variants? Use the collapse option. This "denormalizes" things. Now you have a list of all the
5254repeats on a single line.
5255.INDENT 0.0
5256.INDENT 3.5
5257.sp
5258.nf
5259.ft C
5260groupBy \-i variantsToRepeats.bed \-grp 1,2,3,4 \-opCol 9 \-op collapse
5261chr21 9719758 9729320 variant1 ALR/Alpha,ALR/Alpha,L1PA3,ALR/Alpha,
5262chr21 9729310 9757478 variant2 L1PA3,L1P1,ALR/Alpha,ALR/Alpha,
5263chr21 9795588 9796685 variant3 (GAATG)n,(GAATG)n,(GAATG)n,(GAATG)n,(GAATG)n,(GAATG)n,
5264.ft P
5265.fi
5266.UNINDENT
5267.UNINDENT
5268.SS 5.22.8 Computing frequencies: freqasc and freqdesc.
5269.sp
5270Now for something different. What if we wanted all of the names of the repeats listed on the same line
5271as the variants? Use the collapse option. This "denormalizes" things. Now you have a list of all the
5272repeats on a single line.
5273.INDENT 0.0
5274.INDENT 3.5
5275.sp
5276.nf
5277.ft C
5278cat variantsToRepeats.bed | groupBy \-g 1 \-c 8 \-o freqdesc
5279chr21 (GAATG)n:6,ALR/Alpha:5,L1PA3:2,L1P1:1,
5280
5281cat variantsToRepeats.bed | groupBy \-g 1 \-c 8 \-o freqasc
5282chr21 L1P1:1,L1PA3:2,ALR/Alpha:5,(GAATG)n:6,
5283.ft P
5284.fi
5285.UNINDENT
5286.UNINDENT
5287.SS 5.23 unionBedGraphs
5288.sp
5289\fBunionBedGraphs\fP combines multiple BEDGRAPH files into a single file such that one can directly
5290compare coverage (and other text\-values such as genotypes) across multiple sample
5291.SS 5.23.1 Usage and option summary
5292.sp
5293Usage:
5294.INDENT 0.0
5295.INDENT 3.5
5296.sp
5297.nf
5298.ft C
5299unionBedGraphs [OPTIONS] \-i FILE1 FILE2 FILE3 ... FILEn
5300.ft P
5301.fi
5302.UNINDENT
5303.UNINDENT
5304.TS
5305center;
5306|l|l|.
5307_
5308T{
5309Option
5310T}      T{
5311Description
5312T}
5313_
5314T{
5315\fB\-header\fP
5316T}      T{
5317Print a header line, consisting of chrom, start, end followed by the names of each input BEDGRAPH file.
5318T}
5319_
5320T{
5321\fB\-names\fP
5322T}      T{
5323A list of names (one per file) to describe each file in \-i. These names will be printed in the header line.
5324T}
5325_
5326T{
5327\fB\-empty\fP
5328T}      T{
5329Report empty regions (i.e., start/end intervals w/o values in all files). \fIRequires the \(aq\-g FILE\(aq parameter (see below)\fP\&.
5330T}
5331_
5332T{
5333\fB\-g\fP
5334T}      T{
5335The genome file to be used to calculate empty regions.
5336T}
5337_
5338T{
5339\fB\-filler TEXT\fP
5340T}      T{
5341Use TEXT when representing intervals having no value. Default is \(aq0\(aq, but you can use \(aqN/A\(aq or any other text.
5342T}
5343_
5344T{
5345\fB\-examples\fP
5346T}      T{
5347Show detailed usage examples.
5348T}
5349_
5350.TE
5351.SS 5.23.2 Default behavior
5352.sp
5353Figure:
5354.INDENT 0.0
5355.INDENT 3.5
5356.sp
5357.nf
5358.ft C
5359cat 1.bg
5360chr1 1000 1500 10
5361chr1 2000 2100 20
5362
5363cat 2.bg
5364chr1 900 1600 60
5365chr1 1700 2050 50
5366
5367cat 3.bg
5368chr1 1980 2070 80
5369chr1 2090 2100 20
5370
5371cat sizes.txt
5372chr1 5000
5373
5374unionBedGraphs \-i 1.bg 2.bg 3.bg
5375chr1 900  1000 0  60 0
5376chr1 1000 1500 10 60 0
5377chr1 1500 1600 0  60 0
5378chr1 1700 1980 0  50 0
5379chr1 1980 2000 0  50 80
5380chr1 2000 2050 20 50 80
5381chr1 2050 2070 20 0  80
5382chr1 2070 2090 20 0  0
5383chr1 2090 2100 20 0  20
5384.ft P
5385.fi
5386.UNINDENT
5387.UNINDENT
5388.SS 5.23.3 Add a header line to the output
5389.sp
5390Figure:
5391.INDENT 0.0
5392.INDENT 3.5
5393.sp
5394.nf
5395.ft C
5396unionBedGraphs \-i 1.bg 2.bg 3.bg \-header
5397chrom  start  end  1  2  3
5398chr1   900    1000 0  60 0
5399chr1   1000   1500 10 60 0
5400chr1   1500   1600 0  60 0
5401chr1   1700   1980 0  50 0
5402chr1   1980   2000 0  50 80
5403chr1   2000   2050 20 50 80
5404chr1   2050   2070 20 0  80
5405chr1   2070   2090 20 0  0
5406chr1   2090   2100 20 0  20
5407.ft P
5408.fi
5409.UNINDENT
5410.UNINDENT
5411.SS 5.23.4 Add a header line with custom file names to the output
5412.sp
5413Figure:
5414.INDENT 0.0
5415.INDENT 3.5
5416.sp
5417.nf
5418.ft C
5419unionBedGraphs \-i 1.bg 2.bg 3.bg \-header \-names WT\-1 WT\-2 KO\-1
5420chrom  start  end   WT\-1  WT\-2  KO\-1
5421chr1   900    1000  0     60    0
5422chr1   1000   1500  10    60    0
5423chr1   1500   1600  0     60    0
5424chr1   1700   1980  0     50    0
5425chr1   1980   2000  0     50    80
5426chr1   2000   2050  20    50    80
5427chr1   2050   2070  20    0     80
5428chr1   2070   2090  20    0     0
5429chr1   2090   2100  20    0     20
5430.ft P
5431.fi
5432.UNINDENT
5433.UNINDENT
5434.SS 5.23.5 Include regions that have zero coverage in all BEDGRAPH files.
5435.sp
5436Figure:
5437.INDENT 0.0
5438.INDENT 3.5
5439.sp
5440.nf
5441.ft C
5442unionBedGraphs \-i 1.bg 2.bg 3.bg \-empty \-g sizes.txt \-header
5443chrom  start  end  WT\-1  WT\-2  KO\-1
5444chrom  start  end  1     2     3
5445chr1   0      900  0     0     0
5446chr1   900    1000 0     60    0
5447chr1   1000   1500 10    60    0
5448chr1   1500   1600 0     60    0
5449chr1   1600   1700 0     0     0
5450chr1   1700   1980 0     50    0
5451chr1   1980   2000 0     50    80
5452chr1   2000   2050 20    50    80
5453chr1   2050   2070 20    0     80
5454chr1   2070   2090 20    0     0
5455chr1   2090   2100 20    0     20
5456chr1   2100   5000 0     0     0
5457.ft P
5458.fi
5459.UNINDENT
5460.UNINDENT
5461.SS 5.23.6 Use a custom value for missing values.
5462.sp
5463Figure:
5464.INDENT 0.0
5465.INDENT 3.5
5466.sp
5467.nf
5468.ft C
5469unionBedGraphs \-i 1.bg 2.bg 3.bg \-empty \-g sizes.txt \-header \-filler N/A
5470chrom start end  WT\-1  WT\-2  KO\-1
5471chrom start end  1     2     3
5472chr1  0     900  N/A   N/A   N/A
5473chr1  900   1000 N/A   60    N/A
5474chr1  1000  1500 10    60    N/A
5475chr1  1500  1600 N/A   60    N/A
5476chr1  1600  1700 N/A   N/A   N/A
5477chr1  1700  1980 N/A   50    N/A
5478chr1  1980  2000 N/A   50    80
5479chr1  2000  2050 20    50    80
5480chr1  2050  2070 20    N/A   80
5481chr1  2070  2090 20    N/A   N/A
5482chr1  2090  2100 20    N/A   20
5483chr1  2100  5000 N/A   N/A   N/A
5484.ft P
5485.fi
5486.UNINDENT
5487.UNINDENT
5488.SS 5.23.7 Use BEDGRAPH files with non\-numeric values.
5489.sp
5490Figure:
5491.INDENT 0.0
5492.INDENT 3.5
5493.sp
5494.nf
5495.ft C
5496cat 1.snp.bg
5497chr1 0 1 A/G
5498chr1 5 6 C/T
5499
5500cat 2.snp.bg
5501chr1 0 1 C/C
5502chr1 7 8 T/T
5503
5504cat 3.snp.bg
5505chr1 0 1 A/G
5506chr1 5 6 C/T
5507
5508unionBedGraphs \-i 1.snp.bg 2.snp.bg 3.snp.bg \-filler \-/\-
5509chr1 0 1 A/G C/C A/G
5510chr1 5 6 C/T \-/\- C/T
5511chr1 7 8 \-/\- T/T \-/\-
5512.ft P
5513.fi
5514.UNINDENT
5515.UNINDENT
5516.SS 5.24 annotateBed
5517.sp
5518\fBannotateBed\fP annotates one BED/VCF/GFF file with the coverage and number of overlaps observed
5519from multiple other BED/VCF/GFF files. In this way, it allows one to ask to what degree one feature
5520coincides with multiple other feature types with a single command.
5521.SS 5.24.1 Usage and option summary
5522.sp
5523Usage:
5524.INDENT 0.0
5525.INDENT 3.5
5526.sp
5527.nf
5528.ft C
5529annotateBed [OPTIONS] \-i <BED/GFF/VCF> \-files FILE1 FILE2 FILE3 ... FILEn
5530.ft P
5531.fi
5532.UNINDENT
5533.UNINDENT
5534.TS
5535center;
5536|l|l|.
5537_
5538T{
5539Option
5540T}      T{
5541Description
5542T}
5543_
5544T{
5545\fB\-namesr\fP
5546T}      T{
5547A list of names (one per file) to describe each file in \-i. These names will be printed as a header line.
5548T}
5549_
5550T{
5551\fB\-counts\fP
5552T}      T{
5553Report the count of features in each file that overlap \-i. Default behavior is to report the fraction of \-i covered by each file.
5554T}
5555_
5556T{
5557\fB\-both\fP
5558T}      T{
5559Report the count of features followed by the % coverage for each annotation file. Default is to report solely the fraction of \-i covered by each file.
5560T}
5561_
5562T{
5563\fB\-s\fP
5564T}      T{
5565Force strandedness. That is, only include hits in A that overlap B on the same strand. By default, hits are included without respect to strand.
5566T}
5567_
5568.TE
5569.SS 5.24.2 Default behavior \- annotate one file with coverage from others.
5570.sp
5571By default, the fraction of each feature covered by each annotation file is reported after the complete
5572feature in the file to be annotated.
5573.INDENT 0.0
5574.INDENT 3.5
5575.sp
5576.nf
5577.ft C
5578cat variants.bed
5579chr1 100  200   nasty 1  \-
5580chr2 500  1000  ugly  2  +
5581chr3 1000 5000  big   3  \-
5582
5583cat genes.bed
5584chr1 150  200   geneA 1  +
5585chr1 175  250   geneB 2  +
5586chr3 0    10000 geneC 3  \-
5587
5588cat conserve.bed
5589chr1 0    10000 cons1 1  +
5590chr2 700  10000 cons2 2  \-
5591chr3 4000 10000 cons3 3  +
5592
5593cat known_var.bed
5594chr1 0    120   known1   \-
5595chr1 150  160   known2   \-
5596chr2 0    10000 known3   +
5597
5598annotateBed \-i variants.bed \-files genes.bed conserv.bed known_var.bed
5599chr1 100  200  nasty 1 \-  0.500000  1.000000  0.300000
5600chr2 500  1000 ugly  2 +  0.000000  0.600000  1.000000
5601chr3 1000 5000 big   3 \-  1.000000  0.250000  0.000000
5602.ft P
5603.fi
5604.UNINDENT
5605.UNINDENT
5606.SS 5.24.3 Report the count of hits from the annotation files
5607.sp
5608Figure:
5609.INDENT 0.0
5610.INDENT 3.5
5611.sp
5612.nf
5613.ft C
5614annotateBed \-counts \-i variants.bed \-files genes.bed conserv.bed known_var.bed
5615chr1 100  200  nasty 1 \- 2 1 2
5616chr2 500  1000 ugly  2 + 0 1 1
5617chr3 1000 5000 big   3 \- 1 1 0
5618.ft P
5619.fi
5620.UNINDENT
5621.UNINDENT
5622.SS 5.24.4 Report both the count of hits and the fraction covered from the annotation files
5623.sp
5624Figure:
5625.INDENT 0.0
5626.INDENT 3.5
5627.sp
5628.nf
5629.ft C
5630annotateBed \-both \-i variants.bed \-files genes.bed conserv.bed known_var.bed
5631#chr start end  name  score +/\-  cnt1 pct1     cnt2 pct2     cnt3 pct3
5632chr1 100   200  nasty 1     \-    2    0.500000 1    1.000000 2    0.300000
5633chr2 500   1000 ugly  2     +    0    0.000000 1    0.600000 1    1.000000
5634chr3 1000  5000 big   3     \-    1    1.000000 1    0.250000 0    0.000000
5635.ft P
5636.fi
5637.UNINDENT
5638.UNINDENT
5639.SS 5.24.5 Restrict the reporting to overlaps on the same strand.
5640.sp
5641Note: Compare with the result from 5.24.3
5642.INDENT 0.0
5643.INDENT 3.5
5644.sp
5645.nf
5646.ft C
5647annotateBed \-s \-i variants.bed \-files genes.bed conserv.bed known_var.bed
5648chr1  100   200   nasty  var1  \-  0.000000  0.000000  0.000000
5649chr2  500   1000  ugly   var2  +  0.000000  0.000000  0.000000
5650chr3  1000  5000  big    var3  \-  1.000000  0.000000  0.000000
5651.ft P
5652.fi
5653.UNINDENT
5654.UNINDENT
5655.SH EXAMPLE USAGE
5656.sp
5657Below are several examples of basic BEDTools usage. Example BED files are provided in the
5658/data directory of the BEDTools distribution.
5659.SS 6.1 intersectBed
5660.sp
56616.1.1 Report the base\-pair overlap between sequence alignments and genes.
5662.INDENT 0.0
5663.INDENT 3.5
5664.sp
5665.nf
5666.ft C
5667intersectBed \-a reads.bed \-b genes.bed
5668.ft P
5669.fi
5670.UNINDENT
5671.UNINDENT
5672.sp
56736.1.2 Report whether each alignment overlaps one or more genes. If not, the alignment is not reported.
5674.INDENT 0.0
5675.INDENT 3.5
5676.sp
5677.nf
5678.ft C
5679intersectBed \-a reads.bed \-b genes.bed \-u
5680.ft P
5681.fi
5682.UNINDENT
5683.UNINDENT
5684.sp
56856.1.3 Report those alignments that overlap NO genes. Like "grep \-v"
5686.INDENT 0.0
5687.INDENT 3.5
5688.sp
5689.nf
5690.ft C
5691intersectBed \-a reads.bed \-b genes.bed \-v
5692.ft P
5693.fi
5694.UNINDENT
5695.UNINDENT
5696.sp
56976.1.4 Report the number of genes that each alignment overlaps.
5698.INDENT 0.0
5699.INDENT 3.5
5700.sp
5701.nf
5702.ft C
5703intersectBed \-a reads.bed \-b genes.bed \-c
5704.ft P
5705.fi
5706.UNINDENT
5707.UNINDENT
5708.sp
57096.1.5 Report the entire, original alignment entry for each overlap with a gene.
5710.INDENT 0.0
5711.INDENT 3.5
5712.sp
5713.nf
5714.ft C
5715intersectBed \-a reads.bed \-b genes.bed \-wa
5716.ft P
5717.fi
5718.UNINDENT
5719.UNINDENT
5720.sp
57216.1.6 Report the entire, original gene entry for each overlap with a gene.
5722.INDENT 0.0
5723.INDENT 3.5
5724.sp
5725.nf
5726.ft C
5727intersectBed \-a reads.bed \-b genes.bed \-wb
5728.ft P
5729.fi
5730.UNINDENT
5731.UNINDENT
5732.sp
57336.1.7 Report the entire, original alignment and gene entries for each overlap.
5734.INDENT 0.0
5735.INDENT 3.5
5736.sp
5737.nf
5738.ft C
5739intersectBed \-a reads.bed \-b genes.bed \-wa \-wb
5740.ft P
5741.fi
5742.UNINDENT
5743.UNINDENT
5744.sp
57456.1.8 Only report an overlap with a repeat if it spans at least 50% of the exon.
5746.INDENT 0.0
5747.INDENT 3.5
5748.sp
5749.nf
5750.ft C
5751intersectBed \-a exons.bed \-b repeatMasker.bed \-f 0.50
5752.ft P
5753.fi
5754.UNINDENT
5755.UNINDENT
5756.sp
57576.1.9 Only report an overlap if comprises 50% of the structural variant and 50% of the segmental duplication. Thus, it is reciprocally at least a 50% overlap.
5758.INDENT 0.0
5759.INDENT 3.5
5760.sp
5761.nf
5762.ft C
5763intersectBed \-a SV.bed \-b segmentalDups.bed \-f 0.50 \-r
5764.ft P
5765.fi
5766.UNINDENT
5767.UNINDENT
5768.sp
57696.1.10 Read BED A from stdin. For example, find genes that overlap LINEs but not SINEs.
5770.INDENT 0.0
5771.INDENT 3.5
5772.sp
5773.nf
5774.ft C
5775intersectBed \-a genes.bed \-b LINES.bed | intersectBed \-a stdin \-b SINEs.bed \-v
5776.ft P
5777.fi
5778.UNINDENT
5779.UNINDENT
5780.sp
57816.1.11 Retain only single\-end BAM alignments that overlap exons.
5782.INDENT 0.0
5783.INDENT 3.5
5784.sp
5785.nf
5786.ft C
5787intersectBed \-abam reads.bam \-b exons.bed > reads.touchingExons.bam
5788.ft P
5789.fi
5790.UNINDENT
5791.UNINDENT
5792.sp
57936.1.12 Retain only single\-end BAM alignments that do not overlap simple sequence
5794repeats.
5795.INDENT 0.0
5796.INDENT 3.5
5797.sp
5798.nf
5799.ft C
5800intersectBed \-abam reads.bam \-b SSRs.bed \-v > reads.noSSRs.bam
5801.ft P
5802.fi
5803.UNINDENT
5804.UNINDENT
5805.SS 6.2 pairToBed
5806.sp
58076.2.1 Return all structural variants (in BEDPE format) that overlap with genes on either
5808end.
5809.INDENT 0.0
5810.INDENT 3.5
5811.sp
5812.nf
5813.ft C
5814pairToBed \-a sv.bedpe \-b genes > sv.genes
5815.ft P
5816.fi
5817.UNINDENT
5818.UNINDENT
5819.sp
58206.2.2 Return all structural variants (in BEDPE format) that overlap with genes on both
5821end.
5822.INDENT 0.0
5823.INDENT 3.5
5824.sp
5825.nf
5826.ft C
5827pairToBed \-a sv.bedpe \-b genes \-type both > sv.genes
5828.ft P
5829.fi
5830.UNINDENT
5831.UNINDENT
5832.sp
58336.2.3 Retain only paired\-end BAM alignments where neither end overlaps simple
5834sequence repeats.
5835.INDENT 0.0
5836.INDENT 3.5
5837.sp
5838.nf
5839.ft C
5840pairToBed \-abam reads.bam \-b SSRs.bed \-type neither > reads.noSSRs.bam
5841.ft P
5842.fi
5843.UNINDENT
5844.UNINDENT
5845.sp
58466.2.4 Retain only paired\-end BAM alignments where both ends overlap segmental
5847duplications.
5848.INDENT 0.0
5849.INDENT 3.5
5850.sp
5851.nf
5852.ft C
5853pairToBed \-abam reads.bam \-b segdups.bed \-type both > reads.SSRs.bam
5854.ft P
5855.fi
5856.UNINDENT
5857.UNINDENT
5858.sp
58596.2.5 Retain only paired\-end BAM alignments where neither or one and only one end
5860overlaps segmental duplications.
5861.INDENT 0.0
5862.INDENT 3.5
5863.sp
5864.nf
5865.ft C
5866pairToBed \-abam reads.bam \-b segdups.bed \-type notboth > reads.notbothSSRs.bam
5867.ft P
5868.fi
5869.UNINDENT
5870.UNINDENT
5871.SS 6.3 pairToPair
5872.sp
58736.3.1 Find all SVs (in BEDPE format) in sample 1 that are also in sample 2.
5874.INDENT 0.0
5875.INDENT 3.5
5876.sp
5877.nf
5878.ft C
5879pairToPair \-a 1.sv.bedpe \-b 2.sv.bedpe | cut \-f 1\-10 > 1.sv.in2.bedpe
5880.ft P
5881.fi
5882.UNINDENT
5883.UNINDENT
5884.sp
58856.3.2 Find all SVs (in BEDPE format) in sample 1 that are not in sample 2.
5886.INDENT 0.0
5887.INDENT 3.5
5888.sp
5889.nf
5890.ft C
5891pairToPair \-a 1.sv.bedpe \-b 2.sv.bedpe \-type neither | cut \-f 1\-10 >
5892.ft P
5893.fi
5894.UNINDENT
5895.UNINDENT
5896.sp
58971.sv.notin2.bedpe
5898.SS 6.4 bamToBed
5899.sp
59006.4.1 Convert BAM alignments to BED format.
5901.INDENT 0.0
5902.INDENT 3.5
5903.sp
5904.nf
5905.ft C
5906bamToBed \-i reads.bam > reads.bed
5907.ft P
5908.fi
5909.UNINDENT
5910.UNINDENT
5911.sp
59126.4.2 Convert BAM alignments to BED format using the BAM edit distance (NM) as the
5913BED "score".
5914.INDENT 0.0
5915.INDENT 3.5
5916.sp
5917.nf
5918.ft C
5919bamToBed \-i reads.bam \-ed > reads.bed
5920.ft P
5921.fi
5922.UNINDENT
5923.UNINDENT
5924.sp
59256.4.3 Convert BAM alignments to BEDPE format.
5926.INDENT 0.0
5927.INDENT 3.5
5928.sp
5929.nf
5930.ft C
5931bamToBed \-i reads.bam \-bedpe > reads.bedpe
5932.ft P
5933.fi
5934.UNINDENT
5935.UNINDENT
5936.SS 6.5 windowBed
5937.sp
59386.5.1 Report all genes that are within 10000 bp upstream or downstream of CNVs.
5939.INDENT 0.0
5940.INDENT 3.5
5941.sp
5942.nf
5943.ft C
5944windowBed \-a CNVs.bed \-b genes.bed \-w 10000
5945.ft P
5946.fi
5947.UNINDENT
5948.UNINDENT
5949.sp
59506.5.2 Report all genes that are within 10000 bp upstream or 5000 bp downstream of
5951CNVs.
5952.INDENT 0.0
5953.INDENT 3.5
5954.sp
5955.nf
5956.ft C
5957windowBed \-a CNVs.bed \-b genes.bed \-l 10000 \-r 5000
5958.ft P
5959.fi
5960.UNINDENT
5961.UNINDENT
5962.sp
59636.5.3 Report all SNPs that are within 5000 bp upstream or 1000 bp downstream of genes.
5964Define upstream and downstream based on strand.
5965.INDENT 0.0
5966.INDENT 3.5
5967.sp
5968.nf
5969.ft C
5970windowBed \-a genes.bed \-b snps.bed \-l 5000 \-r 1000 \-sw
5971.ft P
5972.fi
5973.UNINDENT
5974.UNINDENT
5975.SS 6.6 closestBed
5976.sp
5977Note: By default, if there is a tie for closest, all ties will be reported. \fBclosestBed\fP allows overlapping
5978features to be the closest.
5979.sp
59806.6.1 Find the closest ALU to each gene.
5981.INDENT 0.0
5982.INDENT 3.5
5983.sp
5984.nf
5985.ft C
5986closestBed \-a genes.bed \-b ALUs.bed
5987.ft P
5988.fi
5989.UNINDENT
5990.UNINDENT
5991.sp
59926.6.2 Find the closest ALU to each gene, choosing the first ALU in the file if there is a
5993tie.
5994.INDENT 0.0
5995.INDENT 3.5
5996.sp
5997.nf
5998.ft C
5999closestBed \-a genes.bed \-b ALUs.bed \-t first
6000.ft P
6001.fi
6002.UNINDENT
6003.UNINDENT
6004.sp
60056.6.3 Find the closest ALU to each gene, choosing the last ALU in the file if there is a
6006tie.
6007.INDENT 0.0
6008.INDENT 3.5
6009.sp
6010.nf
6011.ft C
6012closestBed \-a genes.bed \-b ALUs.bed \-t last
6013.ft P
6014.fi
6015.UNINDENT
6016.UNINDENT
6017.SS 6.7 subtractBed
6018.sp
6019Note: If a feature in A is entirely "spanned" by any feature in B, it will not be reported.
6020.sp
60216.7.1 Remove introns from gene features. Exons will (should) be reported.
6022.INDENT 0.0
6023.INDENT 3.5
6024.sp
6025.nf
6026.ft C
6027subtractBed \-a genes.bed \-b introns.bed
6028.ft P
6029.fi
6030.UNINDENT
6031.UNINDENT
6032.SS 6.8 mergeBed
6033.sp
60346.8.1 Merge overlapping repetitive elements into a single entry.
6035.INDENT 0.0
6036.INDENT 3.5
6037.sp
6038.nf
6039.ft C
6040mergeBed \-i repeatMasker.bed
6041.ft P
6042.fi
6043.UNINDENT
6044.UNINDENT
6045.sp
60466.8.2 Merge overlapping repetitive elements into a single entry, returning the number of
6047entries merged.
6048.INDENT 0.0
6049.INDENT 3.5
6050.sp
6051.nf
6052.ft C
6053mergeBed \-i repeatMasker.bed \-n
6054.ft P
6055.fi
6056.UNINDENT
6057.UNINDENT
6058.sp
60596.8.3 Merge nearby (within 1000 bp) repetitive elements into a single entry.
6060.INDENT 0.0
6061.INDENT 3.5
6062.sp
6063.nf
6064.ft C
6065mergeBed \-i repeatMasker.bed \-d 1000
6066.ft P
6067.fi
6068.UNINDENT
6069.UNINDENT
6070.SS 6.9 coverageBed
6071.sp
60726.9.1 Compute the coverage of aligned sequences on 10 kilobase "windows" spanning the
6073genome.
6074.INDENT 0.0
6075.INDENT 3.5
6076.sp
6077.nf
6078.ft C
6079coverageBed \-a reads.bed \-b windows10kb.bed | head
6080chr1 0     10000 0  10000 0.00
6081chr1 10001 20000 33 10000 0.21
6082chr1 20001 30000 42 10000 0.29
6083chr1 30001 40000 71 10000 0.36
6084.ft P
6085.fi
6086.UNINDENT
6087.UNINDENT
6088.sp
60896.9.2 Compute the coverage of aligned sequences on 10 kilobase "windows" spanning the
6090genome and created a BEDGRAPH of the number of aligned reads in each window for
6091display on the UCSC browser.
6092.INDENT 0.0
6093.INDENT 3.5
6094.sp
6095.nf
6096.ft C
6097coverageBed \-a reads.bed \-b windows10kb.bed | cut \-f 1\-4 > windows10kb.cov.bedg
6098.ft P
6099.fi
6100.UNINDENT
6101.UNINDENT
6102.sp
61036.9.3 Compute the coverage of aligned sequences on 10 kilobase "windows" spanning the
6104genome and created a BEDGRAPH of the fraction of each window covered by at least
6105one aligned read for display on the UCSC browser.
6106.INDENT 0.0
6107.INDENT 3.5
6108.sp
6109.nf
6110.ft C
6111coverageBed \-a reads.bed \-b windows10kb.bed | awk ??{OFS="\et"; print $1,$2,$3,$6}??
6112> windows10kb.pctcov.bedg
6113.ft P
6114.fi
6115.UNINDENT
6116.UNINDENT
6117.SS 6.10 complementBed
6118.sp
61196.10.1 Report all intervals in the human genome that are not covered by repetitive
6120elements.
6121.INDENT 0.0
6122.INDENT 3.5
6123.sp
6124.nf
6125.ft C
6126complementBed \-i repeatMasker.bed \-g hg18.genome
6127.ft P
6128.fi
6129.UNINDENT
6130.UNINDENT
6131.SS 6.11 shuffleBed
6132.sp
61336.11.1 Randomly place all discovered variants in the genome. However, prevent them
6134from being placed in know genome gaps.
6135.INDENT 0.0
6136.INDENT 3.5
6137.sp
6138.nf
6139.ft C
6140shuffleBed \-i variants.bed \-g hg18.genome \-excl genome_gaps.bed
6141.ft P
6142.fi
6143.UNINDENT
6144.UNINDENT
6145.sp
61466.11.2 Randomly place all discovered variants in the genome. However, prevent them
6147from being placed in know genome gaps and require that the variants be randomly
6148placed on the same chromosome.
6149.INDENT 0.0
6150.INDENT 3.5
6151.sp
6152.nf
6153.ft C
6154shuffleBed \-i variants.bed \-g hg18.genome \-excl genome_gaps.bed \-chrom
6155.ft P
6156.fi
6157.UNINDENT
6158.UNINDENT
6159.SH ADVANCED USAGE
6160.SS 7.1 Mask all regions in a genome except for targeted capture regions.
6161.sp
6162# Add 500 bp up and downstream of each probe
6163.INDENT 0.0
6164.INDENT 3.5
6165.sp
6166.nf
6167.ft C
6168slopBed \-i probes.bed \-b 500 > probes.500bp.bed
6169.ft P
6170.fi
6171.UNINDENT
6172.UNINDENT
6173.sp
6174# Get a BED file of all regions not covered by the probes (+500 bp up/down)
6175.INDENT 0.0
6176.INDENT 3.5
6177.sp
6178.nf
6179.ft C
6180complementBed \-i probes.500bp.bed \-g hg18.genome > probes.500bp.complement.bed
6181.ft P
6182.fi
6183.UNINDENT
6184.UNINDENT
6185.sp
6186# Create a masked genome where all bases are masked except for the probes +500bp
6187.INDENT 0.0
6188.INDENT 3.5
6189.sp
6190.nf
6191.ft C
6192maskFastaFromBed \-in hg18.fa \-bed probes.500bp.complement.bed \-fo hg18.probecomplement.
6193masked.fa
6194.ft P
6195.fi
6196.UNINDENT
6197.UNINDENT
6198.SS 7.2 Screening for novel SNPs.
6199.sp
6200# Find all SNPs that are not in dbSnp and not in the latest 1000 genomes calls
6201.INDENT 0.0
6202.INDENT 3.5
6203.sp
6204.nf
6205.ft C
6206intersectBed \-a snp.calls.bed \-b dbSnp.bed \-v | intersectBed \-a stdin \-b 1KG.bed
6207\-v > snp.calls.novel.bed
6208.ft P
6209.fi
6210.UNINDENT
6211.UNINDENT
6212.sp
6213you can first use intersectBed with the "\-f 1.0" option.
6214.INDENT 0.0
6215.INDENT 3.5
6216.sp
6217.nf
6218.ft C
6219intersectBed \-a features.bed \-b windows.bed \-f 1.0 | coverageBed \-a stdin \-b
6220windows.bed > windows.bed.coverage
6221.ft P
6222.fi
6223.UNINDENT
6224.UNINDENT
6225.SS 7.4 Computing the coverage of BAM alignments on exons.
6226.sp
6227# One can combine SAMtools with BEDtools to compute coverage directly from the BAM
6228data by using bamToBed.
6229.INDENT 0.0
6230.INDENT 3.5
6231.sp
6232.nf
6233.ft C
6234bamToBed \-i reads.bam | coverageBed \-a stdin \-b exons.bed > exons.bed.coverage
6235.ft P
6236.fi
6237.UNINDENT
6238.UNINDENT
6239.sp
6240# Take it a step further and require that coverage be from properly\-paired reads.
6241.INDENT 0.0
6242.INDENT 3.5
6243.sp
6244.nf
6245.ft C
6246samtools view \-bf 0x2 reads.bam | bamToBed \-i stdin | coverageBed \-a stdin \-b
6247exons.bed > exons.bed.proper.coverage
6248.ft P
6249.fi
6250.UNINDENT
6251.UNINDENT
6252.SS 7.5 Computing coverage separately for each strand.
6253.sp
6254# Use grep to only look at forward strand features (i.e. those that end in "+").
6255.INDENT 0.0
6256.INDENT 3.5
6257.sp
6258.nf
6259.ft C
6260bamToBed \-i reads.bam | grep \e+$ | coverageBed \-a stdin \-b genes.bed >
6261genes.bed.forward.coverage
6262.ft P
6263.fi
6264.UNINDENT
6265.UNINDENT
6266.sp
6267# Use grep to only look at reverse strand features (i.e. those that end in "\-").
6268.INDENT 0.0
6269.INDENT 3.5
6270.sp
6271.nf
6272.ft C
6273bamToBed \-i reads.bam | grep \e\-$ | coverageBed \-a stdin \-b genes.bed >
6274genes.bed.forward.coverage
6275.ft P
6276.fi
6277.UNINDENT
6278.UNINDENT
6279.SS 7.6 Find structural variant calls that are private to one sample.
6280.sp
6281# :
6282.INDENT 0.0
6283.INDENT 3.5
6284.sp
6285.nf
6286.ft C
6287pairToPair \-a sample1.sv.bedpe \-b othersamples.sv.bedpe \-type neither >
6288sample1.sv.private.bedpe
6289.ft P
6290.fi
6291.UNINDENT
6292.UNINDENT
6293.SS 7.7 Exclude SV deletions that appear to be ALU insertions in the reference genome.
6294.sp
6295# We\(aqll require that 90% of the inner span of the deletion be overlapped by a
6296recent ALU.
6297.INDENT 0.0
6298.INDENT 3.5
6299.sp
6300.nf
6301.ft C
6302pairToBed \-a deletions.sv.bedpe \-b ALUs.recent.bed \-type notispan \-f 0.80 >
6303deletions.notALUsinRef.bedpe
6304.ft P
6305.fi
6306.UNINDENT
6307.UNINDENT
6308.sp
6309Refer to the mailing list.
6310.SH AUTHOR
6311UVa
6312.SH COPYRIGHT
63132012
6314.\" Generated by docutils manpage writer.
6315.