Control of Developmental Regulators by Polycomb in Human Embryonic Stem Cells
Mapping RNA Polymerase II Occupancy in Embryonic Stem Cells
Data
Global Transcriptional Repression by PRC2
Key Developmental Regulators Are Targets of PRC2
PRC2 and Highly Conserved Elements
Signaling Genes Are Among PRC2 Targets
Activation of PRC2 Target Genes During Differentiation
Supplementary Information
Acknowledgements
References
|
Microarray Design
We used three types of microarray for location analysis experiments:
» Whole Genome Array
» Promoter Array
» Transcription Factor Array
Whole Genome Array
We designed a set of 115 60-mer oligonucleotide arrays to cover
the non-repeat masked region of the sequenced human genome. Arrays
were produced by Agilent Technologies (www.agilent.com).
Selection of regions and design of subsequences
We tiled the genome with variable density: transcription units (defined
below) were tiled with higher density and non-transcription regions
were tiled with a slightly lower density.
To define transcription units, we first selected transcripts from
five different databases: RefSeq, Ensembl, MGC, VEGA (www.vega.sanger.ac.uk)
and Broad (www.broad.mit.edu). The first three are commonly used
databases for gene annotation, the last two are manually annotated
databases covering subsets of the human genome from the Sanger Institute
and Broad Institute, respectively. We also added all microRNAs from
the Rfam database (Griffiths-Jones et al., 2003)and a small set
of collected non-coding RNAs (manual selection).
The entire collection of transcripts was sorted by chromosomal
order. We then extended each transcript 10 kb upstream to capture
proximal promoter regions. Each of these extended transcripts was
considered a “transcription unit”. In cases where one
or more transcription units overlapped, we merged the transcription
units into a single, larger unit. We extracted DNA sequence for
all transcription units. Separately, we extracted intervening genomic
DNA (“intergenic units”) between transcription units.
All sequences and coordinates are from the May 2004 build of the
human genome (NCBI build 35), using the repeatmasked (-s) option.
We then separated sequences into subsequences in order to efficiently
process sequences for oligo selection. We first removed all unmasked
regions 100 bp or smaller. The small size of these regions makes
it more difficult to identify high quality oligos for use on the
array. These small regions represented a small fraction of the genome
and were often covered by neighboring probes designed against larger
subsequences. For unmasked regions that were 101 to 300 bp long,
we treated each strand (Watson and Crick) as a separate subsequence.
This ensured that we would have two oligos to represent these subsequences
if the region could not be covered by neighboring 60-mers. For regions
that were 301 to 640 bp long, we divided the region into two, evenly
sized subsequences. Unmasked regions greater than 640 bp were divided
into evenly sized subsequences such that no individual subsequence
was greater than 320 bp.
We used the program ArrayOligoSelector (AOS)(Bozdech et al., 2003)
to score 60-mers for use on the array, but modified the oligo selection
process. We had two primary reasons for this. First, AOS uses a
relative quality scale in selecting oligos. For any particular subsequence,
it generates scores based on four parameters to evaluate each 60-mer
in the subsequence and looks for the best oligos within that set,
ignoring the absolute quality of the oligo. As a result, lower quality
oligos can be selected. Second, AOS does not have a parameter to
set distance between oligos. Consequently, resolution is largely
set by defining subsequence size but is still subject to highly
variable placement within each subsequence. For instance, if the
desired tiling density is 300 bp, we would select subsequences 300
bp long. For any two adjacent subsequences, probes could be separated
by as little as 0 bp (both probes were placed near the shared subsequence
border) or as much as 480 bp (both probes placed at opposite subsequence
ends).
To avoid selecting lower quality oligos, we ran AOS to derive scores
for every 60-mer in all subsquences and then eliminated oligos based
on these scores. AOS uses a scoring system for four criteria: GC
content, self-binding, complexity and uniqueness. We selected the
following ranges for each parameter: GC content between 30 percent
and 100 percent, self-binding score less than 100, complexity score
less than or equal to 24, uniqueness greater than or equal to –40.
To achieve more uniform tiling, we instituted a method to find
probes within a particular distance from each other for the transcription
unit subsequences. We sorted all qualified probes into chromosomal
order and identified gaps in the genomic sequence that were not
covered by one or more 60-mers. These gaps typically represented
regions that were repeat masked or generated regions of consistently
low quality oligos. For our purposes, gaps that were greater than
640 bp long represented potential dead zones or “borders”.
Based on empirical experience with genome-wide location analysis
technology, we conservatively estimated that we would not identify
binding events that occurred 320 bp away from the genomic location
of any particular probe. As a result, gaps that were longer than
640 bp long likely contained one or more basepairs within the gap
that would not be detected even if we used the closest qualified
oligos as probes. Using these borders, we split the set of all probes
into “packages” containing all qualified probes between
two borders.
For packages up to 300 bp long, we designed two probes where possible,
one from each strand (Watson and Crick). This resulted in two different
probes in the region, compensating for those instances where a small
region would be found isolated by two borders from the nearest,
potentially informative, neighboring probe. For packages greater
than 301 bp long, we selected the first qualified probe in the package
(lowest chromosomal coordinate), then selected the next qualified
probe that was between 150 bp and 280 bp away. If there were multiple,
eligible probes, we chose the most distal probe within the 280 bp
limit. If there were no probes within this limit, we continued scanning
until we found the next acceptable probe. The process was then repeated
with the most recently selected probe. If the most recently selected
probe was within 250 bp of the next border, we automatically selected
the qualified probe closest to the next border. This ensured that
we were selecting probes as close to the ends of packages as possible.
For intergenic unit tiling, we generated subsequences and identified
borders and packages as described for genic tiling. We divided packages
into evenly sized segments where the maximum segment size was 480
bp. We then selected the qualified probe closest to the midpoint
of each segment.
All probes from both transcription unit and intergenic unit tiling
were combined and grouped by chromosome and sorted by position.
TOP
Compiled Probes and Controls
The design process described above led to the production of a set of 115 Agilent microarrays containing a total of 4,652,484 features. Each array contains 40,457 features except for array #115, which contains 40,386 features. The probes are arranged such that array 1 begins with the left arm of chromosome 1, array 2 picks up where array 1 ends, array 3 picks up where array 2 ends, and so on. There are some gaps in coverage that reflect our inability to identify high quality unique 60-mers: these tend to be unsequenced regions, highly repetitive regions that are not repeat masked (such as telomeres or gene families) and certain regions that are probably genome duplications. We estimate that only 10% of the total, non-repeat masked region is not covered by probes. As an estimate of probe density, 95% of all 60-mers are within 450 bp of another 60-mer; 80% of all 60-mers are within 350 bp of another 60-mer.
We added several sets of control probes (1,500total)
to the array designs. On each array, there are 40 oligos designed
against five Arabidopsis thaliana genes that are printed
in triplicate, and thus available for use with spike-in controls.
These Arabidopsis oligos were BLASTed against the human
genome and do not register any significant hits. Since E2F4 chromatin
immunoprecipitations can be accomplished with a wide range of cell
types and have provided a convenient positive control for ChIP-Chip
experiments (for putative regulators where no prior knowledge of
targets exist, for example), we added a total of 80 oligos representing
four proximal promoter regions of genes that are known targets of
the transcriptional regulator E2F4 (NM_001211, NM_002907, NM_031423,
NM_001237). Each of the four promoters is represented by 20 different
oligos that are evenly positioned across the region from 3 kb upstream
to 2 kb downstream of the transcription start site. We also included
a control probe set that provides a means to normalize intensities
across multiple slides throughout the entire signal range. There
are 384 oligos printed as intensity controls; based on test hybridizations,
this set of oligos gives signal intensities that cover the entire
dynamic range of the array. Twenty additional intensity controls,
representing the entire range of intensities, were selected and
printed fifteen times each for an additional 300 control features.
We also incorporated 616 “gene desert” controls. To design these
probes, we identified intergenic regions of 1 Mb or greater and
designed probes in the middle of these regions. These are intended
to identify genomic regions that are least likely to be bound by
promoter-binding transcriptional regulators (by virtue of their
extreme distance from any known gene). We have used these as normalization
controls in situations where a factor binds to a large number of
promoter regions. In addition to these 1,500 controls, there are
2,256 controls added by Agilent (standard) and 77 blank spots.
TOP
Promoter Array
A single oligonucleotide array covering 18,002 unique transcription
start sites (-800bp to +200bp) was produced to further confirm Suz12
results from the whole genome set and to identify H3K27 methylation
at promoters. The oligonucleotides were selected from a previously
designed 10-slide promoter array (Boyer et al., 2005). The array
also included series of oligonucleotides selected from the whole
genome arrays that tile several developmental transcription factors
and other gene clusters. Tiled regions derived from the whole genome
design included HoxA (Chr7:26753403-27181969), HoxB (Chr17:43800473-44332824),
HoxC (Chr12:52430195-52980924), HoxD (Chr2:176600271-177020015),
IL1-beta gene cluster (Chr2:113150361-113850210), IGF2 (Chr11:1596557-2370046),
?-globin(Chr11:4699307-5710863), Interleukin gene cluster (Chr5:131285278-132332015),
APO gene cluster (Chr11:116001474-116507932), MEIS1 (Chr2:66433939-66850408),
MEIS2 (Chr15:34785111-35408790), MEIS3 (Chr19:52500385-52753118),
MLL1 (Chr11:117677663-118110213), PBX1 (Chr1:161166894-161690455),
PBX2 (Chr6:32239846-32303475), PBX3 (Chr9:125450610-125940831),
and PBX4 (Chr19:19465384-19622058). Controls for the single slide
array have already been described and include the gene desert, intensity,
Arabidopsis, and Agilent controls.
TOP
Transcription Factor Array
This array was designed to cover regions between -5kb and +5kb relative to the
transcription start sites of 2,288 human genes encoding transcription factors as determined
by GO classifications and manual annotation. Probes were designed essentially as described above for the whole genome array although tiling density was slightly improved (1 probe approximately every 250 bp). There are a total of 2,079 control spots on the
transcription factor array. The 40 Arabidopsis oligos and 80 E2F4 oligos described above
for the whole genome design are each printed once. A total of 404 intensity controls are
printed twice. A total of 1,085 "gene desert" controls (described above in the whole
genome design) are each printed once. The intensity controls and "gene desert" controls
are expanded sets of the controls described above for the whole genome design.
Array Designs
TOP
|