Young Lab

Control of Developmental Regulators by Polycomb in Human Embryonic Stem Cells

Mapping RNA Polymerase II Occupancy in Embryonic Stem Cells

Human ES Cells

Technology and Protocols

Data

Global Transcriptional Repression by PRC2

Key Developmental Regulators Are Targets of PRC2

PRC2 and Highly Conserved Elements

Signaling Genes Are Among PRC2 Targets

Activation of PRC2 Target Genes During Differentiation

Supplementary Information

Acknowledgements
References

Microarray Design

We used three types of microarray for location analysis experiments:
» Whole Genome Array
» Promoter Array
» Transcription Factor Array

Whole Genome Array

We designed a set of 115 60-mer oligonucleotide arrays to cover the non-repeat masked region of the sequenced human genome. Arrays were produced by Agilent Technologies (www.agilent.com).

Selection of regions and design of subsequences
We tiled the genome with variable density: transcription units (defined below) were tiled with higher density and non-transcription regions were tiled with a slightly lower density.

To define transcription units, we first selected transcripts from five different databases: RefSeq, Ensembl, MGC, VEGA (www.vega.sanger.ac.uk) and Broad (www.broad.mit.edu). The first three are commonly used databases for gene annotation, the last two are manually annotated databases covering subsets of the human genome from the Sanger Institute and Broad Institute, respectively. We also added all microRNAs from the Rfam database (Griffiths-Jones et al., 2003)and a small set of collected non-coding RNAs (manual selection).

The entire collection of transcripts was sorted by chromosomal order. We then extended each transcript 10 kb upstream to capture proximal promoter regions. Each of these extended transcripts was considered a “transcription unit”. In cases where one or more transcription units overlapped, we merged the transcription units into a single, larger unit. We extracted DNA sequence for all transcription units. Separately, we extracted intervening genomic DNA (“intergenic units”) between transcription units. All sequences and coordinates are from the May 2004 build of the human genome (NCBI build 35), using the repeatmasked (-s) option.

We then separated sequences into subsequences in order to efficiently process sequences for oligo selection. We first removed all unmasked regions 100 bp or smaller. The small size of these regions makes it more difficult to identify high quality oligos for use on the array. These small regions represented a small fraction of the genome and were often covered by neighboring probes designed against larger subsequences. For unmasked regions that were 101 to 300 bp long, we treated each strand (Watson and Crick) as a separate subsequence. This ensured that we would have two oligos to represent these subsequences if the region could not be covered by neighboring 60-mers. For regions that were 301 to 640 bp long, we divided the region into two, evenly sized subsequences. Unmasked regions greater than 640 bp were divided into evenly sized subsequences such that no individual subsequence was greater than 320 bp.

We used the program ArrayOligoSelector (AOS)(Bozdech et al., 2003) to score 60-mers for use on the array, but modified the oligo selection process. We had two primary reasons for this. First, AOS uses a relative quality scale in selecting oligos. For any particular subsequence, it generates scores based on four parameters to evaluate each 60-mer in the subsequence and looks for the best oligos within that set, ignoring the absolute quality of the oligo. As a result, lower quality oligos can be selected. Second, AOS does not have a parameter to set distance between oligos. Consequently, resolution is largely set by defining subsequence size but is still subject to highly variable placement within each subsequence. For instance, if the desired tiling density is 300 bp, we would select subsequences 300 bp long. For any two adjacent subsequences, probes could be separated by as little as 0 bp (both probes were placed near the shared subsequence border) or as much as 480 bp (both probes placed at opposite subsequence ends).

To avoid selecting lower quality oligos, we ran AOS to derive scores for every 60-mer in all subsquences and then eliminated oligos based on these scores. AOS uses a scoring system for four criteria: GC content, self-binding, complexity and uniqueness. We selected the following ranges for each parameter: GC content between 30 percent and 100 percent, self-binding score less than 100, complexity score less than or equal to 24, uniqueness greater than or equal to –40.

To achieve more uniform tiling, we instituted a method to find probes within a particular distance from each other for the transcription unit subsequences. We sorted all qualified probes into chromosomal order and identified gaps in the genomic sequence that were not covered by one or more 60-mers. These gaps typically represented regions that were repeat masked or generated regions of consistently low quality oligos. For our purposes, gaps that were greater than 640 bp long represented potential dead zones or “borders”. Based on empirical experience with genome-wide location analysis technology, we conservatively estimated that we would not identify binding events that occurred 320 bp away from the genomic location of any particular probe. As a result, gaps that were longer than 640 bp long likely contained one or more basepairs within the gap that would not be detected even if we used the closest qualified oligos as probes. Using these borders, we split the set of all probes into “packages” containing all qualified probes between two borders.

For packages up to 300 bp long, we designed two probes where possible, one from each strand (Watson and Crick). This resulted in two different probes in the region, compensating for those instances where a small region would be found isolated by two borders from the nearest, potentially informative, neighboring probe. For packages greater than 301 bp long, we selected the first qualified probe in the package (lowest chromosomal coordinate), then selected the next qualified probe that was between 150 bp and 280 bp away. If there were multiple, eligible probes, we chose the most distal probe within the 280 bp limit. If there were no probes within this limit, we continued scanning until we found the next acceptable probe. The process was then repeated with the most recently selected probe. If the most recently selected probe was within 250 bp of the next border, we automatically selected the qualified probe closest to the next border. This ensured that we were selecting probes as close to the ends of packages as possible.

For intergenic unit tiling, we generated subsequences and identified borders and packages as described for genic tiling. We divided packages into evenly sized segments where the maximum segment size was 480 bp. We then selected the qualified probe closest to the midpoint of each segment.

All probes from both transcription unit and intergenic unit tiling were combined and grouped by chromosome and sorted by position.

TOP

Compiled Probes and Controls
The design process described above led to the production of a set of 115 Agilent microarrays containing a total of 4,652,484 features. Each array contains 40,457 features except for array #115, which contains 40,386 features. The probes are arranged such that array 1 begins with the left arm of chromosome 1, array 2 picks up where array 1 ends, array 3 picks up where array 2 ends, and so on. There are some gaps in coverage that reflect our inability to identify high quality unique 60-mers: these tend to be unsequenced regions, highly repetitive regions that are not repeat masked (such as telomeres or gene families) and certain regions that are probably genome duplications. We estimate that only 10% of the total, non-repeat masked region is not covered by probes. As an estimate of probe density, 95% of all 60-mers are within 450 bp of another 60-mer; 80% of all 60-mers are within 350 bp of another 60-mer.

We added several sets of control probes (1,500total) to the array designs. On each array, there are 40 oligos designed against five Arabidopsis thaliana genes that are printed in triplicate, and thus available for use with spike-in controls. These Arabidopsis oligos were BLASTed against the human genome and do not register any significant hits. Since E2F4 chromatin immunoprecipitations can be accomplished with a wide range of cell types and have provided a convenient positive control for ChIP-Chip experiments (for putative regulators where no prior knowledge of targets exist, for example), we added a total of 80 oligos representing four proximal promoter regions of genes that are known targets of the transcriptional regulator E2F4 (NM_001211, NM_002907, NM_031423, NM_001237). Each of the four promoters is represented by 20 different oligos that are evenly positioned across the region from 3 kb upstream to 2 kb downstream of the transcription start site. We also included a control probe set that provides a means to normalize intensities across multiple slides throughout the entire signal range. There are 384 oligos printed as intensity controls; based on test hybridizations, this set of oligos gives signal intensities that cover the entire dynamic range of the array. Twenty additional intensity controls, representing the entire range of intensities, were selected and printed fifteen times each for an additional 300 control features. We also incorporated 616 “gene desert” controls. To design these probes, we identified intergenic regions of 1 Mb or greater and designed probes in the middle of these regions. These are intended to identify genomic regions that are least likely to be bound by promoter-binding transcriptional regulators (by virtue of their extreme distance from any known gene). We have used these as normalization controls in situations where a factor binds to a large number of promoter regions. In addition to these 1,500 controls, there are 2,256 controls added by Agilent (standard) and 77 blank spots.

TOP

Promoter Array

A single oligonucleotide array covering 18,002 unique transcription start sites (-800bp to +200bp) was produced to further confirm Suz12 results from the whole genome set and to identify H3K27 methylation at promoters. The oligonucleotides were selected from a previously designed 10-slide promoter array (Boyer et al., 2005). The array also included series of oligonucleotides selected from the whole genome arrays that tile several developmental transcription factors and other gene clusters. Tiled regions derived from the whole genome design included HoxA (Chr7:26753403-27181969), HoxB (Chr17:43800473-44332824), HoxC (Chr12:52430195-52980924), HoxD (Chr2:176600271-177020015), IL1-beta gene cluster (Chr2:113150361-113850210), IGF2 (Chr11:1596557-2370046), ?-globin(Chr11:4699307-5710863), Interleukin gene cluster (Chr5:131285278-132332015), APO gene cluster (Chr11:116001474-116507932), MEIS1 (Chr2:66433939-66850408), MEIS2 (Chr15:34785111-35408790), MEIS3 (Chr19:52500385-52753118), MLL1 (Chr11:117677663-118110213), PBX1 (Chr1:161166894-161690455), PBX2 (Chr6:32239846-32303475), PBX3 (Chr9:125450610-125940831), and PBX4 (Chr19:19465384-19622058). Controls for the single slide array have already been described and include the gene desert, intensity, Arabidopsis, and Agilent controls.

TOP

Transcription Factor Array

This array was designed to cover regions between -5kb and +5kb relative to the transcription start sites of 2,288 human genes encoding transcription factors as determined by GO classifications and manual annotation. Probes were designed essentially as described above for the whole genome array although tiling density was slightly improved (1 probe approximately every 250 bp). There are a total of 2,079 control spots on the transcription factor array. The 40 Arabidopsis oligos and 80 E2F4 oligos described above for the whole genome design are each printed once. A total of 404 intensity controls are printed twice. A total of 1,085 "gene desert" controls (described above in the whole genome design) are each printed once. The intensity controls and "gene desert" controls are expanded sets of the controls described above for the whole genome design.

Array Designs

TOP


	YOUNG LAB Whitehead Institute 9 Cambridge Center Cambridge, MA 02142 [T] 617.258.5218 [F] 617.258.0376 CONTACT US