Introduction 2000 2007 2003 1999 2006 2006 2002 2004 2004 2006 2006 2005 2007 Materials and methods Variation analysis 2004 2006 http://www.phrap.org/ 1981 X http://www.sanger.ac.uk/HGP/Chr6/MHC/Xfile Major indel sequences, appearing as breaks in the cross_match discrepancy lists between two clones from difference haplotypes, were extracted and subjected to analysis by RepeatMasker to detect the presence of retrotransposible elements. Gene annotation 2004 http://www.repeatmasker.org 1999 1990 2007 2004 1997 2000 2002 http://www.sanger.ac.uk/HGP/havana/ 2004 2006 2006 HLA-A HLA-B HLA-C HLA-DRB1 HLA-DQA1 HLA-DQB1 http://www.ebi.ac.uk/imgt/hla/ 2005 Annotation status of haplotypes 2004 2006 Combination of variation and annotation data http://www.sanger.ac.uk/Software/formats/GFF/ Distribution of sequenced HLA haplotypes in Europeans 2006 HLA–DRB1 DQB1 HLA–DRB1 DQB1 2001 1999 Resources http://www.VEGA.sanger.ac.uk http://www.ncbi.nlm.nih.gov/SNP http://www.bacpac.chori.org/ http://www.sanger.ac.uk/HGP/Chr6/MHC/ http://www.das.ensembl.org/das ens_35_COX_SNP ens_35_COX_DIP ens_35_QBL_SNP ens_35_QBL_DIP ens_35_SSTO_SNP ens_35_SSTO_DIP ens_35_APD_SNP ens_35_APD_DIP ens_35_DBB_SNP ens_35_DBB_DIP ens_35_MANN_SNP ens_35_MANN_DIP ens_35_MCF_SNP ens_35_MCF_DIP These can be accessed via the VEGA browser. Results and discussion Variation analysis 1 2004 2006 Table 1 Haplotype sequence contig length, number of gaps and HLA allele types Haplotype Length (bp) Gaps HLA-A HLA-B HLA-C HLA-DQA1 HLA-DQB1 HLA-DRB1 PGF 4754829 0 A*03010101 B*070201 Cw*07020103 DQA1*010201 DQB1*0602 DRB1*150101 COX 4731878 0 A*01010101 B*080101 Cw*070101 DQA1*050101 DQB1*020101 DRB1*030101 QBL 4249272 5 A*260101 B*180101 Cw*050101 DQA1*050101 DQB1*020101 DRB1*030101 APD 4160965 16 A*01010101 – – – – − DBB 2330101 28 A*02010101 – Cw*06020101 DQA1*0201 DQB1*030302 DRB1*070101 MANN 4191014 10 A*290201 B*440301 Cw*160101 DQA1*0201 DQB1*0202 DRB1*070101 MCF 4087413 15 [A*020101] B*15010101 Cw*030401 DQA1*0303 DQB1*030101 – SSTO 3704249 22 A*320101 B*44020101 Cw*050101 DQA1*030101 DQB1*030501 DRB1*040301 Sequence length (bp) and number of gaps in each haplotype sequence, together with the HLA gene types obtained by BLAST against the IMGT/HLA database. Dashes or data in square brackets indicate the absence or the partial presence, respectively, of a gene owing to a sequence gap. 2 2 3 4 1 2004 2001 Fig. 1 a b c OR2J1 Table 2 Distribution of substitutions and indels amongst haplotypes Haplotype Substitutions Indels ALL COX 15,967 2,393 18,360 QBL 15,282 2,360 17,642 SSTO 14,982 2,300 17,282 APD 4,230 683 4,913 DBB 14,255 1,975 16,230 MANN 12,102 1,654 13,756 MCF 10,790 1,545 12,335 Overall 37,451 7,093 44,544 Number of variations found by comparing the PGF haplotype sequence with each of the other haplotype sequences in turn. Table 3 Distribution of substitutions and indels within different sequence regions amongst haplotypes Sequence region Base pairs COX QBL SSTO APD DBB MANN MCF S ID S ID S ID S ID S ID S ID S ID Coding 247,505 353 8 503 19 380 2 74 0 351 6 401 9 348 2 UTR 155,960 382 34 438 59 331 35 38 9 326 39 303 35 309 31 Intronic 1,283,472 3,141 571 3,135 590 2,658 505 602 147 2,897 509 2,185 393 2,126 404 Total intragenic 1,686,937 3,876 613 4,076 668 3,369 542 714 156 3,574 554 2,889 437 2,783 437 Pseudogenic 57,223 235 15 226 21 227 19 101 8 191 10 109 6 113 10 Pseudogenic intron 63,108 507 54 220 27 215 18 158 20 258 22 98 13 179 13 Transcript exon 78,092 190 30 207 33 119 22 71 8 136 17 88 16 70 15 Transcript intron 332,705 1,243 197 1,186 216 1,053 155 85 29 1,245 192 1,081 161 268 53 REPEATS: LINEs 608,429 2,110 221 2,015 240 2,388 255 755 93 2,097 217 2,084 193 1,530 164 SINEs 428,567 1,381 428 1,316 401 1,311 385 346 134 1,229 318 928 241 936 271 Other repeats 487,863 2,605 207 2,518 229 2,514 207 925 56 2,748 199 2,198 177 2,170 169 Total in repeats 1,524,859 6,096 856 5,849 870 6,213 847 2,026 283 6,074 734 5,210 611 4,636 604 Microsatellite 15,185 186 168 95 85 222 198 14 29 60 76 61 71 90 68 All above 3,297,590 12,333 1,933 11,859 1,920 11,418 1,801 3,169 533 11,538 1,605 9,536 1,315 8,139 1,200 Other intergenic 996,720 3,634 460 3,423 440 3,564 499 1,061 150 2,717 370 2,566 339 2,651 345 Total 4,754,829 15,967 2,393 15,282 2,360 14,982 2,300 4,230 683 14,255 1,975 12,102 1,654 10,790 1,545 2 S ID Table 4 Codon variation caused by substitutions in HLA and other gene loci Codons variation by virtue of substitutions COX QBL SSTO APD DBB MANN MCF HLA Other Total HLA Other Total HLA Other Total HLA Other Total HLA Other Total HLA Other Total HLA Other Total Synonymous 49 81 130 71 106 177 72 57 129 1 24 25 66 69 135 59 79 138 80 52 132 Non-synonymous Total Conservative 125 76 201 184 121 305 164 72 236 19 27 46 120 76 196 144 91 235 147 56 203 68 42 110 102 72 174 92 39 131 11 18 29 67 40 107 77 60 137 82 35 117 Non-conservative 57 34 91 82 49 131 72 33 105 8 9 17 53 36 89 67 31 98 65 21 86 Total 174 157 331 255 227 482 236 129 365 20 51 71 186 145 331 203 170 373 227 108 335 HLA-A HLA-B HLA-C HLA-DRB1 HLA-DRA HLA-DQA1 HLA-DQB1 HLA-DPA1 HLA-DPB1 Gene annotation 2004 2003 2004 1999 2006 ZNF452 ZBTB9 2004 5 6 Table 5 Splice-variant statistics for PGF annotation Type No. Total splice variants 1,267 Coding 523 Unprocessed_pseudogene 50 Processed_pseudogene 41 Expressed_pseudogene 7 Transcript 271 Putative 71 Retained_intron 263 Nonsense_mediated_decay 30 Artefact 11 Total loci 320 Splice variants annotated in the PGF haplotype. Table 6 Gene annotation statistics for eight MHC haplotypes Locus type PGF COX QBL SSTO APD DBB MANN MCF Coding 165 159 150 131 82 146 129 150 Transcript 28 28 26 26 19 26 27 22 Putative 18 18 15 15 6 16 12 14 Pseudogenes total 98 95 93 98 59 92 95 75 Unprocessed 50 48 48 53 36 52 53 42 Processed 41 42 40 39 19 34 37 28 Expressed 7 5 5 6 4 6 5 5 Artefact 11 11 10 11 0 0 0 0 Total loci 320 311 294 281 166 281 264 261 Total variants 1,267 1,191 1,155 1,058 568 1,138 960 1,115 Materials and methods VEGA database and browser Materials and methods 1 1 1 OR2J1 1 Annotation changes MCCD1 2003 MCCD1P1 MCCD1P2 ZBTB9 C6orf21 2003 LY6G6D LY6G6D LY6G6D C6orf21 2006 HLA-DRB1 HLA–DRBDR52 HLA–DRB3 DR53 HLA–DRB4 HLA–DRB53 HLA–DRB4 HLA–DRB7 HLA–DRB8 DASS–218M11.1 DASS–23B5.1 DASS–23B5.2 DASS–23B5.1 2001 PRKRAP1 FAM8A5P 2001 HLA-V and HLA-P HLA-V HLA-P HLA-75 HLA-90 1992 Materials and methods HLA-P HLA–DPB2, HLA-J, CYP21A1P, HLA–DRB6, HLA–L PPP1R2P1 RCCX hypervariable region RP-C4A/B-CYP21-TNXB 2002 CYP21A1P TNXA STK19P C4 C4A C4B 1980 C4A C4B C4A C4B 2004 2 C4B C4A 2006 C4B C4A C4B C4 CYP21A1P TNXA STK19P C4A C6orf205 C6orf205 2006 MICA MICA 2006 PPP1R2P1 PPP1R2P1 2004 2006 PSORS1C1 2006 POU5F1 POU5F1 2006 OR2J1 OR2J1 2000 1 Other annotation differences HCG4P11 HCG4P8 HCG4P7 HCG4P5 HCG4P3 7 HLA-X, C6orf215 HCG2P7 HCG8 HCP5P2 HCP5P3 HCP5P6 HCP5P12 HCP5P13 HCP5P14 HCP5P15 HCG8 HCG26 Table 7 Other newly annotated loci Locus Locus type XXbac-BCX196D17.5 Transcript XXbac-BPG116M5.14 Putative XXbac-BPG116M5.15 Putative XXbac-BPG116M5.16 Putative XXbac-BPG118E17.9 Putative XXbac-BPG126D10.10 Processed pseudogene XXbac-BPG126D10.11 Processed pseudogene XXbac-BPG13B8.10 Transcript XXbac-BPG13B8.9 Unprocessed pseudogene XXbac-BPG154L12.4 Putative XXbac-BPG181B23.4 Transcript XXbac-BPG181M17.4 Putative XXbac-BPG246D15.8 Transcript XXbac-BPG248L24.10 Unprocessed pseudogene XXbac-BPG248L24.9 Processed pseudogene XXbac-BPG249D20.9 Putative XXbac-BPG250I8.13 Transcript XXbac-BPG254F23.5 Putative XXbac-BPG254F23.6 Putative XXbac-BPG254F23.7 Transcript XXbac-BPG254F23.7 Putative XXbac-BPG27H4.7 Transcript XXbac-BPG27H4.8 Transcript XXbac-BPG294E21.7 Processed pseudogene XXbac-BPG296P20.14 Putative XXbac-BPG296P20.15 Putative XXbac-BPG299F13.14 Putative XXbac-BPG308J9.3 Transcript XXbac-BPG308K3.5 Putative XXbac-BPG308K3.6 Transcript XXbac-BPG309N1.15 Unprocessed pseudogene XXbac-BPG32J3.18 Putative XXbac-BPG8G10.2 Unprocessed pseudogene DAQB-12N14.5 Transcript DAQB-331I12.5 Putative DAQB-335A13.8 Transcript Newly annotated loci without HGNC symbols. Non-canonical splice sites 8 2004 2005 HLA–DQA1 HLA–DQA1 Table 8 Haplotype variation at splice sites Gene Variant Affected exons Donor* Acceptor* dbSNP cluster ID Best evidence PGF QBL COX SSTO DBB APD MANN MCF TRIM31 2 3/4 ggt g rs28400887 cDNA NC NC NC C ND NC NC C TRIM31 5 2/3 ggt g rs28400887 EST NC NC NC C ND NC NC C C4B 7 3/4 ggt cgg – EST NC ND NC C NC ND ND ND C4A 7 3/4 ggt cgg – EST NC NC ND C NC ND ND NC HLA-DQA1 4 4/5 ggt g rs707947 cDNA C C C NC NC ND NC NC HLA-DQA1 5 4/5 ggt a a rs3667 cDNA NC NC NC C C ND C C HLA-DRB1 2 2/3 a cag rs9271083 EST NC C C C C ND C ND C NC ND bold 2004 2005 Combination of variation and annotation data 2 HLA-F HLA-G HLA-A HLA-C HLA-B HLA-DRB1 HLA-DQA1 HAL-DQB1 HLA-DQA2 HLA-DQB2 9 Fig. 2 orange bar arrows 8 dark grey green red black blue-green dark red Table 9 Variation status of the main coding variant of each gene in the PGF haplotype annotation Invariable Synonymous variation only Non-synonymous variation Conservative variation Non-conservative variation ABCF1 a AGER BAT2 AGPAT1 BAT5 a BAT3 AIF1 C2 BTNL2 BAT4 APOM CREBL1 C6orf21 C4A ATP6V1G2 DAXX C6orf27 C4B B3GALT4 a CFB C6orf10 C6orf134 GNL1 DOM3Z C6orf100 C6orf136 a GPSM3 DPCR1 C6orf15 C6orf26 GTF2H4 EGFL8 C6orf205 C6orf48 a EHMT2 C6orf25 CLIC1 HSPA1B FKBPL C6orf47 CSNK2B LY6G6C GABBR1 CCHCR1 CUTA MSH5 HLA-DMA CDSN CYP21A2 PBX2 HLA-DOB COL11A2 DDAH2 POU5F1 HLA-DQB2 DHX16 FLOT1 PPP1R11 HLA-DRA HLA-A HLA-DPA1 PRR3 HSPA1A HLA-B HLA-DRB5 RING1 LY6G6D HLA-C HSD17B8 RNF5 MCCD1 HLA-DMB KIFC1 a RXRB a HLA-DPB1 LSM2 SYNGAP1 OR11A1 HLA-DPB2 LST1 TRIM10 OR2H2 HLA-DQA1 LTB TRIM26 OR2J1 HLA-DQA2 LY6G5C TRIM27 OR2J2 HLA-DQB1 LY6G6E a OR2J3 HLA-DRB1 MAS1L VPS52 PHF1 HLA-E MRPS18B ZBTB12 PSMB9 HLA-F NCR3 ZBTB9 RPP21 HLA-G NEU1 ZNRD1 SFTPG HSPA1L NRM SKIV2L IER3 OR2B3 SLC44A4 KIAA1949 OR2H1 TAP2 LTA OR2W1 TRIM15 LY6G5B PFDN6 WDR46 MDC1 PPP1R10 ZBTB22 MICA PRRT1 ZNF311 MICB PSMB8 b NFKBIL1 RDBP NOTCH4 RGL2 OR10C1 RPS18 OR12D2 SLC39A7 OR12D3 STK19 OR5U1 TNF OR5V1 TUBB PPT2 ZFP57 PSORS1C1 PSORS1C2 RNF39 TAP1 TAPBP TCF19 TNXB TRIM31 TRIM40 UBD VARS VARSL LY6G6E HLA-DPB2 C4A C4B HLA-DRB5 a BAT1 BRD2 DDR1 C6orf136 HLA-DOA MOG KIFC1 TRIM39 b PSMB8 10 1994 2003 2002 1999 1998 2005 2004 Table 10 Major indels in the form of retrotransposible elements Chr6 pos’n Flanking loci Presence in haplotype Details PGF COX QBL SSTO APD DBB MANN MCF 29002370 TRIM27:C6orf100 C C C C ? ? C C Complex region (A) 29440424 OR5V1:OR12D3 ✓ ✓ ? ✓ ? ? X X AluYa5 29784097 C6orf40:HCP5P15 ✓ X ✓ ✓ ? X X X AluYa5/8 175..304 29788451 Within HCP5P15 X X ✓ X ? ✓ ✓ X AluYa5/8 176..310 29794763 HCP5P15:HLA-F ✓ X X ✓ ? X X X SVA_E plus simple rpt.s 29922942 HLA-G:MICF ✓ X ✓ ✓ ✓ ✓ ✓ ✓ L1ME3B 5940..6165 29954495 MICF:HLA-H ✓ X X X X X X X HERVK9 inserted in MER9 30008633 HLA-K:HLA-21 ✓ X X ✓ X X ✓ ? SVA E/F plus simple rpt. 30106475 HCG8:ETF1P1 X ✓ X X ✓ ✓ X X AluYb8 30547387 SUCLA2P:RANP1 X X X ✓ ? X X ? AluJb 1..283 and parts of MLT1D/L1PBa 31079582 C6orf205:HCG22 X X ✓ X X X ? X AluYb8 37..297 31117638 C6orf205:HCG22 ✓ X X ✓ ✓ X ✓ ✓ AluY (whole & part) and MER63 1017..1062 31301931 HCG27:HLA-C ✓ ✓ X ✓ ? ✓ ✓ ✓ HERV3 part (6489...7339) 31320352 HCG27:HLA-C ✓ X X X ? X X X SVA_F 349..850 plus GC rich rpt. 31358220 RPL3P2:WASF5P X X ✓ X ? X X X AluY 35..306 31400900 WASF5P:HLA-B ✓ ✓ ✓ ✓ ? X X X AluSp plus L1PREC2 part (3205...4617) 31405648 WASF5P:HLA-B ✓ X ✓ ✓ ? X x x HERVIP10F (part) and AluSg (only cf CX DB) 31418854 WASF5P:HLA-B ✓ ✓ ✓ ✓ ? ✓ ✓ X L1PA5 part (5503..5876) 31530995 MICA:HCP5 ✓ X ? ✓ ? ? X ? SVA B/F plus simple rpt.s 32421915 within C6orf10 ✓ X X ✓ X X ✓ X AluYb8 32486228 BTNL2:HLA-DRA ✓ ✓ ✓ ✓ ✓ X X X L1P1/L1HS parts 32655545 HLA-DRB1 intron 5 ✓ x x X ? ✓ ✓ ? AluYa5 within more or less partial LTR12 32660731 HLA-DRB1 intron 1 X/X ✓/X X/X ✓/✓ ? ✓/✓ ✓/✓ ? Tigger4/AluSx 32661119 HLA-DRB1 intron 1 C C C C ? C C ? Complex region (B) 32663167 HLA-DRB1 intron 1 X/✓ ✓/✓ ✓/✓ ✓/X ? ✓/X ✓/X ? AluSq/AluY 32669534 HLA-DRB1:HLA-DQA1 C C C C ? C C ? Complex region (C) 32679461 HLA-DRB1:HLA-DQA1 ✓ X X X ? X X ? AluY 32693271 HLA-DRB1:HLA-DQA1 ✓ ✓ ✓ ✓ ? X ✓ ? L1PA4 (parts) 32697545 HLA-DRB1:HLA-DQA1 X X X X ? ✓ ✓ ? L1HS 7..6032 32701428 HLA-DRB1:HLA-DQA1 ✓ X ✓ ✓ ? x X x L1PA2 part and from CX: MER2B and AluY 32728179 HLA-DQA1: HLA-DQB1 C C C C ? C C C Complex region (D) 32739664 within HLA-DQB1 X X ✓ X ? X ✓ X AluY 32743646 HLA-DQB1: MTCO3P1 X X X X ? ✓ X X LTR13 32746780 HLA-DQB1: MTCO3P1 X X X X ? ✓ X ✓ L1PA4 (parts) 32751442 HLA-DQB1: MTCO3P1 X X X X ? X ✓ X LTR5_Hs 32753489 HLA-DQB1: MTCO3P1 ✓ ✓ ✓ ✓ ? X ✓ X L1PA10 268..4888 around L1PA4 (part) 32756020 HLA-DQB1: MTCO3P1 X X X X ? X ✓ X LTR5_Hs 32764047 HLA-DQB1: MTCO3P1 ✓ ✓ ✓ ✓ ? X ✓ X AluSx 32765930 HLA-DQB1: MTCO3P1 X X X X ? X ✓ X AluYa5 32785062 MTCO3P1:HLA-DQB3 ✓ ✓ ✓ ✓ ? X X X Tigger4 (Zombi)/L1HS (parts) and T-rich 32795150 MTCO3P1:HLA-DQB3 X X X X X ✓ X ✓ AluY 32796573 MTCO3P1:HLA-DQB3 X X X X X ✓ X ✓ AluY 32815974 HLA-DQB3: HLA-DQA2 X ✓ X X ✓ X X X AluYa5 32857369 HLA-DQB2:HLA-DOB ✓ X ✓ ✓ X X ✓ X AluYg6 32881426 HLA-DQB2:HLA-DOB X X ? ✓ ✓ X X ? AluYa5 32887265 HLA-DQB2:HLA-DOB ✓ X ? X X X ✓ ✓ LTR42 and parts of L1MC5 and AluSc 3..105 33201559 within HLA-DPB2 ✓ X X X ✓ ? X ? AluYb8 33234360 HCG24:COL11A2 ✓ ✓ ✓ ? ? ✓ ✓ X AluY (1..293) AluJb (26..306) Where there was a break in the cross_match discrepancy list match between two clones, the inserted sequence was extracted and subjected to analysis by RepeatMasker to assess the number of major indels that were a result of retrotransposible elements. Chromosome 6 position (NCBI35/36) of the inserted sequence was that of the midpoint where the sequence was an insertion in PGF or the position before the deletion in PGF. Flanking loci were retrieved from the annotation. Insertion in a haplotypes is indicated by ‘✓’, deletion by ‘X’, complex regions by ‘C’. Where there is a sequence gap in a haplotype corresponding to the indel, this is shown by ‘?’. Four complex deletion/insertion events are listed: A, B, C and D; for details, see text. 10 2004 AluSx AluSg AluY AluSx TRIM27 C6orf100 HLA–DRB1 HLA–DRB1 HLA–DQA1 AluSx/ Alu HLA−DQA1 HLA−DQB1 AluSx AluY AluYd2 AluSx AluY AluSg Representation of haplotypes within European populations The eight haplotypes analysed in this study were selected on the basis of their association with type 1 diabetes and multiple sclerosis and their high population frequencies. To determine how representative these haplotypes are with respect to SNP haplotypic diversity in a population, we determined their distribution in the haplotypic tree space in the European population. 2006 3 Fig. 3 HLA-DRB HLA-DQB1 a DRB1 DQB1 2001 b 1999 Circles shaded areas 3 2005 Conclusion and outlook 2007 2006 2006