MCS
The Millennium Cohort Study (MCS), known as ‘Child of the New Century’ to cohort members and their families, is following the lives of around 19,000 young people born across England, Scotland, Wales and Northern Ireland in 2000-02. The study began with an original sample of 18,818 cohort members. Cohort members were genotyped at age 14.
Data availability
Data type | Array / Imputation panel | Number samples | Coverage |
---|---|---|---|
Genetic (non QCd) | GSA Array v1 | 21,159 | 618,540 genetic variants |
Imputed (QCd) | TOPMED | 20,247 | 8,720,874 genetic variants |
Whole exome sequencing | TWIST | 14,753 | 1,916,636 sites |
Genetic QC
Genotyping for 21,556 samples (21,418 individuals) was performed on the Infinium Global Screening Array-24 v1.0. For more details on collection of samples, DNA extraction methods and laboratory procedures see Fitzsimons et al 2022. Genotype calling was performed using GenomeStudio (v2.0, Illumina) and quality control was completed using PLINK1.9 and PLINK2.0 (Chang et al., 2015). 21,556 samples were successfully read into GenomeStudio and data for 618,540 variants were written into a final manifest file and plink ped format using the plink genotyping module. Samples which could no longer be included in the sample (e.g. withdrawn consent) were removed prior to QC (338 samples excluded). Individuals were excluded if they had (i) they had > 2% missing data (677 samples excluded), (ii) they had excess heterozygosity [>3 standard deviation (SD) from the mean] (78 samples excluded), and (iii) their genotype predicted sex using X chromosome homozygosity was discordant with their reported sex (excluding females with an F value > 0.2 and males with an F value < 0.8), if these could not be rectified using family relationships (86 samples excluded).
We used KING to create a list of parent and family IDs to update, excluding duplicate samples (retaining those with a higher genotyping rate). We updated the family IDs and then updated the parent IDs. Instances where parents and children were mixed up and this could not be rectified were removed (26 samples excluded). We also identified a list of samples that had been merged into larger families, where there were multiple mother and fathers. No samples were removed but we have created a flag within the basic demographics file indicating where one trio from each family has been selected (212 individuals can be excluded based on this flag). It is worth noting that king highlighted some inbreeding within this cohort, where CM’s parents are related to one another (e.g. to third or fourth degree). Additionally, there is relatedness between CMs and other members of the cohort (e.g. cousins, MZ/ DZ twins and siblings). This will need to be accounted for in analysis. For final validation, we ran the Plink Mendelian error check. This checks the transmission of all variants in trios is possible.
Prior to imputation SNPs with high levels of missing data (>3%), Hardy-Weinberg equilibrium P < 1e-6 (based on a subset of unrelated, European samples) or minor allele frequency <1% were excluded. The cleaned data were checked against the HRC reference panel r1.1 site list (Haplotype Reference Consortium, 2016) for strand issues, using a Perl script named HRC-1000G-check-bim.pl, v.4.3.0 (Rayner, Reference Rayner2020) from the Wrayner/ McCartney tools box. We used the HRC panel for this step since the data were genome build GRCh37 and were mostly European.
The genetic data were then recoded as vcf files before uploading to the TOPMed Imputation Server which uses Eagle2 to phase haplotypes, and Minimac4 (https://genome.sph.umich.edu/wiki/Minimac4) with the TOPMed reference panel. The genome build was updated to hg38 using LiftOver, which is implemented within the TOPMed server. Imputed genotypes were then filtered with PLINK2.0alpha, excluding SNPs with an R2 INFO score < 0.8 and recoded as binary PLINK format. Samples with >2% missing values and SNPs with >3% missing values, >2 alleles (using –max-allele 2) or a minor allele frequency of <1% were excluded (indels have not been excluded). The duplicate samples were removed, retatining those which had the higher genotying rate (83 individuals excluded). We identified European samples using the GenoPred pipeline, which involves (i) merging the MCS genotypes with data from 1000 genomes Phase 3, (ii) linkage disequilibrium pruning the overlapping single nucleotide polymorphisms (SNPs) such that no pair of SNPs within 1000 bp had r2 > 0.20 and (iii) using an elastic net model to establish which of the super populations the samples fall into (Africans [AFR], Admixed Americans [AMR], East Asians [EAS], Europeans [EUR] and South Asians [SAS]). Although each sample gets assigned to a superpopulation, there are some ancestral outliers within these groups (e.g. >4SD from mean in the PCs), which researchers may want to remove. We have not excluded non-European samples/ outliers but have included a column in the basic demographics file which indicates this, which researchers can use to limit their samples to Europeans (N=17,105). A further 10 individuals have been removed due to withdrawal of consent.
The final dataset consists of 20,247 samples and 8,720,874 genetic varaints (genome build: hg38). The data are provided in plink binary format.
Number of mother/ fathers/ children after QC:
Category | Count |
---|---|
Mother [M] | 7,781 |
Father [F] | 4,635 |
Child/ cohort member [C] | 7,841 |
Mother-cohort member duos | 6,431 |
Father-cohort member duos | 3,804 |
Mother-father-cohort member trios | 3,119 |
Genetic data (non-QCd)
These data have not been QCd, however where sample swaps were identified these have been rectified in this sample. The final dataset consists of 21,159 samples and 618,540 genetic varaints (genome build: hg19/GRCh37). The data are provided in plink binary format unless requested otherwise.
Whole exome sequencing (WES) data
The WES data were generated and QCd by Sanger. 14,791 individuals from MCS, including 7,807 children and 6,975 of their parents, were exome-sequenced using TWIST capture baits (Twist Custom Panel: Core exome plus Broad panel; Twist Design ID: NGSTECustom_0001418) and Illumina NovaSeq S4 100PE, to an average depth of ~68X. Samples were excluded with VerifyBamID score > 0.05 due to having possible contamination. BWA-MEM was used to map the reads to GRCh38 with BWA-MEM, then SNV and indel calling was conducted with GATK HaplotypeCaller, GenomicsDBImport and GenotypeGVCFs (GATK version 4.2.4.0), following GATK best practices. Hail v0.2.105 was used to conduct sample, variant and genotype QC, as described below.