Copy-number variation

Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals in the human population. Copy number variation is a type of structural variation: specifically, it is a type of duplication or deletion event that affects a considerable number of base pairs. However, note that although modern genomics research is mostly focused on human genomes, copy number variations also occur in a variety of other organisms including E. coli and S. cerevisiae. Recent research indicates that approximately two thirds of the entire human genome is composed of repeats and 4.8–9.5% of the human genome can be classified as copy number variations. In mammals, copy number variations play an important role in generating necessary variation in the population as well as disease phenotype. Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals in the human population. Copy number variation is a type of structural variation: specifically, it is a type of duplication or deletion event that affects a considerable number of base pairs. However, note that although modern genomics research is mostly focused on human genomes, copy number variations also occur in a variety of other organisms including E. coli and S. cerevisiae. Recent research indicates that approximately two thirds of the entire human genome is composed of repeats and 4.8–9.5% of the human genome can be classified as copy number variations. In mammals, copy number variations play an important role in generating necessary variation in the population as well as disease phenotype. Copy number variations can be generally categorized into two main groups: short repeats and long repeats. However, there are no clear boundaries between the two groups and the classification depends on the nature of the loci of interest. Short repeats include mainly bi-nucleotide repeats (two repeating nucleotides e.g. A-C-A-C-A-C...) and tri-nucleotide repeats. Long repeats include repeats of entire genes. This classification based on size of the repeat is the most obvious type of classification as size is an important factor in examining the types of mechanisms that most likely gave rise to the repeats, hence the likely effects of these repeats on phenotype. One of the most well known examples of a short copy number variation is the tri-nucleotide repeat of the CAG base pairs in the Huntingtin gene, the gene that is responsible for the neurological disorder Huntington's disease. For this particular case, once the CAG tri-nucleotide repeats more than 36 times, Huntington’s disease will likely develop in the individual and it will likely be inherited by his or her offspring. The number of repeats of the CAG tri-nucleotide is correlated with the age of onset of Huntington’s disease. These types of short repeats are often thought to be due to errors in polymerase activity during replication including polymerase slippage, template switching, and fork switching which will be discussed in detail later. The short repeat size of these copy number variations lends itself to errors in the polymerase as these repeated regions are prone to misrecognition by the polymerase and replicated regions may be replicated again, leading to extra copies of the repeat. In addition, if these tri-nucleotide repeats are in the same reading frame in the coding portion of a gene, it may lead to a long chain of the same amino acid, possibly creating protein aggregates in the cell, and if these short repeats fall into the non-coding portion of the gene, it may affect gene expression and regulation. On the other hand, a variable number of repeats of entire genes is less commonly identified in the genome. One example of a whole gene repeat is the alpha-amylase 1 gene (AMY1) that encodes alpha-amylase which has a significant copy number variation between different populations with different diets. Although the specific mechanism that allows the AMY1 gene to increase or decrease its copy number is still a topic of debate, some hypotheses suggest that the non-homologous end joining or the microhomology-mediated end joining is likely responsible for these whole gene repeats. Repeats of entire genes has immediate effects on expression of that particular gene, and the fact that the copy number variation of the AMY1 gene has been related to diet is a remarkable example of recent human evolutionary adaptation. Although these are the general groups that copy number variations are grouped into, the exact number of base pairs copy number variations affect depends on the specific loci of interest. Currently, using data from all reported copy number variations, the mean size of copy number variant is around 118kb, and the median is around 18kb. In terms of the structural architecture of copy number variations, research has suggested and defined hotspot regions in the genome where copy number variations are four times more enriched. These hotspot regions were defined to be regions containing long repeats that are 90–100% similar known as segmental duplications either tandem or interspersed and most importantly, these hotspot regions have an increased rate of chromosomal rearrangement. It was thought that these large-scale chromosomal rearrangements give rise to normal variation and genetic diseases, including copy number variations. Moreover, these copy number variation hotspots are consistent throughout many populations from different continents, implying that these hotspots were either independently acquired by all the populations and passed on through generations, or they were acquired in early human evolution before the populations split, the latter seems more likely. Lastly, spatial biases of the location at which copy number variations are most densely distributed does not seem to occur in the genome. Although it was originally detected by fluorescent in situ hybridization and microsatellite analysis that copy number repeats are localized to regions that are highly repetitive such as telomeres, centromeres, and heterochromatin, recent genome-wide studies have concluded otherwise. Namely, the subtelomeric regions and pericentromeric regions are where most chromosomal rearrangement hotspots are found, and there is no considerable increase in copy number variations in that region. Furthermore, these regions of chromosomal rearrangement hotspots do not have decreased gene numbers, again, implying that there is minimal spatial bias of the genomic location of copy number variations. Copy number variation was initially thought to occupy an extremely small and negligible portion of the genome through cytogenetic observations. Copy number variations were generally associated only with small tandem repeats or specific genetic disorders, therefore, copy number variations were initially only examined in terms of specific loci. However, breakthroughs in the past decade or so led to an increasing number of highly accurate ways of identifying and studying copy number variations, one of which is the genome-wide association study, that allow copy number variations in general to be located and identified in the genome. Copy number variations were originally studied by cytogenetic techniques, which are techniques that allow one to observe the physical structure of the chromosome. One of these techniques is fluorescent in situ hybridization (FISH) which involves inserting fluorescent probes that require a high degree of complementarity in the genome for binding. Comparative genomic hybridization was also commonly used to detect copy number variations by fluorophore visualization and then comparing the length of the chromosomes. One major drawback of these early techniques is that the genomic resolution is relatively low and only large repeats such as whole gene repeats can be detected. Recent advances in biotechnology gave rise to many important techniques that are of extremely high genomic resolution and as a result, an increasing number of copy number variations in the genome have been reported. One of these advances involves using bacterial artificial chromosome (BAC) array with around 1 megabase of intervals throughout the entire gene, BACs can also detect copy number variations in rearrangement hotspots allowing for the detection of 119 novel copy number variations. Throughout the past decade or so, high throughput genomic sequencing has revolutionized the field of human genomics and in silico studies have been performed to detect copy number variations in the genome. Reference sequences have been compared to other sequences of interest using fosmids by strictly controlling the fosmid clones to be 40kb. Sequencing end reads would provide adequate information to align the reference sequence to the sequence of interest, and any misalignments are easily noticeable thus concluded to be copy number variations within that region of the clone. This type of detection technique offers a high genomic resolution and precise location of the repeat in the genome, and it can also detect other types of structural variation such as inversions.In addition, another way of detecting copy number variation that can ensure high genomic resolution is using single nucleotide polymorphisms (SNP). Since the International HapMap project had begun, common SNPs that occur between four different populations from different continents have been sequenced and located. Due to the abundance of the human SNP data, the direction of detecting copy number variation has changed to utilize these SNPs. Relying on the fact that human recombination is relatively rare and that many recombination events occur in specific regions of the genome known as recombination hotspots, linkage disequilibrium can be used to identify copy number variations. Efforts have been made in associating copy number variations with specific haplotype SNPs by analyzing the linkage disequilibrium, using these associations, one is able to recognize copy number variations in the genome using SNPs as markers. One drawback of this method is that because SNPs in the International HapMap are not optimized for detecting copy number variations, the data is biased towards large copy number variations. Next-generation sequencing has also been used recently to detect copy number variations with high genomic resolutions. Using whole-genome shot-gun sequencing data, assays have been developed to accurately detect and identify regions of duplications. On the other hand, it is very challenging to detect CNVs in targeted sequencing because it is extremely unlikely that breakpoints will occur inside the scant number of regions captured by a gene panel. Thus, soft-clipped reads and discordant reads are unlikely to be found in targeted sequencing. On average, there is about 1 SNP per 800 bit/s, so over a long enough region, B-allele frequencies (BAF) can be used to detect copy number changes. However, in targeted sequencing, there are not enough heterozygous variants within a short region to detect deviations from the expected 50% BAF. Lastly, high resolution microarrays that have copy number probes as well as SNP probes are the gold standard for detecting copy number changes down to 50 kbs with whole genome coverage.Accurately detecting, identifying, and categorizing copy number variations is extremely important because of the complications they bring to DNA sequencing. Traditionally, DNA sequencing relied heavily on sequencing short reads from a large genome and using any overlapping regions of the reads to combine the short reads to form longer reads. This will eventually be mapped together to give the sequence of the entire genome. However, the issues related to copy number variations arise in linking the overlapping regions together. By definition, copy number variation is a region of the genome duplicated a variable number of times in the population and due to the large variation between the number of times portions of the genome are duplicated, when mapping overlapping sequences, it becomes unclear whether or not a region is an overlap or a duplicated region. With all the challenges faced by sequencing to detect copy number variations, high resolution microarrays are the technology of choice. There are two main types of molecular mechanism for the formation of copy number variations: homologous based and non-homologous based. Although many suggestions have been put forward, most of these theories are speculations and conjecture. There is no conclusive evidence that correlates a specific copy number variation to a specific mechanism. One of the best-recognized theories that leads to copy number variations as well as deletions and inversions is non-allelic homologous recombinations. During meiotic recombination, homologous chromosomes pair up and form two ended double-stranded breaks leading to Holliday junctions. However, in the aberrant mechanism, during the formation of Holliday junctions, the double-stranded breaks are misaligned and the crossover lands in non-allelic positions on the same chromosome. When the Holliday junction is resolved, the unequal crossing over event allows transfer of genetic material between the two homologous chromosomes, and as a result, a portion of the DNA on both the homologues is repeated. Since the repeated regions are no longer segregating independently, the duplicated region of the chromosome is inherited. Another type of homologous recombination based mechanism that can lead to copy number variation is known as break induced replication. When a double stranded break occurs in the genome unexpectedly the cell activates pathways that mediate the repair of the break. Errors in repairing the break, similar to non-allelic homologous recombination, can lead to an increase in copy number of a particular region of the genome. During the repair of a double stranded break, the broken end can invade its homologous chromosome instead of rejoin the original strand. As in the non-allelic homologous recombination mechanism, an extra copy of a particular region is transferred to another chromosome, leading to a duplication event. Furthermore, cohesin proteins are found to aid in the repair system of double stranded breaks through clamping the two ends in close proximity which prevents interchromosomal invasion of the ends. If for any reason, such as activation of ribosomal RNA, cohesin activity is affected then there may be local increase in double stranded break repair errors. The other class of possible mechanisms that are hypothesized to lead to copy number variations is non-homologous based. To distinguish between this and homologous based mechanisms, one must understand the concept of homology. Homologous pairing of chromosomes involved using DNA strands that are highly similar to each other (~97%) and these strands must be longer than a certain length to avoid short but highly similar pairings. Non-homologous pairings, on the other hand, rely on only few base pairs of similarity between two strands, therefore it is possible for genetic materials to be exchanged or duplicated in the process of non-homologous based double stranded repairs.

Parent Topic

Child Topic

No Parent Topic