Basic text processing - Howard Assignments

Write code to read, store, and analyze the latest human genome assembly (found at:
/common/contrib/classroom/inf503/genomes/human.txt ). At minimum, your code must contain
(10pts):
• A character array to store the entire human genome in a single data structure
• A separate function to read the human genome file
• A function to compute the number of A, C, G, or T characters in the human genome
• Comments describing major code blocks and control structures
A. (20pts) Read in and store the human genome. There will be multiple scaffolds (each with a
separate header denoted by “>”). Concatenate the entire genome (discard headers) into a
single character array data structure. Collect the following statistics (see below) as you are
reading the file. Hint: you can keep running totals or store scaffold sizes / names in a separate
sets of arrays
• How many scaffolds were there?
• What was the longest and shortest scaffold? Provide names of scaffolds and lengths.
• What was the average scaffold length?
B. (20pts) Write a function to assess the content of the human genome – count the total number
of a given character (A, C, G, or T) in the whole genome.
• What is the ‘big O’ notation of your search (linear / quadratic / cubic / etc)?
• How long does it take (in seconds) to execute this function? Hint: You will need to use
system time within your code to get accurate time estimates.
• What was the GC content of the human genome (percent of C’s and G’s in the genome)?