| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

BGSUbioinformaticsseminar

Page history last edited by Ying-Ju Chen 9 years, 3 months ago Saved with comment

BGSU Bioinformatics Seminar

In Fall of 2012 we invited all Biology, Chemistry, Mathematics and Statistics, and Computer Science majors to attend a weekly one-hour seminar on bioinformatics.  To the extent possible, the content of the seminar meetings are recorded here.

 

Session 1, Monday, October 15, 2012

 

Attendees filled out a simple sheet of background questions:  Background questions.docx  Aside from basic information about class year and major, there are questions about programming experience and knowledge of molecular biology, and questions about what people would like to learn in the seminar.  Those who wanted their email address to be added to a list wrote it on the front of the page.

Some students asked if the seminar could be set up as a course so that it would appear on their transcripts, because this might be helpful for getting into medical school, for example.

 

We have funding from the Choose Ohio First bioinformatics program, for Ohio Residents.  Aside from Ohio residency, the requirements are that students attend and participate in the bioinformatics seminar, that they join a research group on campus, and that they take courses to move them in the direction of bioinformatics.  Dr. Paul Morris, Biology, gave a description of the program and its requirements, how much money is available in total, and the like.

 

We talked about the appropriate courses for students in each major to take.  

  • For Biology, Chemistry, and Mathematics majors, the suggestion from the faculty was that students take CS 2010, a first programming course taught with C++.  CS majors confirmed that that would be a good course to start with, much more appropriate than CS 1010, which is an introduction to using computers and uses Visual Basic.  Biology and Chemistry majors can also take Math 2220, Discrete Mathematics, which the CS majors agreed would be a good course for them, also Math 2470, a course on probability and a bit of statistics, or possibly Math 3220, a higher-level discrete mathematics course.
  • For Computer Science and Mathematics majors, the faculty suggest BIOL 2050, and the Biology majors agreed that that would be a good first course, and that it is not at all necessary to take chemistry before BIOL 2050.  BIOL 2040 is also required for the Biology major, but the Biology majors present said it would not be relevant to bioinformatics.  After BIOL 2050, the faculty suggest BIOL 3500, Genetics.  The Biology majors pointed out that BIOL 2040 is also required for BIOL 3500, but in fact Dr. Morris, who teaches BIOL 3500, will waive that part of the prerequisite for bioinformatics students.  Students who cannot take biology next semester can take CHEM 1250 and 1270.
  • Some students suggested that they may not be able to take any of these courses next semester, because of courses that they need to take first, particularly in freshman year.  They were referred to Dr. Morris. 

 

We talked briefly about careers in bioinformatics, that there will continue to be more and more jobs in this area as more and more genetic information gets used in medicine and as the price of sequencing a genome continues to drop below $1000.  However, since bioinformatics is a very new area, it might not be easy to find advertisements for "Bioinformatician".  Not as easy as for an Accounting major to find job ads for accountants, for example.  A few participants indicated that they or a close friend/relative work for bioinformatics companies, and so were able to assure us that one can get a job.  At any rate, when looking for a CS or Biology job, it will always be helpful to stand out from the other applicants, and having a background in bioinformatics should be helpful.

 

We mentioned that we will have researchers from BGSU come to speak at the seminar to explain their research and to give some idea what a bioinformatics student might be able to do as part of their research group.  It's not perfectly clear how we will match up students with research groups, but it will probably work out OK.

 

One goal of the bioinformatics seminar is to have people learn how to program in Python.  We suggested that people get started by setting up an account at Code Academy and start working on Python.  In the future, we will work on small programming projects.

 

In the last 10-15 minutes of the session, Dr. Craig Zirbel, Mathematics and Statistics, gave an introduction to DNA and how it resembles and differs from the hard drive on your computer.  

  • DNA can be thought of as a long chain of the letters A, C, G, and T.  In humans, the sequence is about 3 billion letters long.  It is split over 23 chromosomes, but we don't have to worry about that now.  Every organism on earth uses DNA.  It is a relatively stable storage medium, like your hard drive.  It encodes large amounts of digital data, like your hard drive.  Parts of it are "read" now and then, like your hard drive.  DNA can be copied, like your hard drive.  
  • But DNA is quite unlike your hard drive in many ways.  First of all, DNA is a physical molecule.  The one sequence of A, C, G, T is always matched with a complementary strand (A with T, G with C).  The double strand twists into a very long double helix which itself gets wound around proteins called histones.  This makes some parts of the DNA more accessible to be copied than other parts.  Thus, quite unlike your hard drive, DNA is not random access, but what is accessible is controlled by the physical shape of the DNA molecule.  
  • When DNA is copied, errors occur.  There is a physical error correcting mechanism, but it does not correct all errors.  The error rate is very low, but definitely non-zero, and changes to the DNA accumulate every time a cell divides.  For example, your skin cells divide fairly often, and by my age, plenty of mutations have accumulated.  When you go out in the sun, the ultraviolet light immediately starts to damage your DNA.  Most of the damage is corrected, but not all.
  • One thing we understand about DNA is that if you have a sequence of DNA about 22 nucleotides long, it is very likely to be unique in the whole genome.  So once you can read that much of the DNA from a new sample, you can tell where it came from.  Of course longer subsequences will be even easier to place uniquely.  Micro RNAs, which are like fine tuning knobs for metabolism in the cell, are about 22-25 nucleotides long, which give them the ability to uniquely target parts specific sites for regulation.
  • DNA sequencing is becoming faster (under a week for a human genome) and cheaper (under $1000 for a human genome).  One technique for sequencing DNA involves making little flashes of light as the DNA strand is traversed, in different colors for A, C, G, and T.  These flashes are watched by a camera through a microscope.  Thus, the primary data for DNA sequencing is a set of images, and these images take up enormous amounts of computer storage space.  The flashes of light aren't perfect.  Sometimes a flash is missed, sometimes the colors can't be distinguished, so there are errors in the sequencing.  A good math/stat question is how many times you should read the same DNA sequence to be sure that you have gotten enough data to make the right call on each letter.
  • An older way of DNA sequencing involved taking, say, 10 copies of a DNA sequence and randomly chopping it into shorter strands of length around 200, sequencing those, and then reassembling the short sequences into a complete sequence by looking for overlaps between sequences.  This is a challenging computational and algorithmic problem. 

 

Session 2, Monday, October 22, 2012

 

Attendees were broken into four groups of 4 each.  Each person got a printed copy of the codon table from http://en.wikipedia.org/wiki/Genetic_code  Each group had at least one person prepared to explain translation of messenger RNA into amino acids, and whatever else they could explain about protein synthesis and the chemical properties of amino acids.  Each group also had people prepared to explain for loops, and work through an example.  Each group worked on explaining these things to each other for about 20 minutes, then we had two questions for them to try to answer. 

  • For those who had just learned about translation, the question went like this:  you have a messenger RNA that codes for a protein that is important to you, but I am going to mutate one letter in one of the codons.  Which of the three letters, the first, the second, or the third, would you like me to mutate?  Why?  People identified the third codon as being least likely to cause a serious change after mutation, by reading the codon table.  The follow question:  suppose I was going to mutate either the first or second letter.  Which one would be safest to mutate?  Here opinion was split, but the discussion was good.  A biochemist suggested that the worst kind of mutation was an insertion or deletion in the mRNA, and discussion followed about why that would have such a big impact.
  • Now to those who had just learned about for loops.  First they were asked to write a for loop that would compute the sum 1 + 2 + 3 + ... + 400, but some groups had already done something just like that, so instead they were asked to write a for loop that would compute the sum 1*2 + 2*3 + 3*4 + 4*5 + ... + 399*400.  They worked on it for a few minutes, then one student wrote this solution on the board:
    • x = 0
    • for i = 1 to 399
    • x = i*(i+1) + x
    • print x
  • We talked about how, in python, the line x = i*(i+1) + x would be indented to show that it should be repeated multiple times.  We worked through the various stages of the for loop, in this way:
    • when i = 1, x = 1*2 + 0 = 2
    • when i = 2, x = 2*3 + 2 = 8
    • when i = 3, x = 3*4 + 8 = 20
    • when i = 4, x = 4*5 + 20 = 40
  • We did not talk about how to make sure that the sum ended in the right place.

 

In the last 6 minutes, Dr. Zirbel talked about structured RNA and proteins, how each of them functions by folding up into a three-dimensional structure, and how mutations in the sequence can disrupt this folding and so reduce the fitness of the cell, how some mutations simply have no effect, and how very occasionally the mutation will improve the fitness of the cell and so allow it to outcompete its neighbors.

 

Session 3, Monday, October 29, 2012

 

People were asked to download and install Python 2.7 ahead of time.  Directions can be found at http://thenewboston.org/watch.php?cat=36&number=1 or http://www.python.org/getit/mac/ for the Mac.  Or look around on the web for instructions.  On Windows, you should be able to simply use this link:  http://www.python.org/ftp/python/2.7.3/python-2.7.3.msi

 

Note on Windows:  you will probably be better off in the future if you install Python "for this user only" rather than "for all users" because some additional Python packages do not look in the Windows registry for programs (like Python) that are installed for all users.

 

We made sure that everyone had Python 2.7 downloaded and installed.  We used a text editor such as PSPad or the equivalent for the Mac to type a simple program like this:

 

print "Hello world!"

for i in range(1,10):

 print i 

 

We made a new folder to save Python programs in, with a name like BioInfo.  We saved the program in this folder with a name like firstprogram.py.  (Note to Windows users using Notepad:  File, Save As, change file type to All Files, then save.  If you don't do this, Notepad will probably call the file firstprogram.py.txt, which is not what you want.)  We opened a command window (Windows) or terminal window (Mac) and changed directory to BioInfo.  On windows, you might type "cd C:\Users\zirbel\Documents\BioInfo" to change to the BioInfo directory.  We ran the program using one of the following commands (try each of them to see which one will work for your installation of Python):

 

python firstprogram.py

c:\python27\python firstprogram.py

firstprogram.py

 

We talked about why the values of i that got printed only ran from 1 to 9, and not 1 to 10 like you might think from having typed in range(1,10).  We challenged people to type a program that would compute 1*2 + 2*3 + ... + 399*400 and tell us what value they got.

 

Here is a tutorial for Windows users that shows what many of the steps above look like.  Make sure that you are able to follow all of them, as we will be doing this often in the seminar.  Type and run a Python program on Windows.docx

 

We will be working with this file, which contains aligned ribosomal small subunit (SSU) sequences from 1096 bacteria and archaea.  SSU_GG_MSA_extract_2012-10-29_1000.fasta  The alignment is from the Greengenes project.  The original file has 1,011,633 entries which takes 7.5 GB uncompressed, but a small percentage were randomly selected and written out to this data file.  They all have at least 1400 ACGTU characters and fewer than 10 N characters.

 

Session 4, Monday, November 5, 2012

 

First, as an exercise, everyone was asked to write a for loop that will calculate the average value of the numbers 10*12, 11*13, 12*14,  ..., 56*58, and to explain how they are sure that they have the right answer.  After doing that, calculate the average value of 21*23, 23*25, ..., 67*69, and explain how you know you are right.  One thing that might be useful is a line like this:

  print "Adding ", i, ' * ', i+2

This shows how to display text and variables on the same line with Python.  It is best to copy and paste this text into your editor.

 

We were going to download and install BioPython.  The download site is http://biopython.org/wiki/Download.  Unfortunately, the BioPython download site does not seem to be working today.

 

Third, we started to work directly with the RNA sequence alignment extract that is listed above.  First, people were asked to download this file and to save it in the same folder where their Python programs are.  Second, they were asked to write this simple Python program to read this file and print the lines in it to the screen:

 

f = open('SSU_GG_MSA_extract_2012-10-29_1000.fasta','r')

for line in f:

  print line

f.close()

 

This program simply reads each line of the file and prints it to the screen.  That makes it very hard to make any sense of it.  People were asked to do some of the following things:

 

  1. Count the number of lines in the file.  (Make a new variable, set it to 0, increment by 1 for each line that is read.)
  2. Print the first two lines of the file and then stop printing.
  3. Print lines 1, 3, 5, 7, 9, etc., all the odd lines of the file.  These are called header lines.  Make sure to not put extra blank lines in between the lines you write out to the screen.  Note that each line that gets read from the file comes with a "newline" character.
  4. Write lines 1, 3, 5, 7, 9, etc. to a separate file.  Here are three helpful lines of code, but you have to figure out where to put them:
  5. g = open('headers.txt','w')
    g.write(line)
    g.close() 
  6. Find any header lines that contain the text "Escherichia" and count how many there are.  Use a line like this
    if "Escherichia" in line: 
  7. For a line with a sequence, find the number of characters in the line.
  8. Find the number of A, C, G, T, U, and - characters in a line.
  9. Find all other characters that appear in some sequence line.  Count them.
  10. Find the line with the most non-"ACGTU-" characters and display it along with its header line.  You will want to keep track of the largest number you find and store the line itself, so that when you get to the end of the file, you can print the line that has the largest number.

 

Here is a good reference about string operations in Python:  http://docs.python.org/2/library/stdtypes.html#string-methods

People were encouraged to find their own things to do.  We can continue to write simple programs to deal with a data file like this one for a few weeks.

 

Note that we will not meet on Monday, November 12 because of the observance of Veterans' Day.

 

Session 5, Monday, November 19, 2012

We had two guest speakers.  Anton Petrov described the work of the BGSU RNA group and some of its many web servers, and Xiaohui Xu of Biology described her work with genomes , and then will continue to work on the basic tasks listed above, reading a sequence alignment file.

 

Here are some additional tasks for the RNA sequence alignment file that we were reading above.

  1. While reading through the file, store each line as an entry in a list.  These commands will be useful.  Before you start the loop, make an empty list this way: 
  2. alllines = []  
    within the for loop, add each line you read from the file to the list this way:
    alllines.append(line)
    Now we can go from column to column, then go through each of the rows of the alignment.
  3. Accumulate the letters in column 407.  This is the basic idea:
    col = ""
    for k in range(0,len(alllines)):
       col = col + alllines[k][407]
    print col
  4. Now count the number of A, C, G, T letters in this column.  You should get  154 A.  How many C, G, and T?

 

Session 6, Monday, November 26, 2012

We had one guest speaker, Caitlin Knowlton, who is a master's student in Biology at BGSU.  She spoke about her work with ice core samples, sequencing the microbes found there and using bioinformatic tools to figure out roughly where in the tree of life they come from, a topic called metagenomics.  Her experience is that people are very happy to hear that an job applicant has bioinformatics experience, and anticipates that they would be even happier to hear that the person can write bioinformatics programs themselves.

 

Craig Zirbel gave an introduction to Python dictionaries.  Here is an explanation for those who did not attend.  Suppose you are taking orders for drinks.  You would like to store these orders in a variable called Orders.  There are various kinds of drinks, and for each drink, you would like to record the number ordered.  Here is some code that can accomplish this.

 

Orders = {}

Orders['lemonade'] = 12

Orders['orange juice'] = 3

Orders['water'] = 18

Orders['Coke'] = 7

 

Having done that, you find that there is one last lemonade order.  You can increase the lemonade order by 1 using code like this:

 

Orders['lemonade'] += 1

 

However, if this is the first lemonade order, Python will object that it has no entry in the dictionary for lemonade.  This is safer:

 

if 'lemonade' in Orders:

 Orders['lemonade'] += 1

else:

 Orders['lemonade'] = 1

 

If you want to see the orders, you can do something simple like this:

 

print Orders

 

or more complicated, like this:

 

for d in Orders:

 print d, Orders[d] 

 

Session 7, Monday, December 3, 2012

 

We will continue to work with the RNA multiple sequence alignment.  In order to get better at using Python dictionaries, let's use dictionaries to count letters in each row of the alignment and to count transitions (A to A, A to C, etc.) in each row in the alignment.  Here are some useful code snippets.

 

DNA = ['A','C','G','T']                               # DNA is a list of the four letters representing DNA

lettercount = {}                                        # set up an empty dictionary

for L in DNA:

  lettercount[L] = 0                                  # initialize the count to 0

 

if L in DNA:

 lettercount[L] += 1                              # add one to the count

 

for L in DNA:

  print L, lettercount[L] 

 

To count the number of transitions from A to A, A to C, A to G, etc. in each line, it will be helpful to store counts for these pairs of letters (AA, AC, AG, etc.) in a dictionary.

 

transcount = {}

for a in DNA:

 for b in DNA:

    transcount[a+b] =0

 

for i in range(0,len(line)):

  if (a in DNA) and (b in DNA):

    transcount[a+b] += 1

 

What is left for you to do is to figure out what a and b ought to be.  Since there are many gap characters, you need to figure out how to work around those.  Also pay attention to the fact that there are characters like N, R, Y in different lines of the data.

 

Session 8, Monday, January 7, 2013

 

This is the first meeting of spring semester in 2013. We mostly worked with people who are new to the seminar, helping them get Python installed and getting a basic program running. 

 

We quick reviewed how to write loops. People were asked to write a for loop that would show two columns. One is numbers from 0 to 10; the other is squares of those numbers. Then they worked on how to compute the sum 1**2+2**2+…+10**2  (the sum of squares of numbers) and show the result.

 

Since sum() is a built-in function in Python, it is better not to use sum as a variable. How it works? 

For example, the following code gives us the sum of numbers from 0 to 10:

a = range(0,11)

s = sum(a)

print s

 

We also worked through the various stages of the for loop, in the way:

  • when i = 0, i is even because i = 2*0+0
  • when i = 1, i is odd because i =2*0+1
  • when i = 2, i is even because i =2*1+0
  • when i = 3, i is odd because i =2*1+1

 

Session 9, Monday, January 14, 2013

 

One thing we did today was to read data from a text file, searched for patterns of text in it, and wrote out data to a file.  Click this link to download the data file:  e_coli_23S_rRNA_sequence.txt

 

Here is a program that we ran, and which illustrates searching for text in a string.  read_sequence_file_count_subsequences.py

 

First, people were asked to download the data file:  e_coli_23S_rRNA_sequence.txt. One way to do it is to click the link of the file and find “Pages & Files”. Under the folder “Pages & Files”, we can find the file. Moving the mouse to the file and clicking “More” will give us the option to download the file. We should save the file in the same folder (here, it is “bioinfo”) where we put our Python programs.  Second, they were asked to write the simple Python program to read this file and count the numbers of occurrences of RNA letters ‘A’, ‘C’, ‘G’, and ‘U’, respectively.

 

f = open('e_coli_23S_rRNA_sequence.txt')

 

for line in f:

    print line.count('A')

    print line.count('C')

    print line.count('G')

    print line.count('U')

 

f.close()

 

This program simply reads each line of the file and prints the numbers of occurrences of letters ‘A’, ‘C’, ‘G’, and ‘U’, respectively. People were asked to do some of the following things:

 

  1. Find the pattern GNRA in a DNA sequence. That is to find the pattern G[ACGU][AG]A in a RNA sequence. ‘N’ can be referred to ‘A’ or ‘C’ or ‘G’ or ‘U’ and ‘R’ can be referred to ‘A’ or ‘G’.
  2. Find the numbers of occurrences of all combinations of two letters which are from RNAletters = [‘A’, ‘C’, ‘G’, ‘U’].
  3. Find the patterns

                                   GGNRAC

                                   CGNRAG

                                   AGNRAU

                                   UGNRAU

                                                       in a DNA sequence.

     4. Write the data into a text file.

 

In the last 10-15 minutes of the session, Dr. Craig Zirbel explained the commands in the file read_sequence_file _count_subsequences.py. And he also explained a little about the difference between DNA and RNA and why we are interested in finding the patterns above.

 

Note that we will not meet on Monday, January 21 because of Martin Luther King, Jr. Day. 

 

Session 10, Monday, January 28, 2013

 

We had a short presentation by R. Nyonyi, an undergraduate, who started learning Python and bioinformatics last semester and are now working with Dr. Zhaohui Xu in Biology on genetics research. He gave an introduction to the project Thermotoga RQ7. They are doing the comparison between Genes in Thermotoga RQ7 Genomes and other Genomes by using Python. The two main tools they use are web2py and biopython. Web2by deals with python web frame work and biopython is a python tool for computational biology. Dr. Craig Zirbel also leaded a small discussion about how they do the comparison and Dr. Zhaohui Xu explained why it is important to do the comparison.

 

After that, we worked on a dynamic programming activity, the Manhattan tourist problem. Dr. Craig Zirbel talked about the connection between this activity and how to do align the RNA sequences to get the most correspondences between them. This is essentially what BLAST does; BLAST, which is a web-based search tool, finds regions of similarity between biological sequences. In the last 5-10 minutes, we worked on the worksheet, Dynamic programming to align two short DNA sequences

 

Both activity sheets are available on http://www-math.bgsu.edu/~zirbel/dp

 

Session 11, Monday, February 4, 2013

 

Please install numpy (for Windows installation) before you come; it has support for vectors and matrices in Python.

Here is the link for MAC OS installation http://sourceforge.net/projects/numpy/files/NumPy/1.7.0rc1/ .

 

We are going to write a program to align two sequences of characters.  Here is a skeleton of the program:  alignment_skeleton.py

 

Based on the activity sheets we had on January 28, 2013, we started writing a program to align two sequences of characters. Dr. Craig Zirbel explained the algorithm and how to make the program work. People were asked to fill in many steps in a skeleton of the program.

Two main matrices we calculated are Score and BackArrow,.  We use the matrix Score to record score on each circle and the matrix BackArrow to record the direction where we got the score, respectively. We all ended up with a working program.

 

Session 12, Monday, February 11, 2013

 

We will continue working on the sequence alignment program.  Here is the program we left off with last time:  alignment_partial.py

 

This time we will work on the "traceback" part of the algorithm, in which we follow the back arrows to discover the optimal alignment.  This part of the program is not yet written.

 

In the example that was provided with the program and which appears in the handout Dynamic programming to align two short DNA sequences, you may notice that there are multiple possible ways to attain the optimal alignment, because at several junctures the maximum score could have come from more than one previous node (represented by circles).  Think about how to change the scoring scheme slightly so that a better alignment is obtained, one in which all aligned letters appear together.  Try this using the sequences 'matcwwwwwwwwmathwwwwwwwwmatchwwwwwwwwmtch' and 'match' which are found in the Python programs provided.

 

At the beginning of the seminar, Elizabeth Ross who is a program manager in SETGO (Science, Engineering & Technology Gateway Ohio) gave a brief introduction about SETGO Summer Research opportunities and the scholarship for undergraduates.

 

Then we continued working on the “traceback” part of the algorithm. Dr. Craig Zirbel explained how we use the back arrows and the matrix BackArrow to discover the optimal alignment. We used a variable matches to record the number of matches, two strings (a1) and (a2) with hyphens to show where non-aligned characters are, and two lists of sequence positions (p1) and (p2) indicating which positions are aligned.  

 

Session 13, Monday, February 18, 2013

 

In the last two weeks, we wrote a program that aligns two sequences of characters. We altered the program to make it better at aligning a short sequence against a very long sequence today. Dr. Craig Zirbel modified the version of the program which is called alignment_local.py and made copies of it for students. The students read the program line by line and tried to figure out the difference between the old version and new version of the program and understand the commands in Python. In the new version of the program, Dr. Zirbel introduced a new variable GapOpen which is a penalty score for putting in a gap. He explained how GapOpen works in the program and why a better alignment is obtained after changing the scoring scheme. Then he talked about how to spot obvious bad alignments. Dr. Zirbel also explained how people align two DNA sequences in the end of the seminar.  

 

In the last 5-10 minutes, we learned how to use Python functions to call the alignment program from another program since that will make it easier to read sequences from a file and align them. 

 

alignment_local.py

 

Session 14, Monday, February 25, 2013

 

The first order of business is to package the alignment program we have been writing into a Python function.  This PDF explains how the two programs should look.  Function calls.pdf

 

alignment_program2.py

 

The second thing is to read in some new sequences, which are sequences of amino acids that make up the protein hemoglobin in different organisms.  Here is a file of sequences of alpha hemoglobin.  Hemoglobin is a protein that binds iron, which can bind oxygen in your lungs and release it when it gets to cells in other parts of your body.  Many different organisms have hemoglobin.    hemoglobin_alpha.txt  Download this file, then start a new program, open this file for reading, then read through the lines in the file, appending them to a list.  (Reminder: in section 4, you can find the code for opening a file in Python.) Then, write a for loop to align each sequence to the first sequence, using the alignment_local function that we wrote earlier.

 

You will notice that each line in the hemoglobin sequence file has an organism name, then spaces, then the actual amino acid sequence.  We would like to separate these, so we don't align organism names.  One way to do this is with code like this:

t = line.split()
sequences.append(t[1])

The method "split" when applied to a string will divide the string up using space characters.

 

Finally, we should modify the alignment program so that it allows similar amino acids to be aligned to one another.  Many people use the BLOSUM62 scoring matrix for that.  Instead of scoring an exact sequence match as 1 and different letters as 0, the BLOSUM62 scoring matrix allows other scores between "good" and "bad".  From this site:  http://www.bio.cam.ac.uk/~tjs23/python/SequenceAlignment.html  we can find the following Python code, which stores the BLOSUM62 scores:

 

BLOSUM62={
 'A':{'A': 4,'R':-1,'N':-2,'D':-2,'C': 0,'Q':-1,'E':-1,'G': 0,'H':-2,'I':-1,
      'L':-1,'K':-1,'M':-1,'F':-2,'P':-1,'S': 1,'T': 0,'W':-3,'Y':-2,'V': 0,'X':0},
 'R':{'A':-1,'R': 5,'N': 0,'D':-2,'C':-3,'Q': 1,'E': 0,'G':-2,'H': 0,'I':-3,
      'L':-2,'K': 2,'M':-1,'F':-3,'P':-2,'S':-1,'T':-1,'W':-3,'Y':-2,'V':-3,'X':0},
 'N':{'A':-2,'R': 0,'N': 6,'D': 1,'C':-3,'Q': 0,'E': 0,'G': 0,'H': 1,'I':-3,
      'L':-3,'K': 0,'M':-2,'F':-3,'P':-2,'S': 1,'T': 0,'W':-4,'Y':-2,'V':-3,'X':0},
 'D':{'A':-2,'R':-2,'N': 1,'D': 6,'C':-3,'Q': 0,'E': 2,'G':-1,'H':-1,'I':-3,
      'L':-4,'K':-1,'M':-3,'F':-3,'P':-1,'S': 0,'T':-1,'W':-4,'Y':-3,'V':-3,'X':0},
 'C':{'A': 0,'R':-3,'N':-3,'D':-3,'C': 9,'Q':-3,'E':-4,'G':-3,'H':-3,'I':-1,
      'L':-1,'K':-3,'M':-1,'F':-2,'P':-3,'S':-1,'T':-1,'W':-2,'Y':-2,'V':-1,'X':0},
 'Q':{'A':-1,'R': 1,'N': 0,'D': 0,'C':-3,'Q': 5,'E': 2,'G':-2,'H': 0,'I':-3,
      'L':-2,'K': 1,'M': 0,'F':-3,'P':-1,'S': 0,'T':-1,'W':-2,'Y':-1,'V':-2,'X':0},
 'E':{'A':-1,'R': 0,'N': 0,'D': 2,'C':-4,'Q': 2,'E': 5,'G':-2,'H': 0,'I':-3,
      'L':-3,'K': 1,'M':-2,'F':-3,'P':-1,'S': 0,'T':-1,'W':-3,'Y':-2,'V':-2,'X':0},
 'G':{'A': 0,'R':-2,'N': 0,'D':-1,'C':-3,'Q':-2,'E':-2,'G': 6,'H':-2,'I':-4,
      'L':-4,'K':-2,'M':-3,'F':-3,'P':-2,'S': 0,'T':-2,'W':-2,'Y':-3,'V':-3,'X':0},
 'H':{'A':-2,'R': 0,'N': 1,'D':-1,'C':-3,'Q': 0,'E': 0,'G':-2,'H': 8,'I':-3,
      'L':-3,'K':-1,'M':-2,'F':-1,'P':-2,'S':-1,'T':-2,'W':-2,'Y': 2,'V':-3,'X':0},
 'I':{'A':-1,'R':-3,'N':-3,'D':-3,'C':-1,'Q':-3,'E':-3,'G':-4,'H':-3,'I': 4,
      'L': 2,'K':-3,'M': 1,'F': 0,'P':-3,'S':-2,'T':-1,'W':-3,'Y':-1,'V': 3,'X':0},
 'L':{'A':-1,'R':-2,'N':-3,'D':-4,'C':-1,'Q':-2,'E':-3,'G':-4,'H':-3,'I': 2,
      'L': 4,'K':-2,'M': 2,'F': 0,'P':-3,'S':-2,'T':-1,'W':-2,'Y':-1,'V': 1,'X':0},
 'K':{'A':-1,'R': 2,'N': 0,'D':-1,'C':-3,'Q': 1,'E': 1,'G':-2,'H':-1,'I':-3,
      'L':-2,'K': 5,'M':-1,'F':-3,'P':-1,'S': 0,'T':-1,'W':-3,'Y':-2,'V':-2,'X':0},
 'M':{'A':-1,'R':-1,'N':-2,'D':-3,'C':-1,'Q': 0,'E':-2,'G':-3,'H':-2,'I': 1,
      'L': 2,'K':-1,'M': 5,'F': 0,'P':-2,'S':-1,'T':-1,'W':-1,'Y':-1,'V': 1,'X':0},
 'F':{'A':-2,'R':-3,'N':-3,'D':-3,'C':-2,'Q':-3,'E':-3,'G':-3,'H':-1,'I': 0,
      'L': 0,'K':-3,'M': 0,'F': 6,'P':-4,'S':-2,'T':-2,'W': 1,'Y': 3,'V':-1,'X':0},
 'P':{'A':-1,'R':-2,'N':-2,'D':-1,'C':-3,'Q':-1,'E':-1,'G':-2,'H':-2,'I':-3,
      'L':-3,'K':-1,'M':-2,'F':-4,'P': 7,'S':-1,'T':-1,'W':-4,'Y':-3,'V':-2,'X':0},
 'S':{'A': 1,'R':-1,'N': 1,'D': 0,'C':-1,'Q': 0,'E': 0,'G': 0,'H':-1,'I':-2,
      'L':-2,'K': 0,'M':-1,'F':-2,'P':-1,'S': 4,'T': 1,'W':-3,'Y':-2,'V':-2,'X':0},
 'T':{'A': 0,'R':-1,'N': 0,'D':-1,'C':-1,'Q':-1,'E':-1,'G':-2,'H':-2,'I':-1,
      'L':-1,'K':-1,'M':-1,'F':-2,'P':-1,'S': 1,'T': 5,'W':-2,'Y':-2,'V': 0,'X':0},
 'W':{'A':-3,'R':-3,'N':-4,'D':-4,'C':-2,'Q':-2,'E':-3,'G':-2,'H':-2,'I':-3,
      'L':-2,'K':-3,'M':-1,'F': 1,'P':-4,'S':-3,'T':-2,'W':11,'Y': 2,'V':-3,'X':0},
 'Y':{'A':-2,'R':-2,'N':-2,'D':-3,'C':-2,'Q':-1,'E':-2,'G':-3,'H': 2,'I':-1,
      'L':-1,'K':-2,'M':-1,'F': 3,'P':-3,'S':-2,'T':-2,'W': 2,'Y': 7,'V':-1,'X':0},
 'V':{'A': 0,'R':-3,'N':-3,'D':-3,'C':-1,'Q':-2,'E':-2,'G':-3,'H':-3,'I': 3,
      'L': 1,'K':-2,'M': 1,'F':-1,'P':-2,'S':-2,'T': 0,'W':-3,'Y':-1,'V': 4,'X':0},
 'X':{'A': 0,'R': 0,'N': 0,'D': 0,'C': 0,'Q': 0,'E': 0,'G': 0,'H': 0,'I': 0,
      'L': 0,'K': 0,'M': 0,'F': 0,'P': 0,'S': 0,'T': 0,'W': 0,'Y': 0,'V': 0,'X':0}}


The idea is this:  insert the code defining the BLOSUM62 matrix into our code for local sequence alignment.  When building the score matrix, rather than checking to see whether s1[i] equals s2[j], instead, let b = BLOSUM62[s1[i]][s2[j]].  This retrieves the right row and right column from the BLOSUM62 matrix.  Now use these lines of code to adjust the score:

if Score[i-1,j-1] + b > Score[i,j]:

  BackArrow[i,j] = 3
  Score[i,j] = Score[i-1,j-1] + b

Also, change the "open gap" penalty to -3, which puts it on the same scale as the penalties for misaligning amino acids.

 

Now you should be able to align hemoglobin sequences from different organisms.  Unfortunately, they are longer than 80 characters, so they are hard to read in the command prompt.  A good idea would be to write out the alignments to a text file, together with organism names. The Python code for lining up columns in a text file:

 

g = open('read_hemoglobin.txt','w')

g.write('%-15s %s \n' % (organisms[0], a1))

# %-15s means we set 15 spaces for the first column, %s means we don't specify the space for the second column. 

 

Session 15, Monday March 11, 2013

 

At the beginning of the seminar, Dr. Craig Zirbel explained that how to make alignments of amino acid sequences from proteins. This allows aligning letters which are not exact matches, but which correspond to amino acids that share enough physical similarity that they can substitute for one another without disturbing the structure of the protein they need to fold into.  

 

Alignment

            ------ exact sequence matches

                        ACGT / ACGU

                    mis-spellings in English

            ------ inexact sequence matches

 

We allow non-exact matches by modifying the scoring scheme in the alignment program (alignment_program2.py) that we have already written. We wrote out the pairwise alignments and made them into files. Dr. Craig Zirbel also talked about the need for multiple sequence alignments. He distributed the sheets of RNA codon table to the students so that they can read the BLOSOM62 matrix to see how the scores are distributed between letters as well as the relationship between the scoring scheme and RNA codon table.

 

For the rest of the semester,  we plan to divide people into several groups and start on a project.  This project will require different groups to write different programs. Next, we will put them on Github to share and keep versions. After making the plan, people keep working on writing the program to align hemoglobin sequences.

 

Session 16, Monday March 18, 2013

 

Today we will start a group programming project.  We will download some existing programs and files from a Git repository, we will work on writing more programs, and we will upload our changes to Git so that other people can make use of them.

 

First, get an account with Git:  https://github.com/signup/free

Second, email your username to Dr. Zirbel so he can add you as a collaborator on this project.

Next, download and install Git.  If you have a Windows machine, use http://windows.github.com/  If you have a Mac, you can go to http://mac.github.com  Once you get it installed, you will need to enter your username and password.  Git will make a folder on your computer, probably a folder called "Github" in the "Documents" folder.

Next, clone the "correspondences" project; this will make a local copy on your computer.

Windows users should make a shortcut to the command prompt which will start in the Github/correspondences folder.  This will make it easier to run Python programs in that folder.

 

The Github repository is available online here:  https://github.com/clzirbel/correspondences

 

Here is some background on what we are working on.  Think of a specific RNA or protein molecule that is important for life.  Many such molecules are present in many different organisms from different species.  Some are common to bacteria and animals, they are so important and have been around for so long.  There are laboratory techniques to extract such molecules from organisms and sequence them, but people find that the sequences are not identical between different organisms due to random errors made during reproduction.  We have been doing sequence alignments, which are our best guess of how individual positions in sequences from different organisms correspond to one another.

 

Some molecules have been studied intensively by laboratories that manage to crystallize them, shine x-rays through them, and infer the locations of all the atoms in the molecules.  This tells us exactly what the 3D structure of the molecule is.  People find that the 3D structure of RNA and protein molecules is extremely stable over time, even over 1 billion years of independent lines of reproduction.  The BGSU RNA group has written a program that aligns the nucleotides in two 3D structures using detailed analyses of the local context of each nucleotide.  You can see an alignment like this at this site:  http://rna.bgsu.edu/r3dalign/results/514313b142ec5  (Allow the Java applet there to run so you can see the molecules.)  This is an alignment of the small ribosomal subunit of two bacteria, E. coli (2QBD) and Thermus thermophilus (3I8H).  E. coli lives in the environment all around us, but Thermus thermophilus lives in deep sea vents under very high pressures and temperatures.  Even so, their ribosomal RNAs are very similar to one another.  I have downloaded the alignment and have written it out in a standard form in the file 2QBD_3I8H_correspondences_514313b142ec5.txt, which is in the Github repository.  Here are the first few lines of the file:

# 2QBD to 3I8H alignment produced by R3D Align
# This alignment is available online at http://rna.bgsu.edu/r3dalign/results/514313b142ec5
2QBD|1|A|U|5 corresponds_to 3I8H|1|A|U|5
2QBD|1|A|G|6 corresponds_to 3I8H|1|A|G|6
2QBD|1|A|A|7 corresponds_to 3I8H|1|A|G|7
2QBD|1|A|A|8 corresponds_to 3I8H|1|A|A|8
2QBD|1|A|G|9 corresponds_to 3I8H|1|A|G|9
2QBD|1|A|G|9 corresponds_to 3I8H|1|A|G|9

There are two header lines, which begin with #, then there are many lines that tell which nucleotides from 3D structure file 2QBD align to nucleotides from 3I8H.  The format is (PDB filename)|(Model number)|(Chain)|(Base)|(Nucleotide number), but you don't need to worry about that, because the programs that generate these nucleotide identifiers will be careful to produce them accurately.

 

Some questions about alignments of 3D structures are:

  1. Are two alignments equal?  CompareCorrespondences.py will compare two 3D to 3D alignments.
  2. How can we combine an alignment of structure A to structure B with an alignment of structure B to structure C to get an alignment of structure A to structure C?  CombineCorrespondences.py will do that.

 

A very important idea is to align 3D structures, which are very rare, to sequence alignments, which are large because it is relatively easy to get sequences from many organisms and align them.  Then we can use the sequence alignments to understand the allowed sequence variability for certain regions in a 3D structure.  For example, you know that in a DNA helix, an AT basepair can be replaced with a GC basepair.  Similarly, in an RNA helix, an AU basepair can be replaced by a GC basepair.  We know this from sequence alignments.  The BGSU RNA group would like to learn about sequence variability of many other RNA motifs.  The first step is to make an alignment between a 3D structure and the columns of a multiple sequence alignment.  In the correspondences repository, there is an extract from the GreenGenes 16S ribosomal sequence alignment, called GG16S2012.fasta.  Steve Dinda (a former PhD student in our group) and I have extracted the nucleotide sequence from 3D structure file 2QBD and aligned it to the nearest sequence in the GG16S2012.fasta file, and wrote out the correspondences between nucleotides in the 3D structure file and columns of the alignment in the file 2QBD_GG16S2012_correspondences.txt.  The first few lines of that file look like this:

# 2QBD to GreenGenes alignment calculated by Steve Dinda and Craig Zirbel in March, 2013
# Alignment columns are numbered starting at 1
# The file 2QBD was downloaded from PDB
# The GreenGenes alignment was downloaded from http://greengenes.lbl.gov/Download/Sequence_Data/Greengenes_format/greengenes16SrRNAgenes.txt.gz dated 25-Jun-2012
2QBD|1|A|U|5 corresponds_to GG16S2012|106
2QBD|1|A|G|6 corresponds_to GG16S2012|107
2QBD|1|A|A|7 corresponds_to GG16S2012|108
2QBD|1|A|A|8 corresponds_to GG16S2012|109
2QBD|1|A|G|9 corresponds_to GG16S2012|110


We should be able to use CombineCorrespondences.py to use the correspondences between 2QBD and 3I8H and between 2QBD and the sequence alignment to tell how the nucleotides in 3I8H correspond to the columns in the sequence alignment GG16S2012.fasta.

 

At the beginning of the seminar, people got accounts with Git, installed Git on their laptops and cloned the “correspondences” project from Git. Dr. Craig Zirbel explained how to automatically sync master files from Git or to Git. He also talked about some background on what we are working on. WriteCorrespondences.py writes an alignment of structure A to structure B into a file. CombineCorrespondences.py lets us combine an alignment of structure A to structure B with an alignment of structure B to structure C to get an alignment of structure A to structure C. CompareCorrespondences.py allows us compare two alignments of structure A to structure B.  

 

Then we divided people into three groups to work on the three programs.

 

Session 17, Monday March 25, 2013

 

We continued to work on a group programming project, writing code to work with correspondences between biological sequences (protein and RNA) and their corresponding 3D structures, and between the 3D structures themselves. We will start to use these programs to compare alignments generated by the BGSU RNA group. The goal is to create tools that will allow other research groups to make better use of correspondences between sequences and structures, to learn more about possible 3D structure changes by studying sequences from many, many organisms.

 

At the beginning of the seminar, Dr. Craig Zirbel reviewed what programs we wrote last Monday and gave some ideas about which kind of problems for aligning sequences we may want to solve. Here are some examples:

 

A---G---U

A---G---C

G---G---C

...

 

We may want to delete the gaps between letters.

RemoveGaps.py will help us to do this.

     Input: sequence: a list of strings, all of the same length

     Output: newsequence: a list of strings, but with all gap columns removed

 

RNA group at BGSU has written programs that align 3D structure to one and another.  In fact, it would be nice to be able to align sequences from E. coli, Thermus thermophiles, Yeast and H.M., like four things at once and make four way alignment, like a multiple sequence alignment. We can do all the different pairwise alignments. The question is if you would try to intercept them, what things line up properly between all four of them?

Start from 3 correspondences:

     Input: 3 correspondences A, B and C.

     Output: 3-tuples of correspondences that agree between all 3

 

In the last 30 minutes of the session, people worked on the programs.

 

Session 18, Monday April 1, 2013

 

Today we continued to work on the group programming project.

Dr. Craig Zirbel put 4.5 MB files on GitHub. People started to get GitHub program and synchronize.

 

Dr. Craig Zirbel talked about the three programs we would like to write for the project. People have gotten sequences from many, many different organisms and they put all the sequences in a file and put gaps or hyphen characters in some of the sequences to line up the columns. The usual format that we are using is called FASTA format. The 3D structure has a sequence and there are many, many different sequences in a fasta format file. What we are interested in is to take an individual sequence and compare it to the first sequence, the second sequence, … in the fasta format file. Do a sequence alignment and see which sequence already in their alignments is closest to the one we have in 3D structure.

1. FindBestAlignmentRow.py

Input:

          sequence: a biological sequence,

          MSAfilename: the file name of a multiple sequence alignment in FASTA format

          sequencetype: a string which can be "protein" or "RNA" or "DNA"

Output: 

     sequencenumber: an integer, starting at 1, telling which sequence in the file was the best match

     header: the header of the best-matching sequence

     MSAsequence: the best-matching sequence from the alignment

 

The job is take a sequence and the alignments, go through line after line, do the alignments. And then keep track of which sequence has the highest score. We save the information about the sequence.  

 

2. SequenceToColumnCorrespondences.py

We want to align a sequence to an un-gap sequence from the alignment, we want to find the sequence that's the same. Then we want to see how the sequence corresponds to that line to the alignment. For example, sequence position 1 has the first character of the sequence we got from 3D structure corresponds to column 106 in the alignment. There would be lot of extra columns in the alignment.  Maybe sequence position 2 corresponds to column 109, there are always extra columns in the alignment. Then we have the correspondence between the sequence and the columns of the alignment. This program will figure out correspondence between positions and sequence and columns in the alignment.

 

3. RemoveGaps.py

Once we take those columns out of the alignment, one of the problems is sometimes there would be gaps all the way down one column of the alignment.

Here is an example:

                                   AA--GGT--A

                                   AC--GAT--C

                                   GGG-A-A--C

                                   .......

We would like to write a program to identify those columns and leave them out.

 

Session 19, Monday April 8, 2013

 

The bioinformatics seminar continues.  We are in the middle of a group programming project concerning correspondences between 3D structures, between sequences and structures, and between sequences and alignments. This work has already helped the BGUS RNA group, and will eventually help other research groups as well.

 

This week we will start by learning about how correspondences allow us to extract certain columns and certain ranges from a multiple sequence alignment, columns which correspond to features of interest in an RNA 3D structure.  Then we will finish off some programs we have been working on and write additional code to test that they each run correctly and that they deal gracefully with incorrect or faulty input.  This is a very important part of writing code that will be used by other people.

 

There are couples that people were working on. There were RemoveGaps.py and TestRemoveGaps.py.

Test1

Test2: figure out some better ways to visualize the results and make sure that working out.

A lot of  those alignments that we have been reading are in FASTA format and stockholm format. Dr. Craig Zirbel wrote a program called ReadFASTAAlignment.py and it is on GitHub. He explained the contents in the FASTA format file that we may have and things that we may want to deal with. There are three thing to do here. 1. To read header lines, 2. Combine together the sequence lines and 3. Remove the spaces and also '\n' character in the end of each line.

 

Session 20, Monday April 15, 2013

 

We continued a group programming project concerning correspondences between 3D structures, between sequences and structures, and between sequences and alignments. This work has already helped the BGSU RNA group, and will eventually help other research groups as well. This week we will finish off some programs we have been working on and write additional code to test that they each run correctly. This is a very important part of writing code that will be used by other people.

 

At beginning of this section, some people were still working on the program to remove gaps from the alignments. They were trying to catch up with the program. Dr. Craig Zirbel introduced a program, ReadFASTAAlignment.py.

 

>Organism name #1

  Sequence line #1 \n

  Sequence line #2 \n

….

 (until the next organism)

>Organism name #2

   Sequence line #1 \n

            …

The idea is to go through and accumulate all sequence lines under the organism line until the next organism.

 

Here is a new project, it is called TransposeFASTAAlignment.py,

Let’s image that we leave out or omit the organism names and also image that all sequence lines are combined. For example,

 

  1.  T T A A A T T G A A G A G T T T ...
  2.  T T A A A T T G A A G A G T T T …
  3. ---A-T -G-A --G -T T  ...

 

This may be 9000 column form. Every so often working with RNA work, what we want are some columns from the sequences. But in order to express those columns, we need to read the entire file. We need to go through every line of the file, read the first line cross those columns separate somewhere, read the second line cross those columns, etc. The problem is that the file may have 1 million lines long and so it takes long time to go through the whole file. What we are interested in here is to take the whole alignment and take column 1 here, all the way down this column. And we would like to write out a file whose first line is column 1 and the second row should be the next column. Basically, what we want to do is to transpose the sequences. We would like to have the same data, but in a transpose way. The problem is the data with 9000 columns wide and 1 million long, this is too big for the software memory. We want to know what the efficient way to do this.  

 

Session 21, Monday April 22, 2013

 

This is the last section of Bioinformatics seminar in Spring, 2013. At beginning of the section, Dr. Craig Zirbel talked about the project TransposeFASTAAlignment.py that he introduced last week. Some people worked on the project and formed a small group to discuss how we can have a more efficient way to record the transpose of sequences. Dr. Craig Zirbel also introduced a new project, Run-length encoder.

 

Input:    A string with no numbers in it

Output: A string with multiple characters in a row replaced by the characters and a number telling how many times the character occurred

 

Here is a simple example:

                                        Input:       AAAA-----

                                        Output:    A4-5

 

The students worked on this project for the rest of time in this section.

 

 

 

Comments (0)

You don't have permission to comment on this page.