Finding bacterial ORI using Python – Part I | Aaisha Muhammad

Where in ~~the world~~ the genome does DNA replication began? Let us look at Salmonella Enterica – the very same bacteria responsible for Salmonellosis. That is a type of food poisoning – and that is also none of our concern, today. For now, we shall focus on the genome of S. Enterica. Join me as we journey through 1,008,898,178 nucleotides and find the origin of DNA replication.

A record of my journey through this excellent book, which covers various bioinformatics principles through a series of hands-on, real-world genomics projects.

A little background

Before we dive into the actual finding part, let’s skim over what the ORI is. This is rather biology-heavy, but stick with me for a moment and we can move into the actual code.

The ORI – short for origin of replication, which I refuse to use because it is tediously long – is the part of DNA where the replication process starts from. This process is initiated by the DnaA protein. Without going into much detail, there is a location within the genome that this protein must bind to for the replication process. This is called the DnaA Box. This occurs within the ORI. That is all we need to know for now.

But why bother?

Good question. Why does it matter? Why do we care where S. Enterica – or anything – replicates its DNA from? Live and let live! Replicate your DNA and let others do their own.

Well, would you like to munch on some frost-resistant tomatoes? How about some pest-resistant corn? No? Hm, maybe I can tempt you with some juicy, virus-resistant plums. Still no?! Well, maybe you’d like to save a poor child from SCID and a lifetime of living in a sterile room? That’s more like it.

Today’s day and age is the era of genetic modification and gene therapies and everything to do with modifying our very building blocks. To successfully modify our DNA – be it to produce virus-resistant plums or to treat SCID – we need to make sure we don’t accidentally break our genome. If we were to break up the ORI with introduced genes, we’d just end up preventing the genome from replicating – forever and ever.

Do you see now why it is so critical for us to know where in the genome does replication begins? While learning that of S. Enterica might not be revolutionary, the same methods and technologies will prove helpful in other cases.

Keying our cypher

Before we can attempt to locate the ORI in S. Enterica, we must key our cypher – which is the same as training our algorithm, but that sounds less cool. So we will begin our work with Vibrio cholerae. Aside from sporting a much smaller genome at only 4,033,460 basepairs long, it is also one whose ORI and DnaA Boxes have already been found and experimentally confirmed.

So, cool, we’ve found our ORI in V. Cholerae, but how does that help us find that of S. Enterica? Well, we must look for short strings that occur very frequently in small regions of the genome.

But why?

Must you question everything I say? Sigh. Oh well, I’ll explain a bit more.

Research has proven that in areas of the genome where proteins bind, we often see these repeated strings. These could be to decrease the chances of a protein “missing” its binding spot – there are multiple, so if it slides by one it will get to the other ones. It could also decrease the chances of mutations disrupting protein binding. They could also just be extroverted and think, the more the merrier.

Whichever case it may be, what matters is that these strings could clue us to finding the DnaA boxes within the ORI. After all, the DnaA box is a protein binding site, so we can expect it to follow the same behavior as other protein binding sites.

Sliding down the known ORI section of V. Cholerae, we can check if they are any repeated strings that would point to the presence of DnaA boxes. And that leads us to this simple script, which will check how frequently a given pattern occurs within a longer genome.

'''
A script to count how many times a specicfic pattern occurs within a larger genome.

Input: genome, pattern
Output: Number of times 'pattern' occurs in 'genome'.
'''

def PatternCount(gen, pat):
    count = 0
    for i in range(len(gen) - len(pat) +1):
        if gen[i:i+len(pat)] == pat:
            count += 1
    return(count)

Well… Cool! We’ve got a script that tells us how frequently a given pattern appears within the genome. So very helpful to keying our cypher, right? Wrong. This is little more than a warm-up – it is vastly impractical to hand-pick strings and check if they are frequent. As such, even my horrible scripting skills wrote this one with ease – but never doubt that I gloated over this little victory.

Well, onwards! Now that we can manually, painstakingly figure out frequent strings, let’s automate it! Simple, right?

A point to clarify is that we must keep track of the most frequent strings, not simply any recurring ones. This means we must find a way to keep track of the total count, and then check each recurring string to make sure it is the most frequent.

It is simple enough to carry this out with repeated for-loops, but the book kindly advises us to check how painfully slow our task is. Yeah, that puts a damper on our enthusiasm, when our script takes the foreseeable future trying to find frequent words. Okay, a touch of exaggeration there, but still. It is slow and clunky.

So let’s introduce our genius solution! Well, the book’s, but we implemented their pseudocode, so it counts… right?

'''
A script that counts the frequency of fixed-length strings within a larger genome.

Input: str genome, int k
Output: Most frequent strings of length k within genome

'''

def BetterFrequentWords(txt, k):
    FrequentPatterns = []
    freqMap = FrequencyTable(txt, k)
    maxi = MaxMap(freqMap)
    for i in freqMap:
        if freqMap[i] == maxi:
            FrequentPatterns.append(i)
    return FrequentPatterns

It gets the job done, and is somewhat efficient. Well, my initial take was something painful to the eyes, so I won’t show that version. But here is the refined and elegant and fit-for-the-public-eye version. The rest is history, and we don’t discuss it.

The script depends on some other functions – such as MaxMap and FrequencyTable – which aren’t displayed here, but they can be found on my GitHub if you’re interested in seeing the nitty-gritties.

Reverse Complement

Let’s find the reverse complement of DNA, shall we?

What’s that?

Again with the questions? Fine.

Ever heard of the DNA letters, A T G C? Ever heard how A pairs with T, G pairs with C? No? Well, head to Google – this isn’t a biology class. Kidding, kidding. But just know that in a double strand of DNA, A will always be opposite to T, G will always be opposite to C, and the same thing for the other way around.

If you have, then you will understand what a reverse complement is. We need to take a genomic string, generate its opposite version, and reverse it. Before you ask, we reverse it because DNA is always read in the same direction, which is 5′ to 3′. You don’t need to worry about what those mean. In a double-strand of DNA, the two lie in opposite directions. Meaning if we just convert the nucleotides to their complements, we get the complement in the opposite direction, which is the wrong direction for reading DNA. It’s like reading English, but right-to-left. So we reverse it back into the 5′-3′ direction – or read English left-to-right again.

So, simple script! Once again, my initial solution was… inelegant, to say the least. For some reason I cannot currently comprehend, I decided that making no less than four variables was essential to my script. This was followed by a chain if-elif… and then an else statement. I still cannot remember why I put it there. Its only purpose was to pass. But that’s why you don’t get to see it.

'''
A script that produces the reverse complement of a given genomic string
by substituing each nucleotide with its complement, then reversing the result string.

'''

def DnaReverse(txt):
    newtxt = list(txt)
    nucs = {
        "A":"T",
        "C":"G",
        "G":"C",
        "T":"A",
    }
    for i in range(len(txt)):
        newtxt[i] = nucs[txt[i]]
    newdna = "".join(newtxt)
    newdna = newdna[::-1]
    return newdna

So this is the pretty version you will see. To be honest, the dictionary might be a touch overkill, but I was curious about a more readable solution than this:

def DnaReverse(txt):
    newtxt = list()
    for i in txt:
        if i == "A":
            newtxt.append("T")
        elif i == "T":
            newtxt.append("A")
        elif i == "G":
            newtxt.append("C")
        elif i == "C":
            newtxt.append("G")

    newdna = "".join(newtxt)
    newdna = newdna[::-1]
    return newdna

(This is not my initial solution. Trust me, you’re never going to see it. Trust me, you don’t want to see it.)

When I stumbled upon a dictionary substitute method, I thought I’d modify it and use it here. I tried to find out if it’s less resource-intensive than if-elif chains, but I can’t find any conclusive evidence either way. But my point is, the chain if-elif would probably have worked just fine for this simple of a script.

Uhh, why did we just do this?

Right, where were we? For once, you asked a good question. We were trying to find the most frequent words, so how did we end up doing reverse complements instead?

Let’s go back to the results of our frequent words finder on V. Cholerae’s ORI. If I were to post the entire results here, we’d see a lot of frequent strings of varying lengths, but we’ll just ignore all those and focus on the most frequent 9-mers – that is, strings of length nine. Why the special attention? That’s because most bacterial DnaA boxes are 9-mers, so we can safely disregard all the others.

atgatcaag cttgatcat tcttgatca ctcttgatc

All of these 9-mers occurred 3 times – each – within the ORI section of the genome.

Still doesn’t answer the question…

Oh, be patient, won’t you! If we were to take a careful-er look at our four strings, we will notice – hopefully – that atgatcaag and cttgatcat are reverse complements of each other. Go on, test it with my awesome script – I’m sure it won’t let you down.

This means that technically speaking, it’s the same string that occurs six times within 500 characters! That’s a lot more significant than two different 9-mers occurring three times each. (The ORI is 500 characters long. I will give no proof for this – just take my word for it, as well as the book’s, and ask no questions.)

A note: we can only count the reverse complement as the same string in this particular instance. Since DnaA is a replication protein, it makes no difference which side of the DNA strand it binds to, therefore allowing us to count them both as the same DnaA box.

Awesome! We found the DnaA boxes! Are we done yet?

Not so fast. Firstly, we were only keying our cypher. Secondly, even that part is not quite done. Sure, we found a 9-mer that occurs six times in a five-hundred character long ORI. However, what if this particular 9-mer is common over the entire genome? Remember, we’re only assuming the DnaA boxes will occur multiple times in a short region because that’s how protein binding sites work. Note: in a short region. If this string is common in the entire genome, we can safely rule it out as a DnaA box, because that is not a short region anymore.

And so that leads us to the next script.

'''
A script to find all the positions where a given string occurs in a longer genome. 

Input: Strings Pattern and Genome
Output: All starting positions where Pattern occurs in Genome.

'''


def pattern_match(gen, pat):
    starters = []
    for i in range(len(gen) - len(pat) + 1):
        if gen[i:i+len(pat)] == pat:
            starters.append(i)

    return starters

This script is a simple one – given a pattern, it must return all the locations where this string occurs in the longer genome.

Now let’s run this script along the entire V. Cholerae genome, looking for our candidate for DnaA boxes – atgatcaag and cttgatcat.

Here are the results for atgatcaag:


116556 149355 151913 152013 152394 186189 194276 200076 224527 307692 479770 610980 653338 679985 768828 878903 985368

And here are those for cttgatcat:


60039 98409 129189 152283 152354 152411 163207 197028 200160 357976 376771 392723 532935 600085 622755 1065555

Well, it appears that we are out of luck, no? These strings are certainly scattered all over the genome.

Well, not quite. Let’s take a closer look at just how scattered they are. Notice how both of these strings occur three times within the 15,200 range? That forms a clump. Notice how there is only one such clump, while all the other instances of this string are scattered throughout the genome?

Turns out, we’re actually in luck – these clumps are within the ORI region! The ORI region begins around the same range that these clumps appear in. We can therefore assume – supported by strong statistical evidence – that atgatcaag/cttgatcat represents a possible DnaA box within V. Cholerae.

What next?

Would you look at that – another good question!

Well, we’ve got a long journey to go still. We have found the DnaA box of V. Cholerae, but that doesn’t much help us in finding the DnaA boxes of any other bacteria. If we re-orient ourselves, the initial goal of finding S. Enterica’s ORI and DnaA boxes is no closer to fruition. But despair not – we have found ourselves clues. 9-mers that appear in clumps are a good starting position.

And now you must reward yourself for reaching this far. Take a break, slurp on some coffee, stuff your brain with more questions, and return for Part II.

(That is a different way of saying this post got too long. Part II will be up shortly, continuing from where we left off. Hopefully, there will be no need for a Part III!)

A little background#

But why bother?#

Keying our cypher#

Reverse Complement#

A little background

But why bother?

Keying our cypher

Reverse Complement