Welcome back to the second part of our long-winded journey towards uncovering the ORI and DnaA boxes of S. Enterica. I am going to assume you’ve read the first part beforehand - if you haven’t, head back there first. If you don’t, then everything I say will make even less sense. Given it probably doesn’t make a lot of sense to begin with, that would be rather inconvenient.

Anyway, moving on. If you recall, the last topic we covered was finding the most frequent 9-mers within the ORI. Then we wrote a function that can check how scattered this string is throughout the rest of the genome.

So let’s use the one function we wrote previously that did not require us to give it a pattern as well as the genome:

def FrequencyTable(txt, k):
    freqMap = {}
    txt = str(txt)
    n = len(txt)
    for i in range(n - k+1):
        Pattern = str(txt[i:k+i])
        if Pattern in freqMap:
            freqMap[Pattern] += 1
        else:
            freqMap[Pattern] = 1
    return freqMap

And use it to create a more generalized script, that will find all strings of length k that appear at least t times within a window of length L.

Our script should therefore slide down the entire genome, in intervals which are L long. It should check every single string of length k, and keep count of how many times it’s appearing. Then it should compile a list of all these k-mers that appear at least t times within said interval.

While this does automate our previous procedure - we don’t have to manually check for clumps - you can hopefully see how quickly this script can become a clunky disaster. If not, well - I am not going to show you just how clunky mine was before I refined it - you only see the refined versions, after all.

Here is where the FrequencyTable function comes in handy. We’ve already got a function that returns all possible k-mers with their specific counts. All that remains is to call this function in every window, check if the frequency is at least t, and then collect all of those strings and output them.

And thus arises this script.

# k, l, t - used for genomic consistency. Substitute if too vague.
def FindClumps(gen, k, l, t):
    patterns = []
    for i in range(0, len(gen)-l):
        window = gen[i:l+i]
        freqMap = FrequencyTable(window, k)
        for s in freqMap.keys():
            if freqMap[s] >= t:
                patterns.append(s)

    patterns = list(dict.fromkeys(patterns))
    return patterns

Let’s do a test run on E. Coli’s genome, just for fun. Let’s look for all 9-mers that appear at least three times within 500 character intervals. If we count each 9-mer only once, and not every single time it appears, we discover that there are actually 1904 different 9-mers forming such clumps! Can you see how much manual work we saved with this script?

At the same time, we also notice that there are thousands of clumps all over E. Coli’s genome - so how do we find the ORI? Previously we found that the clumps were clustered around the ORI region, but that doesn’t seem to be the case here.

GC Skew

Fear not! We have another handy bit of statistics to aid us in our problem. We continue to use E. Coli for this segment. When we examine the genome of E. Coli, we discover that the frequency of G and C nucleotides has a significant “skew” - that is, the frequencies do not match, as we might expect from evenly distributed nucleotides. This is called the GC Skew. Further handy statistics and research prove that we can expect GC Skew to be very low around the ORI region - lowest, in fact, in the entire genome. This means that there is a very high frequency of C and very low of G, by the way. Some great researchers have this all worked out, and so I won’t go over it. Mostly because it is needless, but also because it is very technical and requires a whole lot more biology than I intend to include in this post.

So leaving aside the whys and hows, let’s just stick to the fact that we now have one more way of finding the ORI in cases where our clump-finding betrays us.

def skew(genome):
    skew_list = [0]
    skew_count = 0
    for i in genome:
        if i == "G":
            skew_count += 1
        elif i == "C":
            skew_count -= 1
        else:
            pass
        skew_list.append(skew_count)
    return skew_list

Relatively simple script, easily written and done with. This is more of a trial run, making sure our understanding of the concept is solid. In function, this is practically useless - a text stream of individual values doesn’t help us in finding ORI. But, this does lead us to the next script, which is significantly more useful.

Remember the whole lowest skew thing? This is all the script finds. It will run through the genome, and output those positions where the skew is minimum. This makes a second way to find the ORI.

'''
A script that finds the index/s of the genome where GC Skew is minimum. 

Input: str genome
Output: list of indices where GC Skew is minimum. 

'''

def min_skew(genome):
    skew_count = 0
    minim = 0
    minims_list = list()

    for index, item in enumerate(genome, start=1):
        if item == "G":
            skew_count += 1
        elif item == "C":
            skew_count -= 1

        if skew_count < minim:
            minim = skew_count
            minims_list = [index]
        elif skew_count == minim:
            minims_list.append(index)

    return minims_list

I want to go over this script for a bit. Aside from calculating the GC skew in the same way as the previous script - minus an else statement which was probably unnecessary - this script must also find a way to track the lowest value, and then check each value against that before storing it.

We have to be careful here to not choose only the lowest value - a mistake I initially made. The genome can have multiple points where the skew is the same lowest value. While in an ideal world there should be a perfect lowest skew point, the genome is an organic material and will never follow the perfect case.

Which is why we must keep track of the minimums, and be able to store multiple values. We also must be keeping track of the indices, instead of just the skew values.

My initial solution iterated over numbers, which meant I had to access the genome again and again using the index.

The solution worked, but let’s face it - it’s ugly. I tried to ignore its ugliness, but in the end went hunting for a better solution because I had to show it around. On my GitHub. And I was rudely reminded of the presence of the enumerate function. Which I had entirely forgotten off. So yeah, let’s not discuss that further.

So, moving on. We’ve got out handy solution that finds all the places in the genome where the skew is the lowest.

Cool! We’ve found our ORI now, right?

Wrong. Let’s run this on E. Coli’s genome again. We find out the lowest skew position, all good! But when we try to find our frequent 9-mers that appear three or more times within our 500 character ORI region, we find… nothing! So that throws a spanner in our works. It appears that despite having found a likely ORI region, we have not found any DnaA boxes. Given that those are the actual place where replication begins, we haven’t quite solved our problem.

Mismatches

Now what? We’ve hit another roadblock. Where our clump-finding failed, we tried minimum skew - and that also failed.

Let’s head back for a moment to V. Cholerae - the one whose DnaA boxes and ORI we worked with before. If we alter our algorithm just a touch to allow slight variance in the 9-mers, we find that there are several more instances of our two target strings. Given that DnaA proteins can bind to exact DnaA boxes as well those with slight variations, we can now add these mismatched strings to our DnaA box collection. This gives our chosen strings a boost in likelihood, by the way, since they now occur a lot more times in a short region than we initially thought.

But we have to be wary of accepting mismatches. Too many, and it no longer counts as a possible DnaA box. We need to keep track of these mismatches, making sure they don’t exceed a set figure. This is called the hamming distance. Our hamming distance - or total mismatches - must not exceed a certain figure.

'''
A script that calculates the total number of mismatches between two strings of identical length.

'''

def hamming(genome_a, genome_b):
    h_dis = 0
    for a, b in zip(genome_a, genome_b):
        if a != b:
            h_dis +=1
    return h_dis

Simple script for a simple purpose. As per usual, my initial solution was rather clumsy, but then I remembered the handy zip function and how to iterate two objects simultaneously. Clumsy index-iteration gotten rid of, and this… nicer solution implemented.

We now have a way of tracking mismatches. Cool! Now we must rewrite our pattern-matching script so it calculates mismatches.

'''
A script to find all instances where a given string appears within the genome, with some specified amount of mismatches. 

Input: strings pattern and genome, int d
Output: List of starting positions of Pattern within Genome, with up to d mismatches. 

'''

def aprox_pat_match (pattern, genome, d):
    starters = list()
    for i in range(len(genome) - len(pattern)+1):
        if hamming(pattern, genome[i:i+len(pattern)]) <= d:
            starters.append(str(i))
    return starters

The core functionality here is unchanged. The script still slides over the genome in windows of a specified length. But instead of checking if it’s an exact match, it checks the hamming distance of each string against the given pattern. If the hamming distance is less than or equal to the specified buffer, the script will append the index position to a list, which it later outputs. Be careful about accidentally only outputting those whose hamming distance is less than the buffer. When we’re working with such low values, it can have a huge effect on our outputs.

You might’ve noticed that we’ve taken a step back here. We are back to working with functions that need a pattern as well as the genome - it’s not finding frequent k-mers itself. But be patient, we’ll get back to there in a moment.

Now that we have the script that returns all the starting indices, let’s modify it a touch so it tells us the frequency instead. We could just output the length of our list, but that’s rather clunky. Instead, we adjust the function to only keep track of count instead of a list of starting positions.

def count(pattern, genome, d):
    count = 0
    for i in range(len(genome) - len(pattern)+1):       
        pat = genome[i:i+len(pattern)]
        if hamming(pattern, pat) <= d:
            count +=1
    return count

Pretty easy to modify our outputs, and so we move on.

We’ve finally reached what was, for me, the most challenging part to tackle. That is, finding all frequent k-mers with up to d mismatches. We implemented this previously without the mismatches, and we can use some of the previous functionality. But things turn difficult when we have to account for mismatches.

'''
An improved version of the frequent words script that will account for minor mismatches. 

Input: str Text, int k and d
Output: All frequently occuring k-mers within Text with up to d mismatches. 

'''

def FrequentWordsWithMismatches(Text, k, d):
    Patterns = []
    freqMap = {}
    n = len(Text)
    for i in range(0, n-k+1):
        Pattern = Text[i:i+k]
        neighborhood = Neighbors(Pattern, d)
        for neighbor in neighborhood:
            if neighbor in freqMap:
                freqMap[neighbor] += 1
            else:
                freqMap[neighbor] = 1
    
    m = MaxMap(freqMap)

    for pattern in freqMap:
        if freqMap[pattern] == m:
            Patterns.append(pattern)
    return Patterns

You’ll notice that this script in itself is not very complicated. It does rely on another mysterious Neighbors function, which is one I haven’t introduced yet… and is also the most mind-bending one.

Let’s quickly run over the basic functionality before I head to the Neighbors function. This script receives the genome Text, as well as the length of the k-mers it searches for, and how much it’s mismatch buffer must be.

Then it must go through the genome in chunks of k length, and generate all possible variations of it that have up to d mismatches. After doing this, it must then go through this entire list, and generate a frequency map of them. Afterwards it must output those k-mers with the highest count. They might be multiple of these with the highest frequency, so it must output all of them.

The only hiccup is how on earth do we come up with mismatches?

Here is where Neighbors steps in.

'''
A script to regenerate a given strings with up to d mismatches.

Input: str Pattern
Output: list of all genomic strings with up to d mismatches.

'''

def Neighbors(pattern, d):
    if d == 0:
        return [pattern]
    if len(pattern) == 1:
        return ["A", "C", "G", "T"]
    
    neighborhood = []

    suffixNeighbors = Neighbors(pattern[1:], d)

    for text in suffixNeighbors:
        if hamming(pattern[1:], text) < d:
            for n in ["A", "C", "G", "T"]:
                neighborhood.append(n + text)
        else:
            neighborhood.append(pattern[0] + text)
    return neighborhood

Here is also where the mind-boggled me steps in. Neighbors is a recursive function, meaning it is calling itself whenever you call it. This allows us to break down the input and swap out each character a bit more conveniently.

Let’s go over how this function came into existence.

def ImmediateNeighbors(Pattern):
    neighborhood = list()
    nucs = ["A", "G", "T", "C"]
    for i in range(len(Pattern)):
        symbol = Pattern[i]
        for nuc in nucs:
            Neighbor = Pattern[:i] + nuc + Pattern[i+1:]
            neighborhood.append(Neighbor)
    return neighborhood

This is a simplified version; one that does not use recursion, for a start, and generates strings whose hamming distance is only one. We will call these variations the immediate neighbors.

This does help us minorly when implementing Neighbors, a function that can be told exactly how mismatched its generated strings should be.

Our idea for the Neighbors function is break the string down to individual letters, and then swap each of them until we’ve hit our d limit. So we can start by chopping off the first letter of our three-letter string, and then swapping out each of the remaining two letters with the other nucleotides. Then we take a step back, re-introduce the first letter, and either keep the original nucleotide - if we’re already reached our d limit - or we swap that one out, too - if we’ve not yet hit the d limit.

This concept can be expanded upon to fit a string of any length, not just three. And that is where our recursion steps in. We need to have our script successively break down the string and swap letters. By calling itself again and again until it hits d.

Yeah, it twisted my brain too. Still does, actually.

I hit a very big roadblock here, and it took me a very long time to figure out how to implement this script. I Googled, Stack Overflow-ed, visited shady websites with numerical links that I probably shouldn’t have visited, tried research papers, everything I could possibly lay my hands on. But alas, no fellow students came to save me from my plight, and I was obliged to pick my poor brain and figure it out myself.

Hopefully this will help some other poor student stuck like myself – or perhaps, I will succeed in my initial goal, that is to twist the brains of everyone even more. Hahahahah.

Anyway. I hope this explanation makes sense for what’s going on in the Neighbors script. Another point to mention is the checks in the function before it attempts to further recursify the string. For one, if our mismatches are set to zero, there is no point in doing this entire exercise. For another, if the input pattern is only one character long, then there is also no need to go through the entire function, just return a string of all possible nucleotides.

And thus, after much brain-picking and googling and frustration and almost-giving-up, I finally figured this script out.

Which takes us all the way back to where we begin, which was trying to come up with a script to find frequent words, whilst still accounting for mismatches.

def FrequentWordsWithMismatches(Text, k, d):
    Patterns = []
    freqMap = {}
    n = len(Text)
    for i in range(0, n-k+1):
        Pattern = Text[i:i+k]
        neighborhood = Neighbors(Pattern, d)
        for neighbor in neighborhood:
            if neighbor in freqMap:
                freqMap[neighbor] += 1
            else:
                freqMap[neighbor] = 1
    
    m = MaxMap(freqMap)

    for pattern in freqMap:
        if freqMap[pattern] == m:
            Patterns.append(pattern)
    return Patterns

Mismatches and Reverses

Let’s complicate this step further, shall we? Now that we’ve figured out how to calculate mismatches, let’s make it do that again - but in reverse, this time.

Remember reverse complements? I mentioned them at some point in Part I. Go back and read that if you need a refresher. Either way, we did write a script that could generate the reverse complement of a string. We can utilize in generating the Neighbors of the reverse complement.

I am not sure if there is a more elegant and simplified way to implement this than the one I chose. But it works and I did not notice any performance issues, so I’ve let it be.

'''
A final improved version of the frequent words script that accounts for the reverse complement as well as a certain amount of mismatches. 

'''

def FrequentWordsWithMismatches(Text, k, d):
    Patterns = []
    freqMap = {}
    n = len(Text)
    rev = DnaReverse(Text)
    for i in range(0, n-k+1):
        Pattern = Text[i:i+k]
        rev_pattern = rev[i:i+k]
        neighborhood = Neighbors(Pattern, d)
        rev_neighborhood = Neighbors(rev_pattern, d)

        for neighbor in neighborhood:
            if neighbor in freqMap:
                freqMap[neighbor] += 1
            else:
                freqMap[neighbor] = 1
        for neighbor in rev_neighborhood:
            if neighbor in freqMap:
                freqMap[neighbor] += 1
            else:
                freqMap[neighbor] = 1

    
    m = MaxMap(freqMap)

    for pattern in freqMap:
        if freqMap[pattern] == m:
            Patterns.append(pattern)
    return Patterns

As you can see, this part is not all that complicated - the core functionality is already covered. We simply generate the reverse complement of our entire genome, plug it into the neighbors function, and run the frequency table script on it additionally, alongside the normal version.

The Finale

Phew, what a journey! But finally, we have assembled all the tools needed to break the code. Our cypher is keyed, and we can make a daring attempt to find S. Enterica’s DnaA boxes and Ori region. We did a trial run beforehand on E. Coli, and since the results are satisfactory, we can take a shot at S. Enterica.

f = open("./salmonella_enterica.txt")
genome = f.read()
skew_pos = int(min(min_skew(genome))) # Finding minimum skew to define scanning window. 
window = genome[skew_pos:skew_pos+500] # Specifying a window of 500 nucleotides.
out = FrequentWordsWithMismatches(window, 9, 1) 

This is it. This is all we need to write - the amalgamation of all our previous work, the climax - this is what it all was for. Let’s stop being dramatic and walk through it.

After we’ve got our data in, we first find the minimum skew locations, to get an idea of where we can begin our work from. Then, to account for random skew fluctuations, we give ourselves some breathing room - 500 nucleotides is good enough.

Then we call our grandest function - the frequent words finder, with mismatches and reverse complements and all. We’ll shoot for 9-mers - bacterial DnaA boxes are usually around 9 nucleotides long, so it’s a safe assumption. A mismatch of one is fine - we don’t want to shoot too wide.

CCAGGATCC CCCGGATCC CCGGATCCG CGGATCCGG GGATCCGGG GGATCCTGG

There it is. This is our prediction for S. Enterica’s DnaA boxes within our chosen window.

As previously mentioned, this is not confirmed. With E. Coli, everything was experimentally confirmed. Such is not the case for S. Enterica - so my guess is as good as yours or a researcher in some fancy lab somewhere. Well, maybe their script-writing skills would be better, so let’s leave that territory for now.


Thus ends my journey through Chapter 1 of Bioinfomatics I: Finding Hidden Messages in DNA. Since I gave myself a stern warning to not start Chapter 2 until I had finished this ridiculously long blogpost, I shall now head off and start that. Eventually. With lots of animation thrown in between, because I’m disorganized.