Tuesday, June 19, 2012

Finding Genetic Open Reading Frames (ORF) In FASTA

A recent project I've worked on required that I find Open Reading Frames (ORF) of certain minimal length in genetic material which arrived as FASTA files (.fa).

My natural choice for this string match was of course regular expressions, and the language I use in the project is Perl.

After a few attempts I finally settled on the following code:

my $FIND_ORF_REGEXREADY = "(ATG(?:[ATGCN]{3}){13,}(?:TAG|TAA|TGA))";
                { //handle $1 }
elsif (reverse($CDSString) =~ m/$FIND_ORF_REGEXREADY/i) #check reverse
                { //handle $1 }

I'll break down the code:
I use the match option (m/) to extract the found match into a variable.
The string I'm searching must begin with ATG, and be followed by 3 letters before ending with one of the Stop codons (TAG, TAA,TGA).
Since the code should only be composed of A,T,G,C or N (N can be any of the bases) I placed it in an OR statement (the brackets), and since this should be repeated 3 times it needs to be placed inside a (). In case you are wondering, when you want to do an OR you usually use (), but if that OR is inside another () than you need to use [].
I've also added the {13,} to enforce a certain minimal length to the ORF, in my case the length was 45 (3 + 3*13 + 3). The curly brackets tell us that it should be repeated a minimal number of times and the , tells it to have no maximal length. If I wanted a maximal value I would add a number after the , (like {13, 20}.
I also added the (?:...) in order to have a non-capturing group. More on what that means here.
Finally I closed with an /i to make sure the avoid upper-case/lower-case issues (although there shouldn't have been).

The if was pretty simple, I search it in one direction. If I found good, if not I reverse the string and search the other direction.

More on regular expressions can be found at this excellent cheat sheet and Perl Documentation

No comments:

Post a Comment