Encoding Choices for Error Resistant DNA Computers.

Abstract

Though double stranded DNA appears to be a good, stable storage medium for information, most proposed DNA computation systems use single stranded DNA (oligonucleotides) for storage (and computation). This choice is largely because these proposals depend on the annealing or hybridization of oligonucleotides to perform data storage or computation actions. Unfortunately hybridization is imprecise, and incorrect hybridizations easily occur. This places strict requirements on codeword formation and puts an upper bound on the amount of information which can be stored. This paper proposes that DNA computation methods attempt to use both single and double stranded DNA molecules for storage. It is also suggested that multiple independent encodings be used to increase error-resistance.

Double Stranded DNA is Good for Storage.

DNA is a good medium for storing information an a compact and stable way. Double stranded DNA is quite stable, contains redundancy, and can be maintained in vitro with error correcting enzymes.

When acting as a static storage medium, double stranded DNA tends to maintain its integrity. It is, though, vulnerable to hydrolysis reactions. Hydrolysis will cause "nicks" in the DNA where a bond is broken between nucleotides. This decay can be repaired with ligase, and is probably not too much of an issue.

When DNA is amplified by PCR it is subject to errors when being duplicated by polymerase. Taq polymerase, commonly used in PCR has an in vitro error rate of 1/9000 [Smith96].

This error rate is not a big deal, given reasonable redundancy in data encoding. A compact disc with scratches on the surface has a much (much!) larger error rate than this, but will still play back perfectly, thanks to error correcting codes also stored on the disc.

In fact, a large concern with traditional error correction coding is how to deal with burst errors, where long sequences of bits are corrupted. There does not appear to be an analogous problem with DNA. Hydrolysis and PCR errors are independent of each other and do not occur in groups. (Except of course when they randomly coincide). Error correction in DNA appears to be an easier problem than what standard error correction codes are designed for.

Hence, minimal precautions with error correcting codes are enough to maintain data integrity with double stranded DNA encodings.

Single Stranded DNA and Hybridization.

Proposed models of computation tend to use single stranded DNA (oligonucleotides) as their main form of data storage [Adl94, Baum96 Roweis96]. Oligonucleotides are useful because they will anneal or hybridize together with their complementary strand. This seeking out and finding of your complementary strand in solution appears to be the fundamental unit of computation in DNA computer proposals. It is analogous to (yet more powerful than) the switching of a transistor in an electronic computer.

Single stranded Hydrolysis

Oligonucleotides are less stable than their double stranded counterparts. Hydrolysis will simply cut the molecule into two parts, both of which will drift away in solution, never to be joined again. (A mechanism to re-join hydrolized oligonucleotides would probably be a larger problem than the one the DNA computer is programmed to solve).

This problem is not recoverable, however it is detectable. The oligonucleotide could be coded with its expected length so that truncated strands can be discounted from a solution. It is not, however, likely that this length could be easily updated as the strand length is changed during ligations.

Failing this sort of detection, it is likely that a random hydrolysis break would create invalid code-words (or even code-word sequences).

Single strand hydrolysis cuts are estimated to occur 1.4 e-6 times per base per hour [Smith96], and may not be a problem given enough redundant strands in solution.

Hybridization Errors and Hairpins

Hybridization is the action of one oligonucleotide annealing to its complement. For example the 5'-> 3' sequence ATAGC will tend to anneal to its complement GCTAT. This fundamental "operation" is used often to detect solutions, or to build intermediate data structures.

Unfortunately, unlike the switching of a transistor mentioned earlier, DNA hybridization is not very reliable. Just as exact complements will tend to hybridize, close matches can hybridize as well.

Code Words

In order to represent data on DNA, we must choose "code words" to represent information. A code word is simply a distinct sequence of nucleotides (different from all the other code words) that represents a concept or a symbol. For example, code words could be chosen to represent concepts like:

  1. There is a binary 1 at position 16
  2. There is a path leaving Chicago destined for New York
  3. The letter J
  4. A Turing machine tape symbol

An analogy in conventional computing is using an ASCII code to represent a printed character. Unlike ASCII codes, DNA code words cannot be chosen merely by flipping adjacent nucleotides. A near complement (differing by e.g. 1 base from the true complement) of a code word is almost as likely to hybridize as the true complement is.

Hamming Distance

Code words must be chosen with as large a "Hamming distance" as possible. A Hamming distance is loosely defined as the number of mis-matched base-pairs between two code words. For example, the Hamming distance between ATCcGtA and AgGcGAT is 2 (note that mis-matches are in lower case).

On a practical level, [Deaton96] claimed that for length 20 code words, hamming distances should be at least 6 or greater to prevent unwanted hybridizations. In their paper a genetic search algorithm was used to find valid code words.

[Frutos97] found 108 useful code words of length 16 with at least 4 base-pair mismatches.

Shifting

Having a large Hamming distance between code words is not enough, however. Since DNA code words do not have an understanding of where code word boundaries are, they can shift around and still hybridize across a code word boundary. Although any two codewords might have a hamming distance of 6 or more, a code word might find a close match where two adjacent code words join.

For this reason, code words must be chosen to have high Hamming distances from any combination of two adjoining code-words, with an arbitrary shift. [Garzon97] called this the H-measure of a pair of code words:

where is the left-shift operator, and H(*,*) is the ordinary Hamming distance. Using this metric, they computed good code words for lengths up to 9 nucleotides.

Hairpins

There are yet more constraints on code-words. A single strand of DNA can bind with itself in a hairpin fashion. For example the sequence ATTCGAggca...gtaaTCGAAT could bend around and connect with itself.

The effect of this sort of self-hybridization is to remove the corresponding sequences as available hybridization sites for valid oligonucleotides. This effectively reduces the "yield" of the reaction or computation.

To avoid this circumstance, code words must be chosen so that they, and any combination of them do not contain any mirror-image complements. This adds yet another restriction on code word choice. [Baum96] investigated the possible code sequences such that there are no identical subsequences or mirrored complements on a dna strand of k or less length. For or , the maximum set size is:

He does not appear to consider close matches of larger length that might not have k sequential matches.

Encoding choices

A good encoding must satisfy all of these conditions at once in order to be useful for both hybridization reactions and data storage. These conditions are somewhat difficult to satisy. And consequently it is difficult to compute valid code words. These considerations put upper limits on the amount of data that can be both stored, and operated on in the same test-tube.

Suggested Approaches to Alleviate Encoding Problems

  1. Keep the data-base mostly in double stranded form.

    Algorithms should be designed such that they only operate on a limited area of the database at any given time. This both helps maintain data integrity, but also reduces the number of "exposed" code words that are available for hybridization.

    This approach is analogous to the way that human DNA is kept coiled up when it is not "in use".

    [Smith96]'s approach does this at least part of the time, although not necessarily for the reasons stated above.

    Restriction enzymes can be used to separate double DNA strands into singular ones, though controlling this reaction in a meaningful fashion (while maintaining useful code words) may be difficult.

  2. Use multiple, "orthogonal" encodings in parallel (instead of just multiple copies of the same strands) to alleviate difficulties with the encoding causing errors.

    This method uses rendundancy to correct errors, but is probably inefficient in its use of redundancy. (The payoff might not be worth the trouble). It is probably disadvantageous to use the multiple encodings in the same tube since they might exacerbate hybridization errors.

    I am unsure of what to suggest to use as "orthogonal" encodings. Perhaps just shifting code words around to represent different objects is enough?

Conclusion

DNA is a stable molecule suitable for storage of large amounts of information. However, the use of single stranded DNA in computational models of DNA present a problem for simultaneous computation and storage using these molecules.

The inaccurate nature of hybridization presents several obstacles to storing information using oligonucleotides. Code words must be carefully chosen, and there are upper limits to the amount of code words available.

To combat these problems, a DNA computer should attempt to convert inactive data to double stranded form to reduce the length of oligonucleotides sequences "exposed" in solution.

Additionally it may be beneficial to use redundant, yet different encodings to reduce errors caused by unexpected hybridizations.

Bibliography