关于安全:INF2B-技术难点

64次阅读

共计 32775 个字符,预计需要花费 82 分钟才能阅读完成。

COURSEWORK 1 FOR INF2B (ADS THREAD, 2018-19)
STRING MATCHING
Submission Deadlines: The coursework consists of two parts (of a different nature) relating to
one problem. As shown below these have separate deadlines, you cannot swap the parts round
or submit either part later than the stated deadline (unless you have permission to do so from the
yeare organiser).
Part 1 consists of Exercise 1 on page 4, §2.1 and Exercise 2 on page 9, §2.2; these involve
fairly straightforward analysis.
Submit Part 1 by: 4.00PM, 25 February 2019.
Part 2 consists of the tasks in §3; these are all software related.
Submit Part 2 by: 4.00PM, 08 March 2019.
See §4 at the end of this document for instructions on how to submit.
§1. Introduction
In this practical we will consider two methods of searching for all occurrences of a given pattern
within some text (both given as strings). For example we might have a long piece of English text
about Scotland and we wish to search for all occurrences of“Edinburgh”within it. Our aim is to
return the offset position of all occurrences, if there are no occurrences then we indicate this by
a special convention discussed below.
The practical has several aims:

  1. Practice at asymptotic analysis (with guidance).
  2. Careful implementation of one algorithm (an implementation of the other algorithm is
    supplied).
  3. Carry out timing experiments in order to:
    (a) Compare the algorithm that you implement with one that is supplied and decide the
    point from which one is more efficient than the other (i.e., the overheads are overweighed
    by the advantages).
    (b) Determine the constant for the asymptotic analysis as it applies to your particular
    implementation.
    You can concentrate on Part 1 (up to and including §2) first and then on the rest. However it
    would be a good idea to read the entire document through initially (without going into great
    detail) so that you know what is expected for the complete coursework. Note that although the
    document is fairly long the tasks you are required to carry out are quite modest. The emphasis
    here is on understanding the problem well and so plenty of discussion has been included.
    1
    The practical requires you to submit various things. Each of these has some appropriate
    marks assigned. It would however be a serious error to assume that by carrying out only these
    tasks you have obtained the most (or indeed the main) benefit from the practical. Throughout
    the text you are prompted to think about various matters that are not part of the submission. You
    should consider these carefully along with possibly other issues as they occur to you. In other
    words avoid the pitfall of just going through the motions in a mechanical fashion; in any case this
    is not possible for most parts. The practical is issued to help you develop your understanding of
    the material, the marks are a by-product (of course an important one) and far from being its only
    rationale.
    Notes
    Good Scholarly Practice: Please remember the University requirement as regards all assessed
    work for credit. Details and advice about this can be found at:
    http://web.inf.ed.ac.uk/infwe…
    and links from there. Note that, in particular, you are required to take reasonable measures to
    protect your assessed work from unauthorised access. For example, if you put any such work
    on a public repository then you must set access permissions appropriately (generally permitting
    access only to yourself, or your group in the case of group practicals).
    Following instructions: The various parts of this practical place certain requirements with a
    penalty if these are not followed. No negotiation of any kind will be entered into where the
    requirements are not followed. The point of the requirements is to ensure that in doing the
    various parts you practice and demonstrate understanding of the appropriate areas of the course.
    What you submit must be considered work; imagine that your work was being given to you and
    others as a sample answer. If you, or others, would find it unclear and unhelpful then your work
    needs further revision.
    Regrettably the various requirements and penalties can appear somewhat censorious and limiting.
    This is not the intention, indeed much of what is stated can be seen in the much more
    positive light of reassurance that the various tasks have quite simple answers. Please be assured
    that plenty of variation is possible even when following the requirements, i.e., there isn’t necessarily
    a unique way to do and present things clearly, concisely and correctly.
    Obtaining Help: If, after careful study and deep thought, you need help then use the forum for
    questions that you will find on the link at:
    http://www.inf.ed.ac.uk/teach…
    Note that this forum is for questions on this practical only. Any questions must be posted here
    and not sent by email. The TA for this thread will monitor the forum at various times. You should
    allow at least 2 days for a response.
    2
    §2. String matching
    Our discussion is based on the chapter on string matching from the book Introduction to Algorithms
    by Cormen, Leiserson, Rivest and Stein. We are interested in strings over some fixed
    alphabet (which we need not worry about at this stage, e.g., it could be ASCII characters). We
    are given some text which is simply a string over the alphabet. We will represent the text by
    means of an array T which we will assume is indexed starting with 1 (this keeps some of the operations
    more straightforward1). Our convention will be to think of the text as the array T[1 ..n],
    i.e., the array has n entries and T[i] gives us the ith character of the text. We are also given a
    pattern which is simply another string over our alphabet. We will always represent the pattern
    by an array P[1 ..m]. If m>n then the problem is trivial as there can be no occurrences of the
    pattern within the text. We will therefore assume throughout that m ? n.
    Before going any further let us look at a simple example. Suppose the text is bacababa and the
    pattern is bab, so that n = 8 and m = 3. We can think of their representations diagrammatically
    as follows:
    text T b a c a b a b a
    pattern P b a b
    Clearly the pattern occurs in the the text starting at position 5 but nowhere else.
    We will adopt the following convention. We say that the pattern P occurs in the text with
    shift (or offset) s if and only if
    P[1 ..m] = T[s + 1 ..s + m]. (1)
    The equation is to be understood as stating that P[i] = T[s + i], for 1 ? i ? m. One advantage
    of this definition is that it can be expressed as P[t..m] = T[s + t..s + m] where t is the base
    address of our arrays (so in this handout we have t = 1 and in Java t = 0). The point is that the
    offset is independent of t.
    Our problem is to find all shifts s such that equation (1) holds. We will return these shifts as a
    list in increasing order. Thus the pattern occurs in the text if and only if the list is not empty. So
    for our example we would return a list of one cell containing the number 4, i.e., the shift (note
    that the shift is one less than the position at which a pattern occurs in the text since we number
    array locations from 1 onwards [what happens if we number them from 0?]). We can think of our
    list as a queue, though we only ever add items to it (of course subsequent use of the list would
    remove items to find out where the pattern occurs though we will not do this).
    §2.1 Naive algorithm
    There is an obvious algorithm for our problem. We can think of it in terms of placing the pattern
    under the text in successive positions (starting with 1) and testing to see if the pattern matches
    the text immediately above it.
    1Although you are probably used to starting arrays at 0 it is an important part of your training to be able to follow
    and adapt to other conventions that are in use. Demonstrating this ability is part of this assignment
    3
    Naive-Matcher(T,P)
  4. n T.length
  5. m P.length
  6. create a new empty queue S
  7. for s 0 to n
    m do
    5a. if P[1 ..m] = T[s + 1 ..s + m] then enqueue(s, S)
    Of course the test P[1 ..m] = T[s + 1 ..s + m] has to be implemented as a loop, this has not
    been shown above for the sake of clarity. Line 5a should be seen as shorthand for the following
    code:
  8. matches TRUE
  9. for i 1 to m do
  10. if P[i] 6= T[s + i] then
  11. matches FALSE
  12. exit for loop
  13. if matches then enqueue(s, S)
    The lines have been numbered so that this can be slotted into the preceding pseudocode and in
    your analysis assume that this has been done so you can refer to the line numbers as given
    We have also shown S as a parameter to enqueue for the sake of clarity, if implemented
    in Java then we make the appropriate method call. Another point to note is that, as mentioned
    above, our convention here is to number arrays starting at 1 whereas Java starts at 0. We will
    discuss this below, for now let us not worry about the implementation language.
    Exercise 1. For item 1 you must use the O(·) notation along with any appropriate properties as
    discussed in Lecture Note 2. You may not use named constants as part of the analysis. For item 2
    you could use named constants but in fact this is not necessary, the ?(·) notation enables things to
    be expressed simply and well. Finally, note that for full marks you must justify each expression
    or claim with reference to the algorithm. Any answers that do not follow these requirements will
    be awarded 0.
    Let TNM(n, m) denote the worst case runtime of the algorithm Naive-Matcher on inputs of
    size n and m respectively.
  14. Prove that TNM(n, m) = O((n
    m + 1)m). Note that the additive constant +1 is needed
    to take care of the case when m = n. [15%]
  15. Prove that TNM(n, m) = ?((n
    m + 1)m). Thus you must produce inputs of size n,
    m that cause the algorithm to have runtime as claimed. You cannot fix either of n or m.
    (HINT: There are very simple examples.) [10%]
  16. Is it true to say that TNM(n, m) = ?((n
    m + 1)m)? Justify your answer briefly. [5%]
    Note: The second part of the exercise is a good opportunity to underline the important point
    that for many algorithms the worst case runtime for given input size(s) does not happen for all
    possible inputs but only for some appropriate ones. As an example here, if we take the text
    4
    to consist entirely of the character a and the pattern to consist of anything starting with the
    character b then the runtime is O(n). We saw the same point in connection with the simple linear
    search algorithm.
    §2.2 The Knuth-Morris-Pratt algorithm
    Let us look again at the simple example. In trying to see if there is a match of the pattern with
    the text starting at position 1, the first two characters match but the third fails. Our knowledge of
    the text can be pictured as:
    text T b a ? …
    pattern P b a b
    (It is true that we looked at the next character in the text but we want to relate our knowledge
    of the text to the pattern, taking the extra character into account does not help in the following
    discussion.) This tells us immediately that it is not worth trying a match of the pattern with the
    text starting at position 2, since the first character of the pattern fails to match its second character.
    On the other hand it is worth trying position 3. This observation (and its generalisation below) is
    crucial for the development of the algorithm.
    Consider the general situation where we are comparing the pattern against the text starting at
    position s+ 1 (recall that s is the shift from position 1 of the text). We have a match exactly up to
    position q of the pattern (i.e., position s+q of the text). We allow q = 0 to mean there is no match
    at all (this makes sense, q is the number of characters matched). We have found a complete match
    of the pattern in the text if and only if q = m otherwise there is a mismatch with the character at
    position q + 1 of the pattern. Whatever is the case we now know that T[s+ 1 ..s+q] = P[1 ..q].
    Given this knowledge, we can decide just from the pattern if it makes sense to try matching it
    with the text starting at position s + 2. We do this by comparing the pattern with itself thus:
    a1 a2 a3 a4 … aq
    a1 a2 a3 … aq1
    We see that it is only worth trying a match with the text at the next position if and only if the
    match above is successful (this is the same as saying that T[s+2 ..s+q] = P[1 ..q1]).
    Suppose
    that this fails, i.e., the shift of the pattern does not match with the pattern up to position q. We
    can again decide just from the pattern if it is worth trying at the next position. Our diagram is:
    a1 a2 a3 a4 … aq
    a1 a2 … aq2
    Here we need T[s + 3 ..s + q] = P[1 ..q
    2].
    The preceding discussion shows that what we need is the least shift s0 > s such that
    T[s0 + 1 ..s + q] = P[1 ..s + q
    s0
    ].
    5
    We can rephrase this as: the least shift s0 > s such that
    T[s0 + 1 ..s + q] = P[1 ..k],
    where s0 + k = s + q. We can also write the equation as T[s0 + 1 ..s0 + k] = P[1 ..k] [why?].
    The crucial point here is that we can pre-compute these shifts from the pattern alone (recall that,
    by assumption, T[s0 + 1 ..s + q] consists of the first q characters of the pattern). An equivalent
    interpretation is to find the largest k<q such that the stated condition holds. Since s0 = s+qk
    we can eliminate s0 altogether and write the equation as
    T[s + q
    k + 1 ..s + q] = P[1 ..k]. (?)
    (It is important to get the indices right but don’t let this disguise the main idea which is quite
    simple and intuitive.) Note that such a k always does exist since k = 0 satisfies the condition
    (vacuously) except that it might not be the largest. On the other hand the requirement k<q
    means that the possible values of k are bounded from above and hence there is a maximum.
    At this point it might be helpful to recall that our discussion is based on the assumption that
    T[s + 1 ..s + q] = P[1 ..q]. It follows that the equation (?) above is equivalent to
    P[q
    k + 1 ..q] = P[1 ..k]
    since T[s + q
    k + 1 ..s + q] = P[q
    k + 1 ..q]. In terms of a diagram the last displayed
    equation states
    a1 a2 … aqk+1
    aqk+2
    … aq
    a1 a2 … ak
    So once we fix q we can find k just from the pattern P.
    We define the function ? : {1, 2,…,m} ! {0, 1,…,m
    1} by requiring that ?(q) is the
    value k as just described. (We do not need to define ?(0) because q = 0 means there was no
    match at all and so we will automatically move on. ) Note that the function ? can be represented
    by an array indexed from 1 to m and we will use this without further comment. From the
    preceding discussion we know that
  17. (q) < q
    for all q. Make sure that you understand fully the meaning of this function and how it relates to
    the discussion above. The following example will help.
    The function for our simple pattern bab is given by:
    q 1 2 3
    (q) 0 0 1
    Let us see exactly why these values are correct by looking at them in turn. Our discussion will
    refer to T (to help you relate the calculations to their intended use) but remember that this is not
    necessary for the calculations.
    6
    (1) : Here T[s + 1 ..s + 1] is the same as the pattern entries P[1 . . 1], i.e., T[s + 1] is the same
    character as P[1]. We want the largest k such that P[1 ..k] = T[s0 + 1 ..s0 + k] where
    s0 + k = s + 1. Since we require s0 > s this immediately means that s0 = s + 1 and k = 0.
    Note that this argument applies to any pattern, we always have ?(1) = 0. It is exactly what
    we would expect: we are told that the attempt to match the pattern with the text starting
    at position s + 1 failed at position s + 2 (i.e., T[s + 2] 6= P[2]). It follows that the next
    position at which we should try a match is at s + 2.
    (2) : We have T[s + 1, s + 2] = P[1, 2]. Is it worth trying for a match at position s + 2 of the
    text? Well such an attempt would look like:
    Clearly this is bound to fail. Thus we set k = 0 and this means that s0 = s + 2, there is no
    point in attempting a match at position s + 2 (recall that the next attempted match starts at
    position s0 + 1).
    (3) : We have T[s + 1, s + 2, s + 3] = P[1, 2, 3]. Is it worth trying for a match at position s + 2
    or s + 3 of the text? An attempt at position s + 1 would look like:
    b a b
    b a
    Clearly this would fail (we already knew this from the previous case of course). So now
    we ask if it is worth trying for a match at position s + 3. Here the picture is:
    b a b
    b
    Thus it is worth trying this (and from the information we have we cannot rule out a match).
    So k = 1 and s0 = s + 2.
    It is worth stressing again that although the discussion above has referred to parts of the text
    input, all these are simply initial segments of the pattern. For any initial segment (i.e., value of q)
    we have a value for ?(q). The references to the text were simply a matter of convenience and
    for aiding understanding. As noted above, for a given value of q the value of(q) is simply the
    largest k such that
    P[q
    k + 1 ..q] = P[1 ..k].
    Note that for k = 0 both sides of this equation consist of an empty array so the equation holds.
    In terms of a picture we require the pattern to look like:
    Pattern up to position q a1 ··· aqk
    a1 a2 ··· ak
    Pattern shifted to position q
    k + 1 a1 a2 ··· ak
    7
    In intuitive terms just think about keeping one copy of the pattern up to position q. We then
    place another copy under it but shifted along by one position, ignore the last entry (it falls out
    of scope), and check for a match. If not we shift the bottom along by one more place and again
    check for a match with the characters directly above. We keep doing this till we either find a
    match (in which case k is the number of entries matched) or the bottom copy falls off the right
    hand end (in which case k = 0). You should now be able to see that the lengthy discussion above
    for the pattern bab can be summarised as:
    q 1 2 3
    Pattern b a b
    Shift 1 b a
    Shift 2 b
    (q) 0 0 1
    Make sure you understand how we can work out ?(q) for each q straight from the array of the
    pattern and its shifts. Start with q = 1, how much of the displayed array is relevant? For each q
    cover up the irrelevant part, if any, and then look for the maximum number of characters that
    match completely on any shift line with the relevant part of the pattern.
    The method we have used so far to compute ? is not particularly efficient (as a function of m,
    the length of the pattern P). The crucial aspect of the approach is that we can compute the
    function very efficiently. Here is the pseudocode (recall that we hold the value of ?(i) in an array
    which we will also call ?):
    Compute-Prefix-Function(P)
  18. m P.length
  19. ?[1] 0
  20. k 0
  21. for q 2 to m do
  22. while k > 0 and P[k + 1] 6= P[q] do k [k]
  23. if P[k + 1] = P[q] then k k + 1
  24. [q] k
  25. return
    Remember that in this section we assume arrays start at index 1 but of course in Java they start at
    index 0 so you will need to make an adjustment from the pseudocode to the Java implementation
    of Compute-Prefix-Function, see the note on p. 13. It is by no means obvious that this algorithm
    is correct, indeed the runtime is also not obvious. A good way to begin the study of an algorithm
    is to simulate it on simple examples. In order to help you with this here are two further examples
    of patterns and their functions:
  26. P = bbcbbabbc q 1 2 3 4 5 6 7 8 9
    ?(q) 0 1 0 1 2 0 1 2 3
  27. P = ababbabababc
    q 1 2 3 4 5 6 7 8 9 10 11 12
    (q) 0 0 1 2 0 1 2 3 4 3 4 0
    8
    You should first check that these functions are correct by computing ? by the naive method. Then
    simulate Compute-Prefix-Function, at least for the first few values. In fact, the supplied code
    (see §3.1) has an implementation of the naive method of computing the prefix function (note
    that this does not implement Compute-Prefix-Function). This is in the Matcher class and the
    method is:
    public static int[] buildPrefixFunctionNaively(String pattern)
    Important: This method is supplied to help you in cross checking your work. You must not
    use it as part of your implementation since it would invalidate all the timing tests (in effect you
    would be implementing an algorithm that is significantly slower than naive matching!).
    A proof of correctness of the Compute-Prefix-Function is not easy. Again it would be a
    good exercise to try proving the correctness of the algorithm, or more accurately think about
    how you might approach such a task (put a limit on the time you spend on this). You can find a
    complete proof in the book Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein.
    The worst case runtime is ?(m) but a naive analysis does not work because of the presence
    of a while loop (on line 5) within the overall for loop. We can derive the bound by using a form
    of amortized analysis. We will not go into details for Compute-Prefix-Function because the
    approach is similar to the one discussed below for KMP-Matcher.
    We are now ready to look at a more efficient string matching algorithm.
    KMP-Matcher(T,P)
  28. n T.length
  29. m P.length
  30. Compute-Prefix-Function(P)
  31. q 0
  32. create new (empty) queue Q
  33. for i 1 to n do
  34. while q > 0 and P[q + 1] 6= T[i] do q [q]
  35. if P[q + 1] = T[i] then q q + 1
  36. if q = m then
  37. enqueue(Q, i
    m)
  38. q [q]
  39. return Q
    The correctness of this algorithm is also not obvious but not so hard to establish given the correctness
    of Compute-Prefix-Function. The next exercise will find the worst case runtime. Note
    that this is a type of analysis that we do not otherwise cover in the course. It has therefore been
    broken down into small and simple steps so that the things you have to do are ones that are common
    with the type of analysis we have seen many times. Finally, although the question shows
    three main items, the last one concludes the analysis for you and is not a question as such; in
    your submission include only answers to the first two items.
    Exercise 2. Define the function (s)
    as follows: (0)
    = 0 while for s > 0 the value of (s)
    is
    the value of q just after the body of the for loop of line 6 in KMP-Matcher has executed for the
    value i = s. Consider an execution of the for loop on line 6 when i = s and count the following:
    9
    the number of executions of the statement controlled by the while loop on line 7;
    the number of executions of the statement controlled by the if on line 8;
    the number of executions of the statements on lines 10, 11 (controlled by the if on line 9).
    Denote this count by cs. Define
    cs = cs + (s).
    This is called the amortized cost of the execution of the loop and the function
    is called a
    potential.
  40. Let c be the total cost of executing the for loop in KMP-Matcher, counted in terms of
    the statements listed above, and c the total amortized cost (i.e., c = c1 + ··· + cn and
    c = c1 + ··· + cn). Prove that c = c + (n)

(0).
Explain briefly why it now follows
that c c. [15%]

  1. Prove that cs = O(1) for all s (indeed cs 4). Break this down into the following easy
    parts:
    (a) Let q0 be the value of q before the while loop on line 7 is executed (i.e., q0 = (s1))
    and let q1 be the value of q just after the while loop terminates. Let ws be the number
    of times that the while loop body is executed. Prove that ws+q1q0

    1. Recall that

    (q) < q for all q; what happens to the value of q with each execution of the loop?
    (b) Let q2 be the value of q after line 8 is executed. Prove that 1 + q2
    q1 2.
    (c) Let q3 the value of q just after the if statement starting on line 9 is executed (i.e.,
    q3 = (s)).
    Prove that 2 + q3
    q2 2.
    (d) Now put these observations together to deduce the overall claim. Note that cs ?
    ws +1+2, thus
    cs (ws + 1 + 2) + (s).
    NOTE: Each of these requires no more than a few lines to justify. So think carefully and
    express your justification very clearly. Overlong or unclear answers will receive very much
    reduced credit and any answers that cannot be understood after 2 minutes (per part) will
    be awarded 0. [20%]

  2. We now use the preceding two main parts to deduce that the worst case runtime of the
    algorithm is O(n); there is nothing for you to submit here but you should read through this
    conclusion of the analysis and make sure you understand it. We use without proof the fact
    that the worst case runtime of Compute-Prefix-Function is ?(m); recall that m n.
    Lines 1, 2, 4 and 12 of KMP-Matcher all cost O(1) while line 3 is O(m). The cost
    of executing the for loop on line 6 for i = s is at most acs + b for some constants a
    and b; if you have time give a brief justification of this observation. It follows that the
    10
    overall cost of executing the for loop on line 6 is at most ac + bn and hence at most
    ac + bn. From the preceding part we deduce that ac + bn = O(n). Thus the total cost is
    O(1) + O(1) + O(1) + O(1) + O(m) + O(n) = O(n) since m n.
    In fact, we can improve our analysis to show that the worst case runtime is (n) but we
    will not do this here (would you expect it to be sub-linear).
    §3. Software tasks
    We first give a guide to some of the suppled classes and methods (some more are mentioned
    later).
    §3.1 Supplied software
    The files that you will need for this coursework can be found on-line at the coursework webpage:
    http://www.inf.ed.ac.uk/teach…
    This page contains a file called inf2bcw1.tar which you should download. The file is tarred
    so you need to extract it, e.g. by tar -xf inf2bcw1.tar, this will create a subdirectory
    src of your current one that contains the software. The java package for this coursework is
    package matcher. The only class that should be changed is the nearly empty class in the file
    StudentClass.java which is where you will implement your part of the project. In fact,
    the amount of code you need to write is fairly modest.
    Recall that in the discussion above we assumed that arrays were indexed starting from 1.
    However Java starts with 0. Rather than invent a new class in Java it is simplest to subtract 1
    from all array indices. This is very important otherwise your software will not work.
    Note: It is a part of the exercise that you translate the description of the algorithm based on
    arrays starting at 1 to arrays starting at 0 in Java. As mentioned above, this is a deliberate choice
    so that you practice switching conventions. In this case it is a very simple matter. You cannot
    throw away location 0 of Java arrays and start using them from 1 onwards.
    Important: Do not remove any headers from the supplied software including those in the
    StudentClass.java file.
    In the software the text and pattern will be represented by String variables text and
    pattern; our alphabet is just any valid Java character (in experiments we will use a restricted
    alphabet, this is discussed below). The following methods and classes are already supplied for
    you.
    class Queue. This implements the queue we use to hold the offsets for occurrences of the
    text. In fact, for this practical we only need two methods.
    – public Queue enqueue(Integer offset)
    This puts s at the end of the queue.
    11
    – public Queue toString()
    This simply returns a comma separated string of all the entries in the queue so that
    you can print it out for checking your implementation. If the queue is empty then the
    empty string is returned.
    Class Matcher that contains various methods including
    – public Queue matchNaively(String text, String pattern)
    This takes text and pattern and uses NaiveMatcher to return a Queue structure
    which consists of all the shifts in the text at which the pattern can be found. For example
    if our text is “aababaab” and the pattern is “aa” then the returned Queue structure
    has 0, 5 as its elements in this order. With the same text but with the pattern “aaa” the
    returned Queue is empty.
    §3.2 Tasks
    Note: You are asked to write some code that is in fact fairly simple thanks to the supplied
    methods. Any code that is incorrect will be awarded 0. Correct code will be marked for style
    as well. This means that it should have useful comments and helpful indentation. Note that
    pointless comments (e.g., stating the obvious such as“now we add the two variables together”)
    or excessive indentation are almost as bad as the complete absence of either.
    Important: The checking of your code includes an automated procedure. Partly for this reason,
    you must follow the specification exactly and must not change any of the method names or their
    parameters. Otherwise your code will fail the test and be awarded 0. In particular remember not
    to remove any headers including those in the StudentClass.java file.
    The“texts”that we will be searching are in fact DNA sequences. A DNA sequence or genetic
    sequence is a succession of letters representing the primary structure of a real or hypothetical
    DNA molecule or strand, with the capacity to carry information as described by molecular biology.
    The possible letters are A, C, G, and T (note they are all in upper case) representing the four
    nucleotide bases of a DNA strand: adenine, cytosine, guanine and thymine that are linked to a
    phosphodiester backbone. Sequences can be derived from the biological raw material through a
    process called DNA sequencing. This description is purely for background, you do not need to
    know it for the exercise, all that is needed is that our alphabet consists of A, C, G, T (and indeed
    this is only relevant to carrying out tests).
    Your programming tasks are as follows.
  3. In the StudentCode class, provide implementations of the method
    public static int[] computePrefixFunction(String pattern) [10%]
    In the StudentCode inner class KMPMatcher, implement the string matching algorithm:
    public void search() [15%]
    12
    Note: Recall that the function
    . In
    §
  4. we used the
    convention that arrays start at index 1. However in Java they start at index 0. Since we
    are just computing the array that defines
    we now reduce each argument to the abstract
    function
    by 1. To be clear if pi is the Java array that gives us the values of the function
    then
    is given by pi[i-1]. If we were implementing a Java method getPi, say, then
    we could hide this so that a call to getPi(i) looks up pi[i-1] and returns this as the
    value. However to avoid extra coding we just work directly with the array pi
    .
    Keep your code straightforward, follow the algorithms as outlined in
    §2.2 (taking care
    of the issue regarding array indexing). Note that in designing the matching algorithm we
    made the reasonable assumption that the pattern is not longer than the text. Your code must
    test for this and return an empty queue if the pattern it strictly longer than the text. Remember
    also that you must not use the supplied method buildPrefixFunctionNaively
    as part of your implementation and if you do then you will be awarded 0 for the entire
    coding part
    .
    Test your implementation by using the following methods in the Matcher class: public static Boolean testPrefixFunction(String pattern)
    This checks for occurrences of pattern with itself as discussed for building the
    prefix function. Returns true if the list of occurrences (represented as shifts) is
    correct otherwise false. You will not get any other information.
    public static void testKMPMatcher(int t, int l)
    Generates
    t random patterns of length
    l and finds their occurrences in a fixed text
    using the method KMPMatcher.find. If all results are correct, then it returns an
    appropriate message. Else, it reports on the failure. Note that the integer parameters
    here and elsewhere must not be negative (an exception is raised otherwise).
    Naturally the test procedures do not guarantee correctness. All we can say is that if the
    code fails the tests then there is something wrong. If it passes all tests then it is probably
    correct.
  5. The aim of this part is to compare the runtimes for NaiveMatcher with those for
    KMPMatcher. We would expect that as the pattern gets larger the differences would be
    more significant. Note however that there could be patterns for which NaiveMatcher
    runs faster (you should think carefully why this can be the case).
    However we have a problem in trying to compare these two algorithms. The runtime of
    Naive-Matcher depends on both
    n and
    m while that of KMP-Matcher depends only on
    n
    .
    Of course for any fixed value of
    m the runtime of Naive-Matcher is linear in
    n, recall that
    its runtime is
  6. We would expect that for big enough fixed values of
    m
    the advantage of KMP-Matcher would be clear (but note that if
    m is close to
    n then the
    runtime of Naive-Matcher is again linear in
    n). In order to avoid complications we will
    2Do not make the mistake of concluding falsely from the statement about fixed values of
    m that the runtime of
    Naive-Matcher is linear in
    n. This is clearly nonsense, e.g., consider
    m
    13
    carry out experiments with the following choices of
    m for any given value
    n: we use
    m = 10
    , 10
    2,…, 10
    r where
    r is the smallest integer such that 10
    . The
    upper bound is chosen because
    m has its maximum value at
    The fact that we multiply the length of the pattern by 10 each time means that the number
    of patterns is not enormous.
    The supplied Matcher class method ? public static void getRuntimes(int p, int t, String f)
    does the following:
    (a) Generates
    p patterns for each size from 10
    , 10
    2,… as discussed above using
    n
    =
  7. as the initial text size (the patterns are not random so that experiments are
    repeatable).
    (b) Searches for all occurrences of each pattern within a text of size
    n using the classes
    NaiveMatcher and KMPMatcher, and takes the cpu times for these.
    In fact, the text is always taken from the initial portion of size
    n of the same longer
    text.
    (c) Records the worst case times for each search taken over the
    p patterns (note that it is
    pefectly possible that the worst case runtimes for the two algorithm implementations
    occur for different patterns).
    (d) Repeats the above with texts of size 20000
    , 30000,…, 10000
    t
    .
    (e) Outputs the result for each iteration on the file named
    f in the format
    pattern-size text-size naiveMatcher-worst-case-runtime
    KMPMatcher-worst-case-runtime
    For
    f you can give a path name but the simplest thing is to run code from your
    working directory and use just the name for
    f; the file will be placed in your working
    directory.
    (Note that the data above are given on a single line of the file, the text is too long to fit on
    one line of this document.) Carry out experiments with increasing values of
    t until you find
    a clear difference between the two algorithms. In order to avoid excessive waiting choose
    the value of
    p to be fairly moderate. Naturally it makes sense to continue the experiment
    a little beyond the point of the first clear difference (indeed due to various factors there
    might be a temporary turn around, this doesn’t happen for long).
    Finally, run the experiment with p=10
    , t=100 and keep the results in a file with the name
    matcherTimes.txt. Study the output to see what happens in terms of one algorithm
    outperforming the other. [5%]
    Use this file to find the crossover point(s) at which KMPMatcher is consistently faster
    than naiveMatcher. This is a rough and ready way of obtaining an estimate for the
    14
    settling in period before the overheads outweigh the benefits. In this case our approach is
    open to criticism (think about why this is the case) but is not too misleading.
    From now on we will focus on KMPMatcher.
  8. We know that the worst case runtime T(n) of KMPMatcher satisfies T(n) ? cn for some
    constant c. Of course asymptotic notation allows for a settling in period, i.e., from some
    value n0 of n onwards. In fact, this period is not really necessary at least from n = 1
    onwards because we can just take a larger constant if needed (with the obvious downside
    that this gives a pessimistic performance guarantee for all large enough values of n). It
    should be fairly clear from the algorithm that the settling in period is not long; essentially
    we just need n to be big enough so that cn dominates the cost of initializing variables and
    setting up data structures.
    The supplied Matcher method
    public static void getRatios(int p, int t, int xOver,
    String fRatios)
    is similar to getRuntimes except that for each experiment it only uses KMPMatcher.
    Once it has found the worst case runtime for a text size n it records the value of that runtime
    divided by n. The results are output to the file fRatios (use the name
    matcherRatios.txt for fRatios) in the format
    integer-size ratio
    one per line as before. At the end of this output it also appends four extra lines:
    Ignoring ratios before cross over point: xOver
    Sorted ratios are: r1, r2,…
    Maximum ratio is: max-ratio
    Average ratio is: ave-ratio
    Thus the final two lines give us the overall maximum as well as the average of all the
    ratios. The case for taking the average is that it smooths out to some extent the effects of
    any garbage collection. In a more detailed study we need to be more sophisticated but here
    we will keep things simple and rely on the plot (see below) to give us an idea of how useful
    the average is. You can also use the sorted ratios to get an idea of the effects of garbage
    collection (we would expect ratios affected by this to be significantly bigger than the“real”
    ones).
    Use the following values for the parameters:
    p set to 10,
    t set to 100.
    15
    xOver: use the crossover point (TextSize in matcherTimes.txt) you found
    where KMPMatcher becomes more efficient than NaiveMatcher.
    Finally you will plot the data from the file matcherTimes.txt of the previous part
    and the worst case runtime as determined by the theoretical analysis together with the two
    estimates for the constant found by the preceding experiment (the maximum and average).
    Use the supplied Matcher method
    public static void plotRuntimes(double c, double a,
    String f)
    to produce your plot. For c use the maximum constant found above, for a use the average
    and for the file f pass the data in the file matcherTimes.txt from the preceding part.
    This will bring up a plot which you should save in a file called plot.jpg. [5%]
    Compiling and Running
    From the commandline, type cd path-to-src-folder and compile with the command javac
    matcher/*.java. Run the program with the command java matcher.ClassName
    where ClassName is the .java file where your main() target is.
    Notes
  9. Java provides pattern matching. You must not use this in your implementation. The point
    is to do things from scratch. If you ignore this requirement you will be awarded 0 for the
    coding part.
  10. As stated above your code must be well laid out showing its logical structure. Just like
    good prose or good mathematical writing it must be set out to aid understanding. This way
    you and others can maintain it as well as spot any errors more easily.
    §4. Submitting the Coursework
    You must submit the two parts of your work by the deadlines given at the head of this document.
    If you complete the coursework in full, you should be submitting:
    Part A: Analysis of algorithms (hardcopy only).
  11. Your answer to Exercise 1 of §2.1.
  12. Your answer to Exercise 2 of §2.2.
    Part B: Software.
  13. Your code in the class file StudentClass.java (electronic only).
  14. matcherTimes.txt (electronic only).
  15. plot.jpg (electronic only).
    16
    Note: Your answers to the two exercises of Part 1 are best handwritten (unless you have a medical
    condition that prevents this) and neatly presented following the guidelines of Lecture Note 2 (and
    the notes in general). Your hardcopy submission should be made at the appropriate time to:
    ITO, Appleton Tower Room 6.05, Crichton Street, Edinburgh, EH8 9LE
    Important: Put your matriculation number very clearly at the top of your submissions. You do
    not have to put your name; marks will be returned to the ITO by matriculation number (this is
    how the submit command (see below) organises things. If you need to submit more than one
    piece of paper for a part please staple the sheets together and put your matriculation number on
    the top of each sheet (in case they get separated). Do not use folders to hold several sheets as this
    holds up the marking process significantly.
    For electronic submission follow these instructions.
    First put your submission files (and nothing else) into one directory called inf2b-cw1.
    Submit this directory using the following command (when logged into DICE):
    submit inf2b cw1 inf2b-cw1
    Warning: The rule for courseworks is“We mark what is submitted.”Before you submit your
    coursework, make sure that you are submitting the correct files. In some previous years students
    submitted the wrong files and lost marks that way.
    Bibliography
  16. D. E. Knuth, J.H. Morris, Jr. and V.R. Pratt, Fast pattern matching in strings. SIAM Journal
    on Computing, 6 (2), 1977, 323–350.
  17. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, Introduction
    to Algorithms. McGraw-Hill, 2002.
    Kyriakos Kalorkoti
    Software by Prachya Boonkwan, Xuan Huang,
    Srikanth Ronanki and Naums Mogers
正文完
 0