- Due Date: 11:55 pm on Fri 09 May 2025
- Weight: 20 marks
- Code Requirement: You must write all the code yourself. Only use standard libraries. Standard data structures (like list, dictionary, tuple, set) are allowed as long as they are space/time - efficient for your purpose. Input/output and other unavoidable routines are exempted.
- Submission Procedure:
- File Naming and Content: All your scripts must contain your name and student ID.
- Zip File: Upload a
.ziparchive via Moodle with the filename in the form<student ID>.zip. The archive should extract to a directory named after your student ID. Includea2.pyscript anda2report.pdfin this directory, and nothing else. - Submission Method: Submit the archive electronically via Moodle.
- Specification Adherence: Strictly follow the specifications in each question. Do not hard - code input filenames; they should be passed from the command line.
- Academic Integrity: Monash University upholds high standards of honesty. All submissions will be scanned via MOSS or JPLAG. Using generative AI to answer this assessed task is prohibited.
- Task: Given a collection of text files and multiple pattern files, report all occurrences of each pattern that matches, under the DL - distance ≤1 threshold, the regions of texts in the collection.
- Data Structure Requirement: Use the suffix tree data structure as the primary search data structure for DL - distance ≤1 searches. You may also use supplementary data structures or algorithms from this unit or its prerequisite units to optimize performance.
- Input Files Details:
- Character Set: Each text/pattern file contains characters from the 7 - bit ASCII printable characters (32–126) and ASCII whitespace control codes (8–13, 32). Assume the ‘$’ symbol does not appear in text/pattern files.
- File Content: Each text/pattern file has at least one character from the specified alphabet. The longest pattern is at most as long as the shortest text. Multiple text files can have identical contents, and a larger text file may contain the contents of a smaller one. The same applies to pattern files.
- Program Name:
a2.py - Argument: The program accepts one argument, which is the filename of a run - configuration file. The run - configuration file format is:
<N=number of text files> <M=number of pattern files>
1 <text_filename_1.txt>
2 <text_filename_2.txt>
...
N <text_filename_N.txt>
1 <pattern_filename_1.txt>
2 <pattern_filename_2.txt>
...
M <pattern_filename_M.txt>
- Command Line Usage:
a2.py <run - configuration - fileaname> - Output File Name:
output a2.txt - Output Format:
<pattern number> <text number> <position of occurrence> <DL - distance>
...
- **Numbering**: The pattern number and text number in each line should follow the numbering in the run - configuration file. They need not be in sorted order.
- **Position of Occurrence**: It is the 1 - based position where the pattern was observed in the text under a DL - distance of 0 or 1.
Answer the following questions in the report:
- How did you handle multiple texts in your approach?
- How did you handle DL - distance = 0 search on multiple patterns in your approach?
- How did you handle DL - distance = 1 search on multiple patterns in your approach? Report one subsection for each of the 4 edit operations: (1) substitution, (2) transposition, (3) insertion (of a character in a pattern), and (4) deletion (of a character in a pattern).
- Were changes made to the suffix tree data structure (compared to the version delivered in your lecture)? If yes, state the changes and justify them.
- Did you use any other supporting data structures or algorithms not already covered in your responses to the above?
- What is the worst - case time AND space complexity of your suffix tree construction over multiple files?
- What is the worst - case time AND space complexity of your search approach under the DL - distance ≤1 threshold? # FIT3155 a2