COMP9315 21T1 - Assignment 2Signature IndexesDBMS ImplementationLast updated: Friday 2nd April 8:40pmMost recent changes are shown in red;older changes are shown in brown.A changelog is at the end of the file.Hopefully, this changelog will be very short.summary introduction commands data-types tasks testing submission changelogAimsThis assignment aims to give you an understanding ofhow database files are structured and accessedhow superimposed codeword (SIMC) signatures are implementedhow concatenated codeword (CATC) signatures are implementedhow partial-match retrieval searching is implemented using signaturesthe performance differences between different types of signaturesThe goal is to build a simple implementation of a signature indexed file, including applications to create such files, insert tuples intothem, and search for tuples based on partial-match retrieval queries.SummaryDeadline: 11:00am on Monday 19 AprilLate Penalty: 0.125 marks off the ceiling mark for each hour late (i.e. 3 marks/day)Marks: contributes 20 marks toward your total mark for this course.Groups: you must complete this assignment individuallySubmission:login to Course Web Site > Assignments > Assignment 2 > Submission > upload ass2.taror login to any CSE server > give cs9315 ass2 ass2.tarWorkspace: any machine wth a C compiler (preferably gcc); you do not need to use GriegThe ass2.tar file must contain the Makefile plus all of the .c and .h files that are needed to compile the create, insert andselect executables.You are not allowed to change the following files: create.c, insert.c, select.c, stats.c, dump.c, hash.h, hash.c, x1.c,x2.c, x3.c. We supply them when we/you test your files, so any changes you make will be overwritten. Do not include them in theass2.tar file. Details on how to build the ass2.tar file are given below.Note that the code in create.c, insert.c, select.c, stats.c, dump.c assumes that you honour the interfaces to the ADTsdefined in the *.[ch] file pairs. If you change the interfaces to data types like bits.h and page.h, then your program will be treatedas incorrect.Make sure that you read this assignment specification carefully and completely before starting work on the assignment.Questions which indicate that you haven't done this will simply get the response "Please read the spec".Note: this assignment does not require you to do anything with PostgreSQL.IntroductionSignatures are a style of indexing where (in its simplest form) each tuple is associated with a compact representation of its values (i.e.its signature). Signatures are used in the context of partial-match retrieval queries, and are particularly effective for large tuples.Selection is performed by first forming a query signature, based on the values of the known attributes, and then scanning the storedsignatures, matching them against the query signature, to identify potentially matching tuples. Only these tuples are read from thedata pages and compared against the query to check whether they are true matching tuples. Signature matching can result in "falsematches", where the query and tuple signatures match, but the tuple is not a valid result for the query. Note that signature matchingcan be quite efficient if the signatures are small, and efficient bit-wise operations are used to check for signature matches.The kind of signature matching described above uses one signature for each tuple (as in the diagram below). Other kinds ofsignatures exist, and one goal is to implement them and compare their performance to that of tuple signatures.2021/4/4 COMP9315 21T1 - Assignment 2https://cgi.cse.unsw.edu.au/~... 2/13In files such as the above, queries are evaluated as follows:Input: pmr query, Output: set of tuples satisfying the queryqrySig = makeSignature(query)Pages = {} // set of pages containing possibly matching tuplesforeach tupSig in SignatureFile {if (tupSig matches qrySig) {// potential matchPID = page of tuple associated with tupSigadd PID to Pages}}Results = {} // set of tuples satisfying queryforeach PID in Pages {buf = fetch data page PIDforeach tuple in buf {// check for real matchif (tuple satisfies query) add tuple to Results}}Note that above algorithm is an abstract view of what you must implement in your code. The function makeSignature() does notliterally exist, but you need to build analogues to it in your code.SignaturesWe will consider two methods for building signatures: superimposed codewords (SIMC), and concatenated codewords (CATC). Eachcodeword is formed using the value from one attribute.In SIMC signatures, all codewords and signatures are m bits wide, and each codeword has k bits set to 1. In CATC signatures,signatures are m bits wide, but codewords occupy approximately equal numbers of bits of the signature. Since there are m bits in thesignature and n attributes, each codeword is u = m/n bits long, except for the lower-order codeword (the one for the first attribute).This codeword is u bits long + m mod n bits, so that the total number of codeword bits is equal to m. The following diagram shows theparts of a concenated codeword signature:In this example, the signature is m=42 bits wide. Each codeword, except the lower-order one, is u=10 bits wide. The lower-ordercodeword has two extra bits to make up to 42. Each codeword has half of its bits set to 1; in CATC codewords, k = u/2. This is differentto SIMC codewords, where we need to determine k to ensure that roughly half of the bits in the signature are set to 1.2021/4/4 COMP9315 21T1 - Assignment 2https://cgi.cse.unsw.edu.au/~... 3/13The way we build CATC signatures is conceptually straightforward: form n codewords, each of which is m/n bits wide, andconcatenate them. In practice, we build n codewords, each of which is m bits wide, with the lower-order u bits set as the codeword,and then, shifted into the position that it would occupy in a concatenated codeword signature. The diagram below illustrates this:Note: the fact individual codewords are 8-bits long is not intended to suggest that codewords will always be whole bytes. Individualcodewords would be 6-bits if m = 24, or 9-bits if m = 36. And, as noted above, if m = 42, the codeword for attribute 1 would be 12-bitsand all other attributes would have 10-bit codewords.In subsequent discussions, we denote the length of tuple signatures as m, the length of page signatures as mp, and the length ofCATC codewords as u (remembering that all SIMC codewords have the same length as the signatures they produce).RelationsIn our system, a relation R is represented by five physical files:R.info containing global information such asthe number of attributes and size of each tuplethe number of data pages and number of tuplesthe base type of signatures (simc or catc)the sizes of the various kinds of signaturesthe number of signatures and signature pagesetc. etc. etc.The R.info file contains a copy of the RelnParams structure given in the reln.h file (see below).R.data containing data pages, where each data page containsa count of the number of tuples in the pagethe tuples (as comma-separated character sequences)Each data page has a capacity of c tuples. If there are n tuples then there will be b = ⌈n/c⌉ pages in the data file. All pagesexcept the last are full. Tuples are never deleted.R.tsig containing tuple signatures, where each page containsa count of the number of signatures in the pagethe signatures themselves (as bit strings)Each tuple signature is formed by incorporating the codewords from each attribute in the tuple. How this is done differs betweenSIMC and CATC, but the overall result is a single m-bit long signature. If there are n tuples in the relation, there will be n tuplesignatures, in bt pages. All tuple signature pages except the last are full.R.psig containing page signatures, where each page containsa count of the number of signatures in the pagethe signatures themselves (as bit strings)Page signatures are much larger than tuple signatures, and are formed by incorporating the codewords of all attribute values inall tuples in the page. How this is done differs between SIMC and CATC, but the result is a single mp-bit long signature There isone page signature for each page in the data file.R.bsig containing bit-sliced signatures, where each page containsa count of the number of signatures in the pagethe bit-slices themselves (as bit strings)Bit-slices give an alternate 90o-rotated view of page signatures. If there are b data pages, then each bit-slice is b-bits long. Ifpage signatures are pm bits long, then there are pm bit-slices.The following diagram gives a very simple example of the correspondence between page signatures and bit-slices:2021/4/4 COMP9315 21T1 - Assignment 2https://cgi.cse.unsw.edu.au/~... 4/13PagesThe different types of pages (tuple, signature, slice) were described above. Internally, all pages have a similar structure: a counterholding the number of items in the page, and the items themselves (tuples or signatures or slices). All of the items in a page are thesame size. The following diagram shows the structure of pages in the files of a signature-indexed relation:We have developed some infrastructure for you to use in implementing these signatur-indexed files. The code we give you is notcomplete; you can find the bits that need to be completed by searching for TODO in the code.How you implement the missing parts of the code is up to you, but your implementation must conform to the conventions used in ourcode. In particular, you should preserve the interfaces to the supplied modules (e.g. Bits, Reln, Query, Tuple) and ensure that yoursubmitted modules work with the supplied code in the create, insert and select commands.CommandsIn our context, signature-indexed relations are a collection of files that represent one relational table. These relations can bemanipulated by a number of supplied commands:create RelName SigType #tuples #attrs 1/pFCreates an empty relation called RelName with all tuples having #attrs attributes. SigType specifies how signatures should beformed, and can have one of two values: simc or catc. The #tuples parameter gives the expected number of tuples that arelikely to be inserted into a relation; this, in turn, determines parameters like the number of data pages and length of bit-slicedsuperimposed codewords. The 1/pF parameter gives the inverse of the false match probability; for example, a value of 1000corresponds to a false match probability of 1/1000 (0.001).These parameters are combined using the formulas given in lectures to determine how large tuple- and page-signatures are.Each bit-slice has a number of bits equal to the number of data pages, which is determined from #attrs, #tuples and the pagesize.This gives you storage for one relation/table, and is analogous to making an SQL data definition like:create table R ( a1 integer, a2 text, ... an text );Note that internally, attributes are indexed 0..n-1 rather than 1..n.2021/4/4 COMP9315 21T1 - Assignment 2https://cgi.cse.unsw.edu.au/~... 5/13The following example of using create makes a relation called abc where each tuple has 4 attributes and the indexing has afalse match probability of 1/100. The relation can hold up to 10000 tuples (it can actually hold more, but only the first 10000 willbe indexed via the bit-sliced signatures).$ ./create abc simc 10000 4 100insert RelNameReads tuples, one per line, from standard input and inserts them into the relation specified on the command line. Tuples all takethe form val1,val2,...,valn. The values can be any sequence of alpha-numeric characters and '-'. The characters ',' (fieldseparator) and '?' (query wildcard) are treated specially.Since all tuples need to be the same length, it is simplest to use gendata to generate them, and pipe the generated tuples intothe insert commandselect RelName QueryString IndexTypeTakes a "query tuple" on the command line, and finds all tuples in the data pages of the relation RelName that match the query.IndexType has a value of either t, p or p, indicating whether it should used the tuple, page, or bit-sliced signatures. Queriestake the form val1,val2,...,valn, where some of the vali can be '?' (without the quotes). Some examples, and their interpretationare given below. You can find more examples in the lecture slides and course notes.?,?,? # matches any tuple in the relation10,?,? # matches any tuple with 10 as the value of attribute 1?,abc,? # matches any tuple with abc as the value of attribute 210,abc,? # matches any tuple with 10 and abc as the values of attributes 1 and 2There are also a number of auxiliary commands to assist with building and examining relations:gendata #tuples #attributes [startID] [seed]Generates a specified number of n-attribute tuples in the appropriate format to insert into a created relation. All tuples are thesame format and look likeUniqID,RandomString,a3-Num,a4-Num,...,an-NumFor example, the following 4-attribute tuples could be generated by a call like gendata 1000 47654321,aTwentyCharLongStrng,a3-013,a4-0013456789,aTwentyChrLongString,a3-042,a4-128Of course, the above call to gendata will generate 1000 tuples like these.A tuple is represented by a sequence of comma-separated fields. The first field is a unique 7-digit number; the second field is arandom 20-char string (most likely unique in a given database); the remaining fields have a field identifier followed by a nonunique3-digit number. The size of each tuple is7+1 + 20+1 + (n-2)(6+1)-1 = 28 + 7(n-2) bytesThe -1 is because the last attribute doesn't have a trailing comma, and (n-2)*(6+1) assumes that it does.Note that tuples are limited to at most 9 attributes, which means that the maximum tuple size is a modest 77 bytes. (If you wish,you can work with larger tuples by tweaking the gendata and create commands and the newRelation() function, but thisnot required for the assignment).stats RelNamePrints information about the sizes of various aspects of the relation. Note that some aspects are static (e.g. the size of tuples)and some aspects are dynamic (e.g. the number of tuples). An example of using the stats command is given below.You can use it to help with debugging, by making sure that the files have been correctly built after the create command, andthat the files have been correctly updated after some tuples have been inserted.dump RelNameWrites all tuples from the relation RelName, one per line, to standard output. This is like an inverse of the insert command.Tuples are dumped in a form that could be used by insert to rebuild a database.You can use it to help with debugging, by making sure that the tuples are inserted correctly into the data file.Setting UpYou should make a working directory for this assignment and put the supplied code there, and start reading to make sure that youunderstand all of the data types and operations used in the system.2021/4/4 COMP9315 21T1 - Assignment 2https://cgi.cse.unsw.edu.au/~... 6/13$ mkdir your/ass2/directory$ cd your/ass2/directory$ unzip /web/cs9315/21T1/assignments/ass2/ass2.zipYou should see the following files in the directory:$ lsMakefile dump.c psig.c stats.c x1.cbits.c gendata.c psig.h tsig.c x2.cbits.h hash.c query.c tsig.h x3.cbsig.c hash.h query.h tuple.cbsig.h insert.c reln.c tuple.hcreate.c page.c reln.h util.cdefs.h page.h select.c util.hThe .h files define data types and function interfaces for the various types used in the system. The corresponding .c files contain theimplementation of the functions on the data type. The remaining .c files either provide the commands described above, or are testharnesses for individual types (x1.c, x2.c, x3.c). You can add additional testing files, bu there is no need to submit them.The above files give you a partial implementation of signature-based indexing. You need to complete the code so that it provides thefunctionality described above.You should be able to build the supplied partial implementation via the following:$ makegcc -std=gnu99 -Wall -Werror -g -c -o query.o query.cgcc -std=gnu99 -Wall -Werror -g -c -o page.o page.cgcc -std=gnu99 -Wall -Werror -g -c -o reln.o reln.cgcc -std=gnu99 -Wall -Werror -g -c -o tuple.o tuple.cgcc -std=gnu99 -Wall -Werror -g -c -o util.o util.cgcc -std=gnu99 -Wall -Werror -g -c -o tsig.o tsig.cgcc -std=gnu99 -Wall -Werror -g -c -o psig.o psig.cgcc -std=gnu99 -Wall -Werror -g -c -o bsig.o bsig.cgcc -std=gnu99 -Wall -Werror -g -c -o hash.o hash.cgcc -std=gnu99 -Wall -Werror -g -c -o bits.o bits.cgcc -std=gnu99 -Wall -Werror -g -c -o create.o create.cgcc -o create create.o query.o page.o reln.o tuple.o util.o tsig.o psig.o bsig.o hash.o bits.o -lmgcc -std=gnu99 -Wall -Werror -g -c -o insert.o insert.cgcc insert.o query.o page.o reln.o tuple.o util.o tsig.o psig.o bsig.o hash.o bits.o -o insertgcc -std=gnu99 -Wall -Werror -g -c -o select.o select.cgcc select.o query.o page.o reln.o tuple.o util.o tsig.o psig.o bsig.o hash.o bits.o -o selectgcc -std=gnu99 -Wall -Werror -g -c -o stats.o stats.cgcc stats.o query.o page.o reln.o tuple.o util.o tsig.o psig.o bsig.o hash.o bits.o -o statsgcc -std=gnu99 -Wall -Werror -g -c -o gendata.o gendata.cgcc -o gendata gendata.o util.o -lmgcc -std=gnu99 -Wall -Werror -g -c -o dump.o dump.cgcc dump.o query.o page.o reln.o tuple.o util.o tsig.o psig.o bsig.o hash.o bits.o -o dumpgcc -std=gnu99 -Wall -Werror -g -c -o x1.o x1.cgcc -o x1 x1.o query.o page.o reln.o tuple.o util.o tsig.o psig.o bsig.o hash.o bits.ogcc -std=gnu99 -Wall -Werror -g -c -o x2.o x2.cgcc -o x2 x2.o query.o page.o reln.o tuple.o util.o tsig.o psig.o bsig.o hash.o bits.ogcc -std=gnu99 -Wall -Werror -g -c -o x3.o x3.cgcc -o x3 x3.o query.o page.o reln.o tuple.o util.o tsig.o psig.o bsig.o hash.o bits.oThis should not produce any errors on the CSE servers; let me know ASAP if this is not the case.The gendata command should work completely without change. For example, the following command generates 5 tuples, each ofwhich has 4 attributes. Values in the first attribute are unique; values in the second attribute are highly likely to be unique. Note thatthe third and fourth attributes cycle through values at different rates, so they won't always have the same number.$ ./gendata 5 41000000,lrfkQyuQFjKXyQVNRTyS,a3-000,a4-000 -> 01000001,FrzrmzlYGFvEulQfpDBH,a3-001,a4-001 -> 01000002,lqDqrrCRwDnXeuOQqekl,a3-002,a4-002 -> 01000003,AITGDPHCSPIjtHbsFyfv,a3-003,a4-003 -> 01000004,lADzPBfudkKlrwqAOzMi,a3-004,a4-004 -> 0The create command itself is complete, but some of the functions it calls are not complete. It will allow you to make an emptyrelation, although without a complete bit-slice file (you add this as one of the assignment tasks). The stats command is complete andcan display information about a relation. Using these commands, you could do the following: use the create command to create anempty relation which can hold 4-attribute tuples and able to index up to 5000 tuples (using bit-slices), with a false match probability of1/1000. The stats command then displays the generated parameter values.$ ./create R simc 5000 4 1000$ ./stats RGlobal Info:Dynamic:2021/4/4 COMP9315 21T1 - Assignment 2https://cgi.cse.unsw.edu.au/~... 7/13
...