COM 1100 Fall 2000 -- Prof. Futrelle -- Lab 7 Directions

Accessing files to answer queries

Lab date: Wednesday November 15th. Due Tuesday the 21^st by noon.
These directions were posted Saturday, November 11^th.
NEW --> Example file given out in class and emailed 11/20.
And here's the small text file used, f1.txt.

Goal of Lab #7:

To build a system that can answer questions ("queries") about a moderate-sized file containing structured information by reading the file and searching the resulting strings. Two files are provided for you to choose from.This lab will also help you with Lab #8 on the 29th, when you'll be reading and writing large quantities of data and placing them in arrays for manipulation and later output.

The information in the files: The first file contains information on 150 famous computer games in a simple format that I created by converting an HTML file. The second file is more challenging and is offered as an alternative to people who would like to deal directly with an HTML-formatted file. It contains links to some 2000 classical music sites.

The games file structure: Each entry in the file is "tagged" by two-character tags of that contain a "#" followed by an upper case letter for the tag type. The first line of the file starts with "#H" indicating it is the header line containing the title. The second starts with "#U" and contains the URL from which I got the original file. There are then 150 lines that follow which have the structure shown by this example:

#N4. #TRed Baron #PSierra  #D1990

Where "#N" indicates the game number, "#T" the title, "#P" the publisher and "#D" the date. Following some of the entries are comment blocks whose first line begins with a "#C". The comment block may continue for a number of lines with no type of "#" tag, but every comment ends with a "#E" tag. Here is the entry for "Doom" followed by its comment block:

#N5. #TDoom #Pid Software  #D1993
#CSimply the best action game of all time. Even though DOOM wasn't
true 3D, it transformed the way everyone thought about the PC as a fast gaming
machine. If you want to see us rhapsodize some more, check out this month's
Hall of Fame.#E

The games file is 383 lines long and contains 20300 characters (bytes). You can access it here. You'll need to save this file as text from your browser and place it in your project folder so you can easily access it for file input. It is relatively small and will easily fit onto a floppy. And you can access a copy of the original HTML file here.

The classical music file structure: This file can be used as an alternative to the games file. It is more challenging in that it has all of the original HTML tagging, which is not as simple as the tagging I've used above. You will need to save the HTML source of the file and examine it to see what its structure is and see how you can develop a program to answer queries about it. You are welcome to edit the file slightly to remove the header information, and basically everything other than the 2000 entries it has. Here is the file. It is about 200K in size, so it also will easily fit onto a floppy. It is 7000 lines long and has about 2000 entries, which are links to classical music sites.

What you are to do:

What you are to do is best described by some examples. The user should be given a few lines of instructions, which will be printed whenever the user enters "h", "H" or "?". For the games file, entering a tag letter followed by a string will cause the system to read through the file and print out matching entries. Thus, entering "T Wing Commander" will return the entry for the game with the title (name) "Wing Comnmander" in an easy-to read format, e.g.,

Game number: 7
Title: Wing Commander 
Publisher: Origin 
Date: 1990

Note that the titles may contain blanks, so you must use getline() to input them. If there are comments attached to the game they should be printed out also. When a title string is matched the output should include all entries that contain it. For example, the query "T Star" should return the six entries whose titles contain the string "Star". Queries should be allowed for publishers and dates also and these will typically produce multiple entries. The final command a user can enter is "Q" which will cause the system to quit. Since some of these entries are long, they will overflow your output window. Therefore you must arrange for the queries and the results to also be written to a file at the same time you write them to the screen. This is exactly what files are for, to deal with large amounts of data. Of course their other purpose is to store data permanently, even when the power is off or the computer crashes.

For the classical file (alternate assignment) there are two primary fields, the URL and the text labelling the URL. Occasionally, there is more text following the link tag, </A>. Most of the searches will be done on the text inside and outside of the link, not in the URL itself. You could allow queries on the URLs, for example, to find UK sites or .edu sites.

Your program will make extensive use of the string functions described in Sec. 3.7 of your textbook. This will include string matching to search the lines of the file as you read them in with getline() as well as extracting parts of the strings between tags in order to get the parts you need for your output. As always, define various functions to help organize the various input, search, formatting and output operations. The book does not give extensive examples of using string functions, but there are many examples on the web, for example, some notes from Milwaukee, and an Italian site,

As I've said before, always include some samples of your system in operation, input and output lines, in your source program in a comment block at the end, to make your source code a fully self-contained explanatory document. You may also include listings of the longer files you write as output that logs the user's input and your program's output. Include a few of these files on your floppy also. Remember, what you hand in should "tell a story" -- it should say who you are, the class, the date, the assignment (a URL for it wouldn't hurt), what you did, how you did it, what you got, and what it all means.

EXTRA CREDIT: Allow the user to enter queries connected by AND or OR that will find entries that contain both words or either one word or the other (or both). These logical connectives will not require the words to be adjacent or in any particular order.