Scrigroup - Documente si articole

Username / Parola inexistente      

Home Documente Upload Resurse Alte limbi doc  

AccessAdobe photoshopAlgoritmiAutocadBaze de dateCC sharp
CalculatoareCorel drawDot netExcelFox proFrontpageHardware
HtmlInternetJavaLinuxMatlabMs dosPascal
PhpPower pointRetele calculatoareSqlTutorialsWebdesignWindows

AspAutocadCDot netExcelFox proHtmlJava
LinuxMathcadPhotoshopPhpSqlVisual studioWindowsXml

Text processing: Extracting code listings


+ Font mai mare | - Font mai mic


Trimite pe Messenger
Designing with inheritance - Pure inheritance vs. extension
Exception matching
Initialization and class loading
Returning an array
Casting operators
RTTI syntax
Input and output: Types of InputStream
Creating new data types: class
Text fields
Multidimensional arrays

Text processing

If you come from a C or C++ background, you might be skeptical at first of Java’s power when it comes to handling text. Indeed, one drawback is that execution speed is slower and that could hinder some of your efforts. However, the tools (in particular the String class) are quite powerful, as the examples in this section show (and performance improvements have been promised for Java).

As you’ll see, these examples were created to solve problems that arose in the creation of this book. However, they are not restricted to that and the solutions they offer can easily be adapted to other situations. In addition, they show the power of Java in an area that has not previously been emphasized in this book.

Extracting code listings

You’ve no doubt noticed that each complete code listing (not code fragment) in this book begins and ends with special comment tag marks ‘//:’ and ‘///:~’. This meta-information is included so that the code can be automatically extracted from the book into compilable source-code files. In my previous book, I had a system that allowed me to automatically incorporate tested code files into the book. In this book, however, I discovered that it was often easier to paste the code into the book once it was initially tested and, since it’s hard to get right the first time, to perform edits to the code within the book. But how to extract it and test the code? This program is the answer, and it could come in handy when you set out to solve a text processing problem. It also demonstrates many of the String class features.

I first save the entire book in ASCII text format into a separate file. The CodePackager program has two modes (which you can see described in usageString): if you use the -p flag, it expects to see an input file containing the ASCII text from the book. It will go through this file and use the comment tag marks to extract the code, and it uses the file name on the first line to determine the name of the file. In addition, it looks for the package statement in case it needs to put the file into a special directory (chosen via the path indicated by the package statement).

But that’s not all. It also watches for the change in chapters by keeping track of the package names. Since all packages for each chapter begin with c02, c03, c04, etc. to indicate the chapter where they belong (except for those beginning with com, which are ignored for the purpose of keeping track of chapters), as long as the first listing in each chapter contains a package statement with the chapter number, the CodePackager program can keep track of when the chapter changed and put all the subsequent files in the new chapter subdirectory.

As each file is extracted, it is placed into a SourceCodeFile object that is then placed into a collection. (This process will be more thoroughly described later.) These SourceCodeFile objects could simply be stored in files, but that brings us to the second use for this project. If you invoke CodePackager without the -p flag it expects a “packed” file as input, which it will then extract into separate files. So the -p flag means that the extracted files will be found “packed” into this single file.

Why bother with the packed file? Because different computer platforms have different ways of storing text information in files. A big issue is the end-of-line character or characters, but other issues can also exist. However, Java has a special type of IO stream – the DataOutputStream – which promises that, regardless of what machine the data is coming from, the storage of that data will be in a form that can be correctly retrieved by any other machine by using a DataInputStream. That is, Java handles all of the platform-specific details, which is a large part of the promise of Java. So the -p flag stores everything into a single file in a universal format. You download this file and the Java program from the Web, and when you run CodePackager on this file without the -p flag the files will all be extracted to appropriate places on your system. (You can specify an alternate subdirectory; otherwise the subdirectories will just be created in the current directory.) To ensure that no system-specific formats remain, File objects are used everywhere a path or a file is described. In addition, there’s a sanity check: an empty file is placed in each subdirectory; the name of that file indicates how many files you should find in that subdirectory.

Here is the code, which will be described in detail at the end of the listing:


// 'Packs' and 'unpacks' the code in 'Thinking

// in Java' for cross-platform distribution.

/* Commented so CodePackager sees it and starts

a new chapter directory, but so you don't

have to worry about the directory where this

program lives:

package c17;


import java.util.*;


class Pr


class IO catch(IOException e)

return in;


static BufferedReader disOpen(String fname)

static DataOutputStream dosOpen(File f) catch(IOException e)

return in;


static DataOutputStream dosOpen(String fname)

static PrintWriter psOpen(File f) catch(IOException e)

return in;


static PrintWriter psOpen(String fname)

static void close(Writer os) catch(IOException e)


static void close(DataOutputStream os) catch(IOException e)


static void close(Reader os) catch(IOException e)



class SourceCodeFile ///:~', // End of source

endMarker2 = '}; ///:~', // C++ file end

beginContinue = '} ///:Continued',

endContinue = '///:Continuing',

packMarker = '###', // Packed file header tag

eol = // Line separator on current system


filesep = // System's file path separator


public static String copyright = '';

static catch(Exception e)


private String filename, dirname,

contents = new String();

private static String chapter = 'c02';

// The file name separator from the old system:

public static String oldsep;

public String toString()

// Constructor for parsing from document file:

public SourceCodeFile(String firstLine,

BufferedReader in)

// Convert package name to path name:

pdir = pdir.replace(

'.', filesep.charAt(0));

System.out.println('package ' + pdir);

dirname = pdir;


contents += s + eol;

// Move past continuations:


while((s = in.readLine()) != null)


// Watch for end of code listing:

if(s.startsWith(endMarker) ||





'End marker not found before EOF');

System.out.println('Chapter: ' + chapter);

} catch(IOException e)


// For recovering from a packed file:

public SourceCodeFile(BufferedReader pFile)

contents += s + eol;


} catch(IOException e)


public boolean hasFile()

public String directory()

public String filename()

public String contents()

// To write to a packed file:

public void writePacked(DataOutputStream out) catch(IOException e)


// To generate the actual file:

public void writeFile(String rootpath)


class DirMap

DirMap(String alternateDir)

public void add(SourceCodeFile f)

public void writePackedFile(String fname) catch(IOException e)

Enumeration e = t.keys();





// Write all the files in their directories:

public void write()

// Add file indicating file quantity

// written to this directory as a check:


new File(new File(rootpath, dir),





public class CodePackager

public static void main(String[] args)



private static String currentLine;

private static BufferedReader in;

private static DirMap dm;

private static void

createPackedFile(String[] args)

else if(currentLine.startsWith(


Pr.error('file has no start marker');

// Else ignore the input line


} catch(IOException e)




private static void

extractPackedFile(String[] args) catch(IOException e)

// Capture the separator used in the system

// that packed the file:

if(s.indexOf('###Old Separator:') != -1 )

SourceCodeFile sf = new SourceCodeFile(in);




} ///:~

You’ll first notice the package statement that is commented out. Since this is the first program in the chapter, the package statement is necessary to tell CodePackager that the chapter has changed, but putting it in a package would be a problem. When you create a package, you tie the resulting program to a particular directory structure, which is fine for most of the examples in this book. Here, however, the CodePackager program must be compiled and run from an arbitrary directory, so the package statement is commented out. It will still look like an ordinary package statement to CodePackager, though, since the program isn’t sophisticated enough to detect multi-line comments. (It has no need for such sophistication, a fact that comes in handy here.)

The first two classes are support/utility classes designed to make the rest of the program more consistent to write and easier to read. The first, Pr, is similar to the ANSI C library perror, since it prints an error message (but also exits the program). The second class encapsulates the creation of files, a process that was shown in Chapter 10 as one that rapidly becomes verbose and annoying. In Chapter 10, the proposed solution created new classes, but here static method calls are used. Within those methods the appropriate exceptions are caught and dealt with. These methods make the rest of the code much cleaner to read.

The first class that helps solve the problem is SourceCodeFile, which represents all the information (including the contents, file name, and directory) for one source code file in the book. It also contains a set of String constants representing the markers that start and end a file, a marker used inside the packed file, the current system’s end-of-line separator and file path separator (notice the use of System.getProperty( ) to get the local version), and a copyright notice, which is extracted from the following file Copyright.txt.

// Copyright (c) Bruce Eckel, 1998

// Source code file from the book 'Thinking in Java'

// All rights reserved EXCEPT as allowed by the

// following statements: You may freely use this file

// for your own work (personal or commercial),

// including modifications and distribution in

// executable form only. Permission is granted to use

// this file in classroom situations, including its

// use in presentation materials, as long as the book

// 'Thinking in Java' is cited as the source.

// Except in classroom situations, you may not copy

// and distribute this code; instead, the sole

// distribution point is

// (and official mirror sites) where it is

// freely available. You may not remove this

// copyright and notice. You may not distribute

// modified versions of the source code in this

// package. You may not use this file in printed

// media without the express permission of the

// author. Bruce Eckel makes no representation about

// the suitability of this software for any purpose.

// It is provided 'as is' without express or implied

// warranty of any kind, including any implied

// warranty of merchantability, fitness for a

// particular purpose or non-infringement. The entire

// risk as to the quality and performance of the

// software is with you. Bruce Eckel and the

// publisher shall not be liable for any damages

// suffered by you or any third party as a result of

// using or distributing software. In no event will

// Bruce Eckel or the publisher be liable for any

// lost revenue, profit, or data, or for direct,

// indirect, special, consequential, incidental, or

// punitive damages, however caused and regardless of

// the theory of liability, arising out of the use of

// or inability to use software, even if Bruce Eckel

// and the publisher have been advised of the

// possibility of such damages. Should the software

// prove defective, you assume the cost of all

// necessary servicing, repair, or correction. If you

// think you've found an error, please email all

// modified files with clearly commented changes to:

// (please use the same

// address for non-code errors found in the book).

When extracting files from a packed file, the file separator of the system that packed the file is also noted, so it can be replaced with the correct one for the local system.

The subdirectory name for the current chapter is kept in the field chapter, which is initialized to c02. (You’ll notice that the listing in Chapter 2 doesn’t contain a package statement.) The only time that the chapter field changes is when a package statement is discovered in the current file.

Building a packed file

The first constructor is used to extract a file from the ASCII text version of this book. The calling code (which appears further down in the listing) reads each line in until it finds one that matches the beginning of a listing. At that point, it creates a new SourceCodeFile object, passing it the first line (which has already been read by the calling code) and the BufferedReader object from which to extract the rest of the source code listing.

At this point, you begin to see heavy use of the String methods. To extract the file name, the overloaded version of substring( ) is called that takes the starting offset and goes to the end of the String. This starting index is produced by finding the length( ) of the startMarker. trim( ) removes white space from both ends of the String. The first line can also have words after the name of the file; these are detected using indexOf( ), which returns -1 if it cannot find the character you’re looking for and the value where the first instance of that character is found if it does. Notice there is also an overloaded version of indexOf( ) that takes a String instead of a character.

Once the file name is parsed and stored, the first line is placed into the contents String (which is used to hold the entire text of the source code listing). At this point, the rest of the lines are read and concatenated into the contents String. It’s not quite that simple, since certain situations require special handling. One case is error checking: if you run into a startMarker, it means that no end marker was placed at the end of the listing that’s currently being collected. This is an error condition that aborts the program.

The second special case is the package keyword. Although Java is a free-form language, this program requires that the package keyword be at the beginning of the line. When the package keyword is seen, the package name is extracted by looking for the space at the beginning and the semicolon at the end. (Note that this could also have been performed in a single operation by using the overloaded substring( ) that takes both the starting and ending indexes.) Then the dots in the package name are replaced by the file separator, although an assumption is made here that the file separator is only one character long. This is probably true on all systems, but it’s a place to look if there are problems.

The default behavior is to concatenate each line to contents, along with the end-of-line string, until the endMarker is discovered, which indicates that the constructor should terminate. If the end of the file is encountered before the endMarker is seen, that’s an error.

Extracting from a packed file

The second constructor is used to recover the source code files from a packed file. Here, the calling method doesn’t have to worry about skipping over the intermediate text. The file contains all the source-code files, placed end-to-end. All you need to hand to this constructor is the BufferedReader where the information is coming from, and the constructor takes it from there. There is some meta-information, however, at the beginning of each listing, and this is denoted by the packMarker. If the packMarker isn’t there, it means the caller is mistakenly trying to use this constructor where it isn’t appropriate.

Once the packMarker is found, it is stripped off and the directory name (terminated by a ‘#’) and the file name (which goes to the end of the line) are extracted. In both cases, the old separator character is replaced by the one that is current to this machine using the String replace( ) method. The old separator is placed at the beginning of the packed file, and you’ll see how that is extracted later in the listing.

The rest of the constructor is quite simple. It reads and concatenates each line to the contents until the endMarker is found.

Accessing and writing the listings

The next set of methods are simple accessors: directory( ), filename( ) (notice the method can have the same spelling and capitalization as the field) and contents( ), and hasFile( ) to indicate whether this object contains a file or not. (The need for this will be seen later.)

The final three methods are concerned with writing this code listing into a file, either a packed file via writePacked( ) or a Java source file via writeFile( ). All writePacked( ) needs is the DataOutputStream, which was opened elsewhere, and represents the file that’s being written. It puts the header information on the first line and then calls writeBytes( ) to write contents in a “universal” format.

When writing the Java source file, the file must be created. This is done via IO.psOpen( ), handing it a File object that contains not only the file name but also the path. But the question now is: does this path exist? The user has the option of placing all the source code directories into a completely different subdirectory, which might not even exist. So before each file is written, File.mkdirs( ) is called with the path that you want to write the file into. This will make the entire path all at once.

Containing the entire collection of listings

It’s convenient to organize the listings as subdirectories while the whole collection is being built in memory. One reason is another sanity check: as each subdirectory of listings is created, an additional file is added whose name contains the number of files in that directory.

The DirMap class produces this effect and demonstrates the concept of a “multimap.” This is implemented using a Hashtable whose keys are the subdirectories being created and whose values are Vector objects containing the SourceCodeFile objects in that particular directory. Thus, instead of mapping a key to a single value, the “multimap” maps a key to a set of values via the associated Vector. Although this sounds complex, it’s remarkably straightforward to implement. You’ll see that most of the size of the DirMap class is due to the portions that write to files, not to the “multimap” implementation.

There are two ways you can make a DirMap: the default constructor assumes that you want the directories to branch off of the current one, and the second constructor lets you specify an alternate absolute path for the starting directory.

The add( ) method is where quite a bit of dense action occurs. First, the directory( ) is extracted from the SourceCodeFile you want to add, and then the Hashtable is examined to see if it contains that key already. If not, a new Vector is added to the Hashtable and associated with that key. At this point, the Vector is there, one way or another, and it is extracted so the SourceCodeFile can be added. Because Vectors can be easily combined with Hashtables like this, the power of both is amplified.

Writing a packed file involves opening the file to write (as a DataOutputStream so the data is universally recoverable) and writing the header information about the old separator on the first line. Next, an Enumeration of the Hashtable keys is produced and stepped through to select each directory and to fetch the Vector associated with that directory so each SourceCodeFile in that Vector can be written to the packed file.

Writing the Java source files to their directories in write( ) is almost identical to writePackedFile( ) since both methods simply call the appropriate method in SourceCodeFile. Here, however, the root path is passed into SourceCodeFile.writeFile( ) and when all the files have been written the additional file with the name containing the number of files is also written.

The main program

The previously described classes are used within CodePackager. First you see the usage string that gets printed whenever the end user invokes the program incorrectly, along with the usage( ) method that calls it and exits the program. All main( ) does is determine whether you want to create a packed file or extract from one, then it ensures the arguments are correct and calls the appropriate method.

When a packed file is created, it’s assumed to be made in the current directory, so the DirMap is created using the default constructor. After the file is opened each line is read and examined for particular conditions:

If the line starts with the starting marker for a source code listing, a new SourceCodeFile object is created. The constructor reads in the rest of the source listing. The handle that results is directly added to the DirMap.

If the line starts with the end marker for a source code listing, something has gone wrong, since end markers should be found only by the SourceCodeFile constructor.

When extracting a packed file, the extraction can be into the current directory or into an alternate directory, so the DirMap object is created accordingly. The file is opened and the first line is read. The old file path separator information is extracted from this line. Then the input is used to create the first SourceCodeFile object, which is added to the DirMap. New SourceCodeFile objects are created and added as long as they contain a file. (The last one created will simply return when it runs out of input and then hasFile( ) will return false.)

Checking capitalization style

Although the previous example can come in handy as a guide for some project of your own that involves text processing, this project will be directly useful because it performs a style check to make sure that your capitalization conforms to the de-facto Java style. It opens each .java file in the current directory and extracts all the class names and identifiers, then shows you if any of them don’t meet the Java style.

For the program to operate correctly, you must first build a class name repository to hold all the class names in the standard Java library. You do this by moving into all the source code subdirectories for the standard Java library and running ClassScanner in each subdirectory. Provide as arguments the name of the repository file (using the same path and name each time) and the -a command-line option to indicate that the class names should be added to the repository.

To use the program to check your code, run it and hand it the path and name of the repository to use. It will check all the classes and identifiers in the current directory and tell you which ones don’t follow the typical Java capitalization style.

You should be aware that the program isn’t perfect; there a few times when it will point out what it thinks is a problem but on looking at the code you’ll see that nothing needs to be changed. This is a little annoying, but it’s still much easier than trying to find all these cases by staring at your code.

The explanation immediately follows the listing:


// Scans all files in directory for classes

// and identifiers, to check capitalization.

// Assumes properly compiling code listings.

// Doesn't do everything right, but is a very

// useful aid.


import java.util.*;

class MultiStringMap extends Hashtable

public Vector getVector(String key)

return (Vector)get(key);


public void printValues(PrintStream p)



public class ClassScanner


void scanListing(String fname)

if(in.sval.equals('import') ||



else // It's an identifier or keyword

identMap.add(fname, in.sval);



} catch(IOException e)


void discardLine() catch(IOException e)


// StreamTokenizer's comment removal seemed

// to be broken. This extracts them:

void eatComments()


} catch(IOException e)


public String[] classNames()

public void checkClassNames()



public void checkIdentNames()






static final String usage =

'Usage: n' +

'ClassScanner classnames -an' +

'tAdds all the class names in this n' +

'tdirectory to the repository file n' +

'tcalled 'classnames'n' +

'ClassScanner classnamesn' +

'tChecks all the java files in this n' +

'tdirectory for capitalization errors, n' +

'tusing the repository file 'classnames'';

private static void usage()

public static void main(String[] args) catch(IOException e)


if(args.length == 1)

// Write the class names to a repository:

if(args.length == 2) catch(IOException e)




class JavaFilter implements FilenameFilter

} ///:~

The class MultiStringMap is a tool that allows you to map a group of strings onto each key entry. As in the previous example, it uses a Hashtable (this time with inheritance) with the key as the single string that’s mapped onto the Vector value. The add( ) method simply checks to see if there’s a key already in the Hashtable, and if not it puts one there. The getVector( ) method produces a Vector for a particular key, and printValues( ), which is primarily useful for debugging, prints out all the values Vector by Vector.

To keep life simple, the class names from the standard Java libraries are all put into a Properties object (from the standard Java library). Remember that a Properties object is a Hashtable that holds only String objects for both the key and value entries. However, it can be saved to disk and restored from disk in one method call, so it’s ideal for the repository of names. Actually, we need only a list of names, and a Hashtable can’t accept null for either its key or its value entry. So the same object will be used for both the key and the value.

For the classes and identifiers that are discovered for the files in a particular directory, two MultiStringMaps are used: classMap and identMap. Also, when the program starts up it loads the standard class name repository into the Properties object called classes, and when a new class name is found in the local directory that is also added to classes as well as to classMap. This way, classMap can be used to step through all the classes in the local directory, and classes can be used to see if the current token is a class name (which indicates a definition of an object or method is beginning, so grab the next tokens – until a semicolon – and put them into identMap).

The default constructor for ClassScanner creates a list of file names (using the JavaFilter implementation of FilenameFilter, as described in Chapter 10). Then it calls scanListing( ) for each file name.

Inside scanListing( ) the source code file is opened and turned into a StreamTokenizer. In the documentation, passing true to slashStarComments( ) and slashSlashComments( ) is supposed to strip those comments out, but this seems to be a bit flawed (it doesn’t quite work in Java 1.0). Instead, those lines are commented out and the comments are extracted by another method. To do this, the ‘/’ must be captured as an ordinary character rather than letting the StreamTokenizer absorb it as part of a comment, and the ordinaryChar( ) method tells the StreamTokenizer to do this. This is also true for dots (‘.’), since we want to have the method calls pulled apart into individual identifiers. However, the underscore, which is ordinarily treated by StreamTokenizer as an individual character, should be left as part of identifiers since it appears in such static final values as TT_EOF etc., used in this very program. The wordChars( ) method takes a range of characters you want to add to those that are left inside a token that is being parsed as a word. Finally, when parsing for one-line comments or discarding a line we need to know when an end-of-line occurs, so by calling eolIsSignificant(true) the eol will show up rather than being absorbed by the StreamTokenizer.

The rest of scanListing( ) reads and reacts to tokens until the end of the file, signified when nextToken( ) returns the final static value StreamTokenizer.TT_EOF.

If the token is a / it is potentially a comment, so eatComments( ) is called to deal with it. The only other situation we’re interested in here is if it’s a word, of which there are some special cases.

If the word is class or interface then the next token represents a class or interface name, and it is put into classes and classMap. If the word is import or package, then we don’t want the rest of the line. Anything else must be an identifier (which we’re interested in) or a keyword (which we’re not, but they’re all lowercase anyway so it won’t spoil things to put those in). These are added to identMap.

The discardLine( ) method is a simple tool that looks for the end of a line. Note that any time you get a new token, you must check for the end of the file.

The eatComments( ) method is called whenever a forward slash is encountered in the main parsing loop. However, that doesn’t necessarily mean a comment has been found, so the next token must be extracted to see if it’s another forward slash (in which case the line is discarded) or an asterisk. But if it’s neither of those, it means the token you’ve just pulled out is needed back in the main parsing loop! Fortunately, the pushBack( ) method allows you to “push back” the current token onto the input stream so that when the main parsing loop calls nextToken( ) it will get the one you just pushed back.

For convenience, the classNames( ) method produces an array of all the names in the classes collection. This method is not used in the program but is helpful for debugging.

The next two methods are the ones in which the actual checking takes place. In checkClassNames( ), the class names are extracted from the classMap (which, remember, contains only the names in this directory, organized by file name so the file name can be printed along with the errant class name). This is accomplished by pulling each associated Vector and stepping through that, looking to see if the first character is lower case. If so, the appropriate error message is printed.

In checkIdentNames( ), a similar approach is taken: each identifier name is extracted from identMap. If the name is not in the classes list, it’s assumed to be an identifier or keyword. A special case is checked: if the identifier length is 3 or more and all the characters are uppercase, this identifier is ignored because it’s probably a static final value such as TT_EOF. Of course, this is not a perfect algorithm, but it assumes that you’ll eventually notice any all-uppercase identifiers that are out of place.

Instead of reporting every identifier that starts with an uppercase character, this method keeps track of which ones have already been reported in a Vector called reportSet( ). This treats the Vector as a “set” that tells you whether an item is already in the set. The item is produced by concatenating the file name and identifier. If the element isn’t in the set, it’s added and then the report is made.

The rest of the listing is comprised of main( ), which busies itself by handling the command line arguments and figuring out whether you’re building a repository of class names from the standard Java library or checking the validity of code you’ve written. In both cases it makes a ClassScanner object.

Whether you’re building a repository or using one, you must try to open the existing repository. By making a File object and testing for existence, you can decide whether to open the file and load( ) the Properties list classes inside ClassScanner. (The classes from the repository add to, rather than overwrite, the classes found by the ClassScanner constructor.) If you provide only one command-line argument it means that you want to perform a check of the class names and identifier names, but if you provide two arguments (the second being “-a”) you’re building a class name repository. In this case, an output file is opened and the method ) is used to write the list into a file, along with a string that provides header file information.

Politica de confidentialitate



Vizualizari: 509
Importanta: rank

Comenteaza documentul:

Te rugam sa te autentifici sau sa iti faci cont pentru a putea comenta

Creaza cont nou

Termeni si conditii de utilizare | Contact
© SCRIGROUP 2022 . All rights reserved

Distribuie URL

Adauga cod HTML in site