[Access to Wang masthead]

Tricky Business

Converting legacy documents to the Web

From "Integration",  Access to Wang, February 1997
  [ Prior Article ]     [ Return to the Catalog of articles ]     [ Next Article ]  

With increasing volume, trade publications and the general press have been sounding the call for internal web publishing (intranets), which exploit the information distribution potential of the Internet within a closed loop of organizational insiders. There are few hard numbers presented in this information and it is difficult to assess whether this new medium is really cost- effective or just today's fad. Sure, sharing information is really what data processing was supposed to be about; but can internal webs play a real role in the success of the enterprise?

The answer is a qualified yes. Like all communication forms - 3-ring binders, radio, COBOL data entry programs, PC applications - there are strengths and weaknesses in Web publishing that must be observed for best results. And while there are gains to be made by simply putting your existing document content on-line, even more benefit can be seen by changing the format to work better within the medium.

The following discussion in intended to give you some idea of the challenges you face in moving information into this new medium and some of the ways to meet those challenges.

Differences in Electronic Media

Even if the content and format of an original documen can be preserved, electronic media differ from their printed cousins in significant ways. The viewing area on all personal computers is much smaller than that of a printed page, and much of that area must be devoted to non-content purposes such as screen and window control. The image quality is poor (even with good monitors) and long sessions reading computer screens can result in eyestrain.

Electronic media must also coexist with other applications that might be in use. For example, a document that tells how to overcome a problem with an word processing application must allow you to see the program screen in addition to the documentation so you can follow its directions.

Some of the other differences lie in limits on the presentation medium itself. The original HTML specification was intended to allow rapid distribution of text materials without regard to the individual requirements of the destination system; instead, it is up to the receiving system and its software to accomodate the text and present it in an appropriate way. Therefore, there are few ways to provide normal formatting controls, such as line widths. As a result, HTML documents should be tested using a variety of browsers and platforms to see how these tools differ in their interpretation of the format.

Simple Document Conversion versus Redevelopment

By now you have heard the pitches of many vendors of conversion software, boasting of "one-click" conversion of documents to the format of the World Wide Web, Hypertext Markup Language (HTML). Most of these tools work reasonably well, preserving much of the structure and appearance of the originals.

Even if the appearance of these documents was the same, there are some fundamental differences in the ways that the documents are used that make a straight conversion less usable than a document that has been reordered to take more advantage of the medium. For example, consider the problem of finding topics in a large document. In both printed and on-line media, you can use the index or Table of contents to find information quickly - if you already have some idea of the words to search for. If you are unfamiliar with the topic, it becomes difficult to use indexes or other tabular information to find what you need. In this case, users of printed documents can flip through the material, scanning for words to use in a search or, perhaps, finding the information they are looking for. Users of electronic versions must try different search keywords until they hit the information they need, or slog through the material scanning for a hint. Since the amount of material that can be presented effectively on a screen is far less that that on a printed page, this can be a painful process.

As a rule, it is not desireable to create web pages that require large waits to download over slow connections. My personal goal: no page ever exceeds 100,000 characters in total size (including graphics and all text). Naturally, you can adjust this size upward if your connections are fast (e.g. over a local network).

Example One: Converting a Small Document

Most organizations have moderately small documents that are good candidates for electronic distribution. My example is a travel and reimbursement policy document we recently converted and placed on the corporate Web. The original document was fifteen pages and included a Table of Contents. Using Internet Assistant for Word, we converted the original document to HTML in a few minutes, then spent about an hour creating hyperlinks (jumps from one area of the document to another) from the Table of Contents to the related sections of the document. The resulting Web page was easy to find information in and accessible to everyone in the organization - even employees that were traveling or otherwise away from the office. One drawback was the size of the converted document (about 60,000 characters) took too long to download on a slow connection.

The document conversion was refined further by creating separate Web documents for each major section, including a separate document for the Table of Contents. By splitting the document into small sections, the amount of time downloading each area was significantly reduced.

(Examples of some of these approaches can be viewed at the addresses listed in Figure 1 below.)

Example Two: Converting a Large Document

Earlier last year, I converted my 1989 book, Useraids: A Guide to Low-Cost Software for Wang VS Systems into HTML and made it available for public use. The original work was over two hundred pages and covered 96 utilities in depth. In addition, there were cross-references by file name and program purpose, and a glossary that made references to programs covered in more depth elsewhere in the document.

Some of my intent in the original printed version was to provide alternate ways to access the material - by file name, by subject, or arranged for easy browsing. For the electronic version, I wanted to improve the usability of these paths by providing hyperlinks into the material. Other design goals including reasonable document size (for performance reasons) and an attractive presentation of the material. The completed work consisted of 228 files, the largest of which was less than 50,000 characters. The total size of the work (all files) was 650,265 characters. The project took about forty hours, using a combination of Unix-style text tools and manual editing.

Conclusion

The choice of document conversion approaches is similar to that facing program developers: rough conversions can be accomplished rapidly, but the best results will be seen with redevelopment. With the addition of HTML capabilities to standard desktop software (word processors, spreadsheets, etc.), it becomes possible to simultaneously publish in traditional and electronic formats. This capability promises a new level of convenience and accessibility of information.


Figure 1: References and Examples

Examples

Useraids: A Guide to Low-Cost Software for Wang VS Systems
http://www.dsbarnes.com/useraids

HTML conversion of my 1989 book on user-supported VS software.

Tools

Microsoft Internet Assistant for Word
http://www.microsoft.com/msword/internet/ia/

An add-in for Word that provides HTML capabilities in addition to normal document editing. Available only for recent versions of Word for Windows 95, Windows NT, and the Macintosh.

References

A Beginner's Guide to HTML
http://www.ncsa.uiuc.edu/General/Internet/WWW/HTMLPrimer.html

The classic guide to HTML from the The National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign. A good starting place.

Composing Good HTML
http://www.cs.cmu.edu/~tilt/cgh/

A style guide from James "Eric" Tilton.

World Wide Web FAQ
http://www.boutell.com/faq/

A question-and-answer guide to HTML by Thomas Boutell.


  [ Prior Article ]     [ Return to the Catalog of articles ]     [ Next Article ]  


Copyright © 1997 Dennis S. Barnes
Reprints of this article are permitted without notification if the source of the information is clearly identified