Understanding Data Exchange FormatsHow to keep your information from getting lost in translation |
|
From "VS Workshop", Access to Wang, March 1993 |
|
[ Prior Article ] [ Return to the Catalog of articles ] [ Next Article ] |
One result of the increasing penetration of small computers into business is that many of your customers have discovered data transfer and may expect to be able to exchange information electronically with their vendors. Frequently, this involves exchanges between unlike, seemingly incompatible systems. For example, your customer may be using an inventory system written in FoxPro, which has a file structure based on the dBase file format. How are you gonna get their order information into your VS?
Part of the answer lies in understanding and using some of the many data interchange formats available. Fortunately, such interchange formats abound; unfortunately, they are sometimes subject to wide interpretation - consider the wide variations between "text files" - and are not always easy to meet with conventional VS programming tools. Nonetheless, mastery of common file exchanges can give your organization a competitive advantage by enabling faster and more accurate exchange of information with your clients.
Interchange formats provide a relatively easy translation between complex data structures by describing specific product features in common forms - usually in text. The task of meeting a typical interchange format cannot be described as easy, since it has many unfamiliar aspects in comparison to routine VS programming work, but it is far easier than writing to a proprietary format and far more portable.
Before delving into the specifics of data interchange formats, let's consider some of their characteristics and the constraints they work within.
To be successful in bridging unlike systems, any interchange format must be generic. Specific requirements that must be addressed include:
Character set: The binary values of the characters. ASCII characters are used in the PC, VS, and UNIX worlds, but there is some disparity in usage for characters above and below the standard "typewriter" characters.
Field separators: In traditional data processing systems, fields are fixed in length and can be identified by their position within the record format. In comparison, most interchange formats allow variation in field size and use other means of identifying and separating field information.
Record separators: Records in VS files are identified internally; the separation between records is handled transparently by the operating system. Other environments may require specific characters for this purpose.
Error detection: Some data formats include counters in header and trailer records that helps identify lost records or incomplete transmissions.
Identification: Explicit information on the sender and intended receiver is a part of some interchange formats.
Translated, most data interchange specifications use an uncompressed consecutive organization ("text files"), a limited character set, records and fields that vary in size according to need, and multiple record types. The text file format is necessary to overcome differences between operating environments, since there are few industry standards for record indexing or compression. Ordinary characters - numbers, punctuation, and alphabetic characters - are common ties between all systems and sidestep differences of character interpretation between environments. Variable-length field and record areas allow the record size to remain compact, even in an uncompressed file format - an important consideration in low-speed data transmissions.
An example of such a data interchange format is the humble tab-delimited text file. Tab-delimited files can be imported into and exported from most PC spreadsheets, database products, and word processing systems. In this format, data items are separated by a tab character - an ASCII value of 9 - and records are defined by the combination of a line feed and carriage return characters - like files produced by most text editors. There are usually the same number of "columns" - delimited fields - in each row, similar to the fixed row-and-column arrangement of a spreadsheet or database view. A typical data record might look like this:
ITEMNUMB<T>DESC<T>QUAN<T>UM<T>COST 1234<T>"Striped Widget, Large"<T>5<T>EA<T>5.70 45678Hand cutter 1 150.00 (In this example, the <T> character indicates a tab character.) The first row contains field labels for the columns, followed by two rows of data. Note that the third line has no value for UM (Unit of Measure); instead, there are two tab characters so that the COST column is placed in the correct column. Another interesting point is the double quotes around the description of the Striped Widget; some spreadsheet and database products quote fields that contain a comma or quotes within the text area, since these characters are used as delimiters and might confuse the separation of the fields. (If you have access to a product that produces tab-delimited output, experiment now to understand how it handles this situation.)
This looks simple, but some care is necessary to ensure compatibility with other environments. Number formats should have leading signs that "float" to the leading number position; leading zeros are often acceptable, too. Text fields that contain commas should be quoted. Dates may have to be converted into the internal spreadsheet format - an integer value representing the number of days since the turn of the century - if comparisons will be made between dates. Fields may be reduced to their non-blank area and tab characters added through the methods described previously in this space. (For a description of integer date conversion, see the LOTUSDAT procedure in Access to Wang, June 1989. Producing tab- delimited files from COBOL was the subject in the September 1992 Access to Wang.)
Don't underestimate the confusion of record delimiters when your transfers include a mix of MS-DOS, UNIX, or Macintosh systems; each has a different convention. Most MS-DOS applications use a carriage return (hex OD) followed by a line feed (hex OA), but some reverse this combination. UNIX systems use only the line feed. Inexplicably, Apple chose to use only a carriage return for their record delimiter. Some data communications packages can convert these into the scheme used by the local system; otherwise, it may be necessary to find or build utilities that can perform this translation. (Mega Edit - a Windows-based text editor distributed as Shareware - reads and writes files using any of these formats and can be used as a conversion tool. Contact Computer Witchcraft, Inc. at (415) 752-2477 for details.)
Less common these days but still widely support, the comma-delimited approach uses a combination of double quotes and commas to separate fields. Numeric fields are unquoted; all other text is surrounded by double quotes. Here is how the example above would look in this format:
"ITEMNUMB","DESC","QUAN","UM","COST" 1234,"Striped Widget, Large",5,"EA",5.7 45678,"Cutter, hand",1,,150This is the format you should choose if your intended receiver uses an early version of Lotus 1-2-3, since there are few other import choices with this product. All of the leading spreadsheets still offer it for import and export, usually with the .CSV file extension to indicate its internal structure.
Other popular exchange formats include:
DIF (Data Interchange Format): Created by VisiCalc and absorbed by Lotus, DIF can be used to describe almost any file format with rows and columns, using only text. Files usually has an extension of .PRN or .DIF.
SYLK (Symbolic Link): A Microsoft format first used by Multiplan. Still supported by all Microsoft spreadsheet products, including those for the Macintosh. Similar to DIF in format. Files usually have an extension of .SLK.
SDI (Super Data Interchange): A derivative of DIF developed by Computer Associates and used by SuperCalc.
RTF (Rich Text Format): Another Microsoft format developed primarily for word processing conversion. Uses text to describe document format and content, including attributes such as color and binary images. Primary means of exchange between Macintosh and MS-DOS versions of Microsoft Word and Works. Files usually have an extension of .RTF.
ACH (Automated Clearing House): A series of standards used by the financial industry for automatic payments and withdrawals. Contains extensive internal checks for record count, transaction amounts, and other information. Text format; uses varied character sets and transmission approaches as dictated by the need.
EDI (Electronic Data Interchange): A large group of industry-specific data exchange standards promoted by the Data Interchange Standards Association. Primarily used by large manufacturers and their suppliers. Complex format; most organizations use value-added network services to simplify the delivery and translation processes.
I hope this brief discussion of interchange approaches has given you some ideas of how such informational exchanges can improve your organization's service. Here are some more sources of information on this topic:
File Formats for Popular PC Software: A Programmer's Reference
Jeff Walden; John Wiley & Sons, Inc.; 1986
File structures for SYLK, DIF, dBase II and III, Lotus, etc.
VS COBOL 74 Reference Manual
Wang Laboratories; 800-1201G
Information on the STRING and UNSTRING verbs, two important tools for string handling in COBOL.
"Rich Text Format Specifications"
Microsoft Corporation
Full specifications for the RTF format. Available from Microsoft by request or through their bulletin board (Microsoft Download Service, (206) 936-6735; up to 9600 baud).
"An Introduction to Electronic Data Interchange"
Data Interchange Standards Assoc., Inc. (DISA), (703) 548-7005; Pub. ASC X12S 91-690 (Sep. '91)
Good description of EDI principals and the standards process.
Copyright © 1993 Dennis S. Barnes
Reprints of this article are permitted without notification
if the source of the information is clearly identified