Chapter 4. Converting Flat Files to XML

Table of Contents

The Budget
The Model
Determining the Output Format
Building Hierarchical Structures from Flat Data
Alternatives to Java
Imposing Hierarchy with XSLT
The XML Query Language
Relational Databases

Relatively little of the world’s data is currently stored in XML. Much of it is stored in flat files as tab-delimited text, comma separated values, or some similar format. More is locked up in databases of one kind or another, whether relational, hierarchical, or object-based. Even more is hidden inside unstructured documents including Microsoft Word files, HTML documents, and plain text. XML tools are not suitable for working with any of this.

There are no magic bullets that will convert all your data to semantically tagged XML. There are a few specialized programs that convert certain formats such as Word documents to particular XML applications such as XHTML. However, the output from even the best of these tools often needs to be cleaned up by hand. How much clean-up work you need to do generally depends on how structured the data format is to start with and how clean the data is. It’s relatively easy to encode a relational table from a DB2 database as XML because it already has a lot of structure and a mandatory schema. It’s a lot harder in practice to convert tab-delimited text files because they tend to be full of mistakes and dirty data. Records are missing fields. Fields get swapped with each other. A field that is supposed to contain a number between 1 and 12 may contain a list of foodstuffs the data entry clerk was supposed to buy on his way home one day. All of these things can and do happen, and you have to account for them when doing anything with such data, whether that’s converting it to XML or summarizing it for an annual report.

When you’re tasked with converting legacy data to XML, you just have to roll up your sleeves and attack the problem. You need to understand the current structure of the data. You need to write a program that reads the input format and writes out XML. You need to debug the inevitable problems that arise when the data in the input isn’t exactly you thought it was or what it’s supposed to be. By far the hardest part of this problem is parsing the input data, in whatever form it takes. Once you’ve loaded the data into your program, writing it back out again in XML is a cake walk.

The Budget

As an example of this process, I’m going to use U.S. Federal Government budget authorization data, which the Office of Management and Budget publishes in a variety of equivalent flat formats, even though the data itself is relatively unflat. This is a good example of the sort of legacy data developers often have to deal with. The complete document consists of 3185 line items. Each line item has 43 separate fields. In the comma separated values (CSV) version of the file, a typical line item looks like this:

"418","National Endowment for the Humanities","00","National Endowment for the Humanities","0200","National Endowment for the Humanities: grants and administration","59","503","Research and general education aids","Discretionary","On-budget", 0, 0, 0, 121275, 145231, 150100, 151299, 130560, 135447, 140118, 139478, 132582, 138890, 140435, 153000, 156910, 170002, 175955, 177413, 177491, 172000, 110000, 110000, 111000, 112000, 115000, 120000, 121000, 124000, 126000, 129000, 132000

Each field is separated from the following field by a comma. Strings are enclosed in double quotes, and may contain commas that do not delimit fields. Dollar amounts are written as integers divided by 1000. That is, the last value in the above line is 132 million dollars, not 132 thousand dollars. Table 4.1 identifies the 43 separate fields.[1]

Table 4.1. Budget Fields

Field numberField NameDescription
1Agency codeA unique 3-digit numerical code for the cabinet department or independent agency
2Agency nameThe name of the cabinet level department (e.g. Department of Defense) or independent agency (e.g. Peace Corps). Even though only the executive branch actually has true agencies, for purposes of the budget, offices within the legislative branch are given the agency name “Legislative Branch” and offices within the judicial branch are given the agency name “Judicial Branch”. Agency names have a maximum of 89 characters
3Bureau code2-digit numerical code for the bureau within the cabinet department or independent agency; bureau codes are unique only within an agency.
4Bureau nameThe name of the sub-office within the agency, e.g. “Coast Guard” or “Federal Aviation Administration”. Amounts for the agency as a whole that are not part of a specific bureau are generally categorized under a fictional bureau with the same name as the agency. Bureau names have a maximum of 89 characters
5Account codeA 4-digit code (outlays) or 6-digit code (offsetting receipts) for the account within the bureau.
6Account nameThe name of the budgeted function within the bureau. Account names have a maximum of 160 characters
7Treasury Agency code2-digit numerical code for the agency, assigned by the Treasury Department
8Subfunction code3-digit numerical code for the subfunction within an account
9Subfunction titleThe name of the subfunction within the account. Subfunction titles have a maximum of 72 characters
10BEA categoryBudget Enforcement Act category: “Mandatory”, “Discretionary”, or “Net interest”
11On- and off-budget indicator“On-budget” or “Off-budget”; Social Security trust funds and the Postal Service are off-budget, all other accounts are on-budget
121976 valueActual amounts, in thousands of dollars, for fiscal year (FY) 1976. Budget authority is usually shown as a positive value. Offsetting receipts are usually negative values.
13TQ valueActual amount, in thousands of dollars, for the “transitional quarter” in 1976 when the government shifted the start of its fiscal year ahead from July to October.
14-371977-2000Actual amounts, in thousands of dollars, for each fiscal year from 1977 to 2000
38-end2001-2006Estimated amounts, in thousands of dollars, for FY 2001 through FY 2006

[1] Adapted from PUBLIC BUDGET DATABASE USER’s GUIDE: Budget of the United States Government Fiscal Year 2002, Budget Analysis Branch Office of Management and Budget, April 9, 2001, p. 8

Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified September 10, 2001
Up To Cafe con Leche