An Introduction to Atmospheric and Oceanographic Datasets

2. DATA FORMATS USED FOR ATMOSPHERE/OCEAN DATASETS

Atmospheric and oceanographic data may be archived in several different computer forms: character format, native format, packed binary or in one of several ``standard'' scientific data formats. Users of datasets must be aware of how the data are stored. There are different methods for storing both character and numeric values. Normally, detailed descriptions of data formats are provided and, often, software to access the data is readily available.

Before providing some comments on common archived data forms, a few words about computer terminology may be in order. The smallest element of information contained in a computer is called a bit. Each bit may be `on' or `off' and is represented by a `1' or `0', respectively. Computers store both text and numbers as sequences of bits. A sequence of 8-bits is called a byte and is often (but not always) used to describe a ``character'' of text (e.g., `a', `q', `;', ` ', `6'etc.). An ordered sequence of bytes is called a word. Generally, computer workstations and supercomputers used by the atmospheric and oceanographic scientists have word lengths of 32- or 64-bits, respectively. If the bytes refer to characters, then a 32-bit word could contain 4 characters, while a 64-bit word could contain 8 characters. A computer word used for storing floating point numeric values consists of three segments: a sign bit, a characteristic (biased exponent) and a mantissa. An integer is represented by two segments: a sign bit and a sequence of bits. A 32-bit word can store floating point numbers with six-to-seven decimal place precision while a 64-bit word can store numbers with thirteen-to-fourteen decimal place precision. Workstations which normally operate with 32-bit words can also use and store numeric data in 64-bit mode by using type declaration statements in FORTRAN (double precision) and C (double and long int). Most computer systems use numbers where the most significant bytes are ordered from left to right (called ``big-endian'' form). Other systems use the ``little-endian'' form which means the byte ordering is reversed. This characteristic of computer hardware architecture can cause some difficulty when using binary data created on different machines. However, software is often available to transform the data to the appropriate form.

In the early years of computers, the size of a dataset was often described in terms of the number of 7- or 9-track computer tapes that were required to archive the data. Today, there are a variety of storage media available and the sizes of datasets are typically described in terms of kilo-bytes (KB), [Note: a kilo-byte of computer memory refers to 1024 bytes.], mega-bytes (MB), giga-bytes (GB) and tera-bytes (TB). Table 2.1 summarizes these dataset size descriptors. (Note: a reasonably full 9-track tape holds about 130MB.)

Table 2.1
Commonly used file size descriptors
Name	Size
kilo-bytes (KB)	1000 bytes (B) 10³ B
mega-bytes (MB)	1000 KB 10⁶B
giga-bytes (GB)	1000 MB 10⁹ B
tera-bytes (TB)	1000 GB 10¹² B

Character format: This is often the most convenient form for the user. The most commonly used character set is that based upon the ASCII standard. The other character set which may be encountered is EBCDIC which is an IBM standard. ASCII and EBCDIC are 8-bit character sets. Conversion from one character set to another may be accomplished using various software ``tools'' (e.g., computer software; ``filters'' in UNIX jargon). The advantage of using character data is that it may be read directly by a human being or through standard computer input or output statements provided in the FORTRAN and C programming languages. Character formats are convenient for `small' datasets that must be used on a variety of machines. However, the number of computer instructions needed to read character data is considerable and character data take up relatively large amounts of space on computer external media such as discs or tapes. For example, the numbers `9', `679.43', -0.123456E+05 require 1, 6 and 13 bytes, respectively, in ASCII. As will later be explained, these numbers can be archived much more concisely using `packed-binary' representation.

Native Format: Computers from different manufacturers may use different schemes for the representation of both characters and numbers. The internal format used by a particular computer is commonly called its native format. As previously described, characters are generally stored in ASCII although some machines use EBCDIC. Numeric values often use a standard IEEE format or vendor specific representations (e.g., Cray, DEC). Reading and writing data in native format can be very fast because no data conversions need to be performed. In addition, no precision is lost. (This can be very important in some numerical models or matrix inversion problems.) However, reading one machine's native format on another computer which uses a different native format can be slow because a conversion algorithm must be used. One additional drawback is that each numeric value archived in native form requires the full word length of the machine (e.g., 32 or 64 bits). Thus each value is stored with full machine precision even if it uses more bits than necessary for the known accuracy of the values (e.g., 12.1 degrees C stored as 12.1357 degrees C).

Packed Binary: Because atmospheric and oceanographic datasets can be quite large, it is desirable to optimize the amount of information that can be archived on external media. This optimization of information using a minimum number of bits is called packed binary. It is an efficient method for archiving data and is independent of machine representation. Packing data values means expressing integer and, most often, floating point values in sequences of bits sufficient to capture the required precision of the data. For example, the floating point number ``3.1'' requires 32- or 64-bits in native format or 24-bits (three ASCII or EBCDIC characters) in character format. However, in packed binary it requires only five bits. (Appendix C gives an example of packing and unpacking a number.) Obviously, significant reductions in storage space can be realized.

These packed binary numbers are sequentially stored and, subsequently, retrieved from a bit stream. NCAR has FORTRAN and C routines, called `gbytes' and `sbytes' that will unpack and pack the bit streams. The software and documentation may be obtained from NCAR via anonymous ftp (see Chapters 10 and 11). Most vendors also provide software that provide bit level access for manipulation of bit stream data.

Although software must be used to perform the conversion of the bit groups to a machine's internal format, it is often considerably faster than converting character data. One additional benefit of packed binary is that it means data can be electronically transmitted more efficiently owing to smaller volume.

Scientific Data Formats

There are a number of ``standard'' scientific data formats. Documentation and software necessary to implement these formats are generally available via computer networks (see Appendix B). Architecture independent standard formats commonly used for atmospheric and oceanographic datasets include:

GRIB (GRId in Binary) is a WMO standard data format which is an efficient method for transmitting and archiving large volumes of two-dimensional meteorological and oceanographic data. It is the standard used by the world's largest operational meteorological centers (NMC and ECMWF). A new GRIB format, known as GRIB2, was declared operational in November 2001. GRIB2 (General Regularly-distributed Information in Binary Form) is more flexible than the original GRIB. It can be used to handle radar and satellite data and allows for better data compression. Technical details for GRIB and GRIB2 may be found by using a WWW search engine to look for "FM 92 GRIB".

CDF (Common Data Format) was initially developed by NASA Goddard about 1980 as an interface and a toolkit for archival and access to multidimensional data on a VAX using VMS FORTRAN. Over the years it has evolved into a machine-independent standard and is often used by different NASA groups for storing space and earth science data.

netCDF (network CDF) is a commonly used data format used for scientific data. It was developed by Unidata (UCAR) around 1986, using CDF as a starting point. Unidata rewrote the CDF library to provide machine-independent archival and access. Strangely, netCDF is not compatible with the NASA CDF and no translation software currently exists. netCDF emphasizes a single common interface to data, implemented on top of an architecture independent representation. netCDF's data model is a relatively simple multidimensional array structure. Access to data in netCDF format is enabled by user written software or by many public domain and commercial tools.

HDF (Hierarchical Data Format) is a general, extensible scientific data exchange format created by NCSA. There are two main versions: HDF4 and HDF5 (introduced in 1999). HDF4 is incompatible with HDF5. HDF emphasizes a single common format for the data, on which many interfaces can be built. A netCDF interface to HDF4 is provided but there is no support for mixing HDF and netCDF structures. In other words, HDF4 software can read HDF and netCDF but can only write in HDF4. Both HDF4 and HDF5 are more flexible than netCDF but also are more complicated. HDF is often used to archive and transmit raster images.

Merging of HDF and netCDF: In July 2005, netCDF 4.0 will be released. Unidata's description follows: "The netCDF API will be extended and implemented on top of the HDF5 data format. NetCDF users will be able to create HDF5 files with benefits not available with the netCDF format, such as much larger files and multiple unlimited dimensions. Backward compatibility in accessing old netCDF files will be supported. The combined library will preserve the desirable common characteristics of netCDF and HDF5 while taking advantage of their separate strengths: the widespread use and simplicity of netCDF and the generality and performance of HDF5."

BUFR (Binary Universal Format Representation) is a WMO standard data format for the representation of meteorological and oceanographic data. Although it could be used for any type of data, its primary function is to represent observational data (e.g., from stations, raobs and ships). It was designed to reduce redundancy for efficient transmission over the GTS and to reduce the computer time required to decode the information.

Why are there so many `standards'? The answer is partly historical and partly practical. Historically, many agencies developed their own internal format standards for data archival prior to working with other organizations. When other groups requested data, the originating agency sent them in their own format (of course!). Soon several groups were using the data in a particular format and it became a ``de-facto'' standard. On a practical level, the development of any data archiving and exchange format involves trade-offs among various features: compactness, simplicity, ease of communications, portability, sortability, ease-of-use, etc. Thus, some formats are better for archival and transmission and others for accessibility. For example, netCDF requires field widths to be a multiple of 8 bits, while GRIB has no such restriction. A data type which can be represented most efficiently by 9 bits will require almost twice the disk space in netCDF as it will in GRIB. However, an advantage of netCDF is that it is a self describing format and a rich set of data descriptors (metadata) can be attached to each data file and each variable on a file. These descriptors may include multi-dimensional grid definitions, scaling factors, units of the values, titles, comments, etc. With data files structured in netCDF and with appropriate netCDF software the data may be quickly accessed without concern or knowledge of the internal format. GRIB is not a true self describing format because the user needs an external table to decypher the information. GRIB is used by the world's largest operational meteorological centers (NMC and ECMWF) for gridded data because it allows the data to be efficiently packed and thus moved from one site to another. A table which lists specific attributes about several standards including how to find out more about them is in

Scientific Data Formats

An Introduction to Atmospheric and Oceanographic Datasets