Before providing some comments on common archived data forms, a few words about computer terminology may be in order. The smallest element of information contained in a computer is called a bit. Each bit may be `on' or `off' and is represented by a `1' or `0', respectively. Computers store both text and numbers as sequences of bits. A sequence of 8-bits is called a byte and is often (but not always) used to describe a ``character'' of text (e.g., `a', `q', `;', ` ', `6'etc.). An ordered sequence of bytes is called a word. Generally, computer workstations and supercomputers used by the atmospheric and oceanographic scientists have word lengths of 32- or 64-bits, respectively. If the bytes refer to characters, then a 32-bit word could contain 4 characters, while a 64-bit word could contain 8 characters. A computer word used for storing floating point numeric values consists of three segments: a sign bit, a characteristic (biased exponent) and a mantissa. An integer is represented by two segments: a sign bit and a sequence of bits. A 32-bit word can store floating point numbers with six-to-seven decimal place precision while a 64-bit word can store numbers with thirteen-to-fourteen decimal place precision. Workstations which normally operate with 32-bit words can also use and store numeric data in 64-bit mode by using type declaration statements in FORTRAN (double precision) and C (double and long int). Most computer systems use numbers where the most significant bytes are ordered from left to right (called ``big-endian'' form). Other systems use the ``little-endian'' form which means the byte ordering is reversed. This characteristic of computer hardware architecture can cause some difficulty when using binary data created on different machines. However, software is often available to transform the data to the appropriate form.
In the early years of computers, the size of a dataset was often described in terms of the number of 7- or 9-track computer tapes that were required to archive the data. Today, there are a variety of storage media available and the sizes of datasets are typically described in terms of kilo-bytes (KB), [Note: a kilo-byte of computer memory refers to 1024 bytes.], mega-bytes (MB), giga-bytes (GB) and tera-bytes (TB). Table 2.1 summarizes these dataset size descriptors. (Note: a reasonably full 9-track tape holds about 130MB.)
Character format: This is often the most convenient form for the user. The most commonly used character set is that based upon the ASCII standard. The other character set which may be encountered is EBCDIC which is an IBM standard. ASCII and EBCDIC are 8-bit character sets. Conversion from one character set to another may be accomplished using various software ``tools'' (e.g., computer software; ``filters'' in UNIX jargon). The advantage of using character data is that it may be read directly by a human being or through standard computer input or output statements provided in the FORTRAN and C programming languages. Character formats are convenient for `small' datasets that must be used on a variety of machines. However, the number of computer instructions needed to read character data is considerable and character data take up relatively large amounts of space on computer external media such as discs or tapes. For example, the numbers `9', `679.43', -0.123456E+05 require 1, 6 and 13 bytes, respectively, in ASCII. As will later be explained, these numbers can be archived much more concisely using `packed-binary' representation.
Native Format: Computers from different manufacturers may use different schemes for the representation of both characters and numbers. The internal format used by a particular computer is commonly called its native format. As previously described, characters are generally stored in ASCII although some machines use EBCDIC. Numeric values often use a standard IEEE format or vendor specific representations (e.g., Cray, DEC). Reading and writing data in native format can be very fast because no data conversions need to be performed. In addition, no precision is lost. (This can be very important in some numerical models or matrix inversion problems.) However, reading one machine's native format on another computer which uses a different native format can be slow because a conversion algorithm must be used. One additional drawback is that each numeric value archived in native form requires the full word length of the machine (e.g., 32 or 64 bits). Thus each value is stored with full machine precision even if it uses more bits than necessary for the known accuracy of the values (e.g., 12.1 degrees C stored as 12.1357 degrees C).
Packed Binary: Because atmospheric and oceanographic datasets can be quite large, it is desirable to optimize the amount of information that can be archived on external media. This optimization of information using a minimum number of bits is called packed binary. It is an efficient method for archiving data and is independent of machine representation. Packing data values means expressing integer and, most often, floating point values in sequences of bits sufficient to capture the required precision of the data. For example, the floating point number ``3.1'' requires 32- or 64-bits in native format or 24-bits (three ASCII or EBCDIC characters) in character format. However, in packed binary it requires only five bits. (Appendix C gives an example of packing and unpacking a number.) Obviously, significant reductions in storage space can be realized.
These packed binary numbers are sequentially stored and, subsequently, retrieved from a bit stream. NCAR has FORTRAN and C routines, called `gbytes' and `sbytes' that will unpack and pack the bit streams. The software and documentation may be obtained from NCAR via anonymous ftp (see Chapters 10 and 11). Most vendors also provide software that provide bit level access for manipulation of bit stream data.
Although software must be used to perform the conversion of the bit groups to a machine's internal format, it is often considerably faster than converting character data. One additional benefit of packed binary is that it means data can be electronically transmitted more efficiently owing to smaller volume.
Scientific Data Formats
There are a number of ``standard'' scientific data formats.
Documentation and software necessary to implement these formats are
generally available via computer networks (see Appendix B).
Architecture independent standard formats commonly used for
atmospheric and oceanographic datasets include:
Merging of HDF and netCDF: In July 2005, netCDF 4.0 will be released. Unidata's description follows: "The netCDF API will be extended and implemented on top of the HDF5 data format. NetCDF users will be able to create HDF5 files with benefits not available with the netCDF format, such as much larger files and multiple unlimited dimensions. Backward compatibility in accessing old netCDF files will be supported. The combined library will preserve the desirable common characteristics of netCDF and HDF5 while taking advantage of their separate strengths: the widespread use and simplicity of netCDF and the generality and performance of HDF5."
Why are there so many `standards'? The answer is partly historical and
partly practical. Historically, many agencies developed their own
internal format standards for data archival prior to working with
other organizations. When other groups requested data, the originating
agency sent them in their own format (of course!). Soon several groups
were using the data in a particular format and it became a
``de-facto'' standard. On a practical level, the development of any
data archiving and exchange format involves trade-offs among various
features: compactness, simplicity, ease of communications,
portability, sortability, ease-of-use, etc. Thus, some formats are
better for archival and transmission and others for accessibility. For
example, netCDF requires field widths to be a multiple of 8 bits,
while GRIB has no such restriction. A data type which can be
represented most efficiently by 9 bits will require almost twice the
disk space in netCDF as it will in GRIB. However, an advantage of
netCDF is that it is a self describing format and a rich set of data
descriptors (metadata) can be attached to each data file and
each variable on a file. These
descriptors may include multi-dimensional grid definitions, scaling
factors, units of the values, titles, comments, etc. With data files
structured in netCDF and with appropriate netCDF software the data may
be quickly accessed without concern or knowledge of the internal
format. GRIB is not a true self describing format because the user
needs an external table to decypher the information. GRIB is used by
the world's largest operational meteorological centers (NMC and ECMWF)
for gridded data because it allows the data to be efficiently packed
and thus moved from one site to another. A table which lists specific
attributes about several standards including how to find out more
about them is in
An Introduction to Atmospheric and Oceanographic Datasets