Skip navigation links

Package org.apache.jute

Hadoop record I/O contains classes and a record description language translator for simplifying serialization and deserialization of records in a language-neutral manner.

See: Description

Package org.apache.jute Description

Hadoop record I/O contains classes and a record description language translator for simplifying serialization and deserialization of records in a language-neutral manner.

Introduction

Software systems of any significant complexity require mechanisms for data interchange with the outside world. These interchanges typically involve the marshaling and unmarshaling of logical units of data to and from data streams (files, network connections, memory buffers etc.). Applications usually have some code for serializing and deserializing the data types that they manipulate embedded in them. The work of serialization has several features that make automatic code generation for it worthwhile. Given a particular output encoding (binary, XML, etc.), serialization of primitive types and simple compositions of primitives (structs, vectors etc.) is a very mechanical task. Manually written serialization code can be susceptible to bugs especially when records have a large number of fields or a record definition changes between software versions. Lastly, it can be very useful for applications written in different programming languages to be able to share and interchange data. This can be made a lot easier by describing the data records manipulated by these applications in a language agnostic manner and using the descriptions to derive implementations of serialization in multiple target languages. This document describes Hadoop Record I/O, a mechanism that is aimed at The goals of Hadoop Record I/O are similar to those of mechanisms such as XDR, ASN.1, PADS and ICE. While these systems all include a DDL that enables the specification of most record types, they differ widely in what else they focus on. The focus in Hadoop Record I/O is on data marshaling and multi-lingual support. We take a translator-based approach to serialization. Hadoop users have to describe their data in a simple data description language. The Hadoop DDL translator rcc generates code that users can invoke in order to read/write their data from/to simple stream abstractions. Next we list explicitly some of the goals and non-goals of Hadoop Record I/O.

Goals

Non-Goals

The remainder of this document describes the features of Hadoop record I/O in more detail. Section 2 describes the data types supported by the system. Section 3 lays out the DDL syntax with some examples of simple records. Section 4 describes the process of code generation with rcc. Section 5 describes target language mappings and support for Hadoop types. We include a fairly complete description of C++ mappings with intent to include Java and others in upcoming iterations of this document. The last section talks about supported output encodings.

Data Types and Streams

This section describes the primitive and composite types supported by Hadoop. We aim to support a set of types that can be used to simply and efficiently express a wide range of record types in different programming languages.

Primitive Types

For the most part, the primitive types of Hadoop map directly to primitive types in high level programming languages. Special cases are the ustring (a Unicode string) and buffer types, which we believe find wide use and which are usually implemented in library code and not available as language built-ins. Hadoop also supplies these via library code when a target language built-in is not present and there is no widely adopted "standard" implementation. The complete list of primitive types is:

Composite Types

Hadoop supports a small set of composite types that enable the description of simple aggregate types and containers. A composite type is serialized by sequentially serializing it constituent elements. The supported composite types are:

Streams

Hadoop generates code for serializing and deserializing record types to abstract streams. For each target language Hadoop defines very simple input and output stream interfaces. Application writers can usually develop concrete implementations of these by putting a one method wrapper around an existing stream implementation.

DDL Syntax and Examples

We now describe the syntax of the Hadoop data description language. This is followed by a few examples of DDL usage.

Hadoop DDL Syntax


recfile = *include module *record
include = "include" path
path = (relative-path / absolute-path)
module = "module" module-name
module-name = name *("." name)
record := "class" name "{" 1*(field) "}"
field := type name ";"
name :=  ALPHA (ALPHA / DIGIT / "_" )*
type := (ptype / ctype)
ptype := ("byte" / "boolean" / "int" |
          "long" / "float" / "double"
          "ustring" / "buffer")
ctype := (("vector" "<" type ">") /
          ("map" "<" type "," type ">" ) ) / name)
A DDL file describes one or more record types. It begins with zero or more include declarations, a single mandatory module declaration followed by zero or more class declarations. The semantics of each of these declarations are described below:

Examples

Code Generation

The Hadoop translator is written in Java. Invocation is done by executing a wrapper shell script named named rcc. It takes a list of record description files as a mandatory argument and an optional language argument (the default is Java) --language or -l. Thus a typical invocation would look like:

$ rcc -l C++  ...

Target Language Mappings and Support

For all target languages, the unit of code generation is a record type. For each record type, Hadoop generates code for serialization and deserialization, record comparison and access to record members.

C++

Support for including Hadoop generated C++ code in applications comes in the form of a header file recordio.hh which needs to be included in source that uses Hadoop types and a library librecordio.a which applications need to be linked with. The header declares the Hadoop C++ namespace which defines appropriate types for the various primitives, the basic interfaces for records and streams and enumerates the supported serialization encodings. Declarations of these interfaces and a description of their semantics follow:

namespace hadoop {

  enum RecFormat { kBinary, kXML, kCSV };

  class InStream {
  public:
    virtual ssize_t read(void *buf, size_t n) = 0;
  };

  class OutStream {
  public:
    virtual ssize_t write(const void *buf, size_t n) = 0;
  };

  class IOError : public runtime_error {
  public:
    explicit IOError(const std::string& msg);
  };

  class IArchive;
  class OArchive;

  class RecordReader {
  public:
    RecordReader(InStream& in, RecFormat fmt);
    virtual ~RecordReader(void);

    virtual void read(Record& rec);
  };

  class RecordWriter {
  public:
    RecordWriter(OutStream& out, RecFormat fmt);
    virtual ~RecordWriter(void);

    virtual void write(Record& rec);
  };


  class Record {
  public:
    virtual std::string type(void) const = 0;
    virtual std::string signature(void) const = 0;
  protected:
    virtual bool validate(void) const = 0;

    virtual void
    serialize(OArchive& oa, const std::string& tag) const = 0;

    virtual void
    deserialize(IArchive& ia, const std::string& tag) = 0;
  };
}
Two files are generated for each record file (note: not for each record). If a record file is named "name.jr", the generated files are "name.jr.cc" and "name.jr.hh" containing serialization implementations and record type declarations respectively. For each record in the DDL file, the generated header file will contain a class definition corresponding to the record type, method definitions for the generated type will be present in the '.cc' file. The generated class will inherit from the abstract class hadoop::Record. The DDL files module declaration determines the namespace the record belongs to. Each '.' delimited token in the module declaration results in the creation of a namespace. For instance, the declaration module docs.links results in the creation of a docs namespace and a nested docs::links namespace. In the preceding examples, the Link class is placed in the links namespace. The header file corresponding to the links.jr file will contain:

namespace links {
  class Link : public hadoop::Record {
    // ....
  };
};
Each field within the record will cause the generation of a private member declaration of the appropriate type in the class declaration, and one or more acccessor methods. The generated class will implement the serialize and deserialize methods defined in hadoop::Record+. It will also implement the inspection methods type and signature from hadoop::Record. A default constructor and virtual destructor will also be generated. Serialization code will read/write records into streams that implement the hadoop::InStream and the hadoop::OutStream interfaces. For each member of a record an accessor method is generated that returns either the member or a reference to the member. For members that are returned by value, a setter method is also generated. This is true for primitive data members of the types byte, int, long, boolean, float and double. For example, for a int field called MyField the folowing code is generated.

...
private:
  int32_t mMyField;
  ...
public:
  int32_t getMyField(void) const {
    return mMyField;
  };

  void setMyField(int32_t m) {
    mMyField = m;
  };
  ...
For a ustring or buffer or composite field. The generated code only contains accessors that return a reference to the field. A const and a non-const accessor are generated. For example:

...
private:
  std::string mMyBuf;
  ...
public:

  std::string& getMyBuf() {
    return mMyBuf;
  };

  const std::string& getMyBuf() const {
    return mMyBuf;
  };
  ...

Examples

Suppose the inclrec.jr file contains:

module inclrec {
    class RI {
        int      I32;
        double   D;
        ustring  S;
    };
}
and the testrec.jr file contains:

include "inclrec.jr"
module testrec {
    class R {
        vector VF;
        RI            Rec;
        buffer        Buf;
    };
}
Then the invocation of rcc such as:

$ rcc -l c++ inclrec.jr testrec.jr
will result in generation of four files: inclrec.jr.{cc,hh} and testrec.jr.{cc,hh}. The inclrec.jr.hh will contain:

#ifndef _INCLREC_JR_HH_
#define _INCLREC_JR_HH_

#include "recordio.hh"

namespace inclrec {
  
  class RI : public hadoop::Record {

  private:

    int32_t      mI32;
    double       mD;
    std::string  mS;

  public:

    RI(void);
    virtual ~RI(void);

    virtual bool operator==(const RI& peer) const;
    virtual bool operator<(const RI& peer) const;

    virtual int32_t getI32(void) const { return mI32; }
    virtual void setI32(int32_t v) { mI32 = v; }

    virtual double getD(void) const { return mD; }
    virtual void setD(double v) { mD = v; }

    virtual std::string& getS(void) const { return mS; }
    virtual const std::string& getS(void) const { return mS; }

    virtual std::string type(void) const;
    virtual std::string signature(void) const;

  protected:

    virtual void serialize(hadoop::OArchive& a) const;
    virtual void deserialize(hadoop::IArchive& a);

    virtual bool validate(void);
  };
} // end namespace inclrec

#endif /* _INCLREC_JR_HH_ */

The testrec.jr.hh file will contain:


#ifndef _TESTREC_JR_HH_
#define _TESTREC_JR_HH_

#include "inclrec.jr.hh"

namespace testrec {
  class R : public hadoop::Record {

  private:

    std::vector mVF;
    inclrec::RI        mRec;
    std::string        mBuf;

  public:

    R(void);
    virtual ~R(void);

    virtual bool operator==(const R& peer) const;
    virtual bool operator<(const R& peer) const;

    virtual std::vector& getVF(void) const;
    virtual const std::vector& getVF(void) const;

    virtual std::string& getBuf(void) const ;
    virtual const std::string& getBuf(void) const;

    virtual inclrec::RI& getRec(void) const;
    virtual const inclrec::RI& getRec(void) const;
    
    virtual bool serialize(hadoop::OutArchive& a) const;
    virtual bool deserialize(hadoop::InArchive& a);
    
    virtual std::string type(void) const;
    virtual std::string signature(void) const;
  };
}; // end namespace testrec
#endif /* _TESTREC_JR_HH_ */

Java

Code generation for Java is similar to that for C++. A Java class is generated for each record type with private members corresponding to the fields. Getters and setters for fields are also generated. Some differences arise in the way comparison is expressed and in the mapping of modules to packages and classes to files. For equality testing, an equals method is generated for each record type. As per Java requirements a hashCode method is also generated. For comparison a compareTo method is generated for each record type. This has the semantics as defined by the Java Comparable interface, that is, the method returns a negative integer, zero, or a positive integer as the invoked object is less than, equal to, or greater than the comparison parameter. A .java file is generated per record type as opposed to per DDL file as in C++. The module declaration translates to a Java package declaration. The module name maps to an identical Java package name. In addition to this mapping, the DDL compiler creates the appropriate directory hierarchy for the package and places the generated .java files in the correct directories.

Mapping Summary


DDL Type        C++ Type            Java Type 

boolean         bool                boolean
byte            int8_t              byte
int             int32_t             int
long            int64_t             long
float           float               float
double          double              double
ustring         std::string         Text
buffer          std::string         java.io.ByteArrayOutputStream
class type      class type          class type
vector    std::vector   java.util.ArrayList
map  std::map java.util.TreeMap

Data encodings

This section describes the format of the data encodings supported by Hadoop. Currently, three data encodings are supported, namely binary, CSV and XML.

Binary Serialization Format

The binary data encoding format is fairly dense. Serialization of composite types is simply defined as a concatenation of serializations of the constituent elements (lengths are included in vectors and maps). Composite types are serialized as follows: Serialization of primitives is more interesting, with a zero compression optimization for integral types and normalization to UTF-8 for strings. Primitive types are serialized as follows:

CSV Serialization Format

The CSV serialization format has a lot more structure than the "standard" Excel CSV format, but we believe the additional structure is useful because Serialization formats for the various types are detailed in the grammar that follows. The notable feature of the formats is the use of delimiters for indicating the certain field types. The CSV format can be described by the following grammar:

record = primitive / struct / vector / map
primitive = boolean / int / long / float / double / ustring / buffer

boolean = "T" / "F"
int = ["-"] 1*DIGIT
long = ";" ["-"] 1*DIGIT
float = ["-"] 1*DIGIT "." 1*DIGIT ["E" / "e" ["-"] 1*DIGIT]
double = ";" ["-"] 1*DIGIT "." 1*DIGIT ["E" / "e" ["-"] 1*DIGIT]

ustring = "'" *(UTF8 char except NULL, LF, % and , / "%00" / "%0a" / "%25" / "%2c" )

buffer = "#" *(BYTE except NULL, LF, % and , / "%00" / "%0a" / "%25" / "%2c" )

struct = "s{" record *("," record) "}"
vector = "v{" [record *("," record)] "}"
map = "m{" [*(record "," record)] "}"

XML Serialization Format

The XML serialization format is the same used by Apache XML-RPC (http://ws.apache.org/xmlrpc/types.html). This is an extension of the original XML-RPC format and adds some additional data types. All record I/O types are not directly expressible in this format, and access to a DDL is required in order to convert these to valid types. All types primitive or composite are represented by <value> elements. The particular XML-RPC type is indicated by a nested element in the <value> element. The encoding for records is always UTF-8. Primitive types are serialized as follows: Composite types are serialized as follows: For example:

class {
  int           MY_INT;            // value 5
  vector MY_VEC;            // values 0.1, -0.89, 2.45e4
  buffer        MY_BUF;            // value '\00\n\tabc%'
}
is serialized as

<value>
  <struct>
    <member>
      <name>MY_INT</name>
      <value><i4>5</i4></value>
    </member>
    <member>
      <name>MY_VEC</name>
      <value>
        <array>
          <data>
            <value><ex:float>0.1</ex:float></value>
            <value><ex:float>-0.89</ex:float></value>
            <value><ex:float>2.45e4</ex:float></value>
          </data>
        </array>
      </value>
    </member>
    <member>
      <name>MY_BUF</name>
      <value><string>%00\n\tabc%25</string></value>
    </member>
  </struct>
</value> 
Skip navigation links

Copyright © 2018 The Apache Software Foundation