Java – Create multiple columns from a single Hive UDF

Create multiple columns from a single Hive UDF… here is a solution to the problem.

Create multiple columns from a single Hive UDF

I’m using Amazon EMR and Hive 0.11. I’m trying to create a Hive UDF that will return multiple columns from one UDF call.

For example, I want to call a UDF like this and return several (named) columns.

SELECT get_data(columnname) FROM table;

I’m having a hard time finding documentation on this, but I’ve heard that it’s possible if using a generic UDF. Does anyone know what the evaluate() method needs to return to work?

Solution

I’m just using GenericUDTF. After you have written the UDF extension for GenericUDTF, your UDTF should implement two important methods: initialization and evaluation.

  • In initialization, you can check the parameter type and set the return object type.
    For example, with ObjectInspectorFactory.getStandardStructObjectInspector, you can specify output columns by using the name in the structFieldNames parameter and the column value type in structFieldObjectInspectors. The size of the output column is the size of the structFieldNames list.
    There are two types of systems: Java and Hadoop. Java’s ObjectInspector begins with javaXXObjectInspector, otherwise it starts with writableXXObjectInspector.
  • In terms of processing, it is similar to ordinary UDF. In addition to this, you should use an ObjectInspector saved from initialize() to convert Object to concrete values such as String, Integer, etc. Call the forward function to output a line. In the row object forwardColObj, you can specify a column object.

Here’s a simple example:


public class UDFExtractDomainMethod extends GenericUDTF {

private static final Integer OUT_COLS = 2;
    the output columns size
    private transient Object forwardColObj[] = new Object[OUT_COLS];

private transient ObjectInspector[] inputOIs;

/**
    *
    * @param argOIs check the argument is valid.
    * @return the output column structure.
    * @throws UDFArgumentException
    */
    @Override
    public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
        if (argOIs.length != 1 || argOIs[0].getCategory() != ObjectInspector.Category.PRIMITIVE
                || !argOIs[0].getTypeName().equals(serdeConstants.STRING_TYPE_NAME)) {
            throw new UDFArgumentException("split_url only take one argument with type of string");
        }

inputOIs = argOIs;
        List<String> outFieldNames = new ArrayList<String>();
        List<ObjectInspector> outFieldOIs = new ArrayList<ObjectInspector>();
        outFieldNames.add("host");
        outFieldNames.add("method");
        outFieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        writableStringObjectInspector correspond to hadoop.io.Text
        outFieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        return  ObjectInspectorFactory.getStandardStructObjectInspector(outFieldNames, outFieldOIs);
    }

@Override
    public void process(Object[] objects) throws HiveException {
        try {
            need OI to convert data type to get java type
            String inUrl = ((StringObjectInspector)inputOIs[0]).getPrimitiveJavaObject(objects[0]);
            URI uri = new URI(inUrl);
            forwardColObj[0] = uri.getHost();
            forwardColObj[1] = uri.getRawPath();
            output a row with two column
            forward(forwardColObj);
        } catch (URISyntaxException e) {
            e.printStackTrace();
        }
    }

@Override
    public void close() throws HiveException {

}
}

Related Problems and Solutions