HDF5 数据文件简介

在上一篇文章中，我们介绍了用 CSV (comma separated values) 文件来储存数据。除此之外，HDF5 (Hierarchical Data Format) 也是一种常见的跨平台数据储存文件。

HDF5 (Hierarchical Data Format) 由美国伊利诺伊大学厄巴纳-香槟分校 UIUC (University of Illinois at Urbana-Champaign) 开发，可以存储不同类型的图像和数码数据，并且可以在不同类型的机器上传输，同时还有统一处理这种文件格式的函数库。

HDF5 结构

HDF5 文件一般以 .h5 或者 .hdf5 作为后缀名，需要专门的软件才能打开预览文件的内容。HDF5 文件结构中有 2 primary objects: Groups 和 Datasets。

Groups 就类似于文件夹，每个 HDF5 文件其实就是根目录 (root) group '/'。
Datasets 类似于 NumPy 中的数组 array 。

每个 dataset 可以分成两部分: 原始数据 (raw) data values 和 元数据 metadata (a set of data that describes and gives information about other data => raw data)。

+-- Dataset
|   +-- (Raw) Data Values (eg: a 4 x 5 x 6 matrix)
|   +-- Metadata
|   |   +-- Dataspace (eg: Rank = 3, Dimensions = {4, 5, 6})
|   |   +-- Datatype (eg: Integer)
|   |   +-- Properties (eg: Chuncked, Compressed)
|   |   +-- Attributes (eg: attr1 = 32.4, attr2 = "hello", ...)
|

从上面的结构中可以看出：

Dataspace 给出原始数据的秩 (Rank) 和维度 (dimension)
Datatype 给出数据类型
Properties 说明该 dataset 的分块储存以及压缩情况
- Chunked: Better access time for subsets; extendible
- Chunked & Compressed: Improves storage efficiency, transmission speed
Attributes 为该 dataset 的其他自定义属性

整个 HDF5 文件的结构如下所示：

+-- /
|   +-- group_1
|   |   +-- dataset_1_1
|   |   |   +-- attribute_1_1_1
|   |   |   +-- attribute_1_1_2
|   |   |   +-- ...
|   |   |
|   |   +-- dataset_1_2
|   |   |   +-- attribute_1_2_1
|   |   |   +-- attribute_1_2_2
|   |   |   +-- ...
|   |   |
|   |   +-- ...
|   |
|   +-- group_2
|   |   +-- dataset_2_1
|   |   |   +-- attribute_2_1_1
|   |   |   +-- attribute_2_1_2
|   |   |   +-- ...
|   |   |
|   |   +-- dataset_2_2
|   |   |   +-- attribute_2_2_1
|   |   |   +-- attribute_2_2_2
|   |   |   +-- ...
|   |   |
|   |   +-- ...
|   |
|   +-- ...
|

HDF5 下载与安装

下载安装 HDF5 的方法有多种，Mac 下可以直接 brew install hdf5，其他 Linux 系统也可以使用对应安装包管理工具下载就行了。当然也可以去官网 https://portal.hdfgroup.org/display/support/Downloads 下载对应操作系统的压缩包。

下载安装完成后可以在终端使用 h5dump 命令查看 HDF5 文件的内容。官网同时提供一个 JAVA 开发的 HDF5 数据可视化工具 HDFView，支持全平台查看数据, 但是注意打开文件的路径中不要包含中文。

注意:
当为 python 安装 HDF5 的 h5py 库时，使用 conda install h5py 或者 pip install h5py 后也会安装部分二进制文件 (如 h5dump, h5cc/h5c++, h5fc 等) 和库文件，但是可能不完整，导致 HDF5 的 C/C++ 编译器 h5cc/h5c++ 和 Fortran 编译器 h5fc 无法正常工作。
解决办法:
若 h5c++ 无法正常编译 C++ 文件，终端输入 which h5c++, 若显示该二进制文件在 python 的二进制 (binary) 文件夹 bin 内，则只需找到 brew 或者其他安装包管理工具下载的 h5c++ (一般在 /usr/local/bin 内) 或者官网下载解压后的 h5c++，在根目录 (~) 下的 .bashrc 文件 (或者其他 shell, 如 zsh 的配置文件 .zshrc) 内添加 alias h5c++ = /usr/local/bin/h5c++ 就可以了。

若是想用 clang++ 或者 g++ 而非 h5c++ 编译, 其中只要添加一些头文件 (-I) 和库文件 (-L) 的 flags 就行了。首先确认 h5c++ 可以正常编译后，在终端输入 h5c++ -show, 会显示 CXX_COMPILER + CXX_FLAGS, 例如: g++ -I/usr/local/opt/szip/include -L/usr/local/Cellar/hdf5/1.10.6/lib /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5_hl_cpp.a /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5_cpp.a /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5_hl.a /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5.a -L/usr/local/opt/szip/lib -lsz -lz -ldl -lm , 故我们可以使用 CXX_COMPILER + XXX.cpp + CXX_FLAGS 来编译 C++ 文件 (因为编译依赖关系，CXX_FLAGS 通常放在最后，XXX.cpp 放在 CXX_FLAGS 之前，否则可能会无法成功编译) 。

Python 读写 HDF5 文件

为了方便代码下载浏览，可以访问该 Post 对应的 GitHub repository: https://github.com/wwang721/Python-Examples/tree/master/HDF5

HDF5 的 python 库 h5py 调用起来比较简单，我在这给出一个简单的例子：

#!/usr/bin/python
# -*- coding: UTF-8 -*-
#
# Created by WW on Jan. 26, 2020
# All rights reserved.
#

import h5py
import numpy as np

def main():
	#===================================================================================
	# Create a HDF5 file.
	f = h5py.File("h5py_example.hdf5", "w") 	# mode = {'w', 'r', 'a'}

	# Create two groups under root '/'.
	g1 = f.create_group("bar1")
	g2 = f.create_group("bar2")

	# Create a dataset under root '/'.
	d = f.create_dataset("dset", data=np.arange(16).reshape([4, 4]))

	# Add two attributes to dataset 'dset'
	d.attrs["myAttr1"] = [100, 200]
	d.attrs["myAttr2"] = "Hello, world!"

	# Create a group and a dataset under group "bar1".
	c1 = g1.create_group("car1")
	d1 = g1.create_dataset("dset1", data=np.arange(10))

	# Create a group and a dataset under group "bar2".
	c2 = g2.create_group("car2")
	d2 = g2.create_dataset("dset2", data=np.arange(10))

	# Save and exit the file.
	f.close()

	''' h5py_example.hdf5 file structure
	+-- '/'
	|   +--	group "bar1"
	|   |   +-- group "car1"
	|   |   |   +-- None
	|   |   |   
	|   |   +-- dataset "dset1"
	|   |
	|   +-- group "bar2"
	|   |   +-- group "car2"
	|   |   |   +-- None
	|   |   |
	|   |   +-- dataset "dset2"
	|   |   
	|   +-- dataset "dset"
	|   |   +-- attribute "myAttr1"
	|   |   +-- attribute "myAttr2"
	|   |   
	|   
	'''

	#===================================================================================
	# Read HDF5 file.
	f = h5py.File("h5py_example.hdf5", "r") 	# mode = {'w', 'r', 'a'}

	# Print the keys of groups and datasets under '/'.
	print(f.filename, ":")
	print([key for key in f.keys()], "\n")  

	#===================================================
	# Read dataset 'dset' under '/'.
	d = f["dset"]

	# Print the data of 'dset'.
	print(d.name, ":")
	print(d[:])

	# Print the attributes of dataset 'dset'.
	for key in d.attrs.keys():
		print(key, ":", d.attrs[key])

	print()

	#===================================================
	# Read group 'bar1'.
	g = f["bar1"]

	# Print the keys of groups and datasets under group 'bar1'.
	print([key for key in g.keys()])

	# Three methods to print the data of 'dset1'.
	print(f["/bar1/dset1"][:])		# 1. absolute path

	print(f["bar1"]["dset1"][:])	# 2. relative path: file[][]

	print(g['dset1'][:])		# 3. relative path: group[]



	# Delete a database.
	# Notice: the mode should be 'a' when you read a file.
	'''
	del g["dset1"]
	'''

	# Save and exit the file
	f.close()

if __name__ == "__main__":
    main()

C++ 读写 HDF5 文件

C++ 读写 HDF5 文件比较复杂，参考官网给出的 Examples，下面给出一个创建 HDF5 文件的例子和一个读写 HDF5 文件的例子:

1. h5cpp_creating.cpp:

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 * Copyright © 2020 Wei Wang.						       *
 * Created by WW on 2020/01/26.	                                               *
 * All rights reserved.							       *
 *	                                                                       *
 * This example illustrates how to create a dataset that is a 4 x 6 array.     *
 * Reference: HDF5 Tutorial (https://portal.hdfgroup.org/display/HDF5/HDF5)    *
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

//
// h5cpp_creating.cpp
// CPP
//

#include <iostream>
#include <string>
#include "H5Cpp.h"

#ifndef _H5_NO_NAMESPACE_
using namespace H5;
#ifndef _H5_NO_STD_
    using std::cout;
    using std::endl;
#endif /* _H5_NO_STD_ */        
#endif /* _H5_NO_NAMESPACE_ */

#define PI 3.1415926535

/* 
 *  Define the names of HDF5 file, groups, datasets, and attributes.
 *  Use H5::H5std_string for name strings.
 */
const H5std_string FILE_NAME("h5cpp_example.hdf5");
const H5std_string GROUP_NAME("group1");
const H5std_string DATASET_NAME("dset");
const H5std_string ATTR_NAME1("myAttr1");
const H5std_string ATTR_NAME2("myAttr2");

const int DIM0 = 4;       // dataset dimensions
const int DIM1 = 6;
const int RANK = 2;

int main (int argc, char **argv)
{
    // Try block to detect exceptions raised by any of the calls inside it.
    try
    {
    	/* 
         * Turn off the auto-printing when failure occurs so that we can
    	 * handle the errors appropriately.
         */
    	Exception::dontPrint();

        /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

        double data[DIM0][DIM1];	    // buffer for data to write

        for (int i = 0; i < DIM0; i++)
    		for (int j = 0; j < DIM1; j++)
    	 		data[i][j] = (i + 1) * PI + j;
    	
    	
    	// Create a new file using the default property lists.
        // H5::H5F_ACC_TRUNC : create a new file or overwrite an existing file.
    	H5File file(FILE_NAME, H5F_ACC_TRUNC);

        // Create a group under root '/'.
    	Group group(file.createGroup(GROUP_NAME));
    	
        
        // Use H5::hsize_t (similar to int) for dimensions.
    	hsize_t dims[RANK];               // dataset dimensions
    	dims[0] = DIM0;
    	dims[1] = DIM1;
    	
        // Create the dataspace for a dataset first.
    	DataSpace dataspace(RANK, dims);
    	
    	// Create the dataset under group with specified dataspace.      
    	DataSet dataset = group.createDataSet(DATASET_NAME, PredType::NATIVE_DOUBLE, dataspace);

        // Write data in buffer to dataset.
    	dataset.write(data, PredType::NATIVE_DOUBLE);

        /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ 

        int attr1_data[2] = {100, 200};          // buffer for attribute data to wirte

        hsize_t attr1_dims[1] = {2};             // attribute dimension, rank = 1
        
        // Create the dataspace for an attribute first.
        DataSpace attr1_dataspace(1, attr1_dims);         // rank = 1

        // Create the attribute of dataset with specified dataspace.
        Attribute attribute1 = dataset.createAttribute(ATTR_NAME1, PredType::STD_I32BE, attr1_dataspace);
        
        // Write data in buffer to attribute.
        attribute1.write(PredType::NATIVE_INT, attr1_data);

        /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ 

        /* String Data */

        char attr2_data[30];    // buffer for attribute data to wirte 
    	sprintf(attr2_data, "hello, world!\nAuthor: Wei Wang");

        hsize_t attr2_dims[1] = {30};       // attribute dimension, rank = 1

        // Create the dataspace for an attribute first.
        DataSpace attr2_dataspace(1, attr2_dims);       // rank = 1

        // Create the attribute of dataset with specified dataspace.
        Attribute attribute2 = dataset.createAttribute(ATTR_NAME2, PredType::NATIVE_CHAR, attr2_dataspace);
        
        // Write data in buffer to attribute.
        attribute2.write(PredType::NATIVE_CHAR, attr2_data);

        /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ 

        // Save and exit the group.
    	group.close();
        // Save and exit the file.
    	file.close();

        /* h5cpp_example.hdf5 file structure
         * +-- '/'
         * |   +-- group 'group1'
         * |   |   +-- dataset 'dset'
         * |   |   |   +-- attribute 'myAttr1'
         * |   |   |   +-- attribute 'myAttr2'
         * |   |   |
         * |   |
         * |
         */

    }  // end of try block

    
    // Catch failure caused by the H5File operations.
    catch(FileIException error)
    {
    	error.printErrorStack();
    	return -1;
    }

    // Catch failure caused by the DataSet operations.
    catch(DataSetIException error)
    {
    	error.printErrorStack();
    	return -1;
    }

    // Catch failure caused by the DataSpace operations.
    catch(DataSpaceIException error)
    {
    	error.printErrorStack();
    	return -1;
    }
	
    return 0;  // successfully terminated

}

2. h5cpp_reading.cpp:

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 * Copyright © 2020 Wei Wang.                                                  *
 * Created by WW on 2020/01/26.                                                *
 * All rights reserved.                                                        *
 *                                                                             *
 * This example illustrates how to read and edit an existing dataset.          *
 * Reference: HDF5 Tutorial (https://portal.hdfgroup.org/display/HDF5/HDF5)    *
 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

//
// h5cpp_reading.cpp
// CPP
//

#include <iostream>
#include <string>
#include "H5Cpp.h"

#ifndef _H5_NO_NAMESPACE_
using namespace H5;
#ifndef _H5_NO_STD_
    using std::cout;
    using std::endl;
#endif /* _H5_NO_STD_ */        
#endif /* _H5_NO_NAMESPACE_ */

/* 
 *  Define the names of HDF5 file, groups, datasets, and attributes.
 *  Use H5::H5std_string for name strings.
 */
const H5std_string  FILE_NAME("h5cpp_example.hdf5");
const H5std_string  GROUP_NAME("group1");
const H5std_string  DATASET_NAME("dset");
const H5std_string  ATTR_NAME("myAttr2");

int main (int argc, char **argv)
{
    
    // Try block to detect exceptions raised by any of the calls inside it.
    try
    {
        /*
    	 * Turn off the auto-printing when failure occurs so that we can
    	 * handle the errors appropriately
         */
    	Exception::dontPrint();

        /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ 

        /* HOW TO DELETING A DATASET! */

        /*

        // Open an existing file.
        // H5::H5F_ACC_RDWR : read or edit an existing file.
        H5File file_d(FILE_NAME, H5F_ACC_RDWR);
        
        // Open an existing group.
        Group group_d = file_d.openGroup(GROUP_NAME);

        // Use H5::H5Ldelete to delete an existing dataset.
        int result = H5Ldelete(group_d.getId(), DATASET_NAME.c_str(), H5P_DEFAULT);
        // String.c_str() convert "string" to "const char *".

        cout << result << endl;     
        // Non-negtive: successfully delete; 
        // Otherwise: fail.

        // Save and exit the group.
        group_d.close();
        // Save and exit the file.
        file_d.close();
        // Important! The two close()s above can't be omitted! 
        // Otherwise, the deleting behavior won't be saved to file.

        */

        /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ 

        // Open an existing file.
        // H5::H5F_ACC_RDWR : read or edit an existing file.
        H5File file(FILE_NAME, H5F_ACC_RDWR);
        
        // Open an existing group of the file.
        Group group = file.openGroup(GROUP_NAME);

        // Open an existing dataset of the group.
        DataSet dataset = group.openDataSet(DATASET_NAME);

        // Get the dataspace of the dataset.
    	DataSpace filespace = dataset.getSpace();

        // Get the rank of the dataset.
        int rank = filespace.getSimpleExtentNdims();

        // Use H5::hsize_t (similar to int) for dimensions
        hsize_t dims[rank];        // dataset dimensions

        // Get the dimensions of the dataset.
        rank = filespace.getSimpleExtentDims(dims);

        cout << DATASET_NAME << " rank = " << rank << ", dimensions "
                 << dims[0] << " x "
                 << dims[1] << endl;

        // Dataspace for data read from file.
        DataSpace myspace(rank, dims);

        double data_out[dims[0]][dims[1]];      // buffer for data read from file

        // Read data from file to buffer.
        dataset.read(data_out, PredType::NATIVE_DOUBLE, myspace, filespace);

        for (int i = 0; i < dims[0]; i++)
        {
            for (int j = 0; j < dims[1]; j++)
                cout << data_out[i][j] << " ";
            cout << endl;
        }

        /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ 
        // Read the attribute of the dataset.
        cout << endl;

        // Open an existing attribute of the dataset.
        Attribute attr = dataset.openAttribute(ATTR_NAME);

        // Get the dataspace of the attribute.
        DataSpace attr_space = attr.getSpace();

        // Get the rank of the attribute.
        int attr_rank = attr_space.getSimpleExtentNdims();

        // Use H5::hsize_t (similar to int) for dimensions.
        hsize_t attr_dims[attr_rank];       // attribute dimensions

        // Get the dimension of the attribute.
        attr_rank = attr_space.getSimpleExtentDims(attr_dims);

        cout << ATTR_NAME << " rank = " << attr_rank << ", dimensions " << attr_dims[0] << endl;

        char attr_data_out[attr_dims[0]];   // buffer for attribute data read from file

        // Read attribute data from file to buffer.  
        attr.read(PredType::NATIVE_CHAR, attr_data_out);

        cout << attr_data_out << endl;

        /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ 

        // Save and exit the group.
        group.close();
        // Save and exit the file.
        file.close();

    }  // end of try block

    // Catch failure caused by the H5File operations.
    catch(FileIException error)
    {
    	error.printErrorStack();
    	return -1;
    }

    // Catch failure caused by the DataSet operations.
    catch(DataSetIException error)
    {
    	error.printErrorStack();
    	return -1;
    }

    // Catch failure caused by the DataSpace operations.
    catch(DataSpaceIException error)
    {
        error.printErrorStack();
        return -1;
    }

    return 0;  // successfully terminated
    
}

总结

更多高级 API (Application Program Interface) 的调用，如 Subset, Hyperslab, Chunk , Compress, Single-Writer/Multiple-Reader (SWMR), Parallel HDF5 (即 HDF5 MPI - Message Passing Interface 并行读写) 以及 Virtual Dataset (VDS) 等，可以查阅官网的 Documentation。

除了储存数码数据，HDF5 文件还可以用于存储图像、PDF文件，甚至 Excel 文件，但是鉴于我目前的科研需求，还是 .tsv 和 .txt 更适合我，毕竟查看起来更简单，跨平台跨语言读写也很方便。对我来说，上一篇文章中讨论的 .csv 文件都已经算是比较高级的数据储存格式了😆。 🎉 🎉 🎉

Go to the Home Page

Sponsor the author if you like the contents!

HDF5 数据文件简介

简单介绍并总结 HDF5 数据文件的使用 ...

Feb 15, 2020 • Wei Wang

HDF5 结构

HDF5 下载与安装

Python 读写 HDF5 文件

C++ 读写 HDF5 文件

总结