在上一篇文章中,我们介绍了用 CSV (comma separated values) 文件来储存数据。除此之外,HDF5 (Hierarchical Data Format) 也是一种常见的跨平台数据储存文件。
HDF5 (Hierarchical Data Format) 由美国伊利诺伊大学厄巴纳-香槟分校 UIUC (University of Illinois at Urbana-Champaign) 开发,可以存储不同类型的图像和数码数据,并且可以在不同类型的机器上传输,同时还有统一处理这种文件格式的函数库。
HDF5 结构
HDF5 文件一般以 .h5 或者 .hdf5 作为后缀名,需要专门的软件才能打开预览文件的内容。HDF5 文件结构中有 2 primary objects: Groups 和 Datasets。
- Groups 就类似于文件夹,每个 HDF5 文件其实就是根目录 (root) group
'/'
。 - Datasets 类似于 NumPy 中的数组 array 。
每个 dataset 可以分成两部分: 原始数据 (raw) data values 和 元数据 metadata (a set of data that describes and gives information about other data => raw data)。
+-- Dataset
| +-- (Raw) Data Values (eg: a 4 x 5 x 6 matrix)
| +-- Metadata
| | +-- Dataspace (eg: Rank = 3, Dimensions = {4, 5, 6})
| | +-- Datatype (eg: Integer)
| | +-- Properties (eg: Chuncked, Compressed)
| | +-- Attributes (eg: attr1 = 32.4, attr2 = "hello", ...)
|
从上面的结构中可以看出:
- Dataspace 给出原始数据的秩 (Rank) 和维度 (dimension)
- Datatype 给出数据类型
- Properties 说明该 dataset 的分块储存以及压缩情况
- Chunked: Better access time for subsets; extendible
- Chunked & Compressed: Improves storage efficiency, transmission speed
- Attributes 为该 dataset 的其他自定义属性
整个 HDF5 文件的结构如下所示:
+-- /
| +-- group_1
| | +-- dataset_1_1
| | | +-- attribute_1_1_1
| | | +-- attribute_1_1_2
| | | +-- ...
| | |
| | +-- dataset_1_2
| | | +-- attribute_1_2_1
| | | +-- attribute_1_2_2
| | | +-- ...
| | |
| | +-- ...
| |
| +-- group_2
| | +-- dataset_2_1
| | | +-- attribute_2_1_1
| | | +-- attribute_2_1_2
| | | +-- ...
| | |
| | +-- dataset_2_2
| | | +-- attribute_2_2_1
| | | +-- attribute_2_2_2
| | | +-- ...
| | |
| | +-- ...
| |
| +-- ...
|
HDF5 下载与安装
下载安装 HDF5 的方法有多种,Mac 下可以直接 brew install hdf5
,其他 Linux 系统也可以使用对应安装包管理工具下载就行了。当然也可以去官网 https://portal.hdfgroup.org/display/support/Downloads 下载对应操作系统的压缩包。
下载安装完成后可以在终端使用 h5dump
命令查看 HDF5 文件的内容。官网同时提供一个 JAVA 开发的 HDF5 数据可视化工具 HDFView,支持全平台查看数据, 但是注意打开文件的路径中不要包含中文。
-
注意:
当为 python 安装 HDF5 的 h5py 库时,使用
conda install h5py
或者pip install h5py
后也会安装部分二进制文件 (如h5dump
,h5cc/h5c++
,h5fc
等) 和库文件,但是可能不完整,导致 HDF5 的 C/C++ 编译器h5cc/h5c++
和 Fortran 编译器h5fc
无法正常工作。 -
解决办法:
若
h5c++
无法正常编译 C++ 文件,终端输入which h5c++
, 若显示该二进制文件在 python 的二进制 (binary) 文件夹 bin 内,则只需找到brew
或者其他安装包管理工具下载的h5c++
(一般在/usr/local/bin
内) 或者官网下载解压后的h5c++
,在根目录 (~) 下的 .bashrc 文件 (或者其他 shell, 如 zsh 的配置文件 .zshrc) 内添加alias h5c++ = /usr/local/bin/h5c++
就可以了。
若是想用clang++
或者g++
而非h5c++
编译, 其中只要添加一些头文件 (-I) 和库文件 (-L) 的 flags 就行了。首先确认h5c++
可以正常编译后,在终端输入h5c++ -show
, 会显示CXX_COMPILER
+CXX_FLAGS
, 例如:g++ -I/usr/local/opt/szip/include -L/usr/local/Cellar/hdf5/1.10.6/lib /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5_hl_cpp.a /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5_cpp.a /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5_hl.a /usr/local/Cellar/hdf5/1.10.6/lib/libhdf5.a -L/usr/local/opt/szip/lib -lsz -lz -ldl -lm
, 故我们可以使用CXX_COMPILER
+XXX.cpp
+CXX_FLAGS
来编译 C++ 文件 (因为编译依赖关系,CXX_FLAGS
通常放在最后,XXX.cpp
放在CXX_FLAGS
之前,否则可能会无法成功编译) 。
Python 读写 HDF5 文件
- 为了方便代码下载浏览,可以访问该 Post 对应的 GitHub repository: https://github.com/wwang721/Python-Examples/tree/master/HDF5
HDF5 的 python 库 h5py 调用起来比较简单,我在这给出一个简单的例子:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
#
# Created by WW on Jan. 26, 2020
# All rights reserved.
#
import h5py
import numpy as np
def main():
#===================================================================================
# Create a HDF5 file.
f = h5py.File("h5py_example.hdf5", "w") # mode = {'w', 'r', 'a'}
# Create two groups under root '/'.
g1 = f.create_group("bar1")
g2 = f.create_group("bar2")
# Create a dataset under root '/'.
d = f.create_dataset("dset", data=np.arange(16).reshape([4, 4]))
# Add two attributes to dataset 'dset'
d.attrs["myAttr1"] = [100, 200]
d.attrs["myAttr2"] = "Hello, world!"
# Create a group and a dataset under group "bar1".
c1 = g1.create_group("car1")
d1 = g1.create_dataset("dset1", data=np.arange(10))
# Create a group and a dataset under group "bar2".
c2 = g2.create_group("car2")
d2 = g2.create_dataset("dset2", data=np.arange(10))
# Save and exit the file.
f.close()
''' h5py_example.hdf5 file structure
+-- '/'
| +-- group "bar1"
| | +-- group "car1"
| | | +-- None
| | |
| | +-- dataset "dset1"
| |
| +-- group "bar2"
| | +-- group "car2"
| | | +-- None
| | |
| | +-- dataset "dset2"
| |
| +-- dataset "dset"
| | +-- attribute "myAttr1"
| | +-- attribute "myAttr2"
| |
|
'''
#===================================================================================
# Read HDF5 file.
f = h5py.File("h5py_example.hdf5", "r") # mode = {'w', 'r', 'a'}
# Print the keys of groups and datasets under '/'.
print(f.filename, ":")
print([key for key in f.keys()], "\n")
#===================================================
# Read dataset 'dset' under '/'.
d = f["dset"]
# Print the data of 'dset'.
print(d.name, ":")
print(d[:])
# Print the attributes of dataset 'dset'.
for key in d.attrs.keys():
print(key, ":", d.attrs[key])
print()
#===================================================
# Read group 'bar1'.
g = f["bar1"]
# Print the keys of groups and datasets under group 'bar1'.
print([key for key in g.keys()])
# Three methods to print the data of 'dset1'.
print(f["/bar1/dset1"][:]) # 1. absolute path
print(f["bar1"]["dset1"][:]) # 2. relative path: file[][]
print(g['dset1'][:]) # 3. relative path: group[]
# Delete a database.
# Notice: the mode should be 'a' when you read a file.
'''
del g["dset1"]
'''
# Save and exit the file
f.close()
if __name__ == "__main__":
main()
C++ 读写 HDF5 文件
C++ 读写 HDF5 文件比较复杂,参考官网给出的 Examples,下面给出一个创建 HDF5 文件的例子和一个读写 HDF5 文件的例子:
1. h5cpp_creating.cpp:
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Copyright © 2020 Wei Wang. *
* Created by WW on 2020/01/26. *
* All rights reserved. *
* *
* This example illustrates how to create a dataset that is a 4 x 6 array. *
* Reference: HDF5 Tutorial (https://portal.hdfgroup.org/display/HDF5/HDF5) *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
//
// h5cpp_creating.cpp
// CPP
//
#include <iostream>
#include <string>
#include "H5Cpp.h"
#ifndef _H5_NO_NAMESPACE_
using namespace H5;
#ifndef _H5_NO_STD_
using std::cout;
using std::endl;
#endif /* _H5_NO_STD_ */
#endif /* _H5_NO_NAMESPACE_ */
#define PI 3.1415926535
/*
* Define the names of HDF5 file, groups, datasets, and attributes.
* Use H5::H5std_string for name strings.
*/
const H5std_string FILE_NAME("h5cpp_example.hdf5");
const H5std_string GROUP_NAME("group1");
const H5std_string DATASET_NAME("dset");
const H5std_string ATTR_NAME1("myAttr1");
const H5std_string ATTR_NAME2("myAttr2");
const int DIM0 = 4; // dataset dimensions
const int DIM1 = 6;
const int RANK = 2;
int main (int argc, char **argv)
{
// Try block to detect exceptions raised by any of the calls inside it.
try
{
/*
* Turn off the auto-printing when failure occurs so that we can
* handle the errors appropriately.
*/
Exception::dontPrint();
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
double data[DIM0][DIM1]; // buffer for data to write
for (int i = 0; i < DIM0; i++)
for (int j = 0; j < DIM1; j++)
data[i][j] = (i + 1) * PI + j;
// Create a new file using the default property lists.
// H5::H5F_ACC_TRUNC : create a new file or overwrite an existing file.
H5File file(FILE_NAME, H5F_ACC_TRUNC);
// Create a group under root '/'.
Group group(file.createGroup(GROUP_NAME));
// Use H5::hsize_t (similar to int) for dimensions.
hsize_t dims[RANK]; // dataset dimensions
dims[0] = DIM0;
dims[1] = DIM1;
// Create the dataspace for a dataset first.
DataSpace dataspace(RANK, dims);
// Create the dataset under group with specified dataspace.
DataSet dataset = group.createDataSet(DATASET_NAME, PredType::NATIVE_DOUBLE, dataspace);
// Write data in buffer to dataset.
dataset.write(data, PredType::NATIVE_DOUBLE);
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
int attr1_data[2] = {100, 200}; // buffer for attribute data to wirte
hsize_t attr1_dims[1] = {2}; // attribute dimension, rank = 1
// Create the dataspace for an attribute first.
DataSpace attr1_dataspace(1, attr1_dims); // rank = 1
// Create the attribute of dataset with specified dataspace.
Attribute attribute1 = dataset.createAttribute(ATTR_NAME1, PredType::STD_I32BE, attr1_dataspace);
// Write data in buffer to attribute.
attribute1.write(PredType::NATIVE_INT, attr1_data);
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
/* String Data */
char attr2_data[30]; // buffer for attribute data to wirte
sprintf(attr2_data, "hello, world!\nAuthor: Wei Wang");
hsize_t attr2_dims[1] = {30}; // attribute dimension, rank = 1
// Create the dataspace for an attribute first.
DataSpace attr2_dataspace(1, attr2_dims); // rank = 1
// Create the attribute of dataset with specified dataspace.
Attribute attribute2 = dataset.createAttribute(ATTR_NAME2, PredType::NATIVE_CHAR, attr2_dataspace);
// Write data in buffer to attribute.
attribute2.write(PredType::NATIVE_CHAR, attr2_data);
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
// Save and exit the group.
group.close();
// Save and exit the file.
file.close();
/* h5cpp_example.hdf5 file structure
* +-- '/'
* | +-- group 'group1'
* | | +-- dataset 'dset'
* | | | +-- attribute 'myAttr1'
* | | | +-- attribute 'myAttr2'
* | | |
* | |
* |
*/
} // end of try block
// Catch failure caused by the H5File operations.
catch(FileIException error)
{
error.printErrorStack();
return -1;
}
// Catch failure caused by the DataSet operations.
catch(DataSetIException error)
{
error.printErrorStack();
return -1;
}
// Catch failure caused by the DataSpace operations.
catch(DataSpaceIException error)
{
error.printErrorStack();
return -1;
}
return 0; // successfully terminated
}
2. h5cpp_reading.cpp:
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Copyright © 2020 Wei Wang. *
* Created by WW on 2020/01/26. *
* All rights reserved. *
* *
* This example illustrates how to read and edit an existing dataset. *
* Reference: HDF5 Tutorial (https://portal.hdfgroup.org/display/HDF5/HDF5) *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
//
// h5cpp_reading.cpp
// CPP
//
#include <iostream>
#include <string>
#include "H5Cpp.h"
#ifndef _H5_NO_NAMESPACE_
using namespace H5;
#ifndef _H5_NO_STD_
using std::cout;
using std::endl;
#endif /* _H5_NO_STD_ */
#endif /* _H5_NO_NAMESPACE_ */
/*
* Define the names of HDF5 file, groups, datasets, and attributes.
* Use H5::H5std_string for name strings.
*/
const H5std_string FILE_NAME("h5cpp_example.hdf5");
const H5std_string GROUP_NAME("group1");
const H5std_string DATASET_NAME("dset");
const H5std_string ATTR_NAME("myAttr2");
int main (int argc, char **argv)
{
// Try block to detect exceptions raised by any of the calls inside it.
try
{
/*
* Turn off the auto-printing when failure occurs so that we can
* handle the errors appropriately
*/
Exception::dontPrint();
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
/* HOW TO DELETING A DATASET! */
/*
// Open an existing file.
// H5::H5F_ACC_RDWR : read or edit an existing file.
H5File file_d(FILE_NAME, H5F_ACC_RDWR);
// Open an existing group.
Group group_d = file_d.openGroup(GROUP_NAME);
// Use H5::H5Ldelete to delete an existing dataset.
int result = H5Ldelete(group_d.getId(), DATASET_NAME.c_str(), H5P_DEFAULT);
// String.c_str() convert "string" to "const char *".
cout << result << endl;
// Non-negtive: successfully delete;
// Otherwise: fail.
// Save and exit the group.
group_d.close();
// Save and exit the file.
file_d.close();
// Important! The two close()s above can't be omitted!
// Otherwise, the deleting behavior won't be saved to file.
*/
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
// Open an existing file.
// H5::H5F_ACC_RDWR : read or edit an existing file.
H5File file(FILE_NAME, H5F_ACC_RDWR);
// Open an existing group of the file.
Group group = file.openGroup(GROUP_NAME);
// Open an existing dataset of the group.
DataSet dataset = group.openDataSet(DATASET_NAME);
// Get the dataspace of the dataset.
DataSpace filespace = dataset.getSpace();
// Get the rank of the dataset.
int rank = filespace.getSimpleExtentNdims();
// Use H5::hsize_t (similar to int) for dimensions
hsize_t dims[rank]; // dataset dimensions
// Get the dimensions of the dataset.
rank = filespace.getSimpleExtentDims(dims);
cout << DATASET_NAME << " rank = " << rank << ", dimensions "
<< dims[0] << " x "
<< dims[1] << endl;
// Dataspace for data read from file.
DataSpace myspace(rank, dims);
double data_out[dims[0]][dims[1]]; // buffer for data read from file
// Read data from file to buffer.
dataset.read(data_out, PredType::NATIVE_DOUBLE, myspace, filespace);
for (int i = 0; i < dims[0]; i++)
{
for (int j = 0; j < dims[1]; j++)
cout << data_out[i][j] << " ";
cout << endl;
}
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
// Read the attribute of the dataset.
cout << endl;
// Open an existing attribute of the dataset.
Attribute attr = dataset.openAttribute(ATTR_NAME);
// Get the dataspace of the attribute.
DataSpace attr_space = attr.getSpace();
// Get the rank of the attribute.
int attr_rank = attr_space.getSimpleExtentNdims();
// Use H5::hsize_t (similar to int) for dimensions.
hsize_t attr_dims[attr_rank]; // attribute dimensions
// Get the dimension of the attribute.
attr_rank = attr_space.getSimpleExtentDims(attr_dims);
cout << ATTR_NAME << " rank = " << attr_rank << ", dimensions " << attr_dims[0] << endl;
char attr_data_out[attr_dims[0]]; // buffer for attribute data read from file
// Read attribute data from file to buffer.
attr.read(PredType::NATIVE_CHAR, attr_data_out);
cout << attr_data_out << endl;
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
// Save and exit the group.
group.close();
// Save and exit the file.
file.close();
} // end of try block
// Catch failure caused by the H5File operations.
catch(FileIException error)
{
error.printErrorStack();
return -1;
}
// Catch failure caused by the DataSet operations.
catch(DataSetIException error)
{
error.printErrorStack();
return -1;
}
// Catch failure caused by the DataSpace operations.
catch(DataSpaceIException error)
{
error.printErrorStack();
return -1;
}
return 0; // successfully terminated
}
总结
更多高级 API (Application Program Interface) 的调用,如 Subset, Hyperslab, Chunk , Compress, Single-Writer/Multiple-Reader (SWMR), Parallel HDF5 (即 HDF5 MPI - Message Passing Interface 并行读写) 以及 Virtual Dataset (VDS) 等,可以查阅官网的 Documentation。
除了储存数码数据,HDF5 文件还可以用于存储图像、PDF文件,甚至 Excel 文件,但是鉴于我目前的科研需求,还是 .tsv 和 .txt 更适合我,毕竟查看起来更简单,跨平台跨语言读写也很方便。对我来说,上一篇文章中讨论的 .csv 文件都已经算是比较高级的数据储存格式了😆。 🎉 🎉 🎉

