Dataset curation and modification

Very often, datasets and test corpora need to be curated and/or modified before conducting experimental analysis with them. Typical tasks include:

cleaning and homogenization
validation and filtering
storage format change
file renaming
subset splitting

The enb library provides several tools to help do this in the enb.sets and enb.isets modules.

Structure and symbolic links in the ./datasets folder

By default, the ./datasets directory is considered the base path for data samples.

You can copy any number of files into any (potentially multilevel) subdirectory structure. A “corpus” column is automatically added to each data sample with the name of the folder containing it.

You can also use symbolic links, which are treated as regular files. This way:

You can arrange data samples from multiple sources and still have a consistent corpus name.
You can change the name of a symbolic link, and that name will be employed within the experiments.
You can mix symbolic links and regular files as needed.

For instance, the following dataset folder setup (-> indicates a symbolic link):

- datasets/
- C1
  - A.txt -> /data/source1/A.txt
  - B.txt -> /home/shared/altsource/some_name.txt
- C2
  - C.txt -> /home/shared/altsource/C.txt
  - D.txt -> /data/source2/D.txt
  - E.txt

Would assign corpus “C1” to samples A.txt and B.txt, and “C2” to samples B.txt and C.txt, regardless of the physical folders where those samples are.
The data in /home/shared/altsource/some_name.txt would be known as datasets/C1/B.txt to the experiment.
File datasets/E.txt is not a symbolic link and is treated normally.

The `enb.sets.FileVersionTable` class

The enb.sets.FileVersionTable base class allows to transform an input folder into an output folder in a simple way. You just need to:

Create a subclass of enb.sets.FileVersionTable,
Overwrite its dataset_files_extension attribute to filter out the file extensions that will be produced.
Redefine its enb.sets.FileVersionTable.version() method, which transforms a single input into and output, and
Instantiate your subclass (specify the input and output dirs) and run its get_df method.

The following toy example shows how to normalize all text files in an input directory converting them to lowercase and removing leading and trailing spaces from each line:

import enb

# 1 - Definition of the FileVersionTable subclass
class TextNormalizationTable(enb.sets.FileVersionTable):
    # 2 - Input file extension definition
    dataset_files_extension = "txt"

    # 3 - Redefinition of the version method
    def version(self, input_path, output_path, row):
        with open(input_path, "r") as input_file, open(output_path, "w") as output_file:
            contents = input_file.read()
            output_file.write("\n".join(l.lower().strip() for l in contents.splitlines()))


if __name__ == '__main__':
    # 4 - Instantiation and execution
    tnt = TextNormalizationTable(
        original_base_dir="original_data",
        version_base_dir="versioned_data",
        csv_support_path="")
    tnt.get_df()

This code is made available as a plugin named file_version_example (see Using existing image compression codecs for more information about installing and using plugins), i.e.,

enb plugin install file_version_example ./fve

Note

Tip: you can pass check_generated_files=False to the initializer of enb.sets.FileVersionTable so that enb.sets.FileVersionTable.version() is not required to produce a file with the output path passed as argument. This is particularly useful when

renaming files
filtering out invalid samples.

Note

The subdirectory structure of the input set is preserved by default in the output (versioned) directory.

Predefined classes

enb includes several predefined subclasses of enb.sets.FileVersionTable (autogenerated list):

enb.isets.BILToBSQ: Convert raw images (no header) from band-interleaved line order (BIL) to band-sequential order (BSQ).
enb.isets.BIPToBSQ: Convert raw images (no header) from band-interleaved pixel order (BIP) to band-sequential order (BSQ).
enb.isets.DivisibleSizeVersion: Crop the spatial dimensions of all (raw) images in a directory so that they are all multiple of a given number. Useful for quickly curating datasets that can be divided into blocks of a given size.
enb.compression.fits.FITSVersionTable: Read FITS files and convert them to raw files, sorting them by type ( integer or float) and by bits per pixel.
enb.isets.ImageVersionTable: Transform all images and save the transformed versions.
enb.compression.jpg.JPEGCurationTable: Given a directory tree containing JPEG images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.
enb.compression.png.PDFToPNG: Take all .pdf files in input dir and save them as .png files into output_dir, maintining the relative folder structure.
enb.compression.pgm.PGMCurationTable: Given a directory tree containing PGM images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.
enb.compression.png.PNGCurationTable: Given a directory tree containing PNG images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by load_array_bsq.
enb.compression.pgm.PPMCurationTable: Given a directory tree containing PPM images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.
enb.isets.QuantizedImageVersion: Apply uniform quantization and store the results.
enb.isets.ReindexedVersion: Read the N unique (signed or unsigned) sample values in a file, apply a bijective mapping between the original sample values and the index values in [0, 1, …, N-1]. The resulting indices are stored as unsigned integers using width_bytes bytes in the same order as the original samples. The output files share the name of the originals, except for enb data type name tags (e.g., u16be, s16le, etc.), which are transformed to u8be, u16be or u32be depending on the value of width_bytes. In addition to the index data, a second file is created that contains a list of the unique original sample values as well as the original data type. This file is needed to reconstruct the original data from the indices, although no attempt is made at compressing this data. This second file has the same name as the output index file, with the addition of “.meta” at the end of it.

If you create your own subclasses, don’t hesitate to submit it to us (e.g., via a pull request in github).

Dataset curation and modification

Structure and symbolic links in the ./datasets folder

The enb.sets.FileVersionTable class

Predefined classes

The `enb.sets.FileVersionTable` class