Dataset curation and modification

Very often, datasets and test corpora need to be curated and/or modified before conducting experimental analysis with them. Typical tasks include:

  • cleaning and homogenization

  • validation and filtering

  • storage format change

  • file renaming

  • subset splitting

The enb library provides several tools to help do this in the enb.sets and enb.isets modules.

The enb.sets.FileVersionTable class

The enb.sets.FileVersionTable base class allows to transform an input folder into an output folder in a simple way. You just need to:

  1. Create a subclass of enb.sets.FileVersionTable,

  2. Overwrite its dataset_files_extension attribute to filter out the file extensions that will be produced.

  3. Redefine its enb.sets.FileVersionTable.version() method, which transforms a single input into and output, and

  4. Instantiate your subclass (specify the input and output dirs) and run its get_df method.

The following toy example shows how to normalize all text files in an input directory converting them to lowercase and removing leading and trailing spaces from each line:

import enb

# 1 - Definition of the FileVersionTable subclass
class TextNormalizationTable(enb.sets.FileVersionTable):
    # 2 - Input file extension definition
    dataset_files_extension = "txt"

    # 3 - Redefinition of the version method
    def version(self, input_path, output_path, row):
        with open(input_path, "r") as input_file, open(output_path, "w") as output_file:
            contents = input_file.read()
            output_file.write("\n".join(l.lower().strip() for l in contents.splitlines()))


if __name__ == '__main__':
    # 4 - Instantiation and execution
    tnt = TextNormalizationTable(
        original_base_dir="original_data",
        version_base_dir="versioned_data",
        csv_support_path="")
    tnt.get_df()

This code is made available as a plugin named file_version_example (see Using existing image compression codecs for more information about installing and using plugins), i.e.,

enb plugin install file_version_example ./fve

Note

Tip: you can pass check_generated_files=False to the initializer of enb.sets.FileVersionTable so that enb.sets.FileVersionTable.version() is not required to produce a file with the output path passed as argument. This is particularly useful when

  • renaming files

  • filtering out invalid samples.

Note

The subdirectory structure of the input set is preserved by default in the output (versioned) directory.

Predefined classes

enb includes several predefined subclasses of enb.sets.FileVersionTable (autogenerated list):

  • enb.isets.BILToBSQ: Convert raw images (no header) from band-interleaved line order (BIL) to band-sequential order (BSQ).

  • enb.isets.BIPToBSQ: Convert raw images (no header) from band-interleaved pixel order (BIP) to band-sequential order (BSQ).

  • enb.isets.DivisibleSizeVersion: Crop the spatial dimensions of all (raw) images in a directory so that they are all multiple of a given number. Useful for quickly curating datasets that can be divided into blocks of a given size.

  • enb.compression.fits.FITSVersionTable: Read FITS files and convert them to raw files, sorting them by type ( integer or float) and by bits per pixel.

  • enb.isets.ImageVersionTable: Transform all images and save the transformed versions.

  • enb.compression.jpg.JPEGCurationTable: Given a directory tree containing JPEG images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.

  • enb.compression.png.PDFToPNG: Take all .pdf files in input dir and save them as .png files into output_dir, maintining the relative folder structure.

  • enb.compression.pgm.PGMCurationTable: Given a directory tree containing PGM images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.

  • enb.compression.png.PNGCurationTable: Given a directory tree containing PNG images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by load_array_bsq.

  • enb.compression.pgm.PPMCurationTable: Given a directory tree containing PPM images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.

  • enb.isets.QuantizedImageVersion: Apply uniform quantization and store the results.

  • enb.isets.ReindexedVersion: Read the N unique (signed or unsigned) sample values in a file, apply a bijective mapping between the original sample values and the index values in [0, 1, …, N-1]. The resulting indices are stored as unsigned integers using width_bytes bytes in the same order as the original samples. The output files share the name of the originals, except for enb data type name tags (e.g., u16be, s16le, etc.), which are transformed to u8be, u16be or u32be depending on the value of width_bytes. In addition to the index data, a second file is created that contains a list of the unique original sample values as well as the original data type. This file is needed to reconstruct the original data from the indices, although no attempt is made at compressing this data. This second file has the same name as the output index file, with the addition of “.meta” at the end of it.

If you create your own subclasses, don’t hesitate to submit it to us (e.g., via a pull request in github).