Dataset curation and modification

Very often, datasets and test corpora need to be curated and/or modified before conducting experimental analysis with them. Typical tasks include:

  • cleaning and homogenization

  • validation and filtering

  • storage format change

  • file renaming

  • subset splitting

The enb library provides several tools to help do this in the enb.sets and enb.isets modules.

The enb.sets.FileVersionTable class

The enb.sets.FileVersionTable base class allows to transform an input folder into an output folder in a simple way. You just need to:

  1. Create a subclass of enb.sets.FileVersionTable,

  2. Overwrite its dataset_files_extension attribute to filter out the file extensions that will be produced.

  3. Redefine its enb.sets.FileVersionTable.version() method, which transforms a single input into and output, and

  4. Instantiate your subclass (specify the input and output dirs) and run its get_df method.

The following toy example shows how to normalize all text files in an input directory converting them to lowercase and removing leading and trailing spaces from each line:

import enb

# 1 - Definition of the FileVersionTable subclass
class TextNormalizationTable(enb.sets.FileVersionTable):
    # 2 - Input file extension definition
    dataset_files_extension = "txt"

    # 3 - Redefinition of the version method
    def version(self, input_path, output_path, row):
        with open(input_path, "r") as input_file, open(output_path, "w") as output_file:
            contents = input_file.read()
            output_file.write("\n".join(l.lower().strip() for l in contents.splitlines()))


if __name__ == '__main__':
    # 4 - Instantiation and execution
    tnt = TextNormalizationTable(
        original_base_dir="original_data",
        version_base_dir="versioned_data",
        csv_support_path="")
    tnt.get_df()

This code is made available as a plugin named file_version_example (see Using existing image compression codecs for more information about installing and using plugins), i.e.,

enb plugin install file_version_example ./fve

Note

Tip: you can pass check_generated_files=False to the initializer of enb.sets.FileVersionTable so that enb.sets.FileVersionTable.version() is not required to produce a file with the output path passed as argument. This is particularly useful when

  • renaming files

  • filtering out invalid samples.

Note

The subdirectory structure of the input set is preserved by default in the output (versioned) directory.

Predefined classes

enb includes several predefined subclasses of enb.sets.FileVersionTable (autogenerated list):

  • enb.isets.BILToBSQ: Convert raw images (no header) from band-interleaved line order (BIL) to band-sequential order (BSQ).

  • enb.isets.BIPToBSQ: Convert raw images (no header) from band-interleaved pixel order (BIP) to band-sequential order (BSQ).

  • enb.isets.DivisibleSizeVersion: Crop the spatial dimensions of all (raw) images in a directory so that they are all multiple of a given number. Useful for quickly curating datasets that can be divided into blocks of a given size.

  • enb.fits.FITSVersionTable: Read FITS files and convert them to raw files, sorting them by type ( integer or float) and by bits per pixel.

  • enb.isets.ImageVersionTable: Transform all images and save the transformed versions.

  • enb.jpg.JPEGCurationTable: Given a directory tree containing JPEG images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.

  • enb.png.PDFToPNG: Take all .pdf files in input dir and save them as .png files into output_dir, maintining the relative folder structure.

  • enb.pgm.PGMCurationTable: Given a directory tree containing PGM images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.

  • enb.png.PNGCurationTable: Given a directory tree containing PNG images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.

  • enb.isets.QuantizedImageVersion: Apply uniform quantization and store the results.

If you create your own subclasses, don’t hesitate to submit it to us (e.g., via a pull request in github).