Dataset curation and modification
Very often, datasets and test corpora need to be curated and/or modified before conducting experimental analysis with them. Typical tasks include:
cleaning and homogenization
validation and filtering
storage format change
file renaming
subset splitting
The enb library provides several tools to help do this in the enb.sets
and enb.isets
modules.
Structure and symbolic links in the ./datasets folder
By default, the ./datasets directory is considered the base path for data samples.
You can copy any number of files into any (potentially multilevel) subdirectory structure. A “corpus” column is automatically added to each data sample with the name of the folder containing it.
You can also use symbolic links, which are treated as regular files. This way:
You can arrange data samples from multiple sources and still have a consistent corpus name.
You can change the name of a symbolic link, and that name will be employed within the experiments.
You can mix symbolic links and regular files as needed.
For instance, the following dataset folder setup (-> indicates a symbolic link):
- datasets/
- C1
- A.txt -> /data/source1/A.txt
- B.txt -> /home/shared/altsource/some_name.txt
- C2
- C.txt -> /home/shared/altsource/C.txt
- D.txt -> /data/source2/D.txt
- E.txt
Would assign corpus “C1” to samples A.txt and B.txt, and “C2” to samples B.txt and C.txt, regardless of the physical folders where those samples are.
The data in /home/shared/altsource/some_name.txt would be known as datasets/C1/B.txt to the experiment.
File datasets/E.txt is not a symbolic link and is treated normally.
The enb.sets.FileVersionTable
class
The enb.sets.FileVersionTable
base class allows to transform an input folder into an output folder
in a simple way. You just need to:
Create a subclass of
enb.sets.FileVersionTable
,Overwrite its dataset_files_extension attribute to filter out the file extensions that will be produced.
Redefine its
enb.sets.FileVersionTable.version()
method, which transforms a single input into and output, andInstantiate your subclass (specify the input and output dirs) and run its get_df method.
The following toy example shows how to normalize all text files in an input directory converting them to lowercase and removing leading and trailing spaces from each line:
import enb
# 1 - Definition of the FileVersionTable subclass
class TextNormalizationTable(enb.sets.FileVersionTable):
# 2 - Input file extension definition
dataset_files_extension = "txt"
# 3 - Redefinition of the version method
def version(self, input_path, output_path, row):
with open(input_path, "r") as input_file, open(output_path, "w") as output_file:
contents = input_file.read()
output_file.write("\n".join(l.lower().strip() for l in contents.splitlines()))
if __name__ == '__main__':
# 4 - Instantiation and execution
tnt = TextNormalizationTable(
original_base_dir="original_data",
version_base_dir="versioned_data",
csv_support_path="")
tnt.get_df()
This code is made available as a plugin named file_version_example (see Using existing image compression codecs for more information about installing and using plugins), i.e.,
enb plugin install file_version_example ./fve
Note
Tip: you can pass check_generated_files=False to the initializer of enb.sets.FileVersionTable
so that enb.sets.FileVersionTable.version()
is not required to produce a file with the
output path passed as argument. This is particularly useful when
renaming files
filtering out invalid samples.
Note
The subdirectory structure of the input set is preserved by default in the output (versioned) directory.
Predefined classes
enb
includes several predefined subclasses of enb.sets.FileVersionTable
(autogenerated list):
enb.isets.BILToBSQ
: Convert raw images (no header) from band-interleaved line order (BIL) to band-sequential order (BSQ).enb.isets.BIPToBSQ
: Convert raw images (no header) from band-interleaved pixel order (BIP) to band-sequential order (BSQ).enb.isets.DivisibleSizeVersion
: Crop the spatial dimensions of all (raw) images in a directory so that they are all multiple of a given number. Useful for quickly curating datasets that can be divided into blocks of a given size.enb.fits.FITSVersionTable
: Read FITS files and convert them to raw files, sorting them by type ( integer or float) and by bits per pixel.enb.isets.ImageVersionTable
: Transform all images and save the transformed versions.enb.jpg.JPEGCurationTable
: Given a directory tree containing JPEG images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.enb.png.PDFToPNG
: Take all .pdf files in input dir and save them as .png files into output_dir, maintining the relative folder structure.enb.pgm.PGMCurationTable
: Given a directory tree containing PGM images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.enb.png.PNGCurationTable
: Given a directory tree containing PNG images, copy those images into a new directory tree in raw BSQ format adding geometry information tags to the output names recognized by enb.isets.load_array_bsq.enb.isets.QuantizedImageVersion
: Apply uniform quantization and store the results.
If you create your own subclasses, don’t hesitate to submit it to us (e.g., via a pull request in github).