Transports

Motivation

When you use a local or remote workflow queue, you automatically ensure that all your function returns are stored for later. This is a useful feature for persisting your data, but does not help to keep control over your files. Therefore, Caliber implements the transport.

The Caliber transport is an interface for saving your global data object, and your input and output files in a storage layer. As soon as you set the necessary settings for connecting to a storage layer, Caliber does the rest automatically.

Common use cases for this feature could be:

  • You split your workflow in two or more, where the first part is all about performing calculations, generating data and saving to a storage layer, and the other workflows read data from the storage layer and does some post-processing.

  • You wish to store related input and output data, and input and output files to later be able to reproduce the workflow run, or facilitate for quality assurance or traceability.

Development and production

It can be useful to think of your project work as divided in two phases: development and production. In the development phase you define your workflows and run them partially or completely to test and eliminate bugs, or to iterate on various design solutions. In the production phase you have one or more complete workflows that you run to produce your final results. The storage layer can be a good way to separate between the two phases and make a distinct deploy to production. Whereas you might use various forms of storage in the development phase, where all you need is some temporary storage layer, you should settle for the storage layer with the highest level of integrity in the production phase.

The ZipTransport

The Zip file format is a common archiving and compression standard. The ZipTransport stores your global data object, infiles and outfiles in a Zip file. The ZipTransport is initiated by providing the following environment variables in .caliberenv, where _read refers to the file to restore data and files from and _write refers to the file where data and files are stored:

.caliberenv
caliber_zip_filename_read=myzipfile.zip
caliber_zip_filename_write=myzipfile.zip

View your zip file

The zip file created by the ZipTransport can be opened by any program suitable for working with compressed archives, e.g. 7-Zip, but the easiest way to restore the files is to use the CLI.

Restore from a transport using the CLI

After you have successfully stored your files and data using a transport, you can access the files through the CLI. To restore the infiles, you can type caliber transport restore infiles followed by -f and the format of your storage layer. E.g.:

caliber transport restore infiles -f zip

If you wish to restore your outfiles, simply replace infiles with outfiles in the command above.

Restore globals in a workflow

The global data object that you send to a storage layer can be restored using the restore_globals_from_storage() function. Try using this function as the first task in you post-processing workflow and do your post-processing on the data restored from the storage layer!

Example

Suppose you ran the example in the previous section with the environment variable shown above to produce the Zip file myzipfile.zip. The workflow below, which uses the functions in post_functions.py below, restores the global data object from myzipfile.zip and performs some calculation based on the results. The workflow is defined in post.py, and is run by typing caliber workflow run post.json after creating post.json with py post.py.

post.py
 1import caliber
 2import post_functions
 3
 4# Define tasks
 5restore_g = caliber.Task(
 6    function=caliber.restore_globals_from_storage,
 7    name='Restore g from storage layer',
 8    args=[
 9        'zip',
10    ],
11)
12
13cubic_root = caliber.Task(
14    function=post_functions.root,
15    name='Calculate cubic root',
16)
17
18print_results = caliber.Task(
19    function=post_functions.print_result,
20    name='Print output',
21)
22
23# Collect tasks in process
24post_processing = caliber.Process(
25    name='Post-processing',
26    tasks=[
27        restore_g,
28        cubic_root,
29        print_results,
30    ],
31)
32
33# Create workflow
34do_post_processing = caliber.Workflow(
35    process=post_processing,
36    name='Post-process the results found in the storage layer'
37)
38
39do_post_processing.to_json('post.json')
post_functions.py
 1from caliber import g, print
 2
 3
 4def root():
 5    """Calculate cubic root of a number."""
 6    g.root = g.square**(1/3)
 7
 8
 9def print_result():
10    """Prints the output of the workflow."""
11    # Get dict representation of g
12    g_dct = dict(g)
13
14    # Print output
15    print(f'The cubic root:\n{g_dct["root"]:.2f}')

.caliberignore

In many cases, your workflow will generate files that you are not interested in storing for later. This could be e.g. temporary binary files created by external computer programs. Your working directory might also contain files and folders that you do not wish that Caliber checks for storing. This could be e.g. the .git folder if you have your working directory under git version control or a venv or env folder if you have initialized a virtual environment. To control this, Caliber implements the .caliberignore file which works more or less in the same manner as the .gitignore file.

Say you wish that Caliber ignores all files with the ending .txt, a file called temp.bin and a folder called dist. Create a file named .caliberignore in the folder where you run your workflow and write the following:

.caliberignore
*.txt
temp.bin
dist/*

Caliber will now ignore these files and folders in the next run.

For convenience, a list of default ignores are hardcoded in Caliber, including among others the .caliberignore file, any environment variable file .*env, the .git folder and folders with virtual environments named *env/. To see a full list of default ignored files and folders, open an interpreter, import Caliber and type caliber.config.DEFAULT_IGNORED_FILES.

You can always use the CLI to get an overview over the files in the working directory that will be stored by Caliber if they are modified or created during a workflow run by typing:

caliber transport listfiles

Under the hood of .caliberignore

The .caliberignore file is implemented using the fnmatch.fnmatch() function. Read more about the function and the valid patterns of the .caliberignore file in the official Python docs. Note that even though a file type can be ignored using .caliberignore, infiles of that type are still stored when they are listed in the .attach_files attribute of the Workflow.

The TransportInterface

The TransportInterface is an abstract base class which serves as the blueprint for concrete interface implementations. The TransportInterface includes abstract methods for saving or restoring data and files to or from a storage layer. The concrete interface have implementations of all the abstract methods.

Under the hood of the TransportInterface

The TransportInterface implements the following abstract methods: .add_globals(), .restore_globals(), .add_infiles(), .restore_infiles(), .add_outfiles(), and .restore_outfiles(). The _globals methods work on the global data object. The _infiles methods work on the files listed under the attach_files attribute of the workflow, and whereas Speckle only handles text-based files, the transports handle binary files as well. Before and after the workflow run, Caliber searches through the working directory and finds the files that are either new or modified during the run. The _outfiles methods work on these files.

Storing the global data object

The global data object is stored using the built-in pickle module. The pickle module implements binary protocols for serializing and deserializing objects. This means that your global data object can contain any object type without risking that it cannot be stored. Remember that to be able to restore special types, e.g. numpy.ndarray or pandas.DataFrame, the package that implements the special type must be installed.