Code Operator is used to run python code in a pre-defined environment. Current pre-defined environments are:
pyspark:3.2.1
A typical code operator usually processes the input data and creates an output that can be used later in the pipeline. The following image shows this process.
Image [1]
When a code operator runs, it has 3 main steps:
- Step #1: Download the data to the /input/ directory as parquet files (based on the configuration you defined) - details explained in the configuration section
- Step #2: Run the code you submitted in the environment you selected
- Step #3: Use the output.parquet file that was created and convert it to a table - so it can be used later in the pipeline - example is given in the code section of each environment
Configuration
Configuration menu has 4 main sections:
- Uploading File: this should be a zip file that includes at least one main file and other requirements that is needed by the environment selected. As an example, if you select pyspark environment, the zip file should include a requirements.txt and main.py file.
- Input Table: When the code operator runs, you may need to interact with the input data. Code operator downloads the data to /input/ directory as parquet files so you can interact with them. parquet file format preserves the type of the data unlike csv.
- Environment : The environment is where your code runs. This documentation has a section for each environment.
- Machine Type: Code operator runs on GCP instances. You need to select the machine type based on your needs.