Pipelines
Canals aims to support pipelines of (close to) arbitrary complexity. It currently supports a variety of different topologies, such as:
- Simple linear pipelines
- Branching pipelines where all or only some branches are executed
- Pipelines merging a variable number of inputs, depending on decisions taken upstream
- Simple loops
- Multiple entry components, either alternative or parallel
- Multiple exit components, either alternative or parallel
Check the pipeline's test suite for some examples.
Validation
Pipeline performs validation on the connection type level: when calling Pipeline.connect()
, it uses the @component.input
and @component.output
dataclass fields to make sure that the connection is possible.
On top of this, specific connections can be specified with the syntax component_name.input_or_output_field
.
For example, let's imagine we have two components with the following I/O declared:
@component
class ComponentA:
@component.input
def input(self):
class Input:
input_value: int
return Input
@component.output
def output(self):
class Output:
output_value: str
return Output
def run(self, data):
return self.output(intermediate_value="hello")
@component
class ComponentB:
@component.input
def input(self):
class Input:
input_value: str
return Input
@component.output
def output(self):
class Output:
output_value: List[str]
return Output
def run(self, data):
return self.output(output_value=["h", "e", "l", "l", "o"])
This is the behavior of Pipeline.connect()
:
pipeline.add_component('component_a', ComponentA())
pipeline.add_component('component_b', ComponentB())
# All of these succeeds
pipeline.connect('component_a', 'component_b')
pipeline.connect('component_a.output_value', 'component_b')
pipeline.connect('component_a', 'component_b.input_value')
pipeline.connect('component_a.output_value', 'component_b.input_value')
These, instead, fail:
pipeline.connect('component_a', 'component_a')
# canals.errors.PipelineConnectError: Cannot connect 'component_a' with 'component_a': no matching connections available.
# 'component_a':
# - output_value (str)
# 'component_a':
# - input_value (int, available)
pipeline.connect('component_b', 'component_a')
# canals.errors.PipelineConnectError: Cannot connect 'component_b' with 'component_a': no matching connections available.
# 'component_b':
# - output_value (List[str])
# 'component_a':
# - input_value (int, available)
In addition, components names are validated:
pipeline.connect('component_a', 'component_c')
# ValueError: Component named component_c not found in the pipeline.
Just like input and output names, when stated:
pipeline.connect('component_a.input_value', 'component_b')
# canals.errors.PipelineConnectError: 'component_a.typo does not exist. Output connections of component_a are: output_value (type str)
pipeline.connect('component_a', 'component_b.output_value')
# canals.errors.PipelineConnectError: 'component_b.output_value does not exist. Input connections of component_b are: input_value (type str)
Save and Load
Pipelines can be serialized to Python dictionaries, that can be then dumped to JSON or to any other suitable format, like YAML, TOML, HCL, etc. These pipelines can then be loaded back.
Here is an example of Pipeline saving and loading:
from haystack.pipelines import Pipeline, save_pipelines, load_pipelines
pipe1 = Pipeline()
pipe2 = Pipeline()
# .. assemble the pipelines ...
# Save the pipelines
save_pipelines(
pipelines={
"pipe1": pipe1,
"pipe2": pipe2,
},
path="my_pipelines.json",
_writer=json.dumps
)
# Load the pipelines
new_pipelines = load_pipelines(
path="my_pipelines.json",
_reader=json.loads
)
assert new_pipelines["pipe1"] == pipe1
assert new_pipelines["pipe2"] == pipe2
Note how the save/load functions accept a _writer
/_reader
function: this choice frees us from committing strongly to a specific template language, and although a default will be set (be it YAML, TOML, HCL or anything else) the decision can be overridden by passing another explicit reader/writer function to the save_pipelines
/load_pipelines
functions.
This is how the resulting file will look like, assuming a JSON writer was chosen.
my_pipeline.json
{
"pipelines": {
"pipe1": {
# All the components that would be added with a
# Pipeline.add_component() call
"components": {
"first_addition": {
"type": "AddValue",
"init_parameters": {
"add": 1
},
},
"double": {
"type": "Double",
"init_parameters": {}
},
"second_addition": {
"type": "AddValue",
"init_parameters": {
"add": 1
},
},
# This is how instances of the same component are reused
"third_addition": {
"refer_to": "pipe1.first_addition"
},
},
# All the components that would be made with a
# Pipeline.connect() call
"connections": [
("first_addition", "double", "value/value"),
("double", "second_addition", "value/value"),
("second_addition", "third_addition", "value/value"),
],
# All other Pipeline.__init__() parameters go here.
"metadata": {"type": "test pipeline", "author": "me"},
"max_loops_allowed": 100,
},
"pipe2": {
"components": {
"first_addition": {
# We can reference components from other pipelines too!
"refer_to": "pipe1.first_addition",
},
"double": {
"type": "Double",
"init_parameters": {}
},
"second_addition": {
"refer_to": "pipe1.second_addition"
},
},
"connections": [
("first_addition", "double", "value/value"),
("double", "second_addition", "value/value"),
],
"metadata": {"type": "another test pipeline", "author": "you"},
"max_loops_allowed": 100,
},
},
# A list of "dependencies" for the application.
# Used to ensure all external components are present when loading.
"dependencies": ["my_custom_components_module"],
}