Pathway: a Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG

Dec 30, 2024

https://pathway.com/https://github.com/pathwaycom/pathway

Pathway is a Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

Pathway comes with an easy-to-use Python API, allowing you to seamlessly integrate your favorite Python ML libraries. Pathway code is versatile and robust: you can use it in both development and production environments, handling both batch and streaming data effectively. The same code can be used for local development, CI/CD tests, running batch jobs, handling stream replays, and processing data streams.

Pathway is powered by a scalable Rust engine based on Differential Dataflow and performs incremental computation. Your Pathway code, despite being written in Python, is run by the Rust engine, enabling multithreading, multiprocessing, and distributed computations. All the pipeline is kept in memory and can be easily deployed with Docker and Kubernetes.

Why Pathway

It's simple. From installation to deployment.

With its full Python compatibility, Pathway is easy to use, from the installation to the maintenance.

Python native: Pathway is a Python framework, and as such it is compatible with the whole Python ecosystem. It will integrate perfectly into your Python architecture and will allow you to use your favorite libraries.
Installation: you can install Pathway with a simple pip install pathway.
Many data sources: Pathway provides a multitude of connectors to access your favorite data sources. You can also set up your own.
Transformation and Machine Learning: you can easily design your data pipeline using Pathway transformations. You can define your own UDFs, use any Python library, and integrate Machine Learning models.
Many destinations: Pathway provides output connectors to send the results to the destination you want. You can also create your own.
RAG and LLM-ready: Pathway provides most of the common utilities to develop your LLM applications and RAG. This includes complete AI pipelines with structured and unstructured data ingestion, chunking, and indexing.
Data indexing: Pathway offers real-time data indexes (vector search, full text search, and more) allowing you to effortlessly synchronize your index with data sources in real time. Don't bother installing a dedicated vector store, Pathway has got it covered for you!

It's fast, scalable, and safe.

A powerful Rust engine: Pathway is not bound to Python limits as it relies on a powerful Rust engine. The engine ensures that the computations are fast.

Scalable: thanks to the Rust engine, Pathway provides multi-threading, multi-processing and distributed computations. You can easily deploy your Pathway pipeline in the cloud.
Differential Dataflow and incremental computations: Pathway's engine incrementally processes data updates. Results are computed using the minimum work needed, ensuring high latency.
Stateful operations: Pathway supports stateful operations such as groupby and windows.
Persistence: you can save the state of the ongoing computation, be it for updating your pipeline or for recovery.

It takes the pain out of temporal & event data

Batch and stream processing alike: Pathway does both batch and stream processing. No matter your use case, Pathway is a good fit.

Same syntax: Your pipeline can run on both batch and streaming data, without modifying your code.
Same engine: Pathway unified Rust engine makes your computation fast and scalable, no matter if you choose batch or stream processing.
Consistent results: for stream processing, Pathway returns an output in real-time, which is what you would have if you were processing the received data using batch processing.
Streaming complexity is hidden: All the challenges of stream processing, such as handling late and out-of-order data, are handled by the engine and hidden from the user.
Time-related operations: Pathway offers advanced temporal operations such as as-of-join and temporal windows.

Python + Rust: the best of both worlds

Pathway efficiently associates the convenience of Python with the power of Rust.

Python makes everything easy

Pathway is a fully Python-compatible framework. You can install it with a simple pip install pathway and import it as any Python library. Pathway provides a Python interface and experience created with data developers in mind. You can easily build pipelines by manipulating Pathway tables and rely on the vast resources and libraries of the Python ecosystem. Also, Pathway can seamlessly be integrated into your CI/CD chain as it is inherently compatible with popular tools such as mypy or pytest.

Your Pathway pipelines can be automatically tested, built, and deployed like any other Python workflow. Pathway can be easily deployed in any container-based method (docker, Kubernetes) supporting the deployment of Python-based projects.

Rust makes your pipeline fast and scalable

Pathway relies on a powerful Rust engine to ensure high performance for your pipelines, no matter if you are dealing with batch or streaming data. Pathway engine makes the utmost of Rust speed and memory safety to provide efficient parallel and distributed processing without being limited by Python's GIL. Pathway engine is based on Differential dataflow, a computational framework known for its efficiency to process large volumes of data. Its incremental computations make it able to quickly process data updates. This means that the minimum work needed by any algorithm or transformation is performed to refresh its results when fresh data arrives.

A unified framework to end the debate between batch and stream processing

Batch processing and stream processing are seen as two distinct approaches to handling data.

Pathway is a unified data processing framework that allows you to use the same code for batch and streaming. All the complexity, including late data and consistency, are automatically handled and hidden from the user. Pathway provides advanced streaming operations, such as temporal windows, while keeping the simplicity of batch processing.

With Pathway, you don't have to choose between batch and stream processing. You can make your pipeline and focus on the data transformation you want to do. The resulting pipeline will work with both batch and stream processing. Not having to distinguish between batch and stream --and use different tools for them-- highly simplifies your architecture (bye-bye Lambda architecture) and the development of your pipeline.

What can it be used for?

With its unified engine and full Python compatibility, Pathway makes data processing as easy as possible. It's the ideal solution for a wide range of data processing pipelines, including:

Real-time analytics on IoT and event data.
AI RAG pipelines at scale.
Real-time Document Indexing.
ETL on unstructured data.

Use-cases and templates

Ready to see what Pathway can do?

Try one of our easy-to-run examples!

Available in both notebook and docker formats, these ready-to-launch examples can be launched in just a few clicks. Pick one and start your hands-on experience with Pathway today!

Event processing and real-time analytics pipelines

With its unified engine for batch and streaming and its full Python compatibility, Pathway makes data processing as easy as possible. It's the ideal solution for a wide range of data processing pipelines, including:

AI Pipelines

Pathway provides dedicated LLM tooling to build live LLM and RAG pipelines. Wrappers for most common LLM services and utilities are included, making working with LLMs and RAGs pipelines incredibly easy. Check out our LLM xpack documentation.

Features

A wide range of connectors: Pathway comes with connectors that connect to external data sources such as Kafka, GDrive, PostgreSQL, or SharePoint. Its Airbyte connector allows you to connect to more than 300 different data sources. If the connector you want is not available, you can build your own custom connector using Pathway Python connector.
Stateless and stateful transformations: Pathway supports stateful transformations such as joins, windowing, and sorting. It provides many transformations directly implemented in Rust. In addition to the provided transformation, you can use any Python function. You can implement your own or you can use any Python library to process your data.
Persistence: Pathway provides persistence to save the state of the computation. This allows you to restart your pipeline after an update or a crash. Your pipelines are in good hands with Pathway!
Consistency: Pathway handles the time for you, making sure all your computations are consistent. In particular, Pathway manages late and out-of-order points by updating its results whenever new (or late, in this case) data points come into the system. The free version of Pathway gives the "at least once" consistency while the enterprise version provides the "exactly once" consistency.
Scalable Rust engine: with Pathway Rust engine, you are free from the usual limits imposed by Python. You can easily do multithreading, multiprocessing, and distributed computations.
LLM helpers: Pathway provides an LLM extension with all the utilities to integrate LLMs with your data pipelines (LLM wrappers, parsers, embedders, splitters), including an in-memory real-time Vector Index, and integrations with LLamaIndex and LangChain. You can quickly build and deploy RAG applications with your live documents.

Getting started

Installation

Pathway requires Python 3.10 or above.

You can install the current release of Pathway using pip:

$ pip install -U pathway

Pathway is available on MacOS and Linux. Users of other systems should run Pathway on a Virtual Machine.

Example: computing the sum of positive values in real time.

import pathway as pw

# Define the schema of your data (Optional)
class InputSchema(pw.Schema):
  value: int

# Connect to your data using connectors
input_table = pw.io.csv.read(
  "./input/",
  schema=InputSchema
)

#Define your operations on the data
filtered_table = input_table.filter(input_table.value>=0)
result_table = filtered_table.reduce(
  sum_value = pw.reducers.sum(filtered_table.value)
)

# Load your results to external systems
pw.io.jsonlines.write(result_table, "output.jsonl")

# Run the computation
pw.run()

You can find more examples here.

Deployment

Locally

To use Pathway, you only need to import it:

import pathway as pw

Now, you can easily create your processing pipeline, and let Pathway handle the updates. Once your pipeline is created, you can launch the computation on streaming data with a one-line command:

pw.run()

You can then run your Pathway project (say, main.py) just like a normal Python script: $ python main.py. Pathway comes with a monitoring dashboard that allows you to keep track of the number of messages sent by each connector and the latency of the system. The dashboard also includes log messages.

Alternatively, you can use the pathway'ish version:

$ pathway spawn python main.py

Pathway natively supports multithreading. To launch your application with 3 threads, you can do as follows:

$ pathway spawn --threads 3 python main.py

To jumpstart a Pathway project, you can use our cookiecutter template.

Docker

You can easily run Pathway using docker.

Pathway image

You can use the Pathway docker image, using a Dockerfile:

FROM pathwaycom/pathway:latest

WORKDIR /app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD [ "python", "./your-script.py" ]

You can then build and run the Docker image:

docker build -t my-pathway-app .
docker run -it --rm --name my-pathway-app my-pathway-app

Run a single Python script

When dealing with single-file projects, creating a full-fledged Dockerfile might seem unnecessary. In such scenarios, you can execute a Python script directly using the Pathway Docker image. For example:

docker run -it --rm --name my-pathway-app -v "$PWD":/app pathwaycom/pathway:latest python my-pathway-app.py

Python docker image

You can also use a standard Python image and install Pathway using pip with a Dockerfile:

FROM --platform=linux/x86_64 python:3.10

RUN pip install -U pathway
COPY ./pathway-script.py pathway-script.py

CMD ["python", "-u", "pathway-script.py"]

Kubernetes and cloud

Docker containers are ideally suited for deployment on the cloud with Kubernetes. If you want to scale your Pathway application, you may be interested in our Pathway for Enterprise. Pathway for Enterprise is specially tailored towards end-to-end data processing and real time intelligent analytics. It scales using distributed computing on the cloud and supports distributed Kubernetes deployment, with external persistence setup.

You can easily deploy Pathway using services like Render: see how to deploy Pathway in a few clicks.

Performance

Pathway is made to outperform state-of-the-art technologies designed for streaming and batch data processing tasks, including: Flink, Spark, and Kafka Streaming. It also makes it possible to implement a lot of algorithms/UDF's in streaming mode which are not readily supported by other streaming frameworks (especially: temporal joins, iterative graph algorithms, machine learning routines).

If you are curious, here are some benchmarks to play with.