rasmusab 8 days ago

Pure python scripts, maybe using the #%%-convention (https://code.visualstudio.com/docs/python/jupyter-support-py...) so you get the best of both notebooks and scripts, in a right-sized instance/container/machine. And if you need to run jobs in parallel, then orchestrate using make, like so: https://www.sumsar.net/blog/makefile-recipe-python-data-pipe...

  • niwtsol 7 days ago

    Yeah, I love this — pure Python with cron or periodic tasks (e.g., Django) works great. Celery task for parallelization, and if you pipe logs/alerts into a Slack channel, you can actually get really far without needing a "proper" orchestration layer.

    I recently took over an Airflow system from a former colleague, and in our case, it’s just overly complex for what’s really a pretty simple data flow.

    • mettamage 7 days ago

      I don’t know much about airflow

      But isn’t it just also python with cron?

      • datadrivenangel 5 days ago

        Airflow is python with cron, and the option for very sophisticated and useful orchestration tools, like retries, dependencies, etc. All the stuff you'll end up rolling yourself as your simple scheduled tasks grow.

vitorbaptistaa 8 days ago

My experience entails:

* Luigi -- extensive usage (4y+)

* Makefiles -- (15y+)

* GitHub Actions -- (4y+)

* Airflow -- little usage (<6 months)

* Dagster -- very little, just trying it out

* Prefect -- just followed tutorial

Although it lacks a lot of the monitoring and advanced web ui other platforms have (maybe because of it), Luigi is the simplest to reason about IMHO.

For a new project that will require complex orchestrations, I'd probably go with Dagster or Prefect nowadays. Dagster seems more complex and more powerful with its data lineage functionality, but I have very little experience with either tool.

If it's a simple project, a mix of Makefiles + GH Actions can work well.

  • vector_spaces 8 days ago

    Is there anything even more lightweight, where you don't have to write your code any differently? For instance, say I have 10 jobs that don't depend on each other, all of them pretty small.

    Dagster and even Luigi feel like overkill but I'd still like to plug those into a unified interface where I can view previous runs, mainly logs and exit codes. Being able to do some light job configuration or add retries would be nice but not required. For the moment I just use a logging handler that writes to a database table and that's fine

    • disgruntledphd2 8 days ago

      I think that Airflow 2 implemented a decorator mode which you can just use on functions.

      Honestly, just use airflow, it has its issues but it sucks in well known and predictable ways.

    • cicdw 7 days ago

      One of the goals of Prefect's SDK is to be minimally invasive from a code-standpoint (in the simplest case you only need two lines to convert a script to a `flow`). Our deployment model also makes infrastructure job config a first-class citizen so you might have a good time trying it out. (disclosure: work at Prefect)

      • aradox66 7 days ago

        Love prefect! but for workflows involving concurrency, Prefect code needs to get somewhat invasive.

        Prefect relies on prefect.task()-wrapped methods as the lowest granularity of concurrency in a program, and requires you to use the (somewhat immature) prefect task APIs to implement that concurrency.

        more on this complaint here: https://austinweisgrau.github.io/migrating-to-prefect-part-3...

        • cicdw 7 days ago

          This is an excellent write up thank you for sharing! Yea, our concurrency API needs an upgrade - coincidentally this is going to be a theme of the next sprint or two so I hope I can report some improvements back soon.

PaulHoule 8 days ago

Straightforward programs in languages like Java, Python, etc.

The tools you describe all have the endpoint "you can't get there from here" and the only difference is if it takes you 5 seconds, 5 minutes, 5 days, 5 weeks or 5 months to learn that.

djsjajah 7 days ago

I few people have mentioned dagster and I took a look at that for some machine learning things I was playing with but I found dvc (data version control [1]) and I think it is fantastic. I think it also has more applications than just machine learning but really anything with data. If you have a bunch of shell scripts that write to files to pass data around, then dvc might be a good fit. it will do things like only rerun steps if it needs to. Also for totally non-data stuff, Prefect is great.

[1] https://dvc.org

saturn8601 7 days ago

I used to work for an automation company that produced a product called ActiveBatch. It was such an amazing tool for just drag and drop automation. Its focus was on full fledged workflow automation and not just data orchestration.

What I loved was its simplicity + its out of the box features. To set it up just took a simple MS SQL DB + An Installer. Bam you are up and running an absolute rock solid scheduler(i've seen million+ jobs running on it without it breaking a sweat). Then you could install (or use it to deploy) execution agents to all the servers you wanted as workers.

It also installed a robust Desktop GUI that had so many services built in ready to go (anything from executing scripts all the way to performing direct actions against countless products a company would have or against various cloud services).

There were so many pre built actions where all you had to do was input credentials and it would enumerate the appropriate properties from that service automatically. Then you could connect things together (ie, pull something from the cloud, process it on some other server, store it, pass it along to another service, whatever you wanted)

Only problem was this is very much a B2B application and their sales is really only interested in selling to enterprises and not end users. I really wish we had something like this that regular people could download.

Everything ive seen listed here requires extensive setup,requires coding, or does not have a robust desktop GUI but instead some half baked web gui which might require dropping back down to scripts/coding. You could set up hundreds/thousands of automated steps in ActiveBatch without writing a single line of code. I miss that product.

  • margor 6 days ago

    As I worked adjacent to people who ran thousands of jobs in ActiveBatch - that software indeed was very simple to use and its GUI might have been awesome - but it's been double edged sword where if you have hundreds of people working on it - it becomes maintenance nightmare and promoting changes between environments was non-existent, causing multiple incidents.

    Mind you, it might have been just culture at that place, but I don't think this is as good of an example as you make it be. Sure, it was easy to get started and made the life easier at the beginning, but running it at scale was not in any way easy.

    • saturn8601 4 days ago

      How long back are you talking about? Do you remember what version? When I was working there, there were improvements made to "Change Management" ie. promoting changes from Test -> Production. After I left I heard the improvements continued. When I was there, it was a ~50 person company that was very focused so this was a pain point they were well aware of.

    • pramodbiligiri 6 days ago

      The parent comment made me curious enough to go look it up. Is it this same ActiveBatch that you both are referring to? https://www.advsyscon.com/

      • saturn8601 4 days ago

        Yes its best to google for some real pics of the GUI. They just have "drawings" on their site.

        When I was there it was not owned by Redwood, after 30+ years in business the original owners sold the company to them a few years back to retire.

      • margor 4 days ago

        ActiceBatch by Redwood, correct. Its widely used by some hedge funds.

rich_sasha 7 days ago

I wrote my own in half a day. Worked 24/7 for 3 years... then I quit.

Seriously, took me much less time than setting up airflow. Even had a webpage in the end, with all the tasks, a tree view, downstream, upstream tasks (these were incremental improvements beyond the initial half-day), CLI... The works.

I now know the points of fragility I didn't know before, but I'd do it again.

itfollowsthen 7 days ago

at my last startup I asked a friend to help me debug an Airflow DAG. he just pip installed prefect and I've never really looked back. at the time everything else felt too hard to figure out.

speedgoose 7 days ago

I like having containers running as CronJobs or Deployments in Kubernetes, but Argo Workflow has been a pretty reliable plugin to Kubernetes for the more advanced scenarios.

However, it’s simple only if you are already familiar with software containers and Kubernetes. But it’s perhaps better to learn than having to deal with dependency hell in Python or Java.

recursive4 8 days ago

Either Perfect or Dagster. FWIW, the Dagster team is actively reducing the learning curve with each release.

myfakebadcode 8 days ago

I’ve been using airflow for quite some time. Due to the maturity of where we are at, and while I’ve tested other solutions, I don’t really see changing things.

rubenfiszel 7 days ago

You should give a try to Windmill, it's more of a workflow engine than a data orchestration tool but it's intuitive and open-source.

scary-size 8 days ago

We’ve migrated to Flyte. Mostly using the Java/Scala API which can be a bit verbose. The official Python API is actually easy on the eyes.

jusonchan81 7 days ago

Unmeshed - it’s not open source. It’s a new version of Netflix Conductor. Scales really well and has a GitHub actions style agent that can be used to run commands orchestrated by the platform. It’s probably the cheapest commercial tool you can get.

fmariluis 7 days ago

If you're inside AWS, have a fully containerized workflow and/or can run some tasks in Lambda, Step Functions is probably ok? I personally prefer Airflow, but I wouldn't say is the 'simplest data orchestration tool'.

markus_zhang 7 days ago

From my experience cloud managed Airflow is the easiest to manage and use.

It's a bit expensive but the only thing they push you to do is to upgrade managed K8S and Airflow once for a while.

akgfab 7 days ago

spaCy’s weasel package allows you to put a bunch of commands meant to be run in sequence in one project.yml file, pull assets, etc—- I find it to be the right level of abstraction and I’m pretty sure it’s not trying to become a cloud hosted do everything tool: https://github.com/explosion/weasel

tdeck 7 days ago

The simplest for sure was using ActiveJob in Rails with Clockwork for scheduling and Postgres for queueing things up.

fforflo 7 days ago

Makefile with make2graph to visualize DAGs.