Most of this is not about Python, it’s about matplotlib. If you want the admittedly very thoughtful design of ggplot in Python, use plotnine
> I would consider the R code to be slightly easier to read (notice how many quotes and brackets the Python code needs)
This isn’t about Python, it’s about the tidyverse. The reason you can use this simpler syntax in R is because it’s non-standard-evaluation allows packages to extend the syntax in a way Python does not expose: http://adv-r.had.co.nz/Computing-on-the-language.html
If you step back, it's kind of weird that there's no mainstream programming language that has tables as first class citizens. Instead, we're stuck learning multiple APIs (polars, pandas) which are effectively programming languages for tables.
R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.
The root cause seems to be that we still haven't figured out the best language to use to manipulate tabular data yet (i.e. the way of expressing this). It feels like there's been some convergence on some common ideas. Polars is kindof similar to dplyr. But no standard, except perhaps SQL.
FWIW, I agree that Python is not great, but I think it's also true R is not great. I don't agree with the specific comparisons in the piece.
The author's priorities are sensible, and indeed with that set of priorities, it makes sense to end up near R. However, they're not universal among data scientists. I've been a data scientist for eight years, and have found that this kind of plotting and dataframe wrangling is only part of the work. I find there is usually also some file juggling, parsing, and what the author calls "logistics". And R is terrible at logistics. It's also bad at writing maintainable software.
If you care more about logistics and maintenance, your conclusion is pushed towards Python - which still does okay in the dataframes department. If you're ALSO frequently concerned about speed, you're pushed towards Julia.
None of these are wrong priorities. I wish Julia was better at being R, but it isn't, and it's very hard to be both R and useful for general programming.
Edit: Oh, and I should mention: I also teach and supervise students, and I KEEP seeing students use pandas to solve non-table problems, like trying to represent a graph as a dataframe. Apparently some people are heavily drawn to use dataframes for everything - if you're one of those people, reevaluate your tools, but also, R is probably for you.
If you're doing data science all day, you should learn R, even if it's so weird at first (for somebody coming from a C-style language) that it seems way harder; R is made for the way statisticians work and think, not the way computer programmers work and think. If you're doing data science all day, you should start thinking and working like a statistician and working in R, and the fact that it seems to bend your mind is probably at least in part good, because a statistician needs to think differently than a programmer.
I work in python, though, almost all of the time.
groups = {}
for row in filtered:
key = (row['species'], row['island'])
if key not in groups:
groups[key] = []
groups[key].append(row['body_mass_g'])
can be rewritten as: groups = collections.defaultdict(list)
for row in filtered:
groups[(row['species'], row['island'])].append(row['body_mass_g'])
and variance = sum((x - mean) ** 2 for x in values) / (n - 1)
std_dev = math.sqrt(variance)
as: std_dev = statistics.stddev(values)These days I run some big query on an OLAP database and download the results to parquet stored on the local disk of a cloud notebook VM and then mine it to bits with duckdb reading straight from these parquet files.
The notebooks end up with very clear SQL queries and results (most notebook servers support SQL cells with highlighting and completion etc), and small pockets of python cells for doing those corner case things that an imperative language makes easier.
So when I get to the bottom of the article where it shows the difference between Python and R, I'm screaming "wouldn't that look better in SQL?!" :)
Of course there's a bunch of loops and things; you're exposing what has to happen in both R and Python under the hood of all those packages.
BTW AI is not helping and in fact is leading to a generation of scientists who know how to write prompts, but do not understand the code those prompts generate or have the ability to peer review it.
> Python is pretty good for deep learning. There’s a reason PyTorch is the industry standard. When I’m talking about data science here, I’m specifically excluding deep learning.
I've written very little deep learning code over my career, but made very frequent use of the GPU and differentiable programming for non-deep learning specific tasks. In general Python is much easier to write quantitative programs that make use of the hardware, and you have a lot more options when your problem doesn't fit into RAM.
> I have been running a research lab in computational biology for over two decades.
I've been working nearly exclusively in industry for these two decades and a major reason I find Python just better is it's much, much easier to interface with other parts of engineering when you're a using truly general purpose PL. I've actually never worked for a pure Python shop, but it's generally much easier to get production ML/DS solutions into prod when working with Python.
> Data science as I define it here involves a lot of interactive exploration of data and quick one-off analyses or experiments
This re-iterates the previous difference. In my experience I would call this "step one" in all my DS related work. The first step is to understand the problem and de-risk. But the vast majority of code and work is related to delivering a scalable product.
You can say that's not part of "data science", but if you did you'd have a hard time finding a job on most of the teams I've worked on.
All that said, my R vs Python experience has boiled down to: If your end result is a PDF report, R is superior. If your end result is shipping a product, then Python is superior. And my experience has been that, outside of university labs, there aren't a lot of jobs out there for DS folks who only want to deliver PDFs.
That’s a bad argument or a naive and obvious one; depending on how you look at it.
Python wasn’t designed for Data Science. It is not a DSL for it. MATLAB was arguably designed for scientific computing, and yet it’s the most disliked language in the StackOverflow liked/disliked index.
Here’s a different way to look at it. A good programming language is like the weather in a city. I would love to live somewhere where it’s 72F/23C all year round. But if it’s in the middle of nowhere and I’ve got no friends to hang out with, would I? I don’t think so.
FWIW, Python is like Sweden or Finland, with shitty weather for 6 months of the year yet thriving against all odds.
PS: I think the article’s topic is a bit click-batey (not a particularly useful discussion) because it’s polarizing and no one will be 100% right about it. It’s perhaps best thought of as an opinion piece.
[1] https://link.springer.com/article/10.1007/s11336-017-9581-x
using TidierData, DataFrames
using PalmerPenguins: load
penguins = load()
@chain penguins begin
DataFrame
@drop_missing(body_mass_g)
@group_by(species, island)
@summarize(
body_weight_mean =
mean(body_mass_g),
body_weight_std =
std(body_mass_g)
)
show(_, allrows=true)
end[Data Preparation] --> [Data Analysis] --> [Result Preparation]
Neither Python or R does a good job at all of these.
The original article seems to focus on challenges in using Python for data preparation/processing, mostly pointing out challenges with Pandas and "raw" Python code for data processing.
This could be solved by switching to something like duckdb and SQL to process data.
As far as data analysis, both Python and R have their own niches, depending on field. Similarly, there are other specialized languages (e.g., SAS, Matlab) that are still used for domain-specific applications.
I personally find result preparation somewhat difficult in both Python and R. Stargazer is ok for exporting regression tables but it's not really that great. Graphing is probably better in R within the ggplot universe (I'm aware of the python port).
Python doesn't need to be the best at any one thing; it just has to be serviceable for a lot of things. You can take someone who has expertise in a completely different domain in software (web dev, devops, sysadmin, etc.) and introduce them to the data science domain without making them learn an entirely new language and toolchain.
A lot of data science code is already in Python. That’s where it’s going to stay because rewriting code is time consuming. My guess is we will continue to improve Python gradually and keep refactoring the code.
If by data science you mean loading data to memory and running canned routines for regression, classification and other problems, then Python is great and mostly calls C/FORTRAN binaries under the hood, so Python itself has relatively little overhead.
If your data is already in a table, and you’re using Python, you’re doing it because you want to learn Python for your next job. Not because it’s the best tool for your current job. The one thing Python has on all those other options is $$$. You will be far more employable than if you stick to R.
And the reason for that is because Python is one of the best languages for data and ML engineering, which is about 80% of what a data science job actually entails.
1. Is easy to read
2. Was easy to extend in languages that people who work with scientific data happen to like.
When I did my masters we hacked around in the numpy source and contributed here and there while doing astrophysics.
Stuff existed in Java and R, but we had learned C in the first semester and python was easier to read and contrary to MATLAB numpy did not need a license.
When data science came into the picture, the field was full of physicists that had done similar things. They brought their tools as did others.
But the article says that very exotic syntax is more readable. I think this is mostly about the libraries, where honestly I equally don’t like matplotlib and R’s ggplot. But I would not think it’s language problem.
I was hoping to find some performance benchmarks or something more than feelings about certain block of code. Don’t get me wrong I am also not a die hard fan of Python although I have written a lot or production code in it. Mentioning bloated, boilerplate code…I am afraid author should look on Java or any modern JavaScript project.
As annoying as it is to admit it, python is a great language for data science almost strictly because it has so many people doing data science with it. The popularity is, itself, a benefit.
There's Julia -- it has serious drawbacks, like slow cold start if you launch a Julia script from the shell, which makes it unsuitable for CLI workflows.
Otherwise you have to switch to compiled languages, with their tradeoffs.
If I want to wrangle, explore, or visualise data I’ll always reach for R.
If I want to build ML/DL models or work with LLM’s I will usually reach for Python.
Often in the same document - nowadays this is very easy with Quarto.
While plotting may be clunky, I just don’t see r as much better. Plus in 2025 I can just provide a sample of data and what plot I want in an LLM and I get zero shot code of the plot I want.
Author sounds very academic to me.
This is completely aside, but I wouldn't hold this against the students or Python. The students may be following an age-old rule of office politics: "Never troubleshoot in front of an audience." And why this is more prevalent among the students who use Python, well... sample size of 30.
My take (and my own experience) is that python won because the rest of the team knows it. I prefer R but our web developers don't know it, and it's way better for me to write code that the rest of our team can review, extend, and maintain.
The other thing is that a lot of R’s strengths are really the tidyverse’s. Some of that is to R’s credit as an extensible language that enables a skilled API designer to really shine of course, but I think there’s no reason Python the language couldn’t have similar libraries. In fact it has, in plotnine. (I haven’t tried Polars yet but it does at least seem to have a more consistent API.)
A better stdlib-only version would be:
from palmerpenguins import load_penguins
import math
from itertools import groupby
from statistics import fmean, stdev
penguins = load_penguins()
# Convert DataFrame to list of dictionaries
penguins_list = penguins.to_dict('records')
# create key function for grouping/sorting by species/island
def key_func(x):
return x['species'], x['island']
# Filter out rows where body_mass_g is missing and sort by species and island
filtered = sorted((row for row in penguins_list if not math.isnan(row['body_mass_g'])), key=key_func)
# Group by species and island
groups = groupby(filtered, key=key_func)
# Calculate mean and standard deviation for each group
results = []
for (species, island), group in groups:
values = [row['body_mass_g'] for row in group]
mean_value = fmean(values)
sd_value = stdev(values, xbar=mean_value)
results.append({
'species': species,
'island': island,
'body_weight_mean': mean_value,
'body_weight_sd': sd_value
})But in my personal projects, my favorite language to use it Dart.
https://blog.genesmindsmachines.com/p/python-is-not-a-great-...
It also helps that in R any function can completely change how its arguments are evaluated, allowing the tidyverse packages to do things like evaluate arguments in the context of a data frame or add a pipe operator as a new language feature. This is a very dangerous feature to put in the hands of statisticians, but it allows more syntactic innovation than is possible in Python.
Python, the language itself, might not be a great language for data science. BUT the author can use Pandas or Polars or another data-science-related library/framework in Python to get the job done that s/he was trying to write in R. I could read both her R and Pandas code snippets and understand them equally.
This article reads just like, "Hey, I'm cooking everything by making all ingredients from scratch and see how difficult it is!".
I feel like I'm using Python more and more in a way that is just not native to it. Strict typing is simply necessary at some point in time, but I don't get any of the performance benefits or compile-time warnings that other languages provide. But more than Pandas/Polars/DuckDB, I think it is the plotting ecosystem that keeps me in the Python universe. Seaborn, altair, plotnine all take ggplot's GoG and bring it to Python and I'm really grateful for that.
I don't see an alternative to Python for someone who needs to: 1) Work with data that fits in memory and is mostly tabular (thus Pandas/Python), 2) needs to visualize this data often, 3) does exploratory work (Jupyter notebooks).
Personally I've found polars has solved most of the "ugly" problems that I had with pandas. It's way faster, has an ergonomic API, seamless pandas interop and amazing support for custom extensions. We have to keep in mind Pandas is almost 20 years old now.
I will agree that Shiny is an amazing package, but I would argue it's less important now that LLMs will write most of your code.
Now, is Python a SUCCESSFUL language? Very.
> Either way, I’ll not discuss it further here. I’ll also not consider proprietary languages such as Matlab or Mathematica, or fairly obscure languages lacking a wide ecosystem of useful packages, such as Octave.
I feel, to most programming folks R is in the same category. R is to them what Octave is to the author. R is nice nice, but do they really want to learn a "niche" language, even if it has better some features than Python? Is holding a whole new paradigm, syntax, library ecosystem in your head worth it?
> Contrast this with equivalent code that is full of logistics, where I’m using only basic Python language features and no special data wrangling package:
n = len(values)
# Calculate mean
mean = sum(values) / n
# Calculate standard deviation
variance = sum((x - mean) \* 2 for x in values) / (n - 1)
std_dev = math.sqrt(variance)
He doesn' t know about the statistics package in the standart library of Python (https://docs.python.org/3/library/statistics.html). Of course, if you do not know to use Python, you will have a lot of boilerplate.While I am not a python cheerleader, but a user because the reality is that it is a pretty good glue language, the above is a bit of a problem.
Duckdb, pandas, numpy etc.. is what makes python nice.
About a decade ago I worked at a major BI software company and ran into another silly problem when trying to evangelize R, wikis kbs and search engines don’t like single letter search terms.
So it didn’t matter how much better R was at the time, people found learning it more difficult than it should have been.
Work experience says that 90% of work is gathering, cleaning and transforming data from different sources. In this capacity Python has more options available.
[0] https://kotlinlang.org/docs/kotlin-notebook-overview.html
At the same time it is an absolute necessity to know if you are doing numerics. What this shows, at least to me, is that it is "good enough" and that the million integrations, examples and pieces of documentation matter more than whether the peculiarities of the language work in favor of its given use case, as long as the shortcomings can be mostly addressed.
Recently I am seeing that Python is heavily pushed for all data science related things. Sometimes objectively Python may not be the best option especially for stats. It is hard to change something after it becomes the "norm" regardless of its usability.
It lives in a sterile, idealized world.
Python is a great language for data science in practice because it turns out data science is also:
- gluing a lot of data sources
- cleaning up a ton of terribly shaped data
- validation and error handling
- I/O, networking, and format conversion
- emboarding non-programmers into programming
- wrapping a lot of compiled languages' libs or plugging system
- prototyping stuff and exposing that prototype to some people
- turning prototypes into more permanent projects
And it turns out Python and its ecosystem are good at those while remaining decent at the other things.There are other languages excellent at some of those, or some of the other things, but rarely good at most. And because humanity is vast, diverse, and constantly renewing, being the second best at those is eventually always winning.
Because whoever you are, you will be annoyed at not having the best experience at task X. But you would be mortified if you had the worst experience at doing task Y and Z. And task X, Y, and Z change depending on who you ask.
And you want to get things done, while days have 24 hours.
As usual, to understand the Python phenomenon, you have to see the whole picture. Not your little corner of the bubble. Not the ideal world in your head either. Life is not a maths problem with a clearly laid out premise and an elegant answer.
That's the same debate about why PHP won the web in 2000 no matter the size of the spaghetti plate, why Windows stayed used for so long despite it being terrible, why people keep using iphones after all the abuses, etc. There is more to it than the use case you have every day. People have needs you don't haven't thought about.
So it's not "let the language war begin". It's, "dude, get more experience, go work with accountants, ngos, govs and logistic chains, go work in china, africa and south america, go from a startup to schools to corporate, satisfy the geeks, the artists and the business people, than we'll talk".
More terse, more efficient, less error prone, hopefully more numerically accurate, as if Python had an ecosystem of well designed libraries on par with R.
Best part is, write a --help, and you can load them into LLMs as tools to help the LLMs figure it out for you.
Fight me.
Worked quite well, but the TS/JS langgraph version is way behind. React agents are just a few lines of code, compared to 50 odd lines for the same thing in JS/TS.
Better to use a different language, even one i'm not familiar with, to be able to maintain a few lines of code vs 50 lines.
I can't help to conclude that Python is as good as R because I still have the choice of using Pandas when I need it. What did I get wrong?
it was easy to think about the structures (iterators) it was easy to extend. it had a good community.
And for that, people start extending it via libraries.
There are plenty more alternatives now.
Python is not a great language
First, the white space requirements are a bad flashback to 1970s fortran.
Second, it is the language that is least compatible with itself.
You will say that not everything that is successful is great, and you will be right, but the success of python came organically, and not because of advertisement, de facto monopoly, politics, money, or first-arrived-advantage.
Although there is one cause that isn't intrinsic to python but from the people who built numpy. The fact there is a single numerical library, extremely easy to use, fast and extensive in the whole ecosystem was very very huge
mediocre!
Also add to the fact that TS based exploratory code can potentially plot SVG via d3 and maybe even exported to a webpage.
But the language has many rough edges
1. non standard eval is very weird, rlang fixes these shortcomings 2. unintuitive names or functions not belonging to packages, base has a mix of functions 3. S3 mixes with naming, no problem personally with S3 and S7 is even better, but mixing S3 names with ordinary names is unintuitive, keep snake case 4. data.frames are unintuitive, tidyverse fixes this 5. f(a=) seriously? or working with unintuitive functions in body for discrete ranges of function arguments? 6. no imports per file in packages, I can live with this .. still ... 7. AST functions are unintuitive
R has some excellent parts:
non-standard evaluation, AST in the base language, lazy evaluation
but it is being killed by the bad parts
I think all the external fixes and sanity in names should go into base
but it will take a lot of time if it ever happens due to legacy.
Julia fixes many of these not as elegantly as R but it's pragmatic approach is too attractive.
I was all hyped up, ready to see the amazing examples and arguments that would convince me to pick up R, and it gave me absolutely nothing (except quotes and brackets..).
Disappointing.
- A General programming language like Python is good enough for data science but isn't specifically designed for it.
- A language that is specifically designed for Data Science like R is better at Data Science.
Who would have thought?
It makes it look like perl, on a bad day, or worse autogenerated javascript.
Why on earth is it so many levels deep in objects?
R is kind of a super-specialized language. Python is much more general purpose.
R failed to evolve, let's be honest. Python won via jupyter - I see this used ALL the time in universities. R is used too, but mostly for statistics related courses only, give or take.
Perhaps R is better for its niche, but Python has more momentum and in thus, dominates over R. That's simply the reality of the situation. It is like the bulldozer moving forward, at a fast speed.
> I say “This is great, but could you quickly plot the data in this other way?”
Ok so ... he would have to adjust R code too, right? And finding good info on that is simply harder. He says he has experience with universities. Well, I do too, and my experience is that people are WAY better with python than with R. You simply see that more students will drop out from R than from python. That's also simply the reality of the situation.
> They appear to be sufficiently cumbersome or confusing that requests that I think should be trivial frequently are not.
I am sure the reverse also applies. Pick some python library, do something awesome, then tell the R students to do the same. I bet he will have the same problems.
> So many times, I felt that things that would be just a few lines of simple R code turned out to be quite a bit longer and fairly convoluted.
Ok, so here he is trolling. Flat out - I said it.
I wrote a LOT of python and quite a bit of R. There is no way in life that the R code is more succinct than the python code for about 90% of the use cases out there. Sorry, that's simply not the case. R is more verbose.
> Here is the relevant code in R, using the tidyverse approach:
penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species, island) |>
summarize(
This is like perl. They also don't adapt. R is going to lose grounds.This professor just hasn't realised that he is slowly becoming a fossil himself, by being unable to see that x is better than y.
Of course, if your DS is mixed with ML & modern AI you just:
pip install rpy2
But then, why choose? No need to be dogmatic, if R is nice for you:
install.packages("reticulate")