Anecdotally the Matplotlib maintainers were told
"About 15% of arXiv papers use Matplotlib"
arXiv is the preeminent repository for scholarly preprint articles, especially in scientific fields like physics, mathematics, and chemistry. It stores millions of journal articles used across science. It's also public access, and so we can just scrape the entire thing given enough compute power.
Starting in the early 2010s, Matplotlib started including the bytes //<[[b"Matplotlib"]]> in every PNG and PDF that they produce. These bytes often persist in PDFs that contain Matplotlib plots, including the PDFs stored on arXiv. As a result, it's pretty simple to check if a PDF contains a Matplotlib image. All we have to do is scan through every PDF and look for these bytes; no parsing required.
The data is stored in a requester pays bucket at s3://arxiv (more information at https://arxiv.org/help/bulk_data_s3 ) and also on GCS hosted by Kaggle (more information at https://www.kaggle.com/datasets/Cornell-University/arxiv).
The data is about 1TB in size. We're going to use Dask for this.
This is a good example of writing plain vanilla Python code to solve a problem, running into issues of scale, and then using Dask to easily jump over those problems.
Our data is stored in a requester pays S3 bucket in the //<[[us-east-1]]> region. Each file is a tar file which contains a directory of papers.
There are lots of these
Mostly we have to muck about with tar files. This wasn't hard. The //<[[tarfile]]> library is in the standard library. It's not beautiful, but it's also not hard to use.
//<[[CPU times: user 3.99 s, sys: 1.79 s, total: 5.78 s]]>
//<[[Wall time: 51.5 s]]>
We see that none of these files included a Matplotlib image. That's not surprising. The filenames start with "0011" which means year 2000, month 11. Matplotlib wasn't even around back then 🙂
Great, we can get a record of each file and whether or not it used Matplotlib. Each of these takes about a minute to run on my local machine. Processing all 5000 files would take 5000 minutes, or around 100 hours.
We can accelerate this in two ways:
We can do this easily with Dask (parallel computing) and Coiled (set up Dask infrastructure)
We start a Dask cluster on AWS in the same region where the data is stored.
We mimic the local software environment on the cluster with //<[[package_sync=True]]>.
//<[[CPU times: user 9.72 s, sys: 1.08 s, total: 10.8 s]]>
//<[[Wall time: 1min 25s]]>
Let's scale up this work across all of the directories in our dataset.
Hopefully it will also be faster because the Dask workers are in the same region as the dataset itself.
//<[[CPU times: user 11.7 s, sys: 1.77 s, total: 13.5 s]]>
//<[[Wall time: 5min 58s]]>
Now that we're done with the large data problem we can turn off Dask and proceed with pure Pandas. There's no reason to deal with scalable tools if we don't have to.
Let's enhance our data a bit. The filenames of each file include the year and month when they were published. After extracting this data we'll be able to see a timeseries of Matplotlib adoption.
2122438 rows x 2 columns
Yup. That seems to work. Let's map this function over our dataset.
Now we can just fool around with Pandas and Matplotlib.
I did the plot above. Then Thomas Caswell (Matplotlib maintainer) came by and, in true form, made something much better 🙂
Yup. Matplotlib is used pretty commonly on arXiv. Go team.
This data was slightly painful to procure. Let's save the results locally for future analysis. That way other researchers can further analyze the results without having to muck about with parallelism or cloud stuff.
These are available at the github repository https://github.com/mocklin/arxiv-matplotlib if you want to play around with them.
It's incredible to see the steady growth of Matplotlib across arXiv. It's worth noting that this is all papers, even from fields like theoretical mathematics that are unlikely to include computer generated plots. Is this Matplotlib growing in popularity? Is it Python generally?
For future work, we should break this down by subfield. The filenames actually contained the name of the field for a while, like "hep-ex" for "high energy physics, experimental", but it looks like arXiv stopped doing this at some point. My guess is that there is a list mapping filenames to fields somewhere though. The filenames are all in the Pandas dataframe / parquet dataset, so doing this analysis shouldn't require any scalable computing.
Dask and Coiled were built to make it easy to answer large questions.
We started this notebook with some generic Python code. When we wanted to scale up we invoked Dask+Coiled, did some work, and then tore things down, all in about ten minutes. The problem of scale or "big data" didn't get in the way of us analyzing data and making a delightful discovery.
This is exactly why these projects exist.
There are many ways that this work could be extended (by you?)
This is a fun dataset representing the forefront of human science. It's now easy for us to inspect in its raw form. Fun!