arXiv is the preprint server that defines the pace of research in artificial intelligence, machine learning, physics, mathematics, and quantitative biology — with hundreds of new papers submitted daily, it is where research findings become public weeks or months before formal journal publication. For technology companies monitoring AI research fronts, academic corpus builders assembling training datasets, and competitive intelligence teams tracking the publication activity of leading research labs, arXiv's data pipeline is a strategic asset. Extracting that data systematically — across categories, at the depth of full-text and metadata, with the completeness that reliable intelligence requires — demands a proxy stack configured for arXiv's access environment.
arXiv is an open academic resource operated by Cornell University, and its automated access guidelines ask bulk collectors to use the OAI-PMH API for metadata harvesting and to respect rate limits on web-based access. At gsocks.net, our arXiv proxy stack is configured to complement these guidelines rather than circumvent them. The challenge our customers face is not evading arXiv's access controls — arXiv actively wants its data to be used — it is sustaining the consistent, high-throughput access that large-scale corpus building and real-time research monitoring require across multi-day and multi-week collection campaigns without accumulating the per-IP session density that triggers arXiv's automated traffic management.