Google Summer of Code 2018 | Final Report | Scrapinghub | PSF
Google Summer of Code is a global program for the students studying in college for open source development. This program connects aspiring student developers with experienced developers of Open Source Software. The most exciting part of GSOC is that, students get to work in real software, which are open sourced. The best part of GSOC is its sharp learning curve, which pushes students out of their learning curve.
Details of my Organisation and Project
Scrapy is an open source web scraping framework, which is based in Python. Scrapy supports OOP implementation, so it is quite appropriate for big projects to use scrapy as a web scraping framework. Also it’s user friendly syntax and easy to understand documentation are a great bonus point.
Before I start writing my report, I would like to thank my mentors Daniel Graña and Mikhail Korobov. They have been quite helpful in guiding me through the project, and have been quite responsive. Special thanks to Cathal Garvey, for guiding through the project here in Scrapinghub.
Blogging and details of my work
I have scrubbed my details in a Blog. The entire journey can be reviewed here.
Description of my project
Starting from Python 3.4, python has provided the support for native coroutines, using async/await
syntax. Users who have used scrapy, are familiar with callbacks, which are methods to receive the response, when it is available. async/await
helps in receiving the response to any requests in the same line, instead of having a callback to receive the response. Another exciting aspect was asyncio
framework introduced in Python3.4. Framework using Asyncio are really fast, and have new features supporting async/await
syntax, and other new features of Asyncio.
- Having an awaitable request method, which would await to a response.
- Support
async/await
syntax in Scrapy. - Supporting different Asyncio based frameworks
Short description of my working
My project required me to support async/await
syntax in Scrapy, and also supporting asyncio in Scrapy. While I knew the technologies and framework that scrapy was using, and intended to use, still it was quite a challenging project that I completed. There remains to be a few bit of tasks, that remains before it gets merged into the Scrapy codebase, but the project that is completed is readily usable in Scrapy.
My working can be roughly divided into three broad parts —
- Supporting
async def start_request(...):
The start of the project was with me making a PR , before the submission of the proposal of Google Summer of Code. I made a PR because I wanted to participate with Scrapy, but more than that, I wanted to use a practical example to test my understanding of Asyncio and Twisted. This PR enabled the use of async def start_request
, but this was a just a part of the big puzzle, so itself it did not make any sense.
- Supporting
async def parse(...)
After my proposal got submitted, I started working on supporting async def parse
. While the first requirement of the project was to support await scrapy.Request(...)
, the next task was to support asynchronous generators. As scrapy also supported the yield of any items, it was necessary to support asynchronous generators.
I started on working supporting asynchronous generators. I found a useful resource inline_requests, which used a separate file to yield out the response in the same line. I was quite impressed by it for sole reason — If we want to use native coroutines( async/await
), we should ensure that there is no synchronous methods left : all the method should be asynchronous. I used the idea, and implemented the same for native coroutines.
After supporting asynchronous generators, I shifted my mind in supporting asyncio frameworks. I researched for this topic a lot, and finally landed with this source. This blogpost helped me join all the parts of the asyncio-twisted puzzle. I started working out with few example prototypes, and finally designed an implementation of the above.
At the end, I was left with await scrapy.Request(...)
. As I worked on supporting this syntax, I got stuck with circular imports — A module importing itself over again. This was a problem that plagued my progress, so I shifted in introducing a new method, scrapy.core.Fetch(Request)
.This method would take an awaitable, and will return the response as and when it is available.
- Supporting Asyncio frameworks
While I wanted to support asyncio frameworks, I was wary of the fact that asyncio would not be supported fully. This was a big hurdle for me, as I would land up with the same problem. After researching through this problem, I got the idea of implementing running twisted on top of asyncio.
I went through implementations of the above, and I got the idea of going through the task with twisted-asyncio reactor.
I designed the implementation, and coded the new interoperability support. After trying out few libraries such as aiohttp
and aioredis
, the new API was working good. So asyncio frameworks were supported through this. Users can try out asyncio frameworks, such as aiohttp
and aioredis
to name a few.
Requirements before it is merged into scrapy code base
While I have made most of the API required for using async / await syntax, there remain to be some of tasks left, before it can be merged into Scrapy’s code-base.
- Writing test suites — While I intended to complete writing the test suites, as it was proposed in the proposal, and I did write some of them, but the main project rather got too cumbersome at the end ; So while the new API achieves all the expectations in the proposal, but a new API cannot be rolled out, until and unless the code is well tested.
- Writing documentation — While the new API supposedly adds a few hooks on scrapy, and providing the
async/await
support as an additional feature, but still writing documentation and some test examples, is mandatory for the common user to use them. - Supporting Python2.7 — The Scrapy codebase is backwards compatible, and most of the code is backwards compatible ; but there are some hooks and additional features which are only possible in Python ≥3.7. It should check the version of Python, and use the appropriate methods accordingly.
- Twisted — While this is a reason that is completely out of my hand, but Twisted has a bug, in which it has some variables defined as
async
. Starting from Python 3.7, we haveasync/await
as a reserved keyword. This bug is resolved in Twisted’s github page, but before the corrected release, one has to wait in order to try them. You can clone them in your local computer from here.
What does it means for the Scrapy users ?
After this new API is merged into Scrapy, users would be able to take advantage of new syntaxes async/await
and run libraries requiring asyncio
. The new API also provides the users to get the response in the same line, rather than using a method for receiving the response as a callback.
Learning through the project
The project itself has been quite challenging, but that is the real beauty of Google Summer of Code. I have been quite used to with the frameworks that Scrapy has used, namely Twisted but regarding Asyncio, it has been quite new, and the evolving nature of asyncio would mean that the developers would certainly have a hard time catching with it.
Right at the start of applying for the project, I learnt Django and Flask, but going through Scrapy made me look at it with awe, as concurrency along with single threaded nature of Twisted certainly pushed me to learn this framework. Regarding Asyncio, I was playing with it through, and after reading through quite a lot of blogs, I certainly understood that it would take a fair bit of time before asyncio gets used as a mainstream event driven network framework.
There have been moments, where I chalked out the flag posts for the project, but there have been few moments where I certainly went away from the deadline. But I had allotted time considering the complexity of the project, so I did cover up most of the project within the stipulated time.
I also covered a lot of practical knowledge of generators and asynchronous generators in Python, and after a fair bit of use cases of them, I feel quite content in using them in future requirements.
Important Links
- Asyncio support in Scrapy — The PR for the project
- My Blogpost — (https://yashrsharma44.github.io/)
- Medium page for the Blogs — (https://medium.com/@yashrsharma44/list-of-blogs-for-google-summer-of-code-18-674ddf91a34d)
- Documentation for the project — (https://gist.github.com/yashrsharma44/9730e8817419ea3ce3de2c6ebed3673a)
I would like to thank my mentors again, as they helped me a lot in getting used with the code-base, and splitting the first task into small manageable tasks. The start of the coding period is always daunting, but they eased me to the project. I also want to thank GSOC for providing me with an opportunity to implement my skills. Thanks :)