- You'll save an extra 5% on Books purchased from Amazon.ca, now through July 29th. No code necessary, discount applied at checkout. Here's how (restrictions apply)
Web Scraping with Python: Collecting Data from the Modern Web Paperback – Jul 24 2015
|New from||Used from|
Special Offers and Product Promotions
Frequently Bought Together
Customers Who Bought This Item Also Bought
No Kindle device required. Download one of the Free Kindle apps to start reading Kindle books on your smartphone, tablet, and computer.
Getting the download link through email is temporarily not available. Please check back later.
To get the free app, enter your mobile phone number.
About the Author
Ryan Mitchell is a Software Engineer at LinkeDrive in Boston, where she develops their API and data analysis tools. She is a graduate of Olin College of Engineering, and is a Masters degree student at Harvard University School of Extension Studies. Prior to joining LinkeDrive, she was a Software Engineer working on web scraping and data analysis at Abine.
From the Publisher
Q&A with author Ryan Mitchell
What got you interested in web scraping?
In 2011, I started working for a company called Abine, that offered a service to remove customers’ personal information from various sites on the Internet. In the early days of the company, the process of looking for someone’s personal information on all of these sites, filling out all these opt-out forms, faxing emailing, compiling reports to send back to the customers -- it all took a lot of time! I started looking into ways to streamline these processes, and add additional features. I built bots that could search for profiles, store information in our database, fill out web forms, create documents, and send the emails and faxes automatically. Some of these sites were fairly bot-resistant, so I had to learn, and even invent, some interesting techniques to deal with them. I really fell in love with building bots and scraping the web, and continued to do it even after I left the company!
Why is Python such a good fit for web scraping and building web crawlers?
I’ll be honest: As far as high performance programming languages go, Python does not win many speed contests. But with web scraping, you’re not looking for speed -- sending and receiving data across the Internet will be thousands of times slower than any relatively tiny differences in language performance, so you can throw that metric out the window! What you need is something that’s lightweight, easy to deploy to remote machines, that can be installed and run anywhere, that’s easy to write and modify, and, perhaps most importantly: that has a plethora of well-document tools for just about any situation. Python has all of these in spades.
What’s the most interesting way you’ve used web scraping, for professional or side projects?
One of my favorite scraping projects, and something I introduce in Web Scraping with Python, is scraping Wikipedia for historical edits by IP address, time of the edit, and language. You can resolve the IP address to a geographic location, and explore when and where speakers of different languages are making edits. Lots of interesting sociological research potential there!
A recent hobby of mine has also been automated CAPTCHA solving. I really enjoy analyzing new types of CAPTCHAs for vulnerabilities, writing scripts to pre-process the images, creating data sets for machine learning algorithms, and seeing how high I can get the success percentage of my bots! No real practical applications these days, but you never know when it will come in handy.
What information do you hope that readers of your book will walk away with?
I try to stress a couple of things throughout the book:
First, no website is bot-proof. Attempts to make websites more bot-proof generally also result in a loss of usability for human users. That loss of usability may be in the form of slower loading times, poor browser compatibility, lack of accessibility for users with mobility or visual impairments, or users on mobile devices. And many of these measures have no real deterring effect on web scrapers. If you can view the data in a browser, you can capture it with a scraper.
What’s the most exciting or important thing happening in your space right now?
Like many fields, especially computer science fields, there’s a lot being done with machine learning and big data. The percentage of page requests performed by humans and bots is about 50/50 right now, and as more humans are getting on the Internet, more bots are too -- and outpacing them! There’s just so much data, and so many machines collecting that data, and so many connections we haven’t been able to make before, waiting to be made. And these aren’t just data scientists and server farm owners making them, either! The kind of research that once might have required months or years of surveys and data collection are now just a Python script, a database, and a weekend of coding away!
What Other Items Do Customers Buy After Viewing This Item?
Most Helpful Customer Reviews on Amazon.com (beta)
1. It is a great introduction to web scraping. The reader is given confidence to use well-known Python packages such as BeautifulSoup and get useful results from scraping webpages in a very short time.
2. Where to go after learning the basics? - the author describes the tools, techniques and frameworks to use for scraping dynamic websites, including code examples. This is the most challenging part of the book because it frequently involves combining tools and the reader will have to get his/her hands dirty and learn by doing also. This is reasonable since different websites present different challenges.
3. I liked the author's writing style. She favors simple explanations, identifies potential pitfalls and makes clear, technical recommendations based on her experience.
Highly recommended. I wish I had this book two years ago.
I'd also really appreciated Ryan's nuanced reasoning on the ethics of scraping, which got a layer deeper than simply throwing responsibility over to the reader with some version of "with great power comes great responsibility", and got into a thoughtful discussion of how to make the call about whether a potential application is legitimate.
Look for similar items by category
- Books > Computers & Technology > Internet & Social Media > Online Searching
- Books > Computers & Technology > Internet & Social Media > Web Browsers
- Books > Computers & Technology > Microsoft > Web Browsers
- Books > Computers & Technology > Programming > Languages & Tools > Python
- Books > Computers & Technology > Web Development > Programming
- Books > Computers & Technology > Web Development > Web Services
- Books > Textbooks > Computer Science & Information Systems > Programming Languages