Web Scraping with Python Data Extraction from the Modern Web 3rd Edition By Ryan Mitchell
The 3rd Edition of "Web Scraping with Python: Data Extraction from the Modern Web" by Ryan Mitchell was published by O'Reilly Media in early 2024. This updated version serves as a comprehensive guide for programmers and data analysts to extract, store, and process information from the internet using Python. Key Book Information (3rd Edition)
- Release Date: Published between February and March 2024.
- Length: Approximately 352 pages.
- ISBN-13: 978-1098145354 (Print) / 978-1098145316 (eBook).
- Publisher: O'Reilly Media.
Core Content & TopicsThe book is divided into two primary sections:
- Part I: Building Scrapers (Mechanics)
- Internet Fundamentals: Understanding HTTP, HTML, CSS, and how web servers respond.
- Basic Scraping: Using BeautifulSoup and the Requests library to parse static HTML.
- Crawling: Developing automated crawlers to traverse entire domains or multiple sites.
- Frameworks: Writing spiders with the Scrapy framework for large-scale projects.
- Data Storage: Practical methods for saving data to CSV, MySQL, or other formats.
- Part II: Advanced Scraping
- Modern Challenges: Handling JavaScript and dynamic content using Selenium.
- API Interactions: Scraping undocumented APIs and parsing JSON responses.
- Data Processing: Reading PDFs and Word docs, cleaning "dirty" data, and basic Natural Language Processing (NLP).
- Image Recognition: Using Tesseract and Pillow for OCR and solving CAPTCHAs.
- Avoiding Traps: Techniques for handling cookies, headers, and avoiding bot blockers or honeypots.
- Parallel Scraping: Utilizing multithreading and multiprocessing to speed up data collection.
About the AuthorRyan Mitchell is a principal software engineer at Gerson Lehrman Group (GLG) and an expert in web scraping, application security, and data science. She has also authored Unlocking Python and hosts several LinkedIn Learning courses on Python.