- Browse
- » Web scraping with Python: collecting more data from the modern web
Web scraping with Python: collecting more data from the modern web
Author
Publisher
O'Reilly Media
Publication Date
2018.
Language
English
Description
Loading Description...
More Details
ISBN
9781491985564
9781491985526
9781491985571
9781491985540
9781491985526
9781491985571
9781491985540
Table of Contents
From the eBook - Second edition.
Intro; Preface; What Is Web Scraping?; Why Web Scraping?; About This Book; Conventions Used in This Book; Using Code Examples; O'Reilly Safari; How to Contact Us; Acknowledgments; I. Building Scrapers; 1. Your First Web Scraper; Connecting; An Introduction to BeautifulSoup; Installing BeautifulSoup; Running BeautifulSoup; Connecting Reliably and Handling Exceptions; 2. Advanced HTML Parsing; You Don't Always Need a Hammer; Another Serving of BeautifulSoup; find() and find_all() with BeautifulSoup; Other BeautifulSoup Objects; Navigating Trees; Dealing with children and other descendants
Dealing with siblingsDealing with parents; Regular Expressions; Regular Expressions and BeautifulSoup; Accessing Attributes; Lambda Expressions; 3. Writing Web Crawlers; Traversing a Single Domain; Crawling an Entire Site; Collecting Data Across an Entire Site; Crawling Across the Internet; 4. Web Crawling Models; Planning and Defining Objects; Dealing with Different Website Layouts; Structuring Crawlers; Crawling Sites Through Search; Crawling Sites Through Links; Crawling Multiple Page Types; Thinking About Web Crawler Models; 5. Scrapy; Installing Scrapy; Initializing a New Spider
Writing a Simple ScraperSpidering with Rules; Creating Items; Outputting Items; The Item Pipeline; Logging with Scrapy; More Resources; 6. Storing Data; Media Files; Storing Data to CSV; MySQL; Installing MySQL; Some Basic Commands; Integrating with Python; Database Techniques and Good Practice; "Six Degrees" in MySQL; Email; II. Advanced Scraping; 7. Reading Documents; Document Encoding; Text; Text Encoding and the Global Internet; A history of text encoding; Encodings in action; CSV; Reading CSV Files; PDF; Microsoft Word and .docx; 8. Cleaning Your Dirty Data; Cleaning in Code
Data NormalizationCleaning After the Fact; OpenRefine; Installation; Using OpenRefine; Filtering; Cleaning; 9. Reading and Writing Natural Languages; Summarizing Data; Markov Models; Six Degrees of Wikipedia: Conclusion; Natural Language Toolkit; Installation and Setup; Statistical Analysis with NLTK; Lexicographical Analysis with NLTK; Additional Resources; 10. Crawling Through Forms and Logins; Python Requests Library; Submitting a Basic Form; Radio Buttons, Checkboxes, and Other Inputs; Submitting Files and Images; Handling Logins and Cookies; HTTP Basic Access Authentication
Other Form Problems11. Scraping JavaScript; A Brief Introduction to JavaScript; Common JavaScript Libraries; jQuery; Google Analytics; Google Maps; Ajax and Dynamic HTML; Executing JavaScript in Python with Selenium; Additional Selenium Webdrivers; Handling Redirects; A Final Note on JavaScript; 12. Crawling Through APIs; A Brief Introduction to APIs; HTTP Methods and APIs; More About API Responses; Parsing JSON; Undocumented APIs; Finding Undocumented APIs; Documenting Undocumented APIs; Finding and Documenting APIs Automatically; Combining APIs with Other Data Sources; More About APIs
Excerpt
Loading Excerpt...
Author Notes
Loading Author Notes...
Reviews from GoodReads
Loading GoodReads Reviews.
Staff View
Loading Staff View.