PHP Web Scraping

This list contains PHP libraries related to web scraping and data processing

  • PHP Web Scraping

    • Network

    • Web-scraping Frameworks

    • HTML/XML Parsing

    • Text processing

    • Specific Formats Processing

    • Natural Language Processing

    • Browser automation and emulation

    • Multiprocessing

    • Queue

    • Cloud Computing

    • Email

    • URL Manipulation

    • Web Content Extracting

    • Asynchronous

    • WebSocket

    • DNS Resolving

    • Computer Vision

    • Geocoding

    • API Clients

    • Other PHP Lists

Network

Web-Scraping Frameworks

  • Crawler - (crwlr) - Library for Rapid (Web) Crawler and Scraper Development

  • Roach - It is port of the popular Scrapy package for Python. Include adapter to Laravel and Symfony

HTML/XML Parsing

  • HTML5 PHP - An HTML5 parser and serializer library.

  • QueryPath - a jQuery-like library for working with XML and HTML documents in PHP. It now contains support for HTML5 via the HTML5-PHP project.

  • DiDOM - super fast HTML parser (because it was build on top of plain PHP).

  • PHPScraper - an highly opinionated web-interface.

  • DomCrawler - (Symfony) - The DomCrawler component eases DOM navigation for HTML and XML documents.

Text Processing

Libraries for parsing and manipulating plain texts.

  • General

    • ANSI to HTML5 - An ANSI to HTML5 converter library.

    • Patchwork UTF-8 - A portable library for working with UTF-8 strings.

    • Hoa String - Another UTF-8 string library.

    • Stringy - A string manipulation library with multibyte support.

    • Color Jizz - A library for manipulating and converting colours.

    • Text - A text manipulation library.

    • Flux - A regular expression building library.

  • Transliteration

    • Urlify - A PHP port of Django's URLify.js.

    • Slugify - A library to convert strings to slugs.

  • User-agent

    • Device Detector - Another library for parsing user agent strings.

    • Mobile-Detect - A lightweight PHP class for detecting mobile devices (including tablets).

    • UA Parser - A library for parsing user agent strings.

  • Unites of measure

    • ByteUnits - A library to parse, format and convert byte units in binary and metric systems.

    • PHP Units of Measure - A library for converting between units of measure.

    • PHP Conversion - Another library for converting between units of measure.

  • Phone number

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • CSV

    • CSV - A CSV data manipulation library.

  • Office

    • PHPWord - A library for working with Microsoft Word documents.

    • PHPExcel - A library for working with Microsoft Excel documents.

    • PHPPowerPoint - A library for working with Microsoft PowerPoint documents.

    • ExcelAnt - A library for manipulating Microsoft Excel documents.

  • Markdown

  • BBCode

    • Decoda - A lightweight lexical string parser for BBCode styled markup.

  • JSON

    • JsonMapper - A library that maps nested JSON structures onto PHP classes.

  • vCard

    • vobject - The VObject library allows you to easily parse and manipulate iCalendar and vCard objects.

  • File Type Detection

    • Hoa Mime - Another MIME detection library.

    • Canal - A library to determine internet media types.

    • Apache MIME Types - A library that parses Apache MIME types.

  • GeoJSON

    • GeoJSON - A GeoJSON implementation.

Natural Language Processing

Libraries for working with human languages.

  • PHP NlpTools - Natural Language Processing Tools in PHP

  • nlpTools - Natural Language Processing Toolkit for PHP

Browser automation and emulation

  • php-webdriver - A php client for webdriver.

  • PHP PhantomJS - Execute PhantomJS commands through PHP

  • Mink - universal API for multiple browser emulators (selenium, zombie.js, goutte)

Multiprocessing

  • Spork - A process forking library.

Asynchronous

Libraries for asynchronous networking programming.

  • React - An event driven non-blocking I/O library.

  • Rx.PHP - A reactive extension library.

  • Hoa EventSource - An event source library.

  • Evenement - An event dispatcher library.

  • Event - An event library with a focus on domain events.

  • Broadway - An event source and CQRS library.

Queue

Cloud Computing

  • TODO

Email

Libraries for parsing email.

URL Manipulation

Libraries for parsing URLs.

  • Purl - A URL manipulation library.

  • PHP Domain Parser - A domain suffix parser library.

  • Uri (The PHP League) - A simple URL manipulation library (PSR-7 compatible).

  • Url (crwlr) - Swiss Army knife for urls.

Web Content Extracting

  • Text and Meta Data from Web Documents

    • Essence - A library for extracting web media.

    • Embera - An Oembed consumer library.

    • Embed - An awesome library for getting useful information from a webpage.

  • Video

    • Youtube-Downloader - PHP script for downloading videos from youtube; also parsing youtube feed into RSS enclosures for podcatchers

WebSocket

Libraries for working with WebSocket.

DNS Resolving

  • Net_DNS2 - Native PHP DNS Resolver and Updater

Computer Vision

Geocoding

Other PHP lists

Last updated