Qetch

Supported Versions License https://img.shields.io/pypi/v/qetch.svg https://img.shields.io/travis/stephen-bunn/qetch.svg Code Coverage Documentation Status Updates Say Thanks

A framework for site content extractors and downloaders.
A WIP migration from youtube-dl.

Documentation

Getting Started

This framework is my attempt at modernizing the type of content extraction that youtube-dl performs. It’s called “Qetch” because I couldn’t think of anything better…

I started this because I needed a way of extracting and downloading raw content from just a user dropping in a url. The issue with current solutions is that they have an unintuitive API and an overcomplicated implementation (no offense intended, I really appreciate the work that went into the current solutions).

But I’m a stickler and wanted a cleaner more modular way of building extractors and quicker downloaders; also something that doesn’t strive to be “Pure Python” because pure Python isn’t real Python.

Note

Qetch requires Python 3.6+. Because of support dropping for Python 2.7 and so many various improvments from 3.5, it was decided unanimously (meaning just me) that this project will only support 3.6+.

Installation

Since Qetch is in pre-development/proof-of-concept stages, it is not yet on PyPi. You can install Qetch by cloning the repository at stephen-bunn/qetch and installing the dependencies.

git clone https://github.com/stephen-bunn/qetch.git
cd ./qetch
pip install -r ./requirements.txt

Pipenv is also an option! If you don’t yet know about Pipenv, you should definitely start using it!

Basic Usage

The quickest way to utilize Qetch is to just allow Qetch to discover what extractors/downloaders are required for a URL you give it.

import os
import qetch

# discover what extractor can handle a URL and initialize it
extractor = qetch.get_extractor(URL, init=True)

# extract the first discovered content
content = next(extractor.extract(URL))[0]

# discover what downloader can handle the extracted content and initialize it
downloader = qetch.get_downloader(content, init=True)

# download the content to a given filepath
downloader.download(content, os.path.expanduser('~/Downloads/downloaded_file'))

As shown in the example above, there are several objects that make up Qetch. You can learn more about them in the Project Structure documentation and the Qetch Package reference.

Project Structure

I like pictures, so bear with me while I use a couple that some of you might roll your eyes at.

Qetch mainly consists of 4 separate components. These are listed in the following sections with a quick and simple description of each one and what it’s purpose is.

Content

The Content is a simple object which stores all the required information needed to download something.

_images/content.png

Most of the attributes in this object is sugar used for better representing the content. The only three that really matter are the uid, extractor, and fragments.

The uid is simply a unique identifier for the content. The extractor is just a reference to the BaseExtractor subclass that was used to extract the content.

The actual urls which need to be downloaded to form the full content are items in the fragments list. In most cases the length of this list is 1 (because the raw content is not hosted as segments). However, for sites that do stream segments of media, it most likely means that the length of the fragments list will be more than one.

Because of these fragments, it is necessary to calculate the size of the full content. This is performed through the get_size() method.

Extractors

All extractors are subclasses of BaseExtractor, and provide special logic to handle the extraction of certain URLs. This usually means that a handled domain will have an extractor to deal with that domain’s URLs.

This is essentially the core of the project since it requires contributions from the community to grow and include the ability for difference domains to have their content extracted. If you have the logic to create an extractor for a domain that is not yet handled, please make a pull request following our guidelines.

_images/extractors.png

The overall purpose of extractors is to yield one or more list of Content instances that can be downloaded from a given URL.

The reason extractors yeild lists is because a site might host various levels of quality for some content that is essentially the same. This allows the user to choose which quality of content they want from the available qualities found at the given URL.

Authentication

Sometimes there is no good way to retrieve the necessary information for a certain URL due to authentication requirements by the site itself. In order to handle this, the AuthRegistry was created to help extractors say what kind of authentication is required before they can extract content.

_images/auth.png

An extractor specifies the necessary AuthTypes literal in the authentication property. It applies any authentication in the authenticate() method before extraction.

The AuthRegistry is a borg dictionary which stores authentication information across all instances of the registry.

Downloaders

Downloaders are similaraly structured to extractors, but their purpose is to download a single Content instance to a specified filepath. They all extend BaseDownloader and provide progress hooks to the download process.

_images/downloaders.png

All of the downloaders should support multi-threaded/multi-connection downloads similar to the HTTPDownloader.

The optional merging of fragments is handled by the extractor itself in the merge() (since downloader’s are abstracted away from extraction). If the extractor does require downloaded fragment merging, then it is necessary for the extractor to override that method.

Basic Overview

Just to visualize the overall process involved in downloading a URL from start to finish, here is a simple flow chart describing the process.

_images/download_flow.png

Changelog

All notable changes to qetch will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

unreleased

  • added basic project structure migration from previous proof-of-concepts
  • enhanced documentation to make it readable
  • fixed multi-connection threaded progress reporting
  • removed broken WIP extractors from previous repositories

Reference

Contributing

When contributing to this repository, please first discuss the change you wish to make via an issue to the owners of this repository before submitting a pull request.

Important

We have an enforced style guide and a code of conduct. Please follow them in all your interactions with this project.

Style Guide

  • We stictly follow PEP8 and utilize Sphinx docstrings on all classes and functions.
  • We employee flake8 as our linter with exceptions to the following rules:
    • D203
    • F401
    • E123
  • Linting and test environments are configured via tox.ini.
  • An .editorconfig file is included in this repository which dictates whitespace, indentation, and file encoding rules.
  • Although requirements.txt and requirements_dev.txt do exist, Pipenv is utilized as the primary virtual environment and package manager for this project.
  • We strictly utilize Semantic Versioning as our version specification.

Issues

Issues should follow the included ISSUE_TEMPLATE found in .github/ISSUE_TEMPLATE.md.

  • Issues should contain the following sections:
    • Expected Behavior
    • Current Behavior
    • Possible Solution
    • Steps to Reproduce (for bugs)
    • Context
    • Your Environment

These sections help the developers greatly by providing a large understanding of the context of the bug or requested feature without having to launch a full fleged discussion inside of the issue.

Pull Requests

Pull requests should follow the included PULL_REQUEST_TEMPLATE found in .github/PULL_REQUEST_TEMPLATE.md.

  • Pull requests should always be from a topic/feature/bugfix (left side) branch. Pull requests from master branches will not be merged.
  • Pull requests should not fail our requested style guidelines or linting checks.

Code of Conduct

Our code of conduct is taken directly from the Contributor Covenant since it directly hits all of the points we find necessary to address.

Our Pledge

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards

Examples of behavior that contributes to creating a positive environment include:

  • Using welcoming and inclusive language
  • Being respectful of differing viewpoints and experiences
  • Gracefully accepting constructive criticism
  • Focusing on what is best for the community
  • Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

  • The use of sexualized language or imagery and unwelcome sexual attention or advances
  • Trolling, insulting/derogatory comments, and personal or political attacks
  • Public or private harassment
  • Publishing others’ private information, such as a physical or electronic address, without explicit permission
  • Other conduct which could reasonably be considered inappropriate in a professional setting
Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

Scope

This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at stephen@bunn.io. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project’s leadership.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

Qetch Package

This is the base qetch package.


qetch.auth

qetch.content

This is the base content instance which is used to normalize hosted media for use between extractors and downloaders. The most important attributes of this object are the following:

  • uid: The unique id that identifies the content
    (even unique between levels of quality).
  • source: The url that was given to the extractor for extracting.
  • fragments: A list of urls where the raw content can be retrieved from
    (is a list in case that content is fragmented/segmented).
  • quality: A float value between 0 and 1, 1 being the best quality format.

qetch.extractors

Below are a list of the currently included extractors which all should extend BaseExtractor. The purpose of extractors is to take a url and yield lists of similar content instances.

This allows content with various levels of quality to have a relationship with eachother. For example, gfycat.com hosts various levels and formats of some media (mp4, webm, webp, gif, etc…). When extracting the content for a gfycat url, an extractor will yield a list containing different content instances for each of these formats and different quality values. This allows the developer to hopefully correctly choose the desired content for a list of content extracted for a single resource.

BaseExtractor
gfycat
imgur
fourchan

qetch.downloaders

Below are a list of the currently included downloaders which all should extend BaseDownloader. The purpose of downloaders is to take an extracted Content instance in order to download and merge the fragments resulting in the content being downloaded to a given local system path.

Downloaders should be built to allow parrallel fragment downloading and multiple connection downlaods for each fragment. For example, the HTTPDownloader allows both max_fragments and max_connections as parameters to the download() method. This will allow max_fragments to be processed at the same time and max_connections to be used for the download of each of those fragments. This means that up to (max_fragments * max_connections) between your IP and the host may exist at any point during the download.

It is best to scrutinize this to allow only 10 connections at max, since many hosts will flag/ban IPs using more than 10 connections. By default, max_fragments and max_connections are set to 1 and 8 respectively allowing a maximum of 8 connections from your IP to the host at any point, but only allows 1 fragment to be downloaded at a time.

Downloaders should also support the usage of a progress_hook which is sent updates on the download progress every update_delay seconds. See the example in download() for a very simple example.

BaseDownloader
http