Qetch Package

This is the base qetch package.

qetch.get_downloader(content, init=False, *args, **kwargs)[source]

Gets the first downloader that can handle a given content.

Parameters:
  • content (Content) – The content that needs to be downloaded
  • init (bool, optional) – If True initializes the class, otherwise returns the class
Returns:

The downloader that can handle the

content.

Return type:

downloaders._common.BaseDownloader

Examples

Basic usage…

>>> import qetch
>>> content = next(qetch.get_extractor(GFYCAT_URL, init=True)
...     .extract(GFYCAT_URL))[0]
>>> downloader = qetch.get_downloader(content, init=True)
>>> print(downloader)
<HTTPDownloader at 0xABCDEF1234567890>
qetch.get_extractor(url, init=False, *args, **kwargs)[source]

Gets the first extractor that can handle a given url.

Parameters:
  • url (str) – The url that needs to be extracted
  • init (bool, optional) – If True initializes the class, otherwise returns the class
Returns:

The extractor that can

handle the url.

Return type:

extractors._common.BaseExtractor

Examples

Basic usage…

>>> import qetch
>>> extractor = qetch.get_extractor(GFYCAT_URL, init=True)
>>> print(extractor)
<GfycatExtractor "gfycat">

qetch.auth

class qetch.auth.AuthRegistry(**kwargs)[source]

Bases: dict

Custom borg style registry dictionary.

This registry dictionary utilizes the borg design pattern and maintains the same state across multiple instances. This means that multiple instances of this object can exist, but the values between them will stay syncronized.

Examples

Basic usage…

>>> from qetch.auth import (AuthRegistry,)
>>> from qetch.extractors import (GfycatExtractor,)
>>> registry_1 = AuthRegistry()
>>> registry_1[GfycatExtractor.name] = ('KEY', 'SECRET',)
>>> print(registry_1[GfycatExtractor.name])
('KEY', 'SECRET')
>>> registry_2 = AuthRegistry()
>>> print(registry_2[GfycatExtractor.name])
('KEY', 'SECRET')
>>> registry_1[GfycatExtractor.name] = ('USERNAME', 'PASSWORD',)
>>> print(registry_2[GfycatExtractor.name])
('USERNAME', 'PASSWORD')
clear() → None. Remove all items from D.[source]
copy() → a shallow copy of D[source]
items() → a set-like object providing a view on D's items[source]
keys() → a set-like object providing a view on D's keys[source]
pop(k[, d]) → v, remove specified key and return the corresponding value.[source]

If key is not found, d is returned if given, otherwise KeyError is raised

update([E, ]**F) → None. Update D from dict/iterable E and F.[source]

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → an object providing a view on D's values[source]
class qetch.auth.AuthTypes[source]

Bases: enum.Enum

An enumeration of available authentication types.

Values:
  • NONE: No authentication required
  • BASIC: Basic (username, password) authentication required
  • OAUTH: Standard oauth (key, secret) authentication required

qetch.content

This is the base content instance which is used to normalize hosted media for use between extractors and downloaders. The most important attributes of this object are the following:

  • uid: The unique id that identifies the content
    (even unique between levels of quality).
  • source: The url that was given to the extractor for extracting.
  • fragments: A list of urls where the raw content can be retrieved from
    (is a list in case that content is fragmented/segmented).
  • quality: A float value between 0 and 1, 1 being the best quality format.
class qetch.content.Content(uid, source, fragments, extractor, extension=None, title=None, description=None, quality=0.0, uploaded_by=None, uploaded_date=None, metadata={})[source]

Bases: object

The resulting content instance yielded by extractors.

get_size()[source]

Returns the sum of the length of the fragments.

Returns:The sum of the length of the fragments.
Return type:int
description

The description of the content.

Returns:The description of the content.
Return type:str
extension

The extension of the resulting content.

Returns:The extension for the resulting content.
Return type:str
extractor

The extractor which discovered the content.

Returns:The extractor which discovered the content.
Return type:BaseExtractor
fragments

A list of urls which represent the raw content.

Returns:A list of urls which represent the raw content.
Return type:list[str]
metadata

Any metadata for the current content.

Returns:Any metadata for the current content.
Return type:dict[str,…]
quality

The contextual quality for the current content.

Returns:The contextual quality for the current content.
Return type:float
source

The given source url from where the content came from.

Returns:The given source url from where the content came from.
Return type:furl.furl
title

The title of the content.

Returns:The title of the content.
Return type:str
uid

The unique id of the discovered content.

Returns:The unique id of the discovered content.
Return type:str
uploaded_by

A string of the uploader’s name.

Returns:A string of the uploader’s name.
Return type:str
uploaded_date

The datetime the content was uploaded.

Returns:The datetime the content was uploaded.
Return type:datetime.datetime

qetch.extractors

Below are a list of the currently included extractors which all should extend BaseExtractor. The purpose of extractors is to take a url and yield lists of similar content instances.

This allows content with various levels of quality to have a relationship with eachother. For example, gfycat.com hosts various levels and formats of some media (mp4, webm, webp, gif, etc…). When extracting the content for a gfycat url, an extractor will yield a list containing different content instances for each of these formats and different quality values. This allows the developer to hopefully correctly choose the desired content for a list of content extracted for a single resource.

BaseExtractor

class qetch.extractors._common.BaseExtractor[source]

Bases: abc.ABC

The base extractor. All extractors should extend this.

authenticate(auth)[source]

Handles authenticating the extractor if necessary.

Parameters:auth (tuple[str, str]) – The authentication tuple is available.
classmethod can_handle(url)[source]

Determines if an extractor can handle a url.

Parameters:url (str) – The url to check
Returns:True if the extractor can handle, otherwise False
Return type:bool
extract(url, auth=None)[source]

Extracts lists of content from a url.

Note

When an extractor can handle a url with a given {handle_name: regex} dictionary, the extract() method assumes that a method handle_{handle_name} exists to handle that specific url.

If an appropriately named method does not exist, a NotImplementedError is raised.

Parameters:
  • url (str) – The url to extract content from.
  • auth (tuple[str, str], optional) – The auth tuple if available.
Raises:

NotImplementedError – If a given handle_{handle_name} method does not exist.

Yields:

list[Content] – A list of similar content of different qualities

Examples

Basic usage where GFYCAT_ID is the id determined from GFYCAT_URL.

>>> from qetch.extractors import (GfycatExtractor,)
>>> for content_list in GfycatExtractor().extract(GFYCAT_URL):
...    for content in content_list:
...        print(content)
<Content (1.0) "gfycat-GFYCAT_ID-mp4Url">
<Content (0.5) "gfycat-GFYCAT_ID-webmUrl">
<Content (0.0) "gfycat-GFYCAT_ID-webpUrl">
<Content (0.0) "gfycat-GFYCAT_ID-mobileUrl">
<Content (0.0) "gfycat-GFYCAT_ID-mobilePosterUrl">
<Content (0.0) "gfycat-GFYCAT_ID-posterUrl">
<Content (0.0) "gfycat-GFYCAT_ID-thumb360Url">
<Content (0.0) "gfycat-GFYCAT_ID-thumb360PosterUrl">
<Content (0.0) "gfycat-GFYCAT_ID-thumb100PosterUrl">
<Content (0.0) "gfycat-GFYCAT_ID-max5mbGif">
<Content (0.0) "gfycat-GFYCAT_ID-max2mbGif">
<Content (0.0) "gfycat-GFYCAT_ID-mjpgUrl">
<Content (0.0) "gfycat-GFYCAT_ID-miniUrl">
<Content (0.0) "gfycat-GFYCAT_ID-miniPosterUrl">
<Content (0.25) "gfycat-GFYCAT_ID-gifUrl">
Return type:Generator[List[Any], None, None]
classmethod get_handle(url)[source]

Gets the handle match for a given url.

Parameters:url (str) – The url to get the handle match for.
Returns:A tuple of handle and the match for the url.
Return type:tuple[str, Match]
merge(ordered_filepaths)[source]

Handles merging downloaded fragments into a resulting file.

Parameters:ordered_filepaths (list[str]) – The list of ordered filepaths to downloaded fragments.
Returns:The resulting merged file’s filepath.
Return type:str
session

The default session for the extractor.

Returns:The default session for the extractor.
Return type:requests.Session

gfycat

class qetch.extractors.gfycat.GfycatExtractor[source]

Bases: qetch.extractors._common.BaseExtractor

The extractor for links to media from gfycat.com.

handle_basic(source, match)[source]

Handles basic links to gfycat media.

Parameters:
  • source (str) – The source url
  • match (Match) – The source match regex
Yields:

list[Content] – A list of various levels of quality content for the same source url

Return type:

Generator[List[Content], None, None]

handle_raw(source, match)[source]

Handles raw links to gfycat media.

Parameters:
  • source (str) – The source url
  • match (Match) – The source match regex
Yields:

list[Content] – A list of various levels of quality content for the same source url

Return type:

Generator[List[Content], None, None]

authentication = None
description = 'Site which hosts short high-quality video for sharing.'
domains = ['gfycat.com']
handles = {'basic': '^https?://(?:www\\.)?gfycat\\.com/(?:gifs/detail/)?(?P<id>[a-zA-Z]+)/?$', 'raw': '^https?://(?:[a-z]+\\.)gfycat\\.com/(?P<id>[a-zA-Z]+)(?:\\.[a-zA-Z0-9]+)$'}
name = 'gfycat'

imgur

class qetch.extractors.imgur.ImgurExtractor[source]

Bases: qetch.extractors._common.BaseExtractor

The extractor for links to media from imgur.com.

authenticate(auth)[source]

Handles authenticating the extractor if necessary.

Parameters:auth (tuple[str, str]) – The authentication tuple is available.
handle_album(source, match)[source]

Handles album links to imgur media.

Parameters:
  • source (str) – The source url
  • match (Match) – The source match regex
Yields:

list[Content] – A list of various levels of quality content for the same source url

Return type:

Generator[List[Content], None, None]

handle_basic(source, match)[source]

Handles basic links to imgur media.

Parameters:
  • source (str) – The source url
  • match (Match) – The source match regex
Yields:

list[Content] – A list of various levels of quality content for the same source url

Return type:

Generator[List[Content], None, None]

handle_raw(source, match)[source]

Handles raw links to imgur media.

Parameters:
  • source (str) – The source url
  • match (Match) – The source match regex
Yields:

list[Content] – A list of various levels of quality content for the same source url

Return type:

Generator[List[Content], None, None]

authentication = ('KEY', 'SECRET')
description = 'Dedicated image host originally built for Reddit.'
domains = ['imgur.com', 'i.imgur.com']
handles = {'album': '^https?://(?:www\\.)?imgur\\.com/(?:a|gallery)/(?P<id>[a-zA-Z0-9]+)/?$', 'basic': '^https?://(?:www\\.)?imgur\\.com/(?P<id>[a-zA-Z0-9]+)/?$', 'raw': '^https?://(?:www\\.)?(?:[a-z]\\.)imgur\\.com/(?P<id>[a-zA-Z0-9]+)\\..*$'}
name = 'imgur'

fourchan

class qetch.extractors.fourchan.FourChanExtractor[source]

Bases: qetch.extractors._common.BaseExtractor

The extractor for links to media from 4chan.org.

handle_raw(source, match)[source]

Handles raw links to 4chan media.

Parameters:
  • source (str) – The source url
  • match (Match) – The source match regex
Yields:

list[Content] – A list of various levels of quality content for the same source url

Return type:

Generator[List[Content], None, None]

handle_thread(source, match)[source]

Handles thread links to 4chan media.

Parameters:
  • source (str) – The source url
  • match (Match) – The source match regex
Yields:

list[Content] – A list of various levels of quality content for the same source url

Return type:

Generator[List[Content], None, None]

authentication = None
description = 'A no-limits and lightly categorized temporary image host.'
domains = ['4chan.org', 'i.4chan.org']
handles = {'raw': '^https?://(?:www\\.)?i\\.4cdn\\.org/(?P<board>.*)/(?P<id>.*)\\.(?:[a-zA-Z0-9]+)$', 'thread': '^https?://(?:www\\.)?(?:boards\\.)?4chan\\.org/(?P<board>.*)/thread/(?P<id>.*)/?.*$'}
name = '4chan'

qetch.downloaders

Below are a list of the currently included downloaders which all should extend BaseDownloader. The purpose of downloaders is to take an extracted Content instance in order to download and merge the fragments resulting in the content being downloaded to a given local system path.

Downloaders should be built to allow parrallel fragment downloading and multiple connection downlaods for each fragment. For example, the HTTPDownloader allows both max_fragments and max_connections as parameters to the download() method. This will allow max_fragments to be processed at the same time and max_connections to be used for the download of each of those fragments. This means that up to (max_fragments * max_connections) between your IP and the host may exist at any point during the download.

It is best to scrutinize this to allow only 10 connections at max, since many hosts will flag/ban IPs using more than 10 connections. By default, max_fragments and max_connections are set to 1 and 8 respectively allowing a maximum of 8 connections from your IP to the host at any point, but only allows 1 fragment to be downloaded at a time.

Downloaders should also support the usage of a progress_hook which is sent updates on the download progress every update_delay seconds. See the example in download() for a very simple example.

BaseDownloader

class qetch.downloaders._common.BaseDownloader[source]

Bases: abc.ABC

The base abstract base downloader. All downloaders must extend from this class.

download(content, to_path, max_fragments=1, max_connections=8, progress_hook=None, update_delay=0.1)[source]

The simplified download method.

Note

The max_fragments and max_connections rules imply that potentially (max_fragments * max_connections) connections from the local system’s IP can exist at any time.

Many hosts will flag/ban IPs which utilize more than 10 connections for a single resource. For this reason, max_fragments and max_connections are set to 1 and 8 respectively by default.

Parameters:
  • content (Content) – The content instance to download.
  • to_path (str) – The path to save the resulting download to.
  • max_fragments (int, optional) – The number of fragments to process in parallel.
  • max_connections (int, optional) – The number of connections to allow for downloading a single fragment.
  • progress_hook (callable, optional) – A progress hook that accepts the arguments (download_id, current_size, total_size) for progress updates.
  • update_delay (float, optional) – The frequency (in seconds) where progress updates are sent to the given progress_hook.
Returns:

The downloaded file’s local path.

Return type:

str

Examples

Basic usage where $HOME is the home directory of the currently executing user.

>>> import os
>>> from qetch.extractors import (GfycatExtractor,)
>>> from qetch.downloaders import (HTTPDownloader,)
>>> content = next(GfycatExtractor().extract(GFYCAT_URL))[0]
>>> saved_to = HTTPDownloader().download(
...     content,
...     os.path.expanduser('~/Downloads/saved_content.mp4'))
>>> print(saved_to)
$HOME/Downloads/saved_content.mp4

Similar basic usage, but with a given progress hook sent updates every 0.1 seconds.

>>> def progress(download_id, current, total):
...     print(f'{((current / total) * 100.0):6.2f}')
>>> saved_to = HTTPDownloader().download(
...     content,
...     os.path.expanduser('~/Downloads/saved_content.mp4'),
...     progress_hook=progress,
...     update_delay=0.1)
  0.00
  0.00
 23.01
 54.32
 73.09
 90.49
 97.12
100.00
>>> print(saved_to)
$HOME/Downloads/saved_content.mp4
handle_progress(download_id, content_length, update_delay=0.1)[source]

The progress reporting handler.

Parameters:
  • download_id (str) – The unique id of the download request.
  • content_length (int) – The total size of the downloading content.
  • update_delay (float, optional) – The frequency (in seconds) which progress updates are emitted.
download_state

dict[str,DownloadState] – The download state dictionary.

progress_store

dict[str,int] – The downloaded content size for progress reporting.

class qetch.downloaders._common.DownloadState[source]

Bases: enum.Enum

An enum of allowed download states.

Values:
  • STOPPED: indicates the download is stopped (error occured)
  • RUNNING: indicates the download is running
  • PREPARING: indicates the download is starting up
  • FINISHED: indicates the download is finished (successfully)

http

class qetch.downloaders.http.HTTPDownloader[source]

Bases: qetch.downloaders._common.BaseDownloader

The downloader for HTTP served content.

classmethod can_handle(content)[source]

Determines if a given content can be handled by this downloader.

Parameters:content (Content) – The content the check.
Returns:True if the content can be handled, otherwise False.
Return type:bool
handle_chunk(download_id, url, to_path, start, end, chunk_size=1024)[source]

Handles downloading a specific range of bytes for a url.

Parameters:
  • download_id (str) – The unique id of the download request.
  • url (str) – The url to download.
  • to_path (str) – The local path to save the download.
  • start (int) – The starting byte position to download.
  • end (int) – The ending byte position to download.
  • chunk_size (int, optional) – The size of the chunks to stream in.
handle_download(download_id, url, to_path, max_connections=8)[source]

Handles downloading a specific url.

Note

max_connections defaults to 8 because many content hosting sites will typically flag/ban IPs that use over 10 connections.

Parameters:
  • download_id (str) – The unique id of the download request.
  • url (str) – The url to download.
  • to_path (str) – The local path to save the download.
  • max_connections (int, optional) – The number of allowed connections for parallel downloading of the url.
session

requests.Session – The requests session to use for downloading.

Return type:Session