Qetch Package¶
This is the base qetch package.
-
qetch.
get_downloader
(content, init=False, *args, **kwargs)[source]¶ Gets the first downloader that can handle a given content.
Parameters: Returns: - The downloader that can handle the
content.
Return type: Examples
Basic usage…
>>> import qetch >>> content = next(qetch.get_extractor(GFYCAT_URL, init=True) ... .extract(GFYCAT_URL))[0] >>> downloader = qetch.get_downloader(content, init=True) >>> print(downloader) <HTTPDownloader at 0xABCDEF1234567890>
-
qetch.
get_extractor
(url, init=False, *args, **kwargs)[source]¶ Gets the first extractor that can handle a given url.
Parameters: Returns: - The extractor that can
handle the url.
Return type: Examples
Basic usage…
>>> import qetch >>> extractor = qetch.get_extractor(GFYCAT_URL, init=True) >>> print(extractor) <GfycatExtractor "gfycat">
qetch.auth¶
-
class
qetch.auth.
AuthRegistry
(**kwargs)[source]¶ Bases:
dict
Custom borg style registry dictionary.
This registry dictionary utilizes the borg design pattern and maintains the same state across multiple instances. This means that multiple instances of this object can exist, but the values between them will stay syncronized.
Examples
Basic usage…
>>> from qetch.auth import (AuthRegistry,) >>> from qetch.extractors import (GfycatExtractor,) >>> registry_1 = AuthRegistry() >>> registry_1[GfycatExtractor.name] = ('KEY', 'SECRET',) >>> print(registry_1[GfycatExtractor.name]) ('KEY', 'SECRET') >>> registry_2 = AuthRegistry() >>> print(registry_2[GfycatExtractor.name]) ('KEY', 'SECRET') >>> registry_1[GfycatExtractor.name] = ('USERNAME', 'PASSWORD',) >>> print(registry_2[GfycatExtractor.name]) ('USERNAME', 'PASSWORD')
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.[source]¶ If key is not found, d is returned if given, otherwise KeyError is raised
-
qetch.content¶
This is the base content instance which is used to normalize hosted media for use between extractors
and downloaders
.
The most important attributes of this object are the following:
uid
: The unique id that identifies the content- (even unique between levels of quality).
source
: The url that was given to the extractor for extracting.fragments
: A list of urls where the raw content can be retrieved from- (is a list in case that content is fragmented/segmented).
quality
: A float value between 0 and 1, 1 being the best quality format.
-
class
qetch.content.
Content
(uid, source, fragments, extractor, extension=None, title=None, description=None, quality=0.0, uploaded_by=None, uploaded_date=None, metadata={})[source]¶ Bases:
object
The resulting content instance yielded by extractors.
-
get_size
()[source]¶ Returns the sum of the length of the fragments.
Returns: The sum of the length of the fragments. Return type: int
-
description
¶ The description of the content.
Returns: The description of the content. Return type: str
-
extension
¶ The extension of the resulting content.
Returns: The extension for the resulting content. Return type: str
-
extractor
¶ The extractor which discovered the content.
Returns: The extractor which discovered the content. Return type: BaseExtractor
-
fragments
¶ A list of urls which represent the raw content.
Returns: A list of urls which represent the raw content. Return type: list[str]
-
metadata
¶ Any metadata for the current content.
Returns: Any metadata for the current content. Return type: dict[str,…]
-
quality
¶ The contextual quality for the current content.
Returns: The contextual quality for the current content. Return type: float
-
source
¶ The given source url from where the content came from.
Returns: The given source url from where the content came from. Return type: furl.furl
-
uid
¶ The unique id of the discovered content.
Returns: The unique id of the discovered content. Return type: str
-
uploaded_by
¶ A string of the uploader’s name.
Returns: A string of the uploader’s name. Return type: str
-
uploaded_date
¶ The datetime the content was uploaded.
Returns: The datetime the content was uploaded. Return type: datetime.datetime
-
qetch.extractors¶
Below are a list of the currently included extractors which all should extend BaseExtractor
.
The purpose of extractors is to take a url and yield lists of similar content instances.
This allows content with various levels of quality to have a relationship with eachother.
For example, gfycat.com hosts various levels and formats of some media (mp4, webm, webp, gif, etc…).
When extracting the content for a gfycat url, an extractor will yield a list containing different content instances for each of these formats and different quality
values.
This allows the developer to hopefully correctly choose the desired content for a list of content extracted for a single resource.
BaseExtractor¶
-
class
qetch.extractors._common.
BaseExtractor
[source]¶ Bases:
abc.ABC
The base extractor. All extractors should extend this.
-
authenticate
(auth)[source]¶ Handles authenticating the extractor if necessary.
Parameters: auth (tuple[str, str]) – The authentication tuple is available.
-
classmethod
can_handle
(url)[source]¶ Determines if an extractor can handle a url.
Parameters: url (str) – The url to check Returns: True if the extractor can handle, otherwise False Return type: bool
-
extract
(url, auth=None)[source]¶ Extracts lists of content from a url.
Note
When an extractor can handle a url with a given
{handle_name: regex}
dictionary, theextract()
method assumes that a methodhandle_{handle_name}
exists to handle that specific url.If an appropriately named method does not exist, a
NotImplementedError
is raised.Parameters: Raises: NotImplementedError
– If a givenhandle_{handle_name}
method does not exist.Yields: list[Content] – A list of similar content of different qualities
Examples
Basic usage where
GFYCAT_ID
is the id determined fromGFYCAT_URL
.>>> from qetch.extractors import (GfycatExtractor,) >>> for content_list in GfycatExtractor().extract(GFYCAT_URL): ... for content in content_list: ... print(content) <Content (1.0) "gfycat-GFYCAT_ID-mp4Url"> <Content (0.5) "gfycat-GFYCAT_ID-webmUrl"> <Content (0.0) "gfycat-GFYCAT_ID-webpUrl"> <Content (0.0) "gfycat-GFYCAT_ID-mobileUrl"> <Content (0.0) "gfycat-GFYCAT_ID-mobilePosterUrl"> <Content (0.0) "gfycat-GFYCAT_ID-posterUrl"> <Content (0.0) "gfycat-GFYCAT_ID-thumb360Url"> <Content (0.0) "gfycat-GFYCAT_ID-thumb360PosterUrl"> <Content (0.0) "gfycat-GFYCAT_ID-thumb100PosterUrl"> <Content (0.0) "gfycat-GFYCAT_ID-max5mbGif"> <Content (0.0) "gfycat-GFYCAT_ID-max2mbGif"> <Content (0.0) "gfycat-GFYCAT_ID-mjpgUrl"> <Content (0.0) "gfycat-GFYCAT_ID-miniUrl"> <Content (0.0) "gfycat-GFYCAT_ID-miniPosterUrl"> <Content (0.25) "gfycat-GFYCAT_ID-gifUrl">
Return type: Generator
[List
[Any
],None
,None
]
-
classmethod
get_handle
(url)[source]¶ Gets the handle match for a given url.
Parameters: url (str) – The url to get the handle match for. Returns: A tuple of handle and the match for the url. Return type: tuple[str, Match]
-
merge
(ordered_filepaths)[source]¶ Handles merging downloaded fragments into a resulting file.
Parameters: ordered_filepaths (list[str]) – The list of ordered filepaths to downloaded fragments. Returns: The resulting merged file’s filepath. Return type: str
-
session
¶ The default session for the extractor.
Returns: The default session for the extractor. Return type: requests.Session
-
gfycat¶
-
class
qetch.extractors.gfycat.
GfycatExtractor
[source]¶ Bases:
qetch.extractors._common.BaseExtractor
The extractor for links to media from
gfycat.com
.-
handle_basic
(source, match)[source]¶ Handles
basic
links to gfycat media.Parameters: - source (str) – The source url
- match (Match) – The source match regex
Yields: list[Content] – A list of various levels of quality content for the same source url
Return type:
-
handle_raw
(source, match)[source]¶ Handles
raw
links to gfycat media.Parameters: - source (str) – The source url
- match (Match) – The source match regex
Yields: list[Content] – A list of various levels of quality content for the same source url
Return type:
-
authentication
= None¶
-
description
= 'Site which hosts short high-quality video for sharing.'¶
-
domains
= ['gfycat.com']¶
-
handles
= {'basic': '^https?://(?:www\\.)?gfycat\\.com/(?:gifs/detail/)?(?P<id>[a-zA-Z]+)/?$', 'raw': '^https?://(?:[a-z]+\\.)gfycat\\.com/(?P<id>[a-zA-Z]+)(?:\\.[a-zA-Z0-9]+)$'}¶
-
name
= 'gfycat'¶
-
imgur¶
-
class
qetch.extractors.imgur.
ImgurExtractor
[source]¶ Bases:
qetch.extractors._common.BaseExtractor
The extractor for links to media from
imgur.com
.-
authenticate
(auth)[source]¶ Handles authenticating the extractor if necessary.
Parameters: auth (tuple[str, str]) – The authentication tuple is available.
-
handle_album
(source, match)[source]¶ Handles
album
links to imgur media.Parameters: - source (str) – The source url
- match (Match) – The source match regex
Yields: list[Content] – A list of various levels of quality content for the same source url
Return type:
-
handle_basic
(source, match)[source]¶ Handles
basic
links to imgur media.Parameters: - source (str) – The source url
- match (Match) – The source match regex
Yields: list[Content] – A list of various levels of quality content for the same source url
Return type:
-
handle_raw
(source, match)[source]¶ Handles
raw
links to imgur media.Parameters: - source (str) – The source url
- match (Match) – The source match regex
Yields: list[Content] – A list of various levels of quality content for the same source url
Return type:
-
authentication
= ('KEY', 'SECRET')¶
-
description
= 'Dedicated image host originally built for Reddit.'¶
-
domains
= ['imgur.com', 'i.imgur.com']¶
-
handles
= {'album': '^https?://(?:www\\.)?imgur\\.com/(?:a|gallery)/(?P<id>[a-zA-Z0-9]+)/?$', 'basic': '^https?://(?:www\\.)?imgur\\.com/(?P<id>[a-zA-Z0-9]+)/?$', 'raw': '^https?://(?:www\\.)?(?:[a-z]\\.)imgur\\.com/(?P<id>[a-zA-Z0-9]+)\\..*$'}¶
-
name
= 'imgur'¶
-
fourchan¶
-
class
qetch.extractors.fourchan.
FourChanExtractor
[source]¶ Bases:
qetch.extractors._common.BaseExtractor
The extractor for links to media from
4chan.org
.-
handle_raw
(source, match)[source]¶ Handles
raw
links to 4chan media.Parameters: - source (str) – The source url
- match (Match) – The source match regex
Yields: list[Content] – A list of various levels of quality content for the same source url
Return type:
-
handle_thread
(source, match)[source]¶ Handles
thread
links to 4chan media.Parameters: - source (str) – The source url
- match (Match) – The source match regex
Yields: list[Content] – A list of various levels of quality content for the same source url
Return type:
-
authentication
= None¶
-
description
= 'A no-limits and lightly categorized temporary image host.'¶
-
domains
= ['4chan.org', 'i.4chan.org']¶
-
handles
= {'raw': '^https?://(?:www\\.)?i\\.4cdn\\.org/(?P<board>.*)/(?P<id>.*)\\.(?:[a-zA-Z0-9]+)$', 'thread': '^https?://(?:www\\.)?(?:boards\\.)?4chan\\.org/(?P<board>.*)/thread/(?P<id>.*)/?.*$'}¶
-
name
= '4chan'¶
-
qetch.downloaders¶
Below are a list of the currently included downloaders which all should extend BaseDownloader
.
The purpose of downloaders is to take an extracted Content
instance in order to download and merge the fragments resulting in the content being downloaded to a given local system path.
Downloaders should be built to allow parrallel fragment downloading and multiple connection downlaods for each fragment.
For example, the HTTPDownloader
allows both max_fragments
and max_connections
as parameters to the download()
method.
This will allow max_fragments
to be processed at the same time and max_connections
to be used for the download of each of those fragments.
This means that up to (max_fragments * max_connections)
between your IP and the host may exist at any point during the download.
It is best to scrutinize this to allow only 10 connections at max, since many hosts will flag/ban IPs using more than 10 connections.
By default, max_fragments
and max_connections
are set to 1 and 8 respectively allowing a maximum of 8 connections from your IP to the host at any point, but only allows 1 fragment to be downloaded at a time.
Downloaders should also support the usage of a progress_hook
which is sent updates on the download progress every update_delay
seconds.
See the example in download()
for a very simple example.
BaseDownloader¶
-
class
qetch.downloaders._common.
BaseDownloader
[source]¶ Bases:
abc.ABC
The base abstract base downloader. All downloaders must extend from this class.
-
download
(content, to_path, max_fragments=1, max_connections=8, progress_hook=None, update_delay=0.1)[source]¶ The simplified download method.
Note
The
max_fragments
andmax_connections
rules imply that potentially(max_fragments * max_connections)
connections from the local system’s IP can exist at any time.Many hosts will flag/ban IPs which utilize more than 10 connections for a single resource. For this reason,
max_fragments
andmax_connections
are set to 1 and 8 respectively by default.Parameters: - content (Content) – The content instance to download.
- to_path (str) – The path to save the resulting download to.
- max_fragments (int, optional) – The number of fragments to process in parallel.
- max_connections (int, optional) – The number of connections to allow for downloading a single fragment.
- progress_hook (callable, optional) – A progress hook that accepts
the arguments
(download_id, current_size, total_size)
for progress updates. - update_delay (float, optional) – The frequency (in seconds) where
progress updates are sent to the given
progress_hook
.
Returns: The downloaded file’s local path.
Return type: Examples
Basic usage where
$HOME
is the home directory of the currently executing user.>>> import os >>> from qetch.extractors import (GfycatExtractor,) >>> from qetch.downloaders import (HTTPDownloader,) >>> content = next(GfycatExtractor().extract(GFYCAT_URL))[0] >>> saved_to = HTTPDownloader().download( ... content, ... os.path.expanduser('~/Downloads/saved_content.mp4')) >>> print(saved_to) $HOME/Downloads/saved_content.mp4
Similar basic usage, but with a given progress hook sent updates every 0.1 seconds.
>>> def progress(download_id, current, total): ... print(f'{((current / total) * 100.0):6.2f}') >>> saved_to = HTTPDownloader().download( ... content, ... os.path.expanduser('~/Downloads/saved_content.mp4'), ... progress_hook=progress, ... update_delay=0.1) 0.00 0.00 23.01 54.32 73.09 90.49 97.12 100.00 >>> print(saved_to) $HOME/Downloads/saved_content.mp4
-
handle_progress
(download_id, content_length, update_delay=0.1)[source]¶ The progress reporting handler.
Parameters:
-
download_state
¶ dict[str,DownloadState] – The download state dictionary.
-
progress_store
¶ dict[str,int] – The downloaded content size for progress reporting.
-
-
class
qetch.downloaders._common.
DownloadState
[source]¶ Bases:
enum.Enum
An enum of allowed download states.
- Values:
STOPPED
: indicates the download is stopped (error occured)RUNNING
: indicates the download is runningPREPARING
: indicates the download is starting upFINISHED
: indicates the download is finished (successfully)
http¶
-
class
qetch.downloaders.http.
HTTPDownloader
[source]¶ Bases:
qetch.downloaders._common.BaseDownloader
The downloader for HTTP served content.
-
classmethod
can_handle
(content)[source]¶ Determines if a given content can be handled by this downloader.
Parameters: content (Content) – The content the check. Returns: True if the content can be handled, otherwise False. Return type: bool
-
handle_chunk
(download_id, url, to_path, start, end, chunk_size=1024)[source]¶ Handles downloading a specific range of bytes for a url.
Parameters: - download_id (str) – The unique id of the download request.
- url (str) – The url to download.
- to_path (str) – The local path to save the download.
- start (int) – The starting byte position to download.
- end (int) – The ending byte position to download.
- chunk_size (int, optional) – The size of the chunks to stream in.
-
handle_download
(download_id, url, to_path, max_connections=8)[source]¶ Handles downloading a specific url.
Note
max_connections
defaults to 8 because many content hosting sites will typically flag/ban IPs that use over 10 connections.Parameters:
-
session
¶ requests.Session – The requests session to use for downloading.
Return type: Session
-
classmethod