-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ie/HockeyCanada] Add extractor #10002
base: master
Are you sure you want to change the base?
Conversation
|
||
|
||
class HockeyCanadaIE(InfoExtractor): | ||
_VALID_URL = r'https://video.hockeycanada.ca/(en(?:-\w+)?|fr)/c/.+\.(?P<id>\d+)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_VALID_URL = r'https://video.hockeycanada.ca/(en(?:-\w+)?|fr)/c/.+\.(?P<id>\d+)' | |
_VALID_URL = r'https?://video\.hockeycanada\.ca/[a-z]{2}(?:-[a-z]{2})?/c/[\w-]+\.(?P<id>\d+)' |
'upload_date': '20211015', | ||
'tags': ['English', '2021', "National Women's Team"], | ||
'description': 'md5:efb1cf6165b48cc3f5555c4262dd5b23', | ||
'thumbnail': str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use regex for thumbnail
values, even it's something very lenient like r're:^https?://.+\.jpg'
, so that we test we're returning actual image urls
'timestamp': 1634310409, | ||
'upload_date': '20211015', | ||
'tags': ['English', '2021', "National Women's Team"], | ||
'description': 'md5:efb1cf6165b48cc3f5555c4262dd5b23', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if you want, you could use regex for the description
values too instead of md5 hashes. up to you
webpage = self._download_webpage(url, video_id) | ||
|
||
data_url = self._html_search_regex( | ||
r'content_api:\s*(["\'])(?P<url>.+?)\1', webpage, 'content api url', group='url') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this regex a bit more strict/defined so we don't accidentally capture a huge block of JS
media_config = traverse_obj( | ||
self._download_json(data_url, video_id), | ||
('config', {lambda x: json.loads(base64.b64decode(x).decode())})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this API call into _real_extract
media_config = traverse_obj(
self._download_json(data_url, video_id),
('config', {base64.b64decode}, {bytes.decode}, {json.loads}, {dict}))
and pass media_config
to _yield_formats
instead of data_url
for media_source in traverse_obj(media_config, ('media', 'source', ..., { | ||
'url': ('src', {url_or_none}), | ||
'type': ('type', {mimetype2ext}), | ||
})): | ||
if not (media_url := media_source.get('url')): | ||
continue | ||
media_type = media_source.get('type') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing dict key traversal in the for
loop is not useful if we're assigning to variables afterwards anyways IMO. We could just filter like this:
for media_source in traverse_obj(media_config, ('media', 'source', ..., { | |
'url': ('src', {url_or_none}), | |
'type': ('type', {mimetype2ext}), | |
})): | |
if not (media_url := media_source.get('url')): | |
continue | |
media_type = media_source.get('type') | |
for media_source in traverse_obj(media_config, ('media', 'source', lambda_, v: url_or_none(v['src']))): | |
media_url = media_source['src'] | |
media_type = mimetype2ext(media_source.get('type')) |
media_type = media_source.get('type') | ||
|
||
if media_type == 'm3u8': | ||
yield from self._extract_m3u8_formats(media_url, video_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be non-fatal if there possibly other formats available. And specifying format ids would be nice to differentiate hls from http etc
yield from self._extract_m3u8_formats(media_url, video_id) | |
yield from self._extract_m3u8_formats(media_url, video_id, fatal=False, m3u8_id='hls') |
yield from self._extract_m3u8_formats(media_url, video_id) | ||
elif media_type == 'mp4': | ||
fmt = { | ||
'url': media_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'url': media_url, | |
'format_id': 'http', | |
'url': media_url, |
IMPORTANT: PRs without the template will be CLOSED
Description of your pull request and other information
Fixes #1377
Template
Before submitting a pull request make sure you have:
In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:
What is the purpose of your pull request?