Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ie/HockeyCanada] Add extractor #10002

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

pzhlkj6612
Copy link
Contributor

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

Fixes #1377

Hockey Canada is the national governing body for grassroots hockey in the country.

from: https://www.hockeycanada.ca/en-ca/corporate/about/mandate-mission

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

@pzhlkj6612 pzhlkj6612 marked this pull request as ready for review May 22, 2024 17:41
@seproDev seproDev added the site-request Request to support a new website label May 22, 2024


class HockeyCanadaIE(InfoExtractor):
_VALID_URL = r'https://video.hockeycanada.ca/(en(?:-\w+)?|fr)/c/.+\.(?P<id>\d+)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_VALID_URL = r'https://video.hockeycanada.ca/(en(?:-\w+)?|fr)/c/.+\.(?P<id>\d+)'
_VALID_URL = r'https?://video\.hockeycanada\.ca/[a-z]{2}(?:-[a-z]{2})?/c/[\w-]+\.(?P<id>\d+)'

'upload_date': '20211015',
'tags': ['English', '2021', "National Women's Team"],
'description': 'md5:efb1cf6165b48cc3f5555c4262dd5b23',
'thumbnail': str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use regex for thumbnail values, even it's something very lenient like r're:^https?://.+\.jpg', so that we test we're returning actual image urls

'timestamp': 1634310409,
'upload_date': '20211015',
'tags': ['English', '2021', "National Women's Team"],
'description': 'md5:efb1cf6165b48cc3f5555c4262dd5b23',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if you want, you could use regex for the description values too instead of md5 hashes. up to you

webpage = self._download_webpage(url, video_id)

data_url = self._html_search_regex(
r'content_api:\s*(["\'])(?P<url>.+?)\1', webpage, 'content api url', group='url')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this regex a bit more strict/defined so we don't accidentally capture a huge block of JS

Comment on lines +78 to +80
media_config = traverse_obj(
self._download_json(data_url, video_id),
('config', {lambda x: json.loads(base64.b64decode(x).decode())}))
Copy link
Member

@bashonly bashonly Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this API call into _real_extract

        media_config = traverse_obj(
            self._download_json(data_url, video_id),
            ('config', {base64.b64decode}, {bytes.decode}, {json.loads}, {dict}))

and pass media_config to _yield_formats instead of data_url

Comment on lines +82 to +88
for media_source in traverse_obj(media_config, ('media', 'source', ..., {
'url': ('src', {url_or_none}),
'type': ('type', {mimetype2ext}),
})):
if not (media_url := media_source.get('url')):
continue
media_type = media_source.get('type')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing dict key traversal in the for loop is not useful if we're assigning to variables afterwards anyways IMO. We could just filter like this:

Suggested change
for media_source in traverse_obj(media_config, ('media', 'source', ..., {
'url': ('src', {url_or_none}),
'type': ('type', {mimetype2ext}),
})):
if not (media_url := media_source.get('url')):
continue
media_type = media_source.get('type')
for media_source in traverse_obj(media_config, ('media', 'source', lambda_, v: url_or_none(v['src']))):
media_url = media_source['src']
media_type = mimetype2ext(media_source.get('type'))

media_type = media_source.get('type')

if media_type == 'm3u8':
yield from self._extract_m3u8_formats(media_url, video_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be non-fatal if there possibly other formats available. And specifying format ids would be nice to differentiate hls from http etc

Suggested change
yield from self._extract_m3u8_formats(media_url, video_id)
yield from self._extract_m3u8_formats(media_url, video_id, fatal=False, m3u8_id='hls')

yield from self._extract_m3u8_formats(media_url, video_id)
elif media_type == 'mp4':
fmt = {
'url': media_url,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'url': media_url,
'format_id': 'http',
'url': media_url,

@bashonly bashonly added the pending-fixes PR has had changes requested label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending-fixes PR has had changes requested site-request Request to support a new website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hockey Canada
3 participants