[ie/HockeyCanada] Add extractor #10002

pzhlkj6612 · 2024-05-22T17:40:57Z

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

Hockey Canada is the national governing body for grassroots hockey in the country.

from: https://www.hockeycanada.ca/en-ca/corporate/about/mandate-mission

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

bashonly · 2024-06-17T18:52:59Z

yt_dlp/extractor/hockeycanada.py

+
+
+class HockeyCanadaIE(InfoExtractor):
+    _VALID_URL = r'https://video.hockeycanada.ca/(en(?:-\w+)?|fr)/c/.+\.(?P<id>\d+)'


Suggested change

_VALID_URL = r'https://video.hockeycanada.ca/(en(?:-\w+)?|fr)/c/.+\.(?P<id>\d+)'

_VALID_URL = r'https?://video\.hockeycanada\.ca/[a-z]{2}(?:-[a-z]{2})?/c/[\w-]+\.(?P<id>\d+)'

bashonly · 2024-06-17T18:54:48Z

yt_dlp/extractor/hockeycanada.py

+            'upload_date': '20211015',
+            'tags': ['English', '2021', "National Women's Team"],
+            'description': 'md5:efb1cf6165b48cc3f5555c4262dd5b23',
+            'thumbnail': str,


use regex for thumbnail values, even it's something very lenient like r're:^https?://.+\.jpg', so that we test we're returning actual image urls

bashonly · 2024-06-17T18:55:50Z

yt_dlp/extractor/hockeycanada.py

+            'timestamp': 1634310409,
+            'upload_date': '20211015',
+            'tags': ['English', '2021', "National Women's Team"],
+            'description': 'md5:efb1cf6165b48cc3f5555c4262dd5b23',


also if you want, you could use regex for the description values too instead of md5 hashes. up to you

bashonly · 2024-06-17T18:57:26Z

yt_dlp/extractor/hockeycanada.py

+        webpage = self._download_webpage(url, video_id)
+
+        data_url = self._html_search_regex(
+            r'content_api:\s*(["\'])(?P<url>.+?)\1', webpage, 'content api url', group='url')


Maybe make this regex a bit more strict/defined so we don't accidentally capture a huge block of JS

bashonly · 2024-06-17T19:01:57Z

yt_dlp/extractor/hockeycanada.py

+        media_config = traverse_obj(
+            self._download_json(data_url, video_id),
+            ('config', {lambda x: json.loads(base64.b64decode(x).decode())}))


I would move this API call into _real_extract

media_config = traverse_obj( self._download_json(data_url, video_id), ('config', {base64.b64decode}, {bytes.decode}, {json.loads}, {dict}))

and pass media_config to _yield_formats instead of data_url

bashonly · 2024-06-17T19:04:15Z

yt_dlp/extractor/hockeycanada.py

+        for media_source in traverse_obj(media_config, ('media', 'source', ..., {
+            'url': ('src', {url_or_none}),
+            'type': ('type', {mimetype2ext}),
+        })):
+            if not (media_url := media_source.get('url')):
+                continue
+            media_type = media_source.get('type')


Doing dict key traversal in the for loop is not useful if we're assigning to variables afterwards anyways IMO. We could just filter like this:

Suggested change

for media_source in traverse_obj(media_config, ('media', 'source', ..., {

'url': ('src', {url_or_none}),

'type': ('type', {mimetype2ext}),

})):

if not (media_url := media_source.get('url')):

continue

media_type = media_source.get('type')

for media_source in traverse_obj(media_config, ('media', 'source', lambda_, v: url_or_none(v['src']))):

media_url = media_source['src']

media_type = mimetype2ext(media_source.get('type'))

bashonly · 2024-06-17T19:07:26Z

yt_dlp/extractor/hockeycanada.py

+            media_type = media_source.get('type')
+
+            if media_type == 'm3u8':
+                yield from self._extract_m3u8_formats(media_url, video_id)


Should be non-fatal if there possibly other formats available. And specifying format ids would be nice to differentiate hls from http etc

Suggested change

yield from self._extract_m3u8_formats(media_url, video_id)

yield from self._extract_m3u8_formats(media_url, video_id, fatal=False, m3u8_id='hls')

bashonly · 2024-06-17T19:08:29Z

yt_dlp/extractor/hockeycanada.py

+                yield from self._extract_m3u8_formats(media_url, video_id)
+            elif media_type == 'mp4':
+                fmt = {
+                    'url': media_url,


Suggested change

'url': media_url,

'format_id': 'http',

'url': media_url,

[ie/HockeyCanada] Add extractor

d90cf45

pzhlkj6612 marked this pull request as ready for review May 22, 2024 17:41

seproDev added the site-request Request to support a new website label May 22, 2024

pzhlkj6612 added 4 commits May 23, 2024 05:35

use the magic traverse_obj(); no int_or_none() for an integer

0c76800

merge "master"

71f5137

merge 'master'

18e57f6

merge 'master'

ed5604b

bashonly requested changes Jun 17, 2024

View reviewed changes

bashonly added the pending-fixes PR has had changes requested label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ie/HockeyCanada] Add extractor #10002

[ie/HockeyCanada] Add extractor #10002

pzhlkj6612 commented May 22, 2024

bashonly Jun 17, 2024

bashonly Jun 17, 2024

bashonly Jun 17, 2024

bashonly Jun 17, 2024

bashonly Jun 17, 2024 •

edited

Loading

bashonly Jun 17, 2024

bashonly Jun 17, 2024

bashonly Jun 17, 2024



		class HockeyCanadaIE(InfoExtractor):
		_VALID_URL = r'https://video.hockeycanada.ca/(en(?:-\w+)?\|fr)/c/.+\.(?P<id>\d+)'

	_VALID_URL = r'https://video.hockeycanada.ca/(en(?:-\w+)?\|fr)/c/.+\.(?P<id>\d+)'
	_VALID_URL = r'https?://video\.hockeycanada\.ca/[a-z]{2}(?:-[a-z]{2})?/c/[\w-]+\.(?P<id>\d+)'

	yield from self._extract_m3u8_formats(media_url, video_id)
	yield from self._extract_m3u8_formats(media_url, video_id, fatal=False, m3u8_id='hls')

[ie/HockeyCanada] Add extractor #10002

Are you sure you want to change the base?

[ie/HockeyCanada] Add extractor #10002

Conversation

pzhlkj6612 commented May 22, 2024

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

bashonly Jun 17, 2024

Choose a reason for hiding this comment

bashonly Jun 17, 2024

Choose a reason for hiding this comment

bashonly Jun 17, 2024

Choose a reason for hiding this comment

bashonly Jun 17, 2024

Choose a reason for hiding this comment

bashonly Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

bashonly Jun 17, 2024

Choose a reason for hiding this comment

bashonly Jun 17, 2024

Choose a reason for hiding this comment

bashonly Jun 17, 2024

Choose a reason for hiding this comment

bashonly Jun 17, 2024 •

edited

Loading