Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBC News Videos Extractor Not Working: "Unable to download XML: HTTP Error 404: Not Found" #10170

Open
11 tasks done
LifesGottaBeFun opened this issue Jun 13, 2024 · 2 comments
Open
11 tasks done
Labels
site-bug Issue with a specific website triage Untriaged issue

Comments

@LifesGottaBeFun
Copy link

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

Region

Non-Geoblocked

Provide a description that is worded well enough to be understood

I tried to download this video: https://www.cbc.ca/player/play/video/9.6420651

However, it failed and gave me the "Unable to download XML: HTTP Error 404: Not Found" error.

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • If using API, add 'verbose': True to YoutubeDL params instead
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['https://www.cbc.ca/player/play/video/9.6420651', '-o', 'D:\\Downloaded Audio-Video Tracks\\ViaYouTubeDL\\cbc.ca\\Custom\\%(title)s-%(id)s.%(ext)s', '-o', 'D:/EdmontonAirMonitoring.mp4', '-vU']
[debug] Encodings: locale cp1252, fs utf-8, pref cp1252, out cp1252 (No VT), error cp1252 (No VT), screen cp1252 (No VT)
[debug] yt-dlp version [email protected] from yt-dlp/yt-dlp [12b248ce6] (win_exe)
[debug] Python 3.8.10 (CPython AMD64 64bit) - Windows-10-10.0.22621-SP0 (OpenSSL 1.1.1k  25 Mar 2021)
[debug] exe versions: ffmpeg 6.0-essentials_build-www.gyan.dev (setts), ffprobe 6.0-essentials_build-www.gyan.dev
[debug] Optional libraries: Cryptodome-3.20.0, brotli-1.1.0, certifi-2024.02.02, curl_cffi-0.5.10, mutagen-1.47.0, requests-2.32.2, sqlite3-3.35.5, urllib3-2.2.1, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets, curl_cffi
[debug] Loaded 1820 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: [email protected] from yt-dlp/yt-dlp
yt-dlp is up to date ([email protected] from yt-dlp/yt-dlp)
[cbc.ca:player] Extracting URL: https://www.cbc.ca/player/play/video/9.6420651
[cbc.ca:player] 9.6420651: Downloading webpage
[ThePlatform] Extracting URL: http://link.theplatform.com/s/ExhSPC/media/guid/2655402169/None?mbr=true&formats=MPEG4,FLV,MP3#__youtubedl_smuggle=%7B%22force_smil_url%22%3A+true%7D
[ThePlatform] None: Downloading SMIL data
[ThePlatform] None: Unable to download XML: HTTP Error 404: Not Found (caused by <HTTPError 404: Not Found>)
  File "yt_dlp\extractor\common.py", line 734, in extract
  File "yt_dlp\extractor\theplatform.py", line 313, in _real_extract
  File "yt_dlp\extractor\theplatform.py", line 34, in _extract_theplatform_smil
  File "yt_dlp\extractor\common.py", line 1133, in download_content
  File "yt_dlp\extractor\common.py", line 1093, in download_handle
  File "yt_dlp\extractor\adobepass.py", line 1366, in _download_webpage_handle
  File "yt_dlp\extractor\common.py", line 954, in _download_webpage_handle
  File "yt_dlp\extractor\common.py", line 903, in _request_webpage
  File "yt_dlp\extractor\common.py", line 890, in _request_webpage
  File "yt_dlp\YoutubeDL.py", line 4142, in urlopen
  File "yt_dlp\networking\common.py", line 117, in send
  File "yt_dlp\networking\_helper.py", line 208, in wrapper
  File "yt_dlp\networking\common.py", line 337, in send
  File "yt_dlp\networking\_requests.py", line 366, in _send
yt_dlp.networking.exceptions.HTTPHTTP Error 404: Not Found
An error occured
@LifesGottaBeFun LifesGottaBeFun added site-bug Issue with a specific website triage Untriaged issue labels Jun 13, 2024
@trainman261
Copy link
Contributor

trainman261 commented Jun 17, 2024

I've noticed the same problem. It seems like #9534 was a precursor to this. As far as I can tell, there is no MediaID key anymore, which was what was being used to get the video files from ThePlatform. Looking through how the site works now, I can't find any reference to ThePlatform anymore (although I am a bit of a noob at this, so feel free to tell me I'm wrong).
What definitely works (tried manually successfully) is:

  • Load the webpage and search for the JSON coming after <script id="initialStateDom">window.__INITIAL_STATE__ = as we do now already
  • In that extracted JSON, we need to look under video for all the info (including metadata) we need. Most importantly for the video, we need video/currentClip/media/assets/key - that links to another block of JSON
  • in that JSON we need to grab the url key, which links to the master m3u8 file containing all the info needed. I've tried feeding that URL to yt-dlp and it works, using the generic extractor. Note that the link has a fairly short expiry date (not more than a few minutes IIRC)

I've also found that the whole TS, MP4 as well as VTT files are directly accessible by analyzing the traffic and (for MP4s) messing around with the URLs pulled. In the meantime I've found that the direct link to the VTT file can be extracted from the first block of JSON, but I'm still looking to find a solid pattern as to the TS and mp4 files.

The first option is the most straight forward, but works via HLS and tends to download ~30 files per minute of video (~45 if you add subtitles), meaning ~2000 files for a 45 minute video with subtitles. The second option would be a nice addition, but somewhat more complex.

I'll try to convert the first option into code within the coming week - but if someone else gets around to it sooner feel free and go ahead.

@trainman261
Copy link
Contributor

Update: I've gotten around to implementing a rudimentary solution, I've pushed it to a branch on my dev fork. It works on my end and if someone needs a stopgap, feel free to use it until I polish it up and submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website triage Untriaged issue
Projects
None yet
Development

No branches or pull requests

2 participants