Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermediate URL redirect #140

Open
nmandic78 opened this issue Feb 16, 2024 · 5 comments
Open

Intermediate URL redirect #140

nmandic78 opened this issue Feb 16, 2024 · 5 comments

Comments

@nmandic78
Copy link

It looks google now only provide their intermediate URL that redirects to real news site URL:
'news.google.com/articles/CBMiU2h0dHBzOi8vd3d3LnRoZXZlcmdlLmNvbS8yMDI0LzIvMTQvMjQwNzI3OTIvYXBwbGUtdmlzaW9uLXByby1lYXJseS1hZG9wdGVycy1yZXR1cm5z0gEA?hl=en-US&gl=US&ceid=US%3Aen'

Tried to get redirected URL with requests, but it seems Google use javascript and this won't do. I get to consent page. I don't know how to tackle it without Selenium or similar and this is overhead I don't want for my project.
If someone has solution or pointer in right direction, I will be grateful.

@talhaanwarch
Copy link

Try this

urls = googlenews.get_links()

after getting the urls of news, you have to do it one by one. here is an example

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

def get_final_url(initial_url):
    # Configure Chrome options for headless mode
    chrome_options = Options()
    chrome_options.add_argument("--headless")

    # Set up Chrome WebDriver
    driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=chrome_options)

    try:
        # Open the initial URL
        driver.get(initial_url)

        # Wait until certain elements are present indicating that content has loaded
        wait = WebDriverWait(driver, 10)
        final_url_element = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article")))
        final_url = driver.current_url
        return final_url
    except TimeoutException:
        print("Timed out waiting for page to load")
        return None
    finally:
        # Close the WebDriver session
        driver.quit()

# Example usage:
initial_url = f"https://{urls[0]}"
final_url = get_final_url(initial_url)
if final_url:
    print("Final URL after content loaded:", final_url)

@nmandic78
Copy link
Author

@talhaanwarch , thank you. As I said, Selenium is overkill for my use case so I dropped this lib and solved what I needed with Bing Search API. Anyway, thank you and maybe somebody finds your snippet useful. Regards.

@deanm0000
Copy link

deanm0000 commented Apr 3, 2024

This is much simpler than it seems, you don't even need to BeautifulSoup it.

Use this:

def get_link_url(txt):
    i = txt.find("Opening")
    j = txt.find("a href=", i)
    k = txt.find('"', j + 8)
    return txt[j + 8 : k]

You still have to GET the intermediate URL but if you do:

resp=requests.get(intermediate_url)
real_link=get_link_url(resp.text)

It relies on a bit of code in the intermediate page that you're supposed to see if it doesn't redirect fast enough that tells you it's "Opening". You just use normal python find to look for that. Then you look for where the url begins immediately after that. Then you find where the url ends and extract it. Poof no selenium (or even bs4) required.

@HurinHu
Copy link
Member

HurinHu commented Apr 3, 2024

Be aware sending too many requests to Google may get 429 errors. Each link will send to Google first then get the actual link.

@deanm0000
Copy link

deanm0000 commented Apr 4, 2024

I ended up really wanting async support so I wrote my own which skips the intermediate URL altogether. In doing this DIY, I'm not actually sure where the intermediate URL comes from as the real URL is right there. It's not very pretty or typed so its not ready to be its own repo but if somebody wants to clean it up and incorporate it here or publish it elsewhere then please do:

import httpx
from bs4 import BeautifulSoup
from headers import HEADERS
from urllib.parse import quote_plus


async def search_news(search_terms, date_range=None):
    params = dict(q=quote_plus(search_terms), tbm="nws")
    if date_range is not None and isinstance(date_range, (list, tuple)):
        start_date = date_range[0].strftime("%m/%d/%Y")
        end_date = date_range[1].strftime("%m/%d/%Y")
        params["tbs"] = quote_plus(f"cdf:1,cd_min:{start_date},cd_max:{end_date}")
    dlclient = httpx.AsyncClient(http2=True, headers=HEADERS)
    resp = await dlclient.get(
        "https://www.google.com/search",
        params=params,
    )
    rbs = BeautifulSoup(resp, features="lxml")
    links = [
        x
        for x in rbs.find_all("a")
        if "href" in x.attrs
        and "https" in x.attrs["href"]
        and "google" not in x.attrs["href"]
    ]

    pages = []
    for link in links:
        url = link.attrs["href"]
        url_begin = url.find("https")
        url = url[url_begin:].split("&")[0]
        misc = link.find_all(string=True)
        misc = [x for x in misc if x.parent.name == "div"]
        pages.append({"url": url, "title": misc[0], "misc": misc[1:]})
    return pages

it assumes you have a file headers.py with a dict of headers in a variable called HEADERS. Google doesn't actually seem to mind if you don't use browser headers so it's probably superfluous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants