Any success with fetching images ! #33

ayushbits · 2020-10-25T09:14:09Z

Dear Author @HurinHu ,

Thanks for the package !
Fetching images is still a pain in the code. The images are stored as a javascript object in the self.content variable (shown in image). I tried extracting the value of variable but didn't succeed. Could you try ?

HurinHu · 2020-10-25T09:40:36Z

Well google only load the real image after the page loaded, so when I fetch the original page, it will show the loading image rather than real image. I am still seeking the solution.

ayushbits · 2020-10-25T09:55:12Z

The image is being stored in Javascript variable. If we can retrieve the object corresponding to each variable, then image can be fetched. AFAIK, we don't need to load the news article page. Instead, we can fetch the corresponding Javascript variable only from the parsed html page. रवि, 25 अक्तू॰ 2020, 15:10 को Hurin Hu <[email protected]> ने लिखा:

…

Well google only load the real image after the page loaded, so when I fetch the original page, it will show the loading image rather than real image. I am still seeking the solution. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/HurinHu/GoogleNews/issues/33#issuecomment-716119478>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHIB5BSFUPUNFKI6XFY7VBLSMPXB7ANCNFSM4S6GIA2A> .

HurinHu · 2020-10-25T09:58:34Z

I know, but when you using the script to fetch the page, the js is not executed, the images are dynamic loaded to js. That is what I got last time, will check again later if google has made new changes.

ayushbits · 2020-10-25T10:59:53Z

Sure thanks ! Let us know whatever the result would be.

rbshadow · 2020-11-18T21:17:15Z

We can get image by using another module inside this. Will it be a convenient way?

HurinHu · 2020-11-18T21:19:17Z

Which module? @rbshadow

We can get image by using another module inside this. Will it be a convenient way?

rbshadow · 2020-11-19T04:39:11Z

Which module? @rbshadow

We can get image by using another module inside this. Will it be a convenient way?

newspaper3k

HurinHu · 2020-11-19T05:19:07Z

Well, currently it can return default loading image, as google load the image through js, so it may need to execute the js to get the correct url. Any fetching script without js execution would not help. I have checked newpaper3k, it uses requests.get() method, which would not help. I am not sure how you get the result, can you post some sample code?

Which module? @rbshadow

We can get image by using another module inside this. Will it be a convenient way?

newspaper3k

rbshadow · 2020-11-19T17:54:39Z

Here have attached the full code that I'm using currently.

Code

from GoogleNews import GoogleNews as GN
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk

nltk.download('punkt')


def download_News(data_frame, news_name):
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                 'Chrome/50.0.2661.102 Safari/537.36 '
    config = Config()
    config.browser_user_agent = user_agent
    li_st = []
    df = data_frame
    for ind in df.index:
        dict = {}
        try:
            if news_name == 'Google_News':
                article = Article(df['link'][ind], config=config)
                article.download()
                article.parse()
                article.nlp()
                dict['Date'] = df['date'][ind]
                dict['Title'] = article.title
                dict['Top_Image'] = article.top_image
                dict['Link'] = df['link'][ind]
                li_st.append(dict)
        except Exception as e:
            print(e)
            pass

    news_df = pd.DataFrame(li_st)
    news_df.to_json(news_name + '_articles.json', orient='index', indent=4)  # JSON Output


def google_News(start_date, end_date, search_query):
    start_date = start_date.split('-')
    start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2]
    end_date = end_date.split('-')
    end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2]

    googlenews = GN(start=start_date, end=end_date)
    googlenews.search(search_query)
    result = googlenews.result(sort=True)
    df = pd.DataFrame(result)
    return df


def start():
    start_date = input('Enter start date (DD-MM-YYYY): ')
    end_date = input('Enter end date (DD-MM-YYYY): ')
    search_query = input('Enter Search Query: ')

    return start_date, end_date, search_query


if __name__ == '__main__':
    query = start()
    googleNews = google_News(query[0], query[1], query[2])
    download_News(googleNews, news_name='Google_News')

Output

HurinHu · 2020-11-20T05:38:23Z

Well, it is a solution, but this get the images from the news' page, which will fetch all the items one by one, not from google news directly. It is not a proper solution to do as it may take longer time to process ten or more web requests to get the images. If there are multiple pages requested, it may have side effects, like being blocked by the website by fetching the url too frequently or wait for a longer time.

If anybody has this kind of needs, this method may help, but just be aware, set some delay time for each request, or you might easily be blocked.

Here have attached the full code that I'm using currently.

Code

from GoogleNews import GoogleNews as GN
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk

nltk.download('punkt')


def download_News(data_frame, news_name):
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                 'Chrome/50.0.2661.102 Safari/537.36 '
    config = Config()
    config.browser_user_agent = user_agent
    li_st = []
    df = data_frame
    for ind in df.index:
        dict = {}
        try:
            if news_name == 'Google_News':
                article = Article(df['link'][ind], config=config)
                article.download()
                article.parse()
                article.nlp()
                dict['Date'] = df['date'][ind]
                dict['Title'] = article.title
                dict['Top_Image'] = article.top_image
                dict['Link'] = df['link'][ind]
                li_st.append(dict)
        except Exception as e:
            print(e)
            pass

    news_df = pd.DataFrame(li_st)
    news_df.to_json(news_name + '_articles.json', orient='index', indent=4)  # JSON Output


def google_News(start_date, end_date, search_query):
    start_date = start_date.split('-')
    start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2]
    end_date = end_date.split('-')
    end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2]

    googlenews = GN(start=start_date, end=end_date)
    googlenews.search(search_query)
    result = googlenews.result(sort=True)
    df = pd.DataFrame(result)
    return df


def start():
    start_date = input('Enter start date (DD-MM-YYYY): ')
    end_date = input('Enter end date (DD-MM-YYYY): ')
    search_query = input('Enter Search Query: ')

    return start_date, end_date, search_query


if __name__ == '__main__':
    query = start()
    googleNews = google_News(query[0], query[1], query[2])
    download_News(googleNews, news_name='Google_News')

Output

rbshadow · 2020-11-20T11:23:17Z

Yes you are right. That's why I asked earlier. Also delay is important as you mentioned.
Thanks @HurinHu for your great tool.

Well, it is a solution, but this get the images from the news' page, which will fetch all the items one by one, not from google news directly. It is not a proper solution to do as it may take longer time to process ten or more web requests to get the images. If there are multiple pages requested, it may have side effects, like being blocked by the website by fetching the url too frequently or wait for a longer time.

If anybody has this kind of needs, this method may help, but just be aware, set some delay time for each request, or you might easily be blocked.

Here have attached the full code that I'm using currently.

Code

from GoogleNews import GoogleNews as GN
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk

nltk.download('punkt')


def download_News(data_frame, news_name):
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                 'Chrome/50.0.2661.102 Safari/537.36 '
    config = Config()
    config.browser_user_agent = user_agent
    li_st = []
    df = data_frame
    for ind in df.index:
        dict = {}
        try:
            if news_name == 'Google_News':
                article = Article(df['link'][ind], config=config)
                article.download()
                article.parse()
                article.nlp()
                dict['Date'] = df['date'][ind]
                dict['Title'] = article.title
                dict['Top_Image'] = article.top_image
                dict['Link'] = df['link'][ind]
                li_st.append(dict)
        except Exception as e:
            print(e)
            pass

    news_df = pd.DataFrame(li_st)
    news_df.to_json(news_name + '_articles.json', orient='index', indent=4)  # JSON Output


def google_News(start_date, end_date, search_query):
    start_date = start_date.split('-')
    start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2]
    end_date = end_date.split('-')
    end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2]

    googlenews = GN(start=start_date, end=end_date)
    googlenews.search(search_query)
    result = googlenews.result(sort=True)
    df = pd.DataFrame(result)
    return df


def start():
    start_date = input('Enter start date (DD-MM-YYYY): ')
    end_date = input('Enter end date (DD-MM-YYYY): ')
    search_query = input('Enter Search Query: ')

    return start_date, end_date, search_query


if __name__ == '__main__':
    query = start()
    googleNews = google_News(query[0], query[1], query[2])
    download_News(googleNews, news_name='Google_News')

Output

jacobhtye · 2022-02-08T15:30:44Z

@HurinHu just added some comments to my pull request you closed. Let me know if that makes any difference.

This was referenced Feb 7, 2022

added headless chrome for javascript render #86

Closed

added headless chrome for javascript render #87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any success with fetching images ! #33

Any success with fetching images ! #33

ayushbits commented Oct 25, 2020 •

edited

Loading

HurinHu commented Oct 25, 2020

ayushbits commented Oct 25, 2020 via email

HurinHu commented Oct 25, 2020

ayushbits commented Oct 25, 2020

rbshadow commented Nov 18, 2020

HurinHu commented Nov 18, 2020

rbshadow commented Nov 19, 2020

HurinHu commented Nov 19, 2020

rbshadow commented Nov 19, 2020

HurinHu commented Nov 20, 2020

Code

Output

rbshadow commented Nov 20, 2020

Code

Output

jacobhtye commented Feb 8, 2022

Any success with fetching images ! #33

Any success with fetching images ! #33

Comments

ayushbits commented Oct 25, 2020 • edited Loading

HurinHu commented Oct 25, 2020

ayushbits commented Oct 25, 2020 via email

HurinHu commented Oct 25, 2020

ayushbits commented Oct 25, 2020

rbshadow commented Nov 18, 2020

HurinHu commented Nov 18, 2020

rbshadow commented Nov 19, 2020

HurinHu commented Nov 19, 2020

rbshadow commented Nov 19, 2020

Code

Output

HurinHu commented Nov 20, 2020

Code

Output

rbshadow commented Nov 20, 2020

Code

Output

jacobhtye commented Feb 8, 2022

ayushbits commented Oct 25, 2020 •

edited

Loading