Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any success with fetching images ! #33

Open
ayushbits opened this issue Oct 25, 2020 · 12 comments
Open

Any success with fetching images ! #33

ayushbits opened this issue Oct 25, 2020 · 12 comments

Comments

@ayushbits
Copy link

ayushbits commented Oct 25, 2020

Dear Author @HurinHu ,

Thanks for the package !
Fetching images is still a pain in the code. The images are stored as a javascript object in the self.content variable (shown in image). I tried extracting the value of variable but didn't succeed. Could you try ?

image

@HurinHu
Copy link
Member

HurinHu commented Oct 25, 2020

Well google only load the real image after the page loaded, so when I fetch the original page, it will show the loading image rather than real image. I am still seeking the solution.

@ayushbits
Copy link
Author

ayushbits commented Oct 25, 2020 via email

@HurinHu
Copy link
Member

HurinHu commented Oct 25, 2020

I know, but when you using the script to fetch the page, the js is not executed, the images are dynamic loaded to js. That is what I got last time, will check again later if google has made new changes.

@ayushbits
Copy link
Author

Sure thanks ! Let us know whatever the result would be.

@rbshadow
Copy link
Contributor

We can get image by using another module inside this. Will it be a convenient way?

@HurinHu
Copy link
Member

HurinHu commented Nov 18, 2020

Which module? @rbshadow

We can get image by using another module inside this. Will it be a convenient way?

@rbshadow
Copy link
Contributor

Which module? @rbshadow

We can get image by using another module inside this. Will it be a convenient way?

newspaper3k

@HurinHu
Copy link
Member

HurinHu commented Nov 19, 2020

Well, currently it can return default loading image, as google load the image through js, so it may need to execute the js to get the correct url. Any fetching script without js execution would not help. I have checked newpaper3k, it uses requests.get() method, which would not help. I am not sure how you get the result, can you post some sample code?

Which module? @rbshadow

We can get image by using another module inside this. Will it be a convenient way?

newspaper3k

@rbshadow
Copy link
Contributor

Here have attached the full code that I'm using currently.

Code

from GoogleNews import GoogleNews as GN
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk

nltk.download('punkt')


def download_News(data_frame, news_name):
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                 'Chrome/50.0.2661.102 Safari/537.36 '
    config = Config()
    config.browser_user_agent = user_agent
    li_st = []
    df = data_frame
    for ind in df.index:
        dict = {}
        try:
            if news_name == 'Google_News':
                article = Article(df['link'][ind], config=config)
                article.download()
                article.parse()
                article.nlp()
                dict['Date'] = df['date'][ind]
                dict['Title'] = article.title
                dict['Top_Image'] = article.top_image
                dict['Link'] = df['link'][ind]
                li_st.append(dict)
        except Exception as e:
            print(e)
            pass

    news_df = pd.DataFrame(li_st)
    news_df.to_json(news_name + '_articles.json', orient='index', indent=4)  # JSON Output


def google_News(start_date, end_date, search_query):
    start_date = start_date.split('-')
    start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2]
    end_date = end_date.split('-')
    end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2]

    googlenews = GN(start=start_date, end=end_date)
    googlenews.search(search_query)
    result = googlenews.result(sort=True)
    df = pd.DataFrame(result)
    return df


def start():
    start_date = input('Enter start date (DD-MM-YYYY): ')
    end_date = input('Enter end date (DD-MM-YYYY): ')
    search_query = input('Enter Search Query: ')

    return start_date, end_date, search_query


if __name__ == '__main__':
    query = start()
    googleNews = google_News(query[0], query[1], query[2])
    download_News(googleNews, news_name='Google_News')

Output

image

@HurinHu
Copy link
Member

HurinHu commented Nov 20, 2020

Well, it is a solution, but this get the images from the news' page, which will fetch all the items one by one, not from google news directly. It is not a proper solution to do as it may take longer time to process ten or more web requests to get the images. If there are multiple pages requested, it may have side effects, like being blocked by the website by fetching the url too frequently or wait for a longer time.

If anybody has this kind of needs, this method may help, but just be aware, set some delay time for each request, or you might easily be blocked.

Here have attached the full code that I'm using currently.

Code

from GoogleNews import GoogleNews as GN
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk

nltk.download('punkt')


def download_News(data_frame, news_name):
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                 'Chrome/50.0.2661.102 Safari/537.36 '
    config = Config()
    config.browser_user_agent = user_agent
    li_st = []
    df = data_frame
    for ind in df.index:
        dict = {}
        try:
            if news_name == 'Google_News':
                article = Article(df['link'][ind], config=config)
                article.download()
                article.parse()
                article.nlp()
                dict['Date'] = df['date'][ind]
                dict['Title'] = article.title
                dict['Top_Image'] = article.top_image
                dict['Link'] = df['link'][ind]
                li_st.append(dict)
        except Exception as e:
            print(e)
            pass

    news_df = pd.DataFrame(li_st)
    news_df.to_json(news_name + '_articles.json', orient='index', indent=4)  # JSON Output


def google_News(start_date, end_date, search_query):
    start_date = start_date.split('-')
    start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2]
    end_date = end_date.split('-')
    end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2]

    googlenews = GN(start=start_date, end=end_date)
    googlenews.search(search_query)
    result = googlenews.result(sort=True)
    df = pd.DataFrame(result)
    return df


def start():
    start_date = input('Enter start date (DD-MM-YYYY): ')
    end_date = input('Enter end date (DD-MM-YYYY): ')
    search_query = input('Enter Search Query: ')

    return start_date, end_date, search_query


if __name__ == '__main__':
    query = start()
    googleNews = google_News(query[0], query[1], query[2])
    download_News(googleNews, news_name='Google_News')

Output

image

@rbshadow
Copy link
Contributor

Yes you are right. That's why I asked earlier. Also delay is important as you mentioned.
Thanks @HurinHu for your great tool.

Well, it is a solution, but this get the images from the news' page, which will fetch all the items one by one, not from google news directly. It is not a proper solution to do as it may take longer time to process ten or more web requests to get the images. If there are multiple pages requested, it may have side effects, like being blocked by the website by fetching the url too frequently or wait for a longer time.

If anybody has this kind of needs, this method may help, but just be aware, set some delay time for each request, or you might easily be blocked.

Here have attached the full code that I'm using currently.

Code

from GoogleNews import GoogleNews as GN
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk

nltk.download('punkt')


def download_News(data_frame, news_name):
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                 'Chrome/50.0.2661.102 Safari/537.36 '
    config = Config()
    config.browser_user_agent = user_agent
    li_st = []
    df = data_frame
    for ind in df.index:
        dict = {}
        try:
            if news_name == 'Google_News':
                article = Article(df['link'][ind], config=config)
                article.download()
                article.parse()
                article.nlp()
                dict['Date'] = df['date'][ind]
                dict['Title'] = article.title
                dict['Top_Image'] = article.top_image
                dict['Link'] = df['link'][ind]
                li_st.append(dict)
        except Exception as e:
            print(e)
            pass

    news_df = pd.DataFrame(li_st)
    news_df.to_json(news_name + '_articles.json', orient='index', indent=4)  # JSON Output


def google_News(start_date, end_date, search_query):
    start_date = start_date.split('-')
    start_date = start_date[1] + '/' + start_date[0] + '/' + start_date[2]
    end_date = end_date.split('-')
    end_date = end_date[1] + '/' + end_date[0] + '/' + end_date[2]

    googlenews = GN(start=start_date, end=end_date)
    googlenews.search(search_query)
    result = googlenews.result(sort=True)
    df = pd.DataFrame(result)
    return df


def start():
    start_date = input('Enter start date (DD-MM-YYYY): ')
    end_date = input('Enter end date (DD-MM-YYYY): ')
    search_query = input('Enter Search Query: ')

    return start_date, end_date, search_query


if __name__ == '__main__':
    query = start()
    googleNews = google_News(query[0], query[1], query[2])
    download_News(googleNews, news_name='Google_News')

Output

image

@jacobhtye
Copy link

@HurinHu just added some comments to my pull request you closed. Let me know if that makes any difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants