Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a stage 1 worker using search engine results #7

Open
rom1504 opened this issue Oct 30, 2021 · 7 comments
Open

Build a stage 1 worker using search engine results #7

rom1504 opened this issue Oct 30, 2021 · 7 comments

Comments

@rom1504
Copy link

rom1504 commented Oct 30, 2021

Let's fill all details on this idea by @christophschuhmann

For example:

  • What search engine?
  • what is the input data frame (columns, size, format) ?
  • what is the output (probably image link, caption in parquet files)?
  • what kind of processing need to be done ?
  • any limitations from search engine we forecast?
  • any experiment that were done with first information?
@rvencu
Copy link
Member

rvencu commented Oct 30, 2021

We need to use the 3 stage workflow. The output will be to sent all the data to Postgres database via SQLalchemy engine. We should use CAH format for that. https://github.com/rvencu/crawlingathome-gpu-hcloud/blob/43eec102d3c4f08145a7704d4c65648619677768/ccpp.py#L375

The issue I have is that while we can use private workers, using crowdsourced workers will expose DB credentials and I still have no idea how to curate the output to allow saving to the database. At this point I can only operate our swarm of private workers.

while testing we can use a test table instead of the production one

@rvencu
Copy link
Member

rvencu commented Oct 30, 2021

Table structure is:

create table dataset
(
    sampleid bigint      not null
        constraint dataset_pk
            primary key,
    url      text        not null,
    text     text        not null,
    license  varchar(80),
    domain   varchar(60),
    wat      integer,
    status   smallint default 0,
    illegal  boolean  default false,
    hash     varchar(32) not null,
    modified timestamp,
    url_hash varchar(32) not null
);

alter table dataset
    owner to cah;

create index dataset_status_index
    on dataset (status);

create unique index dataset_url_hash_uindex
    on dataset (url_hash);

create trigger update_customer_modtime
    before update
    on dataset
    for each row
execute procedure update_modified_column();

the trigger just updates the timestamp for last modified time

@christophschuhmann
Copy link

so, update from bing image search query tests: At first I got like 300 im-txt-pairs per sec with my colab code .... - Then after a few 100k samples the ip gets blocked and the rate drops to ~10 samples per sec .... still not bad ... but maybe using tor could be a good idea ... need to do some tests again

@christophschuhmann
Copy link

Here the general plan:

  • we query several image search engines at the same time, eg Bing, Google, Yandex, Duckduckgo, ... for prepared queries with small droplet (stage 1)

  • the queries are distributed by a tracker to the stage 1 workers. Each time a stage 1 worker get enough queries for ~ 1 h of work

  • The output of stage 1 is a list of imageurl - text - pairs

  • Stage 1 worker use multiprocessing to create processes which open a connection to the Tor network e.g. using Torpy. Each processes receives a list of queries and sends them then over this connection to the search engines. It puts a pause between each query to the same engine, to avoid getting banned quickly. the connection to tor should stay open for several requests, not just one, because it takes several sec to open a new connection to tor.

  • It may work for a bit without Tor, but we should try to change ips using tor, to avoid getting complaints to the droplet providers we're using

  • The queries are made by:

  1. All combinations of (adjective of en language from NLTK) (verb / noun of en language from NLTK)
  2. All entities that have wikipedia entries ( organisations, concepts, places, ... )
  3. All celebrities listed on IMDB each combined with one of ~ emotional adjectives

( additionally we could get all at least x-times mentioned named entities from wikipedia, the pile, ... )

@rom1504
Copy link
Author

rom1504 commented Oct 30, 2021

I expect if this starts working at scale, search engines will actively work on banning us and will succeed.
I am wondering if a crawling approach wouldn't be better (and/or some kind of partnership with an existing crawling organization)

@christophschuhmann
Copy link

let's try tor

@TheoCoombes
Copy link
Member

Happy to help out on the tracker side of this :)

Sounds very promising

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants