This repository is to extract the shop directory of all major Hong Kong malls by web scraping. The data will be used for data analysis afterwards.
As working in mall leasing department of a sizable property company in Hong Kong, it is a good idea to montior the merchants leasing situation of competitor malls in ongoing basis. To monitor in a timely manner, this repository aims to develop a pipeline of web scraping procedure so it is easier to replicate and output the shop directory data.
Run "export_data.ipynb" to export the malls' shop directory into "data" folder. By default, all malls will be extracted in one go. If you need to extract particular mall(s), you may amend "mall" variable to only include the desired mall(s).
.
├── README.md
├── LICENSE.txt
├── data # Exported web scraping data
│ └── (Malls folder)
├── webscraper # Web scraper script
│ └── (Malls folder)
└── export_data.ipynb
Depending on the website design, different web scraping methods will be applied. If the website is not java based, BeautifulSoup is mainly used in the web scraper. Since shop list page usually does not contain the shop details, data in shop list page and shop detail page are scrapped separately then combined into a shop master data. For the website which is java based, the designated website API will be called to have web scraping instead.
In this project, there are two main functions (getShopCategory, getShopMaster) to extract the shop categories and shop master data from each mall. The data fields have been standardized among all malls. If no corresponding data could be extracted on the website, NULL will be placed in the fields.
Below is the definition of exported data set:
Shop Category
Field name | Data type | Description |
---|---|---|
mall | String | Name of the mall |
type | String | Type of the shop (Either Shopping or Dinning) |
shop_category_id | String | A unique identifier of the shop category assigned by mall |
shop_category_name | String | Name of the shop category |
update_date | Date | Date of web scraping |
Shop Master
Field name | Data type | Description |
---|---|---|
mall | String | Name of the mall |
type | String | Type of the shop (Either Shopping or Dinning) |
shop_id | String | A system generated unique identifier of the shop assigned by mall |
shop_name_en | String | Name of the shop in English |
shop_name_tc | String | Name of the shop in Traditional Chinese |
shop_number | String | A unique identifier of the shop assigned by mall and usually used to indicate the location of the shop |
shop_floor | String | The floor the shop being located in the mall |
phone | String | Contact phone number of the shop |
opening_hours | String | Opening hours of the shop |
loyalty_offer | String | Indicate name of mall loyalty offer with the shop |
voucher_acceptance | Boolean | Flag to indicate whether the shop accept mall vouchers |
shop_category_id | String | A unique identifier of the shop category assigned by mall |
shop_category_name | String | Name of the shop category |
tag | String | Other additional tagging added to the shop by mall |
update_date | Date | Date of web scraping |
2 - 3 malls web scrapers are expected to add to this project on weekly basis
Malls has been web scraped:
- CityGate
- Citylink
- CityPlaza
- Citywalk
- Elements
- FestivalWalk
- HarbourCity
- IFC
- ISquare
- K11ArtMall
- K11Musea
- Landmark
- LanghamPlace
- LeeGardenMalls
- LinkHKMalls
- LukYeungGalleria
- MaritimeSquare
- MiraPlace
- OlympianCity
- PacificPlace
- ParadiseMall
- PlazaAscot
- PlazaHollywood
- PopCorn
- TelfordPlaza
- TheLane
- TheLohas
- TheOne
- TimeSquare
- Tmtplaza
- Windsor
Malls to be web scraped:
- Megabox
- TheForest
- DPark
- D2Place
- 1881Heritage
- MOKO
- NewTownPlaza
- OtherSHKPMalls
- MOSTown
- MetroCity
- OtherHendersonMalls
- FashionWalk
- OtherHangLungMalls
Please refer to license page.