Extraction of Professional Details from Web-URLs using DeepDive
Quarterly academic journal Ethics Review had a cover story about the term “trigger warning” and its effect on universities nationwide. Primarily used by millennial media outlets and blogs, the cautionary phrase made its way into undergrad and even grad level syllabi.
The idea was to extract important details of a professional from his/her website. We picked up "doctors” as one of the professions and mined the following information:
Name
Email
Location
Contact Number
Specialisation
The project was divided into two segments:
Extract information from web-pages which contain information of a single doctor
Extract information from web-pages where multiple doctors are mentioned and their emails, locations and specialisation mentions are intermixed.
Algorithm Used:
Distant Supervision - A technique used for learning relations between entities in a dataset based on heuristic rules. We wanted to learn which set of information (name, location, email, etc…) was related to which doctor/professional.
Knowledge Graphs - Construction knowledge graphs between the different entities of the data.
Tools Used:
DeepDive - Stanford’s tool for extracting information from the web.