Crawler and Extract Information

Background

A large bank in Vietnam with over 13,224 employees and an annual revenue of approximately 20,000 billion VND wishes to leverage real estate data from various websites to support its business activities and decision-making processes.

Challenge

  • Need to collect real estate information from multiple different websites.
  • Compile selling prices and rental rates to evaluate market value and price trends.
  • Improve accuracy when extracting data from inconsistent sources.
  • Integrate data into the bank's system to support various operations such as valuation, lending, etc.
  • Address the need for automated data collection and extraction without relying on manual methods or heavy dependence on Selenium.

Solution

  • Utilize GenAI (next-generation AI) to extract data from websites.
  • Employ a lightweight AI model that can be deployed on suitable, cost-effective infrastructure for text data extraction.
  • Automate the process of collecting and extracting information without using Selenium.
  • Manage and cross-check extracted data via a web portal.

Key Components

  • AI Service: LLM, Phi
  • Cloud: AWS, Azure
  • Backend: NodeJS
  • Database: PostgreSQL

Architecture

  • The system comprises key components: an automated crawler to retrieve data from websites, an AI Service using the Phi model to extract information, a backend to process data and store it in a PostgreSQL database, all deployed on an on-premise platform.
  • A website for managing extracted data and enabling cross-checking and verification of information.

Implementation

  • Developed a crawler to automatically collect data from real estate websites.
  • Integrated an AI model API to automatically extract relevant information (selling price, rental rate, location, area, etc.) aligned with the company's business needs.
  • Built a backend using NodeJS to process, store, and provide APIs for the bank's operational systems.
  • Deployed the system on the cloud (AWS, Azure) to ensure scalability and stability.
  • Created a web interface to facilitate data checking and cross-verification.

Results

  • Average accuracy reached 91.2%.
  • Fully automated the process of crawling and extracting information.
  • Saved time and manpower compared to the previous manual process.
  • Created dashboards to compare real estate values by region, over time, and market fluctuations…

Impact

  • Provided diverse, continuously updated real estate data to support the bank's business activities.
  • Enhanced efficiency in valuation, risk management, and lending decisions.
  • Supported the development of new products and services based on big data, boosting the bank's competitiveness.

Lessons Learned

  • Applying GenAI saves effort and improves accuracy compared to traditional methods.
  • Managing data quality and cross-checking remain essential to ensure reliability.
  • Cloud solutions enable the system to scale easily as demand increases.

Conclusion

The real estate data crawling and extraction project using GenAI has enabled the bank to fully automate the data collection and processing process, achieving high accuracy, saving resources, and delivering significant value to business operations while supporting more effective decision-making.
6/12/2025
bia.png
NEW
AI for Procurement: Optimizing Vietnam's Largest IT Giant
Apply AI-Model to optimize the identification and classification of products (based on analysis, product details, unit price, unit of measurement, etc.)
thumb.png
NEW
Synthesis Data
"Use a large language model (LLM) to identify descriptive features from a reference image. Based on these features, users can customize specific attributes and use the customized information to search for or generate images with similar characteristics."
Thumb.jpg
NEW
Serverless RAG on AWS
Deploy a Retrieval-Augmented Generation (RAG) system on AWS using a serverless architecture to build an AI application capable of answering questions based on retrieved data. The solution allows users to upload documents, index the data, and interact through a web interface (built with Streamlit) to ask questions, with answers generated by combining information retrieval and the content generation capabilities of a large language model (LLM).
QaiDora Products
QaiDora draws inspiration from the myth of Pandora’s box—a symbol of unexpected possibilities and hope. For us, AI models are like modern Pandora’s boxes, holding untapped potential to turn challenges into opportunities. At QAI, QaiDora serves as an ecosystem of AI products designed to drive innovation and deliver competitive advantages.
Trusted by
Contact us
Copyright by qaidora.com