I have doubts regarding web scrapping

Perumal · September 3, 2024, 5:01pm

I tried web scraping "https://www.tamil2lyrics.com " website, but I have one doubt. Whenever I view the HTML page, I can only extract the data from the main page. Is it possible to extract data from other pages as well? Additionally, while trying to scrape, I managed to extract the lyrics of a single song. I would like to know if it’s possible to extract the lyrics of all the songs on the website.

tshrinivasan · September 3, 2024, 5:05pm

Try the module - mechanize

mechanize — mechanize 0.4.8 documentation

Mechanize Module in Python - Javatpoint

tshrinivasan · September 3, 2024, 5:09pm

The mentioned website is a wordpress based website.

You can get all the content of wordrpess from its REST API easily, instead of scrapping.

explore the below links

Retrieving Post Data Using the WordPress API with Python Script - Machine Learning Applications
Wordpress REST API V2: how to get list of all posts? - WordPress Development Stack Exchange
How to get all Wordpress posts as JSON using Python & the Wordpress REST API | TechOverflow

mohan43u · September 3, 2024, 7:42pm

ஏன் மெயின் பேஜ் மட்டும்தான் எக்ஸ்ரேக்ட் செய்ய முடிகிறது? சாதாரனமாக அந்த வெப்சைட்டை பிரவுசரில் ஓப்பன் செய்து வரும் மெயின் பேஜ்ஜில் இருந்து உங்களுக்கு தேவையான மற்றொரு பேஜ்ஜிற்கு செல்ல நீங்கள் என்ன செய்வீர்கள்? அப்படி செய்யும்போது என்ன நடக்கும்?

Perumal · September 3, 2024, 7:56pm

அடுத்த பக்கத்திற்கு செல்ல எந்த பக்கம் தேவையோ அதற்கான இணைப்பு தொடரினை சொடுக்குவேன்.
அப்பொழுது எனக்கு தேவையான பக்கம் திரையில் தோன்றும்

mohan43u · September 3, 2024, 7:58pm

இணைப்பை சொடுக்கும்போது பிரவுசர் என்ன செய்கின்றது? அதற்கும் நீங்கள் பைத்தானில் ஒரு URL ஐ கொடுத்து ஸ்க்ராப் செய்வதற்கும் என்ன தொடர்பு?

Perumal · September 3, 2024, 8:04pm

அப்பொழுது எனது அந்த இணைய பக்கத்தின் தொடர்பு அமைப்பானது விரிவடைந்து அடுத்த பக்கத்திற்கு இணைய தொடருடன் அமையும்

எனது கேள்வி எப்படி ஒரே பக்கத்தில் இருந்து அனைத்து பக்கங்களுக்கான DATA வை பெறுவது எப்படி ?

mohan43u · September 3, 2024, 8:08pm

இதற்கு என்ன பொருள்? தெளிவாக கூறவும். இணைப்பு முழு URL ஆக மாற்றப்பட்டபின் பிரவுசர் என்ன செய்கின்றது? அது எப்படி அந்த இணைப்பில் இருக்கும் தகவலை உங்களுக்கு காட்டுகின்றது?

Perumal · September 3, 2024, 8:12pm

இதறக்கான விளக்கம் எனது ஆரம்ப பக்கதின் லிங்க் “https://www.tamil2lyrics.com/” லிருந்து நான் அடுத்த பக்கம் செல்லும்போது இப்படியாக “Naanama Maivizhiyil Song Lyrics - Uyir Film” விரிவடையும்

mohan43u · September 3, 2024, 8:18pm

இங்கே கொடுக்கப்பட்டிருக்கும் தகவலை நிதானமாக படிக்கவும். முக்கியமாக நான்காவது பாய்ன்டை புரிந்து கொள்ளவும். பின் அதற்கும் நீங்கள் பைத்தானில் செய்யும் ஸ்கிராப்பிற்கும் என்ன தொடர்பு என்பதை குறிப்பிடவும்.

https://g.co/gemini/share/5c3e123f1cb7

Sh4d0wS6 · September 4, 2024, 6:55am

How to Use This Code

Step 1: Set Up a Python Virtual Environment Before running the code, it’s a good practice to create a virtual environment. This isolates your project and ensures that dependencies do not conflict with other projects.

python3 -m venv mysite
source mysite/bin/activate

Step 2: Clone My GitHub Repository Clone the repository to your local machine using the following command:

git clone "https://github.com/Tpj-root/site_cloner.git"

Step 3: Install the Required Dependencies Navigate to the project directory and install the required dependencies:

pip install -r requirements.txt

Step 4: Run the Code Finally, run the code with the following command:

python3 main_2.py

note: main_1.py — This code is used to generate and collect all the film URLs.

Time Required for Runtime

For python3 main_1.py:

The code processes 270 pages per URL request, and each request takes 10 seconds.
Therefore, the total runtime required is approximately 45 minutes.
The code has already been run to collect all the film URLs in movie_urls.txt.

For python3 main_2.py:

The command cat movie_urls.txt | wc -l shows that there are approximately 4048 films.
The total runtime required to download all the movie lyrics is approximately 11+ hours.