Java – Crawler engine architecture – Java/Perl integration

Crawler engine architecture – Java/Perl integration… here is a solution to the problem.

Crawler engine architecture – Java/Perl integration

I want to develop a management solution around our web crawler perl scripts. Basically, now our script is saved in SVN and started manually by system administrators/developers, etc. Every time we need to retrieve data from a new source, we have to create a ticket with a business description and goal. As you can imagine, this is not the best solution.

This system has 3 consistent themes:

  1. For lack of better phrases, data retrieval has a “conceptual structure”, i.e. information retrieval follows a specific path
  2. We’re just looking for very specific information, so we don’t really have to worry about mass crawling (think thousands of pages or millions) for the time being
  3. Crawling is URL-based rather than site-based.

When I enhance this alpha version to a higher production level beta, I want to add automation and management of data retrieval. Also, our other system is Java (I’m more proficient), and I want to divide the perl aspects so we don’t have to rely too much on external help.

I have evaluated common suspects Nutch, Droid, etc., but the time spent modifying these frameworks to fit our particular information retrieval is unreasonable.

So, I’d like to know what you think about the following architecture.

I want to create a solution

  • Use Java as an interface for managing and executing perl scripts
  • Use Java for configuration and data access
  • Stick to perl for searches

An example use case is

:

  1. The data analyst gave us the scraping needs
  2. The perl developer creates the required script and uses this webapp to submit the script (saved to the file system).
  3. The script starts from a webapp with specific parameters
    ….

Webapp should be able to create multiple threads of perl scripts to start multiple crawlers.

So the question is

  1. What do you think
  2. How reliable is the integration between Java and Perl, especially calling Perl from Java
  3. Someone has used such a system, which is actually part of the perl repository

Our goal is actually not to have a bunch of cluttered perl scripts, but to do some management and organization of our information retrieval. Also, I knew I could use Perl to do the web part we wanted — but as I mentioned earlier — trying to keep Perl focused. But it seems assbackwards and I’m not against making it an all-perl solution.

Any suggestions and comments are accepted.

Thanks

Solution

how solid is integration between Java and Perl specifically from calling perl from java

IMO, the best way to call Perl from Java is to have Java start Perl programs in a separate process. You can try calling Perl directly from Java using JNI/JNA, but it’s hard to get it right. If you get it wrong, you’ll face a crashing JVM.

Open to any all suggestions and opinions.

IMO If you use pure Perl or pure Java, you get a solution that is easier to maintain. If that means you have to learn Perl, so be it. (It is possible to write well-structured, maintainable applications in Perl.) You just need to be disciplined about it. )

Related Problems and Solutions