I need an API that I can use to search for an url, or part of an URL within an external site’s sitemap.
Usage: Laravel & MYSQL
A sitemap is not always located in the same location, nor is the location always mentioned in the robots.txt. So we need to save sitemap locations in a mysql table so we can use those locations to try on other domains, and so be able to locate more sitemaps.
Finding a sitemap Create a mysql table “sitemaps” (example name) that we can use to save sitemap names (e.g. sitemap.xml, sitemap_index.php, etc). The table has a ‘sitemap’ and ‘count’ field, the count field is simply a counter for each time we find a sitemap with the same name. Check if the given domain has a robots.txt (https://example.com/robots.txt), if there is a robots.txt you look for the sitemap directive. “Sitemap: https://www.example.com/example.xml” (can be multiple) You save the sitemap location to the sitemaps table, if it already exists you do a +1 on the count field. If we don’t find the sitemap location in the robots.txt we try to find it using all the sitemap locations we have in our sitemaps table (the more we get, the higher the chance we find it) you check themaps with the highest counts first.
Finding a Url Once you find the sitemap(s), you create an index of all urls in the sitemap and its nested sitemaps. Now you simply try to find the given search term using a mysql query or regex.
Example request /Sitemap?domain=example.com&search=url
Example API Response: What i want is the API to return matching url’s in json format, { search: 'example' domain: domain.com statistics{ sitemaps_found: 3, sitemaps{ 1: 'www.domain.com/sitemap1.xml', 2: 'www.domain.com/sitemap453.xml', 3: 'www.domain.com/sitemap345.xml' } urls: 28892, matches: 25 }, matches{ 1:'www.domain.com/example/13324223', 2:'www.domain.com/example/94827497' } }
Discussion; We can save the sitemap files we find to our server, and search within those files. Or we can insert all sitemap urls in a mysql table and search from there. Not sure what’s faster, let’s discuss.
Save all url’s in Mysql Pro: Fast searching Pro: Easily create a cron to delete entries older than x hours Pro : Easy maintenance Con: Need to extract all urls from the sitemap files (can potentially be hundred of thousands url’s)
Save sitemap as files Pro: No need to extract urls and put them in mysql Cons: Downloading files that might contain vulnerabilities Cons: Saving files costs more space than saving only the urls in mysql
>>>Outside the scope of the initial task, but would be a followup task, do-not price this in