sitemapgen4j

Generate sitemaps with sitemapgen4j

This post is about automatically generating sitemaps. I chose this topic, because it is fresh in my mind as I have recently started using sitemaps for Podcastpedia.org After some research I came to the conclusion this would be a good thing – at the time of the posting Google had 3171 URLs indexed for the website (it has been live for 3 months now), whereby after generating sitemaps there were 87,818 URLs submitted. I am curios how many will get indexed after that…

So because I didn’t want to introduce over 80k URLs manually, I had to come up with an automated solution for that. Because Podcastpedia.org was developed with Java, it came easy to me to select sitemapgen4j

Maven depedency

Check out the latest version here:


The podcasts from Podcastpedia.org have an update frequency (DAILY, WEEKLY, MONTHLY, TERMINATED, UNKNOWN) associated, so it made sense to organize sub-sitemaps to make use of the lastMod and changeFreq properties accordingly. This way you can modify the lastMod of the daily sitemap in the sitemap index without modifying the lastMod of the monthly sitemap, and the Google bot doesn’t need to check the monthly sitemap everyday.

Generation of sitemap

Method : createSitemapForPodcastsWithFrequency – generates one sitemap file

	/**
	 * Creates sitemap for podcasts/episodes with update frequency
	 *
	 * @param  updateFrequency  update frequency of the podcasts
	 * @param  sitemapsDirectoryPath the location where the sitemap will be generated
	 */
	public void createSitemapForPodcastsWithFrequency(
			UpdateFrequencyType updateFrequency, String sitemapsDirectoryPath)  throws MalformedURLException {

		//number of URLs counted
		int nrOfURLs = 0;

		File targetDirectory = new File(sitemapsDirectoryPath);
		WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.podcastpedia.org", targetDirectory)
									.fileNamePrefix("sitemap_" + updateFrequency.toString()) // name of the generated sitemap
									.gzip(true) //recommended - as it decreases the file's size significantly
									.build();

		//reads reachable podcasts with episodes from Database with
		List podcasts = readDao.getPodcastsAndEpisodeWithUpdateFrequency(updateFrequency);
		for(Podcast podcast : podcasts) {
			String url = "http://www.podcastpedia.org" + "/podcasts/" + podcast.getPodcastId() + "/" + podcast.getTitleInUrl();
			WebSitemapUrl wsmUrl = new WebSitemapUrl.Options(url)
		     							.lastMod(podcast.getPublicationDate()) // date of the last published episode
		     							.priority(0.9) //high priority just below the start page which has a default priority of 1 by default
		     							.changeFreq(changeFrequencyFromUpdateFrequency(updateFrequency))
		     							.build();
			wsg.addUrl(wsmUrl);
			nrOfURLs++;

			for(Episode episode : podcast.getEpisodes() ){
				url = "http://www.podcastpedia.org" + "/podcasts/" + podcast.getPodcastId() + "/" + podcast.getTitleInUrl()
						+ "/episodes/" + episode.getEpisodeId() + "/" + episode.getTitleInUrl();

				//build websitemap url
				wsmUrl = new WebSitemapUrl.Options(url)
			     				.lastMod(episode.getPublicationDate()) //publication date of the episode
			     				.priority(0.8) //high priority but smaller than podcast priority
			     				.changeFreq(changeFrequencyFromUpdateFrequency(UpdateFrequencyType.TERMINATED)) //
			     				.build();
				wsg.addUrl(wsmUrl);
				nrOfURLs++;
			}
		}

		// One sitemap can contain a maximum of 50,000 URLs.
		if(nrOfURLs <= 50000){
			wsg.write();
		} else {
			// in this case multiple files will be created and sitemap_index.xml file describing the files which will be ignored
			// workaround to resolve the issue described at http://code.google.com/p/sitemapgen4j/issues/attachmentText?id=8&aid=80003000&name=Admit_Single_Sitemap_in_Index.patch&token=p2CFJZ5OOE5utzZV1UuxnVzFJmE%3A1375266156989
			wsg.write();
			wsg.writeSitemapsWithIndex();
		}

	}

The generated file contains URLs to podcasts and episodes, with changeFreq and lastMod set accordingly.
Snippet from the generated sitemap_MONTHLY.xml:


<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" >
  <url>
    <loc>http://www.podcastpedia.org/podcasts/581/heise-Developer-SoftwareArchitekTOUR-Podcast</loc>
    <lastmod>2013-07-05T17:01+02:00</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.9</priority>
  </url>
  <url>
    <loc>http://www.podcastpedia.org/podcasts/581/heise-Developer-SoftwareArchitekTOUR-Podcast/episodes/130/Episode-40-Mobile-Multiplattform-Anwendungen-am-Beispiel-von-jQuery-Mobile</loc>
    <lastmod>2013-07-05T17:01+02:00</lastmod>
    <changefreq>never</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>http://www.podcastpedia.org/podcasts/581/heise-Developer-SoftwareArchitekTOUR-Podcast/episodes/90/Episode-39-Entwicklung-fr-Embedded-Systeme-mit-mbeddr</loc>
    <lastmod>2013-03-11T15:40+01:00</lastmod>
    <changefreq>never</changefreq>
    <priority>0.8</priority>
  </url>
  .....
</urlset>

Generation of sitemap index

After sitemaps are generated for all update frequencies, a sitemap index is generated to list all the sitemaps. This file will be submitted in the Google Webmaster Toolos.
Method : createSitemapIndexFile

	/**
	 * Creates a sitemap index from all the files from the specified directory excluding the test files and sitemap_index.xml files
	 *
	 * @param  sitemapsDirectoryPath the location where the sitemap index will be generated
	 */
	public void createSitemapIndexFile(String sitemapsDirectoryPath) throws MalformedURLException {

		File targetDirectory = new File(sitemapsDirectoryPath);
		// generate sitemap index for foo + bar grgrg
		File outFile = new File(sitemapsDirectoryPath + "/sitemap_index.xml");
		SitemapIndexGenerator sig = new SitemapIndexGenerator("http://www.podcastpedia.org", outFile);

		//get all the files from the specified directory
		File[] files = targetDirectory.listFiles();
		for(int i=0; i < files.length; i++){
			boolean isNotSitemapIndexFile = !files[i].getName().startsWith("sitemap_index") || !files[i].getName().startsWith("test");
			if(isNotSitemapIndexFile){
				SitemapIndexUrl sitemapIndexUrl = new SitemapIndexUrl("http://www.podcastpedia.org/" + files[i].getName(), new Date(files[i].lastModified()));
				sig.addUrl(sitemapIndexUrl);
			}

		}
		sig.write();
	}

The process is quite simple – the method looks in the folder where the sitemaps files were created and generates a sitemaps index with these files setting the lastmod value to the time each file had been last modified (line 18).
Et voilà sitemap_index.xml:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>http://www.podcastpedia.org/sitemap_DAILY.xml.gz</loc>
    <lastmod>2013-08-01T07:24:38.450+02:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>http://www.podcastpedia.org/sitemap_MONTHLY.xml.gz</loc>
    <lastmod>2013-08-01T07:25:01.347+02:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>http://www.podcastpedia.org/sitemap_TERMINATED.xml.gz</loc>
    <lastmod>2013-08-01T07:25:10.392+02:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>http://www.podcastpedia.org/sitemap_UNKNOWN.xml.gz</loc>
    <lastmod>2013-08-01T07:26:33.067+02:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>http://www.podcastpedia.org/sitemap_WEEKLY.xml.gz</loc>
    <lastmod>2013-08-01T07:24:53.957+02:00</lastmod>
  </sitemap>
</sitemapindex>

If you liked this, please show your support by helping us with Podcastpedia.org
We promise to only share high quality podcasts and episodes.

Source code

  • SitemapService.zip – the archive contains the interface and class implementation for the methods described in the post

References

Podcastpedia image

Adrian Matei is the creator of Podcastpedia.org and Codingpedia.org, computer science engineer, husband, father, curious and passionate about science, computers, software, education, economics, social equity, philosophy.


Get connected on

Adrian’s favorite Spring and Java books