Reading/Parsing RSS and Atom feeds in Java with Rome

As you might have already guessed, Podcastpedia.org is all about podcasts and podcasting is all about distributing audio or video content via RSS or Atom. This post will presents how  Atom and RSS podcast feeds are parsed and added to the directory, with the help of the Java project Rome.

Maven dependencies

In order to use Rome in the Java project, you have to add rome.jar and jdom.jar to your classpath, or if you use Meven the following dependencies in the ``<div id="toc_container" class="no_bullets">

Contents

</div>

As you might have already guessed, Podcastpedia.org is all about podcasts and podcasting is all about distributing audio or video content via RSS or Atom. This post will presents how  Atom and RSS podcast feeds are parsed and added to the directory, with the help of the Java project Rome.

Maven dependencies

In order to use Rome in the Java project, you have to add rome.jar and jdom.jar to your classpath, or if you use Meven the following dependencies in the`` file:

<dependency>
    <groupId>rome</groupId>
    <artifactId>rome</artifactId>
    <version>1.0</version>
</dependency>
<dependency>
    <groupId>org.jdom</groupId>
    <artifactId>jdom</artifactId>
    <version>1.1</version>
</dependency>

Building a SyndFeed object

ROME represents syndication feeds (RSS and Atom) as instances of the com.sun.syndication.synd.SyndFeed interface. The SyndFeed interfaces and its properties follow the Java Bean patterns. The default implementations provided with ROME are all lightweight classes.

XmlReader

ROME includes parsers to process syndication feeds into SyndFeed instances. The SyndFeedInput class handles the parsers using the correct one, based on the syndication feed being processed. The developer does not need to worry about selecting the right parser for a syndication feed, the SyndFeedInput will take care of it by peeking at the syndication feed structure. All it takes to read a syndication feed using ROME are the following 2 lines of code:

SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedUrl));

The first line creates a SyndFeedInput instance that will work with any syndication feed type (RSS and Atom versions). The second line instructs the SyndFeedInput to read the syndication feed from the char based input stream of a URL pointing to the feed. The <a title="Implementation of XmlReader" href="https://java.net/projects/rome/sources/svn/content/trunk/src/java/com/sun/syndication/io/XmlReader.java" target="_blank">XmlReader</a> is a character based Reader that resolves the encoding following the HTTP MIME types and XML rules for it. The SyndFeedInput.build() method returns a SyndFeed instance that can be easily processed.

InputSource

Using the approach just mentioned works fine for most of the podcast feeds out there, but for some, mean exceptions like "Content is not allowed in prolog" or "Invalid byte 2 of 3-byte UTF-8 sequence" started to occur. To tackle these exceptions I replaced the XmlReader with InputSource, which solved most of the problems – thank you Paŭlo Ebermann on StackOverflow for researching into this. The following code snippet presents how this is used to parse the feeds:

public SyndFeed getSyndFeedForUrl(String url) throws MalformedURLException, IOException, IllegalArgumentException, FeedException {

	SyndFeed feed = null;
	InputStream is = null;

	try {

		URLConnection openConnection = new URL(url).openConnection();
		is = new URL(url).openConnection().getInputStream();
		if("gzip".equals(openConnection.getContentEncoding())){
			is = new GZIPInputStream(is);
		}
		InputSource source = new InputSource(is);
		SyndFeedInput input = new SyndFeedInput();
		feed = input.build(source);

	} catch (Exception e){
		LOG.error("Exception occured when building the feed object out of the url", e);
	} finally {
		if( is != null)	is.close();
	}

	return feed;
}

Note the line if("gzip".equals(openConnection.getContentEncoding()) – this was needed because some web sites use gzip to compress the files, and althogh in the browsers you might not recognize this (they decompress the files automatically), if you have to decompress it programatically in your code.

FileInputStream

If for some reason ("Content is not allowed in prolog" or "Invalid byte 2 of 3-byte UTF-8 sequence" etc.),  you cannot parse the feed from the online like via its URL, you can store to a local file, process it and modify the encoding for your needs (very easy with Notepad++  for example ) and parse it from there :

public SyndFeed getSyndFeedFromLocalFile(String filePath)
		throws MalformedURLException, IOException,
		IllegalArgumentException, FeedException {

	SyndFeed feed = null;
	FileInputStream fis = null;
	try {
		fis = new FileInputStream(filePath);
		InputSource source = new InputSource(fis);
		SyndFeedInput input = new SyndFeedInput();
		feed = input.build(source);
	} finally {
		fis.close();
	}

	return feed;
}

Using the SyndFeed interface

Once the SyndFeed instance is created, it is used to extract the metadata of the podcast(like title, description, author, copyright etc.):

@SuppressWarnings("unchecked")
public void setPodcastFeedAttributes(Podcast podcast,  boolean feedPropertyHasBeenSet) throws Exception {

	SyndFeed syndFeed = podcast.getPodcastFeed()

	if(syndFeed!=null){
		//set DESCRIPTION for podcast - used in search
		if(syndFeed.getDescription()!=null
				&& !syndFeed.getDescription().equals("")){
			String description = syndFeed.getDescription();
			//out of description remove tags if any exist and store also short description
			String descWithoutTabs = description.replaceAll("\\<[^>]*>", "");
			if(descWithoutTabs.length() > MAX_LENGTH_DESCRIPTION) {
				podcast.setDescription(descWithoutTabs.substring(0, MAX_LENGTH_DESCRIPTION));
			} else {
				podcast.setDescription(descWithoutTabs);
			}
		} else {
			podcast.setDescription("NO DESCRIPTION AVAILABLE for FEED");
		}

		//set TITLE - used in search
		String podcastTitle = syndFeed.getTitle();
		podcast.setTitle(podcastTitle);

		//set author
		podcast.setAuthor(syndFeed.getAuthor());

		//set COPYRIGHT
		podcast.setCopyright(syndFeed.getCopyright());

		//set LINK
		podcast.setLink(syndFeed.getLink());

		//set url link of the podcast's image when selecting the podcast in the main application - mostly used through 		SyndImage podcastImage = syndFeed.getImage();
		if(null!= podcastImage){
			if(podcastImage.getUrl() != null){
				podcast.setUrlOfImageToDisplay(podcastImage.getUrl());
			} else if (podcastImage.getLink() != null){
				podcast.setUrlOfImageToDisplay(podcastImage.getLink());
			} else {
				podcast.setUrlOfImageToDisplay(configBean.get("NO_IMAGE_LOCAL_URL"));
			}
		} else {
			podcast.setUrlOfImageToDisplay(configBean.get("NO_IMAGE_LOCAL_URL"));
		}

		podcast.setPublicationDate(null);//default value is null, if cannot be set

		//set url media link of the last episode - this is used when generating the ATOM and RSS feeds from the Start page for example
		for(SyndEntryImpl entry: (List)syndFeed.getEntries()){
			//get the list of enclosures
			List enclosures = (List) entry.getEnclosures();

			if(null != enclosures){
				//if in the enclosure list is a media type (either audio or video), this will set as the link of the episode
				for(SyndEnclosureImpl enclosure : enclosures){
					if(null!= enclosure){
						podcast.setLastEpisodeMediaUrl(enclosure.getUrl());
						break;
					}
				}
			}

			if(entry.getPublishedDate() == null){
				LOG.warn("PodURL[" + podcast.getUrl() + "] - " + "COULD NOT SET publication date for podcast, default date 08.01.1983 will be used " );
			} else {
				podcast.setPublicationDate(entry.getPublishedDate());
			}
			//first episode in the list is last episode - normally (are there any exceptions?? TODO -investigate)
			break;
		}
	}
}

Well, that’s all folks. Many thanks to the Rome creators and contributers, to the open source communities, to Google, Stackoverflow and to all the great people out there.

Thanks for sharing and connecting with us

Don’t forget to check out Podcastpedia.org – you might find it really interesting. We are grateful for your support.

Resources

  1. Rome project
  2. Reads and prints any RSS/Atom feed type
  3. Problem with charset and rome – Stackoverflow
  4. JAVA: Resolving org.xml.sax.SAXParseException: Content is not allowed in prolog
  5. Stackoverflow – Getting strange characters when trying to read UTF-8 document from URL
Podcastpedia image

Adrian Matei

Creator of Podcastpedia.org and Codingpedia.org, computer science engineer, husband, father, curious and passionate about science, computers, software, education, economics, social equity, philosophy - but these are just outside labels and not that important, deep inside we are all just consciousness, right?

P.S. The stack trace of the mean “Content is not allowed in prolog”-error is listed bellow:

2013-09-19 06:23:43,529 ERROR [org.podcastpedia.admin.service.impl.UpdateServiceImpl:?] -
com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 1: Content is not allowed in prolog.
	at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:226)
	at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136)
	at org.podcastpedia.admin.service.utils.impl.UtilsImpl.getSyndFeedForUrl(UtilsImpl.java:552)
	at org.podcastpedia.admin.service.impl.UpdateServiceImpl.getSyndFeedForUpdate(UpdateServiceImpl.java:472)
	at org.podcastpedia.admin.service.impl.UpdateServiceImpl.getNewEpisodes(UpdateServiceImpl.java:389)
	at org.podcastpedia.admin.service.impl.UpdateServiceImpl.updatePodcastById(UpdateServiceImpl.java:221)
	at org.podcastpedia.admin.service.impl.UpdateServiceImpl.updatePodcastsFromRange(UpdateServiceImpl.java:607)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
	at org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:89)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at java.lang.Thread.run(Thread.java:662)
Caused by: org.jdom.input.JDOMParseException: Error on line 1: Content is not allowed in prolog.
	at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)
	at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:222)
	... 19 more
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
	... 20 more

Free Programming Books are now on Codingmarks.org

We're happy to announce that we've reached and surpassed our goal of 1 Mb #codingmarks, by importing the books from the Free-Programming-Books project Continue reading