Archive for the ‘Code’ Category
Picasa Photo Scraper Using GData
I took a trip to Japan and China about a year and a half ago. I normally don’t take pictures (I don’t even own a digital camera!), but I have a friend who took over 1000. Of course, once we got back, I wanted them. The problem was he had uploaded them to Picasa and then gotten rid of them. So I was stuck trying to get over 1000 pictures from Picasa, which I obviously didn’t want to do by hand. I wasn’t familiar with the software, but I assumed you could not just download all the pictures in an album in one shot (which I never bothered to verify). Instead, a friend had informed me about the Google Data APIs (GData), which he was using to display photos from Picasa on another website. This sounded like exactly what I was looking for. You can actually use GData to interface with all of the different data hosting services Google provides; it’s very handy. The homepage for the project has the relevant downloads, as well as some documentation. You can use GData from within a variety of languages, including .NET. C# just happens to be my “quick ‘n dirty” language of choice, so I was all set.
At first it was difficult finding a good quick start guide, or something to get me up and running. Modeling the URLs correctly to get the data you want isn’t exactly intuitive. This situation may have improved since I wrote this little application, which was over a year ago. I’m going to go through the code I wrote to scrape the pictures out of Picasa. I will assume you have knowledge of C#. This app was intended to be written quickly for a specific purpose. All data is hard coded. My knowledge of GData is limited to what is contained in this app, since I haven’t had a need for it since. In order to get started, you need to download and install GData from the project page mentioned above.
Here is the code:
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.IO;
using Google.GData.Client;
using Google.GData.Photos;
namespace PicasaPhotoFetcher
{
class Program
{
// Modify the following data according to your needs.
// This array contains the folder names that will be created for each album.
private static readonly string[] Albums = {
"Your album name here"
};
// This array contains the URLs for the data feed for each album.
private static readonly string[] Urls = {
"http://picasaweb.google.com/data/feed/api/user/[USER]/album/[ALBUM]?kind=photo"
};
// This is the absolute path to the folder where the albums will be stored.
private static readonly string DownloadPath = "C:\\My Picasa Photos\\";
// You shouldn't need to modify beyond this point.
static void Main(string[] args)
{
PicasaService photoService = new PicasaService("Picasa");
for (int i = 0; i < Albums.Length; i++) {
FeedQuery photosQuery = new FeedQuery(Urls[i]);
PicasaFeed albumFeed = photoService.Query(photosQuery) as PicasaFeed;
DownloadAllPhotos(Albums[i], albumFeed.Entries);
Console.WriteLine();
}
}
static void DownloadAllPhotos(string albumName, AtomEntryCollection photoList)
{
DirectoryInfo dirInfo = Directory.CreateDirectory(DownloadPath + albumName);
int photoNum = 1;
foreach (AtomEntry photo in photoList) {
Console.SetCursorPosition(0, Console.CursorTop);
Console.Write("Fetching image {0} of {1}...", photoNum, photoList.Count);
HttpWebRequest photoRequest = WebRequest.Create(photo.Content.AbsoluteUri +
"?imgmax=800") as HttpWebRequest;
HttpWebResponse photoResponse = photoRequest.GetResponse() as
HttpWebResponse;
BufferedStream bufferedStream = new BufferedStream(
photoResponse.GetResponseStream(), 1024);
BinaryReader reader = new BinaryReader(bufferedStream);
FileStream imgOut = File.Create(dirInfo.FullName + "\\image" +
photoNum++ + ".jpg");
BinaryWriter writer = new BinaryWriter(imgOut);
int bytesRead = 1;
byte[] buffer = new byte[1024];
while (bytesRead > 0) {
bytesRead = reader.Read(buffer, 0, buffer.Length);
writer.Write(buffer, 0, bytesRead);
}
}
}
}
}
In order to build this code, you will have to add references to the “Google Data API Core Library” and the “Google Data API Picasa Library” to your project.
As you can see, the class starts out with 3 data members that contain all of the configuration data required by the application. You can add as many albums as you want to the string arrays. Each index in the first array corresponds to the equivalent index in the second array (the arrays should have the same # of indexes). The first array contains the name of the folders that will be created on the disk for each album. The second array contains the actual URL sent to Google to get the album contents. There are a few things to note here. Obviously, the [USER] and [ALBUM] markers should be replaced with your username and album. Also, if the album is private but has an auth key, you can add the key to url as a parameter:
&authkey=[KEY]
Where [KEY] is your auth key. I believe you can also pass your username and password in the URL, but I am not sure of the parameter names for that off hand.
There is nothing too fancy going on here. We are essentially reading the data of each image in 1024 byte chunks and dumping those chunks to the image file. One additional thing to note is that we are appending the parameter:
&imgmax=800
To the photo URL. This limits the maximum dimension of the image to 800 pixels. The aspect ratio is maintained when the image is scaled. If you do not want to scale your images automatically, then you can simply remove this piece of the code.
My colleague Johnathan has written a post discussing how to retrieve Picassa content with PHP and has also written a WordPress plugin for this purpose. Check them out.
Automatic Deallocation With AutoPtr
One of the major concepts in C++ that makes it so powerful, and therefore so difficult, is memory management. Even experienced programmers sometimes struggle with allocating and deallocating memory correctly and effectively. However, if done correctly (which is, of course, rather subjective), C++ will always be more efficient than any garbage collected language will ever be.
I recently ran into a memory management issue. I was employing a partial caching strategy, so in certain scenarios I wanted to return a pointer to memory stored in cache, and in other scenarios I needed to allocated new memory to return because the data did not exist in cache. This left me with two options: copy the data, or try to deal with deallocating the memory in the latter case. I chose to deal with deallocating the memory.
Initially I considered std::auto_ptr. No reason to reinvent the wheel for no good reason. However, std::auto_ptr did not provide facilities for specifying that the auto_ptr did not own the memory upon construction, which is something I needed to be able to specify. For this reason, and simply for the learning opportunity, I wrote my own version of an auto_ptr class with the functionality I needed (I realize I could have simply inherited std::auto_ptr to provide this functionality, but how fun would that have been). Here is the source code for this class:
template<typename _T>
class AutoPtr
{
public:
explicit AutoPtr(_T* ptr = 0, bool owned = true)
: _owned(owned),
_ptr(ptr) { }
AutoPtr(AutoPtr<_T>& other)
: _owned(other._owned),
_ptr(other.detach()) { }
template<typename _T1>
AutoPtr(AutoPtr<_T1>& other)
: _owned(other._owned),
_ptr(other.detach()) { }
virtual ~AutoPtr() {
if (_owned) delete _ptr;
}
_T* operator->() { return _ptr; }
_T& operator*() { return *_ptr; }
operator _T*() { return _ptr; }
const _T* operator->() const { return _ptr; }
const _T& operator*() const { return *_ptr; }
operator const _T*() const { return _ptr; }
AutoPtr& operator=(AutoPtr& lhs) {
reset(lhs.detach(), lhs._owned);
}
template<typename _T1>
AutoPtr& operator=(AutoPtr<_T1>& lhs) {
reset(lhs.detach(), lhs._owned);
}
_T* detach() {
_T* t = _ptr;
_ptr = 0;
_owned = false;
return t;
}
void reset(_T* ptr = 0, bool owned = true) {
if (_owned) delete _ptr;
_owned = owned;
_ptr = ptr;
}
protected:
bool _owned;
_T* _ptr;
protected:
struct ReferenceHelper
{
bool _owned;
_T* _ptr;
explicit ReferenceHelper(bool owned, _T* ptr)
: _owned(owned),
_ptr(ptr) { }
};
public:
AutoPtr(ReferenceHelper helper)
: _owned(helper._owned),
_ptr(helper._ptr) { }
AutoPtr& operator=(ReferenceHelper lhs) {
reset(lhs._ptr, lhs._owned);
}
operator ReferenceHelper() {
bool owned = _owned;
return ReferenceHelper(owned, detach());
}
};
Now, this type of thing has been done a thousand times in the past, but that’s OK. I’ll take the opportunity to walk through the code anyway. I’ll assume you have at least some basic template knowledge.
We start with a basic constructor and copy constructor, easy enough. Next is something slightly more interesting:
template<typename _T1>
AutoPtr(AutoPtr<_T1>& other);
This is a copy constructor that allows us to copy from an AutoPtr of a convertible type. i.e., An AutoPtr of a child class type being passed to the constructor of an AutoPtr of it’s parent class type. We then have some simple operators that give the class pointer semantics. These operators are what allow us to treat the AutoPtr class just like a real pointer. We then have 2 operator=’s, which are exactly like the 2 copy constructors.
Next we have:
_T* detach();
This method explicitly takes ownership of the managed memory from the AutoPtr instance. It is, of course, used in the copy constructors and operator=’s. And:
void reset(_T* ptr = 0, bool owned = true);
Which instructs the AutoPtr instance to manage new memory, destroying any previously managed memory first.
The ReferenceHelper type is also interesting. This simple struct gives AutoPtr reference semantics:
AutoPtr<MyClass> getMyClass()
{
return AutoPtr<MyClass>(new MyClass);
}
int main()
{
AutoPtr ptr = getMyClass();
}
Without this struct, we would not be able to properly manage the new’d instance of MyClass when returning from getMyClass. What actually happens here is:
AutoPtr is implicitly converted to ReferenceHelper
ReferenceHelper is implicitly converted to AutoPtr.
This allows us to correctly remember whether or not we own the allocated memory, while not destructing it when returning from getMyClass.
This class is especially useful in complex methods that “save” data, where errors can essentially happen at any point. If memory is allocated and managed with AutoPtr, we do not have to worry about cleaning up the allocated memory on various different code branches.