Smart Groups' Scraping Error

Just got hold of the Criterion Collection (CC) of one of my favourite trilogies—Infernal Affairs. I have had the original Blu-ray rips show up just fine in Infuse, so thought I’d use the new Smart Groups feature to add the CC edition to the same titles. But… that doesn’t seem to work as advertised.

Here’s what my folder structure looks like—

Infernal Affairs (2002)
> Extras
- folder.jpg
- Infernal Affairs (2002) {edition-1 The Criterion Collection}-poster.jpg
- Infernal Affairs (2002) {edition-1 The Criterion Collection}.mkv
- Infernal Affairs (2002) {edition-2 Theatrical Release}-poster.jpg
- Infernal Affairs (2002) {edition-2 Theatrical Release}.mkv

Infernal Affairs II (2003)
> Extras
- folder.jpg
- Infernal Affairs II (2003) {edition-1 The Criterion Collection}-poster.jpg
- Infernal Affairs II (2003) {edition-1 The Criterion Collection}.mkv
- Infernal Affairs II (2003) {edition-2 Theatrical Release}-poster.jpg
- Infernal Affairs II (2003) {edition-2 Theatrical Release}.mkv

Infernal Affairs III (2003)
> Extras
- folder.jpg
- Infernal Affairs III (2003) {edition-1 The Criterion Collection}-poster.jpg
- Infernal Affairs III (2003) {edition-1 The Criterion Collection}.mkv
- Infernal Affairs III (2003) {edition-2 Theatrical Release}-poster.jpg
- Infernal Affairs III (2003) {edition-2 Theatrical Release}.mkv

When I open Infuse (and I have waited for it to scan for changes), this is how it shows up—

I now see 2 entries for Infernal Affairs instead of the 3 that should show up (based on my folder structure). The 3rd Infernal Affairs poster you see at the end of that same row, is actually a file form the Extras folder (which is supposed to be ignored, but, well, that’s another battle). Anyways, there are actually couple of things wrong here.

  1. The poster shown is actually from Infernal Affairs II;
  2. But the title shows Infernal Affairs III; and
  3. It shows 4 items within that title, which again is wrong, because there are only 2 editions of every part.

Looking inside Infernal Affairs I, everything looks exactly the way it should. Fab!

But when I look into the wrongly scraped title, it shows me 4 files (2 from the 2nd part and 2 from the 3rd part) all under the same title.

Scrolling down to try and verify the file names, they seem right. But seem to be just getting scrapped wrongly. Infuse seems to think Infernal Affairs II and Infernal Affairs III are one and the same. They were released the same year (2003), but they are 2 different titles.

What am I doing wrong?

For whatever reason, the top result for ‘Infernal Affairs II’ on TMDB is Infernal Affairs III.

What you can do is use the Edit Metadata option to select the correct title for the 2 copies of II, or just swap the roman numerals in the filename to numbers.

EG

Infernal Affairs 3 (2003) {edition-1 The Criterion Collection}.mkv

Ah! Thanks @james. I’ll try changing the Roman numerals for now.

1 Like

The scraping algorithm has demonstrated difficulty properly identifying between similarly named titles when the only distinction between them are short words (or “words” as in the case of II vs III); and especially so when the titles wholly consist of a single short word. The algorithm by default fails to prioritize exact title matches over near-matches that include what ought to be exclusionary “extraneous” content (such as an additional I).

The reason this works is because TMDB users have previously identified the above issue and created alternative title entries to circumvent it.

Infuse relies on the top result as provided by TMDB for automatic matching. This is usually pretty reliable, but there are some edge cases like this where issues can come up.

You can see this on the TMDB site as well by searching for Infernal Affairs II.

Yes, I know. I wasn’t throwing blame at either of you (Firecore or TMDB). My presumption would be that TMDB controlled the search API and algorithm.

I’ve been posting screencaps of manual TMDB web site searches here for nearly two years; both when demonstrating this specific issue, and when helping other Infuse users here troubleshoot their own various scraping mismatch issues.

1 Like

Right, just wanted to clarify.

FWIW, the ‘scraping’ of names is done by Infuse. We filter out a bunch of irrelevant things that may be included in filenames and send the resulting title (with or without a year) to TMDB, and then we get a list of potential matches back.

1 Like

Important distinction; and I appreciate the correction. In the previous post (and probably several others) I have used the term to refer to the whole process — both the parsing of users’ media files’ filenames to generate search queries for the TMDB API (scraping?), and Infuse’s assignment of metadata to users’ files based on the paring of those filenames with the top search result returned for the provided search terms by TMDB (matching?).

Which word would be best used to refer to the whole process?

Indexing or scanning maybe?

  1. Look for files
  2. Get names
  3. Search for matches
  4. Pick match
  5. Download metadata/artwork
  6. Check file specs (codec, duration, etc…)
1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.