关于Feeds模块采集中文站的问题 - Drupal大学

首页 / 帖子

关于Feeds模块采集中文站的问题

陈蹊星期三,04/30/2014

我用Feeds采集一个中文站，这个站的网页编码是GB2312的，结果采集回来的内容中文部分都是乱码，也没办法解析，请问在Feeds里面该怎么转码呢？

2个答案

刘伯彪发布于：2014-04-30 16:40

我们有现成的智能解码的解决方案，不过这是我们公司的技术秘密，我们是卖钱的。

赵高欣发布于：2014-07-07 16:59

最近也遇到了这个问题，把我的解决方法抛砖引玉一下 :P

增加一个feeds的fetcher plugin，extents FeedsHTTPFetcher
实现fetch()，将非utf-8的charset都改成utf-8（1）改charset（2）用iconv转换内容的编码

代码如下：

  public function fetch(FeedsSource $source) {
    $fetcher_result = parent::fetch($source);
    $raw = $fetcher_result->getRaw();
    
    // Convert document to UTF-8
    if (preg_match('/<meta.+?charset=([-\w]+).*\/>/i', $raw, $matches) && strtolower($matches[1]) !== 'utf-8') {
      $meta = str_replace('charset=' . $matches[1], 'charset=utf-8', $matches[0]);
      $raw = str_replace($matches[0], $meta, $raw);
      $raw = @iconv($matches[1], 'utf-8//IGNORE', $raw);
      
      $fetcher_result = new FeedsFetcherResult($raw);
    }
    
    return $fetcher_result;
  }

分类

Drupal开发

模块相关

主题相关

架构讨论

其他技术

服务器相关